<a href="https://colab.research.google.com/github/Mlandvweni/Mlandvwen/blob/main/GSP_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project shows a basic **Natural Language Processing (NLP)** pipeline using 5 sample messages from the **SMS Spam Collection dataset**. The pipeline includes **tokenization**, **stopword removal**, **lemmatization**, and **TF-IDF vectorization**. These steps clean and transform the text into numerical features suitable for further analysis or machine learning tasks like **spam detection**.

In [None]:
# Load 5 Sample Messagesfrom the SMSSpamCollection dataset
texts = [
    "Go until jurong point, crazy.. Available only in bugis n great world la e buffet...",
    "Ok lar... Joking wif u oni...",
    "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
    "U dun say so early hor... U c already then say...",
    "WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."
]

# Print the messages
for i, msg in enumerate(texts):
    print(f"Text {i+1}: {msg}")


Text 1: Go until jurong point, crazy.. Available only in bugis n great world la e buffet...
Text 2: Ok lar... Joking wif u oni...
Text 3: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
Text 4: U dun say so early hor... U c already then say...
Text 5: WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.


Student number: 1113552

Student Name: Zithile

Job Breakdown: :Responsible for Tokenization

In [None]:
# Tokenize each message(Tokenization)
tokenized_texts = [text.split() for text in texts]

# Display tokenized results
for i, tokens in enumerate(tokenized_texts):
    print(f"Tokens for Text {i+1}:")
    print(tokens)



Tokens for Text 1:
['Go', 'until', 'jurong', 'point,', 'crazy..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet...']
Tokens for Text 2:
['Ok', 'lar...', 'Joking', 'wif', 'u', 'oni...']
Tokens for Text 3:
['Free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'FA', 'Cup', 'final', 'tkts', '21st', 'May', '2005.', 'Text', 'FA', 'to', '87121', 'to', 'receive', 'entry', 'question(std', 'txt', "rate)T&C's", 'apply', "08452810075over18's"]
Tokens for Text 4:
['U', 'dun', 'say', 'so', 'early', 'hor...', 'U', 'c', 'already', 'then', 'say...']
Tokens for Text 5:
['WINNER!!', 'As', 'a', 'valued', 'network', 'customer', 'you', 'have', 'been', 'selected', 'to', 'receivea', '£900', 'prize', 'reward!', 'To', 'claim', 'call', '09061701461.', 'Claim', 'code', 'KL341.', 'Valid', '12', 'hours', 'only.']


**Student number:** 1113523

**Student Name:** Mlandvo Dlamini

**Job Breakdown:** Responcible for Stopword removal

In [None]:
# Stopword Removal
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords
nltk.download('stopwords')

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords from each tokenized message
filtered_texts = []
for tokens in tokenized_texts:
    filtered = [word for word in tokens if word.lower() not in stop_words]
    filtered_texts.append(filtered)

# Display filtered (stopwords removed) tokens
for i, tokens in enumerate(filtered_texts):
    print(f"Filtered Tokens for Text {i+1}:")
    print(tokens)


Filtered Tokens for Text 1:
['Go', 'jurong', 'point,', 'crazy..', 'Available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet...']
Filtered Tokens for Text 2:
['Ok', 'lar...', 'Joking', 'wif', 'u', 'oni...']
Filtered Tokens for Text 3:
['Free', 'entry', '2', 'wkly', 'comp', 'win', 'FA', 'Cup', 'final', 'tkts', '21st', 'May', '2005.', 'Text', 'FA', '87121', 'receive', 'entry', 'question(std', 'txt', "rate)T&C's", 'apply', "08452810075over18's"]
Filtered Tokens for Text 4:
['U', 'dun', 'say', 'early', 'hor...', 'U', 'c', 'already', 'say...']
Filtered Tokens for Text 5:
['WINNER!!', 'valued', 'network', 'customer', 'selected', 'receivea', '£900', 'prize', 'reward!', 'claim', 'call', '09061701461.', 'Claim', 'code', 'KL341.', 'Valid', '12', 'hours', 'only.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Lemmatization
from nltk.stem import WordNetLemmatizer

# Download required data
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the filtered tokens
lemmatized_texts = []
for tokens in filtered_texts:
    lemmatized = [lemmatizer.lemmatize(word.lower()) for word in tokens]
    lemmatized_texts.append(lemmatized)

# Display lemmatized tokens
for i, tokens in enumerate(lemmatized_texts):
    print(f"Lemmatized Tokens for Text {i+1}:")
    print(tokens)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Lemmatized Tokens for Text 1:
['go', 'jurong', 'point,', 'crazy..', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet...']
Lemmatized Tokens for Text 2:
['ok', 'lar...', 'joking', 'wif', 'u', 'oni...']
Lemmatized Tokens for Text 3:
['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005.', 'text', 'fa', '87121', 'receive', 'entry', 'question(std', 'txt', "rate)t&c's", 'apply', "08452810075over18's"]
Lemmatized Tokens for Text 4:
['u', 'dun', 'say', 'early', 'hor...', 'u', 'c', 'already', 'say...']
Lemmatized Tokens for Text 5:
['winner!!', 'valued', 'network', 'customer', 'selected', 'receivea', '£900', 'prize', 'reward!', 'claim', 'call', '09061701461.', 'claim', 'code', 'kl341.', 'valid', '12', 'hour', 'only.']


**Student number**: 1103558

**Student Name:** Mikollito Ong

**Job Breakdown:** : Responsible for TF-IDF Cectorization

In [None]:
#TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

# Join lemmatized tokens back into full strings
final_texts = [' '.join(tokens) for tokens in lemmatized_texts]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(final_texts)

# Display TF-IDF feature names
print("TF-IDF Feature Names:")
print(vectorizer.get_feature_names_out())

# Display the TF-IDF matrix shape
print("\nTF-IDF Matrix Shape:", tfidf_matrix.shape)

# Optional: Show TF-IDF values as a dense matrix
import pandas as pd
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print("\nTF-IDF Values:")
print(df_tfidf.round(3))


TF-IDF Feature Names:
['08452810075over18' '09061701461' '12' '2005' '21st' '87121' '900'
 'already' 'apply' 'available' 'buffet' 'bugis' 'call' 'claim' 'code'
 'comp' 'crazy' 'cup' 'customer' 'dun' 'early' 'entry' 'fa' 'final' 'free'
 'go' 'great' 'hor' 'hour' 'joking' 'jurong' 'kl341' 'la' 'lar' 'may'
 'network' 'ok' 'oni' 'only' 'point' 'prize' 'question' 'rate' 'receive'
 'receivea' 'reward' 'say' 'selected' 'std' 'text' 'tkts' 'txt' 'valid'
 'valued' 'wif' 'win' 'winner' 'wkly' 'world']

TF-IDF Matrix Shape: (5, 59)

TF-IDF Values:
   08452810075over18  09061701461     12   2005   21st  87121    900  already  \
0              0.000        0.000  0.000  0.000  0.000  0.000  0.000    0.000   
1              0.000        0.000  0.000  0.000  0.000  0.000  0.000    0.000   
2              0.192        0.000  0.000  0.192  0.192  0.192  0.000    0.000   
3              0.000        0.000  0.000  0.000  0.000  0.000  0.000    0.354   
4              0.000        0.218  0.218  0.000  0.0