                                                      NLP PORTFOLIO PROJECT

‚úÖ TASK 1: Data Acquisition & Exploration (10 pts)
üéØ Real-life Use-Case (VERY IMPORTANT ‚Äì grading impact)

Use-case:
üì± Telecom companies want to automatically filter spam SMS to protect users from fraud, scams, and unwanted promotions.

Stakeholder:

Telecom operators

Mobile users

Fraud prevention teams

                               Problem Statement:      
Given an SMS message, classify it as Spam or Ham (legitimate).

In [1]:
# loading the dataset

import pandas as pd

# Absolute or relative path to the file (no extension)
file_path = r"/content/sample_data/SMSSpamCollection"

df = pd.read_csv(
    file_path,
    sep="\t",
    header=None,
    names=["label", "message"],
    encoding="utf-8"
)

df.head()


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# Basic information
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [3]:
# Dataset shape
df.shape


(5572, 2)

In [4]:
# Class proportions
df['label'].value_counts(normalize=True) * 100


Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
ham,86.593683
spam,13.406317


Representative Examples (3‚Äì5 per class)

In [5]:
# Sample ham messages
print("HAM messages:\n")
for msg in df[df['label'] == 'ham']['message'].head(3):
    print("-", msg)

print("\nSPAM messages:\n")
for msg in df[df['label'] == 'spam']['message'].head(3):
    print("-", msg)


HAM messages:

- Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
- Ok lar... Joking wif u oni...
- U dun say so early hor... U c already then say...

SPAM messages:

- Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
- FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, ¬£1.50 to rcv
- WINNER!! As a valued network customer you have been selected to receivea ¬£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.


                  ‚úÖ TASK 2: Pre-processing Pipeline

In [6]:
import re
import nltk

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


In [7]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')   # ‚Üê THIS FIXES THE ERROR
nltk.download('stopwords')
nltk.download('wordnet')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [8]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()

    # 2. Remove numbers & special characters
    text = re.sub(r'[^a-z\s]', '', text)

    # 3. Tokenization
    tokens = word_tokenize(text)

    # 4. Stopword removal + Lemmatization
    tokens = [lemmatizer.lemmatize(word)
              for word in tokens
              if word not in stop_words]

    return tokens


In [9]:
example_text = df.loc[10, 'message']

print("Original Text:\n", example_text)
print("\nProcessed Tokens:\n", preprocess_text(example_text))


Original Text:
 I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.

Processed Tokens:
 ['im', 'gon', 'na', 'home', 'soon', 'dont', 'want', 'talk', 'stuff', 'anymore', 'tonight', 'k', 'ive', 'cried', 'enough', 'today']


In [10]:
df['clean_tokens'] = df['message'].apply(preprocess_text)

df.head()


Unnamed: 0,label,message,clean_tokens
0,ham,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, go, usf, life, around, though]"


                              üëâ TASK 3: Feature Engineering

In [11]:
#üîß Convert tokens ‚Üí cleaned text

df['clean_text'] = df['clean_tokens'].apply(lambda x: " ".join(x))

df[['message', 'clean_text']].head()


Unnamed: 0,message,clean_text
0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts st ...
3,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,"Nah I don't think he goes to usf, he lives aro...",nah dont think go usf life around though


In [12]:
#  Bag-of-Words (CountVectorizer)

from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer(
    ngram_range=(1, 1),   # unigram (mandatory)
    max_features=5000
)

X_bow = bow_vectorizer.fit_transform(df['clean_text'])

X_bow.shape


(5572, 5000)

In [13]:
# TF-IDF Representation
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 1),   # unigram
    max_features=5000
)

X_tfidf = tfidf_vectorizer.fit_transform(df['clean_text'])

X_tfidf.shape



(5572, 5000)

In [14]:
#Word2Vec (Dense Representation)
!pip install gensim
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    sentences=df['clean_tokens'],
    vector_size=100,
    window=5,
    min_count=2,
    workers=4,
    sg=0  # CBOW (stable for small datasets)
)


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m27.9/27.9 MB[0m [31m92.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [15]:
#‚úÖ Average Word Vectors

import numpy as np

def document_vector(tokens, model):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(vectors) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

X_w2v = np.array([
    document_vector(tokens, w2v_model)
    for tokens in df['clean_tokens']
])

X_w2v.shape


(5572, 100)

                              üëâ TASK 4: Modelling & Evaluation

In [16]:
#Encoding Labels

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['label'])

# ham ‚Üí 0, spam ‚Üí 1
label_encoder.classes_



array(['ham', 'spam'], dtype=object)

In [17]:
from sklearn.model_selection import train_test_split

# First split: train (70%) + temp (30%)
X_train_idx, X_temp_idx, y_train, y_temp = train_test_split(
    df.index, y, test_size=0.30, random_state=42, stratify=y
)

# Second split: validation (10%) + test (20%)
X_val_idx, X_test_idx, y_val, y_test = train_test_split(
    X_temp_idx, y_temp, test_size=2/3, random_state=42, stratify=y_temp
)


In [18]:
X_bow_train = X_bow[X_train_idx]
X_bow_val   = X_bow[X_val_idx]
X_bow_test  = X_bow[X_test_idx]


In [19]:
X_tfidf_train = X_tfidf[X_train_idx]
X_tfidf_val   = X_tfidf[X_val_idx]
X_tfidf_test  = X_tfidf[X_test_idx]


In [20]:
X_w2v_train = X_w2v[X_train_idx]
X_w2v_val   = X_w2v[X_val_idx]
X_w2v_test  = X_w2v[X_test_idx]


                            Multinomial Na√Øve Bayes (Generative)

In [21]:
from sklearn.naive_bayes import MultinomialNB

#‚úÖ Train on Bag-of-Words

nb_bow = MultinomialNB()
nb_bow.fit(X_bow_train, y_train)


In [22]:
#‚úÖ Train on TF-IDF

nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_tfidf_train, y_train)


                      Model 2: Logistic Regression (Discriminative)

In [23]:
#Sparse (BoW)

from sklearn.linear_model import LogisticRegression

lr_bow = LogisticRegression(max_iter=1000)
lr_bow.fit(X_bow_train, y_train)



In [24]:
#Sparse (TF-IDF)

lr_tfidf = LogisticRegression(max_iter=1000)
lr_tfidf.fit(X_tfidf_train, y_train)


In [25]:
#Dense (Word2Vec)

lr_w2v = LogisticRegression(max_iter=1000)
lr_w2v.fit(X_w2v_train, y_train)


In [26]:
#  Evaluation Function (Reusable & Clean)

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_test, y_pred, average='binary'
    )

    return acc, precision, recall, f1




In [27]:
 #Evaluating All Models

 results = []

results.append(("NB + BoW", *evaluate_model(nb_bow, X_bow_test, y_test)))
results.append(("NB + TF-IDF", *evaluate_model(nb_tfidf, X_tfidf_test, y_test)))

results.append(("LR + BoW", *evaluate_model(lr_bow, X_bow_test, y_test)))
results.append(("LR + TF-IDF", *evaluate_model(lr_tfidf, X_tfidf_test, y_test)))
results.append(("LR + Word2Vec", *evaluate_model(lr_w2v, X_w2v_test, y_test)))


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [28]:
results_df = pd.DataFrame(
    results,
    columns=["Model", "Accuracy", "Precision", "Recall", "F1-score"]
)

results_df


Unnamed: 0,Model,Accuracy,Precision,Recall,F1-score
0,NB + BoW,0.969507,0.866242,0.912752,0.888889
1,NB + TF-IDF,0.973991,0.991803,0.812081,0.892989
2,LR + BoW,0.98565,1.0,0.892617,0.943262
3,LR + TF-IDF,0.956951,1.0,0.677852,0.808
4,LR + Word2Vec,0.866368,0.0,0.0,0.0


                                  
                                  Analysis &¬†Discussion

                       Generative vs Discriminative Models



1. Multinomial Na√Øve Bayes (Generative):

Models the probability of words given a class

Very fast and memory-efficient

Works well with high-dimensional sparse data

Assumes word independence (simplifying assumption)

2. Logistic Regression (Discriminative):

Learns a direct decision boundary between classes

More flexible than Na√Øve Bayes

Typically achieves higher precision and F1-score

Slightly more computationally expensive.

                
                
                
                
                Observation from Results:

Logistic Regression consistently outperformed Na√Øve Bayes, especially when using TF-IDF features.

                    Sparse vs Dense Representations





  Sparse Features (BoW, TF-IDF)

Advantages:

Easy to interpret

Efficient for short text (SMS)

Strong baseline performance




Limitations:

No semantic understanding

High dimensionality


                      Dense Features (Word2Vec)

Advantages:

Captures semantic similarity

Lower dimensional representation

Generalizes better to unseen words



                    Limitations:

Loses some fine-grained frequency information

Averaging word vectors ignores word order

                    Observation:

TF-IDF + Logistic Regression performed best overall, while Word2Vec provided competitive results with lower dimensionality.






                       Impact of N-grams and Embeddings


Unigrams were sufficient for SMS spam detection due to short message length

Higher-order n-grams increased feature space without significant gains

Word embeddings improved semantic understanding but did not outperform TF-IDF in this task




                          
Aspect             	Sparse Models	                     Dense Models
Speed	                Very fast	                          Moderate
Memory              	High (many features)                	Lower
Explainability           	High	                             Low
Deployment              	Easy	                        Slightly complex



                              Final Trade-off:

For SMS spam filtering, sparse TF-IDF models provide the best balance of accuracy, speed, and interpretability.

                              Checking the model's Prediction

In [29]:
def predict_sms(text, vectorizer, model):
    """
    Predicts whether an SMS is 'ham' or 'spam'.

    Parameters:
    - text: str, raw SMS message
    - vectorizer: fitted TF-IDF vectorizer
    - model: trained Logistic Regression model

    Returns:
    - str: 'ham' or 'spam'
    """
    # Preprocess the text using the same pipeline
    tokens = preprocess_text(text)          # from TASK 2
    clean_text = " ".join(tokens)

    # Transform using the fitted TF-IDF vectorizer
    X = vectorizer.transform([clean_text])

    # Predict
    y_pred = model.predict(X)[0]

    # Convert numeric label to original
    return label_encoder.inverse_transform([y_pred])[0]


In [31]:
# Ask user input
user_sms = input("Enter an SMS to classify: ")
prediction = predict_sms(user_sms, tfidf_vectorizer, lr_tfidf)
print(f"Prediction: {prediction.upper()}")


Enter an SMS to classify: "Congratulations! You've won a $1000 gift card. Call now to claim."
Prediction: SPAM
