# ***NLP Portfolion Project***
## **Muhammad Furqan Rauf**


## Imports & Setup

In [2]:
# Core
import numpy as np
import pandas as pd
import re
import random

# Visualization
import matplotlib.pyplot as plt

# NLP
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# ML
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Word embeddings
!pip install gensim
import gensim
from gensim.models import Word2Vec

# Reproducibility
SEED = 42
np.random.seed(SEED)
random.seed(SEED)


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m64.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [3]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

## Dataset Loading & Exploration

In [5]:
import requests
import zipfile
import io

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"

# Download the zip file content
response = requests.get(url)

# Open the zip file from bytes
with zipfile.ZipFile(io.BytesIO(response.content)) as zip_file:
    # Read the 'SMSSpamCollection' file directly from the zip archive
    with zip_file.open('SMSSpamCollection') as file:
        df = pd.read_csv(
            file,
            sep='\t',
            header=None,
            names=['label', 'message']
        )

df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ham,4825
spam,747


In [7]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

In [8]:
for label in [0, 1]:
    print(f"\nLabel {label} examples:")
    print(df[df['label'] == label]['message'].head(3).to_string(index=False))



Label 0 examples:
Go until jurong point, crazy.. Available only i...
                     Ok lar... Joking wif u oni...
 U dun say so early hor... U c already then say...

Label 1 examples:
Free entry in 2 a wkly comp to win FA Cup final...
FreeMsg Hey there darling it's been 3 week's no...
WINNER!! As a valued network customer you have ...


### Preprocessing Pipeline

In [9]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Lowercase
    text = text.lower()

    # Remove URLs, numbers, special chars
    text = re.sub(r"http\S+|www\S+|[^a-z\s]", "", text)

    # Tokenize
    tokens = text.split()

    # Stopword removal + lemmatization
    tokens = [
        lemmatizer.lemmatize(token)
        for token in tokens
        if token not in stop_words
    ]

    return " ".join(tokens)

In [10]:
df['clean_text'] = df['message'].apply(preprocess_text)

In [11]:
sample = df.iloc[10]
print("Original:", sample['message'])
print("Cleaned :", sample['clean_text'])

Original: I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.
Cleaned : im gonna home soon dont want talk stuff anymore tonight k ive cried enough today




> The pipeline performs:
- Lower-casing to ensure case-insensitive matching
- Regex-based cleaning to remove URLs, digits, and special characters
- Tokenisation by splitting text into words
- Stop-word removal to eliminate non-informative tokens
- Lemmatization to reduce words to their base form





## Feature Engineering

### Bag-of-Words

In [12]:
bow_vectorizer = CountVectorizer(ngram_range=(1,1))
X_bow = bow_vectorizer.fit_transform(df['clean_text'])
y = df['label']

### TF-IDF

In [13]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2))
X_tfidf = tfidf_vectorizer.fit_transform(df['clean_text'])

### Word2Vec Embeddings

In [14]:
# Tokenize corpus:
tokenized_corpus = [text.split() for text in df['clean_text']]

In [15]:
# train Word2Vec:
w2v_model = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=100,
    window=5,
    min_count=2,
    workers=4,
    seed=SEED
)

In [16]:
# Document embedding (average):
def document_vector(tokens, model):
    vectors = [
        model.wv[word]
        for word in tokens
        if word in model.wv
    ]
    if len(vectors) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

X_w2v = np.array([
    document_vector(text.split(), w2v_model)
    for text in df['clean_text']
])

## Train / Validation / Test Split

In [20]:
X_train_bow, X_temp_bow, y_train, y_temp = train_test_split(
    X_bow, y, test_size=0.3, random_state=SEED, stratify=y
)

X_val_bow, X_test_bow, y_val, y_test = train_test_split(
    X_temp_bow, y_temp, test_size=2/3, random_state=SEED, stratify=y_temp
)

X_train_tfidf, X_temp_tfidf, _, _ = train_test_split(
    X_tfidf, y, test_size=0.3, random_state=SEED, stratify=y
)

X_val_tfidf, X_test_tfidf, _, _ = train_test_split(
    X_temp_tfidf, y_temp, test_size=2/3, random_state=SEED, stratify=y_temp
)

## Models & Training

### Naive Bayes (BoW + TF-IDF only)

In [22]:
nb_bow = MultinomialNB()
nb_bow.fit(X_train_bow, y_train)

y_pred_nb = nb_bow.predict(X_test_bow)

### Logistic Regression

In [23]:
# Sparse:
lr_tfidf = LogisticRegression(max_iter=1000, random_state=SEED)
lr_tfidf.fit(X_train_tfidf, y_train)

y_pred_lr = lr_tfidf.predict(X_test_tfidf)

In [24]:
# Dense:
X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(
    X_w2v, y, test_size=0.2, random_state=SEED, stratify=y
)

lr_w2v = LogisticRegression(max_iter=1000, random_state=SEED)
lr_w2v.fit(X_train_w2v, y_train_w2v)

y_pred_w2v = lr_w2v.predict(X_test_w2v)

## Evaluation

In [31]:
def evaluate_model(name, y_true, y_pred):
    print(f"\n{name}")
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print(classification_report(y_true, y_pred))

### Evaluate Naive Bayes (BoW)

In [32]:
evaluate_model(
    name="Multinomial Naive Bayes (Bag-of-Words)",
    y_true=y_test,
    y_pred=y_pred_nb
)


Multinomial Naive Bayes (Bag-of-Words)
Accuracy: 0.9730941704035875
              precision    recall  f1-score   support

           0       0.99      0.98      0.98       966
           1       0.87      0.93      0.90       149

    accuracy                           0.97      1115
   macro avg       0.93      0.96      0.94      1115
weighted avg       0.97      0.97      0.97      1115



### Evaluate Logistic Regression (TF-IDF)

In [33]:
evaluate_model(
    name="Logistic Regression (TF-IDF)",
    y_true=y_test,
    y_pred=y_pred_lr
)


Logistic Regression (TF-IDF)
Accuracy: 0.9336322869955157
              precision    recall  f1-score   support

           0       0.93      1.00      0.96       966
           1       1.00      0.50      0.67       149

    accuracy                           0.93      1115
   macro avg       0.96      0.75      0.82      1115
weighted avg       0.94      0.93      0.92      1115



### Evaluate Logistic Regression (Word2Vec)

In [34]:
evaluate_model(
    name="Logistic Regression (Word2Vec)",
    y_true=y_test_w2v,
    y_pred=y_pred_w2v
)


Logistic Regression (Word2Vec)
Accuracy: 0.8663677130044843
              precision    recall  f1-score   support

           0       0.87      1.00      0.93       966
           1       0.00      0.00      0.00       149

    accuracy                           0.87      1115
   macro avg       0.43      0.50      0.46      1115
weighted avg       0.75      0.87      0.80      1115



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [35]:
results = []

def collect_results(model_name, y_true, y_pred):
    report = classification_report(y_true, y_pred, output_dict=True)
    results.append({
        "Model": model_name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": report["weighted avg"]["precision"],
        "Recall": report["weighted avg"]["recall"],
        "F1-score": report["weighted avg"]["f1-score"]
    })

collect_results("NB (BoW)", y_test, y_pred_nb)
collect_results("LR (TF-IDF)", y_test, y_pred_lr)
collect_results("LR (Word2Vec)", y_test_w2v, y_pred_w2v)

results_df = pd.DataFrame(results)
results_df

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,Model,Accuracy,Precision,Recall,F1-score
0,NB (BoW),0.973094,0.974128,0.973094,0.973461
1,LR (TF-IDF),0.933632,0.938355,0.933632,0.923894
2,LR (Word2Vec),0.866368,0.750593,0.866368,0.804336


## Results Interpretation: Analysis and Insights

The evaluation results show a clear distinction in performance among the models:

*   **Multinomial Naive Bayes (Bag-of-Words)**: Achieved the highest accuracy (0.973) and F1-score (0.973), indicating strong overall performance. Its high precision and recall for both 'ham' and 'spam' suggest it is effective in distinguishing between the two classes with the Bag-of-Words representation.

*   **Logistic Regression (TF-IDF)**: Performed reasonably well with an accuracy of 0.934 and F1-score of 0.924. While it achieved perfect recall for 'ham', its recall for 'spam' was lower (0.50), suggesting it missed a significant portion of spam messages. This implies TF-IDF features were less discriminative for the minority class compared to BoW.

*   **Logistic Regression (Word2Vec)**: Showed the lowest performance, with an accuracy of 0.866. Critically, it recorded 0.00 precision and recall for the 'spam' class. This means the model failed to correctly identify **any** spam messages in the test set. This poor performance could be due to the simplicity of averaging word vectors for document representation, the small corpus size for training Word2Vec, or the parameters used for the Word2Vec model.

Overall, the Bag-of-Words representation coupled with Naive Bayes proved most effective for this SMS spam classification task, outperforming both TF-IDF and Word2Vec embeddings.

## Conclusion & Future Scope: Summary and Improvements

**Conclusion:**
The project successfully demonstrates the application of various NLP techniques for SMS spam classification. The simple yet robust **Multinomial Naive Bayes model with Bag-of-Words features emerged as the best performer**, achieving high accuracy and F1-score. TF-IDF with Logistic Regression provided a decent baseline, but Word2Vec embeddings, when averaged, performed poorly in detecting spam, highlighting the limitations of this embedding approach for this specific dataset and model.

**Future Scope:**
1.  **Word Embeddings Enhancement**: Explore pre-trained word embeddings (e.g., GloVe, FastText) or train larger Word2Vec models with different parameters. Consider more sophisticated methods for combining word embeddings into document embeddings (e.g., weighted averages, attention mechanisms, or neural networks like LSTMs/GRUs).
2.  **Model Exploration**: Experiment with other machine learning algorithms such as Support Vector Machines (SVMs), Gradient Boosting (e.g., XGBoost, LightGBM), or even deep learning models like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) for text classification.
3.  **Hyperparameter Tuning**: Conduct systematic hyperparameter tuning for all models and vectorizers using techniques like GridSearchCV or RandomizedSearchCV to optimize performance.
4.  **Error Analysis**: Perform a detailed error analysis on misclassified messages to gain insights into specific patterns or linguistic features that models struggle with, which could inform further preprocessing or feature engineering.
6.  **Feature Engineering**: Introduce additional linguistic features (e.g., part-of-speech tags, sentiment scores, message length, presence of specific keywords) that might help distinguish spam from ham messages.