<a href="https://colab.research.google.com/github/MuhammadAbdullah80/Velocity-Solutions/blob/main/Week_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📦 Week 2 – Text Classification with Machine Learning (Spam Detection)



In Week 2 of the internship, I worked on building a complete text classification pipeline to detect spam SMS messages using machine learning techniques. This involved **data preprocessing**, **feature extraction**, **model training**, and **interactive testing** using two different vectorization methods and classifiers.

---

### 📁 Step 1 - Dataset Loading

- **SMS Spam Collection Dataset**  
- Source: [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)

It contains 5,572 SMS messages labeled as either `spam` or `ham`.

---

### 🧹 Step 2 – Text Preprocessing

Cleaned the raw SMS text using the following techniques:
- Lowercasing all text
- Tokenization using `nltk.word_tokenize()`
- Removing punctuation and stopwords
- Applying **stemming** and **lemmatization** for word normalization

This resulted in a `clean_text` column used for feature extraction.

---

### 📊 Step 3 – Feature Extraction

Converted the cleaned messages into numerical feature matrices using:
- **Bag-of-Words** via `CountVectorizer`
- **TF-IDF** via `TfidfVectorizer`

These created two sparse matrices: `X_bow` and `X_tfidf`.

---

### 🤖 Step 4 – Model Training

Trained and evaluated two different models:

| Model                      | Vectorization | Accuracy | Spam F1-Score |
|----------------------------|----------------|----------|---------------|
| Naive Bayes                | Bag-of-Words   | 96.95%   | 0.89          |
| Logistic Regression        | TF-IDF         | 95.96%   | 0.83          |

**Findings:**
- Naive Bayes was better at catching spam (higher recall)
- Logistic Regression was better at avoiding false positives (higher precision)

---

### 🧪 Step 5 – Custom Message Testing

Built an interactive system to input custom SMS messages and see predictions from both models.

Example:

```
Type an SMS message → Congratulations! You won a free ticket.
Naive Bayes (BoW)         → SPAM (confidence ≈ 0.98)
Logistic Regression (TF-IDF) → SPAM (confidence ≈ 0.96)
```

This demonstrated how both models behave on real-world input and helped visualize differences in decision boundaries.

---

### 🧠 What I Learned

- How to preprocess and clean raw text data
- The difference between Bag-of-Words and TF-IDF vectorization
- When to use Naive Bayes vs. Logistic Regression for text classification
- How to evaluate classification models using precision, recall, and F1-score
- How to build an interactive real-time prediction tool for any new SMS input




# 📥 Step 1 – Loading the SMS Spam Dataset




In this step, I loaded a real-world dataset to be used for a binary text classification task — identifying whether a given SMS message is **spam** or **ham** (not spam).

---

## 📌 Dataset Description

- **Name**: SMS Spam Collection Dataset  
- **Source**: Originally from the UCI Machine Learning Repository  
- **Size**: 5,574 labeled SMS messages  
- **Classes**:
  - `spam`: unwanted promotional or fraudulent messages  
  - `ham`: normal, legitimate text messages  

---

## ❗️ Initial Approach & Issue Faced

I first attempted to load the dataset using the Hugging Face `datasets` library:

```python
from datasets import load_dataset
dataset = load_dataset("ucirvine/sms_spam")
```

However, this resulted in an error:

```
ValueError: '**' can only be an entire path component
```

This was due to a problem with how the dataset was registered or structured on Hugging Face.

---

## ✅ Solution

To resolve this, I located a clean and verified version of the dataset hosted on GitHub, originally used in a PyCon tutorial.

I successfully loaded it using `pandas` as follows:

```python
import pandas as pd

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', names=['label', 'text'])

print(df.head())
```

The dataset loaded correctly with two columns:
- `label`: contains either `'spam'` or `'ham'`  
- `text`: the actual SMS message content  

---

## 🧠 What I Learned

- How to work with public NLP datasets  
- How to handle dataset loading errors and fallbacks  
- How to use alternate sources like GitHub or direct URLs when APIs fail  
- Gained hands-on experience with basic `pandas` functions for data loading and inspection


In [1]:
!pip install datasets




In [2]:
# from datasets import load_dataset

# # Load the SMS Spam dataset
# dataset = load_dataset("ucirvine/sms_spam")




In [3]:
import pandas as pd

# Direct link to raw SMS spam data
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"

# Load using tab separator
df = pd.read_csv(url, sep='\t', names=['label', 'text'])

In [4]:
# Preview
print(df.head())
print(df['label'].value_counts())


  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
label
ham     4825
spam     747
Name: count, dtype: int64



# 🧹 Step 2 – Text Preprocessing



After successfully loading the dataset, the next step was to clean and prepare the SMS text messages for feature extraction and modeling.

Text data in its raw form contains a lot of noise — such as uppercase letters, punctuation, stopwords, and grammatical variations of words. Preprocessing helps standardize and simplify the data for machine learning algorithms.

---

### 🔧 Preprocessing Steps Applied

1. **Lowercasing**  
   Converted all SMS text to lowercase to ensure that "Free", "free", and "FREE" are treated the same.

2. **Tokenization**  
   Used NLTK’s `word_tokenize()` to split each message into individual words (tokens), enabling word-level operations.

3. **Stopword Removal**  
   Removed common words like “the”, “is”, “in”, etc., using NLTK’s predefined English stopword list. These words are frequent but usually not meaningful for classification.

4. **Punctuation and Non-Alphabetic Removal**  
   Removed all punctuation and tokens that were not purely alphabetic using `word.isalpha()`.

5. **Stemming**  
   Applied the PorterStemmer to reduce words to their root forms. For example, “playing”, “played”, and “plays” were reduced to “play”.  
   Note: This sometimes produced artificial words like `"crazy"` → `"crazi"`.

6. **Lemmatization**  
   Used WordNetLemmatizer to convert words to their dictionary base form. Unlike stemming, lemmatization is grammar-aware and returns proper English words.  
   Example: `"running"` → `"run"`, `"better"` → `"good"`.

---

### 📊 Columns Created in the DataFrame

- `text_lower`: Lowercased version of original text
- `clean_text`: Lowercased + tokenized + stopword-free + punctuation removed
- `stemmed_text`: Stemmed version of `clean_text`
- `lemmatized_text`: Lemmatized version of `clean_text`

---

### 🧠 What I Learned

- How to tokenize and clean real-world text data using `nltk`
- The difference between stemming and lemmatization, and their effect on word quality
- Why it's important to normalize data before converting it to numerical features
- How each preprocessing step improves consistency and reduces noise in the dataset



In [5]:
df['text_lower'] = df['text'].str.lower()


In [6]:
print(df[['text', 'text_lower']].head())


                                                text  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   

                                          text_lower  
0  go until jurong point, crazy.. available only ...  
1                      ok lar... joking wif u oni...  
2  free entry in 2 a wkly comp to win fa cup fina...  
3  u dun say so early hor... u c already then say...  
4  nah i don't think he goes to usf, he lives aro...  


In [7]:
df.head()

Unnamed: 0,label,text,text_lower
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro..."


In [8]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Define stopwords
stop_words = set(stopwords.words('english'))

# Define preprocessing function
def clean_text(text):
    words = word_tokenize(text)  # Split text into words
    clean_words = [
        word for word in words
        if word.isalpha() and word not in stop_words  # Keep only meaningful words
    ]
    return " ".join(clean_words)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [9]:
df['clean_text'] = df['text_lower'].apply(clean_text)
df.head()

Unnamed: 0,label,text,text_lower,clean_text
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry wkly comp win fa cup final tkts may...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah think goes usf lives around though


In [10]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [11]:
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [12]:
def stem_clean_text(text):
    words = text.split()  # it's already clean and tokenized (space-separated)
    stemmed = [stemmer.stem(word) for word in words]
    return " ".join(stemmed)

df['stemmed_text'] = df['clean_text'].apply(stem_clean_text)


In [13]:
def lemmatize_clean_text(text):
    words = text.split()
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    return " ".join(lemmatized)

df['lemmatized_text'] = df['clean_text'].apply(lemmatize_clean_text)


In [14]:
df.head(5000)

Unnamed: 0,label,text,text_lower,clean_text,stemmed_text,lemmatized_text
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go jurong point crazy available bugis n great ...,go jurong point crazi avail bugi n great world...,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni,ok lar joke wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry wkly comp win fa cup final tkts may...,free entri wkli comp win fa cup final tkt may ...,free entry wkly comp win fa cup final tkts may...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say early hor u c already say,u dun say earli hor u c alreadi say,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah think goes usf lives around though,nah think goe usf live around though,nah think go usf life around though
...,...,...,...,...,...,...
4995,ham,My drive can only be read. I need to write,my drive can only be read. i need to write,drive read need write,drive read need write,drive read need write
4996,ham,"Just looked it up and addie goes back Monday, ...","just looked it up and addie goes back monday, ...",looked addie goes back monday sucks,look addi goe back monday suck,looked addie go back monday suck
4997,ham,Happy new year. Hope you are having a good sem...,happy new year. hope you are having a good sem...,happy new year hope good semester,happi new year hope good semest,happy new year hope good semester
4998,ham,Esplanade lor. Where else...,esplanade lor. where else...,esplanade lor else,esplanad lor els,esplanade lor else


#📊 Step 3 – Feature Extraction (Vectorization)


With the text fully pre‑processed, the next task was to convert the cleaned messages into numerical features that a machine‑learning model can understand.  
I implemented **two classic vectorization techniques**:

| Vectorizer           | Purpose                                      | Key Output            |
|----------------------|----------------------------------------------|-----------------------|
| **CountVectorizer**  | Bag‑of‑Words: simple word counts per message | Sparse matrix of integers |
| **TfidfVectorizer**  | TF‑IDF: weighted counts that down‑rank frequent words and up‑rank rare, informative words | Sparse matrix of floats |

---

### 🛠 Code Summary

```python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Bag‑of‑Words
bow_vec = CountVectorizer()
X_bow   = bow_vec.fit_transform(df['clean_text'])       # shape → (n_messages, n_words)

# TF‑IDF
tfidf_vec = TfidfVectorizer()
X_tfidf   = tfidf_vec.fit_transform(df['clean_text'])   # shape → (n_messages, n_words)
```

Both `X_bow` and `X_tfidf` are **sparse matrices** where  
*rows* = messages, *columns* = vocabulary terms, *values* = word importance (counts or TF‑IDF scores).

> **Note:** No explicit loop is required; the vectorizers internally iterate over every message and build the full matrix in one pass.

---

### 🧠 What I Learned
- How Bag‑of‑Words provides a straightforward baseline representation.
- How TF‑IDF improves on Bag‑of‑Words by reducing the weight of very common words (e.g., “free”) and highlighting rarer, more informative words.
- Why vectorizers handle entire text corpora without manual loops, returning efficient **sparse matrices** ready for model training.


### Bag of Words

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# We'll use the clean_text column (already lowercased & cleaned)
vectorizer = CountVectorizer()

X_bow = vectorizer.fit_transform(df['clean_text'])


In [16]:
X_bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 42959 stored elements and shape (5572, 7197)>

In [17]:
print(X_bow)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 42959 stored elements and shape (5572, 7197)>
  Coords	Values
  (0, 2526)	1
  (0, 3256)	1
  (0, 4665)	1
  (0, 1368)	1
  (0, 427)	1
  (0, 833)	1
  (0, 2600)	1
  (0, 7049)	1
  (0, 3371)	1
  (0, 831)	1
  (0, 1109)	1
  (0, 2566)	1
  (0, 210)	1
  (0, 6853)	1
  (1, 4298)	1
  (1, 3404)	1
  (1, 3226)	1
  (1, 6959)	1
  (1, 4323)	1
  (2, 2345)	1
  (2, 1954)	2
  (2, 7009)	1
  (2, 1213)	1
  (2, 6970)	1
  (2, 2080)	2
  :	:
  (5567, 1843)	1
  (5567, 4724)	1
  (5568, 2841)	1
  (5568, 2536)	1
  (5568, 2332)	1
  (5568, 1983)	1
  (5569, 3971)	1
  (5569, 4611)	1
  (5569, 6051)	1
  (5570, 2345)	1
  (5570, 6891)	1
  (5570, 3501)	1
  (5570, 4149)	1
  (5570, 6679)	1
  (5570, 5745)	1
  (5570, 1897)	1
  (5570, 2451)	1
  (5570, 866)	1
  (5570, 2653)	1
  (5570, 3087)	1
  (5570, 58)	1
  (5570, 651)	1
  (5571, 4070)	1
  (5571, 6516)	1
  (5571, 5230)	1


In [18]:
print(vectorizer.get_feature_names_out()[:20])  # first 20 words


['aa' 'aah' 'aaniye' 'aaooooright' 'aathi' 'ab' 'abbey' 'abdomen' 'abeg'
 'abel' 'aberdeen' 'abi' 'ability' 'abiola' 'abj' 'able' 'abnormally'
 'aboutas' 'abroad' 'absence']


In [19]:
bow_df = pd.DataFrame(X_bow.toarray(), columns=vectorizer.get_feature_names_out())

# Preview first few rows
print(bow_df.head())


   aa  aah  aaniye  aaooooright  aathi  ab  abbey  abdomen  abeg  abel  ...  \
0   0    0       0            0      0   0      0        0     0     0  ...   
1   0    0       0            0      0   0      0        0     0     0  ...   
2   0    0       0            0      0   0      0        0     0     0  ...   
3   0    0       0            0      0   0      0        0     0     0  ...   
4   0    0       0            0      0   0      0        0     0     0  ...   

   zebra  zed  zeros  zhong  zindgi  zoe  zogtorius  zoom  zouk  zyada  
0      0    0      0      0       0    0          0     0     0      0  
1      0    0      0      0       0    0          0     0     0      0  
2      0    0      0      0       0    0          0     0     0      0  
3      0    0      0      0       0    0          0     0     0      0  
4      0    0      0      0       0    0          0     0     0      0  

[5 rows x 7197 columns]


### TF-IDF

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()


In [22]:
X_tfidf = tfidf.fit_transform(df['clean_text'])
print (X_tfidf.shape)

(5572, 7197)


In [23]:
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out())
print(tfidf_df.head())

    aa  aah  aaniye  aaooooright  aathi   ab  abbey  abdomen  abeg  abel  ...  \
0  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0  ...   
1  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0  ...   
2  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0  ...   
3  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0  ...   
4  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0  ...   

   zebra  zed  zeros  zhong  zindgi  zoe  zogtorius  zoom  zouk  zyada  
0    0.0  0.0    0.0    0.0     0.0  0.0        0.0   0.0   0.0    0.0  
1    0.0  0.0    0.0    0.0     0.0  0.0        0.0   0.0   0.0    0.0  
2    0.0  0.0    0.0    0.0     0.0  0.0        0.0   0.0   0.0    0.0  
3    0.0  0.0    0.0    0.0     0.0  0.0        0.0   0.0   0.0    0.0  
4    0.0  0.0    0.0    0.0     0.0  0.0        0.0   0.0   0.0    0.0  

[5 rows x 7197 columns]


In [24]:
df.head()

Unnamed: 0,label,text,text_lower,clean_text,stemmed_text,lemmatized_text
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go jurong point crazy available bugis n great ...,go jurong point crazi avail bugi n great world...,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni,ok lar joke wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry wkly comp win fa cup final tkts may...,free entri wkli comp win fa cup final tkt may ...,free entry wkly comp win fa cup final tkts may...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say early hor u c already say,u dun say earli hor u c alreadi say,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah think goes usf lives around though,nah think goe usf live around though,nah think go usf life around though


# 🤖 Step 4 – Model Training


In [27]:
# Core
import pandas as pd
import numpy as np

# Models + tools
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix
)


In [28]:
# X_bow and X_tfidf were created in Step 3
X_bow                      # Bag‑of‑Words sparse matrix
X_tfidf                # TF‑IDF sparse matrix

y = df['label']                    # 'spam' or 'ham'


In [30]:
Xbow_train, Xbow_test, y_train, y_test = train_test_split(
    X_bow, y, test_size=0.2, random_state=42)

Xtf_train, Xtf_test, _, _ = train_test_split( #ignore the y inputsa as they are already split in the above line
    X_tfidf, y, test_size=0.2, random_state=42)   # same indices


## Model A — Naive Bayes + BoW

In [32]:
nb = MultinomialNB()
nb.fit(Xbow_train, y_train)

y_pred_nb = nb.predict(Xbow_test)

print("🔹 Naive Bayes (Bag‑of‑Words)")
print("Accuracy :", accuracy_score(y_test, y_pred_nb))
print(confusion_matrix(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))


🔹 Naive Bayes (Bag‑of‑Words)
Accuracy : 0.9695067264573991
[[941  25]
 [  9 140]]
              precision    recall  f1-score   support

         ham       0.99      0.97      0.98       966
        spam       0.85      0.94      0.89       149

    accuracy                           0.97      1115
   macro avg       0.92      0.96      0.94      1115
weighted avg       0.97      0.97      0.97      1115



### 🤖 Model 1 – Naive Bayes (Using Bag-of-Words)

After converting the SMS messages into Bag-of-Words vectors, I trained a **Multinomial Naive Bayes** classifier to detect spam messages.

---

### 📊 Model Evaluation Results

**Accuracy:** `96.95%`  
**Test Samples:** `1,115`

| Metric       | Ham (Not Spam) | Spam        |
|--------------|----------------|-------------|
| Precision    | 0.99           | 0.85        |
| Recall       | 0.97           | 0.94        |
| F1-Score     | 0.98           | 0.89        |
| Support      | 966            | 149         |

---

### 📋 Confusion Matrix

|                | Predicted Ham | Predicted Spam |
|----------------|---------------|----------------|
| **Actual Ham** | 941           | 25             |
| **Actual Spam**| 9             | 140            |

---

### 🧠 What It Means

- ✅ **941 ham messages** were correctly classified as ham  
- ✅ **140 spam messages** were correctly caught as spam  
- ❌ **25 ham messages** were incorrectly marked as spam (false positives)  
- ❌ **9 spam messages** slipped through as ham (false negatives)

---

### 🔍 Interpretation

- The model has **high precision for ham (0.99)**, meaning it rarely marks legit messages as spam.
- It also has **strong recall for spam (0.94)**, meaning it catches most of the spam.
- **Overall F1-score for spam is 0.89**, showing a solid balance between false positives and false negatives.

---

### 📌 Summary

The Naive Bayes classifier paired with Bag-of-Words achieved **97% accuracy**, with very few spam messages missed and only a small number of false alarms. It's a strong baseline model for text classification.




##Model B — Logistic Regression + TF‑IDF

In [35]:
lr = LogisticRegression(max_iter=1000)
lr.fit(Xtf_train, y_train)

y_pred_lr = lr.predict(Xtf_test)

print("\n🔹 Logistic Regression (TF‑IDF)")
print("Accuracy :", accuracy_score(y_test, y_pred_lr))
print(confusion_matrix(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))



🔹 Logistic Regression (TF‑IDF)
Accuracy : 0.9596412556053812
[[963   3]
 [ 42 107]]
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       0.97      0.72      0.83       149

    accuracy                           0.96      1115
   macro avg       0.97      0.86      0.90      1115
weighted avg       0.96      0.96      0.96      1115



### 🤖 Model 2 – Logistic Regression (Using TF-IDF)

For the second model, I used **Logistic Regression** trained on TF-IDF vectors.  
This approach typically offers better performance in real-world NLP tasks, especially when paired with parameter tuning.

---

### 📊 Model Evaluation Results

**Accuracy:** `95.96%`  
**Test Samples:** `1,115`

| Metric       | Ham (Not Spam) | Spam        |
|--------------|----------------|-------------|
| Precision    | 0.96           | 0.97        |
| Recall       | 1.00           | 0.72        |
| F1-Score     | 0.98           | 0.83        |
| Support      | 966            | 149         |

---

### 📋 Confusion Matrix

|                | Predicted Ham | Predicted Spam |
|----------------|---------------|----------------|
| **Actual Ham** | 963           | 3              |
| **Actual Spam**| 42            | 107            |

---

### 🧠 What It Means

- ✅ **963 ham messages** correctly predicted
- ✅ **107 spam messages** correctly caught
- ❌ **42 spam messages missed**
- ❌ **3 ham messages wrongly flagged**

---

### 🔍 Interpretation

- **Precision for spam is excellent (0.97)** — very few false positives (almost no legit messages flagged as spam).
- But **recall for spam is lower (0.72)** — the model missed quite a few spam messages.
- F1-score for spam is 0.83 — lower than Naive Bayes (which had 0.89).

---

### 📌 Summary

Logistic Regression with TF-IDF produced **fewer false alarms** than Naive Bayes, but also **missed more spam**.  
This model may be better suited when **minimizing incorrect spam flags** is more important than catching every single spam message.


 # 🧪 Step 5 – Custom Message Testing

In [38]:
def predict_message():
    """Ask the user for a message, then show predictions from both models."""
    raw_msg = input("\nType an SMS message → ")
    if not raw_msg.strip():
        print("No text entered. Try again.")
        return

    msg_clean = clean_text(raw_msg.lower())

    # Use the correct vectorizer names
    vec_bow   = vectorizer.transform([msg_clean])      # You called it 'vectorizer'
    vec_tfidf = tfidf.transform([msg_clean])           # You called it 'tfidf'

    pred_nb   = nb.predict(vec_bow)[0]
    conf_nb   = nb.predict_proba(vec_bow)[0].max()

    pred_lr   = lr.predict(vec_tfidf)[0]
    conf_lr   = lr.predict_proba(vec_tfidf)[0].max()

    print("\n================  RESULTS  =================")
    print(f"Original text :  {raw_msg}")
    print("--------------------------------------------")
    print(f"Naive Bayes (Bag‑of‑Words)   →  {pred_nb.upper()} "
          f"(confidence ≈ {conf_nb:.2f})")
    print(f"LogReg (TF‑IDF)             →  {pred_lr.upper()} "
          f"(confidence ≈ {conf_lr:.2f})")
    print("============================================\n")

# Run the loop
while True:
    predict_message()
    cont = input("Test another message? (y/n) → ").strip().lower()
    if cont == 'n':
        print("Goodbye!")
        break



Type an SMS message → free

Original text :  free
--------------------------------------------
Naive Bayes (Bag‑of‑Words)   →  SPAM (confidence ≈ 0.57)
LogReg (TF‑IDF)             →  SPAM (confidence ≈ 0.68)

Test another message? (y/n) → for free

Type an SMS message → for freeeeee

Original text :  for freeeeee
--------------------------------------------
Naive Bayes (Bag‑of‑Words)   →  HAM (confidence ≈ 0.87)
LogReg (TF‑IDF)             →  HAM (confidence ≈ 0.93)

Test another message? (y/n) → get for free

Type an SMS message → get for free

Original text :  get for free
--------------------------------------------
Naive Bayes (Bag‑of‑Words)   →  HAM (confidence ≈ 0.53)
LogReg (TF‑IDF)             →  HAM (confidence ≈ 0.56)

Test another message? (y/n) → y

Type an SMS message → why is this free?

Original text :  why is this free?
--------------------------------------------
Naive Bayes (Bag‑of‑Words)   →  SPAM (confidence ≈ 0.57)
LogReg (TF‑IDF)             →  SPAM (confidence ≈