# Binary Text Classification

This notebook explores various approaches for solving binary text classification problems using traditional machine learning techniques. 

We evaluate and compare:
- CountVectorizer and TF-IDF
- Logistic Regression, SVM, and Naive Bayes classifiers
- Performance on common evaluation metrics
- Feature importance and insights from the vectorizer


In [None]:
# Setup Kaggle API (upload kaggle.json)
# Before you run this in Colab, make sure your Kaggle API token is uploaded
from google.colab import files
files.upload()  # Upload your kaggle.json

In [None]:
# Move kaggle.json to the right folder
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Download the SMS Spam Collection Dataset
!kaggle datasets download -d uciml/sms-spam-collection-dataset
!unzip sms-spam-collection-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
License(s): unknown
Downloading sms-spam-collection-dataset.zip to /content
  0% 0.00/211k [00:00<?, ?B/s]
100% 211k/211k [00:00<00:00, 556MB/s]
Archive:  sms-spam-collection-dataset.zip
  inflating: spam.csv                


# Introduction & Dataset
- Objective:
To classify texts into two categories (binary classification) using multiple models and feature extraction methods.

- Dataset:
We use the spam dataset, which contains labeled text data belonging to two classes: spam and ham.

We’ll begin by loading and cleaning the data.

## Loading Data

In [None]:
# Import Required Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [27]:
# Load the dataset (the file is tab-separated)
df = pd.read_csv("spam.csv", encoding='ISO-8859-1')

# Keep only relevant columns
df = df[['v1', 'v2']]
df.columns = ['label', 'text']

In [28]:
# Encode labels: ham = 0, spam = 1
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Quick check
df.head(3)

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...


In [29]:
df.shape

(5572, 2)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

## Exploring Fitted Vectorizers

We convert raw text into numerical representations using:

- **CountVectorizer**: Converts text to a matrix of token counts.
- **TF-IDF Vectorizer**: Adds weighting to terms based on their frequency and inverse document frequency.
- **n-grams**: We explore unigrams, bigrams, and trigrams to improve feature richness.

We'll also explore how the fitted vectorizers tokenize and transform text using `.vocabulary_`, `.idf_`, and `get_feature_names_out()` methods.


### TFIDF

In [None]:
# Fit vectorizer
tfidf = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf.fit_transform(X_train)

In [None]:
# Rows = number of docs
# Columns = number of unique words
print("TF-IDF Matrix Shape:", X_tfidf.shape)

# Vocabulary List (Words learned by vectorizer)
print("Vocabulary Size:", len(tfidf.vocabulary_))
print("Sample Words:", list(tfidf.vocabulary_.keys())[:10])

TF-IDF Matrix Shape: (4457, 7472)
Vocabulary Size: 7472
Sample Words: ['boat', 'moms', 'check', 'yo', 'half', 'naked', 'bank', 'granite', 'issues', 'strong']


In [None]:
# Dictionary: word -> index
print("Word to Index Mapping:")
for word, idx in list(tfidf.vocabulary_.items())[:5]:
    print(f"{word} → {idx}")

Word to Index Mapping:
boat → 1371
moms → 4416
check → 1706
yo → 7415
half → 3210


In [None]:
# Get feature names
feature_names = tfidf.get_feature_names_out()

print("Index to Word Mapping:")
for i in range(1000,1005):
    print(f"{i} → {feature_names[i]}")

Index to Word Mapping:
1000 → april
1001 → aproach
1002 → apt
1003 → aptitude
1004 → aquarius


In [None]:
# View first row (sparse → dense)
vector_dense = X_tfidf[0].toarray()[0]

# Show non-zero values only
print("Non-zero TF-IDF scores in first document:")
for i in np.where(vector_dense > 0)[0]:
    print(f"{feature_names[i]}: {vector_dense[i]:.4f}")

Non-zero TF-IDF scores in first document:
boat: 0.4658
check: 0.3432
half: 0.3487
moms: 0.4528
naked: 0.4658
yo: 0.3487


In [None]:
# Get top N important words
top_n = 5
sorted_indices = np.argsort(vector_dense)[::-1][:top_n]
print(f"Top {top_n} TF-IDF words in first doc:")
for idx in sorted_indices:
    print(f"{feature_names[idx]} → {vector_dense[idx]:.4f}")

Top 5 TF-IDF words in first doc:
boat → 0.4658
naked → 0.4658
moms → 0.4528
yo → 0.3487
half → 0.3487


### CountVectorizer
It converts text into a sparse matrix of word counts (frequency of each word in each document).


In [None]:
cv = CountVectorizer(stop_words='english')
X_cv = cv.fit_transform(X_train)

In [None]:
print("Count Matrix Shape:", X_cv.shape)
print("Vocabulary Size:", len(cv.vocabulary_))
print("Sample Vocabulary Items:")

Count Matrix Shape: (4457, 7472)
Vocabulary Size: 7472
Sample Vocabulary Items:


In [None]:
for word, idx in list(cv.vocabulary_.items())[:5]:
    print(f"{word} → {idx}")

boat → 1371
moms → 4416
check → 1706
yo → 7415
half → 3210


In [None]:
feature_names = cv.get_feature_names_out()
for i in range(1000,1005):
    print(f"{i} → {feature_names[i]}")

1000 → april
1001 → aproach
1002 → apt
1003 → aptitude
1004 → aquarius


In [None]:
# Convert sparse to dense array
vector_dense = X_cv[0].toarray()[0]

# Print only non-zero values
print("Non-zero word counts in first document:")
for i in np.where(vector_dense > 0)[0]:
    print(f"{feature_names[i]}: {vector_dense[i]}")

Non-zero word counts in first document:
boat: 1
check: 1
half: 1
moms: 1
naked: 1
yo: 1


In [None]:
# Getting Top 5 words
top_n = 5
sorted_indices = np.argsort(vector_dense)[::-1][:top_n]
print(f"Top {top_n} words in first doc:")
for idx in sorted_indices:
    print(f"{feature_names[idx]} → {vector_dense[idx]}")

Top 5 words in first doc:
boat → 1
naked → 1
yo → 1
check → 1
half → 1


### TF-IDF with N-grams (e.g., Bigrams & Trigrams)

N-grams are continuous sequences of N words from a text:

- Unigram → ["hello", "there"]

- Bigram → ["hello there"]

- Trigram → ["hello there friend"]




| Vectorizer             | `ngram_range` | `analyzer` | When Useful                           |
| ---------------------- | ------------- | ---------- | ------------------------------------- |
| TF-IDF (word-level)    | `(1, 1)`      | `'word'`   | General use                           |
| TF-IDF with bigrams    | `(2, 2)`      | `'word'`   | Contextual meaning (phrases)          |
| TF-IDF with trigrams   | `(3, 3)`      | `'word'`   | Longer dependencies                   |
| TF-IDF character-level | `(3, 5)`      | `'char'`   | Typos, style, or spam/fuzzy detection |


In [None]:
# Example: Using bigrams (2-grams)
tfidf_bigram = TfidfVectorizer(ngram_range=(2, 2), stop_words='english')
X_bi = tfidf_bigram.fit_transform(X_train)

print("TF-IDF bigram matrix shape:", X_bi.shape)

# Show some bigram features
feature_names = tfidf_bigram.get_feature_names_out()
print("Sample bigrams:")
print(feature_names[:10])

TF-IDF bigram matrix shape: (4457, 24042)
Sample bigrams:
['00 easter' '00 sub' '00 subs' '000 bonus' '000 cash' '000 homeowners'
 '000 pounds' '000 prize' '000 xmas' '000pes 48']


In [None]:
# Trigrams (3-word sequences)
tfidf_trigram = TfidfVectorizer(ngram_range=(3, 3), stop_words='english')
X_tri = tfidf_trigram.fit_transform(X_train)

print("Trigram matrix shape:", X_tri.shape)
print(tfidf_trigram.get_feature_names_out()[:10])

Trigram matrix shape: (4457, 23078)
['00 easter prize' '00 sub 16' '00 subs 16' '000 bonus caller'
 '000 cash 000' '000 cash await' '000 cash needs' '000 homeowners tenants'
 '000 pounds txt' '000 prize claim']


In [None]:
doc_idx = 0
dense = X_bi[doc_idx].toarray()[0]
feature_names = tfidf_bigram.get_feature_names_out()

df = pd.DataFrame(dense, index=feature_names, columns=["tfidf"])
df = df[df['tfidf'] > 0].sort_values(by="tfidf", ascending=False)
df.head(10)

Unnamed: 0,tfidf
boat moms,0.447214
check yo,0.447214
half naked,0.447214
moms check,0.447214
yo half,0.447214


### TF-IDF with Character-level Features

Instead of words, we split input into character n-grams. Useful for:

- Typos

- Spam detection

- Languages with no word spacing (e.g., Chinese)

- Style-based classification (e.g., author detection)

In [None]:
tfidf_char = TfidfVectorizer(analyzer='char', ngram_range=(3, 5))  # Trigrams to 5-grams
X_char = tfidf_char.fit_transform(X_train)

print("Character-level TF-IDF shape:", X_char.shape)
print("Sample character n-grams:")
print(tfidf_char.get_feature_names_out()[1000:1020])

Character-level TF-IDF shape: (4457, 133947)
Sample character n-grams:
[' 117' ' 1172' ' 118' ' 118p' ' 11?' ' 11m' ' 11mt' ' 12' ' 12 ' ' 12 2'
 ' 12 a' ' 12 h' ' 12 m' ' 12 r' ' 12,' ' 12,0' ' 120' ' 1205' ' 121'
 ' 121 ']


In [None]:
dense = X_char[0].toarray()[0]
feature_names = tfidf_char.get_feature_names_out()

df = pd.DataFrame(dense, index=feature_names, columns=["tfidf"])
df = df[df['tfidf'] > 0].sort_values(by="tfidf", ascending=False)
df.head(10)

Unnamed: 0,tfidf
yo.,0.097712
boat.,0.097712
e boa,0.097712
alf n,0.097712
me bo,0.097712
lf na,0.097712
lf n,0.097712
m hal,0.097712
f nak,0.097712
oat.,0.097712


## TF-IDF + Logistic Regression
We train a Logistic Regression classifier using the extracted features.

Logistic Regression is a strong linear baseline model that works well on high-dimensional text features.

We also:
- Check model coefficients
- Evaluate with classification metrics

In [33]:
# Vectorize with TF-IDF
tfidf = TfidfVectorizer(stop_words='english')
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [34]:
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train_tfidf, y_train)
y_pred_logreg = logreg.predict(X_test_tfidf)

print("TF-IDF + Logistic Regression")
print(classification_report(y_test, y_pred_logreg))

TF-IDF + Logistic Regression
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       965
           1       0.97      0.67      0.79       150

    accuracy                           0.95      1115
   macro avg       0.96      0.83      0.88      1115
weighted avg       0.95      0.95      0.95      1115



## CountVectorizer + Naive Bayes
We use Multinomial Naive Bayes, a probabilistic classifier commonly used in text classification due to its simplicity and efficiency.

Despite its independence assumption, it performs surprisingly well on many NLP tasks.

We also compare its assumptions with the previous models.


In [37]:
# Count Vectorizer
count_vec = CountVectorizer(stop_words='english')
X_train_cv = count_vec.fit_transform(X_train)
X_test_cv = count_vec.transform(X_test)

In [40]:
X_train_cv

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 34794 stored elements and shape (4457, 7472)>

In [42]:
# Naive Bayes
nb = MultinomialNB()
nb.fit(X_train_cv, y_train)
y_pred_nb = nb.predict(X_test_cv)

print("CountVectorizer + Naive Bayes")
print(classification_report(y_test, y_pred_nb))

CountVectorizer + Naive Bayes
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       965
           1       0.96      0.92      0.94       150

    accuracy                           0.98      1115
   macro avg       0.97      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



## TF-IDF + Bigrams + SVM
We apply SVM with a linear kernel to classify text.

SVMs work well for sparse data and are robust to high-dimensional feature spaces, making them suitable for text classification.

We use `LinearSVC` from scikit-learn and compare its results with logistic regression.


In [None]:
from sklearn.svm import LinearSVC

# TF-IDF with bigrams
tfidf_bigram = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
X_train_bigram = tfidf_bigram.fit_transform(X_train)
X_test_bigram = tfidf_bigram.transform(X_test)

In [45]:
# SVM
svm = LinearSVC()
svm.fit(X_train_bigram, y_train)
y_pred_svm = svm.predict(X_test_bigram)

print("TF-IDF (1,2) + SVM")
print(classification_report(y_test, y_pred_svm))

TF-IDF (1,2) + SVM
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       965
           1       0.94      0.87      0.91       150

    accuracy                           0.98      1115
   macro avg       0.96      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115



## Char-level TF-IDF + Logistic Regression

In [46]:
# Char-level TF-IDF
char_vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(3,6))
X_train_char = char_vectorizer.fit_transform(X_train)
X_test_char = char_vectorizer.transform(X_test)

In [47]:
# Logistic Regression
logreg_char = LogisticRegression()
logreg_char.fit(X_train_char, y_train)
y_pred_char = logreg_char.predict(X_test_char)

print("Char-level TF-IDF + Logistic Regression")
print(classification_report(y_test, y_pred_char))

Char-level TF-IDF + Logistic Regression
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.70      0.82       150

    accuracy                           0.96      1115
   macro avg       0.98      0.85      0.90      1115
weighted avg       0.96      0.96      0.96      1115



In [None]:
print("TF-IDF + Logistic Regression Accuracy:             ", accuracy_score(y_test, y_pred_logreg))
print("CountVectorizer + Naive Bayes Accuracy:            ", accuracy_score(y_test, y_pred_nb))
print("TF-IDF (1,2) + SVM Accuracy:                       ", accuracy_score(y_test, y_pred_svm))
print("Char-level TF-IDF + Logistic Regression Accuracy:  ", accuracy_score(y_test, y_pred_char))

TF-IDF + Logistic Regression Accuracy:              0.9524663677130045
CountVectorizer + Naive Bayes Accuracy:             0.9838565022421525
TF-IDF (1,2) + SVM Accuracy:                        0.9757847533632287
Char-level TF-IDF + Logistic Regression Accuracy:   0.9596412556053812
