# Classification and NLP

In [None]:
!pip install nltk

## Step 1. Read dataset

We will use an SMS dataset where each message is labeled as **spam** or **ham** (not spam).

This dataset will help us understand how text data is handled in machine learning and how NLP preprocessing fits into a classification pipeline.

In [None]:
import pandas as pd


df = pd.read_csv("../datasets/sms.csv")
df.head()

## Step 2. Setup NLP Tools and Simple preprocessing

Natural Language Processing often relies on external language resources such as:
- Tokenizers
- Stopword lists
- Word dictionaries

NLTK provides these resources, which need to be downloaded once before use.

In [None]:
# Download necessary data from nltk

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

### Tokenization

Tokenization is the process of breaking text into smaller units called **tokens** (usually words).

Here, we compare:
- A simple string split
- NLTK’s tokenizer, which handles punctuation and contractions better

In [None]:
from nltk.tokenize import word_tokenize

text = "Don't split contractions badly! It's important."
text = text.lower()
basic_split = text.split()
nltk_tokens = word_tokenize(text)

print(f"Basic split: {basic_split}")
print(f"NLTK tokens: {nltk_tokens}")

### Removing Stop Words

Stop words are very common words such as *the, is, and, to*.

These words usually do not add much meaning for tasks like spam detection, so we often remove them to reduce noise in the data.

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text.lower())
filtered_tokens = [word for word in tokens if word not in stop_words]

print(f"Original tokens: {tokens}")
print(f"Without stop words: {filtered_tokens}")

### Stemming

Stemming reduces words to their root form by applying simple rules.

The goal is to treat similar words (e.g., *running* and *runs*) as the same feature, even if the resulting word is not grammatically correct.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words_to_stem = ['running', 'runs', 'easily', 'studies', 'happiness']

print("Stemming examples:")
for word in words_to_stem:
    print(f"{word} → {stemmer.stem(word)}")

### Lemmatization

Lemmatization is similar to stemming, but it converts words into their **dictionary form** (lemma).

Compared to stemming, lemmatization:
- Is more accurate
- Is more linguistically correct
- Is usually slower

We compare both approaches to understand the difference.

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("Lemmatization examples:")
for word in words_to_stem:
    print(f"{word} → {lemmatizer.lemmatize(word)}")

# Compare stemming vs lemmatization
print("\nStemming vs Lemmatization:")
comparison_words = ['better', 'running', 'studies', 'geese', 'feet']
for word in comparison_words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word)
    print(f"{word} → Stem: {stem}, Lemma: {lemma}")



## Text Cleaning Pipeline

Instead of applying each preprocessing step manually every time, we combine them into a single function.

This function performs:
- Lowercasing
- Removing special characters
- Tokenization
- Stopword removal
- Stemming

This makes preprocessing reusable and consistent across the dataset.

In [None]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

# Setup preprocessing tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()


def clean_text(text):
    """Complete text preprocessing pipeline"""
    text = text.lower().strip()
    
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    
    tokens = [stemmer.stem(word) for word in tokens]
    
    return ' '.join(tokens)


In [None]:
original = "      This is very &&!Good Text. Visit www.testdoc.com for more info or mail us at a@gmail.com! or 981111111 "
cleaned = clean_text(original)

cleaned

In [None]:
df['sms'] = df['sms'].apply(clean_text)
df.head()

## Step 3. Train test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['sms'],
    df['class'],
    test_size=0.2,
    random_state=42,
    stratify=df['class']
)


## Step 4. Feature Extraction: Representing Text as Numbers

Machine learning models cannot work directly with text.
They only understand numbers.

Feature extraction is the process of converting raw text into a numerical representation that a model can learn from.

In this step, we will use a simple and widely used approach called **Bag of Words (BoW)** to represent text as numbers.


### Bag of Words (BoW)

The Bag of Words approach represents text by:
- Building a vocabulary of all unique words in the dataset
- Counting how many times each word appears in a message

Important points:
- Word order is ignored
- Only word frequency matters
- Each word becomes a feature (column)

In scikit-learn, this is implemented using **CountVectorizer**.

### Convert Text into Numeric Features

We use `CountVectorizer` to:
- Learn the vocabulary from the training data
- Convert each message into a vector of word counts

The training and test data are transformed using the same vocabulary.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

### Inspect the Vocabulary


In [None]:
vectorizer.get_feature_names_out()[100:120]

In [None]:
X_train_vec_df = pd.DataFrame(
    X_train_vec[:5].toarray(),
    columns=vectorizer.get_feature_names_out()
)

X_train_vec_df

In [None]:
X_train_vec_df.loc[:, (X_train_vec_df != 0).any(axis=0)]

### Possible Improvements (Not Implemented Here)

In practice, models can be improved by adding more features, such as:
- TF-IDF instead of raw word counts
- N-grams (word pairs or triples)
- Message length
- Number of digits or special characters
- Presence of URLs or phone numbers

For this workshop, we keep the feature extraction simple, but students are encouraged to experiment with these ideas.

## Step 5. Model Training

Now that the text has been converted into numerical features, we can train a classification model.

We will use **Logistic Regression**, a commonly used baseline model for text classification problems such as spam detection.


In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train_vec, y_train)

### Model Accuracy

In [None]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
accuracy

### Predict on New Messages

In [None]:
message = ["Free entry in 2 a weekly competition to win prizes", "URGENT! You have won a 1 week FREE membership"]
message_vec = vectorizer.transform(message)

prediction = model.predict(message_vec)
prediction


In [None]:
proba = model.predict_proba(message_vec)
proba

### Class Distribution

Before relying only on accuracy, it is important to check how balanced the dataset is.

If one class appears much more frequently than the other, accuracy alone can be misleading.

In [None]:
df["class"].value_counts()

### F1 Score

The F1 score combines **precision** and **recall** into a single metric.

Here, we focus on the F1 score for the **spam** class, since correctly identifying spam is usually more important than ham.

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, y_pred, pos_label="spam")

In [None]:
from sklearn.metrics import confusion_matrix

labels = ["ham", "spam"]
cm = confusion_matrix(y_test, y_pred, labels=labels)

import pandas as pd

pd.DataFrame(
    cm,
    index=["Actual ham", "Actual spam"],
    columns=["Pred ham", "Pred spam"]
)
