## Theoretical Introduction to Text Classification and NLP

Text classification is one of the core tasks in Natural Language Processing (NLP).  
In this laboratory, we focus on building a machine-learning model that automatically classifies text messages as **spam** or **ham** (not spam).  

### What is Spam Classification?

Spam detection is a binary classification problem in which the goal is to:
- assign label **1** → spam,
- assign label **0** → ham (legitimate message).

It is widely used in email filtering, SMS moderation, and security systems.

### NLP Preprocessing

Raw text cannot be directly used by machine-learning models.  
We must convert it into a numerical representation through several steps:

1. **Cleaning and normalization**  
   Lowercasing, removing punctuation, removing unnecessary characters.

2. **Tokenization**  
   Splitting text into individual words.

3. **Stop-word removal**  
   Removing very common words (e.g., *the*, *is*, *and*),  
   which carry little information for classification.

4. **Stemming**  
   Reducing words to crude roots  
   (*walking*, *walked* → *walk*).  
   Fast but produces artificial word forms.

5. **Lemmatization**  
   Reducing words to dictionary forms  
   (*better* → *good*, *cars* → *car*).  
   More accurate but slower.

### Vectorization

Machine-learning models require numerical input, so text must be converted into vectors:

- **Bag-of-Words (BoW)** — counts occurrences of each word.
- **TF-IDF** — measures how important a word is in a document relative to the whole dataset.

These methods transform text into a matrix that a classifier can use.

### Machine-Learning Models

For spam detection, common models include:
- **Multinomial Naive Bayes** — fast baseline model.
- **Logistic Regression** — strong linear classifier.
- **Linear SVM** — often achieves high accuracy.
- **Random Forest** — tree-based ensemble model.

Each model has different strengths in handling sparse, high-dimensional text data.

### Goal of the Laboratory

During the exercises, you will:
- preprocess and clean textual data,
- compare stemming and lemmatization,
- convert text into numerical features,
- train multiple ML models,
- evaluate and compare their performance,
- analyze which words are most strongly associated with spam.

This provides a complete workflow typical for real-world NLP classification tasks.


## Task 0 - Download data

This code automatically downloads the **SMS Spam Collection** dataset from the UCI Machine Learning Repository.  
It then extracts the ZIP file directly in memory, loads the raw text file into a pandas DataFrame, and finally saves it locally as `spam.csv`.

This allows you to use a real, publicly available spam–ham dataset without manually downloading or preparing any files.


In [None]:
import pandas as pd
import requests
import zipfile
import io

# Pobierz ZIP z UCI
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
r = requests.get(url)

# Rozpakowanie z pamięci
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

# Odczyt pliku SMSSpamCollection
df = pd.read_csv(
    "SMSSpamCollection",
    sep="\t",
    names=["v1", "v2"],
    encoding="latin-1"
)

# Zapis CSV w formacie jak Kaggle
df.to_csv("spam.csv", index=False, encoding="latin-1")

print("Utworzono spam.csv z UCI SMS Spam Collection!")
print(df.head())


## Task 1 — Build a Basic SPAM/HAM Classifier

In this task, you will create a simple machine-learning model that classifies text messages as **spam** or **ham** (not spam).  
You will work with the SMS Spam Collection dataset from UCI.

### What you need to do:

1. **Load the dataset** (`spam.csv`) and keep two columns:
   - `label` (spam/ham)
   - `text` (message content)

2. **Convert labels**:
   - `ham` → `0`
   - `spam` → `1`

3. **Explore the dataset**:
   - Display the first few rows
   - Show how many spam and ham messages are in the dataset

4. **Split data** into training and test sets (80/20).
   Use `stratify=y` to preserve class distribution.

5. **Vectorize the text** using Bag-of-Words (`CountVectorizer`).

6. **Train a Multinomial Naive Bayes model** on the vectorized data.

7. **Evaluate the model** using:
   - Accuracy  
   - Precision (spam)  
   - Recall (spam)  
   - F1-score (spam)  
   - Confusion matrix  


In [None]:
# TASK 1 – Basic SPAM / HAM Classifier (Template)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix
)

# 1. Load the dataset.
# The file "spam.csv" must be located in the working directory.
# The dataset should contain at least two columns: one with labels and one with message text.
df = pd.read_csv("spam.csv", encoding="latin-1")

# Select the relevant columns and rename them.
# Typically, the dataset contains columns such as "v1" (label) and "v2" (text),
# along with several unused columns.
df = df[['v1', 'v2']]
df.columns = ['label', 'text']

# 2. Convert categorical labels into numerical form.
# The standard mapping is: "ham" → 0, "spam" → 1.





# 3. Display a short preview and class distribution.






# 4. Define features and target variables.
# X should contain text messages, and y should contain the numerical labels.


# Split the dataset into training and test subsets (80/20),
# using stratified sampling to preserve label proportions.





# 5. Vectorize the text using the Bag-of-Words method.
# CountVectorizer transforms text into a sparse numerical matrix.






# 6. Train a Multinomial Naive Bayes classifier.
# This model is commonly used for text classification tasks.






# 7. Evaluate the classifier.
# Compute accuracy, precision (spam), recall (spam), F1-score (spam),
# and display the confusion matrix for detailed error analysis.


## Task 2 — Compare Text Preprocessing Techniques: No Cleaning vs. Stemming vs. Lemmatization

In this task, you will investigate how different text-cleaning approaches influence the performance of a machine-learning model.

You will compare **three versions** of the SMS dataset:

1. **No stemming or lemmatization**  
2. **With stemming** (PorterStemmer)  
3. **With lemmatization** (WordNetLemmatizer)

### Steps to complete:

1. Implement a `clean_text()` function that:
   - converts text to lowercase,  
   - removes punctuation and digits,  
   - tokenizes text,  
   - removes stopwords,  
   - optionally applies:
     - stemming (`mode="stem"`),
     - lemmatization (`mode="lemma"`),
     - or nothing (`mode="none"`).

2. Create 3 new columns in your DataFrame:
   - `text_clean_none`
   - `text_clean_stem`
   - `text_clean_lemma`

3. For each version of cleaned text:
   - split into train/test sets,
   - vectorize using **TF-IDF** (`TfidfVectorizer`),
   - train a **Logistic Regression** classifier,
   - compute precision, recall, and F1-score for the *spam* class.

4. Create a comparison table with:
   - accuracy,
   - precision (spam),
   - recall (spam),
   - F1-score (spam).

5. (optional) Consider following questions:
   - Which cleaning method performs best?
   - Does stemming produce unnatural word roots?
   - Does lemmatization create more interpretable features?
   - Does more preprocessing always mean better performance?



In [None]:
# Only for first use
 
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

In [None]:
# TASK 2

import re
import nltk

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text, mode="none"):
    """
    Clean and normalize text, optionally applying stemming or lemmatization.
    mode: "none" / "stem" / "lemma"
    """

    pass


# Assume that the DataFrame `df` from Task 1 is already available
# and contains the columns: 'text' (raw messages) and 'label_num' (numeric labels).

# Apply text cleaning for all preprocessing variants
# Generate three cleaned text columns:
#  - text_clean_none  (no stemming/lemmatization)
#  - text_clean_stem  (stemming)
#  - text_clean_lemma (lemmatization)
# Each should be created using df['text'].apply(clean_text, ...).



###Model Training and Evaluation for Each Preprocessing Variant
# For each version of the cleaned text (no preprocessing, stemming, lemmatization), the following procedure must be applied:

# 1. Split the data into training and test sets using stratified sampling to preserve label distribution.  
# 2. Convert the textual data into numerical features using TF-IDF vectorization (`TfidfVectorizer`).  
# 3. Train a Logistic Regression classifier on the vectorized training data.  
# 4. Evaluate the model by computing precision, recall, and F1-score specifically for the *spam* class, as it is typically the minority class and harder to detect.





# Performance Comparison Table
# After evaluating all three preprocessing variants, construct a comparison table that includes the following metrics for each variant:

# - **Accuracy** — overall proportion of correct predictions.  
# - **Precision (spam)** — proportion of messages predicted as spam that are actually spam.  
# - **Recall (spam)** — proportion of true spam messages correctly detected by the model.  
# - **F1-score (spam)** — harmonic mean of precision and recall, providing a balanced measure of classifier performance.

# This table will allow a direct comparison of how preprocessing choices influence the effectiveness of the spam-detection model.


## Task 3 — Compare ML Models and Identify the Most “Spammy” Words

In this task, you will:
1. Train multiple machine-learning models,
2. Compare their performance,
3. Identify the most important words that indicate spam.

### Steps to complete:

1. Choose the best-performing text version from Task 2  
   (usually `text_clean_lemma`).  

2. Vectorize the text using **TF-IDF** with a limited vocabulary size  
   (e.g., `max_features=3000`).

3. Train and evaluate the following ML models:
   - **Multinomial Naive Bayes**
   - **Logistic Regression**
   - **Linear SVM (LinearSVC)**
   - **Random Forest Classifier**

4. For each model, compute:
   - Accuracy  
   - Precision (spam)  
   - Recall (spam)  
   - F1-score (spam)

5. Create a summary table comparing all models.

6. Select the best *linear* model  
   (Logistic Regression or LinearSVC).

7. Extract and display the top **20 most spam-indicative words**, based on:
   - highest positive coefficients → words strongly associated with spam  
   - lowest negative coefficients → words strongly associated with ham  

8. Plot horizontal bar charts showing the top 10 spam and ham words.

9. **Interpretation**:
   - What kinds of words typically appear in spam?  
   - What words are common in ham (normal) messages?  
   - Which ML model performed best, and why do you think so?
