4. Implement e-mail spam filtering using text classification algorithm with appropriate dataset.

In [5]:
!pip install numpy
!pip install pandas
!pip install scikit-learn

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

file_name = 'spam.csv' 
try:
    # Use on_bad_lines='skip' to avoid errors from "broken" data
    mail_data = pd.read_csv(file_name, encoding='latin-1', on_bad_lines='skip')
    print(f"--- 2. Successfully loaded '{file_name}'. ---")
except FileNotFoundError:
    print(f"--- ERROR: '{file_name}' not found! ---")
    print("Please make sure you saved the sample data as 'spam.csv' in the same folder.")

# --- 3. Pre-processing ---
# Keep only the two columns we need
clean_data = mail_data[['v1', 'v2']]
clean_data.rename(columns={'v1': 'category', 'v2': 'message'}, inplace=True)

# Convert category ('ham' or 'spam') into numbers (0 or 1)
clean_data['category_num'] = clean_data['category'].map({'ham': 0, 'spam': 1})

print("--- 3. Data cleaned and pre-processed. ---")
print(clean_data.head()) # Shows the first 5 rows


# --- 4. Define your X (features) and y (target) ---
X = clean_data['message']
y = clean_data['category_num']

# --- 5. Split Data into Training and Testing sets ---
# --- random_state=5 ensures we get the exact same "random" split every time, making our results reproducible. ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)
print("\n--- 4. Data split into training (70%) and testing (30%) sets. ---")

# --- 6. Text Feature Extraction (TF-IDF) ---
# This converts your text (e.g., "free offer") into numbers
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)
print("--- 5. Text converted to numbers (TF-IDF). ---")

# --- 7. Train the Classification Model ---
model = MultinomialNB()
model.fit(X_train_features, y_train)
print("--- 6. Model has been trained. ---")

# --- 8. Evaluate the Model ---
print("\n--- 7. Evaluating Model Performance ---")

# Make predictions on the test data
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(y_test, prediction_on_test_data)
# This will now be a more realistic number (like 80-95%)
print(f"Accuracy on Test Data: {accuracy_on_test_data * 100:.2f}%") 

# Print the full report
print("\nClassification Report (Test Data):")
print(classification_report(y_test, prediction_on_test_data))


# --- 9. Test with a new, custom email ---
print("\n--- 8. Testing with a new email ---")
input_mail = ["Congratulations! You have won a 1 million dollar prize. Click here to claim."]

# Convert the new email to numbers
input_data_features = feature_extraction.transform(input_mail)

# Make a prediction
prediction = model.predict(input_data_features)

if prediction[0] == 1:
    print("This email is SPAM.")
else:
    print("This email is HAM (not spam).")

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: C:\Program Files\Python313\python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: C:\Program Files\Python313\python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
--- 2. Successfully loaded 'spam.csv'. ---
--- 3. Data cleaned and pre-processed. ---
  category                                            message  category_num
0      ham  Go until jurong point, crazy.. Available only ...             0
1      ham                      Ok lar... Joking wif u oni...             0
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...             1
3      ham  U dun say so early hor... U c already then say...             0
4      ham  Even my brother is not like to speak with me. ...             0

--- 4. Data split into training (70%) and testing (30%) sets. ---
--- 5. Text converted to numbers (TF-IDF). ---
--- 6. Model has been trained. ---

--- 7. Evaluating Model Performance ---
Accuracy on Test Data: 71.43%

Classification Report (Test Data):
              precision    recall  f1-score   support

           0       0.71      1.00      0.83         5
           1   


[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: C:\Program Files\Python313\python.exe -m pip install --upgrade pip
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data.rename(columns={'v1': 'category', 'v2': 'message'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data['category_num'] = clean_data['category'].map({'ham': 0, 'spam': 1})
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", resul

Purpose of the Practical
The purpose is to build a supervised machine learning model that can automatically read an email and classify it as either "Spam" (junk) or "Ham" (a normal, non-junk email).

This is a classic "Text Classification" problem. We will train a model on a dataset of emails that are already labeled as 'spam' or 'ham'. The model will learn the patterns of words associated with spam, and then we'll use it to predict the label for new, unseen emails.

üß† Core Theory (How it Works)
This practical involves two key concepts:

TF-IDF Vectorization: Computers don't understand words, they only understand numbers. We need to convert the text of each email into a numerical format. We do this using TF-IDF (Term Frequency-Inverse Document Frequency).

Term Frequency (TF): How often a word appears in one email. A word like "win" appearing 5 times is important.

Inverse Document Frequency (IDF): How "rare" a word is across all emails. A common word like "the" (low IDF) is ignored. A rare word like "lottery" or "viagra" (high IDF) is a very strong signal.

The Score: The TF-IDF score is high for words that are frequent in one email but rare overall. This helps the model identify important, signal-carrying words.

Multinomial Naive Bayes (MultinomialNB): This is the classification algorithm (the "brain") we are using.

It's a fast, simple, and highly effective algorithm for text.

It uses Bayes' Theorem to calculate the probability of an email being "Spam" given the set of TF-IDF scores for the words it contains.

It's called "Naive" because it naively assumes that every word is independent of the others (e.g., it doesn't care that "free" and "prize" appear together, just that both words are present).

üìã Step-by-Step Code Explanation
Import Libraries: We import pandas to read the CSV, train_test_split to divide our data, TfidfVectorizer to convert text to numbers, MultinomialNB as our classifier, and accuracy_score to check our work.

Load & Clean Data:

We use pd.read_csv('spam.csv') to load the dataset.

The most important cleaning step is clean_data['category_num'] = .... This converts the text labels "ham" and "spam" into the numbers 0 and 1, which the model can understand.

Define X and y: We set our X (features) to be the message column (the text) and our y (target) to be the category_num column (the 0 or 1 label).

Split Data: We use train_test_split to split our data.

X_train & y_train: The training set (e.g., 70% of the data). The model learns from this.

X_test & y_test: The testing set (e.g., 30% of the data). We "hide" this from the model to check its performance later.

Feature Extraction (TF-IDF): This is the most crucial part.

We create a TfidfVectorizer object.

X_train_features = feature_extraction.fit_transform(X_train): This does two things:

fit: It reads all of X_train to learn the entire vocabulary (all unique words).

transform: It converts the text in X_train into a matrix of TF-IDF numbers.

X_test_features = feature_extraction.transform(X_test): It uses the same vocabulary learned from X_train to convert X_test into a number matrix.

Train the Model:

We create a MultinomialNB object.

model.fit(X_train_features, y_train): This is the "training." The model learns the probability patterns that connect the TF-IDF numbers (the words) to their correct labels (spam or ham).

Evaluate the Model:

prediction_on_test_data = model.predict(X_test_features): We use the trained model to make predictions on the "hidden" test data.

accuracy_score(y_test, prediction_on_test_data): We compare the model's predictions to the actual answers (y_test) to calculate its accuracy.

Test on New Data: The last step shows how you can use the same trained feature_extraction and model to predict the class of any new email you type in.

üõ†Ô∏è Key Libraries & Functions
pandas: For loading and cleaning the spam.csv file.

sklearn.model_selection.train_test_split: The function used to split the data into training and testing sets.

sklearn.feature_extraction.text.TfidfVectorizer: The most important tool. It converts raw text into a meaningful matrix of TF-IDF scores.

.fit_transform(X_train): (Fit + Transform) Learns the vocabulary and transforms the training data. Use this on X_train ONLY.

.transform(X_test): (Only Transform) Only transforms the test data using the vocabulary it already learned from X_train. Use this on X_test and any new data.

sklearn.naive_bayes.MultinomialNB: The classification algorithm (the "brain").

model.fit(...): The function that trains the model.

model.predict(...): The function that makes predictions on new, unseen data.