<a href="https://colab.research.google.com/github/Bhargav-Parmar/Internship-Project-1/blob/main/Email_Spam_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Setup And Data Loading

This cell imports libraries and loads the real CSV file using Pandas. It performs necessary data cleaning by renaming columns and converting the text labels (ham/spam) to numerical labels (0/1).

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# ---Load Real Data from the uploaded CSV ---
try:
    # The SMS Spam Collection dataset has two columns: v1 (label) and v2 (message)
    df = pd.read_csv('spam.csv', encoding='latin-1')

    # 1. Drop unnecessary columns and rename for clarity
    df = df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])
    df = df.rename(columns={'v1': 'Label', 'v2': 'Message'})

    # 2. Convert 'ham'/'spam' text labels to numerical labels (0 and 1)
    df['Label_Num'] = df['Label'].apply(lambda x: 1 if x == 'spam' else 0)

    # Define X (messages) and y (numerical labels)
    X = df['Message']
    y = df['Label_Num']

    print(f"Successfully loaded {len(df)} real data samples.")
    print(df.head())

except FileNotFoundError:
    print("ERROR: 'spam.csv' not found. Please upload the file to your Colab session.")
    # Stop execution or fall back to dummy data if preferred.
    X = np.array([])
    y = np.array([])

Successfully loaded 5572 real data samples.
  Label                                            Message  Label_Num
0   ham  Go until jurong point, crazy.. Available only ...          0
1   ham                      Ok lar... Joking wif u oni...          0
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...          1
3   ham  U dun say so early hor... U c already then say...          0
4   ham  Nah I don't think he goes to usf, he lives aro...          0


The spam.csv file will be deleted after the session. Here is the link to download it again https://www.kaggle.com/datasets/tmehul/spamcsv?resource=download

Vectorization and Data Splitting

This cell converts the larger set of real messages into a numerical matrix and splits the data chronologically.

In [2]:
# 1. Vectorization: Convert text into numerical feature vectors.
# The vocabulary will now be much larger due to the real dataset's size.
vectorizer = CountVectorizer()
X_vectorized = vectorizer.fit_transform(X)

# 2. Split data: Use 80% for training and 20% for testing.
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.20, random_state=42)

print("\nData has been vectorized and split into training/testing sets.")
print(f"Total features (unique words): {X_vectorized.shape[1]}")
print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}")


Data has been vectorized and split into training/testing sets.
Total features (unique words): 8672
Training samples: 4457, Testing samples: 1115


Model Training and Evaluation

This cell trains the Multinomial Naive Bayes model and provides the final accuracy score on the unseen data, which should be very high (often over 98%).

In [4]:
# 1. Initialize and Train the Model
model = MultinomialNB()
model.fit(X_train, y_train)
print("Model training complete.")

# 2. Test the Model
predictions = model.predict(X_test)

# 3. Evaluate Accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"\nModel Accuracy on Real Test Set: **{accuracy * 100:.2f}%**")

Model training complete.

Model Accuracy on Real Test Set: **97.85%**


Live Prediction Test

Use this final cell to test new, unseen messages and confirm the model is working correctly.

In [5]:
# Test with a classic spam message
test_spam = ["Congratulations! You've won a free holiday. Text 'CLAIM' now to 888."]

# Test with a clear ham (legitimate) message
test_ham = ["Hey, can we confirm the meeting time for tomorrow's presentation?"]

test_messages = [test_spam[0], test_ham[0]]

# The new text MUST be transformed using the SAME vectorizer fitted in Cell 2.
new_messages_vectorized = vectorizer.transform(test_messages)

# Get the predictions
predictions = model.predict(new_messages_vectorized)

print("--- Live Prediction Results ---")
for msg, pred in zip(test_messages, predictions):
    result = "SPAM üóëÔ∏è" if pred == 1 else "HAM ‚úÖ"
    print(f"\nMessage: '{msg[:60]}...'")
    print(f"Classification: **{result}**")

--- Live Prediction Results ---

Message: 'Congratulations! You've won a free holiday. Text 'CLAIM' now...'
Classification: **SPAM üóëÔ∏è**

Message: 'Hey, can we confirm the meeting time for tomorrow's presenta...'
Classification: **HAM ‚úÖ**
