# Intelligent Gmail Spam Remover - Project Walkthrough

Welcome to the interactive notebook for the Spam Remover project. In this notebook, we will:
1.  **Load and Train** a Naive Bayes model on our email dataset.
2.  **Evaluate** the model's accuracy.
3.  **Connect** to your Gmail account.
4.  **Scan and Filter** your real inbox for spam.

## Step 1: Import Libraries & Load Data
We start by importing the necessary Python libraries and loading our datasets (`spam_ham_dataset.csv` and `emails.csv`).

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import joblib
import os

# Load datasets
try:
    df1 = pd.read_csv('spam_ham_dataset.csv')
    df2 = pd.read_csv('emails.csv')
    
    # Standardize column names
    # df1 has 'text' and 'label_num'
    # df2 has 'text' and 'spam' -> rename 'spam' to 'label_num'
    df2.rename(columns={'spam': 'label_num'}, inplace=True)
    
    # Combine into one dataframe
    df = pd.concat([df1[['text', 'label_num']], df2[['text', 'label_num']]], ignore_index=True)
    
    print(f"Dataset loaded successfully. Total emails: {len(df)}")
    print("Sample data:")
    display(df.head())
except FileNotFoundError:
    print("Error: Dataset files not found. Please ensure 'spam_ham_dataset.csv' and 'emails.csv' are in the directory.")

Dataset loaded successfully. Total emails: 10899
Sample data:


Unnamed: 0,text,label_num
0,Subject: enron methanol ; meter # : 988291\r\n...,0
1,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,"Subject: photoshop , windows , office . cheap ...",1
4,Subject: re : indian springs\r\nthis deal is t...,0


In [4]:
df1.isnull().values.any()


NameError: name 'df1' is not defined

## Step 2: Preprocessing & Vectorization
Computers can't understand raw text. We use a **CountVectorizer** (Bag of Words) to convert each email into a list of numbers representing word counts.

In [3]:
# Split into Training and Testing sets (80% Train, 20% Test)
X = df['text']
y = df['label_num']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text
vectorizer = CountVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())} unique words.")

Vocabulary size: 62860 unique words.


## Step 3: Train the Naive Bayes Model
We use the **Multinomial Naive Bayes** algorithm, which is highly effective for text classification.

In [4]:
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Evaluate
y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Save the model for future use
joblib.dump(model, 'spam_model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')
print("Model saved to 'spam_model.pkl'")

Model Accuracy: 98.72%

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1596
           1       0.98      0.97      0.98       584

    accuracy                           0.99      2180
   macro avg       0.98      0.98      0.98      2180
weighted avg       0.99      0.99      0.99      2180

Model saved to 'spam_model.pkl'


## Step 4: Gmail Integration
Now we connect to your real Gmail account. 
**Note:** You must have `credentials.json` in this directory. If it asks you to authenticate, a browser window will open.

In [5]:
from gmail_service import GmailService
from spam_filter import SpamFilter

# Initialize helper classes
try:
    # This will trigger the OAuth flow if token.json doesn't exist
    gmail = GmailService()
    user_email = gmail.get_email_address()
    print(f"Successfully authenticated as: {user_email}")
    
    # Load our just-trained model wrapper
    spam_filter = SpamFilter(model_path='spam_model.pkl', vectorizer_path='vectorizer.pkl')
    print("Spam Filter loaded.")
    
except Exception as e:
    print(f"Authentication Error: {e}")
    print("Make sure 'credentials.json' is present.")

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=638667017428-v22aaiq5pnfnj6g2tvlhhamdu6qm4fjn.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A60829%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fgmail.modify&state=rqcFvM9Ypy9a7DjfmVItg2yVLdsrxa&access_type=offline


KeyboardInterrupt: 

## Step 5: Live Scan & Filter
This cell will fetch your unread emails, classify them, and ask if you want to move the identified spam to the Trash/Spam folder.

In [1]:
# Configuration
MAX_EMAILS_TO_SCAN = 20

print(f"Scanning top {MAX_EMAILS_TO_SCAN} unread emails...")
messages = gmail.get_unread_messages(max_results=MAX_EMAILS_TO_SCAN)

spam_list = []
ham_count = 0

if not messages:
    print("No unread messages found.")
else:
    print(f"Found {len(messages)} emails. Analyzing...\n")
    print(f"{'-'*80}")
    print(f"{'STATUS':<10} | {'SUBJECT'}")
    print(f"{'-'*80}")

    for msg in messages:
        msg_id = msg['id']
        content = gmail.get_message_content(msg_id)
        
        if not content: continue
        
        # Get subject for display
        lines = content.split('\n')
        subject = lines[0].replace('Subject:', '').strip()[:60] # Truncate long subjects
        
        # Predict
        is_spam = spam_filter.is_spam(content)
        
        if is_spam:
            print(f"\033[91m[SPAM]\033[0m    | {subject}") # Red text for spam
            spam_list.append(msg_id)
        else:
            print(f"\033[92m[HAM]\033[0m     | {subject}") # Green text for ham
            ham_count += 1

    print(f"{'-'*80}")
    print(f"Summary: {len(spam_list)} Spam detected, {ham_count} Ham safe.")

Scanning top 20 unread emails...


NameError: name 'gmail' is not defined

In [None]:
# Action Cell
if len(spam_list) > 0:
    print(f"Found {len(spam_list)} spam emails.")
    action = input("Type 'cleanup' to move these to the Spam folder, or press Enter to skip: ")
    
    if action.lower() == 'cleanup':
        print("Moving messages to Spam folder...")
        for msg_id in spam_list:
            gmail.move_to_spam(msg_id)
        print("Cleanup complete!")
    else:
        print("Skipped cleanup.")
else:
    print("No spam to clean up! Great!")

In [3]:
df1.isnull().values.any()


NameError: name 'df1' is not defined