<a href="https://colab.research.google.com/github/KhushbooMahawar/-Rainfall-Prediction-in-Sydney-PROJECT/blob/main/email_spam_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ** 1)Email Spam Classification**
Problem Statement:
Create a binary classifier to identify whether an email is spam
or not using text processing and ML algorithms.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Preprocessing is Key: The preprocess_text function is a crucial step. By removing punctuation, converting text to lowercase, and using stemming, we reduce the noise in the data and ensure that the model focuses on the core meaning of the words. For example, "Congratulations!" and "congratulations" are treated as the same feature.

In [2]:
# Function for text preprocessing
def preprocess_text(text):
    """
    Cleans and preprocesses text by removing punctuation, converting to lowercase,
    removing stopwords, and stemming words.
    """
    # Remove punctuation
    text = "".join([char for char in text if char not in string.punctuation])
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords and perform stemming
    words = [word for word in text.split() if word not in stopwords.words('english')]
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    return " ".join(words)


In [3]:
# 1. Load the dataset
df = pd.read_csv('/content/drive/MyDrive/email-spam data/archive (3)/spam.csv', encoding='latin-1')
df.head(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
# Rename columns for clarity
df = df.rename(columns={'v1': 'label', 'v2': 'text'})
df.head(5)

Unnamed: 0,label,text,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
# 2. Data Cleaning and Preprocessing
# Drop unnecessary columns
df = df.iloc[:, :2]
# Convert 'label' to binary (0 for ham, 1 for spam)
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head(2)

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...


In [6]:
# Apply text preprocessing to the 'text' column
df['text'] = df['text'].apply(preprocess_text)
df.head(2)

Unnamed: 0,label,text
0,0,go jurong point crazi avail bugi n great world...
1,0,ok lar joke wif u oni


TF-IDF Effectiveness: The TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer is powerful because it gives more weight to words that are common in a specific type of message (e.g., "win", "free", "prize" in spam) but rare across the entire dataset, which helps the model learn to differentiate between classes.

In [7]:
# 3. Feature Extraction
# Use TF-IDF Vectorizer to convert text to numerical features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['text'])
y = df['label']


In [8]:
# 4. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Model Training
# Using Support Vector Machine (SVC) for potentially better accuracy
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# 6. Model Evaluation
y_pred = model.predict(X_test)

print("--- Model Performance Metrics ---")
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['ham', 'spam']))
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}\n")

--- Model Performance Metrics ---
Classification Report:
               precision    recall  f1-score   support

         ham       0.98      1.00      0.99       965
        spam       0.98      0.87      0.92       150

    accuracy                           0.98      1115
   macro avg       0.98      0.93      0.96      1115
weighted avg       0.98      0.98      0.98      1115

Accuracy: 0.98



#3.INSIGHTS


High Accuracy: 98% This indicates that the features generated by the TF-IDF vectorizer are highly effective at distinguishing between spam and ham emails.

Precision and Recall: The classification report shows that the model has high precision and recall for both 'spam' and 'ham' classes.

Precision for 'spam' tells us that when the model predicts an email is spam, it is correct a very high percentage of the time.

Recall for 'spam' shows that the model correctly identifies a very high percentage of all actual spam emails.

In [9]:
# 7. Example prediction
new_email = ["Congratulations! You've won a free iPhone. Click here to claim your prize.",
             "Hey, are you free for lunch tomorrow?"]
preprocessed_email = [preprocess_text(email) for email in new_email]
vectorized_email = tfidf_vectorizer.transform(preprocessed_email)

predictions = model.predict(vectorized_email)
print("\n--- Example Predictions ---")
for i, email in enumerate(new_email):
    prediction = "Spam" if predictions[i] == 1 else "Not Spam (Ham)"
    print(f"Email: '{email}'\nPrediction: {prediction}\n")


--- Example Predictions ---
Email: 'Congratulations! You've won a free iPhone. Click here to claim your prize.'
Prediction: Spam

Email: 'Hey, are you free for lunch tomorrow?'
Prediction: Not Spam (Ham)



# 4.Conclusion

This project successfully demonstrates the fundamentals of building a robust email spam detection system. By following a structured process of data preparation, feature engineering with TF-IDF, and applying a suitable machine learning model like Multinomial Naive Bayes, we can achieve excellent performance in classifying emails. While this model is highly accurate, it can be further improved by incorporating more advanced NLP techniques (such as word embeddings) and by training on a larger, more diverse dataset. This project serves as a strong foundation for more complex text classification tasks.