<a href="https://colab.research.google.com/github/SafiUllahAdam/Safi_ML_Prac/blob/main/Email_Spam_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**Step 1: Loading Email Dataset**

In [None]:
import pandas as pd

# Load the dataset from Google Drive
file_path = '/content/drive/MyDrive/Data/Datasets/emails.csv'  # Path of dataset in our drive
emails_df = pd.read_csv(file_path)

# Display the first few rows of the dataset
emails_df.head()


Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


**Step 2: Preprocessing (Remove Unnecessary Columns and Separate Features and Target)**

In [None]:
# Drop the 'Email No.' column as it is just an identifier
emails_df = emails_df.drop('Email No.', axis=1)

# Separate the features (X) and target (y)
X = emails_df.drop('Prediction', axis=1)  # Features (word counts)
y = emails_df['Prediction']  # Target (spam: 1, not spam: 0)

# Let's check the dimensions of X and y to ensure they are correct
print(X.shape, y.shape)


(5172, 3000) (5172,)


**Step 3: Splitting the Dataset into Training and Testing Sets**

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the split data
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)


Training data shape: (4137, 3000)
Testing data shape: (1035, 3000)


**Step 4: Train the Logistic Regression Model**

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Train the model on the training data
model.fit(X_train, y_train)

# Model training is complete, now let's check how well it performs on the test data

# LogisticRegression(max_iter=1000): This creates a Logistic Regression model. We set max_iter=1000 to allow the model enough iterations to converge.
# model.fit(X_train, y_train): This trains the model using the training data. The model will learn the patterns in the email word counts to differentiate between spam and non-spam emails.


**Step 5: Evaluate the Model's Performance**

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Predict the labels for the test data
y_pred = model.predict(X_test)   # y_pred = model.predict(X_test): This predicts whether the emails in the test set are spam or not based on the trained model.

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)  # accuracy_score(y_test, y_pred): This calculates the overall accuracy (percentage of correctly classified emails).

print(f'Accuracy: {accuracy:.2f}')

# Print a detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred)) # classification_report(y_test, y_pred): This prints precision, recall, and F1-score, which give a more detailed look at how well the model performs in detecting both spam and non-spam emails.



Accuracy: 0.97
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       739
           1       0.94      0.96      0.95       296

    accuracy                           0.97      1035
   macro avg       0.96      0.97      0.97      1035
weighted avg       0.97      0.97      0.97      1035



***The model achieved 97% accuracy, meaning it correctly classified most emails. It performs very well in identifying both non-spam (precision: 98%) and spam emails (precision: 94%), with balanced recall and F1-scores.***

**Step 6: Save the Trained Model for Future Use**

In [None]:
import joblib

# Save the trained model to a file
joblib.dump(model, 'Safi_spam_filter_model.pkl') #joblib.dump(model, 'Safi_spam_filter_model.pkl'): This saves the trained model to a file named spam_filter_model.pkl, so you can reuse it without retraining.


# Now the model is saved as 'spam_filter_model.pkl' and can be loaded later
# Later, you can load the model using joblib.load('spam_filter_model.pkl') to make predictions on new data.



['Safi_spam_filter_model.pkl']

**Step 7: Load the Model and Make Predictions on New Data**

In [None]:
# Load the saved model from the file
loaded_model = joblib.load('Safi_spam_filter_model.pkl')  # joblib.load('Safi_spam_filter_model.pkl'): This loads the saved model from the file.

# Now We can use the loaded model to make predictions on new data.

# Simulate new data (for now, we'll use some data from X_test as an example)
new_emails = X_test[:5]  # Taking the first 5 emails from the test set as new data as example, We can replace this with real new data later.

# Make predictions on the new data
predictions = loaded_model.predict(new_emails)

# Display the predictions (1 means spam, 0 means not spam)
print("Predictions for new emails:", predictions) # predictions = loaded_model.predict(new_emails): This makes predictions using the loaded model.



Predictions for new emails: [0 0 1 0 0]


## **Real-Time Example from my email from DiceCamp**

**Step 1: Preprocess and Tokenize the Email**

In [None]:
import re
from collections import Counter

# Provided email as a string
email_text = """
Dear Muhammad Safi Ullah Adam,

Ready to advance your skills in AI and Data Science? We have an exciting opportunity for you! Our AI & Data Science Training Program is now available with a 50% discount, including two self-paced courses designed to elevate your career.

Heres what you get:
Data Science & Machine Learning
Artificial Intelligence with Deep Learning & Computer Vision
Both programs come with lifetime access, allowing you to learn at your own pace, and earn two separate certifications upon completion.

Dont miss out on this exclusive offer—secure your spot now and fast-track your career in the most in-demand fields!

To Register: Click Here

Invest in yourself today and unlock the future of AI and Data Science.

Best Regards,
Dicecamp Team"""

# Basic preprocessing: lowercasing and removing special characters/numbers
email_cleaned = re.sub(r'[^a-zA-Z\s]', '', email_text.lower())

# Tokenize the text (split into individual words)
email_words = email_cleaned.split()

# Count word frequencies
word_freq = Counter(email_words)

print(word_freq.most_common(10))  # Check top 10 most common words


[('your', 5), ('and', 5), ('to', 4), ('data', 4), ('science', 4), ('in', 3), ('ai', 3), ('you', 3), ('with', 3), ('now', 2)]


***Preprocessing: We cleaned the text and tokenized it into words.***

**Step 2: Create a Frequency Vector**

In [None]:
import numpy as np

# Create a vector for the email (same length as the training data's features)
email_vector = np.zeros(X_train.shape[1])

# Map the words to the corresponding columns in the training data
for word, count in word_freq.items():
    if word in X.columns:
        # Find the column index corresponding to the word
        index = X.columns.get_loc(word)
        # Set the frequency count in the vector
        email_vector[index] = count

# Convert to a 2D array since the model expects batches of data
email_vector = email_vector.reshape(1, -1)

print(email_vector)  # Check the email vector


[[2. 4. 0. ... 1. 0. 0.]]


***Create a vector: We mapped the word frequencies to the training data’s feature space.***

**Step 3: Make Predictions**

In [None]:
# Make a prediction using the loaded model
email_prediction = loaded_model.predict(email_vector)

# Display the result
print("Prediction for the new email from DiceCamp (0 = not spam, 1 = spam):", email_prediction[0])


Prediction for the new email from DiceCamp (0 = not spam, 1 = spam): 0




***Make predictions: We used the trained model to predict whether this new email is spam or not. Which Gave us write answer as it was a SPAM email***