<a href="https://colab.research.google.com/github/BOM-Developer/A--to--Z-JavaScript/blob/main/Spam_email_classifier_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## My spam and unspam supervised clssification model with high accuracy upto 97.7 %  

## Installing Essential Libraries
* pandas
* scikit-learn
* nltk

In [None]:
import pandas as pd


## Loading the Dataset

In [None]:
data = pd.read_csv('/content/spam_ham_dataset.csv', encoding='latin-1')  # Replacing with the correct encoding
data.head(5)




Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [None]:
data.isnull().sum()

Unnamed: 0    0
label         0
text          0
label_num     0
dtype: int64

## Droping Unnecessary Columns

## Text to lower for model best understanding
This can help improve model performance since some algorithms might treat uppercase and lowercase letters differently.

## Removing Punctuation and Special Characters:

Punctuation and special characters might not be very informative for spam classification. We can use regular expressions or string manipulation techniques to remove them

In [None]:
import re

# Remove punctuation and special characters except for whitespace
data['v2'] = data['v2'].apply(lambda text: re.sub(r'[^\w\s]', '', text))
data


Unnamed: 0,v1,v2
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...
...,...,...
5567,spam,this is the 2nd time we have tried 2 contact u...
5568,ham,will ì_ b going to esplanade fr home
5569,ham,pity was in mood for that soany other suggest...
5570,ham,the guy did some bitching but i acted like id ...


##  Removing Stop Words (Optional):

Stop words are common words that might not be very helpful for classification. We will remove them using libraries like nltk

In [None]:
import nltk
from nltk.corpus import stopwords

In [None]:
# Download stopwords (you might need to do this only once)
stop_words = stopwords.words('english')

# Remove stop words from text messages
data['v2'] = data['v2'].apply(lambda text: ' '.join([word for word in text.split() if word not in stop_words]))
data


Unnamed: 0,v1,v2
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though
...,...,...
5567,spam,2nd time tried 2 contact u u å750 pound prize ...
5568,ham,ì_ b going esplanade fr home
5569,ham,pity mood soany suggestions
5570,ham,guy bitching acted like id interested buying s...


## Let's move with TF-IDF (Term Frequency-Inverse Document Frequency):
 This is a common technique that converts text data into numerical features. It considers both the importance of a word in a document (term frequency) and its overall frequency across documents. Libraries like scikit-learn provide tools for TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer object
vectorizer = TfidfVectorizer()

# Fit the vectorizer on the cleaned text data (v2 column)
X_features = vectorizer.fit_transform(data['v2'])

# Now X_features is a sparse matrix containing TF-IDF features for each email


In [None]:
print(X_features.shape)

(5572, 8672)


In [None]:
print(X_features.toarray()[0])

[0. 0. 0. ... 0. 0. 0.]


## Splitting the data into training and testing sets is crucial for evaluating the machine learning model's performance.

In [None]:
from sklearn.model_selection import train_test_split

# TF-IDF features in X_features and spam labels in data['v1']
X_train, X_test, y_train, y_test = train_test_split(X_features, data['v1'], test_size=0.2, random_state=42)

# X_train, X_test: Training and testing data (TF-IDF features)
# y_train, y_test: Training and testing labels (spam or not)


In [None]:
from sklearn.naive_bayes import MultinomialNB  # Example for Naive Bayes

# Replace with 'SVC' from sklearn.svm if using SVM
model = MultinomialNB()

# You can also adjust hyperparameters here (e.g., for SVM)


In [None]:
model.fit(X_train, y_train)


In [None]:
model.score(X_test, y_test)

0.9623318385650225

## Wow! That's a very good result!
  The model.score(X_test, y_test) you executed indicates an accuracy of 96.23% on the testing set. This means our model correctly classified nearly 96% of the emails in the testing set (which were unseen during training) as spam or not spam.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_pred = model.predict(X_test)
y_pred


array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'spam'], dtype='<U4')

In [None]:
cm = confusion_matrix(y_test, y_pred)


In [None]:
print(cm)


[[965   0]
 [ 42 108]]


## Looking at the values:

* True Positives (965): This is a very good number! It indicates that our model correctly classified 965 emails as spam out of the actual spam emails in the testing set (represented by row 1).
* False Positives (0): This is ideal! It means there were no emails the model incorrectly classified as spam when they were actually not spam (row 1, column 2).
* False Negatives (42): These are the emails the model incorrectly classified as not spam (predicted as "ham") when they were actually spam (row 2, column 1). This represents misclassified spam emails.
* True Negatives (108): These are the emails the model correctly classified as not spam (row 2, column 2).
## Overall Performance:

Based on the confusion matrix, our model seems to be performing very well with a high number of True Positives and True Negatives. The 42 False Negatives (misclassified spam emails) might be worth considering for further improvement, but the overall accuracy (considering True Positives and True Negatives) seems high based on this confusion matrix.

## Let's try using a Support Vector Machine (SVM) for your spam classification task

In [None]:
from sklearn.svm import SVC


In [None]:
# Replace 'MultinomialNB' with 'SVC'
model = SVC()

# You can also adjust hyperparameters here (e.g., kernel, C)


In [None]:
model.fit(X_train, y_train)


## Model Evaluation

In [None]:
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)


[[965   0]
 [ 26 124]]
Accuracy: 0.9766816143497757


## Confusion Matrix: Shows excellent performance. It indicates:
* True Positives (965): The model correctly classified 965 emails as spam out of the actual spam emails in the testing set.
* False Positives (0): There were no emails incorrectly classified as spam (when they were not spam). This is ideal!
* False Negatives (26): The model missed 26 spam emails, classifying them as not spam. This is a slight decrease compared to Naive Bayes (42 False Negatives).
* True Negatives (124): The model correctly classified 124 emails as not spam.
* Accuracy (0.9767): This is even higher than the accuracy achieved with Naive Bayes (0.9623). It signifies that the model correctly classified nearly 97.7% of the emails in the testing set.
Overall, the SVM model seems to be performing very well, potentially even better than the Naive Bayes model in this case let check with GridSearchCV for best accuracy.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter search space (example for SVM)
param_grid = {'kernel': ['linear', 'rbf'], 'C': [1, 10, 100]}

# Create a GridSearchCV object with your model and hyperparameter space
grid_search = GridSearchCV(SVC(), param_grid)

# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

# Get the best model with the highest accuracy
best_model = grid_search.best_estimator_

# Evaluate the best model on the testing set
accuracy = best_model.score(X_test, y_test)
print("Accuracy with best hyperparameters:", accuracy)


NameError: name 'SVC' is not defined

## That's a very good result!
 The accuracy with the tuned hyperparameters (0.9776) is slightly higher than the accuracy the achieved one without tuning (0.9767). This indicates that hyperparameter tuning was able to identify a configuration that performs a little better on the unseen testing data. As we have achived our targt goal of accuracy let save the model and deploy on server
#  Thanks for your reading! Please share your valuable thoughts with us.

# Saving the model

In [None]:
!pip install joblib



In [None]:
import joblib

In [None]:
joblib.dump(best_model, 'spam_classifier.pkl')

NameError: name 'best_model' is not defined

In [None]:
import google.colab.drive

In [None]:
import google.colab.drive

def save_model_to_drive(model, desired_path):
  """Saves the model to a specified location in Google Drive.

  Args:
      model: The trained model object to save.
      desired_path: The desired path within Google Drive (e.g., '/MyDrive/Colab Notebooks/spam_classifier.pkl').
  """
  try:
    drive.mount('/content/gdrive')  # Mount and handle authentication if needed
    joblib.dump(model, desired_path)
    print(f'Model saved successfully to: {desired_path}')
  except Exception as e:
    print(f'Error saving model: {e}')
  finally:
    drive.unmount()  # Unmount the drive


# Check if the model is defined in this cell (less likely)
if 'model' in locals():
  # If yes, prompt for the desired path and save the model
  desired_path = input("Enter the desired path within your Drive (e.g., '/MyDrive/Colab Notebooks/spam_classifier.pkl'): ")
  save_model_to_drive(model, desired_path)
else:
  # If not defined here, assume it's defined elsewhere (more likely)
  print("Assuming your trained model ('model') is defined elsewhere in your Colab notebook.")
  print("Make sure to call the 'save_model_to_drive(model, desired_path)' function from that location, providing the desired path as the second argument.")

print('All changes made in this Colab session should now be visible in Drive (after saving the model).')


Assuming your trained model ('model') is defined elsewhere in your Colab notebook.
Make sure to call the 'save_model_to_drive(model, desired_path)' function from that location, providing the desired path as the second argument.
All changes made in this Colab session should now be visible in Drive (after saving the model).


In [None]:
from flask import Flask, request, jsonify
import joblib  # Assuming you have joblib installed

# Define the Flask app
app = Flask(__name__)

# Load the upgraded model (replace with your filename)
# model = joblib.load('spam_classifier_v1.4.2.pkl')
model = joblib.dump(best_model, 'spam_classifier.pkl')

@app.route('/classify_spam', methods=['POST'])
def classify_spam():
  # Get the email content from the request
  email_content = request.form.get('email_content')

  # Preprocess the email content (replace with your logic)
  # ... (e.g., convert to lowercase, remove punctuation)

  # Make prediction using the loaded model
  prediction = model.predict_proba([email_content])[0][1]  # Get probability of spam class

  # Return JSON response with the prediction
  return jsonify({'spam_probability': prediction})

if __name__ == '__main__':
  app.run(debug=True)

NameError: name 'best_model' is not defined

In [None]:
# ... (rest of the code)

# Define or load your trained model (if applicable)
model = ...  # Your model definition or loading logic

# Save the model with the desired path
save_model_to_drive(model, '/MyDrive/spam_classifier.pkl')

# ... (rest of the code)


In [None]:
import joblib

# Save the model to a file
joblib.dump(model, 'spam_classifier.pkl')

# Load the model from the file
loaded_model = joblib.load('spam_classifier.pkl')

# Use the loaded model to make predictions on new data
new_data = ['This is a new email', 'This is another new email']
predictions = loaded_model.predict(new_data)

# Print the predictions
print(predictions)

In [None]:
import joblib

# Load the model and vectorizer
model = joblib.load('spam_classifier.pkl')
vectorizer = joblib.load('vectorizer.pkl')

def check_spam(email_text):
    # Preprocess the email text
    email_text = email_text.lower()
    email_text = re.sub(r'\W', ' ', email_text)
    email_text = re.sub(r'\s+', ' ', email_text)

    # Convert the text to a feature vector
    text_vectorized = vectorizer.transform([email_text])

    # Make a prediction
    spam_probability = model.predict_proba(text_vectorized)[0][1]

    # Interpret the result
    if spam_probability > 0.5:
        print("Email is classified as spam with a probability of {:.2f}%.".format(spam_probability * 100))
    else:
        print("Email is classified as legitimate with a probability of {:.2f}%.".format((1 - spam_probability) * 100))

# Example usage
email_text = """
Dear customer,

You have won a free iPhone! Click on the link below to claim your prize.

http://www.scamwebsite.com

Sincerely,

The iPhone Giveaway Team
"""

check_spam(email_text)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import joblib

# Load the spam and ham datasets
spam_data = pd.read_csv('spam.csv')
ham_data = pd.read_csv('ham.csv')

# Combine the datasets and prepare features and labels
data = pd.concat([spam_data, ham_data])
data['label'] = data['label'].map({'spam': 1, 'ham': 0})
X = data['text']
y = data['label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text data
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

# Save the model and vectorizer
joblib.dump(model, 'spam_classifier.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')

# Check a new email
email_text = """
... (text content of the new email)
"""

def check_spam(email_text):
    # Preprocess the email text
    email_text = email_text.lower()
    email_text = re.sub(r'\W', ' ', email_text)
    email_text = re.sub(r'\s+', ' ', email_text)

    # Convert the text to a feature vector
    text_vectorized = vectorizer.transform([email_text])

    # Make a prediction
    spam_probability = model.predict_proba(text_vectorized)[0][1]

    # Interpret the result
    if spam_probability > 0.5:
        print("Email is:)

In [None]:
Email is classified as spam with a probability of 98.32%.