# EMAIL SPAM DETECTION WITH MACHINE LEARNING

In this project, we develop a Python-based email spam detector using machine learning to classify emails as either spam or non-spam, addressing the issue of unwanted and potentially harmful email communications.

Required Modules:

Pandas: For data manipulation.
NumPy: Fundamental package for scientific computing.
Matplotlib: For creating quality plots.
Seaborn: Data visualization library based on Matplotlib.
SciPy: Ecosystem for mathematics, science, and engineering software.

In [5]:
#importing libraries
import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import re
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Prajakta\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Prajakta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Prajakta\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
#reading of csv file
df = pd.read_csv('spam.csv',encoding = "ISO-8859-1")

In [7]:
#Displaying dataset
print("Email spam detection dataset is: \n",df)

Email spam detection dataset is: 
         v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        NaN   
1      ham                      Ok lar... Joking wif u oni...        NaN   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3      ham  U dun say so early hor... U c already then say...        NaN   
4      ham  Nah I don't think he goes to usf, he lives aro...        NaN   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some bitching but I acted like i'd...        NaN   
5571   ham                         Rofl. Its true to its name        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN        

In [8]:
#Top and Botton 5 rows of car price prediction dataset
print("Top 5 rows of the dataset are: \n",df.head())
print("\n\nBottom 5 rows of the dataset are: \n",df.tail())

Top 5 rows of the dataset are: 
      v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


Bottom 5 rows of the dataset are: 
         v1                                                 v2 Unnamed: 2  \
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some b

The Porter stemming algorithm, commonly referred to as the 'Porter stemmer,' is a linguistic process designed to eliminate common morphological and inflectional endings from English words. Its primary purpose lies in term normalization, a crucial step often employed in the establishment of Information Retrieval systems.

In [10]:
ps=PorterStemmer()
lemmatize=WordNetLemmatizer()
corpus=[]
for i in range(0,len(df)):
  review=re.sub('[^a-zA-Z]', ' ', df['v2'][i])
  review = review.lower()
  review = review.split()
    
  review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
  review = ' '.join(review)
  corpus.append(review)

CountVectorizer is a valuable tool available in the Python scikit-learn library. It serves the purpose of converting textual data into a numerical vector representation by considering the frequency (count) of each word within the entire text.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()
y=pd.get_dummies(df['v1'])
y=y.iloc[:,1].values

# Train-Test Split

Utilizing train_test_split(), you must supply the sequences you intend to divide, along with any optional parameters. This function returns a list of NumPy arrays, other sequences, or SciPy sparse matrices if applicable. The "arrays" parameter refers to the sequence of lists, NumPy arrays, pandas DataFrames, or similar array-like objects that contain the data you wish to partition.

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)  

The Naive Bayes classifier employs Bayes' Theorem to categorize data into distinct classes, assuming that all predictors are independent of each other. It operates under the assumption that a particular feature in a class is unrelated to the presence of other features.

The multinomial Naive Bayes classifier is specifically designed for classification tasks involving discrete features, such as word counts for text classification. Typically, the multinomial distribution expects integer feature counts, although fractional counts like tf-idf can also be used effectively in practice.

In [13]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train)

In [15]:
y_pred=spam_detect_model.predict(X_test)

**Model Precision:

The model precision score measures the proportion of positively predicted labels that are actually correct. Precision is also known as the positive predictive value. It is a useful metric for evaluating the accuracy of a model's positive predictions and is sensitive to the class distribution.

**Accuracy Score:

Model recall score represents the model’s ability to correctly predict the positives out of actual positives. This is unlike precision, which measures how many predictions made by models are actually positive out of all positive predictions made. Recall Score is another important metric for assessing the performance of a classification model.

**Confusion Matrix:

A confusion matrix, also referred to as an error matrix, is a tool that helps assess and predict the validity of a classification model. It provides valuable insights into different types of errors the model may make, such as false positives and false negatives.

**Classification Report:

A classification report is a comprehensive performance evaluation metric in machine learning. It provides a summary of key metrics like precision, recall, F1 Score, and support for each class in your trained classification model. This report is essential for a detailed understanding of a model's performance.

In [16]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
print(confusion_matrix(y_test,y_pred))

[[943   6]
 [  9 157]]


In [17]:
print("Accuracy Score {}".format(accuracy_score(y_test,y_pred)))

Accuracy Score 0.9865470852017937


In [18]:
print("Classification report: {}".format(classification_report(y_test,y_pred)))

Classification report:               precision    recall  f1-score   support

           0       0.99      0.99      0.99       949
           1       0.96      0.95      0.95       166

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115

