### Aim:
To develop a machine learning-based system that can effectively classify email messages as either "spam" (unwanted) or "ham" (legitimate), using a dataset of labeled emails (spam and ham), in order to enhance email filtering systems and reduce the time and effort spent by users dealing with unwanted emails.

### 1.  Importing Required Libraries and Data 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Loading the csv file into pandas dataframe                                                              

In [2]:
df= pd.read_csv(r'D:\data sets\spam.csv')

In [3]:
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
3495,ham,Happy birthday... May u find ur prince charmin...,,,
3496,ham,"Oh, the grand is having a bit of a party but i...",,,
3497,ham,You said to me before i went back to bed that ...,,,
3498,ham,I hope you arnt pissed off but id would really...,,,


### Data Preprocessing 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          3500 non-null   object
 1   v2          3500 non-null   object
 2   Unnamed: 2  28 non-null     object
 3   Unnamed: 3  7 non-null      object
 4   Unnamed: 4  3 non-null      object
dtypes: object(5)
memory usage: 136.8+ KB


In [5]:
df.rename(columns={'v1': 'spam_ham', 'v2': 'email'}, inplace=True)


In [6]:
import re
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]','',text)
    text = re.sub(r'\s+',' ',text)
    return text.strip()
df['spam_ham'] = df['spam_ham'].apply(clean_text)
df['email'] = df['email'].apply(clean_text)
print(df)


     spam_ham                                              email Unnamed: 2  \
0         ham  Go until jurong point crazy Available only in ...        NaN   
1         ham                            Ok lar Joking wif u oni        NaN   
2        spam  Free entry in a wkly comp to win FA Cup final ...        NaN   
3         ham        U dun say so early hor U c already then say        NaN   
4         ham  Nah I dont think he goes to usf he lives aroun...        NaN   
...       ...                                                ...        ...   
3495      ham  Happy birthday May u find ur prince charming s...        NaN   
3496      ham  Oh the grand is having a bit of a party but it...        NaN   
3497      ham  You said to me before i went back to bed that ...        NaN   
3498      ham  I hope you arnt pissed off but id would really...        NaN   
3499     spam  Dorothykiefercom Bank of Granite issues Strong...        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN       

In [7]:
df = df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'])

In [8]:
df

Unnamed: 0,spam_ham,email
0,ham,Go until jurong point crazy Available only in ...
1,ham,Ok lar Joking wif u oni
2,spam,Free entry in a wkly comp to win FA Cup final ...
3,ham,U dun say so early hor U c already then say
4,ham,Nah I dont think he goes to usf he lives aroun...
...,...,...
3495,ham,Happy birthday May u find ur prince charming s...
3496,ham,Oh the grand is having a bit of a party but it...
3497,ham,You said to me before i went back to bed that ...
3498,ham,I hope you arnt pissed off but id would really...


In [9]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
#  Convert all text to lowercase to maintain uniformity.
df['email'] = df['email'].str.lower()


In [13]:
#Tokenization: Split the text into words or tokens.

df['email'] = df['email'].apply(nltk.word_tokenize)


In [14]:
#Removing Stopwords: Filter out common stopwords that may not add much meaning.
nltk.download('stop_words')
stop_words = set(stopwords.words('english'))
df['email'] =df['email'].apply(lambda x :[word for word in x if word not in stop_words])

[nltk_data] Error loading stop_words: Package 'stop_words' not found
[nltk_data]     in index


In [17]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
df['email'] = df['email'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\deshp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [18]:
df

Unnamed: 0,spam_ham,email
0,ham,"[go, jurong, point, crazy, available, bugis, n..."
1,ham,"[ok, lar, joking, wif, u, oni]"
2,spam,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,ham,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"[nah, dont, think, go, usf, life, around, though]"
...,...,...
3495,ham,"[happy, birthday, may, u, find, ur, prince, ch..."
3496,ham,"[oh, grand, bit, party, doesnt, mention, cover..."
3497,ham,"[said, went, back, bed, cant, sleep, anything]"
3498,ham,"[hope, arnt, pissed, id, would, really, like, ..."


In [20]:
df

Unnamed: 0,spam_ham,email
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry wkly comp win fa cup final tkts st ...
3,ham,u dun say early hor u c already say
4,ham,nah dont think go usf life around though
...,...,...
3495,ham,happy birthday may u find ur prince charming s...
3496,ham,oh grand bit party doesnt mention cover charge...
3497,ham,said went back bed cant sleep anything
3498,ham,hope arnt pissed id would really like see tomo...


### Splitting data in to x y 

In [33]:
x = df.iloc[:,-1].values
y = df.iloc[:,:-1].values

In [34]:
x

array([list([101, 101, 2175, 18414, 17583, 2391, 4689, 2800, 11829, 2483, 1050, 2307, 2088, 2474, 1041, 28305, 25022, 2638, 2288, 26297, 28194, 102, 102]),
       list([101, 101, 7929, 2474, 2099, 16644, 15536, 2546, 1057, 2006, 2072, 102, 102]),
       list([101, 101, 2489, 4443, 1059, 2243, 2135, 4012, 2361, 2663, 6904, 2452, 2345, 1056, 25509, 2015, 2358, 2089, 3793, 6904, 4374, 4443, 3980, 2102, 2094, 19067, 2102, 3446, 13535, 2015, 6611, 2058, 102, 102]),
       ...,
       list([101, 101, 2056, 2253, 2067, 2793, 2064, 2102, 3637, 2505, 102, 102]),
       list([101, 101, 3246, 12098, 3372, 9421, 8909, 2052, 2428, 2066, 2156, 4826, 2293, 22038, 20348, 20348, 20348, 20348, 20348, 20348, 102, 102]),
       list([101, 101, 9984, 11602, 7512, 9006, 2924, 9753, 3277, 2844, 8569, 2100, 11355, 4060, 2266, 17235, 2850, 4160, 6454, 3729, 13512, 2566, 102, 102])],
      dtype=object)

In [35]:
y

array([['ham'],
       ['ham'],
       ['spam'],
       ...,
       ['ham'],
       ['ham'],
       ['spam']], dtype=object)

In [36]:
df.isnull().sum()

spam_ham    0
email       0
dtype: int64

###  Data Encoding 

In [37]:
from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
y = le.fit_transform(y)




  y = column_or_1d(y, warn=True)


In [38]:
len(y)

3500

In [39]:
y

array([0, 0, 1, ..., 0, 0, 1])

 ### Vectorization TFIDF

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(x)

In [30]:
tfidf_matrix 

<3500x6088 sparse matrix of type '<class 'numpy.float64'>'
	with 28503 stored elements in Compressed Sparse Row format>

In [40]:
# BERT
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
df['email'] = df['email'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

In [None]:
x = df.iloc[:,-1].values

In [41]:
#Split the data into test and train sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(tfidf_matrix, y, test_size=0.2, random_state=42)

In [42]:
x_train

<2800x6088 sparse matrix of type '<class 'numpy.float64'>'
	with 22954 stored elements in Compressed Sparse Row format>

### Training the Model using  SVC  Algorithm

In [43]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(x_train,y_train)

SVC(kernel='linear', random_state=0)

In [44]:
y_pred = classifier.predict(x_test)


In [48]:
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


### Confusion Matrix

In [49]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix(y_test,y_pred)
print(cm)

[[613   1]
 [ 15  71]]


### Accuracy score

In [50]:
accuracy_score(y_test,y_pred)

0.9771428571428571

### Conclusion: The project aims to create a robust spam filtering system that enhances email security, boosts user productivity, and reduces exposure to malicious content. With an accuracy of 97%, the model effectively classifies email messages as spam or ham, ensuring a high level of reliability in filtering out unwanted emails while preserving legitimate communications.