# **Project Name**  -Email Classification and Spam Detection

![email](email.jpg)

##### **Project Type**    - Natural Language Process
##### **Contribution**    - Individual
##### **Member** 1 - Sreenivasulu


## Project Summary

The system is designed to classify incoming emails in real time as either spam or ham (not spam) based on their textual content. It also allows for further categorization (e.g., work, personal, promotions) to enhance inbox management.

# Step -1 : Business Problem Understanding
- Classify the whether the given sms is spam or ham
- to classify the sms whether it is spam or normal

In [1]:
import pandas as pd
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")


In [2]:
# Load the data set 
df = pd.read_csv("SMSSpamCollection",sep="\t", names=["label","message"])
df

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


# Data Understanding

In [3]:
df["label"].unique()

array(['ham', 'spam'], dtype=object)

In [4]:
df["label"].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

In [5]:
len(df)   # Total n0.of rows

5572

# Text Preparation
- 1. Tokenization
- 2. Text Cleaning
-  3. Text Vectorization

**Text Cleaning**
- Remove spacial characters other then whitespace
- Remove stopWords
- stemming/limmatizatiom

In [6]:
import nltk
import re                                       
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps  = PorterStemmer()

In [7]:
corpus= [ ]

for i in range(len(df)):
    s = re.sub('[^a-zA-Z]'," ",df['message'][i])         # Remove the special characters 
    s = s.lower()                                        # convert all lower case
    s = s.split()                                        # split words
    s = [ps.stem(word) for word in s if not word in set(stopwords.words('english'))]
    s = " ".join(s)                                       # group all words in sentence
    corpus.append(s)        
             

**Text Vectorization**
- Convert text data to numerical data

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv  = CountVectorizer()
x = cv.fit_transform(corpus)

In [9]:
x.shape

(5572, 6296)

In [10]:
#y = pd.get_dummies(df["label"],drop_first=True)
df["label"].replace({"ham":0,"spam":1},inplace=True)

In [11]:
y = df["label"]

**Train-Test-split**

In [12]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=True)

# Modelling
### Naive Bayes Classifier with default paramrters

In [13]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train,y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


**Preditions**

In [14]:
ypred_train = model.predict(x_train)
ypred_test = model.predict(x_test)

**Evakuation**

In [15]:
from sklearn.metrics import accuracy_score
print("Train Accuracy: ",accuracy_score(y_train,ypred_train))
print("Test Accuracy: ",accuracy_score(y_test,ypred_test))


Train Accuracy:  0.9919228180390397
Test Accuracy:  0.9820627802690582


## Prediction future data

In [16]:
input_mail = "I hope you're doing well. I wanted to thank you again for the opportunity to interview."
df_test = pd.DataFrame({"message":input_mail},index=[0])

In [17]:
df_test

Unnamed: 0,message
0,I hope you're doing well. I wanted to thank yo...


**Test processing**

In [18]:
corpus1= [ ]

for i in range(len(df_test)):
    s = re.sub('[^a-zA-Z]'," ",df['message'][i])         # Remove the special characters 
    s = s.lower()                                        # convert all lower case
    s = s.split()                                        # split words
    s = [ps.stem(word) for word in s if not word in set(stopwords.words('english'))]
    s = " ".join(s)                                       # group all words in sentence
    corpus1.append(s)        
             

In [19]:
corpus1

['go jurong point crazi avail bugi n great world la e buffet cine got amor wat']

In [20]:
X_input = cv.transform(corpus1)
prediction = model.predict(X_input)

In [21]:

if prediction == 0:
    print("ham")
else:
    print("spam")


ham


 ## 📝Conclusion

The Real-Time Email Classification and Spam Detection system demonstrates the practical application of NLP and machine learning to solve a real-world communication challenge. By accurately identifying spam messages in real-time, the system improves email filtering, enhances productivity, and helps users avoid potentially harmful content.