**Oasis Data Science Internship**

**Aditya Rajesh Sakhadeo**

**Task 2 - Email Spam Detection**

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

**Importing the dataset**

In [None]:
df = pd.read_csv("spam.csv",encoding="ISO-8859-1")

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Category    5572 non-null   object
 1   Message     5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [None]:
df.head()

Unnamed: 0,Category,Message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


**Removing last three empty columns**

In [None]:
df.drop(df.columns[[2,3,4]], axis=1, inplace=True)

In [None]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Seperating features and labels**

In [None]:
features = df['Message']
labels = df['Category']

**Encoding labels as 0 (spam) and 1 (ham)**

**As categorical value can not be processed by the model**



In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
labels= label_encoder.fit_transform(labels)

**Extracting important features from the features dataframe**

**And also encoding it for further processing**

In [None]:
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
features_encoded = feature_extraction.fit_transform(features)

**Splitting the training and testing data**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features_encoded, labels, test_size=0.2, random_state=42)

**Importing different models**

In [None]:
logic = LogisticRegression()

In [None]:
d_tree = DecisionTreeClassifier()

In [None]:
r_forest = RandomForestClassifier()

**Training these models**

In [None]:
logic.fit(X_train,y_train)

In [None]:
d_tree.fit(X_train,y_train)

In [None]:
r_forest.fit(X_train,y_train)

**Validating the models using validation set**

In [None]:
y_pred_one = logic.predict(X_test)

In [None]:
y_pred_two = d_tree.predict(X_test)

In [None]:
y_pred_three = r_forest.predict(X_test)

**Checking the performance of the models using the parameters accuracy score and confusion matrix**

In [None]:
accuracy_one = accuracy_score(y_test, y_pred_one)
conf_one = confusion_matrix(y_test, y_pred_one)

In [None]:
accuracy_two = accuracy_score(y_test, y_pred_two)
conf_two = confusion_matrix(y_test, y_pred_two)

In [None]:
accuracy_three = accuracy_score(y_test, y_pred_three)
conf_three = confusion_matrix(y_test, y_pred_three)

**Function that will write the performance of each model in a text file**

In [None]:
def write_scores(classifier,accuracy_score,f1_score):
    f=open("Classifier_score.txt","a+")
    acc_score=str(accuracy_score)
    f1_sc=str(f1_score)
    classifier_name=str(classifier)
    f.write(classifier_name+"\n")
    f.write("Accuracy Score: "+acc_score+"\n")
    f.write("confusion matrix: \n"+f1_sc+"\n")
    f.close()

In [None]:
write_scores(logic,accuracy_one,conf_one)
write_scores(d_tree,accuracy_two,conf_two)
write_scores(r_forest,accuracy_three,conf_three)

**Now using the model,**

**Creating two instances of the mails,**

**First mail is a spam mail and second one is a ham mail**

In [None]:
input_mail_1 = "Free entry in 2 a wkly comp to watch WC Cup final tkts 21st August 2023. Text WC to 13234 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
input_mail_2 = "Would really appreciate if you call me. Just need someone to talk to."

**We need to convert the text into encoded features same as in training set**

**After that we also do the inverse encoding to get the actual output means whether mail is spam or not**

In [None]:
input_mail_fe = feature_extraction.transform([input_mail_2])
predicted_label = r_forest.predict(input_mail_fe)
predicted_label = label_encoder.inverse_transform(predicted_label)

**Obtaining the output**

In [None]:
acc_3 = float(accuracy_three)
per = str(acc_3*100)
print("Email is "+per+" % ", predicted_label[0])

Email is 97.57847533632287 %  ham
