# Spam/Ham Configuration Model
## Objective
Create a model that is trained on several emails that include spam mails and ham mails. 

## Glossary
*   Spam mail: A mail that leads to a potential hazard or manipulates the reader to get his/her personal information.
*   Ham mail: An abbreviation that is termed for normal standard emails that do not lead to spamming or any type of manipulation attacks.


## Work Flow 
In this project we will use logistic regression model as it is the best model to handle binary classification problems.
<br>


1.   Getting the mail data as a csv file
2.   Data pre processing: cleaning data and making it available to train
3. Splitting the data set into train data and test data
4. Supply the training data to our logistic regression model
5. After the model is trianed, we will supply the test data to check the efficiency of the model
<br> 



In [1]:
#importing all the dependencies for the project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# Loading data from a csv file to a pandas dataframe
raw_mail_data = pd.read_csv("./mail_data.csv")

Now we clean this raw data that contains null values to clean the data.

In [3]:
# Replace all null values with a null string ''
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)), '')
mail_data

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [4]:
print(mail_data.isna().sum().sum()) #this will count the no. of null values in the dataframe

0


In [5]:
mail_data.shape

(5572, 2)

Label Encoding

In [6]:
# label spam mail as 0;  ham mail as 1;

mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

In [7]:
print(mail_data)

     Category                                            Message
0           1  Go until jurong point, crazy.. Available only ...
1           1                      Ok lar... Joking wif u oni...
2           0  Free entry in 2 a wkly comp to win FA Cup fina...
3           1  U dun say so early hor... U c already then say...
4           1  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567        0  This is the 2nd time we have tried 2 contact u...
5568        1               Will ü b going to esplanade fr home?
5569        1  Pity, * was in mood for that. So...any other s...
5570        1  The guy did some bitching but I acted like i'd...
5571        1                         Rofl. Its true to its name

[5572 rows x 2 columns]


Train-Test split- dividing the given data into two sub parts in which 1 is used to train and the other is used to evaluate the model.

In [8]:
# separating the data as texts and label

X = mail_data['Message']

Y = mail_data['Category']

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

Now we have our train-test split and the next step is to convert the data into numbers so that the model can understand. This will be done via vectors that will transform text data into numbers format. This is called feature extraction.

In [11]:
# transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

# convert Y_train and Y_test values as integers

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [None]:
print(X_train_features)

Train the model
Now we supply our cleaned data to the logistic regression model

In [13]:
model = LogisticRegression()

In [15]:
model.fit(X_train_features, Y_train)

Evaluating our model on the train data we have.
Here we will just supply X_train_features to the model without the Y_train.
Then compare the array generated by our model with Y_train to check the accuraccy.
Any score above 0.75 will be good.

In [16]:
predicted_on_train_data = model.predict(X_train_features)
accuracy_on_train_data = accuracy_score(Y_train, predicted_on_train_data)
print(accuracy_on_train_data)

0.9670181736594121


In [17]:
# Now we test our model on the basis of test data 
predicted_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, predicted_on_test_data)
print(accuracy_on_test_data)

0.9659192825112107


Finally, we create a function that takes a mail as an input and tells us whether it is a spam mail or a ham mail.

The input should be short in lenght for better accuracy (as I dive deeper into ML models, I will improve this project based on my learning).

Algorithm for the function-
1. Take a string input and store it in a input variable
2. Convert the input to a feature vector
3. Supply the feature vector to the ML model
4. After the model predicts the value either 0 or 1, we will print our message as spam or ham respectively.


In [None]:
def SpamOrHam(input_mail_by_user) : 
  input_mail_string = [input_mail_by_user]
  input_mail = input_mail_string
  input_mail_features = feature_extraction.transform(input_mail)
  prediction_result = model.predict(input_mail_features)
  if (prediction_result==0):
    print("This mail is a spam mail")
  else: 
    print("This mail is a ham mail")
x = input("Enter a mail :")
SpamOrHam(x)

When we run the above cell, it asks for a mail to enter so for the mail, it needs a mail either from the existing csv file in the folder 
or the email which is closest to the mail that is there in the csv file. 