Spam Mail Prediction using Machine Learning with Python:
==

Generally, a mail can be classified as:
Spam mail: Free entry in a 2a wkly comp to win FA cup final tkts 21st May.
Ham mail : Pls go ahead with watts. I just wanted to be sure.

Workflow:
==
Mail Data -> Data Preproessing -> Train test split -> Logistic Regression Model (Binary Classification Problem ) -> Trained 
Regression Model (New mail is fed into the newly trained model) -> Spam or Ham (Prediction) 

The dataset for this project is extracted from the Google Drive link or Kaggle.

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [1]:
#Importing the dependencies 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score

Data Collection & Pre-processing:
==


In [2]:
#loading the data from the csv file to the pandas dataframe
raw_mail_data = pd.read_csv(r"C:\Users\17347\Sapna's Projects\Spam Mail Prediction\mail_Data.csv")
print(raw_mail_data.head)

<bound method NDFrame.head of      Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]>


In [3]:
print(raw_mail_data.shape)

(5572, 2)


In [4]:
#replace the null values with a null string 
# mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)),'')
mail_data = raw_mail_data.fillna(' ')
print(mail_data)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


Label Encoding:
==
0 -> Spam mail 
1 -> Ham/ Non - Spam mail


In [5]:
mail_data.loc[mail_data['Category'] == 'spam','Category',] = 0
mail_data.loc[mail_data['Category'] == 'ham','Category',] = 1

In [6]:
#seperating the data as texts and label
X = mail_data['Message']
Y = mail_data['Category']
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [7]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


Train Test Split:
==
Splitting the data set into train data and test data


In [8]:
x_train, x_test, y_train, y_test = train_test_split (X,Y, test_size = 0.2, random_state = 2)


In [9]:
print(X.shape)
print(x_train.shape)
print(x_test.shape)

(5572,)
(4457,)
(1115,)


Feature Extraction:
==
Transform the text data to feature vectors(numeric values) that can be used as input to the Logistic Regression Model

TfidfVectorizer looks at the data (X) and if you see all the spam mails they contain words like free, offer, discounts and so on
and the TfidfVectorizer goes through all the words in the dataset and if the word is repeated several times it will give some values.Each word in the document is assigned with a score.
This score is then used by our model to find out which mail are spam mail and which mail are ham mail.

min_df -> if the score of a particular word is less than one then we need to ignore it , and if the score is greater than 1 we can include it so it basically means that if the word is repeated only once in that case we don't want to use those words because those words won't be that important for our prediction.

stop_words -> stop words are those words that will be repeated multiple times in a dataset. stop_words can be ignored from our dataset 

lowercase = True -> all the letters are converted into lowercase.

fit_transform -> fits all the data into the vectorizer and after that it will transform all the data into feature vectors which are nothing but the numerical values; we don't basically fit the data for the test data


In [10]:
feature_extraction = TfidfVectorizer(min_df=1,stop_words='english', lowercase = 'True')
x_train_features = feature_extraction.fit_transform(x_train)
# we don't basically fit the data for the test data
x_test_features = feature_extraction.transform(x_test)

# convert y_train and y_test values as integers as they're considered as an object
y_train = y_train.astype('int')
y_test = y_test.astype('int')

In [11]:
print(x_train_features)

  (0, 4334)	0.42941702167641554
  (0, 3958)	0.6161071828926097
  (0, 6586)	0.44333254982109394
  (0, 6927)	0.48935591439341625
  (1, 2121)	0.3573617143022146
  (1, 1428)	0.5869421390016223
  (1, 6971)	0.42812434651556874
  (1, 3168)	0.5869421390016223
  (2, 5115)	0.3408491178137899
  (2, 7353)	0.31988118061968496
  (2, 3852)	0.3408491178137899
  (2, 4884)	0.35749230587184955
  (2, 5695)	0.35749230587184955
  (2, 806)	0.26730249393705324
  (2, 5894)	0.35749230587184955
  (2, 1876)	0.28751725124107325
  (2, 6878)	0.35749230587184955
  (3, 197)	0.36522237107066735
  (3, 3723)	0.16297045459835785
  (3, 2435)	0.26698378141852
  (3, 1825)	0.26858331513730566
  (3, 5231)	0.2266831802864503
  (3, 300)	0.2915969875465198
  (3, 7248)	0.23571908490908416
  (3, 5005)	0.3169028431039865
  :	:
  (4454, 2244)	0.2526916142542512
  (4454, 666)	0.28653660324238944
  (4454, 1575)	0.20946314330145205
  (4454, 1094)	0.24862733340971144
  (4454, 5068)	0.22284357632450164
  (4454, 311)	0.19547195974237946
  

Training the model : Logistic Regression 
==


In [12]:
model = LogisticRegression()

In [13]:
# training the Logistic Regression model with the training data 
model.fit(x_train_features,y_train)

LogisticRegression()

Evaluating the trained model:
==


In [14]:
#prediction on training data
prediction_on_training_data = model.predict(x_train_features)
accuracy_on_training_data = accuracy_score(y_train, prediction_on_training_data)

In [15]:
print('Accuracy on training data:',accuracy_on_training_data)

Accuracy on training data: 0.9683643706529056


In [16]:
#prediction on test data
prediction_on_test_data = model.predict(x_test_features)
accuracy_on_test_data = accuracy_score(y_test, prediction_on_test_data)
print('Accuracy on test data:',accuracy_on_test_data)

Accuracy on test data: 0.9524663677130045


Building a Predictive System:
==

In [17]:
input_mail = ["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times"]

# convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# making prediction

prediction = model.predict(input_data_features)
print(prediction)


if (prediction[0]==1):
      print('Ham mail')

else:
      print('Spam mail')

[1]
Ham mail
