### **Spam Email Prediction Using Random Forest**


## Get understanding about data set

Email data contains


1.  ID
1.  Mail
1.  Text
1.  Label








## Import Library

In [34]:
import pandas as pd
import numpy as np


## Import CSV as Dataframe

Use URL of the file directly

In [35]:
df=pd.read_csv(r'https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Spam%20Email.csv')

Or use file path after uploading file in Google Colab Notebook

In [36]:
# df = pd.read_csv(r'/content/Spam Email.csv)

Or use local file path in Jupyter Notebook

In [37]:
# df = pd.read_csv(r'C:\Users\abc\Downloads\Spam Email.csv')

## Get first 5 rows of dataframe

In [38]:
df.head()

Unnamed: 0,ID,Mail,Text,Label
0,1,ham,Subject: christmas tree farm pictures\r\n,0
1,2,ham,"Subject: vastar resources , inc .\r\ngary , pr...",0
2,3,ham,Subject: calpine daily gas nomination\r\n- cal...,0
3,4,ham,Subject: re : issue\r\nfyi - see note below - ...,0
4,5,ham,Subject: meter 7268 nov allocation\r\nfyi .\r\...,0


## Get information of Dataframe

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      5171 non-null   int64 
 1   Mail    5171 non-null   object
 2   Text    5171 non-null   object
 3   Label   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


## Get column names

In [40]:
df.columns

Index(['ID', 'Mail', 'Text', 'Label'], dtype='object')

## Get shape of Dtaframe

In [41]:
df.shape

(5171, 4)

## Define y (Dependent or label or target variable) and X (independent or features or attribute variable)

In [42]:
y=df['Label']


In [43]:
y.shape

(5171,)

In [44]:
y

0       0
1       0
2       0
3       0
4       0
       ..
5166    1
5167    1
5168    1
5169    1
5170    1
Name: Label, Length: 5171, dtype: int64

In [45]:
X=df['Text']

In [46]:
X.shape

(5171,)

In [47]:
X

0               Subject: christmas tree farm pictures\r\n
1       Subject: vastar resources , inc .\r\ngary , pr...
2       Subject: calpine daily gas nomination\r\n- cal...
3       Subject: re : issue\r\nfyi - see note below - ...
4       Subject: meter 7268 nov allocation\r\nfyi .\r\...
                              ...                        
5166    Subject: our pro - forma invoice attached\r\nd...
5167    Subject: str _ rndlen ( 2 - 4 ) } { extra _ ti...
5168    Subject: check me out !\r\n61 bb\r\nhey derm\r...
5169    Subject: hot jobs\r\nglobal marketing specialt...
5170    Subject: save up to 89 % on ink + no shipping ...
Name: Text, Length: 5171, dtype: object

## Get Train Test split

In [48]:
from sklearn.model_selection import train_test_split

In [49]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,stratify=y,random_state=2529)

In [50]:

X_train.shape,X_test.shape,y_train.shape,y_test.shape

((3619,), (1552,), (3619,), (1552,))

## Get X variable Feature Extraction

The sklearn.feature_extraction module can be used to extract the features in a format supported by ML algorithms from datasets consisting of formats such as text and image.
     The raw data,a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vector with fixed size rather than the raw text documents with variable length.
     we call vectorization the general process of turning a collection of text document into numerical feature vectors.This strategy is called the Bag of words or "Bag of n-grams" representation.
     In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf-idf transform.
     Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency.


In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
tfidf=TfidfVectorizer(min_df=1,stop_words='english',lowercase='True')

In [53]:
X_train_features=tfidf.fit_transform(X_train )

In [54]:
X_test_features=tfidf.transform(X_test)

In [55]:
X_train

4367    Subject: time sensitive . . . refer to # f 781...
3849    Subject: homeowners - get more money in your p...
2199    Subject: calpine daily gas nomination\r\n>\r\n...
2057    Subject: holiday invitation\r\nplease click on...
3019    Subject: first deliveries - comstock oil & gas...
                              ...                        
579     Subject: new update for buybacks\r\nthere are ...
4280    Subject: get it free - ibm thinkpad computer !...
3909    Subject: buckhorn doberman\r\nhello . i did no...
2133    Subject: galleryfurniture . com bowl\r\nenron ...
4508              Subject: want a new playstation 2 ?\r\n
Name: Text, Length: 3619, dtype: object

In [56]:
X_train_features

<3619x40400 sparse matrix of type '<class 'numpy.float64'>'
	with 237060 stored elements in Compressed Sparse Row format>

## Get model train

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf=RandomForestClassifier(random_state=2529)

In [57]:
rf.fit(X_train_features,y_train)

RandomForestClassifier(random_state=2529)

## Get model prediction

In [58]:
y_pred=rf.predict(X_test_features)

In [59]:
y_pred.shape

(1552,)

In [60]:
y_pred

array([0, 1, 0, ..., 1, 0, 0])

## Get probability of each predicted class

In [61]:
rf.predict_proba(X_test_features)

array([[0.99, 0.01],
       [0.1 , 0.9 ],
       [0.72, 0.28],
       ...,
       [0.3 , 0.7 ],
       [0.89, 0.11],
       [0.68, 0.32]])

## Get model evaluation

In [62]:
from sklearn.metrics import confusion_matrix,classification_report

In [63]:
print(confusion_matrix(y_test,y_pred))

[[1081   21]
 [  15  435]]


In [64]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98      1102
           1       0.95      0.97      0.96       450

    accuracy                           0.98      1552
   macro avg       0.97      0.97      0.97      1552
weighted avg       0.98      0.98      0.98      1552

