# **Spam Email Prediction using Random Forest**

# **Understanding about Dataset**

## Email data contains following variables

1. **ID**
2. **Mail**
3. **Text** 
4. **Label**

# **Import Library**

In [8]:
import pandas as pd

In [9]:
import numpy as np

# **Import CSV as DataFrame**

In [10]:
df = pd.read_csv('https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Spam%20Email.csv')

# **Analyzing the data**

# **Displaying the first 5 rows of DataFrame**

In [11]:
df.head()

Unnamed: 0,ID,Mail,Text,Label
0,1,ham,Subject: christmas tree farm pictures\r\n,0
1,2,ham,"Subject: vastar resources , inc .\r\ngary , pr...",0
2,3,ham,Subject: calpine daily gas nomination\r\n- cal...,0
3,4,ham,Subject: re : issue\r\nfyi - see note below - ...,0
4,5,ham,Subject: meter 7268 nov allocation\r\nfyi .\r\...,0


# **Detailed Information of DataFrame**

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      5171 non-null   int64 
 1   Mail    5171 non-null   object
 2   Text    5171 non-null   object
 3   Label   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


# **Column Names**

In [14]:
df.columns

Index(['ID', 'Mail', 'Text', 'Label'], dtype='object')

# **Shape of DataFrame - displays total no. of rows and cols**

In [15]:
df.shape

(5171, 4)

# **Define X and Y**

## **X - (Features or Independent or Attribute Variable)**
## **Y - (Label or Dependent or Target Variable)**

In [16]:
X = df['Text']

In [17]:
X.shape

(5171,)

In [18]:
X

0               Subject: christmas tree farm pictures\r\n
1       Subject: vastar resources , inc .\r\ngary , pr...
2       Subject: calpine daily gas nomination\r\n- cal...
3       Subject: re : issue\r\nfyi - see note below - ...
4       Subject: meter 7268 nov allocation\r\nfyi .\r\...
                              ...                        
5166    Subject: our pro - forma invoice attached\r\nd...
5167    Subject: str _ rndlen ( 2 - 4 ) } { extra _ ti...
5168    Subject: check me out !\r\n61 bb\r\nhey derm\r...
5169    Subject: hot jobs\r\nglobal marketing specialt...
5170    Subject: save up to 89 % on ink + no shipping ...
Name: Text, Length: 5171, dtype: object

In [19]:
y = df['Label']

In [20]:
y.shape

(5171,)

In [21]:
y

0       0
1       0
2       0
3       0
4       0
       ..
5166    1
5167    1
5168    1
5169    1
5170    1
Name: Label, Length: 5171, dtype: int64

# **Train Test Split Data**

In [22]:
from sklearn.model_selection import train_test_split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = y ,random_state=202529)

In [26]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((4136,), (1035,), (4136,), (1035,))

# **Get X Variables Feature Extraction**

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
tfid = TfidfVectorizer(min_df = 1, stop_words = 'english', lowercase='True')

In [29]:
X_train_features = tfid.fit_transform(X_train)

In [30]:
X_test_features = tfid.transform(X_test)

In [31]:
X_train

4493    Subject: send out five motor\r\natt : client i...
2517    Subject: hpl nom for february 2 , 2001\r\n( se...
3508    Subject: proposed rule extends marketing affil...
4587    Subject: randal , i can help you\r\never been ...
1798    Subject: union carbide - seadrift\r\ndaren\r\n...
                              ...                        
2045    Subject: enron / hpl noms for november 16 , 20...
2867    Subject: natural gas nomination for 04 / 01\r\...
2778    Subject: hpl nom for march 21 , 2001\r\n( see ...
3060    Subject: southern union - 03 / 01 prod - austi...
808     Subject: # 6487 rangel / dewpoint\r\nplease se...
Name: Text, Length: 4136, dtype: object

In [33]:
X_train_features

<4136x43174 sparse matrix of type '<class 'numpy.float64'>'
	with 269189 stored elements in Compressed Sparse Row format>

# **Train the Model**

In [34]:
from sklearn.ensemble import RandomForestClassifier

In [35]:
rf = RandomForestClassifier(random_state=202529)

In [36]:
rf.fit(X_train_features, y_train)

RandomForestClassifier(random_state=202529)

# **Model Prediction**

In [37]:
y_pred = rf.predict(X_test_features)

In [38]:
y_pred.shape

(1035,)

In [39]:
y_pred

array([0, 1, 0, ..., 0, 0, 0])

# **Get Probability of each predicted class**

In [40]:
rf.predict_proba(X_test_features)

array([[1.  , 0.  ],
       [0.03, 0.97],
       [0.9 , 0.1 ],
       ...,
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.99, 0.01]])

# **Model Evaluation**

In [41]:
from sklearn.metrics import confusion_matrix, classification_report

In [42]:
print(confusion_matrix(y_test, y_pred))

[[718  17]
 [  8 292]]


In [43]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98       735
           1       0.94      0.97      0.96       300

    accuracy                           0.98      1035
   macro avg       0.97      0.98      0.97      1035
weighted avg       0.98      0.98      0.98      1035

