<a href="https://colab.research.google.com/github/RifatMuhtasim/NLP_Natural_Language_Processing/blob/main/Learn/15.Bag_of_Words_for_Text_Representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:
#load email dataset
email_dataset = pd.read_csv("https://raw.githubusercontent.com/codebasics/nlp-tutorials/main/9_bag_of_words/spam.csv")
email_dataset.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
#create a new feature or column which represents if the email is spam or not
email_dataset['spam'] = email_dataset['Category'].apply(lambda x: 1 if x == "spam" else 0)
email_dataset.drop('Category', axis="columns", inplace=True)
email_dataset.head()

Unnamed: 0,Message,spam
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


In [4]:
#display how many spam and ham email in the email dataset
email_dataset['spam'].value_counts()

spam
0    4825
1     747
Name: count, dtype: int64

Here, We see that this is an imbalanced dataset. There are 747 Spam message and 4825 Not Spam.

In [5]:
#Train Test Split
from sklearn.model_selection import train_test_split

X = email_dataset['Message']
y = email_dataset['spam']
X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=0)

In [6]:
#display the shape of X_train
X_train.shape

(4457,)

### **Feature Engineering**
Convert text to Number using CountVectorizer

In [7]:
#CountVectorization
from sklearn.feature_extraction.text import CountVectorizer

#Fit and Transform CountVectorization for X_train
v = CountVectorizer()
X_train_cv = v.fit_transform(X_train.values)

In [8]:
#display shape of the X_train_cv
X_train_cv.shape

(4457, 7708)

In [9]:
#output those numbers they are not zero for 1257 st email
X_train_np = X_train_cv.toarray()
np.where(X_train_np[1257] != 0)

(array([ 942,  952, 1187, 1577, 2253, 2261, 2400, 3596, 3871, 4053, 4151,
        4586, 4650, 4815, 4918, 5402, 5573, 5583, 6670, 6700, 6763, 6788,
        6875, 6905, 7331, 7444, 7669]),)

In [10]:
#Read 1257 Email
X_train[1257]

"Not yet chikku..going to room nw, i'm in bus.."

In [11]:
#Get Feature names out for "not" keyword
v.get_feature_names_out()[4815]

'not'

### **Model Buiilding:**
 Here we use Naive Bayes model for predict if the email is spam or not.

In [12]:
#Naive Bayes Model
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)

In [13]:
#CountVectorizer for X_test
X_test_cv = v.transform(X_test.values)

#Prediction
y_pred = model.predict(X_test_cv)
y_pred[:5]

array([0, 0, 0, 0, 1])

In [14]:
#Check Accuracy
from sklearn.metrics import accuracy_score, classification_report
accuracy_score(y_test, y_pred)

0.9775784753363229

In [15]:
#Classification Report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.97      0.86      0.91       149

    accuracy                           0.98      1115
   macro avg       0.97      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115



### Prediction

In [16]:
#Predict new email
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1])

Where 1 means email is spam and 0 means not spam

### Use pipline to reduce errors

In [17]:
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [18]:
clf.fit(X_train, y_train)

In [19]:
y_pred = clf.predict(X_test)

#Classification Report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.97      0.86      0.91       149

    accuracy                           0.98      1115
   macro avg       0.97      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115

