<a href="https://colab.research.google.com/github/Ali-Asgar-Lakdawala/ML-Practice/blob/main/spam_detection_Naive_Bayes_Classifier_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <u><b> Objective </b></u>
## <b> Your task is to predict whether a message will be spam or not. In the class we used the <code>sklearn.countvectorizer</code> to find vectors for each message. Now you need to do the same task but rather than using countvectorizer, you are required to use TF-IDF vectorizer to find the vectors for the messages. </b>

### You will use <code>tfidfVectorizer</code>. It will convert collection of text documents (SMS corpus) into 2D matrix. One dimension represent documents and other dimension repesents each unique word in SMS corpus.

### If $n^{th}$ term $t$ has occured $p$ times in $m^{th}$ document, $(m, n)$ value in this matrix will be $\rm TF-IDF(t)$, where 
$\rm TF-IDF(t) = \rm Term ~Frequency (TF) * \rm Inverse~ Document ~Frequency (IDF)$
* ### <b>Term Frequency (TF)</b> is a measure of how frequent a term occurs in a document.

* ### $TF(t)$= Number of times term $t$ appears in document ($p$) / Total number of terms in that document

* ### <b>Inverse Document Frequency (IDF)</b> is measure of how important term is. For TF, all terms are equally treated. But, in IDF, for words that occur frequently like 'is' 'the' 'of' are assigned less weight. While terms that occur rarely that can easily help identify class of input features will be weighted high.

* ###  $IDF(t)= log(\frac{\rm Total ~ number ~ of ~document}{ Number ~of~ documents~ with ~term ~t ~in~ it})$

### At end we will have for every message, vectors normalized to unit length equal to size of vocabulary (number of unique terms from entire SMS corpus)





In [136]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [137]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [138]:
df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ML/spam.csv", encoding='latin-1')[['v1', 'v2']]

In [139]:
df.v2=df.v2.apply(lambda x:x.lower())

In [140]:
df

Unnamed: 0,v1,v2
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,this is the 2nd time we have tried 2 contact u...
5568,ham,will ì_ b going to esplanade fr home?
5569,ham,"pity, * was in mood for that. so...any other s..."
5570,ham,the guy did some bitching but i acted like i'd...


In [141]:
import string

In [142]:
df.v2=df.v2.apply(lambda x:x.translate(str.maketrans('', '', string.punctuation))) 

In [143]:
df

Unnamed: 0,v1,v2
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...
...,...,...
5567,spam,this is the 2nd time we have tried 2 contact u...
5568,ham,will ì b going to esplanade fr home
5569,ham,pity was in mood for that soany other suggest...
5570,ham,the guy did some bitching but i acted like id ...


In [144]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [145]:
stop = stopwords.words('english')

In [146]:
df.v2=df.v2.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [147]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [148]:
vectorizer = TfidfVectorizer(max_df=0.95,norm='l1')
X = vectorizer.fit_transform(df.v2)

In [149]:
df.iloc[4065].v2

'fyi im gonna call sporadically starting like ltgt bc doin shit'

In [150]:
sum(X.toarray()[4065])

1.0

In [151]:
from sklearn.model_selection import train_test_split
msg_train,msg_test,label_train,label_test = train_test_split(df.v2,df.v1,test_size=0.2)

In [152]:
train_vectorized = vectorizer.transform(msg_train)
test_vectorized = vectorizer.transform(msg_test)

In [153]:
train_array= train_vectorized.toarray()
test_array = test_vectorized.toarray()

In [154]:
from sklearn.naive_bayes import GaussianNB
spam_detect_model = GaussianNB().fit(train_array,label_train)

In [155]:
train_preds = spam_detect_model.predict(train_array)
test_preds = spam_detect_model.predict(test_array)

In [156]:
from sklearn.metrics import classification_report,confusion_matrix

In [157]:
# Confusion matrices for train and test 
print(confusion_matrix(label_test,test_preds))

[[854  94]
 [ 26 141]]


In [158]:
# Print the classification report for train and test
print(classification_report(label_test,test_preds))

              precision    recall  f1-score   support

         ham       0.97      0.90      0.93       948
        spam       0.60      0.84      0.70       167

    accuracy                           0.89      1115
   macro avg       0.79      0.87      0.82      1115
weighted avg       0.91      0.89      0.90      1115

