## Spam Classifier (using basic concepts + ml Classifier)

We have tab seperated file for the data. Let's read it ...

In [72]:
import pandas as pd

In [73]:
# setting the data path
data_path = "/content/Natural-Language-Processing/Basics/mini-projects/Spam-Classifier/data/SMSSpamCollection"

messages = pd.read_csv(data_path, sep='\t',
                        names=["label", "message"])


In [74]:
messages

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [75]:
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [76]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [77]:
lemmatizer = WordNetLemmatizer()

corpus = []

for i in range(len(messages)):
    text = re.sub('[^a-zA-Z]', ' ', messages['message'][i])
    text = text.lower()
    text = text.split()
    text = [lemmatizer.lemmatize(word) for word in text if not word in stopwords.words('english')]
    text = ' '.join(text)
    corpus.append(text)


#### Let's try with both the approaches for creating word vectors (bag of words and TF-IDF)

In [78]:
### Bag Of Words

In [79]:
from sklearn.feature_extraction.text import CountVectorizer
# let's keep the features to all for now
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

(5572, 7098)

X.shape

So, we have 5572 total messages and 7098 unique word vocabulary

In [80]:
Let's convert label to one hot vector (ham/spam)

array([0, 0, 1, ..., 0, 0, 0], dtype=uint8)

In [81]:
y = pd.get_dummies(messages['label'])

# getting spam column (1 for spam 0 for not)
y = y.iloc[:, 1].values
y

In [82]:
# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [83]:
# training model using Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [85]:
from sklearn.metrics import confusion_matrix
cfm = confusion_matrix(y_test, y_pred)

array([[943,  23],
       [  6, 143]])

In [86]:
# 0 1
# 0 1
cfm

In [87]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)

0.9739910313901345

In [88]:
accuracy

In [89]:
# trying with less number of features (example: 2500)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

In [90]:
# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

(4457, 2500)

In [91]:
X_train.shape

In [94]:
# training model using Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

array([[954,  12],
       [  6, 143]])

In [95]:
cfm = confusion_matrix(y_test, y_pred)
cfm

0.9838565022421525

accuracy = accuracy_score(y_test, y_pred)
accuracy

We got boost in accuracy (i know that accuracy is not an ideal metric for this problem, but still) and we have less misclassifications as seen in the confusion matrix

In [108]:
### TF-IDF

In [109]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=2500)
X = tfidf.fit_transform(corpus).toarray()

In [110]:
y = pd.get_dummies(messages['label'])

# getting spam column (1 for spam 0 for not)
y = y.iloc[:, 1].values

In [111]:
# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [112]:
# training model using Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

array([[964,   2],
       [ 18, 131]])

In [113]:
cfm = confusion_matrix(y_test, y_pred)
cfm

0.9820627802690582