<a href="https://colab.research.google.com/github/Gladiator07/Natural-Language-Processing/blob/main/Basics/mini-projects/Spam-Classifier/Spam_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Spam Classifier (using basic concepts + ml Classifier)

We have tab seperated file for the data. Let's read it ...

In [1]:
# Downloading and unzipping data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip

--2021-09-18 08:23:33--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/x-httpd-php]
Saving to: ‘smsspamcollection.zip’


2021-09-18 08:23:33 (1.66 MB/s) - ‘smsspamcollection.zip’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [2]:
import pandas as pd

In [4]:
# setting the data path
data_path = "/content/SMSSpamCollection"

messages = pd.read_csv(data_path, sep='\t',
                        names=["label", "message"])


In [5]:
messages

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [6]:
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [7]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [8]:
lemmatizer = WordNetLemmatizer()

In [9]:
corpus = []

for i in range(len(messages)):
    text = re.sub('[^a-zA-Z]', ' ', messages['message'][i])
    text = text.lower()
    text = text.split()
    text = [lemmatizer.lemmatize(word) for word in text if not word in stopwords.words('english')]
    text = ' '.join(text)
    corpus.append(text)

#### Let's try with both the approaches for creating word vectors (bag of words and TF-IDF)

## Bag Of Words

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
# let's keep the features to all for now
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

In [11]:
X.shape

(5572, 7098)

So, we have 5572 total messages and 7098 unique word vocabulary

Let's convert label to one hot vector (ham/spam)

In [12]:
y = pd.get_dummies(messages['label'])

# getting spam column (1 for spam 0 for not)
y = y.iloc[:, 1].values
y

array([0, 0, 1, ..., 0, 0, 0], dtype=uint8)

In [13]:
# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [14]:
# training model using Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [15]:
from sklearn.metrics import confusion_matrix
cfm = confusion_matrix(y_test, y_pred)

In [16]:
# 0 1
# 0 1
cfm

array([[943,  23],
       [  6, 143]])

In [17]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)

In [18]:
accuracy

0.9739910313901345

In [19]:
# trying with less number of features (example: 2500)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

In [20]:
# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [21]:
X_train.shape

(4457, 2500)

In [22]:
# training model using Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [23]:
cfm = confusion_matrix(y_test, y_pred)
cfm

array([[954,  12],
       [  6, 143]])

In [24]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9838565022421525

We got boost in accuracy (i know that accuracy is not an ideal metric for this problem, but still) and we have less misclassifications as seen in the confusion matrix

## TF-IDF

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=2500)
X = tfidf.fit_transform(corpus).toarray()

In [26]:
y = pd.get_dummies(messages['label'])

# getting spam column (1 for spam 0 for not)
y = y.iloc[:, 1].values

In [27]:
# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [28]:
# training model using Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [29]:
cfm = confusion_matrix(y_test, y_pred)
cfm

array([[964,   2],
       [ 18, 131]])

In [30]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9820627802690582