# <h1 style='color:purple' align='center'> NLP: SPAM CLASSIFIER USING TFIDF MODEL </h1>

Dataset: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

Loading the data into a pandas dataframe:

Since the file is tab separated, use\t as the delimiter

Read the text separated by tab into two columns in the pandas dataframe

Column names: label (Spam or not),message (text message)

The text in the dataset consists of two lables ham and spam

ham: Legitimate message (not spam)

spam: Spam message


In [1]:
import pandas as pd
df=pd.read_csv("SMSSpamCollection",sep='\t',names=['label','message'])

Lets have a look into the data

In [2]:
df.head(2)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...


In [3]:
df.shape

(5572, 2)

In [4]:
df['message'][0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

# Data cleaning

As we can see, data consists of:

punctuations,stop words like in,there etc. which doesnot play any role in spam classification. Lets remove all of these.



In [5]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gunas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
import re

In [7]:
from nltk.stem.porter import PorterStemmer

In [8]:
ps=PorterStemmer()

In [9]:
len(df)

5572

In [10]:
messages=[]
from nltk.corpus import stopwords
for i in range(0,len(df)):
     # Retain only text in the message, replace everything else with space
    data=re.sub('[^a-zA-Z]',' ',df['message'][i])
    # Covert everything into lowercase
    data=data.lower()
    # Split the sentences in the data into list of words
    data=data.split()
    # remove stop words
    data=[word for word in data if not word in stopwords.words('english')]
    # Perform stemming on the list of words
    data=[ps.stem(word) for word in data]
    # Join the words with space separation between words
    data=' '.join(data)
    messages.append(data)
        

In [11]:
messages[0]

'go jurong point crazi avail bugi n great world la e buffet cine got amor wat'

Now the data cleaning process is over

Next step is to convert this list of strings into machine readable form that is numbers

For this we are using here TFIDF model

TFIDF model: In this model, the words present in the dataset are considered to be the features. 

For each sentence in the dataset a vector of numbers is generated. Numbers in vector is a product of Term frequency and inverse document frequency for each word.

Term frequecy=Total number of repetition of the word in the sentence/Total number of words in the sentence

Inverse document frequency=log(Total number of sentences in the corpus/Total number of sentences in which the particualr word is present)

The length of the vector is equal to the number of differnt words in the dataset.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()

In [13]:
X=cv.fit_transform(messages)

In [14]:
X.shape

(5572, 6296)

As we can see, each sentence is represented by a vector of length 6296. If a particuar word is present in a sentence then that position in the array has a number which is the product of term frequency and inverse documnet frequecy for that word.

We can reduce the number of features by specifying the parameter in the CountVectorizer, Lets say we want to choose only most frequent 2000 words in the TFIDF model

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv=TfidfVectorizer(max_features=2000)
X=cv.fit_transform(messages)

In [16]:
X.shape

(5572, 2000)

Now our X is ready lets look into the labels y for each data sample

In [19]:
df['label'][:2]

0    ham
1    ham
Name: label, dtype: object

In [20]:
dummies=pd.get_dummies(df['label'])

In [22]:
dummies[:2]

Unnamed: 0,ham,spam
0,1,0
1,1,0


Only one column is sufficient to represent the label information. Lets retain only the second column. Second column is represented by index 1

Spam=1

No sapm=1

In [30]:
y=dummies.iloc[:,1].values

In [32]:
y

array([0, 0, 1, ..., 0, 0, 0], dtype=uint8)

Now both X and y are ready. We can train the model. For that lets split the data into training and test set

In [33]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=2)

Naive bayes classifier

In [37]:
from sklearn.naive_bayes import MultinomialNB
spam_classifier=MultinomialNB()

Fit the model on the training data

In [38]:
spam_classifier.fit(X_train,y_train)

MultinomialNB()

Predict the values for the test data using the trained model

In [39]:
y_pred=spam_classifier.predict(X_test)

Evaluate the model on the test data using accuracy score and confusion matrix

In [40]:
from sklearn.metrics import confusion_matrix

In [41]:
cf=confusion_matrix(y_test,y_pred)
print(cf)

[[955   2]
 [ 33 125]]


In [42]:
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_test,y_pred)
print(accuracy)

0.968609865470852
