<a href="https://colab.research.google.com/github/Shahrukh2016/Natural_Language_Processing/blob/main/SpamClassifier_pnyb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project we have to classify weather a SMS is spam or not with the help of Natural Language Processing.

Link to dataset- https://archive.ics.uci.edu/ml/datasets/sms+spam+collection#

In [21]:
# Importing needed libraries for data manipulation
import numpy as np
import pandas as pd

In [22]:
# Reading the file
messages= pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ML Algorithms Shahrukh/Datasets/SMSSpamCollection", sep="\t", names=["Label","Message"])

In [23]:
# Checking first five observations
messages.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


We have loaded and checked the dataset. We have observed that our messages contains many stopwords, punctuation which will definately create blunders while classifying a SMS. So, lets just proceed and handle accordingly.

##Stemming with BAG OF WORDS (CountVectorizer)

In [24]:
#Data cleaning and preprocessing
import re               # Regular expression
import nltk             # Natural language tool-kit
nltk.download('stopwords')    # Downloading stopwords

from nltk.corpus import stopwords       #Stopwords
from nltk.stem.porter import PorterStemmer      #For stemming (changing word to its root word either meaningfull or meaningless)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
# Creating instance for PorterStemmer
ps = PorterStemmer()

In [26]:
# Removing puncuations and stopwords, lowering the words, Stemming the words and apeending it into new list
corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['Message'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [27]:
# Checking the first message before and after manipulation
print(messages["Message"][0])
print("--"*60)
print(corpus[0])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
------------------------------------------------------------------------------------------------------------------------
go jurong point crazi avail bugi n great world la e buffet cine got amor wat


Great, till now we have pre processed our data but still we have have one problem wuth us. To feed our data to ML classifier we need to make it numeric afterall any ML model understands language of numbers. So let's just use BAG OF WORDS technique to make dataset numeric (tokenising) and separate the independent and dependent variable.

In [35]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000)
x = cv.fit_transform(corpus).toarray()

y=pd.get_dummies(messages["Label"], drop_first=True).values

In [36]:
# Checking our independent variables
x

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [38]:
# Checking our dependent variable
y

array([[0],
       [0],
       [1],
       ...,
       [0],
       [0],
       [0]], dtype=uint8)

Cool, we have separate our data. Now let's just split the data set into train and test sets for model training and evaluation.

In [41]:
# Importing train test split from sklearn
from sklearn.model_selection import train_test_split

# datast spliting
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

In [42]:
# Checking shape of each sets
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((4457, 5000), (1115, 5000), (4457, 1), (1115, 1))

We have splitted the data into train and test sets. Now let's impliment any classification algorithm (say Naive bayes) for model training and testing

In [43]:
# Importing Naive bayes classifier
from sklearn.naive_bayes import MultinomialNB

# Training model using Naive bayes classifier
spam_detect_model = MultinomialNB().fit(x_train, y_train)

# Model Prediction
y_pred=spam_detect_model.predict(x_test)

  y = column_or_1d(y, warn=True)


Model Evaluation

In [45]:
# Importing Classification report for evaluation
from sklearn.metrics import classification_report

# Model evaluation
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       955
           1       0.94      0.95      0.95       160

    accuracy                           0.98      1115
   macro avg       0.97      0.97      0.97      1115
weighted avg       0.98      0.98      0.98      1115



Superb !!! 
From the classification report we can see that we are getting the accuracy of 98% with the recall of 0.95 for SPAM class.

##Lemmatization with TF-IDF Vectorizer (TfidfVectorizer)

In [61]:
#Data cleaning and preprocessing
import re               # Regular expression
import nltk             # Natural language tool-kit
nltk.download('stopwords')    # Downloading stopwords
nltk.download('all')      # Downloading resources

from nltk.corpus import stopwords       #Stopwords
from nltk.stem import WordNetLemmatizer      #For Lemmatization (changing word to its meaningfull root word)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data.

In [62]:
# Creating instance for wordnet
wordnet  = WordNetLemmatizer()

In [63]:
# Removing puncuations and stopwords, lowering the words, Lemmatizing the words and apeending it into new list
corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['Message'][i])
    review = review.lower()
    review = review.split()
    
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [64]:
# Checking the first message before and after manipulation
print(messages["Message"][0])
print("--"*60)
print(corpus[0])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
------------------------------------------------------------------------------------------------------------------------
go jurong point crazy available bugis n great world la e buffet cine got amore wat


Great, till now we have pre processed our data but still we have have one problem wuth us. To feed our data to ML classifier we need to make it numeric afterall any ML model understands language of numbers. So let's just use TF-IDF technique to make dataset numeric (tokenising) and separate the independent and dependent variable.

In [65]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfv = TfidfVectorizer(max_features=5000)
x = tfidfv.fit_transform(corpus).toarray()

y=pd.get_dummies(messages["Label"], drop_first=True).values

In [67]:
# Checking our independent variables
x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [68]:
# Checking our dependent variable
y

array([[0],
       [0],
       [1],
       ...,
       [0],
       [0],
       [0]], dtype=uint8)

Cool, we have separate our data. Now let's just split the data set into train and test sets for model training and evaluation.

In [69]:
# Importing train test split from sklearn
from sklearn.model_selection import train_test_split

# datast spliting
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

In [70]:
# Checking shape of each sets
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((4457, 5000), (1115, 5000), (4457, 1), (1115, 1))

We have splitted the data into train and test sets. Now let's impliment any classification algorithm (say Naive bayes) for model training and testing

In [71]:
# Importing Naive bayes classifier
from sklearn.naive_bayes import MultinomialNB

# Training model using Naive bayes classifier
spam_detect_model = MultinomialNB().fit(x_train, y_train)

# Model Prediction
y_pred=spam_detect_model.predict(x_test)

  y = column_or_1d(y, warn=True)


Model Evaluation

In [72]:
# Importing Classification report for evaluation
from sklearn.metrics import classification_report

# Model evaluation
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.99       955
           1       1.00      0.84      0.91       160

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



Superb !!! 
From the classification report we can see that we are getting the accuracy of 98% with the recall of 0.84 for SPAM class.