## **In this practice we will use natural language processing to build a spam classifier**

## **We will make use of the sms spam classification data for the implementation**

**Data processing**
*   Import the required packages
*   Load the data into train and test variables
*   Remove the unwanted data columns
*   Build wordcloud to see which message is spam and which is not.
*   Remove the stop words and punctuations
*   Convert the text data into vectors

**Building a classification model**
*   Split the data into train and test sets
*   Use Sklearn built in classifiers to build the models
*   Train the data on the model
*   Make predictions on new data













### **Installing Dependencies**

In [59]:
#Install Packages
!pip install wordcloud



## **Import the required packages that is to be used**

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import csv
import sklearn
import pickle
from wordcloud import WordCloud
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV,train_test_split,StratifiedKFold,cross_val_score,learning_curve


## **Preprocessing and Exploring the Dataset**

In [3]:
data = pd.read_csv('https://raw.githubusercontent.com/akshaybhatia10/SMS-Spam-Classification/master/data/spam.csv', encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## **Removing unwanted columns**

In [4]:
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v2" : "text", "v1":"label"})

In [5]:
data[1990:2000]

Unnamed: 0,label,text
1990,ham,HI DARLIN IVE JUST GOT BACK AND I HAD A REALLY...
1991,ham,No other Valentines huh? The proof is on your ...
1992,spam,Free tones Hope you enjoyed your new content. ...
1993,ham,Eh den sat u book e kb liao huh...
1994,ham,Have you been practising your curtsey?
1995,ham,Shall i come to get pickle
1996,ham,Lol boo I was hoping for a laugh
1997,ham,\YEH I AM DEF UP4 SOMETHING SAT
1998,ham,"Well, I have to leave for my class babe ... Yo..."
1999,ham,LMAO where's your fish memory when I need it?


In [6]:
data['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [None]:
# Import nltk packages and Punkt Tokenizer Models
import nltk
nltk.download("punkt")
import warnings
warnings.filterwarnings('ignore')

### **WordClouds- to see which words are common in SPAM and NOT SPAM mesaages**

In [None]:
ham_words = ''
spam_words = ''

In [None]:
import nltk
nltk.download('punkt')

In [None]:
# Creating a corpus of spam messages
for val in data[data['label'] == 'spam'].text:
    text = val.lower()
    tokens = nltk.word_tokenize(text)
    for words in tokens:
        spam_words = spam_words + words + ' '
# Creating a corpus of ham messages        
for val in data[data['label'] == 'ham'].text:
    text = val.lower()
    tokens = nltk.word_tokenize(text)
    for words in tokens:
        ham_words = ham_words + words + ' '

## **Creating  Spam wordcloud and ham wordcloud**

In [None]:
spam_wordcloud = WordCloud(width=500, height=300).generate(spam_words)
ham_wordcloud = WordCloud(width=500, height=300).generate(ham_words)

In [None]:
#Spam Word cloud
plt.figure( figsize=(10,8), facecolor='w')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

In [None]:
#Creating Ham wordcloud
plt.figure( figsize=(10,8), facecolor='g')
plt.imshow(ham_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

In [None]:
data = data.replace(['ham','spam'],[0, 1]) 

In [None]:
data.head(10)

## **Removing Stopwords from the messages**

In [None]:
import nltk
nltk.download('stopwords')

## **Remove punctuation  and stopwords**

In [None]:
#remove the punctuations and stopwords
import string
def text_process(text):
    
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
    
    return " ".join(text)

In [None]:
data['text'] = data['text'].apply(text_process)

In [None]:
data.head()

In [52]:
text = pd.DataFrame(data['text'])
label = pd.DataFrame(data['label'])
label

Unnamed: 0,label
0,ham
1,ham
2,spam
3,ham
4,ham
...,...
5567,spam
5568,ham
5569,ham
5570,ham


## **Converting words to vectors**



In [9]:
## Counting how many times a word appears in the dataset

from collections import Counter

total_counts = Counter()
for i in range(len(text)):
    for word in text.values[i][0].split(" "):
        total_counts[word] += 1

print("Total words in data set: ", len(total_counts))

Total words in data set:  15586


In [11]:
# Sorting in decreasing order (Word with highest frequency appears first)
vocab = sorted(total_counts, key=total_counts.get, reverse=True)
print(vocab[:60])

['to', 'you', 'I', 'a', 'the', 'and', 'in', 'is', 'i', 'u', 'for', 'my', '', 'of', 'your', 'me', 'on', 'have', '2', 'that', 'it', 'are', 'call', 'or', 'be', 'at', 'with', 'not', 'will', 'get', 'can', 'U', 'so', 'ur', "I'm", 'but', '&lt;#&gt;', 'You', 'from', '4', 'do', 'up', 'just', 'if', '.', 'go', 'when', 'know', 'this', 'like', 'we', 'all', 'out', 'got', 'was', 'come', 'now', '?', 'am', '...']


In [12]:
# Mapping from words to index

vocab_size = len(vocab)
word2idx = {}
#print vocab_size
for i, word in enumerate(vocab):
    word2idx[word] = i

In [13]:
# Text to Vector
def text_to_vector(text):
    word_vector = np.zeros(vocab_size)
    for word in text.split(" "):
        if word2idx.get(word) is None:
            continue
        else:
            word_vector[word2idx.get(word)] += 1
    return np.array(word_vector)

In [14]:
# Convert all titles to vectors
word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_)
for i, (_, text_) in enumerate(text.iterrows()):
    word_vectors[i] = text_to_vector(text_[0])

## **Converting words to vectors using TFIDF Vectorizer**

In [17]:
#convert the text data into vectors
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['text'])
vectors.shape

(5572, 8672)

In [18]:
#features = word_vectors
features = vectors

## **Splitting into training and test set**

In [19]:
#split the dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(features, data['label'], test_size=0.15, random_state=111)

## **Classifying using sklearn pre built classifiers**

In [21]:
#import sklearn packages for building classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [22]:
#initialize multiple classification models 
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier(n_neighbors=49)
mnb = MultinomialNB(alpha=0.2)
dtc = DecisionTreeClassifier(min_samples_split=7, random_state=111)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=31, random_state=111)

In [23]:
#create a dictionary of variables and models
clfs = {'SVC' : svc,'KN' : knc, 'NB': mnb, 'DT': dtc, 'LR': lrc, 'RF': rfc}

In [24]:
#fit the data onto the models
def train(clf, features, targets):    
    clf.fit(features, targets)

def predict(clf, features):
    return (clf.predict(features))

In [25]:
pred_scores_word_vectors = []
for k,v in clfs.items():
    train(v, X_train, y_train)
    pred = predict(v, X_test)
    pred_scores_word_vectors.append((k, [accuracy_score(y_test , pred)]))

## **Predictions using TFIDF Vectorizer algorithm**

In [26]:
pred_scores_word_vectors

[('SVC', [0.9892344497607656]),
 ('KN', [0.9617224880382775]),
 ('NB', [0.9940191387559809]),
 ('DT', [0.9677033492822966]),
 ('LR', [0.9688995215311005]),
 ('RF', [0.9856459330143541])]

##**Model predictions**





In [54]:
#write functions to detect if the message is spam or not
def find(x):
    if x == 'spam':
        print ("Message is SPAM")
    else:
        print ("Message is NOT Spam")

In [57]:
newtext = ["win FREE mobiles and play games"]
integers = vectorizer.transform(newtext)

In [58]:
x = mnb.predict(integers)
find(x)        

Message is SPAM
