# 

# Project overview
* Build a spam classification model 
* Step_1 : Text pre processing (tokenization; stopwords; lemmatization & stemming
* Step_2 : Text to vectors: BOW, TFIDF, Word2Vec, AvgWord2Vec 
* Step_3 : Check the accuracy 

# Import important liabraries

In [1]:
import pandas as pd
import numpy as np

# Import Data

In [2]:
data_path = "C:\\Personal\\Carrier Path\\Data_Scientist\\Advanced Phase\\NLP\\"
data = pd.read_csv(data_path + "spam.csv" ,  encoding="ISO-8859-1")

* There are 2 types of encoding 1) encoding="ISO-8859-1" 2) encoding="utf-8"
* Here "utf-8" was not working as it was not able to some data points, hence we had to use the other encoding and that worked fine. 

**Ques: Understand about various encoding schemes. What we use them? what is the difference between the 2 encoding schemes?**

In [3]:
data.drop(columns = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace = True)
data.columns = ['Label', 'Message']

In [4]:
data.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
data.shape

(5572, 2)

# Text preprocessing

**Note: Make sure that the BOW is implemented only on the training data and not the test data. Hence, divide the training and the test data at the first place. And then continue with the exercise**

In [6]:
data["Message"].loc[100]

'Okay name ur price as long as its legal! Wen can I pick them up? Y u ave x ams xx'

In [7]:
data.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
import re
import nltk

In [9]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [10]:
ps = PorterStemmer()

In [11]:
# Stemming 
corpus = []
for i in range(0, len(data)):
    review = re.sub('[^a-zA-Z]', ' ', data['Message'][i]) # removed all special character
    review = review.lower()                               # lowered the words
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')] # stemming
    review = ' '.join(review)
    corpus.append(review)

In [12]:
corpus[0:5]

['go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 'ok lar joke wif u oni',
 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli',
 'u dun say earli hor u c alreadi say',
 'nah think goe usf live around though']

# Bag of words

In [13]:
# Creating the BOW model 
from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(max_features = 2500 , binary = True , ngram_range= (1,2))
X = cv.fit_transform(corpus).toarray()

* If there are more than 2500 words in the vocabulary. Then the algorithm would pick only first 2500 words with the highest frequency

In [14]:
print(X.shape , type(X))

(5572, 2500) <class 'numpy.ndarray'>


In [16]:
 # Target colum 'Spam' => Converted that column in to 1,0's 
y = pd.get_dummies(data['Label'])
y = y.iloc[:,1].values

In [17]:
# train test split
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test , y_train, y_test = train_test_split(X , y, test_size = 0.2, random_state= 42)

In [19]:
print(X_train.shape, X_test.shape , y_train.shape, y_test.shape)

(4457, 2500) (1115, 2500) (4457,) (1115,)


**Hence we can see that there are 2500 features**

## Modelling

In [25]:
from sklearn.naive_bayes import MultinomialNB  

In [26]:
spam_detect_model = MultinomialNB().fit(X_train, y_train)

In [27]:
# prediction
y_pred =spam_detect_model.predict(X_test)

In [28]:
from sklearn.metrics import accuracy_score , classification_report
score = accuracy_score(y_test, y_pred)
print(score)

0.9811659192825112


# TFIDF Model 

In [29]:
# Creating the model 
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features=2500, ngram_range=(1,2))
X = tv.fit_transform(corpus).toarray()

In [30]:
X_train, X_test , y_train, y_test = train_test_split(X , y, test_size = 0.2, random_state= 44)

### Model: MultinomialNB 

In [33]:
spam_detect_model = MultinomialNB().fit(X_train, y_train)

In [34]:
y_pred =spam_detect_model.predict(X_test)

In [35]:
score = accuracy_score(y_test, y_pred)
print(score)

0.9739910313901345


### Model: RandomForest 

In [36]:
from sklearn.ensemble import RandomForestClassifier

In [37]:
rf = RandomForestClassifier()

**We used the TFIDF data for random forest**

In [38]:
spam_detect_model_rf = rf.fit(X_train, y_train)

In [39]:
y_pred = spam_detect_model_rf.predict(X_test)

In [40]:
score = accuracy_score(y_test, y_pred)

In [41]:
print(score)

0.9766816143497757


# Word2Vec

In [45]:
# pre trained model 
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')



In [43]:
# Lemmatization
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()

In [44]:
corpus = []
for i in range(0, len(data)):
    review = re.sub('[^a-zA-Z]', ' ', data['Message'][i]) # removed all special character
    review = review.lower()                               # lowered the words
    review = review.split()
    
    review = [lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')] # stemming
    review = ' '.join(review)
    corpus.append(review)

In [46]:
from nltk import tokenize , sent_tokenize
from gensim.utils import simple_preprocess 

In [47]:
words=[]
for sent in corpus:
    sent_token = sent_tokenize(sent)
    for sent in sent_token:
        words.append(simple_preprocess(sent)) 

In [None]:
# rough
sent = corpus[0]
sent_token = sent_tokenize(sent)
sent_token

In [None]:
# rough
group = []
for sent in sent_token:
    group.append(simple_preprocess(sent)) 

In [None]:
# rough
group

In [53]:
words[0]

['go',
 'jurong',
 'point',
 'crazy',
 'available',
 'bugis',
 'great',
 'world',
 'la',
 'buffet',
 'cine',
 'got',
 'amore',
 'wat']

**what this function does: simple_preprocess(sent)?**
* It performs various tasks of text preprocessing 
* It does the following things: 
* Tokenization
* LowerCasing
* Remove puntuation
* Remove stopwords
* Handelling numbers
* Stripping whitespace

In [49]:
import gensim