## Installing  and importing required libraries

In [1]:
!pip install -U gensim

Collecting gensim
  Downloading gensim-4.1.0-cp38-cp38-win_amd64.whl (24.0 MB)
Collecting Cython==0.29.23
  Downloading Cython-0.29.23-cp38-cp38-win_amd64.whl (1.7 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
Installing collected packages: Cython, smart-open, gensim
  Attempting uninstall: Cython
    Found existing installation: Cython 0.29.21
    Uninstalling Cython-0.29.21:
      Successfully uninstalled Cython-0.29.21
Successfully installed Cython-0.29.23 gensim-4.1.0 smart-open-5.2.1


In [1]:
!pip install -U keras

Collecting keras
  Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
Installing collected packages: keras
  Attempting uninstall: keras
    Found existing installation: Keras 2.4.3
    Uninstalling Keras-2.4.3:
      Successfully uninstalled Keras-2.4.3
Successfully installed keras-2.6.0


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split
import gensim.downloader as api
import gensim
from tensorflow import keras


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import nltk
import string
import re

_Below command will download a pre-trained w2v model on Wikipedia pages. File is 128Mb, so will take time._

In [2]:
w2v_wiki = api.load('glove-wiki-gigaword-100')

In [8]:
#seeing a vector for the word "victor"
w2v_wiki['victor']

array([ 0.34364  , -0.046794 ,  0.18192  , -0.050262 ,  0.87526  ,
       -0.33039  ,  0.78294  , -0.23179  , -0.35979  , -1.0337   ,
        0.40359  ,  0.48397  , -0.72324  ,  0.046765 ,  0.26511  ,
       -0.33097  ,  0.69357  , -0.045868 ,  0.2698   ,  0.49527  ,
        0.13056  , -0.095836 ,  0.99915  ,  0.097056 , -0.46398  ,
        0.40895  ,  0.072931 , -0.37482  ,  1.0035   ,  0.73496  ,
       -0.10795  , -0.076917 , -0.061385 ,  0.29896  , -0.099562 ,
       -0.4148   ,  0.1317   ,  0.26688  ,  0.038517 , -1.0769   ,
        0.66625  , -0.24342  , -0.047666 , -0.3902   ,  0.14802  ,
       -0.29275  , -0.59396  , -0.91602  , -0.28666  ,  0.75782  ,
        0.81966  , -0.19011  , -0.24243  ,  0.43011  ,  0.64499  ,
       -1.4213   , -0.60807  ,  0.6863   , -0.018351 , -0.34212  ,
        0.21337  ,  0.15804  , -0.29021  ,  0.26644  ,  0.31479  ,
       -0.33131  ,  0.20754  ,  0.5988   , -0.7563   ,  0.55374  ,
        0.19392  ,  0.11201  , -0.46302  , -0.90231  ,  0.5045

_Measuring distance between two similar words. If you compare the below result with the above "dist1", you'll notice that
"vector" and "scalar" have more closer compared to "vector" and "victor", which tells us that w2v model understands and
differentiates based on context and not just difference in word spellings._

In [9]:
# taking mean to for understanding purpose,
# otherwise 100 dimension will be difficult to visualize the distance.
print(np.mean(w2v_wiki['vector'] - w2v_wiki['scalar']))
print(np.mean(w2v_wiki['vector'] - w2v_wiki['victor']))

0.09825662
0.15752597


### To check what words  out w2v model has learned.

_Getting the vector form for each word learned by the model in X_train and forming the array for each text in X_train._

_But these texts are not of same length, hence the vectors won't be either. ML algorithms expect input features of same length. Hence we'll avg. down all the word vectors in text and get a single 100 dimension vector representing a text. We'll have 100 features, where each row represents a vector for a text._

## Below is the implementation of the data using _doc2vec_ model.
- d2v model converts each doc or text or string or paragraph into a representation of n-dimension vector. 
- This is much easier than w2v model as in w2v model we have to avg out vector repres. of all words in a string/doc to get single repres. of the doc.
- d2v model generates the vector for the doc in a much sophisticated way compared to just avg. out the vectors of all words as in w2v model. So it may prove to be stronger than w2v model.

-----------------------------------------------

- _First we have to tag the documents/texts in order to inform d2v model about individual docs while training._
- _Simplest way is to just use the index of each text, but there are other methods as well_

- We are passing the tag as a list as that's how the tagging model expects the tag numbers.

- You cannot view the vector for a single word as the model is trained to understand docs containing minimum 2 words. 

- Below we are retrieving the vectors for each text in X_test.
- Note :- In w2v model, we stored the vectors in arrays as we had to perform element-wise mean operation on those vectors. Here
we dont have to do that so we'll directly store them in lists.
- Infer_vector is not deterministic i.e. they won't give you same vectors every time you run them.

## Defining a function to clean the text messages :- 
- It will remove the punctuation
- Remove special characters
- Remove stopwords

In [2]:
stopwords = nltk.corpus.stopwords.words('english')

In [3]:
def clean_text(text):
    t1 = "".join([word.lower() for word in text if word not in string.punctuation])
    t2 = re.split('\W+',t1)
    clean_txt = [word for word in t2 if word not in stopwords]
    return clean_txt
    

## Reading in the data and applying the cleaning function

In [4]:
messages = pd.read_csv('spam.csv', encoding = 'latin-1')
messages.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis = 1, inplace=True)
messages.columns = ['labels', 'text']

messages.labels = np.where(messages['labels'] == 'spam', 1,0)

messages['txt_clean'] = messages['text'].apply(lambda x : clean_text(x) )
messages.head()

Unnamed: 0,labels,text,txt_clean
0,0,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n..."
1,0,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,0,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,0,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, goes, usf, lives, around, t..."


## Splitting data into train and test sets

In [5]:
X_train, X_test, y_train, y_test = train_test_split(messages['txt_clean'],messages['labels'], test_size=0.2)

## Preparing data for different models

- Three different vectorization models will be used. TF-IDF from sklearn, word2vec and doc2vec from gensim.
- _These texts are not of same length, hence the vectors won't be either. ML algorithms expect input features of same length. Hence we'll avg. down all the word vectors in text and get a single 100 dimension vector representing a text. We'll have 100 features, where each row represents a vector for a text._
- d2v model converts each doc or text or string or paragraph into a representation of n-dimension vector. 
- This is much easier than w2v model as in w2v model we have to avg out vector repres. of all words in a string/doc to get single repres. of the doc.
- d2v model generates the vector for the doc in a much sophisticated way compared to just avg. out the vectors of all words as in w2v model. So it may prove to be stronger than w2v model.
- _First we have to tag the documents/texts in order to inform d2v model about individual docs while training._
- _Simplest way is to just use the index of each text, but there are other methods as well_


_TFIDF won't work on pd series. So in order to trick the vectorizer, we'll just pass our clean text function in analyzer._

In [85]:
tfidf = TfidfVectorizer(analyzer = clean_text)
tfidf_fit = tfidf.fit(X_train)
X_train_tfidf = tfidf.transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

_For word2vec_

In [7]:
w2v = gensim.models.Word2Vec(X_train, vector_size=100, window=5, min_count=2)

In [13]:
X_train_vect = np.array([np.array([w2v.wv[i] for i in ls if i in w2v.wv.index_to_key]) for ls in X_train])
X_test_vect = np.array([np.array([w2v.wv[i] for i in ls if i in w2v.wv.index_to_key]) for ls in X_test])

  X_train_vect = np.array([np.array([w2v.wv[i] for i in ls if i in w2v.wv.index_to_key]) for ls in X_train])
  X_test_vect = np.array([np.array([w2v.wv[i] for i in ls if i in w2v.wv.index_to_key]) for ls in X_test])


In [12]:
def avg_vect(w2v_vect):
    
    vect_avg = []

    for vect in w2v_vect:
        if len(vect) != 0:
            vect_avg.append(vect.mean(axis=0))
        else:
            vect_avg.append(np.zeros(100))
    return vect_avg

In [19]:
X_train_w2v_avg = avg_vect(X_train_vect)
X_test_w2v_avg = avg_vect(X_test_vect)

_For doc2vec_

In [54]:
tag_train = [gensim.models.doc2vec.TaggedDocument(v,[i]) for i,v in enumerate(X_train)]
tag_test = [gensim.models.doc2vec.TaggedDocument(v,[i]) for i,v in enumerate(X_test)]

In [62]:
tag_train

[TaggedDocument(words=['lasting', 'much', '2', 'hours', 'might', 'get', 'lucky'], tags=[0]),
 TaggedDocument(words=['nope', 'waiting', 'sch', '4', 'daddy', ''], tags=[1]),
 TaggedDocument(words=['sorry', 'missed', 'call', 'please', 'call', 'back'], tags=[2]),
 TaggedDocument(words=['dont', 'file', 'bagi', 'work', 'called', 'mei', 'tell', 'find', 'anything', 'room'], tags=[3]),
 TaggedDocument(words=['cool', 'come', 'havent', 'wined', 'dined'], tags=[4]),
 TaggedDocument(words=['one', 'joys', 'lifeis', 'waking', 'daywith', 'thoughts', 'somewheresomeone', 'cares', 'enough', 'tosend', 'warm', 'morning', 'greeting', ''], tags=[5]),
 TaggedDocument(words=['okies', 'ill', 'go', 'yan', 'jiu', 'skip', 'ard', 'oso', 'go', 'cine', 'den', 'go', 'mrt', 'one', 'blah', 'blah', 'blah', ''], tags=[6]),
 TaggedDocument(words=['latest', 'news', 'police', 'station', 'toilet', 'stolen', 'cops', 'nothing', 'go'], tags=[7]),
 TaggedDocument(words=['ìï', 'log', '4', 'wat', 'sdryb8i'], tags=[8]),
 TaggedDocum

In [51]:
X_test[0:5]

3329                          [send, yettys, number, pls]
1632    [hello, little, party, animal, thought, id, bu...
3235    [aight, text, youre, back, mu, ill, swing, nee...
3082                           [kkhow, training, process]
1264                                    [see, half, hour]
Name: txt_clean, dtype: object

In [56]:
d2v = gensim.models.Doc2Vec(tag_train, vector_size = 50, window = 2, min_count = 2 )

In [63]:
X_train_d2v = [d2v.infer_vector(text.words) for text in tag_train]
X_test_d2v = [d2v.infer_vector(text.words) for text in tag_test]

In [64]:
print(X_train_d2v[0])

[ 0.00210781 -0.01025606  0.00424152  0.00024143 -0.00917417 -0.00439115
  0.00161037  0.02572165 -0.02221104 -0.01134223  0.00095406 -0.02488241
 -0.00110105  0.01128565 -0.00049442  0.00772907  0.02609073  0.00314334
 -0.01531209 -0.01115839 -0.00600235  0.01197011  0.01796303 -0.01073938
  0.01127774  0.00446393 -0.00327941 -0.00060531 -0.00953925 -0.00706836
 -0.00897983 -0.0018637  -0.0016146   0.011136   -0.00790186  0.00457287
  0.00237767 -0.00195308  0.01756881 -0.0118155   0.02458095 -0.0097932
  0.00855546  0.01297083  0.01960312 -0.0118759  -0.00112738 -0.01837532
  0.00637462  0.01632589]


## Applying Random Forest Classifier

In [65]:
rf_d2v = RandomForestClassifier()
rf_d2v_model = rf_d2v.fit(X_train_d2v,y_train.values.ravel())

In [66]:
y_pred = rf_d2v_model.predict(X_test_d2v)

In [69]:
precision  = precision_score(y_test,y_pred)
recall  = recall_score(y_test,y_pred)

## Applying LSTM

In [79]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_tokenized = tokenizer.texts_to_sequences(X_train)
X_test_tokenized = tokenizer.texts_to_sequences(X_test)

In [91]:
len(tokenizer.index_word)

8328

In [81]:
X_train_padded = pad_sequences(X_train_tokenized, 50)
X_test_padded = pad_sequences(X_test_tokenized, 50)

Making precision and recall functions

In [87]:
import tensorflow.keras.backend as K
from tensorflow.keras.layers import Dense,Embedding,LSTM
from tensorflow.keras.models import Sequential

In [104]:
def precision_m(y_true,y_pred):
    
    tp = K.sum(K.round(K.clip(y_true*y_pred,0,1)))
    pp = K.sum(K.round(K.clip(y_pred,0,1))) #predicted postives
    precison = tp/(pp*K.epsilon())
    return recall

def recall_m(y_true,y_pred):
    
    tp = K.sum(K.round(K.clip(y_true*y_pred,0,1)))
    ap = K.sum(K.round(K.clip(y_true,0,1))) #actual positives
    recall = tp/(ap*K.epsilon())
    return recall

In [105]:
model = Sequential()

model.add(Embedding(len(tokenizer.index_word)+1,32))
model.add(LSTM(32, dropout = 0, recurrent_dropout = 0 ))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, None, 32)          266528    
_________________________________________________________________
lstm_5 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_10 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 33        
Total params: 275,937
Trainable params: 275,937
Non-trainable params: 0
_________________________________________________________________


In [106]:
model.compile(optimizer = 'adam',
             loss = 'binary_crossentropy',
             metrics = ['accuracy',precision_m,recall_m])

In [107]:
history = model.fit(X_train_padded, y_train, batch_size = 32, epochs = 5,
                   validation_data = (X_test_padded,y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
