<a href="https://colab.research.google.com/github/Gaukhar-ai/Gaukhar.Diamond.github/blob/master/NLP_natural_language_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
#nltk.download()

In [2]:
#dir(nltk)

In [3]:

nltk.download("stopwords")



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
#what can i do with nltk?

from nltk.corpus import stopwords
stopwords.words('english')[0:5]

['i', 'me', 'my', 'myself', 'we']

In [5]:
#let's look at additional words later in the list
stopwords.words('english')[0:500:25]

['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won']

NLP Basics: reading in text data.
Read in Semi-structured text data.

In [6]:
#read in and view the raw data
import pandas as pd
messages = pd.read_csv('https://raw.githubusercontent.com/Gaukhar-ai/NLP/main/spam.csv', encoding = "latin-1") #other encoding = "cp1252"
#encoding = "ISO-8859-1", 'utf-8'
messages.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [7]:
messages= messages.drop(labels=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1) #axis 1 = columns, 1 = rows
messages.columns = ['label', 'text']
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
#how big is the dataset?
messages.shape

(5572, 2)

In [9]:
#what portion of our text messages are actually spam?
messages['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [10]:
#are we missing any data?
print('Number of nulls in label: {}'.format(messages['label'].isnull().sum()))
print('Number of nulls in text: {}'.format(messages['text'].isnull().sum()))

Number of nulls in label: 0
Number of nulls in text: 0


NLP Basics: Implementing A pipeline to clean Text
Pre-processing Text Data
Cleaning up the text data is necessary to highlight attributes that I'm going to want my ML system to pick up on. We will explore 3 pre=processing steps in this lesson:

1. remove punctuation
2. tokenization
3. remove stopwords

In [11]:
#read in raw data and clean up the column names:
import pandas as pd
pd.set_option('display.max_colwidth', 100) #with display can see more data

messages = pd.read_csv('https://raw.githubusercontent.com/Gaukhar-ai/NLP/main/spam.csv', encoding = "latin-1")
messages = messages.drop(labels = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
messages.columns = ['label', 'text']
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


REMOVE PUNCTUATION

In [12]:
#what punctuation is included in the default list?
import string #string package has punctuations
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
#why is it important to remove punctuation? cuz . , look like characters
'This messages is spam' == 'This message is spam.'


False

In [14]:
#define a function to remove punctuation in our messages

def remove_punct(text):
  text = ''.join([char for char in text if char not in string.punctuation]) #'' join on nothing.
  return text

messages['text_clean'] = messages['text'].apply(lambda x: remove_punct(x)) #lambda to apply that function
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though


TOKENIZE

In [15]:
#define a function to split our sentences into a list of words
import re #re package

def tokenize(text):
  tokens = re.split('\W+', text) #'\W+'= this is the pattern, splits white space, special characters, etc., re = regex pattern to split
  return tokens

messages['text_tokenized'] = messages['text_clean'].apply(lambda x: tokenize(x.lower()))
messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"


REMOVE STOPWORDS

In [16]:
#what an example look like?

tokenize('I am learning NLP'.lower())


['i', 'am', 'learning', 'nlp']

In [17]:
#load the list of stopwords built into nltk
import nltk

stopwords = nltk.corpus.stopwords.words('english') 
#stopwords: the, but, a, i, am, etc.

In [18]:
#Define a function to remove all stopwords
def remove_stopwords(tokenized_text):
  text = [word for word in tokenized_text if word not in stopwords]
  return text

messages['text_nostop'] = messages['text_tokenized'].apply(lambda x: remove_stopwords(x))
messages.head()

Unnamed: 0,label,text,text_clean,text_tokenized,text_nostop
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"


In [19]:
#remove stopwords in our example
remove_stopwords(tokenize('I am learning NLP'.lower()))

['learning', 'nlp']

This above was a pre-processing process

Term frequency - inverse document Frequency (TF-IDF)
-creates a document-term matrix; one row per document, one column per word in the corpus
 - generates a weighting for each word/document pair intended to reflect how important a given word is to the doc within the context of its frequency withing a larger corpus.

W(ij) = TF(ij)*log(N/DF(i))
W(ij) = weighted of word i for doc j
TF(ij) = number of times i occurs in j divided by the total number of terms in j
DF(i) = number of docs containing word i
N = total # of docs

I LIKE NLP.

TF(i), I like NLP = 1/3 ===== log(N/DFi) = log(5572/2690) = 0.32

TF(like), I like NLP = 1/3 ===== log(N/DF(like)) = log(5572/922) = 0.78

TF(nlp), I like NLP = 1/3 ===== log(N/DF(nlp)) = log(5572/1) = 3,75

now lets see the weight of each word:

W(i), i like nlp = 1/3* 0.32=0.11
W(like) i like nlp = 1/3*0.78 = 0.26
W(nlp) i like nlp = 1/3* 3.75 = 1.25


CREATE Function to Clean Text 

In [20]:
#define a function to handle all data cleaning
def clean_text(text):
  text = ''.join([word.lower() for word in text if word not in string.punctuation])
  tokens = re.split('\W+', text)
  text = [word for word in tokens if word not in stopwords]
  return text

Apply TFIDFVectorizer

In [21]:
#fit a basic TFIDF Vectorizer and view the results
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(messages['text'])
print(X_tfidf.shape)
print(tfidf_vect.get_feature_names())

(5572, 9395)


In [22]:
#how is the output of the TfidfVectorizer stored?
#sparse matrix 
X_tfidf

<5572x9395 sparse matrix of type '<class 'numpy.float64'>'
	with 50453 stored elements in Compressed Sparse Row format>

NLP Basics: Building a Basic Random Forest Model on Top of Vectorized Text

Read In and Clean Text

In [23]:
#read in, clean, vectorize data
import nltk
import pandas as pd
import re 
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')

messages = pd.read_csv('https://raw.githubusercontent.com/Gaukhar-ai/NLP/main/spam.csv', encoding = "latin-1")
messages = messages.drop(labels = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
messages.columns = ['label', 'text']

def clean_text(text):
  text = ''.join([word.lower() for word in text if word not in string.punctuation])
  tokens = re.split('\W+', text)
  text = [word for word in tokens if word not in stopwords]
  return text

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(messages['text'])

X_features = pd.DataFrame(X_tfidf.toarray())
X_features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,9355,9356,9357,9358,9359,9360,9361,9362,9363,9364,9365,9366,9367,9368,9369,9370,9371,9372,9373,9374,9375,9376,9377,9378,9379,9380,9381,9382,9383,9384,9385,9386,9387,9388,9389,9390,9391,9392,9393,9394
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Explore RandomForestClassifier Attributes & Hyperparams

In [24]:
#import RF for classification from sklearn
from sklearn.ensemble import RandomForestClassifier

In [25]:
#View the arguments (and default values) for RandomForestClassifier
print(RandomForestClassifier())
#these are hyperparams: max_depth=None, and n_estimators = 100(100 decision trees, then vote to determine the final predictions)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)


Explore RandomForestClassifier on a Holdout Set

In [26]:
#import the methods that will be needed to evaluate a basic model
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split

In [27]:
#split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_features, messages['label'], test_size=0.2)

In [28]:
#fit a basic Random Forest model
rf = RandomForestClassifier()
rf_model = rf.fit(X_train, y_train)

In [29]:
#make predictions on the test set using the fit model
y_pred = rf_model.predict(X_test)

In [30]:
#evaluate model predictions using precision and recall
precision = precision_score(y_test, y_pred, pos_label='spam')
recall = recall_score(y_test, y_pred, pos_label='spam')
print('Precision: {} / Recall: {}'.format(round(precision, 3), round(recall, 3)))

Precision: 1.0 / Recall: 0.799


In [31]:
#100% precision = identified spam 100%
# recall 0.824 = all the spam that came into email, 82% were properly placed in the spam folder but other 17% got in the email box

1. hows NLP useful in the real world?
through spam filters, autocomplete, autocorrect

2. A basic random forest model needs to be fit on top of the cleaned data. Hows this done?
rf = RandomForestClassifier()
rf_model = rf.fit(X_train, y_train)

3. Your system calls for the vectorizing of a filed named text, and applying the results to a doc matrix. What are the steps to do this?
from sklearn.feature_extraction


word2vec = is a shallow, two-layer neural network that accepts a text corpus as an input, and it returns a set of vectors(also known as embeddings); each vector is a numeric representation of a given word. 
Numeric vector representation. Converts words into vectors. 


word2vec: How to implement word2vec
Explore Pre-trained Embeddings
some other options:
1. glove-twitter-{25/50/100/200}
2. glove-wiki-gigaword-{50/200/300}
3. word2vec-google-news-300
4. word2vec-ruscorpora-news-300

In [34]:
#Install gensim
#!pip install -U gensim #-U = just upgrade it

In [35]:
#Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')



In [37]:
#explore the word vector for 'king'
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [38]:
#Find the words most similar to king based on the trained 
wiki_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690191268921),
 ('son', 0.7020887136459351),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919990181922913),
 ('kingdom', 0.6811410188674927),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712857484817505),
 ('ii', 0.6676074266433716)]

Train the Model

In [41]:
#read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('https://raw.githubusercontent.com/Gaukhar-ai/NLP/main/spam.csv', encoding = "latin-1")
messages = messages.drop(labels=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1)
messages.columns = ['label', 'text']
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [42]:
#clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [44]:
#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [48]:
#train the word2vec model
w2v_model = gensim.models.Word2Vec(X_train, size=100, window=5, min_count=2)

In [49]:
#Explore the word vector for 'king' base on our trained model
w2v_model.wv['king']

array([-0.06577569,  0.04970483,  0.03738878, -0.04651305, -0.02355495,
        0.00493436, -0.04110452, -0.02830215, -0.01045995, -0.01255331,
        0.02835891, -0.06017068, -0.00155131,  0.03175579,  0.00489237,
        0.0151575 ,  0.03418704, -0.02944662, -0.01141755, -0.0620597 ,
       -0.03086465, -0.00038265, -0.00927016, -0.00051424, -0.0033051 ,
        0.00337956, -0.03100642,  0.04210328, -0.00942894,  0.05410887,
       -0.01109493,  0.00103933, -0.02420425, -0.00438388, -0.05591501,
        0.00153315,  0.03049744, -0.00900078, -0.02925864,  0.01755   ,
        0.00012719, -0.02030995,  0.02769895, -0.00079072, -0.02855062,
        0.01951789, -0.06114267,  0.01808042, -0.05182774,  0.00183947,
       -0.00814203, -0.03153287,  0.0086607 ,  0.00145464,  0.02552261,
        0.01771739,  0.00696966,  0.03301093,  0.02948623, -0.00278988,
        0.0213562 , -0.00357028,  0.01048429, -0.00345482, -0.03070234,
       -0.02911132,  0.05170259, -0.00215378, -0.02489405,  0.00

In [50]:
#Find the most similar words to 'king' based on word vectors from our trained model
w2v_model.wv.most_similar('king')

[('cost', 0.9958551526069641),
 ('silent', 0.9957934617996216),
 ('friendship', 0.9957230687141418),
 ('nokia', 0.9957073926925659),
 ('gbp', 0.9957016110420227),
 ('points', 0.9956694841384888),
 ('entry', 0.9956380128860474),
 ('service', 0.995632529258728),
 ('txt', 0.9956250190734863),
 ('extra', 0.9956110119819641)]

Prep Word Vectors

In [None]:
#Generate a list of words the word2vec model learned word vectors for 
w2v_model.wv.index2word 
#appeared twice in the training data

In [53]:
#Generate aggregated sentence vectors based on the word vectors for each word in the sentence
w2v_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in w2v_model.wv.index2word]) for ls in X_test])

  


In [59]:
#Why is the length of the sentence different than the lenght of the sentence vector?
for i, v in enumerate(w2v_vect):
  print(len(X_test.iloc[i]), len(v))
#looking for differences
#first number is a number of words in a text message. 
#the second number is a number of word vectors
#the model wants to see the same numbers on both sides

8 8
6 5
12 12
5 5
21 21
5 5
23 17
11 10
7 7
23 22
6 6
7 5
26 26
24 19
101 101
20 15
15 13
12 10
22 21
4 2
5 5
22 19
6 6
8 8
23 22
7 6
5 5
13 9
22 19
4 3
2 2
10 10
18 16
29 29
8 7
12 8
11 11
6 6
9 8
10 10
1 1
10 7
11 11
5 5
5 4
7 7
12 12
9 9
5 5
12 12
19 18
25 23
26 25
9 9
15 15
4 4
29 28
14 13
4 4
23 22
13 12
3 3
22 22
4 4
4 2
38 35
11 11
9 8
13 13
5 3
7 7
7 7
11 9
30 29
8 7
14 14
26 26
1 1
4 2
47 46
12 11
15 14
17 17
8 8
9 8
16 14
8 7
7 7
26 25
6 6
6 5
9 8
27 27
22 21
30 28
7 7
3 3
16 16
11 11
9 8
7 7
8 8
4 4
14 14
14 14
4 4
8 7
26 25
21 14
14 14
6 5
29 23
18 18
22 20
13 7
4 4
4 4
9 8
11 8
7 7
30 28
7 6
15 14
4 3
4 2
11 9
23 23
18 18
12 12
7 7
5 5
4 3
12 10
20 19
5 4
15 15
22 18
20 19
14 13
14 14
1 1
9 8
7 5
9 9
6 6
21 20
16 16
10 6
8 8
9 9
8 8
9 9
9 9
22 21
6 6
6 6
7 7
12 11
14 14
25 24
16 14
24 23
22 19
9 9
4 4
4 4
10 9
46 39
4 4
21 21
16 15
6 6
8 8
4 4
9 9
5 5
7 7
7 5
26 26
9 6
18 15
10 10
8 7
23 23
16 15
22 18
15 12
22 18
13 12
6 5
21 18
12 12
21 18
20 20
15 14
16 15
10 10
12 9
24

In [58]:
#Compute sentence vectors by averaging the word vectors for the words contained in the sentence
w2v_vect_avg = []

for vect in w2v_vect:
  if len(vect)!=0:
    w2v_vect_avg.append(vect.mean(axis=0))
  else:
    w2v_vect_avg.append(np.zeros(100))

In [60]:
#Are our sentence vector lenghts consistent?
for i, v in enumerate(w2v_vect_avg):
  print(len(X_test.iloc[i]), len(v))
#now ML will have 100 features for every word

8 100
6 100
12 100
5 100
21 100
5 100
23 100
11 100
7 100
23 100
6 100
7 100
26 100
24 100
101 100
20 100
15 100
12 100
22 100
4 100
5 100
22 100
6 100
8 100
23 100
7 100
5 100
13 100
22 100
4 100
2 100
10 100
18 100
29 100
8 100
12 100
11 100
6 100
9 100
10 100
1 100
10 100
11 100
5 100
5 100
7 100
12 100
9 100
5 100
12 100
19 100
25 100
26 100
9 100
15 100
4 100
29 100
14 100
4 100
23 100
13 100
3 100
22 100
4 100
4 100
38 100
11 100
9 100
13 100
5 100
7 100
7 100
11 100
30 100
8 100
14 100
26 100
1 100
4 100
47 100
12 100
15 100
17 100
8 100
9 100
16 100
8 100
7 100
26 100
6 100
6 100
9 100
27 100
22 100
30 100
7 100
3 100
16 100
11 100
9 100
7 100
8 100
4 100
14 100
14 100
4 100
8 100
26 100
21 100
14 100
6 100
29 100
18 100
22 100
13 100
4 100
4 100
9 100
11 100
7 100
30 100
7 100
15 100
4 100
4 100
11 100
23 100
18 100
12 100
7 100
5 100
4 100
12 100
20 100
5 100
15 100
22 100
20 100
14 100
14 100
1 100
9 100
7 100
9 100
6 100
21 100
16 100
10 100
8 100
9 100
8 100
9 100
9 100
22

Question 1 of 4
While prepping words for output, you need to generate aggregate sentences based on word vectors. How is this done in Python? 

You are correct!
w2v_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in w2v_model.wv.index2word])
   for ls in X_test])



You want to implement the pretrained word vectors from Wikipedia. How do you load them into the system? 

You are correct!
import gensim.downloader as api
wiki_embeddings = api.load('glove-wiki-gigaword-100')

What formula represents the similarity between a king and queen vector?
You are correct!

Sim(King, Queen) = cos(O)

The word2vec function is a powerful tool to help in analysis. What does it return?
You are correct!

a vector of numeric values

doc2vec

```
# This is formatted as code
```

