# Introduction to NLP preprocessing and encoding

## Part 1: Preprocessing

We will be using a subsample of Twitter data from this dataset: https://www.kaggle.com/datasets/monogenea/game-of-thrones-twitter?resource=download

In [182]:
import pandas as pd
import numpy as np
# read in the gotTwitter dataset we will be working with
got_data = pd.read_csv('gotTwitter.csv', dtype='str')

In [183]:
# set_option to view full column in notebook
pd.set_option('display.max_colwidth', None)

# make sure it loaded correctly, there should be 3 columns, an id number, a created_at date, and the text of the tweets
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"Over 370,000 'Game of Thrones' fans sign petition for remake of season 8 https://t.co/jqxWu4E5k3"
1,x1129144306206298112,5/16/2019 21:59,"With both Game of Thrones and The Big Bang Theory ending this week, I<U+2019>m wondering if it would be fun to take hugely-popular, long-running, juggernaut shows I<U+2019>ve never seen a single episode of, and watch JUST the finale, and see what I think?"
2,x1129144249205895171,5/16/2019 21:59,"Suddenly, last episode, Daenerys embraced the thrill of genocide, specifically targeting civilians with dragon fire. Personality changes happen in fiction, but not with such lack of subtlety -- not to characters @GameOfThrones respect and understand.\n\n https://t.co/FR9HzLcGB0"
3,x1129144246869663745,5/16/2019 21:59,"Sprinkles causes a stampede by releasing a limited-time-only <U+2018>Game of Thrones<U+2019> dragon fruit cupcake sold Friday, May 17-Sunday, May 19 https://t.co/9zNhC00Sj4"
4,x1129141956095954956,5/16/2019 21:49,"<U+2018>Game of Thrones<U+2019> is airing its final episode, and here<U+2019>s what we<U+2019>ll miss when it ends https://t.co/Adb12iWRqb"


#### Remove things like special characters, symbols, punctuation, URLs, etc. from the data that contains little information for a model to learn and are often primarily noise.

In [184]:
import re, string #import packages for regex replacement

def clean_text_round(row):
    row = re.sub(r'http\S+', '', row) #remove urls
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row) #remove mentions
    return row

clean = lambda x: clean_text_round(x)

In [185]:
# apply the function above across each row of the text column
got_data.loc[:, 'text'] = got_data['text'].apply(clean)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"Over 370,000 'Game of Thrones' fans sign petition for remake of season 8"
1,x1129144306206298112,5/16/2019 21:59,"With both Game of Thrones and The Big Bang Theory ending this week, I<U+2019>m wondering if it would be fun to take hugely-popular, long-running, juggernaut shows I<U+2019>ve never seen a single episode of, and watch JUST the finale, and see what I think?"
2,x1129144249205895171,5/16/2019 21:59,"Suddenly, last episode, Daenerys embraced the thrill of genocide, specifically targeting civilians with dragon fire. Personality changes happen in fiction, but not with such lack of subtlety -- not to characters respect and understand.\n\n"
3,x1129144246869663745,5/16/2019 21:59,"Sprinkles causes a stampede by releasing a limited-time-only <U+2018>Game of Thrones<U+2019> dragon fruit cupcake sold Friday, May 17-Sunday, May 19"
4,x1129141956095954956,5/16/2019 21:49,"<U+2018>Game of Thrones<U+2019> is airing its final episode, and here<U+2019>s what we<U+2019>ll miss when it ends"


We can see there are still elements that appear noisy, let's add a line to our function:

In [186]:

def clean_text_round(row):
    row = re.sub(r'http\S+', '', row) #remove urls
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row) #remove mentions
    row = re.sub(r"<[^>]+>", '', row) # remove carrot inserts from collection <---- new operation
    return row

clean = lambda x: clean_text_round(x)

In [187]:
# apply the function above across each row of the text column
got_data.loc[:, 'text'] = got_data['text'].apply(clean)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"Over 370,000 'Game of Thrones' fans sign petition for remake of season 8"
1,x1129144306206298112,5/16/2019 21:59,"With both Game of Thrones and The Big Bang Theory ending this week, Im wondering if it would be fun to take hugely-popular, long-running, juggernaut shows Ive never seen a single episode of, and watch JUST the finale, and see what I think?"
2,x1129144249205895171,5/16/2019 21:59,"Suddenly, last episode, Daenerys embraced the thrill of genocide, specifically targeting civilians with dragon fire. Personality changes happen in fiction, but not with such lack of subtlety -- not to characters respect and understand.\n\n"
3,x1129144246869663745,5/16/2019 21:59,"Sprinkles causes a stampede by releasing a limited-time-only Game of Thrones dragon fruit cupcake sold Friday, May 17-Sunday, May 19"
4,x1129141956095954956,5/16/2019 21:49,"Game of Thrones is airing its final episode, and heres what well miss when it ends"


In [188]:

def clean_text_round(row):
    row.lower()
    row = re.sub(r'http\S+', '', row) #remove urls
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row) #remove mentions
    row = re.sub(r"<[^>]+>", '', row) 
    row = re.sub(r"\n","",row)
    row = re.sub(r"[^\w\s]", " ", row)
    return row

clean = lambda x: clean_text_round(x)
got_data.loc[:, 'text'] = got_data['text'].apply(clean)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,Over 370 000 Game of Thrones fans sign petition for remake of season 8
1,x1129144306206298112,5/16/2019 21:59,With both Game of Thrones and The Big Bang Theory ending this week Im wondering if it would be fun to take hugely popular long running juggernaut shows Ive never seen a single episode of and watch JUST the finale and see what I think
2,x1129144249205895171,5/16/2019 21:59,Suddenly last episode Daenerys embraced the thrill of genocide specifically targeting civilians with dragon fire Personality changes happen in fiction but not with such lack of subtlety not to characters respect and understand
3,x1129144246869663745,5/16/2019 21:59,Sprinkles causes a stampede by releasing a limited time only Game of Thrones dragon fruit cupcake sold Friday May 17 Sunday May 19
4,x1129141956095954956,5/16/2019 21:49,Game of Thrones is airing its final episode and heres what well miss when it ends


#### Lowercase

There are many ways to lowercase your data, here we use re, feel free to drop in your new lines as we redefine the function

In [189]:
# added a new line
def clean_text_round(row):
    row.lower()
    row = re.sub(r'http\S+', '', row) #remove urls
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row) #remove mentions
    row = re.sub(r"<[^>]+>", '', row) 
    row = re.sub(r"\n","",row)
    row = re.sub(r"[^\w\s]", " ", row)
    return row

clean = lambda x: clean_text_round(x)

In [190]:
got_data.loc[:, 'text'] = got_data['text'].apply(clean)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,Over 370 000 Game of Thrones fans sign petition for remake of season 8
1,x1129144306206298112,5/16/2019 21:59,With both Game of Thrones and The Big Bang Theory ending this week Im wondering if it would be fun to take hugely popular long running juggernaut shows Ive never seen a single episode of and watch JUST the finale and see what I think
2,x1129144249205895171,5/16/2019 21:59,Suddenly last episode Daenerys embraced the thrill of genocide specifically targeting civilians with dragon fire Personality changes happen in fiction but not with such lack of subtlety not to characters respect and understand
3,x1129144246869663745,5/16/2019 21:59,Sprinkles causes a stampede by releasing a limited time only Game of Thrones dragon fruit cupcake sold Friday May 17 Sunday May 19
4,x1129141956095954956,5/16/2019 21:49,Game of Thrones is airing its final episode and heres what well miss when it ends


#### Tokenization

Again, there are many ways to tokenize. The package NLTK has built in functions to help us tokenize in different ways, including by word and by sentence. Here, we tokenize using a special tweet tokenizer that is able to take things like emojis into account. Read the documentation here: https://www.nltk.org/api/nltk.tokenize.html

In [191]:
# import tokenizer from nltk
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer() 

def tweet_tokenize(row):
    row = tweet_tokenizer.tokenize(row)
    return row

tokenized = lambda x: tweet_tokenize(x)

In [192]:
got_data.loc[:, 'text'] = got_data['text'].apply(tokenized)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"[Over, 370, 000, Game, of, Thrones, fans, sign, petition, for, remake, of, season, 8]"
1,x1129144306206298112,5/16/2019 21:59,"[With, both, Game, of, Thrones, and, The, Big, Bang, Theory, ending, this, week, Im, wondering, if, it, would, be, fun, to, take, hugely, popular, long, running, juggernaut, shows, Ive, never, seen, a, single, episode, of, and, watch, JUST, the, finale, and, see, what, I, think]"
2,x1129144249205895171,5/16/2019 21:59,"[Suddenly, last, episode, Daenerys, embraced, the, thrill, of, genocide, specifically, targeting, civilians, with, dragon, fire, Personality, changes, happen, in, fiction, but, not, with, such, lack, of, subtlety, not, to, characters, respect, and, understand]"
3,x1129144246869663745,5/16/2019 21:59,"[Sprinkles, causes, a, stampede, by, releasing, a, limited, time, only, Game, of, Thrones, dragon, fruit, cupcake, sold, Friday, May, 17, Sunday, May, 19]"
4,x1129141956095954956,5/16/2019 21:49,"[Game, of, Thrones, is, airing, its, final, episode, and, heres, what, well, miss, when, it, ends]"


#### Remove stop words

In [196]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopeng = set(stopwords.words('english'))

def remove_stopwords(row):
    row = [w for w in row if w not in stopeng]
    return row

no_stopwords = lambda x: remove_stopwords(x)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ishadoshi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [197]:
got_data.loc[:, 'text'] = got_data['text'].apply(no_stopwords)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,"[Over, 370, 000, Game, Thrones, fans, sign, petition, remake, season, 8]"
1,x1129144306206298112,5/16/2019 21:59,"[With, Game, Thrones, The, Big, Bang, Theory, ending, week, Im, wondering, would, fun, take, hugely, popular, long, running, juggernaut, shows, Ive, never, seen, single, episode, watch, JUST, finale, see, I, think]"
2,x1129144249205895171,5/16/2019 21:59,"[Suddenly, last, episode, Daenerys, embraced, thrill, genocide, specifically, targeting, civilians, dragon, fire, Personality, changes, happen, fiction, lack, subtlety, characters, respect, understand]"
3,x1129144246869663745,5/16/2019 21:59,"[Sprinkles, causes, stampede, releasing, limited, time, Game, Thrones, dragon, fruit, cupcake, sold, Friday, May, 17, Sunday, May, 19]"
4,x1129141956095954956,5/16/2019 21:49,"[Game, Thrones, airing, final, episode, heres, well, miss, ends]"


#### Lemmatization/Stemming

In [198]:

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer

lmtzr = WordNetLemmatizer()


def lemmatize(row):
    row = [lmtzr.lemmatize(token) for token in row]
    row = ' '.join(row) # this is the final step of our guided walkthrough, so I have re-joined the tweets into single documents instead of lists
    return row

lemmatized = lambda x: lemmatize(x)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ishadoshi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/ishadoshi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [199]:
got_data.loc[:, 'text'] = got_data['text'].apply(lemmatized)
got_data.head()

Unnamed: 0,status_id,created_at,text
0,x1129144346618540033,5/16/2019 21:59,Over 370 000 Game Thrones fan sign petition remake season 8
1,x1129144306206298112,5/16/2019 21:59,With Game Thrones The Big Bang Theory ending week Im wondering would fun take hugely popular long running juggernaut show Ive never seen single episode watch JUST finale see I think
2,x1129144249205895171,5/16/2019 21:59,Suddenly last episode Daenerys embraced thrill genocide specifically targeting civilian dragon fire Personality change happen fiction lack subtlety character respect understand
3,x1129144246869663745,5/16/2019 21:59,Sprinkles cause stampede releasing limited time Game Thrones dragon fruit cupcake sold Friday May 17 Sunday May 19
4,x1129141956095954956,5/16/2019 21:49,Game Thrones airing final episode here well miss end


How would you apply stemming? Hint: https://www.nltk.org/howto/stem.html

In [202]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_text(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

got_data['stemmed_text'] = got_data['text'].apply(stem_text)
got_data.head()

Unnamed: 0,status_id,created_at,text,stemmed_text
0,x1129144346618540033,5/16/2019 21:59,Over 370 000 Game Thrones fan sign petition remake season 8,over 370 000 game throne fan sign petit remak season 8
1,x1129144306206298112,5/16/2019 21:59,With Game Thrones The Big Bang Theory ending week Im wondering would fun take hugely popular long running juggernaut show Ive never seen single episode watch JUST finale see I think,with game throne the big bang theori end week im wonder would fun take huge popular long run juggernaut show ive never seen singl episod watch just final see i think
2,x1129144249205895171,5/16/2019 21:59,Suddenly last episode Daenerys embraced thrill genocide specifically targeting civilian dragon fire Personality change happen fiction lack subtlety character respect understand,suddenli last episod daeneri embrac thrill genocid specif target civilian dragon fire person chang happen fiction lack subtleti charact respect understand
3,x1129144246869663745,5/16/2019 21:59,Sprinkles cause stampede releasing limited time Game Thrones dragon fruit cupcake sold Friday May 17 Sunday May 19,sprinkl caus stamped releas limit time game throne dragon fruit cupcak sold friday may 17 sunday may 19
4,x1129141956095954956,5/16/2019 21:49,Game Thrones airing final episode here well miss end,game throne air final episod here well miss end


If you have extra time: how might you split our data into ngrams? Hint: https://www.nltk.org/api/nltk.util.html#nltk.util.ngrams

In [203]:
from nltk import ngrams

def get_bigrams(text):
    bigrams = ngrams(text.split(), 2)
    return list(bigrams)

got_data['bigrams'] = got_data['text'].apply(get_bigrams)
got_data.head()

Unnamed: 0,status_id,created_at,text,stemmed_text,bigrams
0,x1129144346618540033,5/16/2019 21:59,Over 370 000 Game Thrones fan sign petition remake season 8,over 370 000 game throne fan sign petit remak season 8,"[(Over, 370), (370, 000), (000, Game), (Game, Thrones), (Thrones, fan), (fan, sign), (sign, petition), (petition, remake), (remake, season), (season, 8)]"
1,x1129144306206298112,5/16/2019 21:59,With Game Thrones The Big Bang Theory ending week Im wondering would fun take hugely popular long running juggernaut show Ive never seen single episode watch JUST finale see I think,with game throne the big bang theori end week im wonder would fun take huge popular long run juggernaut show ive never seen singl episod watch just final see i think,"[(With, Game), (Game, Thrones), (Thrones, The), (The, Big), (Big, Bang), (Bang, Theory), (Theory, ending), (ending, week), (week, Im), (Im, wondering), (wondering, would), (would, fun), (fun, take), (take, hugely), (hugely, popular), (popular, long), (long, running), (running, juggernaut), (juggernaut, show), (show, Ive), (Ive, never), (never, seen), (seen, single), (single, episode), (episode, watch), (watch, JUST), (JUST, finale), (finale, see), (see, I), (I, think)]"
2,x1129144249205895171,5/16/2019 21:59,Suddenly last episode Daenerys embraced thrill genocide specifically targeting civilian dragon fire Personality change happen fiction lack subtlety character respect understand,suddenli last episod daeneri embrac thrill genocid specif target civilian dragon fire person chang happen fiction lack subtleti charact respect understand,"[(Suddenly, last), (last, episode), (episode, Daenerys), (Daenerys, embraced), (embraced, thrill), (thrill, genocide), (genocide, specifically), (specifically, targeting), (targeting, civilian), (civilian, dragon), (dragon, fire), (fire, Personality), (Personality, change), (change, happen), (happen, fiction), (fiction, lack), (lack, subtlety), (subtlety, character), (character, respect), (respect, understand)]"
3,x1129144246869663745,5/16/2019 21:59,Sprinkles cause stampede releasing limited time Game Thrones dragon fruit cupcake sold Friday May 17 Sunday May 19,sprinkl caus stamped releas limit time game throne dragon fruit cupcak sold friday may 17 sunday may 19,"[(Sprinkles, cause), (cause, stampede), (stampede, releasing), (releasing, limited), (limited, time), (time, Game), (Game, Thrones), (Thrones, dragon), (dragon, fruit), (fruit, cupcake), (cupcake, sold), (sold, Friday), (Friday, May), (May, 17), (17, Sunday), (Sunday, May), (May, 19)]"
4,x1129141956095954956,5/16/2019 21:49,Game Thrones airing final episode here well miss end,game throne air final episod here well miss end,"[(Game, Thrones), (Thrones, airing), (airing, final), (final, episode), (episode, here), (here, well), (well, miss), (miss, end)]"


#### BOW (bag of words)

In [204]:
data_list = got_data.iloc[:50]['text'].to_list()

In [205]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer()

bow_result = bow.fit_transform(data_list)

In [206]:
import numpy as np

feature_array = np.array(bow.get_feature_names_out()) 
bow_sorting = np.argsort(bow_result.toarray()).flatten()[::-1]

n = 10
bow_top_n = feature_array[bow_sorting][:n]
print(bow_top_n)

['thrones' 'destroy' 'parallel' 'penultimate' 'classical' 'history'
 'theories' 'queens' 'dany' 'rain']


#### TF-IDF

In [207]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create variable for vectorizer
tfidf = TfidfVectorizer()
 
# apply function on data subsample
tfidf_result = tfidf.fit_transform(data_list)

In [208]:
feature_array = np.array(tfidf.get_feature_names_out()) # get_feature_names is dependent on sklearn version
tfidf_sorting = np.argsort(tfidf_result.toarray()).flatten()[::-1]

n = 10
tfidf_top_n = feature_array[tfidf_sorting][:n]
print(tfidf_top_n)

['parallel' 'world' 'rain' 'destroy' 'did' 'theories' 'real' 'history'
 'kings' 'queens']


## Part 3: Word Embeddings

For our word embedding exploration we will be using the gensim library. Gensim has pretrained embeddings for glove, word2vec, and fasttext;

documentation: https://radimrehurek.com/gensim/models/word2vec.html, https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.html


In [210]:
#pip install gensim
import gensim.downloader as api

In [211]:
w2v_google_news = api.load('word2vec-google-news-300')

[--------------------------------------------------] 1.4% 23.4/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[==------------------------------------------------] 4.3% 70.8/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[===-----------------------------------------------] 7.3% 121.1/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[=====---------------------------------------------] 10.5% 174.6/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)






In [216]:
# you can run code for analogies, I have included the king - man, + woman = queen example to start
w2v_google_news.most_similar_cosmul(positive=['king','woman'], negative=['man'])

[('queen', 0.9314123392105103),
 ('monarch', 0.858533501625061),
 ('princess', 0.8476566672325134),
 ('Queen_Consort', 0.8150269389152527),
 ('queens', 0.8099815249443054),
 ('crown_prince', 0.8089976906776428),
 ('royal_palace', 0.8027306795120239),
 ('monarchy', 0.8019613027572632),
 ('prince', 0.800979733467102),
 ('empress', 0.7958388328552246)]

In [217]:
# comparing words similar to a single word
w2v_google_news.most_similar('patriot')

[('patriots', 0.6744824051856995),
 ('patriotic', 0.5867904424667358),
 ('statesman', 0.5711327791213989),
 ('constitutionalist', 0.557580292224884),
 ('traitor', 0.5424764752388),
 ('revolutionist', 0.5388074517250061),
 ('hero', 0.5304942727088928),
 ('ardent_patriot', 0.5239595174789429),
 ('patriotism', 0.5227564573287964),
 ('rabble_rouser', 0.5194998383522034)]

In [218]:
w2v_google_news.distance('president', 'patriot')

0.7869569361209869

Explore different word relationships using the tools above or commands from the linked documentation, what relationships seem surprising? Why might the model have embedded certain words in similar ways when linguistically we wouldn't expect them to be similar? Use at least one new example per method.

In [238]:
w2v_google_news.most_similar_cosmul(positive=['buffalo','woman'], negative=['male'])

[('grandmother', 0.8582088351249695),
 ('cow', 0.8554173707962036),
 ('Grandmother', 0.8421149849891663),
 ('granddaughter', 0.8380693793296814),
 ('albino_buffalo', 0.8244361877441406),
 ('Vance_Ehmke_tracks', 0.8210681676864624),
 ('buffalo_roam', 0.8183169364929199),
 ('husband_funeral_pyre', 0.8180189728736877),
 ('aunt', 0.8175702095031738),
 ('hunter_stray_bullet', 0.81653892993927)]

In [233]:
w2v_google_news.most_similar('disney')

[('alice', 0.6571455001831055),
 ('harry_potter', 0.6108603477478027),
 ('gwen', 0.5985742211341858),
 ('mario', 0.5946715474128723),
 ('nolan', 0.5925750136375427),
 ('disneyland', 0.584934413433075),
 ('orlando', 0.5845032334327698),
 ('hannah_montana', 0.5791226625442505),
 ('jackie', 0.5755027532577515),
 ('nikki', 0.5743952989578247)]

In [229]:
w2v_google_news.distance('president', 'scam')

0.992013331502676