#### CSC 215 Artificial Intelligence (Spring 2023)

#### Dr. Haiquan Chen, Dept of Computer Scicence

#### California State University, Sacramento



## Lab 15: Natural Language Processing using NLTK and Gensim (word2vec)


To run the code for this lab, you need to install NLTK and Gensim using the following:

* ***pip install gensim***
* ***pip install nltk***

In [8]:
# do this in Google Colab

!pip install --upgrade gensim
!pip install --upgrade numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy>=1.18.5
  Using cached numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.9 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.1
    Uninstalling numpy-1.24.1:
      Successfully uninstalled numpy-1.24.1
Successfully installed numpy-1.22.4


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy
  Using cached numpy-1.24.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.7.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.24.1 which is incompatible.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.24.1 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.24.1


NLTK is a leading platform released by Microsoft for building Python programs to work with human language data. NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

* [NLTK](https://www.nltk.org/)


Some simple things you can do with NLTK: 
 
* Tokenize and tag some text:

* Identify named entities:

* Display a parse tree:


Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora

* [Gensim](https://radimrehurek.com/gensim/)




In this lab, we focus on word2vec. Word2vec is a group of models that are used to produce word embeddings. These models are ***shallow, two-layer neural networks*** that are trained to reconstruct linguistic contexts of words. 

#### Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. 

Word vectors are positioned in the vector space such that ***words that share common contexts in the corpus are located in close proximity to one another in the vector space.***



## This enable math operations on words!

![Word2Vec](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/06062705/Word-Vectors.png)



### Download pre-trained word embeddings:  


* [GoogleNews Vectors](https://code.google.com/archive/p/word2vec/), [GitHub Mirror](https://github.com/mmihaltz/word2vec-GoogleNews-vectors)

* [Stanford GloVe](https://nlp.stanford.edu/projects/glove/)


### Next we will train a Word2Vec model to get word embeddings from scratch using the provide corpus


In [1]:
import numpy as np
import pandas as pd

from gensim.models import Word2Vec   # if your numpy too old, update:   !pip install --upgrade numpy
from scipy import spatial

from nltk.stem.lancaster import LancasterStemmer
from nltk.tokenize import RegexpTokenizer



### Import training dataset

https://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment


In [2]:
# 2016 Republican Party presidential debate

data = pd.read_csv('data/Sentiment.csv')
data

Unnamed: 0,id,candidate,candidate_confidence,relevant_yn,relevant_yn_confidence,sentiment,sentiment_confidence,subject_matter,subject_matter_confidence,candidate_gold,...,relevant_yn_gold,retweet_count,sentiment_gold,subject_matter_gold,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,1,No candidate mentioned,1.0000,yes,1.0000,Neutral,0.6578,None of the above,1.0000,,...,,5,,,RT @NancyLeeGrahn: How did everyone feel about...,,2015-08-07 09:54:46 -0700,629697200650592256,,Quito
1,2,Scott Walker,1.0000,yes,1.0000,Positive,0.6333,None of the above,1.0000,,...,,26,,,RT @ScottWalker: Didn't catch the full #GOPdeb...,,2015-08-07 09:54:46 -0700,629697199560069120,,
2,3,No candidate mentioned,1.0000,yes,1.0000,Neutral,0.6629,None of the above,0.6629,,...,,27,,,RT @TJMShow: No mention of Tamir Rice and the ...,,2015-08-07 09:54:46 -0700,629697199312482304,,
3,4,No candidate mentioned,1.0000,yes,1.0000,Positive,1.0000,None of the above,0.7039,,...,,138,,,RT @RobGeorge: That Carly Fiorina is trending ...,,2015-08-07 09:54:45 -0700,629697197118861312,Texas,Central Time (US & Canada)
4,5,Donald Trump,1.0000,yes,1.0000,Positive,0.7045,None of the above,1.0000,,...,,156,,,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,,2015-08-07 09:54:45 -0700,629697196967903232,,Arizona
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13866,13867,No candidate mentioned,1.0000,yes,1.0000,Negative,0.7991,Abortion,0.6014,No candidate mentioned,...,yes,7,Negative,Abortion\nWomen's Issues (not abortion though),RT @cappy_yarbrough: Love to see men who will ...,,2015-08-07 09:29:43 -0700,629690895479250944,Como,
13867,13868,Mike Huckabee,0.9611,yes,1.0000,Positive,0.7302,None of the above,0.9229,Mike Huckabee,...,yes,1,,,RT @georgehenryw: Who thought Huckabee exceede...,,2015-08-07 09:25:02 -0700,629689719056568320,USA,
13868,13869,Ted Cruz,1.0000,yes,1.0000,Positive,0.8051,None of the above,0.9647,Ted Cruz,...,yes,67,Positive\nNeutral,,"RT @Lrihendry: #TedCruz As President, I will a...",,2015-08-07 07:19:18 -0700,629658075784282112,,
13869,13870,Donald Trump,1.0000,yes,1.0000,Negative,1.0000,Women's Issues (not abortion though),0.9202,Donald Trump,...,yes,149,,Women's Issues (not abortion though),RT @JRehling: #GOPDebate Donald Trump says tha...,,2015-08-07 09:54:04 -0700,629697023663546368,,


In [3]:
data = data[['text','sentiment']]
data = data[data.sentiment != "Neutral" ] 
data = data[~data.text.str.startswith('RT')].reset_index(drop=True)

In [4]:
data

Unnamed: 0,text,sentiment
0,Deer in the headlights RT @lizzwinstead: Ben C...,Negative
1,@JGreenDC @realDonaldTrump In all fairness #Bi...,Negative
2,Me reading my family's comments about how grea...,Negative
3,Hey @ChrisChristie exploiting the tragedy of 9...,Negative
4,reason comment is funny 'in case you're ignora...,Negative
...,...,...
4586,"This is why I don't watch Fox News, they're al...",Negative
4587,@marcorubio came out of the gate like a true l...,Positive
4588,"Best line of #GOPDebate was ""Immigration witho...",Positive
4589,People who say they are #prolife are usually a...,Negative


### Preprocess data (tokenize text)
- Convert all letters into lowercase
- Remove punctuations, numbers, etc.

In [5]:
tkr = RegexpTokenizer('[a-zA-Z]+')
#stemmer = LancasterStemmer()


In [6]:
data['tokenized'] = data['text'].apply(lambda row: [t.lower() for t in tkr.tokenize(row)])

#data['tokenized'] = data['text'].apply(lambda row: [stemmer.stem(t.lower()) for t in tkr.tokenize(row)])

In [7]:
data['tokenized']

0       [deer, in, the, headlights, rt, lizzwinstead, ...
1       [jgreendc, realdonaldtrump, in, all, fairness,...
2       [me, reading, my, family, s, comments, about, ...
3       [hey, chrischristie, exploiting, the, tragedy,...
4       [reason, comment, is, funny, in, case, you, re...
                              ...                        
4586    [this, is, why, i, don, t, watch, fox, news, t...
4587    [marcorubio, came, out, of, the, gate, like, a...
4588    [best, line, of, gopdebate, was, immigration, ...
4589    [people, who, say, they, are, prolife, are, us...
4590    [so, trans, soldiers, can, die, for, you, huck...
Name: tokenized, Length: 4591, dtype: object

In [8]:
tweets = data['tokenized']
print(tweets[0])    
print(tweets[1])    
print(tweets[2])    

['deer', 'in', 'the', 'headlights', 'rt', 'lizzwinstead', 'ben', 'carson', 'may', 'be', 'the', 'only', 'brain', 'surgeon', 'who', 'has', 'performed', 'a', 'lobotomy', 'on', 'himself', 'gopdebate']
['jgreendc', 'realdonaldtrump', 'in', 'all', 'fairness', 'billclinton', 'owns', 'that', 'phrase', 'gopdebate']
['me', 'reading', 'my', 'family', 's', 'comments', 'about', 'how', 'great', 'the', 'gopdebate', 'was', 'http', 't', 'co', 'giagjpygxz']


### Create and train model
- Create a word2vec model and train it with the corpus
- Key parameter description (https://radimrehurek.com/gensim/models/word2vec.html)
    - **sentences**: training data (***has to be a list with tokenized sentences***)
    - **vector_size (formerly: size)**: dimension of embedding space
    - **sg**: Continuous Bag-of-Words model (CBOW) if 0 and the Skip-Gram model if 1. CBOW creates a sliding window around current word, to predict it from “context” — the surrounding words.  Skip-Gram model is actaully the opposite of CBOW. Instead of prediciting one word each time, we use 1 word to predict all surrounding words (“context”). 
    - **window**: number of words accounted for each context (if the window size is 3, 3 word in the left neighorhood and 3 word in the right neighborhood are considered)
    - **min_count**: Ignores all words with total frequency lower than this
    - **epochs (formerly: iter)**: number of training iterations
    - **workers**: number of worker threads to train (if multicore machines)
    
For details,  go to http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/    



In [9]:
model = Word2Vec(sentences = tweets, vector_size = 128, sg = 1, window = 3, min_count = 1, epochs = 10)

### Normalize all the derived word vectors so they have equal length (optional if you use Gensim 4.0)

In [10]:
model.init_sims(replace = True)   # normalize all word vectors so they have equal length

# Note that you cannot continue training after doing a replace. 
# The model becomes effectively read-only = you can call most_similar, similarity etc., but not train.

  model.init_sims(replace = True)   # normalize all word vectors so they have equal length


### Save and load model
- word2vec model can be saved and loaded locally
- Doing so can reduce time to train model again

In [11]:
model.save('word2vec_model')

In [12]:
model = Word2Vec.load('word2vec_model')

### Once the model is trained, it is accessible via the “wv” attribute. 

#### The ***wv*** property of the word2vec model holds all trained word vectors.

In [13]:
X_vecs = model.wv

### You can print the learned vocabulary of tokens (words) as follows:

In [14]:
words = list(model.wv.index_to_key)
words

['gopdebate',
 'the',
 't',
 'gopdebates',
 'to',
 'a',
 'i',
 'co',
 'of',
 'is',
 's',
 'and',
 'http',
 'trump',
 'in',
 'it',
 'you',
 'that',
 'for',
 'on',
 'was',
 'last',
 'night',
 'he',
 'not',
 'this',
 'about',
 'realdonaldtrump',
 'https',
 'amp',
 'but',
 'debate',
 'like',
 'be',
 'at',
 'are',
 'with',
 'gop',
 'they',
 'foxnews',
 'megynkelly',
 'all',
 'have',
 'so',
 'my',
 'can',
 'we',
 'who',
 'just',
 'what',
 'as',
 'from',
 'if',
 'me',
 'his',
 'm',
 'fox',
 'candidates',
 'no',
 'how',
 'did',
 'up',
 'out',
 'more',
 'people',
 'one',
 'donald',
 'when',
 'or',
 'your',
 'don',
 'has',
 'questions',
 'think',
 'by',
 'an',
 'news',
 'were',
 'do',
 'after',
 'rubio',
 'only',
 'time',
 'would',
 'why',
 'get',
 'god',
 'carson',
 'him',
 'women',
 'will',
 'really',
 'should',
 'good',
 'than',
 'kasich',
 'know',
 'tcot',
 'these',
 'cruz',
 'jeb',
 'their',
 'bush',
 'now',
 'great',
 'watching',
 'republican',
 'won',
 'want',
 'say',
 'didn',
 'any',
 'p

In [15]:
print ('trump' in words)

True


In [16]:
print(model.wv['trump'])
print(model.wv['trump'].shape)

[ 0.03349504 -0.04257391 -0.01241147  0.08473593  0.09193628  0.00250722
  0.04860888  0.0785149  -0.09423184  0.04285305  0.09216811 -0.11547117
  0.00734323 -0.01464565  0.15262897  0.08627716 -0.05485145  0.03828108
 -0.0142516  -0.04458943  0.04672495  0.12381701  0.00163646 -0.15341294
 -0.12056599 -0.06133847 -0.05325154  0.11909362  0.13320418  0.05845943
 -0.0707586   0.1438664  -0.09212553 -0.00732785 -0.1591769   0.05757797
  0.17115906  0.08361261  0.13478763 -0.05267585  0.04322574  0.01830761
  0.03495894 -0.02452877  0.04426668 -0.00721593 -0.07042611 -0.18496332
  0.06080728 -0.00371308  0.09611719 -0.04782627  0.11422641  0.11177275
  0.10654738 -0.00759606  0.22715497 -0.03908908 -0.09738907 -0.03214465
 -0.04792882 -0.0754894   0.07186405 -0.00287175  0.10343757 -0.07429752
  0.00441283  0.0359907  -0.06266224 -0.0785659   0.080979   -0.01882071
 -0.19112276  0.00942762  0.07876293 -0.15119855  0.0207716  -0.08172785
 -0.06769384  0.15542078  0.01994049 -0.03418295 -0

Why 128?  

### Similarity calculation

- Similarity between embedded words (i.e., vectors) can be computed using metrics such as cosine similarity

In [17]:
model.wv.most_similar('trump')

[('preview', 0.8394254446029663),
 ('donald', 0.8279519081115723),
 ('trumps', 0.8120870590209961),
 ('insulted', 0.8016189932823181),
 ('wins', 0.7937921285629272),
 ('jebbush', 0.7878875732421875),
 ('sexist', 0.7866778373718262),
 ('bringbackdarrellhammond', 0.7820214033126831),
 ('bimbo', 0.780591607093811),
 ('obviously', 0.7796945571899414)]

In [18]:
v1 = model.wv['ted']
v2 = model.wv['trump']

In [19]:
# define a function that computes cosine similarity between two words
def cosine_similarity(v1, v2):
    return 1 - spatial.distance.cosine(v1, v2)

In [20]:
cosine_similarity(v1, v2)

0.7444007396697998

words similar to ['ted', 'trump'] but disimilar to ['cnn']

In [21]:
model.wv.most_similar(positive=['ted', 'trump'], negative=['cnn'])

[('cruz', 0.7120761871337891),
 ('kasich', 0.6357433795928955),
 ('huckabee', 0.6288331151008606),
 ('carson', 0.6092767715454102),
 ('donald', 0.6077600717544556),
 ('ben', 0.6065826416015625),
 ('dear', 0.6032679080963135),
 ('john', 0.6012837886810303),
 ('says', 0.599044919013977),
 ('dr', 0.5922253131866455)]

### Use word2vec result as input data to train a neural network model.

#### Now each tweet can be "treated as 2-D array".  Let's take the first tweet as example

In [22]:
[token for token in tweets[0]]

['deer',
 'in',
 'the',
 'headlights',
 'rt',
 'lizzwinstead',
 'ben',
 'carson',
 'may',
 'be',
 'the',
 'only',
 'brain',
 'surgeon',
 'who',
 'has',
 'performed',
 'a',
 'lobotomy',
 'on',
 'himself',
 'gopdebate']

In [23]:
len(tweets[0])

22

In [24]:
for t, token in enumerate(tweets[0]):
    print(model.wv[token])

[ 0.01527211 -0.13625318  0.05632527  0.02510232  0.07749157 -0.07799084
 -0.00305168  0.01561658 -0.03489351  0.16915986  0.10616387 -0.08212677
 -0.0653687  -0.04441346  0.04029745  0.09241129 -0.0825252   0.09331887
 -0.11898465  0.07169112  0.12186944  0.14663832 -0.02138282 -0.14636658
 -0.12227961  0.06027051 -0.06814151  0.09819517  0.04018774 -0.04638248
 -0.03833468  0.08404088  0.02917851 -0.00840866 -0.0544882   0.00889804
  0.21430561  0.00781945  0.04715935  0.01726187  0.0054331   0.14732233
 -0.02693737 -0.07024334  0.123269    0.05579787 -0.07234985 -0.04271977
  0.04152122  0.06560847  0.06062153  0.05837453  0.06174793  0.09771547
  0.02155089  0.04507793  0.22670943  0.01411681 -0.05886986  0.09373379
 -0.08763298  0.00363162  0.0982978  -0.0119391   0.15804315 -0.01351247
  0.02905325  0.0087295  -0.07345144 -0.03417888  0.06119534 -0.04169855
 -0.22709662 -0.12329203  0.08782518 -0.05259509 -0.10497121 -0.02105464
 -0.1833303   0.1163919  -0.03122422 -0.05424273  0

### Let's write code to convert each tweet to its corresponding 2D array represenatation. 

Suppose we set tweet max length (max number of tokens taken into account per tweet)   =  20

#### In this case, then ***each tweet can be represented as a 20 by 128 2D array/matrix.***    

How to convert a particular tweet to its corresponding 20 by 128 matrix?    See the example code below. 

In [25]:
max_tweet_length = 20
vector_size = 128

tweet0 = np.zeros((max_tweet_length, 
              vector_size), dtype= np.float32)

In [26]:
for t, token in enumerate(tweets[0]):
        if t >= max_tweet_length:
            break
        if token not in model.wv:
            continue
        tweet0[t, :] = model.wv[token]

In [27]:
tweet0

array([[ 1.52721088e-02, -1.36253178e-01,  5.63252680e-02, ...,
         3.03159244e-02, -7.51966238e-02,  3.69942449e-02],
       [-4.41542044e-02, -1.12128094e-01,  1.05844922e-01, ...,
        -2.94101648e-02, -5.99560887e-02,  4.91375476e-03],
       [-9.60457511e-03, -6.27710223e-02,  1.69044763e-01, ...,
        -7.29673803e-02, -7.19162822e-02,  3.41904685e-02],
       ...,
       [ 2.05860417e-02, -1.04951270e-01,  1.03138499e-01, ...,
        -7.88663253e-02, -1.96735024e-01,  4.71174568e-02],
       [ 1.71085994e-05, -1.58520222e-01,  9.34343040e-02, ...,
        -2.86284648e-02, -7.70124644e-02,  4.77802679e-02],
       [-2.81199496e-02, -5.46709746e-02,  1.16341911e-01, ...,
        -7.67739443e-03, -7.20800981e-02,  4.57515270e-02]], dtype=float32)

In [28]:
tweet0.shape

(20, 128)

### Now aggregate such 2D arrays for all the tweets and you will get the input data ready in 3D shape (i.e., X) for model training.   Super cool right?  Here is the sample code to get you X and Y for model training. 

In [None]:
### Create train and test sets

# Generate random indexes
indexes = set(np.random.choice(num_of_tweets, num_of_tweets, replace=False))

X = np.zeros((num_of_tweets, max_tweet_length, vector_size), dtype= np.float32)
Y = np.zeros((num_of_tweets, 2), dtype= np.float32)

for i, index in enumerate(indexes):
    for t, token in enumerate(tweets[index]):
        if t >= max_tweet_length:
            break
        
        if token not in model.wv:
            continue
        
        X[i, t, :] = model.wv[token]
        
            
    Y[i, :] = [1.0, 0.0] if tweets[index] == 0 else [0.0, 1.0]    
    
    
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)  