# Creating Word Vectors using the word2vec in Python

In this first example we will use the focus group comments.

In [1]:
import nltk
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import gensim
from gensim.models.word2vec import Word2Vec
from sklearn.manifold import TSNE
from bokeh.io import output_notebook
from bokeh.plotting import show, figure
%matplotlib inline



#### Loading the comments
Let's load the social media comments as in previous examples

In [3]:
#Read the datasets
filename = './data/AllWeeks.txt'
text = ''

f=open(filename,'r',encoding="utf8")
lines=f.readlines()
for line in lines:
    text+=line.lower()
f.close

<function TextIOWrapper.close>

In [4]:
print (text)

no i dont think so - we did this with oab and patients seem to be able to make correct decisions and understand their co-existent issues. in other words i do think that they are appropriate the symptoms mentioned only can be confuse with underactive bladder which may have similar type symptoms
detrusor underactivity is a difficult to diagnose and not as common condition. in selected clinic and urodynamic series it may occur in 15% or so. i agree that the symptoms may partially overlap, but even md provider will likely give alpha blocker to those pts as they do not have a full uds examination to review.
it's not inappropriate for these patients to take tamsulosin. the main issue is going to be whether or not the data are considered adequate to demonstrate that people who have not been diagnosed already with these comorbid conditions and take otc tamsulosin hydrochloride receive their diagnosis promptly. if the data are adequate to show that the otc drug does not delay the diagnosis long

#### Extract sentences
We will parse the text and split it into sentences

In [5]:
# Making sure you have downlaoded nltk sentence tokenization resources
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Giancarlo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
def remove_punctuation(corpus):
    punctuations = ".,\"-\\/#!?$%\^&\*;:{}=\-_'~()"    
    filtered_corpus = [token for token in corpus if (not token in punctuations)]
    return filtered_corpus

def apply_stopwording(corpus, min_len):
    filtered_corpus = [token for token in corpus if (not token in stopwords.words('english') and len(token)>min_len)]
    return filtered_corpus

def apply_lemmatization(corpus):
    lemmatizer = nltk.WordNetLemmatizer()
    normalized_corpus = [lemmatizer.lemmatize(token) for token in corpus]
    return normalized_corpus


# Extract sentences
sa_sentences = sent_tokenize(text)

# Extract tokens in each sentence
tokens = []
for sentence in sa_sentences:
    t = word_tokenize(sentence)
    tokens.append(apply_lemmatization(apply_stopwording(remove_punctuation(t),3)))

In [7]:
print (len(sa_sentences))
print (sa_sentences[0:5])
print (len(tokens))
print (tokens[0:5])

557
['no i dont think so - we did this with oab and patients seem to be able to make correct decisions and understand their co-existent issues.', 'in other words i do think that they are appropriate the symptoms mentioned only can be confuse with underactive bladder which may have similar type symptoms\ndetrusor underactivity is a difficult to diagnose and not as common condition.', 'in selected clinic and urodynamic series it may occur in 15% or so.', 'i agree that the symptoms may partially overlap, but even md provider will likely give alpha blocker to those pts as they do not have a full uds examination to review.', "it's not inappropriate for these patients to take tamsulosin."]
557
[['dont', 'think', 'patient', 'seem', 'able', 'make', 'correct', 'decision', 'understand', 'co-existent', 'issue'], ['word', 'think', 'appropriate', 'symptom', 'mentioned', 'confuse', 'underactive', 'bladder', 'similar', 'type', 'symptom', 'detrusor', 'underactivity', 'difficult', 'diagnose', 'common',

#### Creating the Word2Vec model
Using the sentences extracted in the previous step, we will create the Word2Vec model. Keep in mind we don't have a large corpus to generate the w2v model, so I am not expecting great results.

Parameters:
  - Sentences: the list of sentences
  - size: the # of dimensions of the Word2Vec space being generated
  - sg (skip grams): we are going to use the Skip Gram algorithm (this is a small dataset)
  - window: window size for the skip grams
  - min_count: minimum number of times a word must appear to be considered
  - seed: for replicatebility 
  - workers: CPU cores to use for running the model

In [8]:
w2v_model = Word2Vec(sentences=tokens,size=32, sg=1, window = 5, min_count=3, seed = 20, workers=2)

#You can save the model so you can reuse it later
#w2v_model.save('./models/socialposts_01.w2v')

#You can reload a saved model
#w2v_model = gensim.models.Word2Vec.load('./models/sport_arts_model.w2v')

In [9]:
print (len(w2v_model.wv.vocab))
print (w2v_model.wv.vocab)

520


In [10]:
# Each term is a vector in a 32-dimensional space
len(w2v_model['tamsulosin'])

  


32

In [12]:
# Try words like 'tamsulosin', 'cancer', 'patient'
w2v_model.most_similar('tamsulosin')

  


[('symptom', 0.9974979162216187),
 ('company', 0.9968241453170776),
 ('prostate', 0.9967771768569946),
 ('need', 0.9965850114822388),
 ('study', 0.9963476657867432),
 ('also', 0.9962530732154846),
 ('consumer', 0.9962126016616821),
 ('product', 0.9961062669754028),
 ('self-selection', 0.9960931539535522)]

In [13]:
#Retrieving the vocabulary from the 64-dimensional space
X_32D=w2v_model[w2v_model.wv.vocab]
# Transform the data and load up a Panda dataframe
tSNE = TSNE(n_components=2, n_iter=1000)
X_2D = tSNE.fit_transform(X_32D)
x2D_df = pd.DataFrame(X_2D, columns=['x','y'])
x2D_df['word'] = w2v_model.wv.vocab.keys()
x2D_df.head(10)

  


Unnamed: 0,x,y,word
0,3.408856,-3.90942,dont
1,-39.261364,-12.356514,think
2,-46.412117,-11.238316,patient
3,37.998791,12.22721,seem
4,-9.732599,0.689811,able
5,-43.475597,-11.008735,make
6,-12.356521,-6.835112,correct
7,-11.773398,-4.412108,decision
8,-17.814341,-5.890518,understand
9,-32.039474,-11.865029,issue


In [14]:
# Configure the notebook to generate graph in a cell
# Always call this method before any visualization
output_notebook()

In [15]:
# Extract a sample. If you have a powerful computer you can display all 17,000
plot = figure(plot_width=800, plot_height=800)
_ = plot.text(x=x2D_df.x, y=x2D_df.y, text=x2D_df.word)
show(plot)

In [21]:
print(w2v_model.most_similar(positive=['patient','tamsulosin','adverse','event']))

[('specific', 0.996046781539917), ('symptom', 0.9958760142326355), ('condition', 0.9957076907157898), ('prostate', 0.9955739974975586), ('issue', 0.9950482845306396), ('would', 0.9949253797531128), ('cause', 0.9947460889816284), ('study', 0.9945489168167114), ('section', 0.9945038557052612), ('know', 0.9944651126861572)]


  """Entry point for launching an IPython kernel.
