# Word 2 Vec

The following notebook walks you through using word2vec from the Python Gensim package. Word2Vec is a Word Embedding Model (WEM) and helps to find how specific words are used in a given text. 

###  Before we begin
Before we start, you will need to have set up a [Carbonate account](https://kb.iu.edu/d/aolp) in order to access [Research Desktop (RED)](https://kb.iu.edu/d/apum). You will also need to have access to RED through the [thinlinc client](https://kb.iu.edu/d/aput). If you have not done any of this, or have only done some of this, but not all, you should go to our [textPrep-Py.ipynb](https://github.com/cyberdh/Text-Analysis/blob/drafts/textPrep-Py.ipynb) before you proceed further. The textPrepPy notebook provides information and resources on how to get a Carbonate account, how to set up RED, and how to get started using the Jupyter Notebook on RED. 

### Run CyberDH environment
The code in the cell below points to a Python environment specificaly for use with the Python Jupyter Notebooks created by Cyberinfrastructure for Digital Humanities. It allows for the use of the different pakcages in our notebooks and their subsequent data sets.

##### Packages
- **sys:** Provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. It is always available.
- **os:** Provides a portable way of using operating system dependent functionality.

#### NOTE: This cell is only for use with Research Desktop. You will get an error if you try to run this cell on your personal device!!

In [1]:
import sys
import os
sys.path.insert(0,"/N/u/cyberdh/Carbonate/dhPyEnviron/lib/python3.6/site-packages")
os.environ["NLTK_DATA"] = "/N/u/cyberdh/Carbonate/dhPyEnviron/nltk_data"

### Include necessary packages for notebook 

Python's extensibility comes in large part from packages. Packages are groups of functions, data, and algorithms that allow users to easily carry out processes without recreating the wheel. Some packages are included in the basic installation of Python, others created by Python users are available for download. Make sure to have the following packages installed before beginning so that they can be accessed while running the scripts.

In your terminal, packages can be installed by typing `pip install nameofpackage --user`. However, since you are using ReD and our Python environment, you will not need to install any of the packages below to use this notebook. Anytime you need to make use of a package, however, you need to import it so that Python knows to look in these packages for any functions or commands you use. Below is a brief description of the packages we are using in this notebook:


- **re:** Provides regular expression matching operations similar to those found in Perl.
- **nltk:** A leading platform for building Python programs to work with human language data.
- **glob:** Finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. 
- **pandas:** An open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- **warnings:** Allows for the manipulation of warning messages in Python.
- **pprint:** Provides a capability to “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter.
- **spacy:** A library for advanced Natural Language Processing in Python and Cython.
- **collections:** implements specialized container datatypes providing alternatives to Python's general purpose built-in containers, dict, list, set, and tuple.
- **gensim:** Python library for topic modelling, document indexing and similarity retrieval with large corpora.

Notice we import some of the packages differently. In some cases we just import the entire package when we say `import XYZ`. For some packages which are small, or, from which we are going to use a lot of the functionality it provides, this is fine. 

Sometimes when we import the package directly we say `import XYZ as X`. All this does is allow us to type `X` instead of `XYZ` when we use certain functions from the package. So we can now say `X.function()` instead of `XYZ.function()`. This saves time typing and eliminates errors from having to type out longer package names. I could just as easily type `import XYZ as potato` and whenever I use a function from the `XYZ` package I would need to type `potato.function()`. What we import the package as is up to you, but some commonly used packages have abbreviations that are standard amongst Python users such as `import pandas as pd` or `import matplotlib.pyplot as plt`. You do not need to us `pd` or `plt`, however, these are widely used and using something else could confuse other users and is generally considered bad practice. 

Other times we import only specific elements or functions from a package. This is common with packages that are very large and provide a lot of functionality, but from which we are only using a couple functions or a specific subset of the package that contains the functionality we need. This is seen when we say `from XYZ import ABC`. This is saying I only want the `ABC` function from the `XYZ` package. Sometimes we need to point to the specific location where a function is located within the package. We do this by adding periods in between the directory names, so it would look like `from XYZ.123.A1B2 import LMN`. This says we want the `LMN` function which is located in the `XYZ` package and then the `123` and `A1B2` directory in that package. 

You can also import more than one function from a package by separating the functions with commas like this `from XYZ import ABC, LMN, QRS`. This imports the `ABC`, `LMN` and `QRS` functions from the `XYZ` package.

In [2]:
import re
from nltk.corpus import stopwords
import glob
import pandas as pd
import warnings
from pprint import pprint
import spacy
from collections import Counter

import gensim
from gensim.utils import simple_preprocess
from gensim.models.phrases import Phrases, Phraser

This will ignore deprecation, user, and future warnings. All the warnings in this code are not concerning and will not break the code or cause errors in the results.

In [3]:
warnings.filterwarnings("ignore", category=UserWarning,
                        module = "gensim", lineno = 598)

warnings.filterwarnings("ignore", category=FutureWarning,
                        module = "gensim", lineno = 737)
warnings.filterwarnings("ignore", category=DeprecationWarning)

### Getting your data

#### File paths
Here we are saving as variables different file paths that we need in our code. We do this so that they are easier to call later and so that you can make most of your changes now and not need to make as many changes later. 

First we use the `os` package above to find our `[HOME]` directory using the `environ` function. This will work for any operating system, so if you decide to try this out on your personal computer instead of ReD, the `homePath` variable will still be the path to your "home" directory, so no changes are needed.

Next, we combine the `homePath` variable with the folder names that lead to where our data is stored. Note that we do not use any file names yet, just the path to the folder. This is because we may want to read in all the files in the directory, or just one file. There are options below for doing both. We save the path as a variable named `dataHome`.

Now we add the `homePath` variable to other folder names that lead to a folder where we will want to save any output generated by this code. We will change the file names for our output in other cells as we need to down below. We save this file path as the variable `dataResults`.

In [4]:
homePath = os.environ["HOME"]
dataHome = os.path.join(homePath, "Text-Analysis-master", "data")
dataResults = os.path.join(homePath, "Text-Analysis-master", "Output")

### Set needed variables
This is where you will make some decisions about your data and set the necessary variables. Much like the file path variables above, we do this so you do not need to make as many changes later.

**source**<br>
First, we need to decide if we want our code to read all the files in a directory or just a single file. If we want all the files in a directory then we set `source` equal to `"*"`. This means 'all' and will be added to the file type later in the code. If you want a single file change `"*"` to the file name without the ".txt" or ".csv" or ".json" at the end. So if you have a file named "myFile.txt" you would set `source` equal to `"myFile"` without the ".txt".

**fileType**<br>
Next we assign the file type our data comes in to a variable. At the moment the only options are ".txt", ".csv" or ".json". The ".txt" format is the most popular format for analysis of a text or corpus, while ".csv" and ".json" are the most common formats for twitter data. We assign the format to the `fileType` variable. It should look like this: `fileType = ".txt"`..

**docLevel**<br>
The `docLevel` variable is only for file types of ".txt" so if you have a ".csv" or ".json" you want to set it to **False** or it will cause problems in other parts of the code later since we do not keep track of file names for the ".csv" and ".json" files. If your data is in ".txt" format, then you need to determine if you want to chunk your corpus by line or by document.

We do this in case your data is a single ".txt" file. If you have multiple documents then the documents themselves are the chunks. If you have a single document, then we need to create chunks, and we do this by spliting the document up by line and each line is a separate chunk.

If you want to separate by document, then set docLevel equal to **True**. If you want to separate a line at a time and have each line be it's own entity or 'chunk' then set `docLevel` equal to **False**. If you set `source` equal to `"*"` then you will want to set `docLevel` equal to **True**. If you set `source` equal to a specific file name, then you will want to set `docLevel` equal to **False**.

**nltkStop**<br>
The `nltkStop` is where you determine if you want to use the built in stopword list provided by the NLTK package. They provide stopword lists in multiple languages. If you wish to use this then set `nltkStop` equal to **True**. If you do not, then set `nltkStop` equal to **False**.

**customStop**<br>
`customStop` is for if you have a .txt file that contains additional stopwords that you would like to read in and have added to the existing `stopWords` list. You do *NOT* need to use the NLTK stopwords list in order to add your own custom list of stopwords. **NOTE: Your custom stopwords file needs to have one word per line as it reads in a line at a time and the full contents of the line is read in and added to the existing stopWords list.** If you have a list of your own then set `customStop` equal to **True**. If you do not have your own custom stopwords list then set `customStop` equal to **False**.

**spacyLem**<br>
`spacyLem` is where we decide if we want to use the spaCy package lemmatization function. What is lemmatization? Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, and identified by the word's lemma, or dictionary form. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, where as stemming does not take the context of the word into account. For example, if we lemmatize the word "running" or "ran" it will become the word "run". If we stem the word "running" most stemmers will convert it to "runn" only removing the "ing" and leaving the second "n". Stemming will also change "police" and "policy" both to "polic" and they will be considered the same word by the LDA script. The lemmatizer will leave both words as "police" and "policy". This is useful and recommended for topic modeling as it allows the algorithm to just consider "walk" instead of "walking", "walked", and "walk" and thereby can increase the accuracy of your results. To use the spacy lemmatizer set `spacyLem` equal to **True**. If you do not wish to use the lemmatizer set `spacyLem` equal to **False**.

**stopLang**<br>
Now we choose the language we will be using for the nltk stopwords list. If you need a different language, simply change 'english' (keep the quotes) in the `stopLang` variable to the anglicized name of the language you wish to use (e.g. 'spanish' instead of 'espanol' or 'german' instead of 'deutsch'). If you need to see the list of available languages in nltk simply remove the `#` from in front of `#print(" ".join(stopwords.fileids()))` on the last line and run the cell. A list of available languages will print out.

**lemLang**<br>
Now we choose the language for our lemmatizer. The languages available for spacy include the list below and the abbreviation spacy uses for that language:

- **English:** en
- **Spanish:** es
- **German:** de
- **French:** fr
- **Italian:** it
- **Portuguese:** pt
- **Dutch:** nl
- **Multi-Language:** xx

To choose a language simply type the two letter code following the angliscized language name in the list above. So for Spanish it would be `'es'` (with the quotes) and for German `'de'` and so on.

**encoding, errors**<br>
The variable `encoding` is where you determine what type of encoding to use (ascii, ISO-8850-1, utf-8, etc...). We have it set to utf-8 at the moment as we have found it is less likely to have any problems. However, errors do occur, but the encoding errors rarely impact our results and it causes the Python code to exit. So instead of dealing with unhelpful errors we ignore the ones dealing with encoding by assigning `'ignore'` to the `errors` variable.

**textColIndex**<br>
The `textColIndex` variable is only applicable if our `fileType` is ".csv" or ".json". The `textColIndex` variable is where we put the header name of the dataframe column that will contain the content we are interested in from our tweets. Generally the content of the tweets are labeled as "text" since this is the label given to the tweet content when it is pulled directly from the Twitter API. For this reason our default value assigned to the `textColIndex` is `"text"`. If for some reason the tweet content has a different label or header, and you need to change this, remember to keep the quotes around the new label.

**stopWords, docs**<br>
The `stopWords =[]` variable is simply an empty list. This is where the words from the nltk stopword list or your custom stopword list or both combined or neither (depending on what you decide) will reside later on. You do not need to do anything to this line of code.

The `docs = []` variable also does not need to have anything done to it as it is also an empty list that will be added to later.

In [5]:
source = "1599Hamlet"
fileType = ".txt"
docLevel = False
nltkStop = True
customStop = True
spacyLem = True
stopLang = 'english'
lemLang = 'en'
encoding = "utf-8"
errors = "ignore"
textColIndex = "text"
stopWords = []
docs = []

### Stopwords
If you set `nltkStop` equal to **True** above then this will add the NLTK stopwords list to the empty list named `stopWords`.

You already chose your desired language above, so you do not need to do that now. 

If you need to add a few more words to the `stopWords` list that are specific to your dataset (such as common names or phrases that may make your results inaccurate), then add those to the `stopWords.extend(['would', 'said', 'says', 'also'])` part of the code in the square brackets with single quotes around each word and separated by a comma.

In [6]:
# NLTK Stop words
if nltkStop is True:
    stopWords.extend(stopwords.words(stopLang))

    stopWords.extend(['would', 'said', 'says', 'also', 'let', 'not', 'know', 'come', 'good'])

#### Add own stopword list

Here is where your own stopword list is added if you selected **True** in `customStop` above. Here you will need to change the folder names and file name to match your folders and file. Remember to put each folder name in quotes and in the correct order always putting the file name including the file extension (.txt) last.

In [7]:
if customStop is True:
    stopWordsFilepath = os.path.join(homePath, "Text-Analysis-master", "data", "earlyModernStopword.txt")

    with open(stopWordsFilepath, "r",encoding = encoding, errors = errors) as stopfile:
        stopWordsCustom = [x.strip() for x in stopfile.readlines()]

    stopWords.extend(stopWordsCustom)

### Reading in .txt files
The code below reads in text files if you chose `fileType =".txt"` above. It can do this in two ways. We can read in an entire directory, or we can read in a single file and it will do those based on what you chose for `source` above. Then it will chunk your data, either by document or by line, and this will depend on what you chose for `docLevel` above. 

In [8]:
if fileType == ".txt":
    paths = glob.glob(os.path.join(dataHome, "shakespeareDated",source + fileType))
    for path in paths:
        with open(path, "r", encoding = encoding, errors = errors) as file:
             # skip hidden file
            if path.startswith('.'):
                continue
            if docLevel is True:
                docs.append(file.read().strip('\n').splitlines())
            else:
                for line in file:
                    stripLine = line.strip()
                    if len(stripLine) == 0:
                        continue
                    docs.append(stripLine.split())

### Reading in .csv and .json files

If you chose `".csv"` as your `fileType` up above, then the first `if` statement in the code below reads in ".csv" files and saves the contents to a dataframe using the Pandas package. It will read in either an entire directory or a single ".csv" file depending on what you chose for `source` above. 

Once we have read in the ".csv" file using the Pandas `read_csv` function, we need to concatenate the ".csv" files if there are multiple. Because of this it is important that your ".csv" files have an identical column count and each column has identical header names or you will get errors. If you have a single ".csv" file then you should be fine for this step. We assign this process to the variable `cdf` so we can use it later.

Now we convert our `cdf` to a pandas dataframe. This allows for easier manipulation of the data in the next line.

Finally, we pull in the column containing the data we are interested in which we assigned to the variable `textColIndex` earlier and turn it into a list assigned to the variable `tweets`.

If you chose `".json"` for your fileType, then the second `if` statement will read in ".json" files and save the content to a dataframe using the Pandas package much like the ".csv" file process described above. The only difference is that we use the Pandas function `read_json` instead of `read_csv`. Everything else is exactly the same as what is described above in the ".csv" section. 

In [9]:
if fileType == ".csv":
    allFiles = glob.glob(os.path.join(dataHome, "twitter", "CSV", "Iran", source + fileType))     
    df = (pd.read_csv(f, engine = "python") for f in allFiles)
    cdf = pd.concat(df, ignore_index=True)
    cdf = pd.DataFrame(cdf, dtype = 'str')
    tweets = cdf[textColIndex].values.tolist()
if fileType == ".json":
    allFiles = glob.glob(os.path.join(dataHome, "twitter", "JSON", source + fileType))     
    df = (pd.read_json(f, encoding = encoding) for f in allFiles)
    cdf = pd.concat(df, ignore_index=True)
    cdf = pd.DataFrame(cdf, dtype = 'str')
    tweets = cdf[textColIndex].values.tolist()

### Data variable

Now we need to change our variable containing our data (either docs or tweets from above) to the variable `data` since this is the variable used going forward and it saves you from having to switch between `tweets` and `docs` later in the code. If you read in ".csv" or ".json" files then your data is saved in the `tweets` list and if you read in ".txt" files then it is in the `docs` list. This code says if the length of the `docs` list is greater than 0 then assign `docs` to the variable data. If the length of `tweets` greater than 0 then assign `tweets` to the variable `data`.

If your data was in `tweets` then it most likely needs some additional cleaning. So the next chunk of code removes URLS and new line characters from the `data` variable if the length of `tweets` is greater than 0.

The last line prints out the first chunk of data in our collection, in this case the first few lines of the first item in our list of lists. If you are reading in a single document this will print the first line of your data. If you are reading in a line at a time this will print out each word for the entire text (either a single document or mutiple documents depending on your choices above) on it's own individual line.

In [11]:
if len(docs) > 0:
    data = docs
else:
    if len(tweets) > 0:
        data = tweets
        # Remove Urls
        data = [re.sub(r'http\S+', '', sent) for sent in data]
        # Remove new line characters
        data = [re.sub('\s+', ' ', sent) for sent in data]
print(len(data))

4154


### Tokenizing

This block of code separates each chunk of text into a list of individual words. In the process it also lower cases all the words and removes punctuation. If you wish to keep the punctuation change `deacc = True` to `deacc = False`.

In [12]:
def sentToWords(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

dataWords = list(sentToWords(data))

if docLevel is True:
    for i in dataWords:
        print(i[:1])
else:
    print(dataWords[:1])

[['who', 'there']]


### Find Bigrams and Trigrams

This code will most likely not need to be adjusted. It creates a model of bigrams and trigrams in your dataset that occur frequently and then connects them with an underscore so the LDA algorithm will later consider them as one word. This is a good idea for items like 'new york' or 'new zealand' or 'Ho Chi Minh'. If we do not combine these frequently occuring phrases then 'new' and 'york' will be considered independently and give us less accurate results. 

Right now we have a `min_count` of 5 and a `threshold` of 100. The `min_count` is simply the minimum number of times the bigram or trigram needs to occur in order to be combined with an underscore. The `threshold` is a score that the bigram or trigram needs to exceed in order to be combined with an underscore. The score is determined by using this formula: (bigram_count - min_count)\*vocab_count/(wordA_count \* wordB_count). So let's say we have the bigram "good_lord" and it appears 30 times in a text of 10,000 words where "good" appears 60 times total and "lord" appears 40. With our `min_count` set to 5 we get the following: (30 - 5)\*10000/(60 \* 40) = 104.167 which means since our `threshold` is set to 100 "good_lord" will be combined with an underscore and made into a bigram. If the resulting score is above your `threshold` then the ngram is considered important enough to combine with an underscore and will be viewed as one word for the LDA scoring later. Therefore, if you increase the `threshold`, you will get fewer bigrams and trigrams. If our threshold was set to 110, then "good" and "lord" would not be combined into "good_lord".

The Phraser function takes the model you built with the Phrases function and cuts down memory consumption of Phrases, by discarding model state not strictly needed for the bigram detection task.

Lastly, we take a look at the ngrams created from the first item in our dataset only, so the results are for only one chunk, not the whole dataset. We do this by counting the number of words that contain an underscore as this is used to connect the words in the ngram together. **NOTE:** The output is only to test if the ngrams work so you will probably see ngrams containing stopwords. We will create a few functions next and then apply them to remove stopwords, create bigrams, and lemmatize the chunk.

In [13]:
# Build the bigram and trigram models
bigram = Phrases(dataWords, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = Phrases(bigram[dataWords], threshold=100)  

# Removes model state from Phrases thereby reducing memory use.
bigramMod = Phraser(bigram)
trigramMod = Phraser(trigram)

# See bigram/trigram example
testNgram = trigramMod[bigramMod[dataWords[0]]]
char = "_"
nGrams = [s for s in testNgram if char in s]
            
pprint(Counter(nGrams))

Counter()


### Functions
We need to create a function in order to stem and tokenize our data. Any time you see `def` that means we are **DE**claring a **F**unction. The `def` is usually followed by the name of the function being created and then in parentheses are the parameters required by the function. After the parentheses is a colon, which closes the declaration, then a bunch of code below which is indented. The indented code is the program statement or statements to be executed. Once you have created your function all you need to do in order to run it is call the function by name and make sure you have included all the required parameters in the parentheses. This allows you to call the function without having to write out all the code in the function every time you wish to perform that task.

### Some functions

Below are functions we are creating that perform certain tasks. First we are creating a function to remove the stopwords that are in our stopword list we created previously. Then we create functions to apply our bigram and trigram code from above. 

Lastly, if you set `spacyLem` equal to **True** above then we will create the `lemmatization` function. If you set it equal to **False** then it will not create the function.

In [14]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def removeStopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stopWords] for doc in texts]

def makeBigrams(texts):
    return [bigramMod[doc] for doc in texts]

def makeTrigrams(texts):
    return [trigramMod[bigramMod[doc]] for doc in texts]


if spacyLem is True:
    def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
        """https://spacy.io/api/annotation"""
        textsOut = []
        lemmaPOS = []
        for sent in texts:
            doc = nlp(" ".join(sent)) 
            textsOut.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
            lemmaPOS.append([token.text and token.lemma_ and token.pos_ for token in doc if token.pos_ in allowed_postags])
        return textsOut
        print(lemmaPOS[:10])

Now we apply the functions. There are really only two parts where you may need to make changes and they are in the lines `dataWordsNgrams = makeBigrams(dataWordsNostops)` and `nlp = spacy.load(lemLang, disable=\['parser', 'ner'\])`.

The `dataWordsNgrams` variable is where you will change between either using the `makeBigrams` or `makeTrigrams` functions above. If you only want bigrams, then keep the code as it is. If you want both bigrams and trigrams to be considered in your topic modeling, then change the `makeBigrams` part to `makeTrigrams` and it will now calculate both bigrams and trigrams. 


Adjustments to the other line that might need changes mentioned above may only be necessary if you previously set `spacyLem` equal to **True**. Even if you set it to **True** you may still not need to make changes. The line of code you may want to change is `nlp = spacy.load('lemLang', disable=\['parser', 'ner'\])` and is where you can disable the parser and named entity recognizer (ner). 

If you wish for your words to be parsed simply remove `'parser'` from the `disable=` bracket. Same for `ner`. If you wish to use both the parser and ner then just remove the `, disable=\['parser', 'ner'\]` entirely (including the preceding comma), but leave the closing parantheses. The reason we disable to 'parser' and 'ner' is because they slow down the lemmatization process and are not necessary to lemmatize our dataset.

Lastly, we print out the ngrams we find in the first chunk (document or line) of our data. Notice there are no trigrams included. This is because we applied only the `makeBigrams` function from above. If we had applied the `makeTrigrams` function we would have both bigrams and trigrams. Feel free to change this in the code as described above. If we set `spacyLem` equal to **True** then we will get the first 10 words, their lemmatized form (which sometimes is identical to the word being lemmatized), with their parts of speech tagging from the `lemmatization` function above. Below this is a list of the lemmatized bigrams from the first chunk of our data. If we set it to **False** then we will get bigrams from the first chunk that have not been lemmatized.

In [15]:
# Remove Stop Words
dataWordsNostops = removeStopwords(dataWords)

# Form Bigrams
dataWordsNgrams = makeBigrams(dataWordsNostops)

if spacyLem is True:
    # Initialize spacy language model, eliminating the parser and ner components
    nlp = spacy.load(lemLang, disable=['parser', 'ner'])
    
    # Do lemmatization tagging only noun, adj, vb, adv
    allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']
    dataLemmatized = lemmatization(dataWordsNgrams, allowed_postags=allowed_postags)
    lemmaPOS = []
    for sent in dataLemmatized:
        lemmaNLP = nlp(" ".join(sent))
        for token in lemmaNLP:
            lemmaPOS.append([token.text, token.lemma_, token.pos_])
    print(lemmaPOS[:10])
    

    # Find ngrams and count number of times they occur
    dataNgrams = [s for s in dataLemmatized[0] if char in s]
    
else:
    dataNgrams = [s for s in dataWordsNgrams[0] if char in s]
print(Counter(dataNgrams))

[['answer', 'answer', 'VERB'], ['stand', 'stand', 'VERB'], ['unfold', 'unfold', 'ADJ'], ['long', 'long', 'ADV'], ['live', 'live', 'ADJ'], ['king', 'king', 'NOUN'], ['barnardo', 'barnardo', 'NOUN'], ['carefully', 'carefully', 'ADV'], ['hour', 'hour', 'NOUN'], ['strike', 'strike', 'NOUN']]
Counter()


#### Getting Information

Now we want to get some information about our corpus now that it is cleaned as some words have been removed, some may have been combined into one word by our bigram/trigram function, and others may have been lemmatized. We need some word counts as these will help inform some of our parameters for the word2vec function later.

It is important to note that we have an `if` `else` statement. This is used so that you will not need to make adjustments if you chose not to use the spacy lemmatizer above.

The first thing in both the `if` and the `else` part is to save our cleaned corpus as the variable `texts`.

However, our cleaned corpus is a list of lists, and our `Counter` function from the `collections` package needs everything in a single list. So we combine our list of lists into a single list in the `tokens = sum(texts, [])` part of the code. 

Next, we count how often every word appears in our cleaned and finalized list of words and ngrams. We assign this to the variable `count`. What we have just assigned to the variable `count` is a dictionary, meaning it has a {"key":"value", "key":"value"} configuration with the "key" being the word and the "value" being the number of times the word appears in our data.

Now we convert our `count` dictionary to a list sorted by the "values" in descending order. We assign this list to the variable `sortList`.

Now we print out the number of total words (`print(sum(count.values()))`), followed by the number of unique words (`print(len(count))`), and ending with the 100th word how many times it appears in our data. You will want to change this number based on the size of your dataset. 

We print this information as it will inform our choices in the next cell.  

In [16]:
if spacyLem is True:
    # Create Corpus
    texts = dataLemmatized
    tokens = sum(texts, [])
else:
    # Create Corpus
    texts = dataWordsNgrams
    tokens = sum(texts, [])

count = Counter(tokens)
sortList = sorted(count.items(), key=lambda x:x[1], reverse = True)
print(sum(count.values()))
print(len(count))
print(sortList[150])

10985
3450
('norway', 12)


### Build vocabulary and train the model
Now we pass our corpus through the `Word2Vec` function. Then we call the function `train` to "train" `Word2Vec` using our corpus. The parameters in the `Word2Vec` function may need to be changed depending on your needs, your dataset, and the results of the word counts from the previous cell of code. The parameters do the following:

- **texts:** Is passing our variable above and letting Word2Vec know that this is what we wish to use for our dataset.
- **size:** The size of the dense vector to represent each token or word (i.e. the context or neighboring words). If you have limited data, then size should be a much smaller value since you would only have so many unique neighbors for a given word. If you have lots of data, its good to experiment with various sizes.
- **window:** The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left or the right, then, some neighbors would not be considered as being related to the target word. In theory, a smaller window should give you terms that are more related.
- **min_count:** Minimum number of times a word needs to appear to be counted. This should be adjusted based on your dataset. This is where the number of times the 100th or 1000th (depending on your data) most frequently occuring word appears in your data might be helpful.
- **workers:** How many cores to use on a machine with multiple cores.
- **sg:** Training algorithm: 1 for skip-gram; 0 for continuous bag-of-words (CBOW).  In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. CBOW is faster while skip-gram is slower but does a better job for infrequent words. 
- **seed:** Sets the seed for reproducibility.

In [17]:
# build vocabulary and train model
model = gensim.models.Word2Vec(
    texts,
    size=50,
    window=10,
    min_count=12,
    workers=1,
    sg = 1,
    seed = 42)
model.train(texts, total_examples=len(texts), epochs=10)

(23316, 109850)

### Let's find some word relationships
First, we have assigned the name of the output .csv file to the variable `w2vCSVfile`. You may wish to change the name of the file to better match your data.

Now we choose a word of interest and see what other words are associated with that word in the text. Change the word in quotes after `w1` to change the word. You may also want to change the `topn` value as this determines how many of the top words you will get in the .csv file output.

The ouput you see shows the top ten words ranked by proximity. This is because what `Word2Vec` does is turn each word into a vector and then run cosine similarity on the vectors when two words are compared. The numbers you see are the cosine similarity scores of our word of interest with the corresponding word in the table. They are listed from highest to lowest (most similar to least).

In [18]:
w2vCSVfile = 'word2vec.csv'
w1 = "hamlet"
topn = 30

wtv = model.wv.most_similar(positive=[w1], topn = topn)
df = pd.DataFrame(wtv)
df.to_csv(os.path.join(dataResults, w2vCSVfile))
df.head(10)

Unnamed: 0,0,1
0,lord,0.759488
1,dear,0.74805
2,life,0.745671
3,spirit,0.719273
4,nature,0.716717
5,great,0.714609
6,horatio,0.711564
7,man,0.709857
8,look,0.706127
9,love,0.700015


Here we can compare two words to each other to see how similar they are in their usage. Just change the words in quotes to the ones you want to compare.

In [19]:
model.wv.similarity("ophelia", "hamlet")

0.53032297

## VOILA!!

This code was adapted from Kavita Ganesan at [http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.XFnQmc9KjUI](http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.XFnQmc9KjUI). Accessed 02/05/2019.