# Introduction to Word Vectors in Python

This Jupyter Notebook is designed to walk you through the basics of creating a word embedding model using two of the most popular natural language processing libraries, Gensim and Spacy. This notebook shows you how to use both Gensim and Spacy because, like most libraries, there are some pros and cons that come with both. 

### What is word embedding useful for?

In addition to allowing you to ask really interesting questions of your textual data (for instance, what word is most similar to "king"), word embeddings have other uses in natural langauge processing. For instance, a word embedding model can be used for other natural language processing tasks such as text classification and often increases the accuracy of these tasks. Because word embeddings capture the semantic use of a word, many natural language processing tasks become much easier with a model trained on word vectors. This is because word embedding models allow a machine learning algorithm to work with words it hasn't seen in the training process. Additionally, while Word2Vec is the most popular algorithm for constructing word embeddings, the algorithm Doc2Vec extends Word2Vec to instead treat individual documents as "words" and thus allows you to compare the semantics of entire documents. This algorithm can be useful if you are wanting to find the semantic similarities between two documents and can also allow you to break down a corpus at both the word level, using Word2Vec, and on the document level, using Doc2Vec.

### Anaconda

Anaconda is a distribution of Python that is designed to make library and package management easy. One of the benefits of using Anaconda is that it comes with many libraries pre-installed and also comes with many popular IDEs such as Spyder and Jupyter Notebooks.

### Downloading Anaconda

For Macs: https://docs.anaconda.com/anaconda/install/mac-os/

For Windows: https://www.anaconda.com/products/distribution

### How do I navigate this Jupyter Notebook?

This notebook is designed to be read from top-to-bottom. We consider this particular notebook to contain the core concepts that you need to get started with Word2Vec. The notebook uses a combination of text and code cell. The code cell contain real code that can be run in the notebook, itself, or brought over into your IDE of choice. In order to run a code cell, click the "run" button in the toolbar at the top after clicking the cell. As a warning, some of the code blocks may not produce very useful results if they have been taken out of a larger block of code. Typically, the code will be explained line-by-line and then the code, in its entirety, will be located in a single block at the end of each section.

# Word Embeddings Using Gensim

One of the first things that we need to do is make sure that all of libraries that we need are installed. For this tutorial, we will be using the following libraries:

- **re** the re library gives us access to regular expressions which makes cleaning data much easier
- **os** the os library allows us to access operating-system based information
- **string**  the string library gives us access to a wide variety of string functions. Since we are working with text data, this is useful
- **glob** The glob library allows you to access files based on their filetype. This will be useful to loading a set of models into memory
- **Path** The Path library gives us access to files in other directories besides our current working directory
- **gensim** Gensim is the library which contains the particular instance of Word2Vec that we are using 
- **Word2Vec** We will be accessing this particular flavor of Word2Vec through Gensim. Word2Vec is what will actually convert our text data into vectors
- **pandas** the pandas library allows us to work with dataframes, it makes sorting and organizing data much faster


In order to install these libraries, you should refer back to the "Libraries" portion of the introduction to Python notebook. It is a good coding practice to have all of your imports at the top of your code, so we are going to go ahead and load everything that we need for the entire tutorial here. There are comments next to each library explaining what each library is for. 

In [2]:
# A good practice in programming is to place your import statements at the top of your code, and to keep them together

import re                                               # for regular expressions
import os                                               # to look up operating system-based info
import string                                           # to do fancy things with strings
import glob                                             # to locate a specific file type
from pathlib import Path                                # to access files in other directories
import gensim                                           # to access Word2Vec
from gensim.models import Word2Vec                      # to access Gensim's flavor of Word2Vec
import pandas as pd                                     # to sort and organize data

## Loading Your Data ##


### Loading Texts from a Folder ###

Next, we need to actually load our data into Python. It is a good idea to place your dataset somewhere where it's easy to navigate to. For instance, it's a good idea to place your data in a folder in your Documents folder or in the same respository as your code file. In either case, you will need to know what the **file path** is for the folder that is currently holding your data. Then, we are going to tell the computer to iterate through that folder, pull the text from each file, and store it in a dictionary. The code is written to process a folder with plain text files (.txt). These files can be anywhere within this folder, including in sub-folders. 

A few important things to note:

1. When you are inputing your filepath, you should use the **entire** file path. For example, on a Windows computer, that filepath might look something like: C:/users/admin/Documents/MY_FOLDER

2. If you are having trouble getting your filepath to load successfully, try using either double slashes in the filepath or even switching the direction of the slashes (Windows machines and Macs use slashes in different directions for their filepaths)

3. Remember, you can use a file path to a folder full of different types of files, but this code is only going to look for **.txt** files. If you want to work with different file types, you'll have to change the "endswith(.txt)" call. However, keep in mind that these files should always contain some form of plain text. For example, a Word document or a PDF won't work with this code. 

### The Code

Lets walk through what the code is doing before we run it. As the comments indicate, the code begins by reading the file path that you provided. That little "r" in front of the file path tells the computer "hey, read whatever is at this file path location." Then, we have two empty lists that have been initiated, one called `filenames` and one called `data`. `filenames` is going to be used to store the name of each file as the code is traversing (or walking through) the folder. `data` is going to actually hold all of the textual data from each .txt file.

The first set of `for` loops tells the computer "hey, find all of the files that end with .txt in this folder and save their filenames to our `filenames` list. The reason why there are two `for` loops here, is that this code will traverse through subfolders, as well. So, you could provide a file path which points to a folder with tons of other folders nested at varying levels within that main folder and the code will peek into each of these folders and pull out any file that ends with .txt

The second code chunk takes that list of relevant filenames and tells the computer "open each file in this filename list, and dump whatever is in that file into our `data`." As the computer is working through the files, it will open a file, read it, and then close it. Closing the file once it has been read is an important step for saving memory. Otherwise, you could very well have over a hundred text files open. Remember computers are actually pretty simple--they only do what you tell them to and nothing else.

In [3]:
dirpath = r'INSERT FILE PATH HERE' # get file path (you can change this)

filenames = []
data = []

 # this for loop will run through folders and subfolders looking for a specific file type
for root, dirs, files in os.walk(dirpath, topdown=False):
   for name in files:
    # if you want a different file type, change this to a different ending
       if (root + os.sep + name).endswith(".txt"): 
           filenames.append(os.path.join(root, name))
   for name in dirs:
    # if you want a different file type, change this to a different ending
       if (root + os.sep + name).endswith(".txt"): 
           filenames.append(os.path.join(root, name))

# this for loop then goes through the list of files, reads them, and then adds the text to a list
for filename in filenames:
    with open(filename) as afile:
        data.append(afile.read()) # read the file and then add it to the list
        afile.close() # close the file when you're done


### OPTIONAL: Loading Data from a Spreadsheet

Gensim is pretty versitile in that it doesn't particularly care **where** your text data comes from, as long as it is formatted as machine readable. Let's take, for example, a researcher who instead of individual text files, instead has a spreadsheet where one column records where the text is sourced from (an online database, for example) and one column contains the actual text that the researcher is interested in. Converting a spreadsheet like this to plain text and feeding it into Gensim is actually really simple. 

Begin by saving your spreadsheet in a CSV format. CSV (comma seperated values) is machine readable unlike an .xsl file and so our code will be able to understand what the spreadsheet actually is. Once your have your CSV file, you are going to run the following code:

In [4]:
col_list = ["cluster", "text"] # columns you want to use, can change to whateve

df = pd.read_csv(r'FILEPATH TO CSV FILE/file.csv', usecols= col_list)

FileNotFoundError: [Errno 2] File FILEPATH TO CSV FILE/file.csv does not exist: 'FILEPATH TO CSV FILE/file.csv'