## Working With Text, Simplified

Generally, projects that work with text data follow the same overall pattern as any other projects. The main difference is that text projects usually require a bit more cleaning and preprocessing than regular data, in order to get the text into a format that's usable for modeling. 

Here are some of the ways that NLTK can make our lives easier when working with text data:

* **_Stop Word Removal_**. NLTK contains a full library of stop words, making it easy to remove the words that don't matter from our data.  
<br>  
* **_Filtering and Cleaning_**. NLTK provides simple, easy ways to create and filter Frequency Distributions, as well providing mutliple ways to clean, stem, lemmatize, or tokenize datasets.
<br>  
* **_Feature Selection and Feature Engineering_**. NLTK contains tools to quickly generate features such as bigrams and ngrams. It major libraries such as the **_Penn Tree Bank_** to allow quick feature engineering, such as generating Part-of-Speech tags, or sentence polarity. 


### Regular Expressions (Regex) and Use Cases for NLP

Regex is especially useful for Natural Language Processing. By definition, just about any text document you work with on an NLP task is going to be one that contains a large amount of text. One of the more common NLP-specific use cases for regex is to use regex during the tokenization stage to define the rules for where we should split strings into separate tokens. As an example, NLTK's basic `word_tokenize` function would split a word that contains an apostrophe into 3 separate tokens--`'they're'` gets broken into `["they", "'", "re"]`. This is because the word tokenizer has instructions to just grab sequences of letters as the basic tokens, and an apostrophe isn't a letter. When preprocessing text data, it's quite common to use some small regex patterns to create a more intelligent tokenization scheme to avoid problems like this, so that our tokenizer treats words like `'they're'` as a single token.

### Creating Basic Patterns

Regex is only as good as the **_Patterns_** we create. We can use these patterns to find, or to replace text. There are many, many things we can do with regex, and covering them all is outside the scope of this lesson. Instead, we'll just focus on some of the more useful, basic patterns that allow us to begin using regex to work with text data. 

Let's take a look at a basic regex pattern, to get a feel for what they look like. 

```python
import re
sentence = 'he said that she said "hello".'
pattern = 'he'
p = re.compile(sentence)
p.findall() # Output will be ['he', 'he, 'he']
```

We define a pattern by just writing in a python string. We can then use the regular expressions library, `re`, to compile this pattern. Once we have a compiled pattern, we just need to pass in a string and the pattern will find every instance of that pattern in the string. 

For people new to regex, the results from the pattern above might be surprising at first. The pattern successfully matches the word 'he', but it also matches the letters 'he' that are found inside of the words 'she' and 'hello'.  Subsequences inside of larger sequences are fair game to regex. If we just wanted to match the word 'he', we would need to specify that the pattern needs to start and end with a space, or use of **_anchors_** for things like word boundaries. 

### Ranges, Groups, and Quantifiers

Obviously, we don't want to have to explicitly type every valid match for any search into our pattern. That would defeat the purpose. Luckily, we don't have to type every possible uppercase letter to match on uppercase letters. Instead, we can use a **_Range_** such as `[A-Z]`. This will match any uppercase letter. Ranges are always inside of square brackets. We can put many things inside of ranges at the same time, and regex will match on any of them. For instance, of if we wanted to find any uppercase letter, lowercase letter, or digit, we could use `[A-Za-z0-9]`. 


### Character Classes

Character classes are a special case of ranges. Since it's quite a common task to use ranges to do things like match on words or numbers, regex actually includes character classes as a shortcut. For instance, we could use `\d` to match any digit--this is equivalent to using `[0-9]`. We could also use `\w` to match on any word. In the same vein, we can use `\D` to get anything that _isn't_ a digit, or `\W` to match on everything that isn't a word. There are a few other types of character classes as well. For a full list, check out the cheat sheet below!

### Groups and Quantifiers

Groups are kind of like ranges, but they specify an exact pattern to match on. Groups are denoted by parentheses. Whereas `[A-Z0-9]` matches on any uppercase letter or any digit, `(A-Z0-9)` will only match on the sequence `'A-Z0-9'` exactly. This becomes much more useful when paired with **_Quantifiers_**, which allows us to specify how many times a group should happen in a row. If we want to specify an exact number of times, we can use curly braces. For instance, a group followed by `{3}` will only match on patterns that have that group repeated exactly 3 times. The most common quantifiers are usually:

* `*` (0 or more times)
* `+` (1 or more times)
* `?` (0 or 1 times)

In this way, we can fill a grouping with any pattern, tell and specify the number of times we can expect to see that pattern. When we include things like ranges, groupings, and quantifiers together, it becomes easy to write a pattern that can match complex things, like email addresses--take a look at the example provided below, and see if you can figure out how it works!

`'([A-Za-z]+)@([A-Za-z]+)\.com'` 

This pattern matches basic email addresses like 'joe@gmail.com', but not 'john.doe@gmail.com', or 'joe@stanford.edu'. Take a look at the pattern again--how would you need to modify the pattern in order for it to match either of those, as well?

### Cheat Sheet Link

https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

#### Sample Regex for Phone Number

    pattern = '(\(\d{3}\) (\d{3}-\d{4}))'
    p = re.compile(pattern)
    digits = p.findall(file)
    digits
#### Sample Regex for Price

    pattern = '(\$\d+\.?\d*)'
    p = re.compile(pattern)
    digits = p.findall(file)
    digits

## Common Approaches to NLP Feature Engineering

As you've likely noticed by now, working with text data comes with **_a lot_** of ambiguity. When all we start with is an arbitrarily-sized string of words, there's no clear answer as to what sorts of features we should engineer, or even where we should start! The goal of this lesson is to provide a framework for working with text data, and help us figure out exactly what sorts of features we should create when working with text data. 

In this lesson, we'll focus on the following topics:

* Stopword Removal
* Frequency Distributions
* Stemming and Lemmatization
* Bigrams, Ngrams, and Mutual Information Score

### Removing Stop Words

When working with text data, one of the first steps to try is to remove the **_Stop Words_** from the text. One common feature of text data (regardless of language!) is the inclusion of stop words for grammatical structure. Words such as "a", "and", "but", and "or" are examples of stop words. While a sentence would be both grammatically incorrect and hard to understand without them, from a modeling standpoint, stop words provide little to no actual value. If we create a **_Frequency Distribution_** to see the number of times each word is used in a corpus, we'll almost always find that the top spots are dominated by stop words, which tell us nothing about the actual content of the corpus. Removing stop words allows us to reduce the overall dimensionality of our dataset (which is always a good thing), while also distilling the overall vocabulary of our Bag-Of-Words down only to the words that really matter. 

_NLTK_ makes it extremely easy to remove stopwords. The library includes a full corpus of all stopwords for all the languages NLTK supports. Since we usually only want the stopwords relevant to the language our text data is in, NLTK even makes it easy to filter out the unneeded stop words and grab only the ones that pertain to our problem. 

The following example shows how we can get all the stopwords for English from NLTK:

```python
from nltk.corpus import stopwords

stopwords_list = stopwords.words('english')

# It is generally a good idea to also remove punctuation
import string

# Now we have a list that includes all english stopwords, as well as all punctuation
stopwords_list += list(string.punctuation)
```

Once we have a list of stopwords, we can easily remove them from our text data after we've tokenized our data. Recall that we can easily tokenize text data using NLTK's `word_tokenize` function. Once we have a list of word tokens, all we need to do is use a list comprehension, and omit any tokens that can be found in our stopwords list.  For example:

```python
from nltk import word_tokenize

tokens = word_tokenize(some_text_data)

# It is usually a good idea to lowercase all tokens during this step, as well
stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]

## Frequency Distributions

Once we have tokenized our data and removed all the stop words, the next step is usually to explore our text data through a **_Frequency Distribution_**. This is just a fancy way of saying that we create a histogram that tells us the total number of times each word is used in a given corpus. 

Once we have tokenized our text data, we can use NLTK to easily create a Frequency Distribution using `nltk.FreqDist()`. A Frequency Distribution is analagous to a python dictionary, with a few more bells and whistles attached to make it easier to use for NLP tasks. Each key is a word token, and each value is the corresponding number of times that token appeared in the tokenized corpus given to the `FreqDist` object at instantiation. 

We can easily filter a FreqDist to see the most common words by using the built-in method, as seen below:

```python
from  nltk import FreqDist
freqdist = FreqDist(tokens)

# get the 200 most common words 
most_common = freqdist.most_common(200)
```

Once we have the most common words, we can easily use this to filter out the text and reduce the dimensionality of particularly large datasets, as needed. 

### Stemming and Lemmatization

Consider the words 'run', 'running', 'ran', and 'runs'. If we create a basic frequency distribution, each of these words will be treated as a separate token. After all, they are different words. However, we know that they pretty much mean the same thing. Counting these words as individual separate tokens can sometimes hurt our model by needlessly increasing dimensionality, and hiding important information from our model. Although we instinctively know that those four words are all talking about the same action, our model will default to thinking that they are four completely different concepts. The way we deal with this is to remove suffixes through techniques such as **_Stemming_** or **_Lemmatization_**.

People often get stemming and lemmatization confused, because they are extremely similar. They generally accomplish the same task, but they use different means to do so. 

**_Stemming_** follows a predetermined set of rules to reduce a word to its _stem_.  Words like 'running' and 'runs' will be reduced down to 'run', because the stemmer contains rules that understands how to deal with suffixes such as '-ing' and '-s'. The best stemmer currently available is the **_Porter Stemmer_**. For code samples demonstrating how to use it, check out NLTK's documentation for the [Porter Stemmer](http://www.nltk.org/howto/stem.html).

**_Lemmatization_** differs from stemming in that it reduces each word down to a linguistically valid **_lemma_**, or root word. It does this through stored linguistic mappings. Lemmatization is generally more complex, but also more accurate. This is because the rules that guide things like the Porter Stemmer are good, but far from perfect. For example, Stemmers commonly deal with the suffix `-ed` by just  dropping it from the word. This usually works, until it runs into an edge case like the word 'agreed'. When stemmed, 'agreed' becomes 'agre'. Lemmatization does not make this mistake, because it contains a mapping for the word that tells it what 'agreed' should be reduced down to. Generally, most lemmatizers make use of the famous **_WordNet_** lexical database. 

NLTK makes it quite easy to make use of lemmatization, as demonstrated below:

```python
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize('feet') # foot
lemmatizer.lemmatize('running') # run
```

### Bigrams and Mutual Information Score

Another alternative to tokenization is to instead create **_Bigrams_** out of the text. A bigram is just a pair of adjacent words, treated as a single unit. 

Consider the sentence "the dog played outside". If we created bigrams out of this sentence, we would get `('the', 'dog'), ('dog', 'played'), ('played', 'outside')`. From a modeling perspective, this can be quite useful, because sometimes pairs of words are greater than the sum of their parts. Note that bigrams are just a special case of **_ngrams_**--we can choose any number of words for a sequence. Alternatively, it's quite common to create ngrams at the character level, rather than the word level. 

One handy feature of bigrams is that we can apply a frequency filter to only keep bigrams that show up more than a set number of times. In this way, we can get rid of all bigrams that only occur because of random chance, and keep the bigrams that must mean something, because they occur together multiple times. How strict your frequency filter should be depends on a number of factors, and generally, its something you'll have to experiment with to get right. However, at minimum, most experts tend to apply a minimum frequency filter of 5. 

Another way we can make use of bigrams is to calculate their **_Pointwise Mutual Information Score_**. This is a statistical measure from information theory that generally measures the mutual dependence between two words. In plain english, this measures how much information the bigram itself contains by computing the dependence between the two words in the bigram. For instance, the bigram `('San', 'Francisco')` would likely have a high mutual information score, because when these tokens appear in the text, it is highly likely that they appear together, and unlikely that they appear next other words. 

In practice, you don't need to worry too much about how to calculate Mutual Information, because NLTK provides an easy way to do this for us. We'll explore this in detail in the next lab. Instead, your main takeaway on this topic should be that Mutual Information scores are a type of feature that you can engineer for text data that may provide good information for you when it comes to exploring the text data or fitting a model to it. 

### Vectorization Strategies

Once we cleaned and tokenized our text data, we can convert it to vectors. However, there are a few different ways we can do this. Depending on our goals and our dataset, some may be more useful than others. 

#### Count Vectorization

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

sklearn.feature_extraction.text.CountVectorizer¶

One of the most basic, but useful ways of vectorizing text data is to simply count the number of times each word appears in the corpus.  If working with a single document, we just create a single vector, where each element in the vector corresponds to the count of a unique word in the document. If working with multiple documents, we would store everything in a DataFrame, with each column representing a unique word, while each row represents the the count vector for a given document. 

| Document | Aardvark | Apple | ... | Zebra |
|:--------:|:--------:|:-----:|-----|-------|
|     1    |     0    |   3   | ... | 1     |
|     2    |     1    |   2   | ... | 0     |

Note that we do not need to have a column for every word in the English language--just a column for each word that shows up the total vocabulary of our document or documents. If we have multiple documents, we just combine the unique words from each document to get the total dimensionality that allows us to represent each. If a word doesn't show up in a given document, that's fine--that just means the count is 0 for that row and column. 

#### TF-IDF Vectorization

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

sklearn.feature_extraction.text.TfidfVectorizer

TF-IDF stands for **_Term Frequency-Inverse Document Frequency_**.  It is a combination of two individual metrics, which are the TF and IDF, respectively. TF-IDF is used when we have multiple documents. It is based on the idea that rare words contain more information about the content of a document than words that are used many times throughout all the documents. For instance, if we treated every article in a newspaper as a separate document, looking at the amount of times the word "he" or "she" is used probably doesn't tell us much about what that given article is about--however, the amount of times "touchdown" is used can provide good signal that the article is probably about sports.

## NLP on Macbeth

    #import packages
    import nltk
    from nltk.corpus import gutenberg, stopwords
    from nltk.collocations import *
    from nltk import FreqDist
    from nltk import word_tokenize
    import string
    import re
    #look at available corpuae
    file_ids = gutenberg.fileids()
    file_ids
    #get macbeth and verify fist 1000 words
    macbeth_text = gutenberg.raw('shakespeare-macbeth.txt')
    print(macbeth_text[:1000])
    Preprocessing the Data¶

        #Preprocessing Data
        pattern = "([a-zA-Z]+(?:'[a-z]+)?)" #removes all numbers and finds words with apostrophes
        macbeth_tokens_raw = nltk.regexp_tokenize(macbeth_text, pattern)#tokenize data
        macbeth_tokens = macbeth_tokens = [word.lower() for word in macbeth_tokens_raw] #lowercase all

        #Create a Frequency Distribution
        macbeth_freqdist = FreqDist(macbeth_tokens)
        macbeth_freqdist.most_common(50)

        #Remove stopwords
        stopwords_list = stopwords.words('english') #gets all english stopwords
        stopwords_list += list(string.punctuation) #adds punctuation to stopwords
        stopwords_list += ['0','1','2','3','4','5','6','7','8','9'] adds all numbers to stopwords
        macbeth_words_stopped = [word for word in macbeth_tokens if word not in stopwords_list]
       
        #Recreate a frequency distribution
        macbeth_stopped_freqdist = FreqDist(macbeth_words_stopped)
        macbeth_stopped_freqdist.most_common(50)
        # check vocabulary length
        len(macbeth_stopped_freqdist)
        
        #Normalize the frequency distribution as a percentage
        total_word_count = sum(macbeth_stopped_freqdist.values())
        macbeth_top_50 = macbeth_stopped_freqdist.most_common(50)
        print("Word\t\t\tNormalized Frequency")
        for word in macbeth_top_50:
            normalized_frequency = word[1] / total_word_count
            print("{} \t\t\t {:.4}".format(word[0], normalized_frequency))
            
        #Create Bigram
        bigram_measures = nltk.collocations.BigramAssocMeasures() #function to creat bigrams
        macbeth_finder = BigramCollocationFinder.from_words(macbeth_words_stopped) #instantiate w/ text
        macbeth_scored = macbeth_finder.score_ngrams(bigram_measures.raw_freq) #grabs scores
        macbeth_scored[:50] 
        
        #Get Pointwise Mutual Information Scores
        macbeth_pmi_finder = BigramCollocationFinder.from_words(macbeth_words_stopped) #create another finder
        macbeth_pmi_finder.apply_freq_filter(5) #apply frequency filter at 5
        macbeth_pmi_scored = macbeth_pmi_scored = macbeth_pmi_finder.score_ngrams(bigram_measures.pmi) #calculate pmi scores
        macbeth_pmi_scored[:50] #grab first 50 scores

## Context Free Grammar and POS Tagging

#### CFG meaning
Speech contains an underlying "deep structure" that we recognize, regardless of the actual content of the the sentence. We don't actually need any context about what the sentence is actually about to determine if the grammar is correct

#### Sentence Structure and Parsing
One way that we can help a computer understand how to interpret a sentence is to create a CFG for it to use when parsing. The CFG defines the rules of how sentences can exist. 

#### Sample Sentence Structure
* `S -> NP VP` A sentence (S) consists of a Noun Phrase (NP) followed by a Verb Phrase (VP).
* `PP -> P NP` A Prepositional Phrase (PP) consists of a Preposition (P) followed by a Noun Phrase (NP)
* `NP -> Det N | Det N PP | 'I'` A Noun Phrase (NP) can consist of:
    * a Determiner (Det) followed by a Noun (N), or (as denoted by `|`) 
    * a Determiner (Det) followed by a Noun (N), followed by a Prepositional Phrase (PP), or
    * The token `'I'`.
* `VP -> V NP | VP PP` A Verb Phrase can consist of:
    * a Verb (V) followed by a Noun Phrase (NP) or
    * a Verb Phrase (VP) followed by a Prepositional Phrase (PP)
* `Det -> 'an' | 'my'` Determiners are the tokens 'an' or 'my'
* `N -> 'elephant' | 'pajamas'` Nouns are the tokens 'elephant' or 'pajamas'
* `V -> 'shot'` Verbs are the token 'shot'
* `P -> 'in'` Prepositions are the token 'in'

#### Generating POS Tags with NLTK
Generating POS tags is very simple with NLTK--all we need to do is pass in a tokenized version of our corpus and NLTK will return a list of tuples containing the token and it's part of speech.
    
    #generating part of sppech tags to tokens
    nltk.download('averaged_perceptron_tagger')
    nltk.pos_tag(tokenized_sent)
    
Note that the abbreviations NLTK uses for their POS tags come from the Penn Tree Bank, and won't be immediately recognizable to you. To understand what these tags stand for, take a look at this [reference list].(https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

## Text Classification

#### Questions to ask before dealing with text data
* Do we remove stop words, or not?  
<br>  
* Do we stem or lemmatize our text data, or leave the words as is?  
<br>  
* Is basic tokenization enough, or do we need to support special edge cases through the use of regex?  
<br>  
* Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?  
<br>  
* Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?  
<br>  
* What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec? 
#### Feature Engineering in Text Classification
Experiment and treat the entire project as an iterative process! When working with text data, don't be afraid to try modeling on alterative forms of the text data, such as bigrams or ngrams. Similarly, explore how adding in additional features such as POS tags or mutual information scores affect the overall model performance.

## Text Classification for Article Class Prediction
    #get packages
    import nltk
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize, FreqDist
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics import accuracy_score
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.naive_bayes import MultinomialNB
    import pandas as pd
    import numpy as np
    np.random.seed(0)
    
    #set classes to predict and remove unnecessary parts of the articles
    categories = ['alt.atheism', 'comp.windows.x', 'rec.sport.hockey', 'sci.crypt', 'talk.politics.guns']
    newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers','quotes'))
    newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers','quotes'))
    
    #set target, labels and relevant data
    data = newsgroups_train.data
    target = newsgroups_train.target
    label_names = newsgroups_train.target_names
    
    #create stopwords list
    stopwords_list = stopwords.words('english')
    punc_list = list(string.punctuation)
    stopwords_list += punc_list
    other_list = ["''", '""','...', '`']
    stopwords_list += other_list
    
    #iterate through data to tokenize and remove stopwords
    def process_article(article):
        tokens = nltk.word_tokenize(article)
        stopwords_removed = [token.lower() for token in tokens if token not in stopwords_list]
        return stopwords_removed
    #create new processed list
    processed_data = list(map(process_article, data))
    
    #counting word totals
    total_vocab = set()
    for comment in processed_data:
        total_vocab.update(comment)
    len(total_vocab)
    
    #creating a frequency distribution for all articles
    articles_concat = []
    for article in processed_data:
        articles_concat += article
    articles_freqdist = FreqDist(articles_concat)
    articles_freqdist.most_common(200)
    
    #need to go back and vectorize data
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer()
    tf_idf_data_train = vectorizer.fit_transform(data)
    tf_idf_data_test =vectorizer.fit_transform(newsgroups_test.data)
    tf_idf_data_train.shape #get shape of data 
    
    #run multinomial naive bayes and random forest classifier models
    nb_classifier = MultinomialNB()
    rf_classifier = RandomForestClassifier(n_estimators=100)
    nb_classifier.fit(tf_idf_data_train, target)
    nb_train_preds = nb_classifier.predict(tf_idf_data_train)
    nb_test_preds = nb_classifier.predict(tf_idf_data_test)
    rf_classifier.fit(tf_idf_data_train, target)
    rf_train_preds = rf_classifier.predict(tf_idf_data_train)
    rf_test_preds = rf_classifier.predict(tf_idf_data_test)
    nb_train_score = accuracy_score(target, nb_train_preds)
    nb_test_score = accuracy_score(newsgroups_test.target, nb_test_preds)
    rf_train_score = accuracy_score(target, rf_train_preds)
    rf_test_score = accuracy_score(newsgroups_test.target, rf_test_preds)

    print("Multinomial Naive Bayes")
    print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))
    print("")
    print('-'*70)
    print("")
    print('Random Forest')
    print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score, rf_test_score))
    
_Needs further tuning_

## Deep NLP

### Word Embeddings

**_Word Embeddings_** are a type of vectorization strategy that computes word vectors from a text corpus by training a neural network, which results in a high-dimensional embedding space, where each word is in the corpus is a unique vector in that space. In this embedding space, the position of the vector relative to the other vectors captures semantic meaning. Embeddings, since vectorize, take up smaller space in order to process faster and more accurately.

### Word2Vec

At its core, Word2Vec is just another Deep Neural Network. It's not even a particularly complex neural network--The model contains an input layer, a single hidden layer, and and an output layer that uses the softmax activation function, meaning that the model is meant for multiclass classification. The model examines a window of words, which is a tunable parameter that we can set when working with the model. 

Instead of predicting the next word given a context, the model trains to predict the context surrounding a given word!  It turns out that by training to predict the context window for a given word, the neurons in the hidden layer end up learning the embedding space! This is the reason why the size of the word vectors output by a Word2Vec model are a parameter that we can set ourselves. 

If we want word vectors of size 300, then we just include 300 neurons in our hidden layer. If we want vectors of size 100, then we include 100 neurons, and so on. If there are 10,000 words and we want vectors of size 300, then this means the hidden layer will be of shape  [10000, 300]. To put it another way--each of the 10,000 words will have it's own unique vector of weights, which will be of size 300, since there are 300 neurons. 

Once we've trained the model, we don't actually need the output layer anymore--all that matters is the hidden layer, which will now act as a "Lookup Table" that allows us to quickly get the vector for any given word in the vocabulary.

In the case of the Word2Vec model, the "company" a word keeps is quite literally the words in the context window around a given word, which the model is learning to predict! The more similar words are, the more sentences in which they are likely to share context windows! This is exactly what the model is learning, and this is why words that are similar end up near each other inside the embedding space. The ways that they are not similar also help the model learn to differentiate between them, since there will be patterns here as well. 

#### Training A Word2Vec Model

To train a Word2Vec model, we first need to import the model from the `gensim` library and instantiate it. Upon instantiation, we'll need to provide the model with certain parameters such as:
* the dataset we'll be training on
* the `size` of the word vectors we want to learn 
* the `window` size to use when training the model
* `min_count`, which corresponds to the minimum number of times a word must be used in the corpus in order to be included in the training (for instance, `min_count=5` would only learn word embeddings for words that appear 5 or more times throughout the entire training set)
* `workers`, the number of threads to use for training, which can speed up processing (`4` is typically used, since most processors nowadays have at least 4 cores). 

Once we've instantiated the model, we'll still need to call the model's `.train()` function, and pass in the following parameters:

* The same dataset that we passed in at instantation
* The `total_examples`, which is the number of words in the model. You don't need to calculate this manually--instead, you can just pass in the instantiated model's `.corpus_count` attribute for this parameter.
* The number of `epochs` to train the model for. 

The following example demonstrates how to instantiate and train a Word2Vec model:


    from gensim.models import Word2Vec

    #Let's assume we have our text corpus already tokenized and stored inside the variable 'data'--the regular text preprocessing steps still need to be handled before training a Word2Vec model!

    model = Word2Vec(data, size=100, window=5, min_count=1, workers=4)

    model.train(data, total_examples=model.corpus_count)

#### Exploring the Embedding Space

Once we have trained the model, we can easily explore the embedding space using the built-in methods and functionality provided by gensim's `Word2Vec` class. 

The actual Word2Vec model itself is quite large. Normally, we only need the actual vectors and the words that correspond to them, which are stored inside of `model.wv` as a `Word2VecKeyedVectors` object. To save time and space, it's usually easiest to just store the `model.wv` inside it's own variable, and then work directly with that. We can then use this model for various sorts of functionality, which we'll demonstrate below!

   
    wv = model.wv

    #Get the most similar words to a given word
    wv.most_similar('Cat')

    #Get the least similar words to a given word. NOTE: We'll see in 
    #the next lab that this function doesn't always work the way we   
    #might intuitively think it will!
    wv.most_similar(negative='Cat')

    #Get the word vector for a given word
    wv['Cat']

    #Get all the vectors for all the words!
    wv.vectors

    #Compute some word vector arithmetic, such as (king - man + woman) 
    #which should be roughly equal to 'queen'
    wv.most_similar(positive=['king', 'woman'], negative=['man'])
    
### Text Classification

The most common model to use for embedded word vectors is the **_GloVe_** (short for **_Global Vectors for Word Representation_**) model by the Stanford NLP Group.

#### GloVe file

For text classification purposes, loading the weights precludes the need for us to instantiate or train a Word2Vec model entirely--instead, we just need to:

* Get the total vocabulary in our dataset
* Download and unzip the GloVe file needed from the Stanford NLP Group's website
* Read the GloVe file, and save only the vectors that correspond to the words that appear in the vocabulary of our dataset.

#### Mean Word Embedding

To get the vector representation for any arbitrarily-sized block of text, all we need to do is get the vector for every individual word that appears in that block of text, and average them together! The benefit of this is that no matter how big or small that block of text is, the **_Mean Word Embedding_** of that sentence will be the same size as all of the others, because the vectors we're averaging together all have the exact same dimensionality! This makes it a simple matter to get a block of text into a format that we can use with traditional Supervised Learning models such as Support Vector Machines or Gradient Boosted Trees.

**_Best to perform Classification using a Pipeline so save on work_**

#### Embedding Layer

An **_Embedding Layer_** is just a layer that learns the word embeddings for our dataset on the fly, right there inside the Neural Network. Essentially, its a way to make use of all the benefits of Word2Vec, without worrying about finding a way to include a separately trained Word2Vec model's output into our Neural Networks

You should make note of a couple caveats that come with using embedding layers in your Neural Network--namely:

* The Embedding Layer must always be the first layer of the network, meaning that it should immediately follow the `Input()` layer
* All words in the text should be integer-encoded, with each unique word encoded as it's own unique integer. 
* The size of the Embedding Layer must always be greater than the total vocabulary size of the dataset! The first parameter denotes the vocabulary size, while the second denotes the size of the actual word vectors
* The size of the sequences passed in as data must be set when creating the layer (all data will be converted to padded sequences of the same size during the preprocessing step). 