In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import wikipedia
import spacy
from textblob import TextBlob

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

## Mini Review Lab (50 Minutes)

There are a lot of moving pieces in NLP and it is worthwhile to keep practicing the techniques we started to acquire yesterday. 

The first section of our lesson today will be a chance to review those topics and to practice discussing NLP and machine learning together. 

We'll be using a truncated version of the [Amazon Fine Food Review](https://www.kaggle.com/snap/amazon-fine-food-reviews/data) dataset. For a larger project, we would make use of the full set of data. However, in the interest of processing time, we'll use a randomly sampled set of 10,000 reviews for our training set and an additional 2,000 reviews for our test set.

Your goal will be to create a predictive model that classifies a review into a high scoring review (5 stars) or not a high scoring review (1-4 stars). This value is already present in the data under the name `high_score`.

In [None]:
train = pd.read_csv('./datasets/amazon_train.csv')
train.head()

In [None]:
test = pd.read_csv('./datasets/amazon_test.csv')
test.head()

Split into pairs and work together to do the following.

#### Model Generation (30 Minutes)

1. Try and create a predictive model that identifies whether a review will be a high-scoring review or not (`high_score` feature in the data). While you can use any of the NLP techniques we discussed yesterday, here are some areas to focus on:

1. Should you use `CountVectorizer` or `TfidfVectorizer` to transform your DataFrame?
    - Keep stop words or drop them?
    - Limit the words going in using `max_df` or `min_df`?
2. Apply dimensionality reduction using `TruncatedSVD` or not?
    - If you do, how many components should you keep?
3. What modeling technique should you use? (`LogisticRegression`, `RandomForestClassifier`, etc.?) How will you change the hyperparameters.

Make sure that you are checking your model's performance against the test set.

#### Discussion (10 Minutes)

A pair from each market will come on mic and discuss how they've chosen to transform their data. Additionally, we'll compare the **mean accuracy** for each market to see who has (at this point) made the most predictive model.

#### Model Refinement (10 Minutes)

Continue to refine your model or include some choices made by other markets. At the end of these 10 minutes, we'll report each market's best finding (and final model) by mic. 

# NLP Techniques

Today's lesson is designed as an introduction to more advanced libraries or techniques in the realm of Natural Language Processing. These techniques can help you gain even greater accuracy in your modeling, but require more in-depth knowledge of new libraries, new techniques, etc. 

While we'll be introducing a lot of new material today, we'll be doing our best to limit the discussion to what is most immediately helpful. Each of these libraries and techniques has much more going on than we have time to discuss this week and we encourage you to spend time investigating and understanding these libraries. However, **mastery of these libraries, techniques, and materials introduced today is not required nor expected.**

For Project 4 and your Capstone Project, if you are pursuing an NLP approach, these libraries may be very helpful. However, you can get a lot of mileage out of refining and using the sklearn libraries that we discussed yesterday. A good workflow is to try simple answers first and move into more advanced techniques as your use-case requires -- your goals as modelers should be to make best choice that you can, contingent on time and use-case. Having something work, but not be 100% correct is better than having something 100% correct that doesn't work yet. 

## Using `spacy` to extract parts of speech and named entities

[`spaCy`](https://spacy.io/) is a large-scale NLP and text processing library designed to help you extract useful information from text in a speedy and accurate manner. You can imagine it like `CountVectorizer()` turned up to 11. It has underpinnings to C to increase speed and a focus on usability.

`spaCy` does *so* much more than we are able to discuss at this point. It is quickly becoming the go-to library for text processing and feature extraction for text. Today, we'll use it to extract parts of speech and named entities.

### Parts of Speech

We may want to use some derived statistics about parts of speech in our work as Data Scientists, either as the inputs to a model (document _x_ is _y_% verbs) or to help us modify the inputs to a model (we may want to treat `book` the verb differently than `book` the noun). While many different libraries can do parts of speech (`textblob`, which we'll introduce shortly, can do that as well), we'll introduce this using `spaCy`.

First, we set up some text from Wikipedia to parse:

In [None]:
chicago = wikipedia.page('chicago')

print(chicago.content[0:500])

Next, let's create sentences by splitting on `.`:

In [None]:
chicago_sents = chicago.content.split('. ')
chicago_sents[0:5]

Next, we'll set up a model in `spaCy`. This lets `spaCy` know what to use as its internal corpus. We name this model `nlp` by default:

In [None]:
nlp = spacy.load('en')

Then, we'll feed a sentence into `nlp`. This will automatically split the text into a generator of tokens (one token to each word). These tokens will have the part of speech already tagged in it:

In [None]:
doc = nlp(chicago_sents[1])
print(doc)
for token in doc:
    print(token.text, token.pos_)

If we wanted to convert this into a set of part of speech tags, we could add in a little extra Python to do so:

In [None]:
tags = {}
for token in doc:
    if token.pos_ not in tags.keys():
        tags[token.pos_] = 1
    else:
        tags[token.pos_] += 1
print(tags)

There are many more tags to that `spaCy` can provide for us:

- Text: The original word text.
- Lemma: The base form of the word.
- POS: The simple part-of-speech tag.
- Tag: The detailed part-of-speech tag.
- Dep: Syntactic dependency, i.e. the relation between tokens.
- Shape: The word shape – capitalisation, punctuation, digits.
- is alpha: Is the token an alpha character?
- is stop: Is the token part of a stop list, i.e. the most common words of the language?

In [None]:
print('\t'.join(['Text', 'Lemma', 'POS', 'Detailed POS', 'Dependency',
                'Shape', 'Is alphabetic?', 'Is stopword?']))
for token in doc:
    print('\t'.join([token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, str(token.is_alpha), str(token.is_stop)]))

### Check For Understanding 1 (10 minutes)

With a partner, do the following:

1. Pick two different wikipedia articles
2. Get the content using the `wikipedia` library
3. Using `spacy`, derive the following:
    1. How many tokens are in your article?
    2. How many parts of speech are in each article? How often do they occur?
    3. As a percentage of the total number of tokens, how often does each part of speech occur?
4. Does it look like there's a difference across your documents? What other types of documents would have different distributions of parts of speech?

### Named Entities

Named entities are business, people, countries, or other things that refer to a specific person, place, or thing (think `Apple`, computer manufacturer versus `apple`, delicious crunchy fruit). `spaCy` can identify named entities for us, which we can either highlight or drop from our analyses. 

Once we've parsed a string of text using `spaCy`, we can call out the named entities using the `.ents` attribute:

In [None]:
print(doc, type(doc))

In [None]:
for named_entity in doc.ents:
    print(named_entity.text, named_entity.label_)

`spaCy` provides a set of labels for each type of named entity:

|Label|Description|
|:-- | :-- |
|PERSON |	People, including fictional. |
|NORP |	Nationalities or religious or political groups. |
|FACILITY |	Buildings, airports, highways, bridges, etc. |
|ORG |	Companies, agencies, institutions, etc. |
|GPE |	Countries, cities, states. |
|LOC |	Non-GPE locations, mountain ranges, bodies of water. |
|PRODUCT |	Objects, vehicles, foods, etc. (Not services.) |
|EVENT |	Named hurricanes, battles, wars, sports events, etc. |
|WORK_OF_ART |	Titles of books, songs, etc. |
|LAW |	Named documents made into laws. |
|LANGUAGE |	Any named language.|
|DATE |	Absolute or relative dates or periods. |
|TIME |	Times smaller than a day. |
|PERCENT |	Percentage, including "%".
|MONEY |	Monetary values, including unit. |
|QUANTITY |	Measurements, as of weight or distance. |
|ORDINAL |	"first", "second", etc. |
|CARDINAL |	Numerals that do not fall under another type. |

If we wanted to see all the unique named entities in the Chicago page, for example:

In [None]:
chicago.content[0:500]

In [None]:
chicago_model = nlp(chicago.content)
named_entities = []
for entity in chicago_model.ents:
    named_entities.append(entity.text)
print(set(named_entities), len(set(named_entities)))

### Check for Understanding 2 (10 Minutes)

As a market (or as groups in your market), please discuss the following:

1. As modelers, we will frequently have to make decisions about how to transform data. If you were using NLP to predict things, would it make sense to keep named entities? Would it make sense to drop them? If it would depend on the circumstances, under what circumstances would it make sense to keep or drop named entities?

We'll have a couple of markets come on mic to discuss cases they identified where keeping named entities might make sense and cases where it would not make sense.

## Using `textblob` to do sentiment analysis

We can also use a library known as [`textblob`](https://textblob.readthedocs.io/en/dev/) to do a **lot** of text transformation and extraction on our behalf. For our purposes, we are going to use it to analyze text and derive the overall sentiment of the text.

Sentiment can be split into two related scales:

- subjectivity (0 to 1): scores closer to 0 are more objective in tone, scores closer to 1 are more subjective in tone
- polarity (-1 to 1): scores closer to -1 are more negative in tone, closer to 0 are more neutral, and closer to 1 are more positive in tone.

Using `textblob` is user-friendly -- pass a string into a `Textblob()` class and then call the `.sentiment.polarity` or `sentiment.subjectivity` attributes:

In [None]:
really_good_review = '''
Goodness me, what a fantastic movie. 
Caught the world premiere at the Toronto International Film Festival and 
the entire theater laughed until they cried. 
Amazingly directed, HILARIOUSLY funny, it blends a 1930s gangster 
stylishness into a Hong Kong kung fu movie to astonishing results. 
Who would've thought you could top Shaolin Soccer? 
Not me, until I saw this movie. Stephen Chow pulled it off. 
Chow's comedic timing gets better and better with every movie 
he makes, and while his films are depending more and more on 
CGI these days, and makes this movie much more a fantasy kung 
fu film than a traditional one, it hardly detracts from the 
enjoyable experience. Make it your mission to see this film - 
it will be one of the most entertaining you ever see. 
I can't remember the last film I enjoyed myself in more. 
My eyes still hurt from wiping away tears of laughter. Seriously.  
'''

really_bad_review = '''
Thank you for coming into your performance review Mr. Smith.
The company is concerned about your performance. Lately your work has 
been subpar and at times counter to this company's stated goals.
Your demeanor has been aggresive and at times hostile to your 
fellow coworkers.
We have no choice but to terminate your employment, effective 
immediately. Thank you.
'''

In [None]:
good_review = TextBlob(really_good_review)
print(good_review.sentiment.subjectivity, good_review.sentiment.polarity)

bad_review = TextBlob(really_bad_review)
print(bad_review.sentiment.subjectivity, bad_review.sentiment.polarity)

### Check for Understanding 3 (5 Minutes)

Individually, please answer the following:

1. What type of subjectivity and polarity scores would you expect wikipedia articles to have?
2. Confirm your hypothesis by using `textblob` on some of the wikipedia pages we have used so far. Were your thoughts confirmed?

### Adding these features to DataFrames

We may want to include these features into a DataFrame for use in a later model. The most straightforward way to do so would be to apply them using Pandas.

Here, we'll make use of the same dataset on economic news that we used yesterday. 

In [None]:
econ = pd.read_csv('datasets/economic_news.csv',
                  usecols=[7, 11, 14],
                  nrows=200)
econ['text'] = econ['text'].apply(lambda x: x.replace('</br>', ''))
econ['relevance'] = econ['relevance'].apply(lambda x: 1 if x == 'yes' else 0)
econ.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(econ[['text']],
                                                   econ['relevance'],
                                                   test_size=0.50,
                                                   random_state=8675309)

Let's use `spaCy` to create a column for the number of monetary-based named entities, followed by using `textblob` to create a polarity score for each article.

While we can try to put this into a lambda function, it will probably be easiest in this case to define four functions and apply them.

However, because we're sequentially loading up each row of data and processing it, this can be a little bit of a time and memory sink. Expect processing to take some extra time for this step.*

* **note**: for spacy, there are faster ways to process the data that do not involve pushing it through Pandas. Investigate the spacy `pipe` method if you're looking to do a larger amount of text transformation.

In [None]:
def number_of_monetary_ents(text):
    text = nlp(text)
    return len([x.text for x in text.ents if x.label_ == 'MONEY'])

def polarity(text):
    text = TextBlob(text)
    return text.sentiment.polarity

In [None]:
X_train['num_monetary'] = X_train['text'].apply(number_of_monetary_ents)

In [None]:
X_train['num_monetary'].describe()

In [None]:
X_train['num_monetary'].plot(kind='hist')

In [None]:
X_train['polarity'] = X_train['text'].apply(polarity)

In [None]:
X_train['polarity'].describe()
X_train['polarity'].plot(kind='hist')

We could also pass this into a predictive model to see if these features can assist predicting economic status:

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train[['num_monetary', 'polarity']], y_train)
print(rfc.score(X_train[['num_monetary', 'polarity']], y_train))
predictions = rfc.predict(X_train[['num_monetary', 'polarity']])
print(confusion_matrix(y_train, predictions))
print(classification_report(y_train, predictions))

In [None]:
X_test['num_monetary'] = X_test['text'].apply(number_of_monetary_ents)
X_test['polarity'] = X_test['text'].apply(polarity)
print(rfc.score(X_test[['num_monetary', 'polarity']], y_test))
predictions = rfc.predict(X_test[['num_monetary', 'polarity']])
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

**Note**: you're probably wondering _why_ we would try these things when they don't seem to immediately help. During a larger project, we will likely spend days if not weeks on feature extraction and analysis and will want to make as many useful features as possible to make as good a model as possible. Other techniques may involve more nuanced modeling, such as looking at the sequence of parts of speech, etc. Part of this lesson is designed to expose to what is out there so that when faced with a situation where those techniques may be useful, you're aware of their existence.

## Assigning documents to topics using LDA

LDA (Latent Dirichlet Allocation) is an unstructured machine learning technique that iteratively attempts to find clusters of words that are likely to happen together across multiple documents. We interpret the co-occurance of these words together to be analgous to different topics discussed in across a body of documents. 

LDA works by iteratively guessing how likely a given word is to be part of a given topic until we tell it to stop. 

This process of updating probabilities will make more sense after next weeks lectures on Bayes, but we'll quickly discuss here and move forward.

(Explanation cribbed from [Introduction to Latent Dirichlet Allocation](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/))

We begin by picking a set of documents and a number of topics that we want to generate. One way that we do this is what's known as collapsed Gibbs saampling. We do the following:

1. Randomly assign every word in every document to one of the $k$ topics:
    - $w$: a word in a document
    - $d$: a document
    - $k$: a topic
2. At this point, every word has a likelihood that they belong in a given a topic, based on the other words in documents that they exist in. 
3. Iterate through every word in every document and:
    1. Assume that every other word has the correct likelihood that they belong to each topic (so, `apple` might have a distribution of `[0.1, 0.1, 0.2, 0.4, 0.2]` for five topics.
    2. Look at the likelihood of seeing word $w$ in document $d$ and adjust the topic probabilities as needed
    > for example, if there are a lot of words in topic 1 in document $d$ and word $w$ has a stronger likelihood of being in topic 2, because we're assuming that every **other** distribution is correct, we should change our understanding of where word $w$ belongs and tweak it more in favor of belonging to topic 1, not topic 2
    
You can kind of interpret this with an analogy:

> Imagine you move to a new town and you don't know what sort of people you want to hang out with. You imagine there's five different groups of people. You start visiting different places around town (the park, the library, the mall, etc.) and noting who's there. Everytime you go to a place you start adjusting your expectation on who you'll see there (such as the goths constantly are at the mall, so we should expect less and less that they'll show up at the library). This is (very roughly) analgous to what LDA is doing.

The name latent dirichlet allocation should begin to make more sense in this context:
- latent -- because we have no explicit marker of topic and are grouping things together based on features we are inferring, not seeing
- [dirichlet](https://en.wikipedia.org/wiki/Dirichlet_distribution) -- is a type of probability distribution for multiple vectors at once (like a bunch of words towards a bunch of topics)
- allocation -- we are allocating different words to different topics via this iterative updating of priors

Both sklearn and `gensim`, a library we will discuss in the context of a technique called `word2vec`, can handle LDA. However, we'll rely on the sklearn implementation here to reduce the amount of extra work we'll need to do in picking up a new library.

First, let's reimport all of the economic news data instead of just the first 200 rows:

In [None]:
econ = pd.read_csv('datasets/economic_news.csv',
                  usecols=[14])
econ['text'] = econ['text'].apply(lambda x: x.replace('</br>', ''))
econ.head()

Next, we'll transform the data using `CountVectorizer` and removing stop words:

In [None]:
cv = CountVectorizer(stop_words='english')
cv.fit(econ['text'].values)
X = cv.transform(econ['text'].values)
X

Next we'll instantiate an LDA and fit it to our sparse matrix of words. We have to provide a number of topics that we are looking for (in this case, we're looking for 5 topics). We'll also store the names of the each of the words created during the `CountVectorizer` step for use with the LDA results:

In [None]:
feature_names = cv.get_feature_names()
lda = LatentDirichletAllocation(n_components=5)

lda.fit(X)

The results of our work will be held in the `.components_` feature. Each row of this array is one of our topics and each column (in order) is a word created by `CountVectorizer`. The values are the relative "likelihoods" that the word $w$ should be in topic $t$.

> From the sklearn docs, `.components_`: "can be viewed as pseudocount that represents the number of times word j was assigned to topic i. It can also be viewed as distribution over the words for each topic after normalization" (we could normalize by dividing row total for that topic). 

For our purposes, it's enough to say that bigger values means the word belongs more in that topic. 

In [None]:
print(lda.components_.shape)

In [None]:
results = pd.DataFrame(lda.components_,
                      columns=feature_names)

To see what words are most likely in each topic, we could sort by the biggest values for each topic.

> Every feature has a likelihood of being in a topic, just a very, very low one

In [None]:
for topic in range(5):
    print('Topic', topic)
    word_list = results.T[topic].sort_values(ascending=False).index
    print(' '.join(word_list[0:25]), '\n')

As we change the number of topics, we should see the topics change slightly. Remember that because this is an unstructured technique our editorial power as the modeler is important to identify useful topics. 

However, this provides a powerful tool to create summaries of larger bodies of documents!

### Check for Understanding 4 (20 Minutes)

In pairs, do the following:

1. Rerun the LDA, choosing 10 topics instead of 5. 
> Make sure that you can explain what each line of the code does to each other. This can be as generic as "This runs an LDA with 10 components on a matrix of words and documents" but it's important to be able to explain what a block of code is doing. In particular, make sure that you're able to explain what has happened in this line of code above `word_list = results.T[topic].sort_values(ascending=False).index` -- if you need to, start with the very first portion (`results`) and investigate what each subsequent step does.
2. Look at the results of your LDA. How would you summarize what each topic says?
3. Does 10 look to be a correct number of topics? Are the same words showing up in multiple topics? 