# Text Analysis

In this lab we'll do a short text analysis so that you start to become familiar with the packages and tools available to you in Python to work with text data. Nothing here will be very in-depth - it's supposed to be able to be completed in a short period of time after all. But, it will give you a starting point for your final assignment and projects, should you want to analyze text data.

## Data Science Question
In this short project, we're going to answer the question: *For each presidential inauguration, which word is most unique?* 

To do this, we'll use the text from each Inaugural address in American history and carry out a TF-IDF analaysis.

Secondarily, we'll think about whether these words make sense in the context of the history at the time and visualize words uniqueness over the course of history.

# Part I : Setup & Data Wrangling

This lab uses a number of different functions across multiple packages. **Run the following code cell and take a look at each package we'll be using below. Make sure you understand what the package is used for. Be sure to familiarize yourself with anything that you're not yet familiar with.**

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd

# Import nltk package 
# NLTK provides support for a wide variety of text processing tasks: 
# tokenization, stemming, proper name identification, part of speech identification, etc. 
#   PennTreeBank word tokenizer 
#   English language stopwords
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# scikit-learn imports
#   TF-IDF Vectorizer that first removes widely used words in the dataset and then transforms test data
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# import re for regular expression
import re

## seaborn for plotting
import seaborn as sns
sns.set(font_scale=1.2, style="white")

# import matplotlib for plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.style as style
# set plotting size parameter
plt.rcParams['figure.figsize'] = (12, 5)

# improve resolution
%config InlineBackend.figure_format ='retina'

To get started on your text analysis using the `nltk` package, run the code below to **download the NLTK English tokenizer ('punkt'), stopwords of all languages ('stopwords') from `nltk`, and the inaugural dataset from `nltk` ('inaugural')**. To determine what code you'll need to do this, you can explore the `download` method [here](https://www.nltk.org/) or their book [here](http://www.nltk.org/book/).

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('inaugural')

Now that you have downloaded a few of the datasets you'll need, **import the `inaugural` dataset from `nltk.corpus`.**

In [None]:
### BEGIN SOLUTION
from nltk.corpus import inaugural
### END SOLUTION

In [None]:
assert inaugural

If all is working well, the following cell should display the files included in this dataset. 

In [None]:
inaugural.fileids()

As you can see there is one file from each address. And, you'll note that the filename includes the year of each address. We'll want to use that address later, so **write code that extracts each year from the filename and stores it as a list. Assign this list to the variable `years`.**  

In [None]:
### BEGIN SOLUTION
years = [fileid[:4] for fileid in inaugural.fileids()]
### END SOLUTION

In [None]:
assert len(years) == len(inaugural.fileids())
assert years[1] == '1793'

Let's take a look at one of these addresses. We'll pick a short one - Washington's *second* address. **Run the code below to take a look.**

In [None]:
# see Washington's Second Inaugural Address
inaugural.raw('1793-Washington.txt')

You'll notice that there are some new line characters, as well as a colon, some commas, some periods. We're really only interested in the words though for TF-IDF, so let's remove all punctuation. **Write code that returns a list (`text`), where each element in the list includes the text as above, but with:
- punctuation removed 
- each word separated by a space
- all words are lower case (i.e. "Constitution" should be "constitution)

Assign this to the variable `text`.

In [None]:
### BEGIN SOLUTION
text = [re.sub(r'[^A-Za-z0-9]+', ' ', x) for x in [inaugural.raw(file_id) for file_id in inaugural.fileids()]]
text = list(map(str.lower, text))
### END SOLUTION

In [None]:
assert isinstance(text, list)
assert len(text) == 58
out = re.search('^fellow',text[0])
assert out != None

If you've done this correctly and you **run the following cell, all punctuation should be stripped from the text, so that you only see the words from Washington's second address, separated by spaces, with all words lowercase.**

In [None]:
text[1]

With that, you now have a dataset ready for analysis by TF-IDF!

# Part II : Text Analysis

To get started on your TF-IDF analysis, you'll first want to **create a `TfidfVectorizer` object to transform your text data into vectors. Assign this `TfidfVectorizer` object to `tfidf`.**

In this object, you'll need to **pass five arguments to initialize `tfidf`**: 
* set to apply TF scaling: `sublinear_tf=True`
* analyze at the word-level: `analyzer='word'`
* set maximum number of unique words: ` max_features=2000`
* specify that you want to tokenize the data using the word_tokenizer from NLTK: `tokenizer=word_tokenize`
* remove English language stop words: `stop_words=stopwords.words("english")`

In [None]:
### BEGIN SOLUTION
tfidf = TfidfVectorizer(sublinear_tf=True,
                        analyzer='word',
                        max_features=2000,
                        tokenizer=word_tokenize,
                        stop_words=stopwords.words("english"))
### END SOLUTION

In [None]:
assert tfidf.analyzer == 'word'
assert tfidf.max_features == 2000
assert tfidf.tokenizer == word_tokenize

Now, it's time to calculate TF-IDF for words across our corpus of Inaugural addresses! 

To do this:

1. generate a DataFrame `inaug_tfidf` using the `tfidf.fit_transform` function to calculate TF-IDF on your `text` variable. 
2. Be sure that your index here is the year of the address and the columns are named with the columns of the words the values represent. The `get_feature_names` method from `tfidf` may help you accomplish the columns name assignment. And the `years` you created earlier may help you with the indices.

In [None]:
### BEGIN SOLUTION
inaug_tfidf = pd.DataFrame(tfidf.fit_transform(text).toarray())
inaug_tfidf.columns = tfidf.get_feature_names()
inaug_tfidf.index = years
### END SOLUTION

In [None]:
assert len(inaug_tfidf.index) == len(years)
assert len(inaug_tfidf.columns) == 2000
assert inaug_tfidf.shape == (58, 2000)

# Part 3: Results

We're almost there. We now have a DataFrame that includes the TF-IDF for the top 2000 words in our corpus! **Now, you'll want to extract the single most unique word from each address. Assign this information (most likely a Series object) to the variable `most_unique`.**

In [None]:
### BEGIN SOLUTION
most_unique = inaug_tfidf.idxmax(axis=1)
### END SOLUTION
most_unique

Take a look through this list of most unique words over time. Do they make sense based on what you know about American history? Do any surprise you?

With that part of our Analysis done, one thing that stuck out to me in this list is the fact that "british" was the most unique word to the 1813 inaugural address. This made sense to me - it was early in American history and we had only recently left British rule. But, I was curious to see whether or not 'british' would show up uniquely (albeit less uniquely) in any later addresses. **Generate a line plot that plots the TF-IDF for the word "british" on the y-axis. Plot year on the x-axis.**

In [None]:
### BEGIN SOLUTION
x = inaug_tfidf.index
plt.plot(x, inaug_tfidf['british'], label="british")
plt.xlabel('Year')
plt.ylabel('TF-IDF')
plt.xticks(np.arange(0,56,step=5))
plt.show()
### END SOLUTION

Here you should see that over time "british" peaked in inaugural addresses at a few interesting points throughout history. What about some other words?

Using a similar approach, **plot TF-IDF for "british", "america", "war", and "jobs". Take a look at the trends over time. Feel free to look at other words' trends over time.**

In [None]:
### BEGIN SOLUTION
plt.plot(x, inaug_tfidf['british'], label="british")
plt.plot(x, inaug_tfidf['america'], label="america")
plt.plot(x, inaug_tfidf['war'], label="war")
plt.plot(x, inaug_tfidf['jobs'], label="jobs")
plt.xlabel('Year')
plt.ylabel('TF-IDF')
plt.legend(loc="upper right")
plt.xticks(np.arange(0,56,step=5))
plt.show();
### END SOLUTION

You should see that the mention of "america" happened frequently in the country's infancy, but then became less common, whereas "british was really common early on and "jobs" has really only become applicable in recent innaugural addresses.

As with all analysis, TF-IDF is not without its limitations. Let's take a look at how our results change if we change the `max_features` result in our analysis above to include 4000 words (rather than 2000). **Redo the analysis to 1) calculate TF-IDF for these 4000 words, 2) identify the word with the highest TF-IDF in each year (assignt his to `most_unique_4000`, and 3) generate a dataframe with the most common word from each analysis.Then, take a look to see how changing one argument in your analysis can affect your results! Finally, you can regenerate line plots if you're interseted to see how your plots have changed in this new analysis.**

In [None]:
# define tfidfvectorizer object 
### BEGIN SOLUTION
tfidf2 = TfidfVectorizer(sublinear_tf=True,
                         analyzer='word',
                         max_features=4000,
                         tokenizer=word_tokenize,
                         stop_words=stopwords.words("english"))
### END SOLUTION

In [None]:
# calculate TF-DF on input text
### BEGIN SOLUTION
inaug_tfidf2 = pd.DataFrame(tfidf2.fit_transform(text).toarray())
inaug_tfidf2.columns = tfidf2.get_feature_names()
inaug_tfidf2.index = years 
### END SOLUTION

In [None]:
# identify most uniuqe word each year from new model
### BEGIN SOLUTION
most_unique_4000 = inaug_tfidf2.idxmax(axis=1)
most_unique_4000
### END SOLUTION

In [None]:
# join most_unique from original model with this new list
# in a single dataframe to compare word each year
### BEGIN SOLUTION
pd.concat([most_unique, most_unique_4000],axis=1)
### END SOLUTION

In [None]:
# regenerate plot
### BEGIN SOLUTION
#import datetime
x = inaug_tfidf2.index
plt.plot(x, inaug_tfidf2['british'], label="british")
plt.plot(x, inaug_tfidf2['america'], label="america")
plt.plot(x, inaug_tfidf2['war'], label="war")
plt.plot(x, inaug_tfidf2['jobs'], label="jobs")
plt.legend(loc="upper right")
plt.xticks(np.arange(0,56,step=5))
plt.show();
### END SOLUTION

Good work getting comfortable working with text data here...and hopefully learning a bit more about Inaugural Addresses over time. Go ahead and submit your discussion lab!