# Text Data

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from requests import get
import re

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import nltk

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from nltk.corpus import stopwords
from nltk import SnowballStemmer
from nltk.tokenize import RegexpTokenizer

import string
import time

## Text Analysis over Time

We've so far only looked at the New York Times API for one month. However, we have access to much more data than that. The NYT Archive API allows us to pull data for any given month, so we could have just pulled data for an entire year by looping over the months and using the API for each month. The code to do this is shown below.

In [None]:
with open('nyt-key.txt', 'r') as f:
    nyt_key = f.readline()

In [None]:
year = 2020
month = 7

base_url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json"

In [None]:
r = get(base_url, params= {'api-key':nyt_key})

In [None]:
def get_nyt_archive(month, year, key):
    '''
    Pull from NYT Archive API for a given month and year. Returns a DataFrame that contains the abstract 
    
    Arguments:
        month: int, month for which the data should be pulled
        year: int, year for which the data should be pulled
        key: str, the Census key to use to pull from the API
        
    Returns:
        A DataFrame
    '''
    base_url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json"
    r = get(base_url, params= {'api-key':nyt_key})
    articles = r.json()['response']['docs']
    
    keys = ['web_url','abstract', 'pub_date', 'type_of_material','word_count']
    nyt_dict = {key:[article[key] for article in articles] for key in keys}
    return pd.DataFrame(nyt_dict)

Note that there is a `time.sleep(12)` in the loop. This is because the NYT API only allows 5 API requests per minute. So, we slow down our code so that we don't make too many requests, or else they will be denied. This code might take some time to run because of this, so instead you can bring in the CSV file provided that has the results of this process.

In [None]:
nyt_2020 = pd.read_csv('nyt_2020.csv')

In [None]:
nyt_2020.head()

## Visualizing NYT Abstracts Over Time

First, let's take the data over a full year and apply the same cleaning techinques that we discussed. That is, we want to tokenize, stem, and remove stopwords within our corpus. 

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer("english")
stop = stopwords.words('english')

First, we tokenize the words. Using `reset_index()` afterwards means that we have an `index` variable that acts as our document ID variable. In other words, the `index` variable represents a unique abstract, while the `abstract` variable contains the words within those abstracts.

In [None]:
tokens = nyt_2020.abstract.apply(tokenizer.tokenize).explode().reset_index()

In [None]:
tokens.head()

We'll merge this back in with our `type_of_material` and `month` data so that we have that to use with our visualizations.

In [None]:
token_df = tokens.merge(nyt_2020[['type_of_material', 'month']], 
                        how = 'left', left_on = 'index', right_index = True).dropna()

token_df.head()

Next, we proceed with our data pre-processing steps. We first lowercase all words, then remove stopwords. Lastly, we stem the words. 

In [None]:
token_df['abstract'] = token_df['abstract'].str.lower()
cleaned_token_df = token_df[-token_df.abstract.isin(eng_stopwords)]
cleaned_token_df['abstract'] = cleaned_token_df.abstract.apply(stemmer.stem)
cleaned_token_df.head()

Suppose we want to look at the most frequent words within each month of the year. We could use `groupby` to agggregate by the `month` variable, then use `value_counts()` in order to count up how often each word was used within that month. Then, we'll just take the top 10 words from each month by using the `nlargest` method within each group. The `reset_index()` here again makes it so that we can pull out the grouped multi-index month and abstract as variables within the DataFrame, so that we can use them within our visualizations.

In [None]:
tokens_freq = cleaned_token_df.groupby('month').abstract.value_counts()
top_tokens = tokens_freq.groupby(level = 0, group_keys=False).nlargest(10).reset_index()

In [None]:
top_tokens.head()

One possibility for a graph is a bar graph with a slider for month. This allows us to look at the top words within each month separately, as well as see how it changes over time.

In [None]:
px.bar(top_tokens, x = 'count', y = 'abstract', orientation = 'h',
       animation_frame="month", animation_group = 'abstract',
       range_x=[0,1200])

We could also try to use a line plot. This might not be as helpful with every word, because there are too many colors to tell apart.

In [None]:
sns.relplot(top_tokens, x = 'month', y = 'count', hue = 'abstract', kind = 'line')

<font color ='red'>**Question 1: What are some trends in the words used each month of 2020? What are some possible relationships that might be of interest in this dataset, and what are some other possible types of graphs you might use to analyze these?**</font>

Your answer here:

## Turning Text Data into a Matrix Format

When we work with data, we usually think about it in terms of rows and columns. That is, we have rows of observations and columns of variables. Text data as it is doesn't quite fit into that format, so we need to do some work to be able to do more advanced analyses. We've gone over doing some cleaning and breaking down of text data, but now we want to convert it into a matrix or table format that we can use to do analyses.

To do this, we're going to treat each token as a variable and each document as an observation. So, in the case of NYT Article abstracts, we will treat individual article abstract as an observation. There will be as many columns as there are unique tokens in the overall corpus (so there will be many many variables!). The dataset that we end up with will looking something like this:

|document ID|about|america|author|ask|...|
|-|-|-|-|-|-|
|1|0|0|0|0|...|
|2|0|1|0|0|...|
|3|0|0|3|0|...|
|4|1|0|0|0|...|
|5|0|0|0|2|...|
|...|...|...|...|...|...|

To convert our abstracts into this format, we first take a Series of the abstracts with everything lowercased.

In [None]:
abstracts = nyt_2020.abstract.str.lower().reset_index().abstract.dropna()
abstracts.head()

Next, we create a `tokenize` function that does the tokenizing and temming steps that we had done before. This is a function that we will need to provide to `CountVectorizer` below instead of using directly.

In [None]:
stemmer = SnowballStemmer("english")

def tokenize(text):
    tokens = tokenizer.tokenize(text)
    return [stemmer.stem(token) for token in tokens]

## CountVectorizer

We can apply this to each abstract in our corpus using `CountVectorizer`. This will not only do the tokenizing, but it will also count any duplicates of words and create a matrix that contains the frequency of each word. This will be quite a large matrix (number of columns will be number of unique words), so it outputs the data as a sparse matrix.

We will first create the `vectorizer` object (you can think of this like a model object), and then fit it with our abstracts. This should give us back our overall corpus bag of words, as well as a list of features (that is, the unique words in all the abstracts).

In [None]:
# Tokenize stop words to match
eng_stopwords = [tokenize(s)[0] for s in stop]

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                             tokenizer=tokenize, # function to create tokens
                             ngram_range=(0,1), # Tokens are individual words for now
                             strip_accents='unicode',
                             stop_words= eng_stopwords)

Once we have created the vectorizer, we can use it to transform our abstracts.

In [None]:
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus into a bag of words 
features = vectorizer.get_feature_names_out()

Note that since this can be quite large, it will be stored as a sparse matrix. That is, it only stores information about which rows and columns have non-zero values.

In [None]:
print(bag_of_words[0])

In [None]:
features[:10]

In [None]:
features[[10649, 22204, 25233, 18085, 26651, 9483, 5041, 5005]]

In [None]:
abstracts[0]