# Homework #2

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import os
import json
import re

import numpy as np
import pandas as pd

# Problem 1: Word phrases

### In this problem we will look at methods to identify valid n-grams such as 'New York' or 'Barack Obama' while eliminating statistical flukes such as `in the` or `i write`.

### Preprocessing such as this can drastically improved embeddings since words can ngrams will often have a different meaning than the sum of its parts
### `V('united')` + `V('states')` != `V('united states')`
### `V('real')` + `V('estate')` != `V('real estate')`

In [6]:
wiki_df = pd.read_csv('./data/kdwd_r1k_articles.csv')

In [7]:
wiki_df.head()

Unnamed: 0,page_id,page_title,page_views,intro_text
0,856,Apple Inc.,190485,Apple Inc. is an American multinational techno...
1,2386,American Airlines,40829,"American Airlines, Inc. (AA) is a major Americ..."
2,2400,Advanced Micro Devices,40665,"Advanced Micro Devices, Inc. (AMD) is an Ameri..."
3,2439,Anthem,2967,An anthem is a musical composition of celebrat...
4,6690,Coca-Cola,457810,"Coca-Cola, or Coke, is a carbonated soft drink..."


### get consecutive unigrams for the 'intro_text' column of our dataset

In [8]:
# list of lists of unigrams
unigram_pattern = ...
corpus = [re.findall(unigram_pattern, doc.lower()) for doc in wiki_df['intro_text'].tolist()]

TypeError: first argument must be string or compiled pattern

### The package `gensim` has a convenient wrapper to obtain statistically significant ngrams/Phrase automatically

### we need to first `pip install gensim`
### `gensim` is a useful library for anything related to word representations and embeddings. It will come up a few more times. https://radimrehurek.com/gensim/index.html

In [10]:
from gensim.models.phrases import Phrases

### Write some code to parse our corpus and use valid ngrams using `Phrases`

In [None]:
phrases = ...

In [None]:
vocab_count_dict = {k.decode('utf8'): v for k, v in phrases.vocab.items()}

n_grams = ...

In [None]:
print(n_grams.shape[0], 'n-grams found')
n_grams.head(10)

### How do the results look? Can you improve the results by excluding common terms using the `common_terms` kwarg of `Phrases`?

In [None]:
phrases = ...

In [None]:
n_grams = ...

In [None]:
print(n_grams.shape[0], 'n-grams found')
n_grams.head(10)

## This was convenient, but it's also a black box where many of the knobs for tuning are actually broken in the newest version. Let's try to create our own solution for finding n-grams.

### To do this, let's start by counting unigrams and bigrams within our corpus

### Tip: use Counter for easy counting. It behaves similar to a dictionary with some added functionality around counting. such as `my_counter[unknown_key]` returning `0` for all unknown keys

In [None]:
from collections import Counter

unigram_counter = Counter()
bigram_counter = Counter()
for tokens in corpus:
    # your code here

### Now we need to come up with a score for each bigram that helps us decide on its importance and the fact of whether it is truly a bigram or two independent unigrams.

In [None]:
# your code here

bigram_df = ...

In [None]:
bigram_df.head(10)

### Find ways to sort and filter your output to bigrams that make sense, such as `wells fargo`, `apple inc` or `puerto rico`

In [None]:
# your code here

filtered_bigram_df = ...

In [None]:
filtered_bigram_df.head(10)

# Problem 2: Word vectors via Pointwise Mutual Information (PMI)

### In this problem we will investigate another way of creating word representation from word co-occurrences. For this we will create a word-word matrix that counts the number of times that two words appear close to each other.

## More formally:

### The pointwise mutual information (PMI) for a (word, context) pair in a corpus is defined as the probability of their co-occurrence divided by the probabilities of them appearing individually, 
## $$
{\rm pmi}(w, c) = \log \frac{p(w, c)}{p(w) p(c)}
$$

## $$
p(w, c) = \frac{
f_{i,j}
}{
\sum_{i=1}^N \sum_{j=1}^N f_{i,j}
}, \quad 
p(w) = \frac{
\sum_{j=1}^N f_{i,j}
}{
\sum_{i=1}^N \sum_{j=1}^N f_{i,j}
}, \quad
p(c) = \frac{
\sum_{i=1}^N f_{i,j}
}{
\sum_{i=1}^N \sum_{j=1}^N f_{i,j}
}
$$
### where $f_{i,j}$ is the word-word count matrix. <br />
### In addition we can define the positive pointwise mutual information as, 
## $$
{\rm ppmi}(w, c) = {\rm max}\left[{\rm pmi(w,c)}, 0 \right]
$$

## We will implement this on our wiki featured articles dataset

In [None]:
wiki_feat_df = pd.read_csv('../../data/kdwd_featured_articles.csv')
wiki_feat_df.sample(5)

In [None]:
corpus = wiki_feat_df['intro_text'].tolist()

In [None]:
def get_tokens(text):
    token_pattern = ...
    return re.findall(token_pattern, text.lower())

In [None]:
from collections import Counter

In [None]:
unigram_counts = Counter()
for doc in corpus:
    # your code here

vocab = ...
inv_vocab = ...
print('vocabulary size: {}'.format(len(unigram_counts)))

### Skip-grams are a generalization of n-grams: https://en.wikipedia.org/wiki/N-gram#Skip-gram
### We will use this term here to find pairs of word within a context window, meaning that all words separated by max N words will be considered a bigram

In [None]:
# use skip-2-grams and context length 2 in each direction
word_window_len = 2
skipgram_counts = Counter()
for doc in corpus:
    tokens = get_tokens(doc)
    for token_idx, token in enumerate(tokens):
        for context_token in tokens[token_idx - word_window_len:token_idx + word_window_len]:
            #your code here

print('number of skipgrams:', len(skipgram_counts))
print('most common:')
[((inv_vocab[t1], inv_vocab[t2]), v) for (t1, t2), v in skipgram_counts.most_common(5)]

### Now let's create a sparse matrix that contains word-word co-occurrence counts

In [None]:
from scipy import sparse as ssp

row_indxs = []
col_indxs = []
dat_values = []

# your code here

wwcnt_mat = ssp.csr_matrix((dat_values, (row_indxs, col_indxs)))

### Next, create the PPMI matrix

In [None]:
# reusable quantities
num_skipgrams = wwcnt_mat.sum()
sum_over_words = np.array(wwcnt_mat.sum(axis=0)).flatten()
sum_over_contexts = np.array(wwcnt_mat.sum(axis=1)).flatten()

ppmi_dat_values = []   # positive pointwise mutial information
row_indxs = []  # for creating sparce matrices
col_indxs = []  # for creating sparce matrices
for (tok_word, tok_context), sg_count in skipgram_counts.items():

    nwc = ...
    Pwc = ...
    nw = ...
    Pw = ...
    nc = ...
    Pc = ...
    
    pmi = np.log2(Pwc / (Pw * Pc))   
    ppmi = max(pmi, 0)
    
    row_indxs.append(tok_word)
    col_indxs.append(tok_context)
    ppmi_dat_values.append(ppmi)

ppmi_mat = ssp.csr_matrix((ppmi_dat_values, (row_indxs, col_indxs)))

## Use `ppmi_mat` to investigate the most similar values to a few test terms.

In [None]:
# to speed up calculation we do dimentionality reduction here
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=64, random_state=6006)
trafo_ppmi_mat = svd.fit_transform(ppmi_mat)
sim_mat = cosine_similarity(trafo_ppmi_mat)

In [None]:
token = 'eminem'
# print most similar terms

In [None]:
token = 'quantum'
# print most similar terms

## In what way do these embeddings differ the TfIdf based ones we covered in class? Can you think of advantages/disadvantages for each approach?

In [None]:
# Your answer here!

# Problem 3: Word vectors for different domains


### In this problem we will creat embeddings for the `intro_text` column of the datasets `kdwd_featured_articles.csv` and `kdwd_r1k_articles.csv`
### We can think of these as examples of 'generic' and 'finance specific' word representations

## The goal of this exercise is to compare these two representations and find out which words change meaning the most across these two domains

In [2]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as ssp
from scipy.sparse import csr_matrix
import pandas as pd
import numpy as np

In [3]:
generic_df = pd.read_csv('./data/kdwd_featured_articles.csv')

In [4]:
generic_df

Unnamed: 0,page_id,page_title,page_views,intro_text
0,25,Autism,49693,Autism is a developmental disorder characteriz...
1,621,Amphibian,18926,"Amphibians are ectothermic, tetrapod vertebrat..."
2,662,Apollo 11,64044,Apollo 11 was the spaceflight that first lande...
3,663,Apollo 8,9848,Apollo 8 was the first crewed spacecraft to le...
4,751,Aikido,18702,is a modern Japanese martial art developed by ...
...,...,...,...,...
5767,61561199,David Hillhouse Buel,86,"David Hillhouse Buel (July 19, 1862 – May 23, ..."
5768,61688854,Roman temple of Bziza,627,The Roman temple of Bziza is a well-preserved ...
5769,61788967,Hurricane Humberto (2019),2140,Hurricane Humberto was a large and powerful tr...
5770,61824268,Battle of Babylon Hill,146,The Battle of Babylon Hill was a skirmish that...


In [5]:
finance_df = pd.read_csv('./data/kdwd_r1k_articles.csv')

In [5]:
finance_df

Unnamed: 0,page_id,page_title,page_views,intro_text
0,856,Apple Inc.,190485,Apple Inc. is an American multinational techno...
1,2386,American Airlines,40829,"American Airlines, Inc. (AA) is a major Americ..."
2,2400,Advanced Micro Devices,40665,"Advanced Micro Devices, Inc. (AMD) is an Ameri..."
3,2439,Anthem,2967,An anthem is a musical composition of celebrat...
4,6690,Coca-Cola,457810,"Coca-Cola, or Coke, is a carbonated soft drink..."
...,...,...,...,...
756,59351825,Diamondback Energy,1023,Diamondback Energy is a company engaged in hyd...
757,59690565,The Michaels Companies,4605,"The Michaels Companies, Inc. is North America'..."
758,60008806,Cheniere Energy,1124,"Cheniere Energy, Inc. is a liquefied natural g..."
759,60653452,CNX Resources,448,CNX Resources is a natural gas company based i...


### Create word representation for our 2 corpora using your favorite method

In [19]:
generic_vectorizer = TfidfVectorizer(min_df=1, max_df=1.0)
generic_mat = generic_vectorizer.fit_transform(generic_df['intro_text'].tolist())
generic_vocab = {token: n for n, token in enumerate(pd.Series(generic_vectorizer.vocabulary_).sort_values().index)}

finance_vectorizer = TfidfVectorizer(min_df=1, max_df=1.0)
finance_mat = finance_vectorizer.fit_transform(finance_df['intro_text'].tolist())
finance_vocab = {token: n for n, token in enumerate(pd.Series(finance_vectorizer.vocabulary_).sort_values().index)}

### Since our two corpora use different vocabulary we want to sub-select each representation matrix to be only of vacabulary tokens that occur in both corpora

In [20]:
common_terms = list(set(generic_vocab) & set(finance_vocab))

In [21]:
len(common_terms)

7473

In [22]:
filtered_generic_mat = []
filtered_finance_mat = []

# your code here
for word in common_terms:
    filtered_generic_mat.append(generic_mat[:,generic_vocab[word]])
    filtered_finance_mat.append(finance_mat[:,finance_vocab[word]])

filtered_generic_mat = ssp.hstack(filtered_generic_mat)
filtered_finance_mat = ssp.hstack(filtered_finance_mat)

In [23]:
filtered_generic_mat

<5772x7473 sparse matrix of type '<class 'numpy.float64'>'
	with 713222 stored elements in COOrdinate format>

### Our documents for each corpus are different so there is no direct way of comparing our two representations, even though they now have the same dimension. To get them on equal footing, let's look at the word-word similarlity matrix for each domain.
### Comparing these two, find terms that seem to have a drastically different meaning within the two domains.

In [24]:
# get the term-term similarity matrix
generic_term_sim_mat = cosine_similarity(filtered_generic_mat.T)
finance_term_sim_mat = cosine_similarity(filtered_finance_mat.T)

In [25]:
finance_term_sim_mat

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [26]:
generic_term_sim_mat

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.00691614],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.00691614, 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [27]:
term_drift_scores = {}
for n, term in enumerate(common_terms):
    term_sim = cosine_similarity([generic_term_sim_mat[n]],[finance_term_sim_mat[n]])
    term_drift_scores[term] = term_sim

In [28]:
term_drifts = pd.Series(term_drift_scores).sort_values()

In [29]:
term_drifts.head(10)

pancakes         [[0.05038100790221931]]
bellamy          [[0.06005089573428904]]
haas             [[0.07207226355123118]]
methodists       [[0.07404461116193248]]
circus           [[0.07439743333673383]]
oldham           [[0.07447224405096084]]
understands      [[0.07458418077041584]]
cynthia          [[0.07485072508366464]]
individuality    [[0.07515285069220118]]
levi             [[0.07648182601374093]]
dtype: object

In [30]:
term_drifts.tail(10)

it      [[0.7428606442366982]]
its     [[0.7529668360683097]]
with    [[0.7589772440928931]]
by       [[0.765719225451046]]
as      [[0.7784307198243836]]
to      [[0.7947377813721546]]
in      [[0.7976528411563721]]
and     [[0.7990783673375985]]
of      [[0.8026969896289652]]
the     [[0.8192134197968853]]
dtype: object

# Problem 4: Corporate Similarity and Returns
### In this example we'll explore how to use NLP to measure corporate similarity

### In particular we will
 - ### Make word vectors for firms in order to get an NLP measure of similarity
 - ### Measure the quality of this similarity metric by predicting future co-movement of returns. 
 
## Step X: This problem uses a few concepts of basic modeling such as `sklearn.model_selection.train_test_split` and `sklearn.linear_model.LinearRegression`
## Feel free to read some of the sklearn documentation, but otherwise we will cover these concepts next class
 

# $ \\ $
## Step 0: Load the MD&A section from Form-10-K from 2016

In [None]:
with open('../../data/parsed_mda.json') as f:
    data = json.load(f)

### Next, take only the first filing for each company

In [None]:
clean = []
seen = set()
for item in data:
    if item['ticker'] in seen:
        continue
    else:
        seen.add(item['ticker'])
        clean.append(item)
data = clean
del clean

### Now load the price data for 2015-2018

In [None]:
prices = pd.read_csv(
    '../../data/sp500_prices.csv', 
    index_col=0, 
    parse_dates=True
).loc['2015-01-01':'2018-01-01']

In [None]:
prices.head()

In [None]:
data_tickers = [item['ticker'] for item in data]
assert len(data_tickers) == len(set(data_tickers)), 'non-unique tickers, this will not work'

## Step 1: clean the text
### Much of NLP boils down to doing reasonable processing on text.
### First, we'll try out very minimial processing

In [None]:
def clean_mda_simple(mda):
    return mda.lower()

In [None]:
# add import here

In [None]:
word_vecs = ...

## Step 2: Pairwise Word similarity
### Calculate the pariwise cosine similarity between word vectors
### Make the cosine similarities into a dataframe indexed/columned on ticker symbols

In [None]:
word_sims = ...

## Step 2a: Why `cosine_similarity` and not another measure?

In [None]:
# ANSWER GOES HERE

## Step 3: Wrangle the price and word data
### Our goal here is to have a dataframe which is indexed on PAIRS of tickers and has columns
 - ### `returns_correlation`: the correlation of returns for those two tickers from Jan 1 2016 to Jan 1 2017
 - ### `word_similarity`: the cosine similarity of the word vectors for the two companies' MD&A sections
 
## Tips
 - ### NB: use pct_change to calculate returns in pandas
 - ### NB: use the pandas builtin corr function to calculate correlations (we don't need anything fancy)
 - ### NB: the index of the dataframe should have two columns (the tickers)

In [None]:
# one way you might do this is
rets_cor = ... # calculate returns correlations
word_cor = #  calcuate the word similarities in the right shape

all_data = rets_cor.join(word_cor)
all_data = all_data.dropna()
all_data.head()

## Step 3a: 
 - ### What is the contemperaneous correlation of these data?
 - ### Make a scatter plot of the returns correlation and word similarities

In [None]:
all_data.corr()

In [None]:
## This should be about 12%. That's not bad, but we can do better

In [None]:
all_data.plot.scatter(x='returns_correlation', y='word_similarity')

## Part 4: Try to predict the future returns correlations
### Use OLS (`LinearRegression`) to predict `returns_correlation` from `word_similarity`. 
### What is the (contemperaneous) out of sample performance?

In [None]:
data_df =  ...
feature_cols =  ...
target_col =  ...

In [None]:
# add code here

reg = ...# add code here
reg.score(X_test, y_test)

In [None]:
pd.Series(reg.coef_, index=feature_cols)

### This is not amazing. We can do better!

## $ \\ $ 

## Part 5: Repeat, but be careful
### Here we will see if we can clean the data better

### Things to try
 - ### Look at the histograms of word similarities to see if we can "ignore" some ill-behaved data
 - ### Try limiting how greedy the `TFIDFVectorizer` is: `min_df`, `max_df`, `max_features`, etc.
 
### We will examine our data and look for things that look out of place
 - ### We will ultimately want our data to look normally distributed


In [None]:
def clean_mda(mda):
    paras = [p.lower() for p in mda.split('\n') if len(p) > 40]
    cleaned =  ' '.join(paras)
    words = cleaned.split()
    words = [word for word in words if len(word) > 2]
    if len(words) > 10:
        return ' '.join(words)
    else:
        return ''

In [None]:
vec = TfidfVectorizer(
    min_df=...
    max_df= ...
    max_features= ...
)
word_vecs = vec.fit_transform((clean_mda(item['mda']) for item in data))


word_sims =  ...
# Lots of word similarities are all zeros- so we'll ignore
# add code here to remove rows of word_sims where all the elements are zero

In [None]:
# calculate the returns correlation and the cosine similarities as above
all_data = ...

In [None]:
# inspect your data- make some histograms
plt.figure(figsize=(12,7))
all_data.returns_correlation.hist(bins=40)
plt.title('Returns Correlation')
plt.figure(figsize=(12,7))
all_data.word_similarity.hist(bins=40)
plt.title('Word Similarity')

### Cleaning our data
### It seems lots of things are identically 0 (no word overlap) or identically 1 (the MD&A section for one company perfectly overlaps itself). We will exclude those

In [None]:
# your code here

In [None]:
# examine histograms again
plt.figure(figsize=(12,7))
all_data.returns_correlation.hist(bins=40)
plt.title('Returns Correlation')
plt.figure(figsize=(12,7))
all_data.word_similarity.hist(bins=40)
plt.title('Word Similarity')

In [None]:
all_data.corr()

### Lastly, there is a bit of a "hump" at low `word_similarity`

In [None]:
# add code here
all_data.corr()

### The contemperaneous correlation is twice as large!

## Part 6: Now, repeat the exercise of predicting future returns correlation

In [None]:
data_df =  ...# Add code here
# add code here

In [None]:
reg =  ...
reg.score(X_test, y_test)

In [None]:
pd.Series(reg.coef_, index=feature_cols)

### This is about 5 times better than before!
## $ \\ $ 
## Part 7: What will happen if we include last year's returns correlation as a feature

In [None]:
last_year_corr =  ...
data_df = last_year_corr.join( ...

In [None]:
reg =  ...
reg.score(X_test, y_test)

In [None]:
pd.Series(reg.coef_, index=feature_cols)

## Indeed, we do much better, but the word features still help!