# NMF Application to Dataframe

## Workflow

Step 1: run the model against the entire dataframe to collect the topics

Step 2: take this model and apply it back to the dataframe to assign most likely topic to each case (we want the topic # and its dot product)

Step 3: make a dictionary of the components that make up each topic from the original model

Step 4: use this dictionary to "look up" the topic components and apply those to the dataframe

Step 5: Getting data together for visualization!

In [1]:
import pandas as pd
import re

In [None]:
##########################################  modeling imports  #######################################################
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
#from sklearn.preprocessing import Normalizer

In [None]:
df = pd.read_pickle("full_proj_lemmatized3.pickle")

In [None]:
df.head(5)

![](../images/full-project.png)

In [None]:
df.ix[15000, "case_url"]
#'http://caselaw.findlaw.com/us-supreme-court/382/12.html'

## Step 1: Run model against entire dataframe (as a corpus)¶
Think of it like this: We need to find the themes across the entire set of documents (over 23,000 in all), so how else would we do this than stacking every document together as a reservoir to extract information out of?

In [None]:
def nmf_mod(corp ):
    df = .80
    n_topics = 30
    n_features = 2000
    n_top_words = 40
    
    # Use tf-idf features for NMF.
    print("Extracting tf-idf features for NMF...")
    tfidf_vectorizer = TfidfVectorizer(max_df=df, min_df=5, # ngram_range=(1,2), #max_features=n_features,
                                       stop_words='english')

    tfidf = tfidf_vectorizer.fit_transform(corp)


    # Fit the NMF model
    print("Fitting the NMF model with tf-idf features, "
          "n_topics= %d, n_topic_words= %d, n_features= %d..."
          % (n_topics, n_top_words, n_features))

    nmf = NMF(n_components=n_topics, random_state=2, alpha=.1, l1_ratio=.5).fit(tfidf)
    
    print("\nTopics in NMF model:")
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    #return print_top_words(nmf, tfidf_feature_names, n_top_words) 
    return tfidf,nmf

In [None]:
tfidf, nmf_mod_test = nmf_mod(df.lem)

## Step 2: Applying the model back to the dataframe

NMF (as well as other types of topic modeling) returns a matrix of likelihoods that a particular document fits in Topic 1, 2, etc. Unlike LDA, An NMF matrix does not contain probabilities of inclusion, but rather the dot product of two matrices. Don't worry about the (linear algebra) details, just imagine that we need to find the biggest number in this matrix and return the index of that.

In [None]:
out =nmf_mod_test.transform(tfidf)
out[49] #verified that each of these is different

![](../images/dataframe-array.png)

**Returning these as a Series** 
It's easy to run the model against a column of the dataframe, return it as a series, and append that series as a new column. (remember not to sort if you do this because you need the order to stay the same).

In [None]:
import operator
topics = []
for item in out:
    max_index, max_value = max(enumerate(item), key=operator.itemgetter(1))
    topics.append(max_index) 
    
df["topicnumber"] = pd.Series(topics, index=df.index)

In [None]:
topics_likelihood = []
for item in out:
    max_index, max_value = max(enumerate(item), key=operator.itemgetter(1))
    topics_likelihood.append(max_value)
    
df["strengthoftopic"] = pd.Series(topics_likelihood, index=df.index)

In [None]:
df.topicnumber.value_counts() #let's make sure this is a good model...

![](../images/topic-count-output.png)

## Step 3: Creating dictionary of topic components

There's probably an easier way to do this, but I haven't found one. I'm running the model function again (random state will get the same results as before) but this time creating a topic words feature space to "look up" in my dataframe.

In [None]:
def nmf_topics_dict(corp, n_topics):
    df = .80
    n_top_words = 40
    
    tfidf_vectorizer = TfidfVectorizer(max_df=df, min_df=5,# ngram_range=(1,2), #max_features=n_features,
                                       stop_words='english')

    tfidf = tfidf_vectorizer.fit_transform(corp)
    nmf = NMF(n_components=n_topics, random_state=2, alpha=.1, l1_ratio=.5).fit(tfidf)
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
      
    topic_dict = {}
    for topic_idx, topic in enumerate(nmf.components_):
        topic_dict[topic_idx] = ", ".join([tfidf_feature_names[i] \
                                    for i in topic.argsort()[:-n_top_words - 1:-1]])
    return topic_dict

In [None]:
# After testing different topic distributions, 30 was optimal
nmf_words_30 = nmf_topics_dict(df.lem, 30) #dict object

In [None]:
nmf_words_30

![](../images/topic-store.png)

In [None]:
import json
with open('finaliteration_topics.json', 'w') as fp:
    json.dump(nmf_words_30, fp)

## Step 4: Looking up topic words for each item in dataframe

In [None]:
def word_lookup(num):
    return nmf_words_30.get(num)

In [None]:
df["words"] = df.topicnumber.apply(word_lookup)

In [None]:
df.ix[15017,"words"] # This cell and the one below verifies that it worked

![](../images/word-lookup.png)

In [None]:
df.ix[14972,"lem"]

![](../images/lem-lookup.png)

In [None]:
df.ix[15017,"case_url"]
# 'http://caselaw.findlaw.com/us-supreme-court/380/145.html'

In [None]:
df.to_pickle("full_project_modelled_final.pickle")

In [None]:
df = pd.read_pickle("full_project_modelled_final.pickle")

## Step 5: Arranging data for visualization

Creating a brushable area chart with D3 requires a datapoint for every topic for every year, so we have to do some pivoting to make that happen.

In [None]:
# some topics were extremely similar and at the suggestion of my instructors,
# for the sake of the visualization, I have condensed the topics to 20

def topic_condenser(topicnum):
    if topicnum == 20:
        return 24
    if topicnum == 25:
        return 1
    if topicnum == 2:
        return 12
    if topicnum == 27:
        return 26
    if topicnum == 18 or topicnum == 5:
        return 29
    if topicnum == 8 or topicnum == 22:
        return 7
    if topicnum == 15:
        return 16
    if topicnum == 9:
        return 14
    if topicnum == 19:
        return 3
    else: 
        return topicnum
df["condensedtopics"] = df.topicnumber.apply(topic_condenser)

In [None]:
# doing some research on the not so obvious topics
df = df[df["topicnumber"] != 2]
#df_16.ix[15065, "caseurl"]
df_16

![](../images/research.png)

In [None]:
df_details = pd.read_csv("detailsford3.csv", encoding = 'iso-8859-1')
df_details.columns = ["condensedtopics", "topicname", "title", "exampleURL", "leadpp", "topicwords"]
df_details

![](../images/research2.png)

In [None]:
df_with_details = pd.merge(df, df_details, how = "inner", on = "condensedtopics")

In [None]:
#temp_df = df_with_details[['years', 'condensedtopics', "topicname", "title", "exampleURL", "leadpp", "topicwords"]]
#temp_df.to_csv("temp.csv")
temp_df = df_with_details[['years', 'condensedtopics']]
temp_df.condensedtopics.value_counts()

![](../images/condensed-count.png)

In [None]:
#dummy value for each existing topic. Pay no attention to this error.
temp_df["count"] = 1
temp_df

In [None]:
#this condenses each point for the same year into n number of points 
temp_df = temp_df.groupby(["years", "condensedtopics"]).count().reset_index()
temp_df

![](../images/temp-df.png)

In [None]:
data_fillna = temp_df.pivot_table("count", "years", "condensedtopics").fillna(0).unstack().reset_index()

**A few (really cool) things are happening**

First, we are pivoting to add dummy values for nonexistent year/topic points (for ex, there's only 1 case in 1792 but 30 topics, we need 29 points of 0). The topic numbers become column headers first, followed by filling the NaNs with 0's, then we stack the df back to the way it was and reset the index.

In [None]:
#we lose the count label column in the previous steps, so we're just renaming it here, and reordering columns based on 
#how they are arranged in the viz csv
data_fillna.columns = ["condensedtopics", "years", "count"]
data_fillna = data_fillna[["years", "condensedtopics", "count"]]

In [None]:
#merge data
final_data = pd.merge(data_fillna, df_details, how = "inner", on = "condensedtopics")
final_data

![](../images/merge-data.png)

In [None]:
#sort by year
final_data.sort_values("years", inplace = True, ascending = True)
final_data

![](../images/sort-data.png)

In [None]:
#backup file
final_data.to_csv("topicsbyyear.csv", index = False)
final_data.to_csv("year_topic_data2.csv", index = False)

In [None]:
'''the best part of this viz is the brushing side to side effect. For that, we need total cases for every year
and need no other information'''

data_fillna.groupby("years")["count"].sum().reset_index().to_csv("year_data.csv", index = False)

# Fin!

![](../images/visualization.png)