# Week 13 Demo

This exercise involves paper data taken from the [HCI Bibliography](http://hcibib.org/); in particular, abstracts for papers at CHI (the human-computer interaction conference).

You can download it from the [course data sets page](https://cs533.ekstrandom.net/f21/resources/data/).

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline

## Load Data

In [None]:
papers = pd.read_csv('chi-papers.csv', encoding='utf8')
papers.info()

Let's treat empty abstracts as empty strings:

In [None]:
papers['abstract'].fillna('', inplace=True)
papers['title'].fillna('', inplace=True)

For some purposes, we want *all text*.  Let's make a field:

In [None]:
papers['all_text'] = papers['title'] + ' ' + papers['abstract']

## Counting

Now that you have this, let's go!

Set up a `CountVectorizer` to tokenize the words and compute counts:

In [None]:
vec = CountVectorizer(encoding='utf8')

You can use the `sum` method on a sparse matrix to sum up entries. If you sum the *columns* (specify `axis=0`), you will get an array of word counts:

In [None]:
mat = vec.fit_transform(papers['abstract'])
mat

In [None]:
abs_counts = np.array(mat.sum(axis=0)).flatten()

Plot the distribution of the log of word counts.

## Classifying

Train a classifier to predict whether a paper was written before 2000 or after (predict "recent" where "recent" is where the year is >= 2000).

Use either Naive Bayes or k-NN.

## Factorizing

Compute a TruncatedSVD with 10 features from *all* text words.  This is a "topic model" - we're trying to learn latent topics from a corpus of texts.

In [None]:
svd = Pipeline([
    ('tokenize', CountVectorizer()),
    ('svd', TruncatedSVD(10))
])
svd_X = svd.fit_transform(papers['all_text'])

What does the pairplot of these dimensions look like? ([movie decomposition demo notebook](https://cs533.ekstrandom.net/f21/resources/tutorials/moviedecomp/) is helpful!)

In [None]:
sns.pairplot(pd.DataFrame(svd_X))

What **words** are most strongly aligned with the first 3 dimensions (topics)?  The `vocabulary_` field on a vectorizer contains a dictionary mapping terms to feature indices. You need to invert this mapping (map indices to terms) in order to look up the term for a column of your reduced matrix:

In [None]:
vocab = pd.Series(svd.named_steps['tokenize'].vocabulary_)
vocab

In [None]:
vocab.index.name='word'
words = vocab.to_frame(name='index').reset_index().set_index('index').sort_index()
words

Let's make a data frame.  We're going to transpose the components, so columns are dimensions; then put a word index on it.

In [None]:
svd_df = pd.DataFrame(svd.named_steps['svd'].components_.T, index=words['word'])
svd_df

What are the most important words on the first component?

In [None]:
svd_df[0].nlargest(10)

And the second?

In [None]:
svd_df[1].nlargest(10)

And the third?

In [None]:
svd_df[2].nlargest(10)

**Exercise:** what happens if you remove stop words before fitting the model?

**Exercise 2:** what if you use LDA instead of Truncated SVD for the topic model? Do the topics make more sense?