## Natural Language Visualization With Scattertext
## Jason S. Kessler @jasonkessler
### Global AI Conference 2018, Seattle, WA. April 27, 2018.

The Github repository for talk is at [https://github.com/JasonKessler/GlobalAI2018](https://github.com/JasonKessler/GlobalAI2018). 

Visualizations were made using [Scattertext](https://github.com/JasonKessler/scattertext).

Please cite as: 
Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017.

In [2]:
import pandas as pd
import numpy as np
import scattertext as st
import spacy
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))
import matplotlib.pyplot as plt
%matplotlib inline 

In [3]:
assert st.__version__ >= '0.0.2.25'

### The data
 
Dataset consists of reviews of movies and plot descriptions.  Plot descriptions are guaranteed to be from a movie which was reviewed. 

Data set is from http://www.cs.cornell.edu/people/pabo/movie-review-data/

References:
* Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, Proceedings of EMNLP 2002.

* Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004.

In [4]:
rdf = st.SampleCorpora.RottenTomatoes.get_data()
rdf['category_name'] = rdf['category'].apply(lambda x: {'plot': 'Plot', 'rotten': 'Negative', 'fresh': 'Positive'}[x])
print(rdf.category_name.value_counts())
rdf[['text', 'movie_name', 'category_name']].head()

Positive    2455
Negative    2411
Plot         156
Name: category_name, dtype: int64


Unnamed: 0,text,movie_name,category_name
0,"A senior at an elite college (Katie Holmes), a...",abandon,Plot
1,Will Lightman is a hip Londoner who one day re...,about_a_boy,Plot
2,Warren Schmidt (Nicholson) is forced to deal w...,about_schmidt,Plot
3,An account of screenwriter Charlie Kaufman's (...,adaptation,Plot
4,Ali G unwittingly becomes a pawn in the evil C...,ali_g_indahouse,Plot


In [5]:
corpus = (st.CorpusFromPandas(rdf, 
                              category_col='category_name', 
                              text_col='text',
                              nlp = st.whitespace_nlp_with_sentences)
          .build())
corpus.get_term_freq_df().to_csv('term_freqs.csv')
unigram_corpus = corpus.get_unigram_corpus()

### Let's visualize the corpus using Scattertext

The x-axis indicates the rank of a word or bigram in the set of positive reviews, and the y-axis negative reviews.

Ranks are determined using "dense" ranking, meaning the most frequent terms, regardless of ties, are given rank 1, the next most frequent terms, regardless of ties, are given rank 2, etc.

It appears that terms more associated with a class are a further distance from the diagonal line between the lower-left and upper-right corners.  Terms are colored according to this distance.  We'll return to this in a bit.

Scattertext selectively labels points in such a way as to prevent labels from overlapping other elements of the graph. Mouse-over points and term labels for a preview, and click for a key-word in context view.

References:
* Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017.

In [5]:
html = st.produce_scattertext_explorer(
    corpus,
    category='Positive',
    not_categories=['Negative'],
    sort_by_dist=False,
    metadata=rdf['movie_name'],
    term_scorer=st.RankDifference(),
    transform=st.Scalers.percentile_dense
)
file_name = 'rotten_fresh_stdense.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1300, height=700)

### We view can see more terms through breaking ties in ranking alphabetically.
Lower frequency terms are more prominent in this view, and more terms can be labeled.

In [6]:
html = st.produce_scattertext_explorer(
    corpus,
    category='Positive',
    not_categories=['Negative'],
    sort_by_dist=False,
    metadata=rdf['movie_name'],
    term_scorer=st.RankDifference(),
)
file_name = 'rotten_fresh_st.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1300, height=700)

## Scaled F-Score
### Associatied terms have a *relatively* high category-specific precision and category-specific term frequency (i.e., % of terms in category are term)
### Take the harmonic mean of precision and frequency (both have to be high)
### We will make two adjustments to this method in order to come up with the final formulation of Scaled F-Score

Given a word $w_i \in W$ and a category $c_j \in C$, define the precision of the word $w_i$ wrt to a category as:
$$ \mbox{prec}(i,j) = \frac{\#(w_i, c_j)}{\sum_{c \in C} \#(w_i, c)}. $$

The function $\#(w_i, c_j)$ represents either the number of times $w_i$ occurs in a document labeled with the category $c_j$ or the number of documents labeled $c_j$ which contain $w_i$.

Similarly, define the frequency a word occurs in the category as:

$$ \mbox{freq}(i, j) = \frac{\#(w_i, c_j)}{\sum_{w \in W} \#(w, c_j)}. $$

The harmonic mean of these two values of these two values is defined as:

$$ \mathcal{H}_\beta(i,j) = (1 + \beta^2) \frac{\mbox{prec}(i,j) \cdot \mbox{freq}(i,j)}{\beta^2 \cdot \mbox{prec}(i,j) + \mbox{freq}(i,j)}. $$

$\beta \in \mathcal{R}^+$ is a scaling factor where frequency is favored if $\beta < 1$, precision if $\beta > 1$, and both are equally weighted if $\beta = 1$. F-Score is equivalent to the harmonic mean where $\beta = 1$.

In [6]:
from scipy.stats import hmean

term_freq_df = corpus.get_unigram_corpus().get_term_freq_df()[['Positive freq', 'Negative freq']]
term_freq_df = term_freq_df[term_freq_df.sum(axis=1) > 0]

term_freq_df['pos_precision'] = (term_freq_df['Positive freq'] * 1./
                                 (term_freq_df['Positive freq'] + term_freq_df['Negative freq']))

term_freq_df['pos_freq_pct'] = (term_freq_df['Positive freq'] * 1.
                                /term_freq_df['Positive freq'].sum())

term_freq_df['pos_hmean'] = (term_freq_df
                             .apply(lambda x: (hmean([x['pos_precision'], x['pos_freq_pct']])
                                               if x['pos_precision'] > 0 and x['pos_freq_pct'] > 0 
                                               else 0), axis=1))
term_freq_df.sort_values(by='pos_hmean', ascending=False).iloc[:10]

Unnamed: 0_level_0,Positive freq,Negative freq,pos_precision,pos_freq_pct,pos_hmean
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
the,2346,2288,0.506258,0.048037,0.087748
a,1775,1613,0.523908,0.036345,0.067975
and,1637,1179,0.581321,0.03352,0.063385
of,1480,1235,0.54512,0.030305,0.057418
to,942,1010,0.482582,0.019289,0.037095
it,826,801,0.507683,0.016913,0.032736
is,818,726,0.529793,0.01675,0.032473
s,808,749,0.518947,0.016545,0.032067
in,676,622,0.520801,0.013842,0.026967
that,617,602,0.506153,0.012634,0.024652


In [7]:
term_freq_df.pos_freq_pct.describe()

count    12032.000000
mean         0.000083
std          0.000826
min          0.000000
25%          0.000000
50%          0.000020
75%          0.000041
max          0.048037
Name: pos_freq_pct, dtype: float64

In [8]:
term_freq_df.pos_precision.describe()

count    12032.000000
mean         0.506651
std          0.418623
min          0.000000
25%          0.000000
50%          0.500000
75%          1.000000
max          1.000000
Name: pos_precision, dtype: float64

0.0027233450048119254

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from scipy.stats import norm

matplotlib.rc('font', **font)

fig, ax = plt.subplots(figsize=(10,6))
log_freqs = np.log(term_freq_df.pos_freq_pct[term_freq_df.pos_freq_pct > 0])

sns.distplot(log_freqs[:1000], kde=False, rug=True, hist=False, rug_kws={"color": "k"})

x = np.linspace(log_freqs.min(), 
                log_freqs.max(), 
                100)
y = norm(log_freqs.mean(), log_freqs.std()).pdf(x)
plt.plot(x, y ,color='k')
word_freq = np.log(term_freq_df.pos_freq_pct[term_freq_df.pos_freq_pct > 0].loc['bad'])
plt.axvline(x=word_freq, color='red', label='Log frequency of "bad"')
plt.fill_between(x[x < word_freq], 
                 y[x < word_freq], y[x < word_freq] * 0, 
                 facecolor='blue', 
                 alpha=0.5,
                 label='''Log-normal CDF of bad's frequency''')
ax.set_xlabel('Log term frequency')
ax.set_ylabel('Cumulative term probability')
plt.legend()
for item in ([ax.title,  ax.xaxis.label, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels() ):
    item.set_fontsize(12)
plt.rc('legend', fontsize=15)     
plt.show()