# Metadata

```yaml
Course:    DS 5001
Module:    M05 Lab
Topic:     Variant TFIDFs and Document Significance
Author:    R.C. Alvarado
Date:      12 February 2023
```

# Exposition

* Three kinds of signficance:
  * __Local__: `TF-IDF` (significance of a term in a document; related to $p(w|d, C)$ ).
  * __Global__: Aggregate `TF-IDF` by term (significane of a term in the corpus; related to $p( w|C ) $ ).
  * __Document__: Aggreate `TF-IDF` by document (significance of document in the corpus; related to $p(d|W_d,C) $ ).
* `TF-IDF` is essentially local frequency balanced by global frequency.
* `DF-IDF` = `TF-IDF` Σ for boolean counts.
* `DF-IDF` is global boolean term entropy.
* Boolean counts are bad for computing local significance, but good for global.
* Max normalization is good for local significance.
* Doc significance should be computed from good local significance.

# Set Up

In [1]:
data_in = '../data/output'
data_out = '../data/output'
data_prefix = 'austen-melville'

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly_express as px
import plotly.graph_objects as go # For more control on graphs
import re

In [None]:
sns.set()

# Get Data

In [None]:
LIB = pd.read_csv(f"{data_in}/{data_prefix}-LIB.csv").set_index('book_id')
LIB['title'] = LIB.title.str.split(r',? by').apply(lambda x: x[0])
LIB['author'] = LIB.apply(lambda x: re.split(r',? by', x.title)[-1], 1)
for idx in [15859, 13720, 53861, 13721]:
    LIB.loc[idx, 'author'] = 'Herman Melville'
LIB = LIB[['title','author']]

In [None]:
TOKEN = pd.read_csv(f'{data_in}/{data_prefix}-TOKEN.csv')
OHCO = TOKEN.columns.to_list()[:5] 
TOKEN = TOKEN.set_index(OHCO)

In [None]:
VOCAB = pd.read_csv(f'{data_in}/{data_prefix}-VOCAB.csv').set_index('term_str')
# VOCAB = VOCAB.drop('term_id', 1) # We will forego using numeric term_ids and just use the term_str
VOCAB['pos_max'] = TOKEN.groupby(['term_str','pos']).pos.count().unstack().idxmax(1)
VOCAB['pos_group'] = VOCAB.max_pos.str[:2]
VOCAB['term_code'] = VOCAB.apply(lambda x: str(x.name) + '/' + x.max_pos, 1)
VOCAB['term_len'] = VOCAB.index.str.len()

# Recreate BOW

In [None]:
# BAG = OHCO[:1] # Book
BAG = OHCO[:2] # Chapter
# BAG = OHCO[:3] # Paragraph

In [None]:
BOW = TOKEN.groupby(BAG + ['term_str']).term_str.count().to_frame('n')

In [None]:
BOW

## Extract the DOCS table

This is a table of bag-level observations. We'll use this later when exploring document level significance.

In [None]:
DOCS = BOW.groupby(BAG).n.sum().to_frame('n')

In [None]:
DOCS

# Local Significance (TFIDF)

## Traditional

In [None]:
TF = BOW.n.unstack(fill_value=0) # Document-Term Count Matrix
DF = TF.astype('bool').sum() 
N = len(DOCS)
IDF = np.log2(N/DF)      
TFIDF = TF * IDF
TFIDF_agg = TFIDF.sum()

In [None]:
TFIDF_agg.sort_values(ascending=False).head(20)

## Variants

In [None]:
tf_variants = {
    'raw': lambda tf: tf,
    'rel': lambda tf: (tf.T / tf.T.sum()).T,
    'max': lambda tf, alpha=.4: alpha + (1 - alpha) * (tf.T / tf.T.max()).T,
    'log': lambda tf: np.log2(1 + tf),
    'bool': lambda tf: tf.astype('bool').astype('int'),
    # 'sub': lambda tf: 1 + np.log2(tf)
}

In [None]:
tfidf_variants = {k: tf_variants[k](TF) * IDF for k, v in tf_variants.items()}

# Global Significance

Global significance in this context means aggregate significance for the whole corpus.

We get the sums of each variant TF.

In [None]:
tfidf_sums = {k: v.sum().to_frame('sum_val') for k, v in tfidf_variants.items()}

In [None]:
SUMS = pd.concat([v.sort_values('sum_val', ascending=False).head(20).reset_index() 
    for v in tfidf_sums.values()], keys=tfidf_sums.keys(), axis=1)

In [None]:
SUMS.style.background_gradient('YlGnBu')

We compare sums to means.

In [None]:
tfidf_means = {k: v.mean().to_frame('mean_val') for k, v in tfidf_variants.items()}

In [None]:
MEANS = pd.concat([v.sort_values('mean_val', ascending=False).head(20).reset_index() 
        for v in tfidf_means.values()], keys=tfidf_means.keys(), axis=1)

In [None]:
MEANS.style.background_gradient('YlGnBu')

Combine and compare

In [None]:
AGGS = pd.concat([SUMS,MEANS], axis=1, ignore_index=True) # We combine the two
AGGS.columns = ['_'.join(idx) + '_x' for idx in SUMS.columns] + ['_'.join(idx) + '_y' for idx in MEANS.columns] # This flattens the multiindex
AGGS = (AGGS - AGGS.mean(numeric_only=True)) / AGGS.std(numeric_only=True) # This normalizes the quantities as Z-scores
AGGS = AGGS[[col for col in AGGS.columns if 'term_str' not in col]] # This removes the string columns
AGGS = (AGGS * 100).astype('int') # This converts the numbers so they can be compact in the heatmap

In [None]:
AGGS.T.style.background_gradient(cmap='RdYlBu', axis=None)

We can see that there is no difference between mean and sum for sorting terms by global significance. 

We can also see that boolean counting produces a different distribution than the others, which are similar to each other.

## DFIDF as DH

Let's compute the global entropy of terms in the corpus and use that as another measure of global significance.

It turns out that this measure is the same as TFIDF summing from a boolean count matrix.

In [None]:
DFIDF = (DF * IDF).to_frame('val')
DP = DF / N
DI = np.log2(1/DP)
DH = (DP * DI).to_frame('val')

In [None]:
pd.concat([
        DFIDF.sort_values('val', ascending=False).head(20).reset_index(),
        DH.sort_values('val', ascending=False).head(20).reset_index(), 
        tfidf_sums['bool'].sort_values('sum_val', ascending=False).head(20).reset_index()
    ], keys=['dfidf', 'dh', 'bool'], axis=1)\
    .style.background_gradient('YlGnBu')

# Document Significance

In [None]:
for k, v in tfidf_variants.items():
    DOCS[f'doc_sig_{k}'] = v.T.mean()

In [None]:
DOCS

We computes TFIDF with the book as the context.

In [None]:
def get_chap_sigs(bow):
    X = bow.unstack(fill_value=0).astype('bool')
    df = X.sum()
    tf = (X.T / X.T.sum()).T
    idf = np.log2(len(tf)/df)
    tfidf = tf * idf
    ds = tfidf.T.sum()
    return ds

In [None]:
DOCS['book_chap_sig'] = DOCS.groupby('book_id').apply(lambda x: get_chap_sigs(BOW.loc[x.name]))

See how length and significance are related.

In [None]:
DOCS.loc[105].plot.scatter('n', 'book_chap_sig');

In [None]:
def plot_sig_docs2(book_id, type='scatter'):

    global DOCS
    D = DOCS.loc[book_id]
    title = LIB.loc[book_id].title
    point_size = (D.n / D.n.sum()) * 700
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=D.index, y=D.book_chap_sig, 
                             text=D.index, 
                             mode = 'lines+markers+text',
                             marker = dict(size=point_size, color='#BBB'),
                             line = dict(color='#DADADA'),
                             textfont = dict(color="black")
                            ))
                  
    fig.update_layout(
        font = dict(color="#000", size=14),
        title=title,
        xaxis_title="Chapter",
        yaxis_title="Significance",
        height=800
    )
    fig.show()

In [None]:
plot_sig_docs2(105)

> **Chapter 12 signals a climax in the novel's narrative.** Persuasion is a linear narrative that is organized chronologically. The original edition of this novel was published in two volumes, **the first volume ending at the close of Chapter 12**. Louisa's fall is the greatest dramatic occurrence which has happened so far. By inserting the fall here, Austen creates a cliffhanger and encourages her readers to buy the second volume of her novel. In these chapters, the reader is shown the negative effects of what can happen when one is too stubborn. Louisa would not be persuaded to keep from jumping off the wall. Her firmness of mind means serious injury for her and significant guilt for Captain Wentworth. He is encouraged to rethink his initial judgment of the benefit of a "strong character." [Sparknotes](https://www.sparknotes.com/lit/persuasion/section6/page/2/)

In [None]:
plot_sig_docs2(1342)

In [None]:
plot_sig_docs2(2701)