## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [1]:
%%capture
!pip install scattertext

In [2]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML

In [4]:
#load data
dickens_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
dickens_df= pd.read_csv(dickens_url)

In [5]:
# sanity check
print(dickens_df.shape)
dickens_df.sample(5)

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
18833,dickens,bleak,"As we advanced, I began to feel misgivings tha...",misgiving companion confidence roadside people...,same long weary other other reassuring perplexed,advance begin feel lose look sit see go overhe...
2785,dickens,expectations,“Stay a bit. I know what you’re a-going to say...,bit going bit sister back fall time sister ram...,heavy such candour,stay know ’re say stay deny come deny throw dr...
14978,dickens,bleak,"“But it’s the inside of the man, the warm hear...",inside man heart man passion man blood man vis...,warm fresh little interested superlative more ...,’ speak pursue sound suppose say believe tell ...
10104,dickens,copperfield,"There are many faces that I know, among the li...",face crowd church mother village bloom grief b...,many little first youthful,know face know wonder face see come mind mind ...
17620,dickens,bleak,"“Lady Dedlock, I have not yet been able to com...",decision course meantime secret,able satisfactory clear,come act request keep keep wonder keep


In [6]:
#load data
eliot_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
eliot_df= pd.read_csv(eliot_url)

In [7]:
# sanity check
print(eliot_df.shape)
eliot_df.sample(5)

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
11763,eliot,bede,"“Remember,” Mr. Irwine went on, “there are oth...",other friend stroke strength mind sense duty m...,good,remember go think act fall bear think expect t...
3689,eliot,middlemarch,"Celia’s rare tears had got into her eyes, and ...",tear eye corner mouth,rare,get agitate
7774,eliot,deronda,“I pity the man who can travel from Dan to...,man barren world fruit _,sentimental,pity travel say ti cultivate offer
16396,eliot,felix,"This speech, in its chief points, had been del...",speech point face flint gentry duty her tone c...,chief defiant defensive due powerful strong ow...,prepare set make know know brave dare stand wa...
6197,eliot,mill,"“No compliment can be eloquent, except as an e...",compliment expression indifference,eloquent little,say flush


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [8]:
# create samples
dickens_sample_df= dickens_df.sample(10_000)
eliot_sample_df= eliot_df.sample(10_000)

In [9]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [10]:
# drop all columns except 'author' and 'nouns'
nouns_df = df[['author', 'nouns']]

In [11]:
print(nouns_df.shape)
nouns_df.sample(10)

(20000, 2)


Unnamed: 0,author,nouns
10913,dickens,suffolk
22559,dickens,indignation solicitor friend door coach purpose
13647,dickens,mother bite handkerchief hand gun
13858,dickens,people wind coast way house one knocking way l...
15788,dickens,office row law stationer name thing
17503,eliot,thing
8326,dickens,smoke night father
10127,eliot,intensity lightning course frame voice soul so...
15221,eliot,company hear command expectation utterance sen...
11331,eliot,relief knowledge past damage evening rencontre...


## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [12]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(nouns_df, category_col='author', text_col='nouns').build()

In [13]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [14]:
# display visualization in notebook
HTML(html)

In [15]:
# Note: You can save this visualization as an html file
file_name = 'example.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [None]:
# it's the same thing execpt with adj

In [16]:
# create samples
dickens_sample_df= dickens_df.sample(10_000)
eliot_sample_df= eliot_df.sample(10_000)

In [17]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [18]:
# drop all columns except 'author' and 'adjectives'
adjectives_df = df[['author', 'adjectives']]

In [19]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(adjectives_df, category_col='author', text_col='adjectives').build()

In [20]:
# # transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [21]:
# display visualization in notebook
HTML(html)

In [22]:
file_name_2 = 'example.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)