## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [1]:
%%capture
!pip install scattertext

In [2]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML

In [3]:
#load data
dickens_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
dickens_df = pd.read_csv(dickens_url)

In [4]:
# sanity check
print(dickens_df.shape)
dickens_df.sample(5)

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
14498,dickens,bleak,After a little while he opened his outer wrapp...,wrapper coach arm pocket side,little outer large whole deep,open appear wrap put
23654,dickens,pickwick,"Oh, certainly, ma'am,' replied Mrs. Rogers; af...",ma'am lady,other,reply respond
23608,dickens,pickwick,"They found Mr. Pickwick, in company with Jingl...",company look group racket ground group looking...,motley worth idle,find talk bestow congregate
17178,dickens,bleak,"“Diminutive,” whispered Miss Flite, making a v...",variety motion forehead intellect love counsel,own sagacious dear clear,whisper make express ’ hear
11071,dickens,copperfield,The little picture was so instantaneously diss...,picture midst family face face hand,little astonished,dissolve go doubt hold shout


In [5]:
#load data
eliot_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
eliot_df = pd.read_csv(eliot_url)

In [6]:
# sanity check
print(eliot_df.shape)
eliot_df.sample(5)

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
13024,eliot,romola,"Nello whispered in the ear of Sandro, who roll...",ear eye sign smile heel rapidity,solemn slow surprising,whisper roll nod follow understand take
3689,eliot,middlemarch,"Celia’s rare tears had got into her eyes, and ...",tear eye corner mouth,rare,get agitate
9625,eliot,deronda,"“But there is some other fear on your mind,” s...",fear mind fear peace evil child anxiety defens...,other dear good more anxious,say disturb forecast guard turn ’
9323,eliot,deronda,"He was received with the usual friendliness, s...",friendliness costume woman child elder air bou...,usual additional slight proud own,receive wonder allow pass think lay enclose sa...
16253,eliot,felix,The active Harold had almost always something ...,way time ground snow horseback indoor billiard...,active definite fine slippery most,propose fill walk see melt get learn ride stai...


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [7]:
# create samples
dickens_sample_df = dickens_df.sample(10_000)
eliot_sample_df = eliot_df.sample(10_000)

In [8]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [9]:
# drop all columns except 'author' and 'nouns'
nouns_df = df[['author', 'nouns']]

## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [10]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(nouns_df, category_col='author', text_col='nouns').build()

In [11]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [12]:
# display visualization in notebook
HTML(html)

In [13]:
# Note: You can save this visualization as an html file
file_name = 'example.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [14]:
# create samples
dickensadj_sample_df = dickens_df.sample(10_000)
eliotadj_sample_df = eliot_df.sample(10_000)

In [16]:
# combine DataFrames
adj_df = pd.concat([dickensadj_sample_df, eliotadj_sample_df])

In [17]:
# drop all columns except 'author' and 'adjectives'
adj_df = df[['author', 'adjectives']]

In [18]:
# create a scattertext corpus
corpusadj = st.CorpusFromPandas(adj_df, category_col='author', text_col='adjectives').build()

In [19]:
# # transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpusadj,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [20]:
# display visualization in notebook
HTML(html)

In [21]:
file_name = 'adj.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)