## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [5]:
%%capture
!pip install scattertext

In [6]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML

In [7]:
#load data
dickens_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
dickens_df = pd.read_csv(dickens_url)

In [8]:
# sanity check
print(dickens_df.shape)
dickens_df.sample(5)

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
124,dickens,carol,"Scrooge went to bed again, and thought, and th...",bed,perplexed,go think think think make think endeavour thin...
9349,dickens,times,"Since the Pegler affair, this gentlewoman had ...",affair gentlewoman pity veil melancholy contri...,quiet woful woful,cover become assume bestow
4088,dickens,expectations,The candles that lighted that room of hers wer...,candle room sconce wall ground dulness light a...,high steady artificial pale withered bridal ow...,light place burn renew look make stop throw se...
21969,dickens,pickwick,From the centre of the ceiling of this kitchen...,centre ceiling kitchen hand branch branch rise...,old own huge same general delightful old mysti...,suspend give do take lead salute submit befit ...
23828,dickens,pickwick,"'Strange sitivation for one o' the family,' ob...",sitivation o family aunt chair depitty sawbone,strange,observe hoist bring


In [9]:
#load data
eliot_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
eliot_df = pd.read_csv(eliot_url)

In [10]:
# sanity check
print(eliot_df.shape)
eliot_df.sample(5)

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
12070,eliot,bede,"“Why, that’s just the reason she wants to go, ...",reason fur reason country t eat folk week alla...,comfortable much miserable next canna wi,’ want go give say say ’ arena ’ go turn say ’...
4363,eliot,silas,"Silas, always ill at ease when he was being sp...",ease better man horseback constraint,ill such tall powerful florid,speak see answer
16922,eliot,scenes,"‘Depend upon it,’ said Mr. Cleves, ‘there is s...",explanation affair man knack injustice manner,simple whole right minded,depend say happen know impress do
10459,eliot,bede,But he had the best antidote against imaginati...,antidote dread necessity coffin minute hammer ...,good imaginative next other strange,get ring sound overpower come take come howl s...
10029,eliot,deronda,"“Some time—gradually—you will know all,” said ...",time yourself time trouble distress,sure,know say tell pass go


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [11]:
# create samples #adding a underscore makes data more readable
dickens_sample_df = dickens_df.sample(10_000)
eliot_sample_df = eliot_df.sample(10_000)

In [12]:
# combine DataFrames #pd.concat adds both dataframes together.We must out dataframes in a list to concat
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [13]:
# drop all columns except 'author' and 'nouns' #second square bracket is for a list 
nouns_df = df[['author', 'nouns']]

In [14]:
# sanity shape for two columns and 20,000 rows
print(nouns_df.shape)
nouns_df.sample(10)

(20000, 2)


Unnamed: 0,author,nouns
23682,dickens,waiter gentleman gentleman lady hospitality sa...
17663,dickens,home
17288,eliot,advantage love thing secret sake day riding ho...
10231,dickens,voice half round prospect
7524,eliot,no curtness toss head hat
15032,eliot,pet
2665,eliot,room attention phoenix cleverness sense profes...
21283,dickens,arm neck sigh form lip smile face lip smile world
3299,dickens,ground dike sluice manner air order possessor ...
10741,dickens,belief outset story unknown delusion line diff...


## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [15]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(nouns_df, category_col='author', text_col='nouns').build()

In [16]:
#corpus is a function in pandas. we're comparing nouns from both text 

In [17]:
# transform corpus into html-based visualization with scattertext
#variable called html and we say st.proudce.... and we give it the corpus. 
#then one of the values in the category we're comparing which. category name labels it "Eliot"
#minimum_term_frequency removes the words used less than 20 times
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [18]:
# display visualization in notebook
HTML(html)

In [19]:
# Note: You can save this visualization as an html file
file_name = 'example.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [23]:
# create samples
dicken_sample_df = dickens_df.sample(15_000)
eliott_sample_df = eliot_df.sample(15_000)

In [24]:
# combine DataFrames
dfs = pd.concat([dicken_sample_df, eliott_sample_df])

In [29]:
# drop all columns except 'author' and 'adjectives'
wanted_columns = dfs[['author', 'adjectives']]

In [31]:
print(wanted_columns.shape)
wanted_columns.sample(10)

(20000, 2)


Unnamed: 0,author,adjectives
2426,eliot,well
15046,eliot,good deep lazy good
5211,dickens,independent
13640,eliot,main sickly weak dark human
13074,eliot,arched distant pale
5252,dickens,first small young
3706,dickens,degraded vile
7431,eliot,happy next sorry same charming
19710,dickens,sudden several inconsistent general facetious ...
15609,dickens,precious golden little hasty own worth great p...


In [32]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(wanted_columns, category_col = "author", text_col = 'adjectives').build()

In [39]:
# # transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus, category ='eliot', category_name = 'Eliot', 
                                       not_category_name = 'Dickens', minimum_term_frequency = 13, width_in_pixels =900)

In [40]:
HTML(html)

In [42]:
# display visualization in notebook
file = "adjetives_used_by_Eliot_and_Dickens.html"
with open(file, encoding = 'utf8', mode='w')as f:
    f.write(html)