# Demo 05

In [None]:
import nltk
import spacy
import pandas as pd
import os
from tqdm import tqdm

import numpy as np

## Types vs Tokens

In [None]:
speech = "We refuse to believe that there are insufficient funds in the great vaults \
of opportunity of this nation. And so we've come to cash this check, a check that \
will give us upon demand the riches of freedom and the security of justice"

In [None]:
speech

In [None]:
tokens = nltk.tokenize.word_tokenize(speech)
" ".join(tokens)

In [None]:
f"Number of tokens is {len(tokens)}, number of types is {len(set(tokens))}"

### Duplicate types?

**Question:** Can you find any duplicate types in our vocabulary?

In [None]:
vocab = set(tokens)
" ".join(vocab)

<details>
<summary>Answer</summary>
    <b>"We"</b> and <b>"we"</b>

</details>

Do we want to treat these as different types?

**Question:** What solution would you suggest? 

<details>
<summary>Solution</summary>
<b>Lowercase</b>

</details>

In [None]:
# Solution is below in code

In [None]:
lower_tokens = nltk.tokenize.word_tokenize(speech.lower())
f"Number of lowered tokens is {len(lower_tokens)}, number of types is {len(set(lower_tokens))}"

(back to slides)

## Lematization

In [None]:
import nltk

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizer.lemmatize("transformed", "v") # The NLTK WordNet Lemmatizer needs to know the part of speech tag

#### Go, Goes, Went, Gone, Going

**Question:** What do you think the lemma for these terms should be? 

In [None]:
lemmatizer.lemmatize("go"), lemmatizer.lemmatize("goes")

In [None]:
lemmatizer.lemmatize("went", "v"), lemmatizer.lemmatize("gone", "v"), lemmatizer.lemmatize("going", "v")

(back to slides)
## Stemming

In [None]:
snowball_stemmer = nltk.stem.SnowballStemmer("english") # Same of PorterStemmer
snowball_stemmer.stem("babies")

### constituional, constitutionality, ...

In [None]:
snowball_stemmer.stem("constitution"), snowball_stemmer.stem("constitutions"), snowball_stemmer.stem("constitutional"), snowball_stemmer.stem("constitutionality"), snowball_stemmer.stem("constitutionalism")

### Relat

In [None]:
snowball_stemmer.stem("relativity"), snowball_stemmer.stem("relative")

(back to slides)
## Stopwords

In [None]:
" ".join(nltk.corpus.stopwords.words('english'))

**Question:** What do we notice about these words?

(back to demo)
## Part of Speech Tagging

In [None]:
nltk.pos_tag(speech)

**Question:** What does this error mean?

In [None]:
nltk.pos_tag(nltk.word_tokenize(speech))

Let's look at another tagset

In [None]:
nltk.pos_tag(nltk.word_tokenize(speech), tagset='universal')

Tutorial 2.1 will further explore differences between these sets

### Tricky examples

***time flies like an arrow***

**Question:** What should the POS tags here be?

- time: 
- flies:
- like: 
- an:
- arrow:

Let's see what nltk tells us

In [None]:
nltk.pos_tag(nltk.word_tokenize("time flies like an arrow"), tagset='universal')

**Question:** Do we agree?

Tutorial 2.1 will focus on the difference between these

(back to slides)

## Dependency Parsing

### Spacy

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

You might need to run 

> !python3 -m spacy download en_core_web_sm

In [None]:
doc = nlp(speech)
doc, type(doc)

In [None]:
list(doc.sents)[0]

Tutorial 2.1 will go into details about the spacy `Doc` object

In [None]:
from spacy import displacy
displacy.render(list(doc.sents)[0], style="dep")

In [None]:
for tok in list(doc.sents)[0]:
    print(tok.text, tok.dep_.upper(), tok.head)

Spacy dependency parse labels are explained [here](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md)

(back to slides)

## Named Entity Recognition

In [None]:
example_doc = nlp("Monday, October 30, Hillary Clinton will present her book in Chicago at the University of Chicago.")
example_doc

**Question:** How do we get the entities?

In [None]:
example_ents = ...
example_ents

**Question:** Let's get the text of the entities and the label of the entity

<details>
<summary>Solution</summary>
<b>[(ent.text, ent.label_) for ent in example_doc.ents]</b>

</details>

### Entities in Dracula

I downloaded Dracula from Project Gutenberg: https://www.gutenberg.org/ebooks/345

In [None]:
!ls data/Dracula.txt

The next line will take about 2 minutes

In [None]:
%%time 
doc = nlp(open("data/Dracula.txt").read())

I ran a [tool](https://github.com/JonathanReeve/chapterize) developed by Jonathan Reeve that splits novels from Project Gutenberg into files for each chapter.

[Jonathan](https://jonreeve.com/) is a Computational literary analyst here at Columbia.

In [None]:
!ls data/Dracula-chapters

In [None]:
%%time

DRACULA_PATH = "data/Dracula-chapters/"

chapter2doc = {}
for file in tqdm(os.listdir(DRACULA_PATH)):
    chapter_id = file.split(".")[0]
    chapter2doc[chapter_id] = nlp(open(DRACULA_PATH + file).read())

In [None]:
chapter2doc.keys()

In [None]:
type(chapter2doc['01'])

In [None]:
texts, labels = [], []
for ent in chapter2doc['01'].ents:
    texts.append(ent.text)
    labels.append(ent.label_)
    
ents_df = pd.DataFrame({'text': texts, 'label': labels})
ents_df.sample(10)

**Question:** What labels do we see the most in the first Chapter?

In [None]:
ents_df['label'].value_counts()

**Question:** What person is mentioned the most in the first chapter?

In [None]:
ents_df[ents_df['label'] == 'PERSON'].value_counts()

**Question:** Who is mentioned the most throughout the entire book?

In [None]:
chapters, texts, labels = [], [], []

for chapter, doc in chapter2doc.items():
    for ent in doc.ents:
        texts.append(ent.text)
        labels.append(ent.label_)
        chapters.append(chapter)
    
ents_df = pd.DataFrame({'text': texts, 'label': labels, 'chapter': chapters})
ents_df.sample(10)

In [None]:
ents_df.sort_values(by='chapter')

In [None]:
lucy_mentions_df = ents_df[ents_df['text'] == 'Lucy']
lucy_mentions_df

In [None]:
lucy_mentions_df['label'].value_counts()

In [None]:
lucy_mentions_df = lucy_mentions_df.drop(columns=['label']) 
lucy_mentions_df

In [None]:
lucy_mentions_df['chapter'].value_counts().plot(kind='line')

**Question:** What don't we like about this graph?

In [None]:
ax = lucy_mentions_df['chapter'].value_counts().sort_index().plot(kind='line')
ax.set_title("Number of times Lucy is mentioned per chapter")
ax.set_xlabel("Chapter Number")
ax.set_xlabel("Number of Lucy mentions")

**Question:** Does this figure make sense based on the novel?


#### Plotting most common characters in Dracula

**Question:** Who are the 50 most commonly mentioned characters in Dracula?


<details>
<summary>Solution</summary>
<b>ents_df[ents_df['label'] == 'PERSON']['text'].value_counts().head(50)</b>

</details>

In [None]:
# write code here to determine that based on entities_df

Let's query the dataframe to find all rows that have been tagged as a PERSON and 
save the result from the query in `person_df`


<details>
<summary>Solution</summary>
<b>ents_df[ents_df['label'] == 'PERSON']</b>

</details>

In [None]:
person_df = ...
person_df

Now lets determine how many times each person was mentioned in each chapter.

We want to make a new dataframe where the indices are the chapters and the columns represent the counts of how many times a specific character was mentioned in the chapter.

In [None]:
pv_table = pd.pivot_table(person_df, index=['chapter'],
                    columns=['text'], aggfunc=len, fill_value=0)
pv_table

In [None]:
pv_table.reset_index()

Let's plot just the 10 most frequently mentioned characters

In [None]:
person_df['text'].value_counts().head(10) # first find the 10 most frequently mentioned characters

In [None]:
ten_freq_people = person_df['text'].value_counts().index[:10] # Lets get their names
ten_freq_people

In [None]:
pv_table['label'][ten_freq_people].plot(kind='line') # Query the pivot table and then plot the result 

let's make subplots as well