# Intro

In [2]:
import spacy

%load_ext nb_black

nlp = spacy.load("en_core_web_sm")

<IPython.core.display.Javascript object>

In [3]:
# Process sentences 'Hello, world. Antonio is learning Python.' using spaCy
doc = nlp(u"Hello, world. Antonio is learning Python.")

<IPython.core.display.Javascript object>

## Get tokens and sentences

#### What is a Token?
A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".

Example: The following sentence can be tokenised by splitting up the sentence into individual words.

	"Antonio is learning Python!"
	["Antonio","is","learning","Python!"]

In [4]:
# Get first token of the processed document
token = doc[0]
print(token)

# Print sentences (one sentence per line)
for sent in doc.sents:
    print(sent)

Hello
Hello, world.
Antonio is learning Python.


<IPython.core.display.Javascript object>

## Part of speech tags

#### What is a Speech Tag?
A speech tag is a context sensitive description of what a word means in the context of the whole sentence.
More information about the kinds of speech tags which are used in NLP can be [found here](http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/).

Examples:

1. CARDINAL, Cardinal Number - 1,2,3
2. PROPN, Proper Noun, Singular - "Jan", "Javier", "Antonio", "Italy"
3. INTJ, Interjection - "Ohhhhhhhhhhh"

In [5]:
# For each token, print corresponding part of speech tag
for token in doc:
    print(token.text, token.pos_)

Hello INTJ
, PUNCT
world NOUN
. PUNCT
Antonio PROPN
is AUX
learning VERB
Python PROPN
. PUNCT


<IPython.core.display.Javascript object>

In [6]:
from spacy import displacy

<IPython.core.display.Javascript object>

In [7]:
displacy.render(doc, style='dep')



<IPython.core.display.Javascript object>

In [33]:
displacy.render(doc, style = "ent",jupyter = True)


<IPython.core.display.Javascript object>

In [9]:
print(doc[2])

world


<IPython.core.display.Javascript object>

We have said that dependency structures are represented by directed graphs that satisfy the following constraints:

1. There is a single designated root node that has no incoming arcs.

2. With the exception of the root node, each vertex has exactly one incoming arc.

3. There is a unique path from the root node to each vertex in V.

You can inspect the head of each token by invoking the `.head` attribute of a spaCy token:


In [10]:
doc[2]

world

<IPython.core.display.Javascript object>

In [11]:
doc[2].head

Hello

<IPython.core.display.Javascript object>

So how would you search for the root?

Since there is a unique path from the root node to each vertex in V, there's only one root node that has no incoming arcs, we can search for the token which have as head itself!

In [12]:
for token in doc:
    if token.head == token:
        print(token)

Hello
learning


<IPython.core.display.Javascript object>

As expected, since there were two sentences in the doc, we got two roots.

We can also build a function that, given a spaCy token, gives the path till the root:

In [13]:
# Define a function to find the path to the root of each word in a sentence

def path_to_the_root(pth):
    for token in pth:
        if token.head == token:
            print(token)

path_to_the_root(doc)

Hello
learning


<IPython.core.display.Javascript object>

In [14]:
path_to_the_root(doc)

Hello
learning


<IPython.core.display.Javascript object>

In [15]:
print(doc[4])

Antonio


<IPython.core.display.Javascript object>

## Embeddings 

An embedding is a fixed sizes numerical vector that attempts to encode some semantic meaning of the word or sentence it is encoding. The distributional hypothesis is usually the concept behind most embeddings. This hypothesis states that words which often have the same neighboring words tend to be semantically similar. For example if 'football' and 'basketball' usually appear close the word 'play' we assume that they will be semantically similar. An algorithm that is based on this concept is Word2Vec. A common way of obtaining sentence embeddings is to average the word embeddings inside the sentence and use that average as the representation of the whole sentence. 

- In spacy every token has its embedding.
- It is under the attribute 'vector'.
- In spacy embeddings are of size 96 or 128.


Obtain the embeddings of all the tokens.

In [16]:
def path_to_the_root(pth):
    for token in pth:
        print(token.vector)


path_to_the_root(doc)


[ 1.0823821  -0.42946622 -0.5029499   0.16072363  0.6249163  -0.43111977
  1.4949617   0.84327805  1.1696405  -1.146951    0.22747783  1.609772
 -0.5234932  -0.6649049  -0.43582094  0.55953836 -0.15735066 -1.0758792
  0.6815688  -0.05594122 -0.14967234 -0.36605948 -0.3892722   0.46946785
 -0.35326886 -0.15151012 -0.07002574  0.01581563 -0.7716576   0.16755503
  0.50840676 -1.330525   -0.84367275  0.76433086 -0.49710384  0.20579234
 -1.0266633  -0.42367968  0.1842682   0.9837595   0.35812497 -0.04406814
 -0.13158312 -1.0771542  -0.07300432  0.02702707  0.10031849 -0.6668478
  0.69046175 -0.6554684  -1.0477257  -0.25338298 -0.84254164 -0.3190821
  0.13100918 -0.1516825  -0.8030119  -1.0564109  -0.5003395   1.9936432
  2.533157   -0.04505131  1.0323157   0.18503037  0.78973675  0.9035241
 -0.60990644  0.02243002 -0.9370111  -1.0024397   0.54751515 -0.83813816
 -0.82515997 -0.9111524   1.2109123  -1.2123573  -0.29819542 -0.24432868
  0.33222318  0.30747586  0.08078419 -1.6028439   0.115246

<IPython.core.display.Javascript object>

## Semantic similarity 

To compute the semantic similarity between two sentences, $u$ and $v$, we measure the cossine similarity between the two sentence embeddings. The formula is as follows:

$sim(u, v) = \frac{u \cdot v}{||u|| ||v||} $


Use the following formula to get the semantic similarity betwen the words in doc.
Feel free to test it between differente words too

In [17]:
import numpy as np
def semantic_sim(u,v):
    return np.dot(u,v)/(abs(v)*abs(u))

<IPython.core.display.Javascript object>

# Pride and Prejudice analysis

We would like to:

- Extract the names of all the characters from the book (e.g. Elizabeth, Darcy, Bingley)
- Visualize characters' occurences with regards to relative position in the book
- Authomatically describe any character from the book
- Find out which characters have been mentioned in a context of marriage
- Build keywords extraction that could be used to display a word cloud (example)

To load the text file, it is convinient to decode using the utf-8 standard:

In [18]:
def read_file(file_name):
    with open(file_name, "r", encoding="utf-8") as file:
        return file.read()

<IPython.core.display.Javascript object>

### Process full text

In [19]:
text = read_file("data/pride_and_prejudice.txt")
# Process the text

<IPython.core.display.Javascript object>

In [20]:
nlp = spacy.load('en_core_web_sm')
with open ('data/pride_and_prejudice.txt') as f:
    text = f.read()
doc = nlp(text)

<IPython.core.display.Javascript object>

In [21]:
# How many sentences are in the book (Pride & Prejudice)?
len_sent = len(list(doc.sents))
print(f'There are {len_sent} sentences in the book (Pride & Prejudice)')

# Print sentences from index 10 to index 15, to make sure that we have parsed the correct book
print(list(doc.sents)[10:15])


There are 5764 sentences in the book (Pride & Prejudice)
[

"Why, my dear, you must know, Mrs. Long says that Netherfield is taken
by a young man of large fortune from the north of England; that he came
down on Monday in a chaise and four to see the place, and was so much
delighted with it, that he agreed with Mr. Morris immediately; that he
is to take possession before Michaelmas, and some of his servants are to
be in the house by the end of next week., "

"What is his name?"

"Bingley., "

"Is he married or single?"

"Oh!, Single, my dear, to be sure!, A single man of large fortune; four or
five thousand a year.]


<IPython.core.display.Javascript object>

## Find all the personal names

[Hint](# "List doc.ents and check ent.label_")

In [31]:
# Extract all the personal names from Pride & Prejudice and count their occurrences.
# Expected output is a list in the following form: [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266) ...].
import re
from collections import Counter, defaultdict


def find_character_occurences(doc):
    """
    Return a list of actors from `doc` with corresponding occurences.

    :param doc: Spacy NLP parsed document
    :return: list of tuples in form
        [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266)]
    """

    characters = Counter()
    # your code here
    # sents = nlp(doc)
    for names in doc.ents:
        if names.label_ == 'PERSON':
            characters[names.lemma_] += 1
    print(characters.most_common())


    
    


find_character_occurences(doc)
    

[('Elizabeth', 413), ('Darcy', 361), ('Jane', 268), ('Bennet', 248), ('Bingley', 214), ('Collins', 169), ('Wickham', 160), ('Lizzy', 95), ('Gardiner', 88), ('Lady Catherine', 76), ('William', 33), ('Fitzwilliam', 33), ('Hurst', 29), ('Phillips', 28), ('Forster', 21), ('Maria', 18), ('Lady Lucas', 16), ('Lucas', 16), ('Long', 14), ('Pemberley', 13), ('Lydia', 13), ("Lady Catherine's", 13), ('Project Gutenberg-tm', 13), ('Miss Bennet', 12), ('Netherfield', 11), ('Brighton', 11), ('Miss', 10), ('de Bourgh', 10), ('Catherine', 9), ('Jenkinson', 9), ('Lady Catherine de Bourgh', 8), ('William Lucas', 7), ('Denny', 7), ('George Wickham', 7), ('Miss de Bourgh', 7), ('Reynolds', 7), ('Elizabeth Bennet', 6), ('Miss Lucas', 6), ('Charlotte', 6), ('Eliza', 6), ('Charles', 6), ('Kitty', 5), ('Miss King', 5), ('Lady\nCatherine', 5), ('Hill', 5), ('Jane Austen', 4), ('Carter', 4), ('Jones', 4), ('Miss Bennets', 4), ('Younge', 4), ('Robinson', 3), ("William Lucas's", 3), ('Louisa', 3), ('Eliza Bennet'

<IPython.core.display.Javascript object>

In [27]:
teachers = Counter() 
for i in range(10):
    teachers['alessio'] += 1
print(teachers.most_common())


[('alessio', 10)]


<IPython.core.display.Javascript object>

## Plot characters personal names as a time series 

In [24]:
# Matplotlib Jupyter HACK
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

<IPython.core.display.Javascript object>

We can investigate where a particular entity occurs in the text. We can do it just accessing the `.start` attribute of an entity:

[Hint](# "ent.start")

In [25]:
# List all the start positions of person entities

<IPython.core.display.Javascript object>

So we can create a function that stores all the offsets of every character:
   
   
[Hint](# "Create a dictionary with the lowered lemmas [ent.lemma_.lower()] and associate a list of all the ent.starts")

In [26]:
# Plot characters' mentions as a time series relative to the position of the actor's occurrence in a book.

def get_character_offsets(doc):
    """
    For every character in a `doc` collect all the occurences offsets and store them into a list. 
    The function returns a dictionary that has actor lemma as a key and list of occurences as a value for every character.
    
    :param doc: Spacy NLP parsed document
    :return: dict object in form
        {'elizabeth': [123, 543, 4534], 'darcy': [205, 2111]}
    """
            
    return dict(character_offsets)

character_occurences = get_character_offsets(processed_text)

NameError: name 'processed_text' is not defined

<IPython.core.display.Javascript object>

In [None]:
character_occurences

[Hint](# "Use the character offsets for each character as x")

In [None]:
# Plot the histogram of the character occurrences in the whole text
NUM_BINS = 20

def plot_character_hist(character_offsets, character_label, cumulative=False):
    pass

In [None]:
plot_character_hist(character_occurences, "elizabeth")

In [None]:
plot_character_hist(character_occurences, "darcy")

### Cumulative occurrences

In [None]:
plot_character_hist(character_occurences, "elizabeth", cumulative=True)

In [None]:
plot_character_hist(character_occurences, "darcy", cumulative=True)

### Spacy parse tree in action

[Hint](# "ent.subtree, token.pos_ == 'ADJ'") 

In [None]:
# Find words (adjectives) that describe Mr. Darcy.

def get_character_adjectives(doc, character_lemma):
    """
    Find all the adjectives related to `character_lemma` in `doc`
    
    :param doc: Spacy NLP parsed document
    :param character_lemma: string object
    :return: list of adjectives related to `character_lemma`
    """
    
    adjectives = []
    for ent in processed_text.ents:
        # your code here
        pass
    
     for ent in processed_text.ents:
        if ent.lemma_.lower() == character_lemma:
            if ent.root.dep_ == 'nsubj':
                for child in ent.root.head.children:
                    if child.dep_ == 'acomp':
                        adjectives.append(child.lemma_)
                        
    return adjectives

print(get_character_adjectives(processed_text, 'darcy'))

In [None]:
# Find words (adjectives) that describe Elizabeth.


print(get_character_adjectives(processed_text, 'elizabeth'))

For all the dependencies manual: https://nlp.stanford.edu/software/dependencies_manual.pdf

`acomp`: adjectival complement
*i.e.* an adjectival phrase which functions as the complement (like an object of the verb) e.g. "She looks very beautiful": *beautiful* is an adjectival complement of *looks*

`nsubj`: nominal subject
*i.e.* a noun phrase which is the syntactic subject of a clause. The head of this relation
might not always be a verb: when the verb is a copular verb, the root of the clause is the complement of
the copular verb, which can be an adjective or noun.
*e.g.* "Clinton defeated Dole". The relationship is *nsubj(defeated, Clinton)*

"The baby is cute". The relationship is *nsubj(cute, baby)*.

In the code, `.dep_`stands for syntactic dependency, *i.e.* the relation between tokens.

In [None]:
processed_text.ents[30].root.dep_

[Hint](# "ent.label_, ent.root.head.lemma_") 

In [None]:
# Find characters that are 'talking', 'saying', 'doing' the most. Find the relationship between 
# entities and corresponding root verbs.

character_verb_counter = Counter()


for ent in processed_text.ents:
    if # your code here:
        character_verb_counter[ent.text] += 1

print(character_verb_counter.most_common(10)) 

# do the same for talking and doing

print(character_verb_counter.most_common(10)) 


[Hint](# "ent.label_, ent.root.head.pos_") 

In [None]:
# Find 20 most used verbs
verb_counter = Counter()

# your code here

print(verb_counter.most_common(20))

In [None]:
# Create a dataframe with the most used verb and how many time a character used the verb

import pandas as pd
verb_characters = {}
verb_list = [verb[0] for verb in verb_counter.most_common(20)]
for ent in processed_text.ents:
    if ent.label_ == 'PERSON' and ent.root.head.lemma_ in verb_list:
        # complete the code
        pass


In [None]:
df = pd.DataFrame(verb_characters).transpose().fillna(0)
df

In [None]:
# drop the less meaningful columns
df = df[df.columns[df.sum()>=10]].sort_index()
df

In [None]:
import seaborn as sns
%matplotlib inline
sns.heatmap(df, annot=True, cmap='Blues')
df.style.background_gradient(cmap='Blues')
