# IST664 Lab 3 #

**Credit**: **Jeff Stanton and Preeti Jagadev**

In the realm of natural language processing, syntax is one level up from morphology. Whereas morphology pertains to the components of words, syntax examines how words are sequenced together.  

Although contemporary deep learning methods tend to hide a lot of these details behind the veil of the neural network, syntactical analysis remains a key part of effective NLP solutions, which is why it is such a core process in spaCy. Your ability to create, debug, and successfully modify a natural language system will be enhanced by deepening your understanding of how we use code to assign meaning to various parts of speech as well as the ways that sentences fit together.

This lab begins by reading a complete text from the Project Gutenberg website. We are downloading Dostoevsky's Crime and Punishment, as plain text, in a translation by Constance Garnett.

# Section 3.1 #

In [1]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# text from online gutenberg
from urllib import request

url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw), len(raw)

[nltk_data] Downloading package punkt to C:\Users\Black
[nltk_data]     Knight\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Black
[nltk_data]     Knight\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


(str, 1135213)

In [2]:
# Show the first 178 characters of the string stored in raw.
# It means: “start at index 0, and go up to (but not including) index 178.”
raw[:178]

'*** START OF THE PROJECT GUTENBERG EBOOK 2554 ***\n\n\n\n\nCRIME AND PUNISHMENT\n\nBy Fyodor Dostoevsky\n\n\n\nTranslated By Constance Garnett\n\n\n\n\nTRANSLATOR’S PREFACE\n\nA few words about Do'

In [3]:
# We'll begin our processing with tokenization
crimetokens = nltk.word_tokenize(raw)
crimetokens[112:122]

['the',
 'Petersburg',
 'school',
 'of',
 'Engineering',
 '.',
 'There',
 'he',
 'had',
 'already']

In [4]:
# Let's keep track of how many unique tokens we're starting with.
len(set(crimetokens))

11103

In [5]:
# Let's normalize to lower case to reduce the number of unique tokens
crimetokens = [w.lower() for w in crimetokens]
crimetokens[112:122]

['the',
 'petersburg',
 'school',
 'of',
 'engineering',
 '.',
 'there',
 'he',
 'had',
 'already']

Let's compare three stemmers provided by NLTK.

In [6]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.SnowballStemmer('english')
type(porter), type(lancaster), type(snowball)


(nltk.stem.porter.PorterStemmer,
 nltk.stem.lancaster.LancasterStemmer,
 nltk.stem.snowball.SnowballStemmer)

In [7]:
# From a data reduction standpoint, which stemmer results in the greatest reduction in the number of unique tokens?
# lancaster stemmer results in the greatest reduction in the number of unique tokens
crimePstem = [porter.stem(t) for t in crimetokens]
crimeLstem = [lancaster.stem(t) for t in crimetokens]
crimeSstem = [snowball.stem(t) for t in crimetokens]

len(set(crimePstem)), len(set(crimeLstem)), len(set(crimeSstem))

(7173, 6245, 6988)

In [8]:
# What proportion of reduction have we achieved with the Porter stemmer?
len(set(crimePstem))/len(set(crimetokens))

0.6907077515647568

In [9]:
# Question 3.1
# Calculate and show the percent reduction in the number of tokens for the other two stemmers.
len(set(crimeLstem))/len(set(crimetokens))


0.6013480982185845

In [10]:
len(set(crimeSstem))/len(set(crimetokens))

0.6728935965334617

In [11]:
pip install tabulate

Note: you may need to restart the kernel to use updated packages.


In [12]:
# Let's compare the highest frequency tokens from the three stemmers
from tabulate import tabulate
from nltk import FreqDist
pdist = FreqDist(crimePstem)
ldist = FreqDist(crimeLstem)
sdist = FreqDist(crimeSstem)

compare = zip(pdist.most_common(20),
              ldist.most_common(20),
              sdist.most_common(20))

print(tabulate(compare, headers=["Porter", "Lancaster", "Snowball"]))

Porter          Lancaster       Snowball
--------------  --------------  --------------
(',', 16041)    (',', 16041)    (',', 16041)
('.', 8790)     ('.', 8790)     ('.', 8790)
('the', 7821)   ('the', 7853)   ('the', 7821)
('and', 6960)   ('and', 6960)   ('and', 6960)
('to', 5267)    ('to', 5267)    ('to', 5267)
('he', 4767)    ('he', 4767)    ('he', 4767)
('a', 4593)     ('a', 4593)     ('a', 4593)
('i', 4397)     ('i', 4397)     ('i', 4397)
('’', 4039)     ('’', 4039)     ('’', 4039)
('you', 4011)   ('you', 4019)   ('you', 4011)
('“', 3980)     ('“', 3980)     ('“', 3980)
('”', 3931)     ('”', 3931)     ('”', 3931)
('of', 3808)    ('of', 3808)    ('of', 3808)
('it', 3456)    ('it', 3456)    ('it', 3456)
('that', 3267)  ('that', 3267)  ('that', 3267)
('in', 3188)    ('in', 3201)    ('in', 3188)
('wa', 2824)    ('was', 2824)   ('was', 2824)
('!', 2363)     ('on', 2591)    ('!', 2363)
('?', 2277)     ('!', 2363)     ('?', 2277)
('hi', 2114)    ('?', 2277)     ('his', 2113)


In [13]:
# Question 3.2:
# Reference: https://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg

# Can you think of some reasons why the word "the" has a different count for the Lancaster stemmer?
# Lancaster stemmer is said to be very aggressive so could over stemm words like their or there to the increasing the word count for the

# What's going on near the end of the list where we have the following output:(('wa', 2825), ('was', 2825), ('was', 2825))
# Porter algorithm has removed "s" from the word was while the toher 2 know that this is not a plural but a word.

# The counts match, but what has the Porter stemmer done differently?
# Porter is a rules based algorithm and does not use a dictionary so removes "s" from words

# What conclusions can you draw about the advantages and disadvantages of various stemmers?
# Some stemmers are better when there are irregular words like verbs with odd conjugations. 

# Section 3.2 #

Because stemming is quite variable in the results it produces, some NLP processing methods use lemmatization instead. A lemma is the root form on a word. Let's try this with the Wordnet Lemmatizer:

In [14]:
nltk.download('wordnet')
nltk.download('omw-1.4')
wnl = nltk.WordNetLemmatizer()
type(wnl)

[nltk_data] Downloading package wordnet to C:\Users\Black
[nltk_data]     Knight\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Black
[nltk_data]     Knight\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


nltk.stem.wordnet.WordNetLemmatizer

In [15]:
wnl.lemmatize("am", pos ="v")

'be'

In [16]:
# Question 3.3
# a: Lemmatize "is" and "are" using wnl.lemmatize(), where pos="v"
# b: Test what happens if you leave out the pos argument?
# c: Write a comment describing what the pos argument does.

# Solution:
# pos argument means part of speech and it specifys what the part of speech is for the word passed. If it is left out the default is noun and so often does not change the word

In [17]:
wnl.lemmatize("is", pos ="v")
wnl.lemmatize("are", pos ="v")

'be'

In [18]:
wnl.lemmatize("is")

'is'

In [19]:
wnl.lemmatize("are")

'are'

# Section 3.3
WordNet really seems to require that the user specify the part of speech. Without that specification, there are likely to be errors.

Switching gears for a moment, one way of capturing more contextual information in our token lists is to analyze tokens in sets of two or more. Two tokens together is called a bigram, three is called a trigram, and more generally any number "n" is called an ngram.

NLTK and other language packages contain numerous tools for working with bigrams. Let's look at the output of the NLTK ngrams() function:

In [20]:
# Rather than a whole book, let's begin by working with one sentence:
sentence = "thomas jefferson began building monticello at the age of twenty-six."
len(sentence)  #len() counts all characters in a string, including:Letters, Spaces, and Punctuation marks.

68

In [21]:
from nltk.util import ngrams
import re # Regular expressions library
pattern = re.compile('[a-z]+')
tokens = pattern.findall(sentence)
list(ngrams(tokens, 2))

[('thomas', 'jefferson'),
 ('jefferson', 'began'),
 ('began', 'building'),
 ('building', 'monticello'),
 ('monticello', 'at'),
 ('at', 'the'),
 ('the', 'age'),
 ('age', 'of'),
 ('of', 'twenty'),
 ('twenty', 'six')]

In [22]:
# We can easily repeat the process for trigrams:
list(ngrams(tokens, 3))

[('thomas', 'jefferson', 'began'),
 ('jefferson', 'began', 'building'),
 ('began', 'building', 'monticello'),
 ('building', 'monticello', 'at'),
 ('monticello', 'at', 'the'),
 ('at', 'the', 'age'),
 ('the', 'age', 'of'),
 ('age', 'of', 'twenty'),
 ('of', 'twenty', 'six')]

Research has shown that there is value in understanding the context around words - i.e., the other words that occur nearby. This was an idea called "**The Distributional Hypothesis**" imagined by linguist Zellig Harris, that words with similar meanings tend to occur in similar contexts.

In [23]:
bigrams = [" ".join(w) for w in ngrams(tokens, 2)]
print(bigrams)

['thomas jefferson', 'jefferson began', 'began building', 'building monticello', 'monticello at', 'at the', 'the age', 'age of', 'of twenty', 'twenty six']


In [24]:
# Question 3.4:
# Build trigram tokens from the Thomas Jefferson tokens.
# Make sure the trigram tokens have spaces between the component words as shown in the bigram example in the code block just above.

# Solution
trigrams = [" ".join(w) for w in ngrams(tokens, 3)]
print(trigrams)

['thomas jefferson began', 'jefferson began building', 'began building monticello', 'building monticello at', 'monticello at the', 'at the age', 'the age of', 'age of twenty', 'of twenty six']


In [25]:
# Given only one sentence, that result is not very exciting, but what if we did a whole book?
nltk.download('stopwords')
nltk_stops = nltk.corpus.stopwords.words('english')

crimenopunct = [w for w in crimetokens if w.isalnum()]
crimenostops = [w for w in crimenopunct if w not in nltk_stops]
crimebigrams = [" ".join(w) for w in ngrams(crimenostops, 2)]

fdist = FreqDist(crimebigrams) # This creates a list of frequencies for bigrams
len(fdist) # This is the total number of unique bigrams

[nltk_data] Downloading package stopwords to C:\Users\Black
[nltk_data]     Knight\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


74155

In [26]:
fdist.most_common(10) # What do you notice about the most frequent bigrams
# The most frequent bigrams are all full names

[('katerina ivanovna', 215),
 ('pyotr petrovitch', 172),
 ('pulcheria alexandrovna', 123),
 ('avdotya romanovna', 112),
 ('old woman', 91),
 ('rodion romanovitch', 82),
 ('porfiry petrovitch', 81),
 ('marfa petrovna', 76),
 ('sofya semyonovna', 71),
 ('amalia ivanovna', 54)]

**Part Two**

The WordNet lemmatizer does not work very well unless we already know the part of speech of the word we are trying to lemmatize. This is a significant limitation.

Given the limitations of stemmers and simple lemmatizers, it is time to take a more serious look at part of speech tagging. For this, we are going to graduate from NLTK to our first effort with spaCy. Whereas NLTK was designed for teaching and research, spaCy was architected so that it can serve as the basis of a production-grade NLP pipeline. Unlike other NLP toolkits (e.g., Stanford core NLP) spaCy was written in Python and Cython, so it is convenient for use directly from the Jupyter notebook environment. For now, we will just try out a few basic techniques.


# Section 3.4

In [27]:
pip install --upgrade pip

Note: you may need to restart the kernel to use updated packages.


In [28]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
     - ------------------------------------- 0.5/12.8 MB 707.5 kB/s eta 0:00:18
     ---- ----------------------------------- 1.3/12.8 MB 1.4 MB/s eta 0:00:09
     ---- ----------------------------------- 1.3/12.8 MB 1.4 MB/s eta 0:00:09
     -----

In [29]:
import spacy
nlp = spacy.load('en_core_web_sm') # sm means small - some pipeline capabilities not loaded
type(nlp) # This is our pipeline: an instantiated class that we can use to process any string
# You can ignore this warning if you see it: "UserWarning: Can't initialize NVML"

spacy.lang.en.English

In [30]:
# Let's process a small example from NLPIA first
sentence = "The faster Harry got to the store, the faster Harry would get home."
spsent = nlp(sentence)
type(spsent), len(spsent)

(spacy.tokens.doc.Doc, 15)

In [31]:
[m for m in dir(spsent) if m[0] != '_']

['cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'from_json',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set_ents',
 'set_extension',
 'similarity',
 'spans',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'to_bytes',
 'to_dict',
 'to_disk',
 'to_json',
 'to_utf8_array',
 'user_data',
 'user_hooks',
 'user_span_hooks',
 'user_token_hooks',
 'vector',
 'vector_norm',
 'vocab']

In [32]:
spsent.has_annotation("TAG")

True

In [33]:
tags = [(i, i.pos_) for i in spsent]
print(tabulate(tags, headers=["Token", "POS Tag"]))

Token    POS Tag
-------  ---------
The      DET
faster   ADJ
Harry    PROPN
got      VERB
to       ADP
the      DET
store    NOUN
,        PUNCT
the      DET
faster   ADJ
Harry    PROPN
would    AUX
get      VERB
home     ADV
.        PUNCT


In [34]:
from tabulate import tabulate

# Make a little dataset for tabulate() to work on.
poslist = [ (i, i.lemma_, i.pos_) for i in spsent]

print(tabulate(poslist,  headers=["Token", "Lemma", "Tag"]))

Token    Lemma    Tag
-------  -------  -----
The      the      DET
faster   fast     ADJ
Harry    Harry    PROPN
got      get      VERB
to       to       ADP
the      the      DET
store    store    NOUN
,        ,        PUNCT
the      the      DET
faster   fast     ADJ
Harry    Harry    PROPN
would    would    AUX
get      get      VERB
home     home     ADV
.        .        PUNCT


In [35]:
[m for m in dir(spsent[0]) if m[0:2] == 'is']

['is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper']

In [36]:
# Let's make a more detailed table with a few of these fields.
poslist = [ (i, i.head, i.lemma_, i.pos_, i.tag_, i.is_alpha) for i in spsent]

print(tabulate(poslist,  headers=["Token", "Head", "Lemma", "Tag", "Details","Alpha?"]))


Token    Head    Lemma    Tag    Details    Alpha?
-------  ------  -------  -----  ---------  --------
The      Harry   the      DET    DT         True
faster   Harry   fast     ADJ    JJR        True
Harry    got     Harry    PROPN  NNP        True
got      get     get      VERB   VBD        True
to       got     to       ADP    IN         True
the      store   the      DET    DT         True
store    to      store    NOUN   NN         True
,        get     ,        PUNCT  ,          False
the      Harry   the      DET    DT         True
faster   Harry   fast     ADJ    JJR        True
Harry    get     Harry    PROPN  NNP        True
would    get     would    AUX    MD         True
get      get     get      VERB   VB         True
home     get     home     ADV    RB         True
.        get     .        PUNCT  .          False


The table above just scratches the surface, but there's still a lot of interesting stuff happening there. In the first column we have the token itself, which can be a word, a number, or punctuation. The second column starts to unpack the idea of dependency grammar - that each word in a sentence represents a portion of a tree, with "ancestors" that it depends on and "children" that depend on it. "Head" refers to the immediate ancestor of a word. So for instance, the proper noun "Harry" depends on the corresponding verb "got." Next we have the lemmas and the simple part of speech tag as before. By the way, you can find an explanation of these tags here:

https://universaldependencies.org/docs/u/pos/

Finally, there is a fine-grained part of speech - a more complicated tag provided by spaCy. These are unique to each language model, but there is a function call that will provide information about any of the tags:

In [37]:
spacy.explain("JJR")

'adjective, comparative'

#Checkpoint! Use spacy.explain("RB")



In [38]:
# Question 3.5
# Add and run spacy.explain("RB")

spacy.explain("RB")

'adverb'

In [39]:
# Let's practice by tagging another sentence. Here's some text extracted from
# Wikipedia's article on kites.
kites = """A kite is a tethered heavier-than-air or lighter-than-air craft with wing surfaces that react against the air to create lift and drag forces.
A kite consists of wings, tethers and anchors. Kites often have a bridle and tail to guide the face of the kite so the wind can lift it.
Some kite designs don’t need a bridle; box kites can have a single attachment point.
A kite may have fixed or moving anchors that can balance the kite.
One technical definition is that a kite is “a collection of tether-coupled wing sets“.
The name derives from its resemblance to a hovering bird."""

spkites = nlp(kites)
type(spkites), len(spkites)

(spacy.tokens.doc.Doc, 130)

In [40]:
# Question 3.6:
# Display tokens, lemmas, and parts of speech for spkites.
# Try using a nice, neat tabular format for the output.

# Solution
poslist = [ (i, i.lemma_, i.pos_) for i in spkites]

print(tabulate(poslist,  headers=["Token", "Lemma", "Tag"]))

Token        Lemma        Tag
-----------  -----------  -----
A            a            DET
kite         kite         NOUN
is           be           AUX
a            a            DET
tethered     tethered     ADJ
heavier      heavy        ADJ
-            -            PUNCT
than         than         ADP
-            -            PUNCT
air          air          NOUN
or           or           CCONJ
lighter      light        ADJ
-            -            PUNCT
than         than         ADP
-            -            PUNCT
air          air          NOUN
craft        craft        NOUN
with         with         ADP
wing         wing         NOUN
surfaces     surface      NOUN
that         that         PRON
react        react        VERB
against      against      ADP
the          the          DET
air          air          NOUN
to           to           PART
create       create       VERB
lift         lift         NOUN
and          and          CCONJ
drag         drag         NOUN
forces       

In [41]:
# It might be more convenient to work with individual sentences:
kitespans = list(spkites.sents)
kitespans[0] # Let's view just the first sentence

A kite is a tethered heavier-than-air or lighter-than-air craft with wing surfaces that react against the air to create lift and drag forces.

In [42]:
len(kitespans)

7

In [43]:
from IPython.display import display
# And import any other functions here as well: from IPython.display import HTML, Image


In [44]:
# One other neat trick: We can use spaCy to display a graphical
# version of the dependence tree for any sentence or document.
from spacy import displacy
from IPython.display import display, HTML
html = displacy.render(kitespans[0], style="dep", jupyter=False)

display(HTML(html))


In [45]:
# Question 3.7
# Add a dependency structure graph for the second sentence in kites.

# Solution
html1 = displacy.render(kitespans[1], style="dep", jupyter=False)

display(HTML(html1))

Let's close the loop on the idea of data reduction by seeing how many unique lemmas spaCy creates for Crime and Punishment.

Recall that we were unsatisfied with the lemmatizer from NLTK because, for it to work efficiently we needed to know the POS for each token before calling the lemmatizer.

The spaCy nlp() call ingests our whole text, applies tags, and determines lemmas, all based on a swappable language model.

In [46]:
# Process Crime and Punishment with spaCy: takes a minute!
nlp.max_length = 1200000
crimespacy = nlp(raw) # We're going back to the original raw text data!
type(crimespacy), len(crimespacy)

(spacy.tokens.doc.Doc, 270889)

In [47]:
# Let's count unique lemmas
newcrimelemma = [l.lemma_ for l in crimespacy]
len(set(newcrimelemma))

7570

In [48]:
len(set(newcrimelemma))/len(set(crimetokens))

0.7289359653346172

**Part Three**

One essential way of representing a corpus is to transform the token counts into a "**Document Term Matrix" (DTM)** or a transposed version of the same thing, a "**Term Document Matrix.**"

The most basic DTM contains word frequencies in each cell. A more advanced DTM contains adjusted values known as **TF-IDF** (term frequency, inverse document frequency).

Creating a DTM, either with counts or with TF-IDF values begins with a process called vectorization. Let's vectorize Crime and Punishment, treating each sentence as a document.

In [49]:
crimespans = list(crimespacy.sents)
crimespans[42:45] # Let's view three sample sentences

[His
 garret was under the roof of a high, five-storied house and was more
 like a cupboard than a room.,
 The landlady who provided him with garret,
 dinners, and attendance, lived on the floor below, and every time
 he went out he was obliged to pass her kitchen, the door of which
 invariably stood open.,
 And each time he passed, the young man had a
 sick, frightened feeling, which made him scowl and feel ashamed.]

In [50]:
type(crimespans[42]) # Check on the type of a single sentence

spacy.tokens.span.Span

In [51]:
# Create a vectorizer using the powerful Sci-Kit Learn" library.
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate a vectorizer, removing stopwords, setting min doc frequency
vectorizer = CountVectorizer(min_df=1, stop_words='english', lowercase=True)

crimesparse = vectorizer.fit_transform([ t.text for t in crimespans])
type(crimesparse)


scipy.sparse._csr.csr_matrix

In [52]:
# A sparse matrix DTM is excellent for efficient storage, but to do useful
# manipulations, we will need to blow it up into a data frame.
import pandas as pd
dtmDF = pd.DataFrame(crimesparse.toarray(),
                      columns=vectorizer.get_feature_names_out())

dtmDF.shape # Make sure you know what these numbers are: Confirm with your partner!

(14177, 9295)

In [53]:
dtmDF

Unnamed: 0,14,1849,1859,1861,1864,1880,2554,47,_a,_a_,...,zest,zeus,zigzags,zimmerman,zossimov,æsthetic,æsthetically,æsthetics,éternelle_,êtes
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14172,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14173,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14174,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14175,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [54]:
# We can look up any word in the DTM by name and find out how frequently it occurs.
dtmDF['priest'].sum() # We're computing the column sum of word counts

np.int64(19)

In [55]:
# 3.8: Get a total frequency count for a different word.

# Solution
dtmDF['money'].sum() # We're computing the column sum of word counts

np.int64(166)

In [56]:
# Let's make a complete frequency list of all words (columns)
wordfreqs = [ (word, dtmDF[word].sum()) for word in vectorizer.get_feature_names_out()]
wordfreqs.sort(key=lambda w: w[1], reverse=True)
# Show the top 20 items
wordfreqs[0:20]

[('raskolnikov', np.int64(785)),
 ('know', np.int64(529)),
 ('said', np.int64(519)),
 ('did', np.int64(497)),
 ('come', np.int64(479)),
 ('man', np.int64(479)),
 ('don', np.int64(464)),
 ('like', np.int64(453)),
 ('sonia', np.int64(402)),
 ('time', np.int64(385)),
 ('went', np.int64(356)),
 ('razumihin', np.int64(347)),
 ('dounia', np.int64(325)),
 ('thought', np.int64(306)),
 ('ivanovna', np.int64(304)),
 ('say', np.int64(296)),
 ('looked', np.int64(293)),
 ('suddenly', np.int64(293)),
 ('little', np.int64(288)),
 ('petrovitch', np.int64(287))]

In [57]:
# Rodio Raskolnikov and Dmitri Prokofych Razumikhin are focal characters in the book,
# so it is pretty cool that their names are among the most frequently appearing terms in our DTM.

# Question 3.9:
# Make a list of *row sums* from our dtm using dtmDF.sum(axis=1).
# Examine this list to see if there are any documents that have a row sum of zero.
# What would this imply, if you found it?
# these are spaces in the book between the paragraphs

# Solution
sorted(dtmDF.sum(axis=1))

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [58]:
# Let's find out all of the words that are included in the default list of stop words used by CountVectorizer
from sklearn.feature_extraction import text
stop_words = text.ENGLISH_STOP_WORDS
len(stop_words)

318

In [59]:
# Question 3.10:
# Print out the contents of stop_words.
# Review it carefully. Are there any surprises?
# the numbers are suprising to me since they give meaning by stating a quantity

# Solution
print(stop_words)

frozenset({'himself', 'due', 'towards', 'beyond', 'noone', 'therefore', 'our', 'every', 'and', 'only', 'others', 'seeming', 'through', 'per', 'three', 'last', 'are', 'since', 'co', 'why', 'whereafter', 'someone', 'give', 'never', 'anyway', 'becoming', 'neither', 'herself', 'toward', 'side', 'with', 'in', 'that', 'system', 'else', 'becomes', 'via', 'although', 'yourselves', 'whose', 'get', 'cannot', 'should', 'un', 'over', 'up', 'together', 'until', 'upon', 'among', 'eg', 'anywhere', 'put', 'nor', 'once', 'you', 'became', 'without', 'de', 'against', 'within', 'bill', 'were', 'around', 'ever', 'first', 'hundred', 'namely', 'name', 'anyhow', 'below', 'former', 'one', 'hereby', 'ie', 'elsewhere', 'along', 'yourself', 'to', 'two', 'or', 'between', 'has', 'before', 'most', 'they', 'as', 'out', 'herein', 're', 'further', 'any', 'whom', 'perhaps', 'empty', 'detail', 'everything', 'next', 'wherein', 'under', 'her', 'a', 'been', 'mill', 'such', 'many', 'eight', 'seem', 'about', 'often', 'anythin

In [60]:
# Let's conclude with a primitive analysis of the dtm.
# First we'll make two subsets of our data, based on mentions of characters:
raskolDF = dtmDF[dtmDF.raskolnikov > 0]
razumihinDF = dtmDF[dtmDF.razumihin > 0]
raskolDF.shape, razumihinDF.shape

((775, 9295), (343, 9295))

In [61]:
# This creates a ratio of the number of times the word good is mentioned in each of the two data subsets.
raskolDF['good'].sum()/razumihinDF['good'].sum()

np.float64(3.1666666666666665)

In [62]:
# Question 3.11: Obtain a ratio of total word frequency for a word other than good

# Solution
raskolDF['time'].sum()/razumihinDF['time'].sum()

np.float64(1.9090909090909092)

In [63]:
# Question 3.12: Revectorize Crime and Punishment sentences with TF-IDF

# Solution
from sklearn.feature_extraction.text import TfidfVectorizer
tfvectorizer = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

crimesparsetf = tfvectorizer.fit_transform([ t.text for t in crimespans])
type(crimesparsetf)

scipy.sparse._csr.csr_matrix

In [64]:
# Question 3.13: Convert vectorization results (the TF-IDF DTM) to pandas data frame

# Solution
dtmDFtf = pd.DataFrame(crimesparsetf.toarray(),
                      columns=tfvectorizer.get_feature_names_out())

dtmDFtf.shape

(14177, 9295)

In [65]:
# Question 3.14:
# Repeat one or more of diagnostic tests demonstrated for the count vectorization.

# Solution
raskolDFtf = dtmDFtf[dtmDFtf.raskolnikov > 0]
razumihinDFtf = dtmDFtf[dtmDFtf.razumihin > 0]
raskolDFtf.shape, razumihinDFtf.shape

((775, 9295), (343, 9295))

In [66]:
# Question 3.15: Repeat ratio tests, comparing the contents of the DTMs for the two characters.

# Solution
raskolDFtf['time'].sum()/razumihinDFtf['time'].sum()

np.float64(2.5104687596893855)

##Part 4
## Pointwise Mutual Information (PMI)

PMI calculates the probability of the co-occurence of two words using the probability of each word independently as a baseline.

Here's an example: Let's say that "fish" occurs five times in 100 words, while "cake" appears eight times. The combination "fish cake" appears 3 times. Now run the code below:

In [67]:
pfish = 5/100
pcake = 8/100
pfishcake = 3/100

import math # We will need the log2() function
pmi = math.log2( pfishcake / (pfish * pcake))
print(pmi)

2.9068905956085187


So based on this result, Fish and Cake are occuring together somewhat more frequently than would be expected based on how often they appear independently. You can fiddle around with the probability values to see how it affects the PMI calculation.

To do this kind of analysis at scale, we'll pull in some code from the nltk.collocations module.

In [68]:
from nltk.collocations import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(crimetokens)
type(bigram_measures), type(finder)

(nltk.metrics.association.BigramAssocMeasures,
 nltk.collocations.BigramCollocationFinder)

In [69]:
# The NLTK Pointwise Mutual Information scoring function, PMI,
# scores the bigrams by taking into account the frequency of the two component words.
# When infrequent words make a bigram they get a boost in PMI.
# A higher score thus means a more interesting bigram.
finder.apply_freq_filter(2)
scored = finder.score_ngrams(bigram_measures.pmi)
# Examine the pairs with PMI greater than 16.5 (an arbitrary number chosen simply to keep the list short).
[bg for bg in scored if bg[1] > 16.5]

[(('_die', 'wäsche_'), 16.952690065907774),
 (('_sein', 'rock_'), 16.952690065907774),
 (('_special', 'case_'), 16.952690065907774),
 (('_tout', 'court_'), 16.952690065907774),
 (('alexandr', 'grigorievitch'), 16.952690065907774),
 (('du', 'mehr'), 16.952690065907774),
 (('ebook', '2554'), 16.952690065907774),
 (('en', 'va-t-en'), 16.952690065907774),
 (('gutenberg', 'ebook'), 16.952690065907774),
 (('unfailing', 'regularity'), 16.952690065907774),
 (('willst', 'du'), 16.952690065907774)]

In [70]:
# Question 3.16:
# Lower the PMI threshold to 14 or 15 and examine some of the additional bigrams.
# What do you see? Are high PMI bigrams useful at telling us something about the corpus?
# somewhat from the high PMI bigrams I maybe able tell what type of story its going to be

# Solution
[bg for bg in scored if bg[1] > 14.5]

[(('_die', 'wäsche_'), 16.952690065907774),
 (('_sein', 'rock_'), 16.952690065907774),
 (('_special', 'case_'), 16.952690065907774),
 (('_tout', 'court_'), 16.952690065907774),
 (('alexandr', 'grigorievitch'), 16.952690065907774),
 (('du', 'mehr'), 16.952690065907774),
 (('ebook', '2554'), 16.952690065907774),
 (('en', 'va-t-en'), 16.952690065907774),
 (('gutenberg', 'ebook'), 16.952690065907774),
 (('unfailing', 'regularity'), 16.952690065907774),
 (('willst', 'du'), 16.952690065907774),
 (('cracking', 'nuts'), 16.36772756518662),
 (('va-t-en', 'guerre'), 16.36772756518662),
 (('_vater', 'aus'), 16.367727565186616),
 (('aus', 'berlin_'), 16.367727565186616),
 (('cleft', 'palate'), 15.952690065907774),
 (('cleft', 'palates'), 15.952690065907774),
 (('krestovsky', 'island'), 15.952690065907774),
 (('mental', 'diseases'), 15.952690065907774),
 (('ninety', 'versts'), 15.952690065907774),
 (('penal', 'servitude'), 15.952690065907774),
 (('titular', 'counsellor'), 15.952690065907774),
 (('d