In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("tutorial2_1.ipynb")

# Tutorial 2.1: Tokenization, Lemmatization, Stemming, and StopWords

Welcome to Tutorial 2.1!  In today's class we covered fundamentals of text processing and used `nltk` and `spacy`. 

In this tutorial, we will go deeper into `spacy`, `nltk`, and other libraries used to process text. 

First, set up the tests and imports by running the cell below.

In [None]:
# Run this cell, but please don't change it.

# These lines load the tests.
import otter
grader = otter.Notebook()

import nltk
import spacy
!python -m spacy download en_core_web_sm
import sklearn

import pandas as pd
import matplotlib.pyplot as plt
import textblob
#%matplotlib notebook
%matplotlib inline
import numpy as np

## 1. Karen Sparck Jones

The text we will use today is the NYTimes obituary for Karen Sparck Jones.
Make sure to first read the [obituary](https://www.nytimes.com/2019/01/02/obituaries/karen-sparck-jones-overlooked.html)

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_1
points: 2
manual: true
-->

**Question 1.1:** In at most 3 sentences, who was Karen Sparck Jones and what is her connection to this course?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q1_2
points: 1
-->

We have extracted Karen Sparck Jones's obituary from the NYTimes and stored it in `data/sparck-jones-obit.txt`.

**Question 1.2:** Read in the obituary and store the obituary as a single string in the variable named `obit`

In [None]:
obit = ...
" ".join(obit.split()[:50])

In [None]:
grader.check("q1_2")

<!--
BEGIN QUESTION
name: q1_3
points: 1
-->

**Question 1.3:** Use nltk's sentence tokenizer to convert the obituary into a list of sentences and store the first 3 sentences in the variable called `snippet`. `snippet` should be a single string where the sentences are concatenated and seperated by a white space. 

In [None]:
snippet = ...
snippet

In [None]:
grader.check("q1_3")

## 2. Tokenization

We will use three different nlp libraries to tokenize this snippet:

- nltk
- spacy
- Textblob

### 2.1 nltk

<!--
BEGIN QUESTION
name: q2_1
points: 
    - 0.1
    - 0.9
-->

**Question 2.1:** Use the default word tokenizer in nltk to tokenize the snippet. Store the list of tokens in a list called `nltk_tokens`. Make sure to first lowercase the text before applying the tokenizer.

*Hint: look at the completed Demo 05 to see an example of how we tokenized text* 

In [None]:
nltk_tokens = ...
nltk_tokens[40:50]

In [None]:
grader.check("q2_1")

### 2.2 Spacy

Spacy is another powerful and popular python library for working with text. 
This [Cheat Sheet](http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06) is a good reference for looking up how to do different things with text using spacy. We've included a link to this cheat sheet on the course webpage.
<br>
We will now walk through Spacy briefly.

#### Loading Spacy Model

Spacy has released pre-built models to tokenize, lemmatize, parse, and do other things with text, including extract Named Entities. On the top of this notebook, we downloaded the `'en_core_web_sm'` models, which is one of these models to work with English text.

<!-- BEGIN QUESTION -->

The [Spacy documentation](https://spacy.io/models/en) describes this and other models. 

<!--
BEGIN QUESTION
name: q2_2
points: 1
manual: true
-->

**Question: 2.2:** Based on the documentation, what type of texts what the model trained on?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



Run the next line to load the spacy english model.

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp

Based on the documentation, this `nlp` object is a text-processing pipeline. 
As the documentation mentions, "*usually you'll load this once per process,
and pass the instance around your application*."


#### Spacy Doc & Token objects

The next line will run the spacy NLP pipeline on this sentence. This will create a [Doc ojbect](https://spacy.io/api/doc) which we assign to the variable named `example_doc`

In [None]:
example_doc = nlp('The next line will run the spacy NLP pipeline on this sentence')
example_doc

A spacy Doc object is a container for accessing linguistic annotations.

<!--
BEGIN QUESTION
name: q2_3
points: 1
manual: false
-->

**Question 2.3:** In the next cell, determine how many tokens are in `example_doc` and assign the value to the variable named `number_example_tokens`.

*Hint:* How do we determine the number of elements in a list or dictionary?

*Note:* The test here does not check the number, we have kept that on gradescope as a hidden test. 

In [None]:
number_example_tokens = ...
f"There are {number_example_tokens} tokens in the sentence \"{example_doc}\""

In [None]:
grader.check("q2_3")

 When we iterate through a Doc object, we access each [Token object](https://spacy.io/api/token)  one at a time. A spacy Doc object is sequence, i.e. an ordered collection, of Token objects.

We can access these Token objects with indexing as seen in the next line.

In [None]:
token_first = example_doc[0]
token_last = example_doc[-1]

token_first, token_last

Although the previous line printed out the words, these Token objects contain a lot of information beyond the surface form word. Run the next line.

In [None]:
token_first.lemma, token_first.lemma_

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_4
points: 1
manual: true
-->

**Question 2.4:** Briefly explain what was printed out on the line above. Feel free to look at the Token object documentation [online](https://spacy.io/api/token)

_Type your answer here, replacing this text._

<!-- END QUESTION -->



The next cell shows some more information we can get from a Token object.

In [None]:
print(token_first.lower_)
print(token_first.is_stop)
print(token_first.is_currency)
print(token_first.is_alpha)
print(token_first.i)
print(token_first.like_url)

Go through the [documentation](https://spacy.io/api/token) on the Spacy website to familarize with all the more types of information that is in a token object. We will use this later in the tutorial.

##### Slicing Doc objects

Remember that we can get a subset of an array with slicing as shown in the next cell:

In [None]:
np.random.randn(10)[3:6]

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_5
points: 1
manual: true
-->


**Question 2.5:** When we slice a numpy array or a python list, what type is returned?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

We can similarly slice a Document object.

<!--
BEGIN QUESTION
name: q2_6
points: 1
-->


**Question 2.6:** In the next cell, use index slicing to get the 3rd, 4th, and 5th Tokens from `example_doc` and
assign it to the variable named `example_sliced_doc`

In [None]:
example_sliced_doc = ...
example_sliced_doc

In [None]:
grader.check("q2_6")

`example_slided_doc` is not a new Document object, rather it is a Spacy [Span](https://spacy.io/api/span#attributes) object

In [None]:
type(example_sliced_doc)

The next line applies the `nlp` pipeline on the snippet from earlier in the assignment.

<!--
BEGIN QUESTION
name: q2_7
points: 
    - 0.25
    - 0.25
    - 0.25
    - 0.5
    - 0.5
-->


**Question 2.7:** Complete the cell to loop through the spacy [Doc ojbect](https://spacy.io/api/doc) called `doc` and add the lower cased text of each token to the list called `spacy_tokens`.  

In [None]:
spacy_tokens = []
doc = nlp(snippet)

...
spacy_tokens[40:50]

In [None]:
grader.check("q2_7")

### 2.3 Textblob

Textblob is another python library commonly used for processing text. Running the next line will run the snippet through the NLP pipeline using textblob.

In [None]:
blob = textblob.TextBlob(snippet.lower())
blob

The next line will print out a textblob Object's functions and attirbutes.

In [None]:
" ".join(dir(blob))

In [None]:
textblob_tokens = list(blob.tokens)
textblob_tokens[40:50]

### Comparing tokenizers

<!--
BEGIN QUESTION
name: q2_8
points: 1
-->


**Question 2.8:** In the next cell, write an expression to determine if the 3 lists of tokens each have the same number of tokens. Assign the boolean value to `same_tok_number`

In [None]:
same_tok_number = ...
same_tok_number

In [None]:
grader.check("q2_8")

<!-- BEGIN QUESTION -->

It is possible that the different libraries result in different lists of tokens. 

<!--
BEGIN QUESTION
name: q2_9
points: 2
manual: true
-->


**Question 2.9:** If the lists are different, in the next cell write what tokens are different and why might this be the case? If the lists are exactly the same, indicate that in the next cell.

*You are encouraged to create a new python cell to write code to determine what words are different.*

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## 3. Part of Speech Tags

We are going to look at just the first two sentence of the obituary and explore different types of part of speech tags.

<!--
BEGIN QUESTION
name: q3_1
points: 1
manual: false
-->

**Question 3.1:** Use nltk's sentence tokenizer to convert the obituary into a list of sentences and store the first 2 sentences in the variable called `short_snippet`. `short_snippet` should be a single string where the sentences are concatenated and seperated by a white space. Also, make sure to lowercase the text.

Make sure not to word tokenize the sentence.

In [None]:
short_snippet = ...
short_snippet

In [None]:
grader.check("q3_1")

The next cell will create a dataframe with part of speech tags for each token. 
There are different set of labels used for part of speech tagging. Here, we will look at the 
[Penn Treebank (PTB)](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
and a simplified version of the [Universal](http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf) part of speech tags.

In [None]:
pos_df = pd.DataFrame()
pos_df['token'] = nltk.word_tokenize(short_snippet)
pos_df['PTB'] = nltk.pos_tag(pos_df['token']) #, tagset='en-ptb')
pos_df['PTB'] = pos_df['PTB'].map(lambda x: x[1])
pos_df['Universal'] = nltk.pos_tag(pos_df['token'], tagset='universal')
pos_df['Universal'] = pos_df['Universal'].map(lambda x: x[1])
pos_df.head(5)

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_2
points: 2
manual: true
-->

**Question 3.2:** What is the PTB tag and what is the Universal tag for the token `people`? What is different about these tags and what do they mean?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_3
points: 2
manual: true
-->

**Question 3.3:** What are the PTB tags for the 4th token (were), 5th token (trying) and 12th token (talk)? What do the differences mean?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q3_4
points: 1
-->

**Question 3.4:** Complete the missing code in the next cell to loop through each token in `snippet_doc` add the Spacy simple part of speech tag for each token to `pos_df`. Assign the new column in `pos_df` the name `Spacy_Simple_POS`.

In [None]:
spacy_pos = []
snippet_doc = nlp(short_snippet)

for tok in snippet_doc:
...

pos_df.head(5)

In [None]:
grader.check("q3_4")

<!--
BEGIN QUESTION
name: q3_5
points: 2
manual: false
-->

**Question 3.5:** Complete the missing code in the next cell to add the Spacy detailed part of speech tag for each token to `pos_df`. Assign the new column in `pos_df` the name `Spacy_Detailed_POS`.

In [None]:
spacy_pos = []
snippet_doc = nlp(short_snippet)

for tok in snippet_doc:
...

pos_df.head(5)

In [None]:
grader.check("q3_5")

<!--
BEGIN QUESTION
name: q3_6
points: 1
-->
**Question 3.6:** What are the tokens where the Spacy Simple POS tag is different than the Universal tag in nltk? Assign them to the variable named `different_tags` which should be a set or numpy array of the unique terms

In [None]:
different_tags = ...
different_tags

<!-- BEGIN QUESTION -->

It is possible that some tags might be different but convey more or less the same information. For example, periods are given the `Universal` tag of `.` in nltk but in `spacy` they have the `PUNCT`

<!--
BEGIN QUESTION
name: q3_7
points: 2
manual: true
-->

**Question 3.7:** Looking at the tags for these terms, which of these tokens have very different tags between nltk and spacy and why might that be the case?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## 4 Lemmatization in nltk

The following line of code prints out morphological substitutions that are used for lemmatization in nltk

In [None]:
from nltk.corpus import wordnet
wordnet.MORPHOLOGICAL_SUBSTITUTIONS

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q4_1
points: 2
manual: true
-->
**Question 4.1:** Briefly describe what is printed out above

_Type your answer here, replacing this text._

<!-- END QUESTION -->



In Monday's class we saw how the above rules change how we lemmatize a word based on its pos tag.
Run the next line to see how we lemmatize `leaves` depending on its POS tag

In [None]:
lemmatizer = nltk.wordnet.WordNetLemmatizer()
lemmatizer.lemmatize("leaves", 'v'), lemmatizer.lemmatize("leaves", 'n')

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q4_2
points: 2
manual: true
-->
**Question 4.2:** In the next cell, come up with another word that has different lemmas based on its part of speech. Feel free to use the substitutions printed in 4.1 as a guide 

<!-- END QUESTION -->



## 5. Stopwords

The following line will read in stopwords from three popular NLP libraries. 
`sklearn` is a library used for Machine Learning, we will cover it later in the course.

In [None]:
nltk_sw = set(nltk.corpus.stopwords.words('english'))
spacy_sw = spacy.lang.en.stop_words.STOP_WORDS
sklearn_sw = sklearn.feature_extraction.text.ENGLISH_STOP_WORDS

<!--
BEGIN QUESTION
name: q5_1
points: 1
manual: false
-->

**Question 5.1:** How many words are in each of the lists of stopwords?

In [None]:
number_sklearn_sw = ...
number_spacy_sw = ...
number_nltk_sw = ...

number_sklearn_sw, number_spacy_sw, number_nltk_sw

In [None]:
grader.check("q5_1")

<!--
BEGIN QUESTION
name: q5_2
points: 1
-->

**Question 5.2:** What words are unique to the stopword list in sklearn? In other words, what words are in sklearn's stop list that are not in the other two lists? Assign these words to the variable named `unique_sklearn_sw`.

*Hint:* You might want to use the method from python `set`s to determine the values that are in one set but not in another 

In [None]:
unique_sklearn_sw = ...
unique_sklearn_sw

In [None]:
grader.check("q5_2")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q5_3
points: 1
manual: true
-->

**Question 5.3:** Looking at these examples, what might be an example where we might not want to use this list of stop words? In otherwords, give an example of a type of research question or specific domain where removing some of these words would be a bad idea.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q5_4
points: 1
-->

**Question: 5.4** What words are unique to the stopword list in nltk? In other words, what words are in nltk's stop list that are not in the other two lists? Assign these words to the variable named `unique_nltk_sw`.

In [None]:
unique_nltk_sw = ...
unique_nltk_sw

In [None]:
grader.check("q5_4")

<!--
BEGIN QUESTION
name: q5_5
points: 1
-->

**Question 5.5:** Is there a category in particular you notice about these words that are unique to nltk's stopswords?

_Type your answer here, replacing this text._

<!--
BEGIN QUESTION
name: q5_6
points: 1
-->

**Question 5.6:** What words are unique to the stopword list in spacy? In other words, what words are in spacy's stop list that are not in the other two lists? Assign these words to the variable named `spacy_sklearn_sw`.

In [None]:
unique_spacy_sw = ...
unique_spacy_sw

In [None]:
grader.check("q5_6")

<!--
BEGIN QUESTION
name: q5_7
points: 1
-->

**Question 5.7:** Comparing the `unique_spacy_sw` with `unique_nltk_sw`, what do you think might be causing these differences?

_Type your answer here, replacing this text._

[This paper](https://www.aclweb.org/anthology/W18-2502.pdf) discusses issues with blindly using stopwords from open source libraries. Specificially, 

>We have found
that popular stop lists, which users often apply
blindly, may suffer from surprising omissions and
inclusions, or their incompatibility with particular
tokenizers

The lesson here is to be careful when applying existing lists of stopwords when cleaning your corpora

## 6. Exploring the obituary

This section demonstrates how we can explore the obituary. There is one question at the end of this section

The following cell defines a function that creates a dataframe called obit_pos_df where each row represents a token. The columns of the dataframe are:
- `Word` (the original surface form of the token)
- `lower` (the lowercased version of the token) 
- `POS` (the simplified universal pos tag of the token)
- `lemma` (the lemma of the word)
- `Stop Word` (boolean if the word is a stop word)
- `Punctuation` (boolean if the word is punctuation)

The cell will first pass the obit through the Spacy nlp pipeline and then apply the function

In [None]:
def spacy_doc2pd(doc):
    toks_df = pd.DataFrame()
    toks_df["Word"] = [word.text for word in doc]
    toks_df["Lower"] = [word.lower_ for word in doc] 
    toks_df["POS"]  = [word.pos_ for word in doc] 
    toks_df["Lemma"] =[word.lemma_ for word in doc] 
    toks_df["Stop Word"] =[word.is_stop for word in doc]  
    toks_df["Punctuation"] =[word.is_punct for word in doc] 
    return toks_df

obit = open("data/sparck-jones-obit.txt").read()
doc = nlp(obit)
obit_df = spacy_doc2pd(doc)
obit_df.head()

The next cell makes a barplot to show how many words have each Part of Speech Tag. 

In [None]:
ax = obit_df['POS'].value_counts().plot(kind='bar', rot=45)
ax.set_title("Count of POS tags in Spark Jones' obit")

While nouns are the most frequent type of POS used, the most common type is not a noun. Rather it is a comman 

In [None]:
obit_df['Lower'].value_counts().head(15)

If we look at lemmas instead, we can see some differences

In [None]:
obit_df['Lemma'].value_counts().head(15)

Let's remove punctation. The next cell will print the most common lemmas that are not puncutation marks.

In [None]:
obit_df[~obit_df['Punctuation']]['Lemma'].value_counts().head(15)

We can see that many of these lemmas are function words rather than content words.
The next line will determine the most common lemmas that are not a stop word or punctiation.

In [None]:
obit_df[(~obit_df['Stop Word']) & (~obit_df['Punctuation']) ]['Lemma'].value_counts().head(20)

<!--
BEGIN QUESTION
name: q6_1
points: 1
-->

**Question 6.1:** Does this bag of words representation capture who Karen Spark Jones was based on reading her obituary? Are there, if so, what aspects about her life are missing in this representation?

_Type your answer here, replacing this text._

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()