# Lab 2: NLTK and spaCy

In this lab, you will be learning a bit about how to use the Python libraries <code>nltk</code> and ``spaCy`` to perform text normalization and to explore and analyze texts. 

The first few parts of this notebook will help you understand how to use Jupyter Notebooks, and there are many tutorials and quick-start guides on the web (<a href="https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/">here</a>, <a href="https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook">here</a>, <a href="http://bi1.caltech.edu/code/t0b_jupyter_notebooks.html">here</a>). Note that if you ever get weird behavior in a notebook. just go up to the Kernel menu and restart the kernel and clear the output, then run each code cell up to where you started having the problem.

After you have completed all code and questions in this notebook, push and commit your version of this file along with the file you will create in part 7 to your repo. The deadline is Wednesday, September 14, at 11:59pm EDT. 

## 1. Getting started

In the cell below, where it says <code> In [ ]: </code>, type <code>print("Hello, World!")</code>. Click in the cell below, and then hit the Run button from the menu of icons at the top of the page. Depending on your installation of jupyter, the run button might have the text Run or it might just be an icon that looks like a black triangle pointing to the right. The keyboard shortcut is <code>shift-return</code>, holding both keys down at the same time.

In [None]:
# enter your Hello World code here


Underneath your command you should now see the output <code>Hello, World!</code>. 

Great! Now you have run your first command in this Jupyter Notebook. You can always go back and edit the stuff you've written in any code cell. Just remember to re-run it if you change anything. 

*Note: Many jupyter beginners forget that if you change the value of some variable in a block of code, that variable now has that new value everywhere -- even in earlier blocks of code. If you are having trouble, it often helps to go back and re-run the block of code where you originally set the value of that variable.* 

Now let's start using nltk. Start by typing <code>import nltk</code> in the command cell below to import the nltk library. Don't forget to hit the Run button above while the cursor is in the command cell below.

In [None]:
# enter your import nltk command here and hit Run



It's likely that you don't have all the packages you need by default in nltk. Just in case, you should download the most popular ones. 

Run the code below **one time** to download the necessary packages:

In [None]:
# NOTE: YOU ONLY NEED TO DO THIS ONCE!
# When you run all the whole notebook at the end, you don't need 
# to execute this block. :)

import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('popular')


**If your whole computer crashes or things go really wrong, you'll need to download the packages as described in the README.md file in this repo.**

Just to make sure your nltk is working, use it to calculate the minimum edit distance between two words. The function is <code>nltk.edit_distance</code> and the arguments are the two strings you want to compare.

In [None]:
# enter your call to edit distance here and hit Run



Since we're going to be downloading files a lot, you should also learn how to download files in Python. There are a few different libraries, but we'll be using <code>urllib</code>, so import that. We're also going to be using regular expressions, so we'll import that library, too.

In [None]:
# enter your import urllib and re commands here



Now let's download a file. We're going to look at Great Expectations, by Charles Dickens. Click on the cell below and hit the Run button to issue the command to download the plain text version of the book from Project Gutenberg.

**If you get a long error about a bad SSL certificate, just download the file directly and put it in your repo (and add, commit, and push it). We can work on getting downloads with urllib to work properly later.**


In [None]:
# reminder: if you get an error, just download the file directly, rename it, 
# and add, commit, and push it to your repo

urllib.request.urlretrieve("http://www.gutenberg.org/files/1400/1400-0.txt", "greatexpectations.txt")

Now you have a text to work with. In the same directory where you saved this Notebook, you should see the file you just downloaded saved as <code>greatexpectations.txt</code>. Using whatever text editor you like (on a Mac, it will open by default with TextEdit.app), have a look at the file, and familiarize yourself with the format.

## 2. Loading in the text

You'll notice that plain text Gutenberg Project books are formatted to have 80 or fewer characters per line. This is fine for reading on an old-timey computer screen, but when we're processing text, we don't want a lot of manually inserted hard line breaks in the middle of our text. We're going to read in the text and replace line breaks with spaces. Run the code below.

In [None]:
f = open("greatexpectations.txt", "r", encoding="utf-8")
alltext = f.read().rstrip()
alltext = re.sub("\n", " ", alltext)
f.close()

<code>alltext</code> is a single string containing the entire text of the book. You can see that this is true by printing out the whole thing, but that will take up lots of space. Instead just try printing a random slice, like this:

In [None]:
print(alltext[0:25])
print(alltext[-99:])

Recall from when you examined the file in a text editor that there there was a bunch of text at the beginning and end of the file that was not actually a part of the text of the book. Above I showed how to use <code>re.sub</code> to remove all the line breaks. In the cell below, use <code>re.sub</code> to delete everything up to and including ``Chapter I.   `` **followed by three spaces**. Then use <code>re.sub</code> to delete everything starting from the white space that appears before ``*** END OF THE PROJECT GUTENBERG EBOOK GREAT EXPECTATIONS ***`` all the way to the end of the file. 

Hint: Be very careful about spaces, case, punctuation, etc. Some regular expressions you will find very useful: <code>+ ^ $ \s .\*</code> and <code>.\*?</code> and the backslash.

In [None]:
# enter your code here and run it




If you did your regular expressions right, repeating the slice printing commands above will yield the following output:

<code>My father’s family name b</code><br> 
<code>the broad expanse of tranquil light they showed to me, I saw no shadow of another parting from her.</code>

<b>If you didn't get this output, *go back and reload the file* by putting the cursor in the command cell where you originally read in the file and clicking Run.</b> Then try your regular expression again. Do not continue until you get the right output.

## 3. Word tokenization

In Python, you can turn a "sentence" into a string of "words" by splitting on white space using the <code>split</code> function. As we've discussed in class, however, splitting on white space is not a great way to tokenize (i.e., to separate out each actual word) because you leave punctuation attached to words. This prevents you from recognizing that, for instance, "dogs" is the same word whether it's before a space or a comma. In addition, you won't be able to learn anything about the distribution of different punctuation marks since they will always be attached to something else.

Fortunately, nltk has a word tokenizer function that, when given a string, will return a list of tokens. Here's the syntax for calling it:

<code>listoftokens = nltk.word_tokenize(inputstring)</code>

Call this function on <code>alltext</code> to produce a list of tokens called <code>alltokens</code>.

In [None]:
# call nltk.word_tokenize here and Run




### <b>Q1: How many tokens are there in this text? How many types are there in this text? What is the type:token ratio? Write three python commands in the line below that will calculate these three numbers. Then print out all three numbers.</b>

In [None]:
# line of code for token count


# line of code for type count


# line of code for type:token ratio


# line of code to print out all three



### <b>Q2: What text normalization might you want to do before counting the number of types and tokens? (Hint: there are some words you might be counting as separate types because of the way they are spelled.) How might this normalization make your type and token counts more accurate? How might it make these counts less accurate?</b>

### Double click here to enter your answers to Q2 
  

## 4. Frequency distributions

Your answers to Q1 demonstrate that there must be some words that were used more than once. Suppose you want to know what are the most frequent words. You can do this using the <code>FreqDist()</code> class in nltk. Run the code below to create a frequency distribution for your list of tokens and to print out the 10 most frequent words and their counts.

In [None]:
fdist = nltk.FreqDist(alltokens)
fdist.most_common(10)


It's not too surprising that the words you see in this list are the most common words. These little words that don't add a lot of content to language but appear frequently and usually serve a specific function are called <i><b>function words</i></b> or <i><b>closed class words</i></b>. These words are important, but the don't tell us much by themselves about the story.

What should we do if we want to know the most frequent words that are <i><b>content words</b></i> or <i><b>open class words</b></i> like nouns, verbs, adjectives, and adverbs -- the kinds of words that can tell us more about the story itself?

We filter out the function words using a <i><b>stop list</b></i>, which is a list of words that we can skip when we're interested in the real content of a text. nltk provides a stop list that you can use and add to. Let's get it and print it out to see what's there.

In [None]:
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
print(stoplist)


### Q3: What common and important class of tokens is missing from this list that we also might like to ignore?

### Double click here to answer Q3 




Add at least three of these missing tokens to the stop list using the usual Python syntax for appending to or extended a list, and check to make sure it worked. Then make a new version of <code>alltokens</code> from which all stop words in your stoplist have been removed. Finally, create a new <code>FreqDist</code> from this stopword-free list of tokens, and print out the top 10 tokens.

Keep adding stop words (or stop tokens!) to the stoplist until you start seeing mostly real content words in the top 10.

(Note: There are smart quotes in the text because it's UTF-8 not ascii. You can add these to the stoplist by just copying and pasting them into your list of things you're adding to the stoplist.)

In [None]:
stoplist = stopwords.words('english')

# enter your code for appending at least three tokens to the stop list here



# print out the stoplist to make sure your new tokens were added correctly



# make a new version of alltokens called allcontenttokens that doesn't contain items from the stop list



# create a new FreqDist from this new version of allcontenttokens



# print out the top 10 most frequent tokens in this new FreqDist



# Remember to repeat the above steps until the 10 most frequent words are content words
# rather than function words or punctuation!


### Q4: How many tokens did you have to add to the stoplist? What do you think of nltk's stoplist?

### Double click here to answer Q4




## 5. N-grams

In class, we learned about language modeling with n-grams. nltk makes counting n-grams really easy with <code>nltk.util.ngrams</code>. 

**Note: I am using ``alltokens`` here and not ``allcontenttokens``. Why? Because language models are used when we are interested in word *sequences and how words fit together with each other*.**

In [None]:
from nltk.util import ngrams

mybigrams = ngrams(alltokens,2)
mytrigrams = ngrams(alltokens,3)



Make a <code>FreqDist</code> for the bigrams and trigrams in the text, and print out the 10 most frequent for each.

In [None]:
# enter your code for creating FreqDist for bigrams and trigrams






When working with bigrams in nltk, you can also build a <i>conditional</i> frequency distribution, which, for a given word, keeps track of the frequencies of any following words. Let's look.

In [None]:
bicfreq = nltk.ConditionalFreqDist(ngrams(alltokens,2))

print(bicfreq["Mr."].most_common(10))  # prints out common words after Mr.

print(bicfreq["who"].most_common(10))  # prints out common words after who


Although the <code>FreqDist</code> class is useful, nltk's language modeling functionality is very buggy, so we probably won't be using it for this class. We'll be using other tools, one of which you'll learn about in future labs.

## 6. Stemming and lemmatization

There's a common normalization task we haven't performed yet: stemming or lemmatization.

### Q5: Looking at the top 50 or 100 most frequent unigrams, how can you tell the tokens are not stemmed or lemmatized? 

### Double click here to answer Q5




The command cell below shows how to use nltk's only true lemmatizer, the WordNet Lemmatizer.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# an example
print(lemmatizer.lemmatize("dogs"))
print(lemmatizer.lemmatize("speaks", "v"))

Use this lemmatizer to lemmatize every token in the <code>allcontenttokens</code> list you created above. Then make a new frequency distribution and examine the 50 or 100 most frequent words.

**Note: I want you to use ``allcontenttokens`` here. Why? Because we are thinking about words by themselves here rather than word sequences so we can disregard function words. In addition, you can lemmatize only verbs, nounsm adjectives, and adverbs (in English, at least).**

In [None]:
# Create a new list of tokens, all_lemmas by lemmatizing allcontenttokens



# Build a new FreqDist on all_lemmas and print out the 50 or 100 
# most frequent lemmatized tokens



It probably doesn't look much better. This is because the WordNet lemmatizer in nltk assumes by default that every word is a noun. Unless you tell the lemmatizer that something is a verb, it won't try to look it up as a verb. This is why "said" doesn't get lemmatized, and also why "was" gets lemmatized to "wa". In a future lab or problem set, we'll be exploring automatic part of speech tagging, which allows us to label every word as a noun, verb, adjective, preposition, etc. We'll also see shortly that spaCy does a much better job with this.

Note that there are several different stemmers implemented in the nltk.stem package. You can explore these in the nltk.stem package.

## 7. Sentence tokenization

For the second part of this lab, you'll need to take this text and save it out to a file with one tokenized sentence per line. Let's start by going back to the string holding our original text, <code>alltext</code>. We can turn this into a list of strings, each of which is a sentence, using the <code>sent_tokenize()</code> function, which takes a string as an argument and returns a list of sentences.

Below, take <code>alltext</code>, break it up into sentences with <code>sent_tokenize()</code>. Then loop through the sentences in that list, and use <code>word_tokenize</code> to tokenize each sentence. Print each tokenized sentence out to a file so that you have one sentence per line. 

**Note: Do not just call <code>print()</code>! This will print out an ugly list of lists. Cycle through the lists to print out strings.** 

For example, this text:

<code>Open the pod bay doors, Hal! I'm sorry, Dave. I can't do that.</code>

would get printed to a file as this:

<code>Open the pod bay doors , Hal !</code><br>
<code>I'm sorry , Dave .</code><br>
<code>I can't do that .</code><br>

Please observe that there are *no quotes, no square brackets, and no commas, as you would get if you just called ``print()`` on a Python list!* TAs will be instructed to give you a 0 for this section if you print out a raw list. 

Name the file you print out to <code>great.txt</code>. 

In [None]:
# use sent_tokenize() to break alltext into a list of sentences, allsent


# Open a file to write to called great.txt.


# Loop through the sentences

    # call word_tokenize() on each sentence

    # First write out <s> to the file to indicate the beginning of a sentence, then a space.
    
    # Then write out to the file each token one-by-one, each followed by a space. 
   
    # Then write out </s> to indicate the end of the sentence.
    
    
# Close the file great.txt.



The second line of your file should look like this:

```
<s> So , I called myself Pip , and came to be called Pip . </s>
```

## 8. Using spaCy to do a lot of this work for you

It is crucial that you understand how to do each of these steps yourself since which steps you do depends very much on what task you are working on. Things like capitalization, punctuation, function words, and sentence boundaries might be important for what you want to do, or they might not matter at all.

However, there is a different python library, spaCy, that will do a lot of this for you (and more!) automagically (and more slowly). Experiment with the code below to see the different things spaCy can do. To explore more, you can consult [the official spaCy documentation](https://spacy.io/api) or helpful websites like [this](https://realpython.com/natural-language-processing-spacy-python/).

In [None]:
import spacy

# Cool jupyter feature: Putting an exclamation point at the beginning of a line in 
# in a jupyter notebook lets you run a lot of commands that you would normally
# run at a linux command line.
!python3.9 -m spacy download en_core_web_sm  # Comment this out after you run it for the first time.

# This line loads a big model/pipeline that works specifically for English.
nlp = spacy.load('en_core_web_sm')

# Remember: spaCy is fancy, so it can be slow. Let's look at just the 
# first 10000 characters of Great Expectations.

doc = nlp(alltext[0:10000])



The pipeline you loaded in line 9 in the above code block, which I called ``nlp``, takes as input a text. It then returns a data structure that contains a very detailed processing and analysis of that text, including sentence boundary detection, tokenization, lemmatizing, part of speech  tagging, and all kinds of other helpful things.

In [None]:
# Here's how to get access to sentences.
for sent in doc.sents:
    print(sent)



In [None]:
# Here's how to get tokens, along with information 
# about each token such as its lemma and part of speech.
for token in doc:
    print(token, token.lemma_, token.pos_)


In [None]:
# spaCy has stoplists, too, and they are much more
# complete and expansive (perhaps too expansive)
# than the nltk list

english_stops = spacy.lang.en.stop_words.STOP_WORDS
print(english_stops)

To get some practice using spaCy, I'd like you to try it out some of the above commands but with a new text in a language that is *not* English!

**Step 1**: There are many trained pipelines for different languages here: https://spacy.io/models. Go to that address, and pick a language. In the code box below where you pick the language, you'll see a line that shows you how to load a pipeline for the language you have selected (e.g., ``nlp = spacy.load("es_core_news_sm")`` for Spanish).

**Step 2**: Go on the web and find a chunk of text for the language you chose. You can pick text from Gutenberg or from Google news for that langauge or from any website where you can get a good continuous chunk of 100-200 words. 

**Step 3**: Process that chunk of text with the language pipeline you chose in Step 1. (You can just copy and paste the chunk into your code block.)

**Step 4**: Print out the following:
* The number of tokens in the text.
* The number of sentences in the text.
* All the verbs in the text.
* All the stopwords in the text.


In [None]:
# Write your code for Part 8 here. Do not forget to include comments!




## 9. Verifying and submitting your work

Make sure you've answered every <b>Q</b> question.

Make sure you've written code wherever required. 

Go up to the Kernel menu and select Restart and Run All. **(Don't forget that you can comment out or skip the nltk download block and the urllib block.)** This will run all of the code you've written. Make sure there are no errors.

Add, commit, and push this file and the <code>great.txt</code> file to your repo. When you are totally done, make the comment say "FINAL SUBMISSION - PLEASE GRADE".

This lab is due September 14, 2022, at 11:59pm.