# Python Text Analysis: Fundamentals, Part 1

In this workshop series, we'll establish building blocks for performing text analysis in Python. These techniques lie in the domain of *natural language processing*, where we apply computational techniques to language written by humans in order to explain some of the underlying structure.

So, the million dollar question: How exactly do we go about performing computational methods on words?

This is ultimately a question of *representations*. Text naturally is represented as words, which are understandable to humans because we have a grammatical and syntactical structure we use to extract meaning from those words. However, most machine learning and data science techniques utilize numerical methods to extract patterns from large datasets. So, we need to find a way to convert the language into a numerical representation. We'll start with this goal in mind, and demonstrate how involved this process can be.

We'll start this process by first importing text into Python. Then, we'll cover a variety of preprocessing steps you might want to use before proceeding with computational methods. In the next sequence of this workshop, we'll work with the bag-of-words, or the first numerical representation of text that we'll encounter in this workshop series.

## Importing Files Containing Text

Text data we want to analyze will be stored in external files that need to be imported. These files will generally be text files (`.txt`) or comma separated value files (`.csv`).

All the data used in this notebook are stored in a `data` folder that we need to access. We need to adjust our filepaths accordingly:

In [1]:
text_path = '../data/sowing_and_reaping.txt'

### Text Files

We'll first start by importing "Sowing and Reaping" by Frances Harper, which is stored in a text file. Python has built-in functionality for importing text files:

In [2]:
# Open and read the text
with open(text_path, 'r') as file:
    raw_text = file.read()

We've stored the text file in an object called `raw_text`. We'll remove the front and end matter for better preprocessing later:

In [3]:
# Remove the front and end matter
sowing_and_reaping = raw_text[1114:684814]

---

### Challenge 1: Working with Strings

* What type of object is `sowing_and_reaping`?
* How many characters are in `sowing_and_reaping`?
* How can we get the first 1000 characters of `sowing_and_reaping`?

---

In [4]:
# YOUR CODE HERE
# What type of object is sowing_and_reaping?
type(sowing_and_reaping)


str

In [5]:
# YOUR CODE HERE
# How many characters are in sowing_and_reaping?
len(sowing_and_reaping)

167642

In [6]:
# YOUR CODE HERE
# How can we get the first 1000 characters of sowing_and_reaping?
sowing_and_reaping[:1000]


'I hear that John Andrews has given up his saloon; and a foolish thing\nit was. He was doing a splendid business. What could have induced him?"\n\n"They say that his wife was bitterly opposed to the business. I don\'t\nknow, but I think it quite likely. She has never seemed happy since John\nhas kept saloon."\n\n"Well, I would never let any woman lead me by the nose. I would let her\nknow that as the living comes by me, the way of getting it is my affair,\nnot hers, as long as she is well provided for."\n\n"All men are not alike, and I confess that I value the peace and\nhappiness of my home more than anything else; and I would not like to\nengage in any business which I knew was a source of constant pain to my\nwife."\n\n"But, what right has a woman to complain, if she has every thing she\nwants. I would let her know pretty soon who holds the reins, if I had\nsuch an unreasonable creature to deal with. I think as much of my wife\nas any man, but I want her to know her place, and I kno

### Comma Separated Value (CSV) Files

Often, we may have data stored in "dataframes" or "tables", which consists of many samples (rows), each containing several features (columns). Among the features is likely a text column which contains the text of interest. These dataframes are often found as Comma Separated Value (CSV) files (and somewhat less frequently as tab separated value (TSV) files). In either case, there is some "delimiter" (i.e., a comma or tab) which helps separate entries from each other.

The `pandas` package is the best package for dealing with dataframes in Python, and this package comes with its own function for reading CSV files. For example, let's read in a file containing many Tweets about airlines:

In [7]:
# Import pandas
import pandas as pd
# Use pandas to import Tweets
csv_path = '../data/airline_tweets.csv'
tweets = pd.read_csv(csv_path, sep=',')

In [8]:
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


Let's take a look at some of the Tweets:

In [9]:
print(tweets['text'].iloc[0])
print(tweets['text'].iloc[1])
print(tweets['text'].iloc[2])

@VirginAmerica What @dhepburn said.
@VirginAmerica plus you've added commercials to the experience... tacky.
@VirginAmerica I didn't today... Must mean I need to take another trip!


---

### Challenge 2: Reading in Many Files

The `data` folder contains another folder called `amazon`, which contains many `csv` files of Amazon reviews. Use a `for` loop to read in each dataframe. Do the following:

* We've provided a path to the `amazon` folder, and a list of all the file names within the folder using the `os.listdir()` function.
* Iterate over all these files, and import them using `pd.read_csv()`. You will need to use `os.path.join()` to create the correct path. Additionally, you need to provide `pandas` with the column names since they are not included in the reviews. We have create the `column_names` variable for you.
* Extract the text column from each dataframe, and add then to the `reviews` list. 
* How many totals reviews do you obtain?

---

In [10]:
# The os package has useful tools for file manipulation
import os
# Amazon review folder
amazon_path = '../data/amazon'
# List all the files in the amazon folder
files = os.listdir(amazon_path)
# Column names for each file
column_names = ['id',
                'product_id',
                'user_id',
                'profile_name',
                'helpfulness_num',
                'helpfulness_denom',
                'score',
                'time',
                'summary',
                'text']
# Add each review text to this list
reviews = []

In [11]:
for file in files:
    # Check that the file is actually a CSV file
    if os.path.splitext(file)[1] == '.csv':
        # YOUR CODE HERE
        full_path = os.path.join(amazon_path, file)
        reviews_df = pd.read_csv(full_path, sep=',', names=column_names)
        text = list(reviews_df['text'])
        reviews.extend(text)

There are other file types which you may come across: `json`, `xml`, `html`, etc. There are packages you can use to import each other these. The main challenge, in most cases, is dealing with multiple files, and extracting the actual text you want.

## Preprocessing

Our goal is to convert a text representation to a numerical representation. However, language can be messy. There's a variety of preprocessing steps that we'd like to do before we get to the numerical representation.

We will largely be using a package called Natural Language Toolkit, or `nltk`, to perform these operations. In some cases, we'll use basic Python.

There are a host of natural language processing packages one can use. For example, one newer package is `spaCy`, which is extremely powerful. Our goal here is to not make you an expert in a variety of NLP packages, but to expose you to principles that are shared by all of them. In this way, you'll be better prepared to open up any new NLP package you might have to use.

### Installing `nltk`

If this is your first time using `nltk`, we'll go through a couple steps to get set up. First, install `nltk` if you have not already:

In [12]:
# Run if you do not have nltk installed
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


Next, we need to install a couple packages within `nltk`:

In [13]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/blueraspberry/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/blueraspberry/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Text Cleaning

"Text cleaning" is a catch-all term for the process of performing relatively simple tasks in order to normalize our code. Text cleaning can mean a variety of different things depending on your use case.

#### A Brief Introduction to Regular Expressions

Before we dive into the specific text cleaning processes, let's briefly introduce regular expressions. We do this here since many text cleaning steps may require regular expressions, and many NLP libraries heavily use them under the hood.

Regular expressions (regexes) are a powerful way of searching for specific string patterns in large corpora. They have an infamously steep learning curve, but are very efficient when you get a handle on them.

Our goal in this workshop is not to provide a deep (or even shallow) dive into regexes; instead, we want to expose you to them so that you're better prepared to do deep dives in the future.

Regex testers are a useful tool in both understanding and creating regex expression. An example is this [website](https://regex101.com).

In [14]:
import re
pattern = 'test'

In [15]:
test_string = 'This is a test.'
# Find tokens
tokens = re.findall(pattern=pattern, string=test_string)
print(tokens)
# Replace tokens
replaced = re.sub(pattern=pattern, repl='not a test', string=test_string) 
print(replaced)

['test']
This is a not a test.


This is nice, but we could have done this somewhat easily with basic Python `string` functions. Let's try something more interesting:

In [16]:
# Word pattern matcher
pattern = r'\w+'
re.findall(pattern, test_string)

['This', 'is', 'a', 'test']

What did this do? Use the regex website to confirm your guess!

For now, we won't go much further than this, but there are many resources online to continue learning about regexes.

#### Lowercasing

While there is often information in the "casing" of words (e.g., whether text is lowercase or uppercase), we often don't work in a regime where we're able to properly leverage this information. So, a common text cleaning step is to lowercase all text, in order to simplify our analysis.

We can easily do this with the built-in string function `lower()`:

In [17]:
sowing_and_reaping_lower = sowing_and_reaping.lower()

In [18]:
print(sowing_and_reaping[:200])
print('------')
print(sowing_and_reaping_lower[:200])

I hear that John Andrews has given up his saloon; and a foolish thing
it was. He was doing a splendid business. What could have induced him?"

"They say that his wife was bitterly opposed to the busin
------
i hear that john andrews has given up his saloon; and a foolish thing
it was. he was doing a splendid business. what could have induced him?"

"they say that his wife was bitterly opposed to the busin


#### Removing Punctuation

Sometimes, you might want to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. This becomes less common when we consider more advanced NLP algorithms. In many cases, you may do this step *after* tokenization, which we will discuss in the next section. 

In [19]:
from string import punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [20]:
punctuation_text = "We've got quite a bit of punctuation here, don't we?!? #Python @D-Lab."
no_punctuation = ''.join([char for char in punctuation_text if char not in punctuation])
print(no_punctuation)

Weve got quite a bit of punctuation here dont we Python DLab


#### Stripping Blank Spaces

Removing blank space is a common step, as we might come across text with extraneous blank space. This is particularly common when text is imported from messy places, like webpages.

Python has a built-in function to deal with blank space on the *ends* of strings:

In [21]:
string = ' Hello! '
string.strip()

'Hello!'

What about within text? We will need to use a regular expression for this:

In [22]:
example1_path = '../data/example1.txt'

with open(example1_path, 'r') as file:
    example1 = file.read()
    
print(example1)



This is a text file that has some extra blankspace at the start and end. Blankspace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches blankspace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.





In [23]:
# Stripping only removes the ends
print(example1.strip())

This is a text file that has some extra blankspace at the start and end. Blankspace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches blankspace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.


In [24]:
# A regular expression will handle blank spaces within the text
blankspace_pattern = r'\s+'
blankspace_repl = ' '
clean_text = re.sub(blankspace_pattern, blankspace_repl, example1)
clean_text.strip()

'This is a text file that has some extra blankspace at the start and end. Blankspace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines. The Python method called "strip" only catches blankspace at the start and end of a string. But it won\'t catch it in the middle, for example, in this sentence. Once again, regular expressions will help us with this.'

#### Removing URLs, Hashtags, and Numbers

Text containing non-alphabetic symbols may have additional meaning beyond simply using punctuation or numbers. For example, text may contain URLs, hashtags, or numbers. Each of these are informative in their own right.

However, we rarely care about the exact URL used in a tweet. Similarly, we might not care about specific hashtags, or the precise number used. While, we could remove them completely, it's often informative to know that there *exists* a URL, hashtag, or number.

So, we replace individual URLs, hashtags, and numbers with a "symbol" that preserves the fact these structures exist in the text. It's standard to just use the strings "URL", "HASHTAG", and "DIGIT".

Since these types of text often contain precise structure, they're an apt case for using regular expressions. Let's apply these patterns to the Tweets above.

In [25]:
# Get a Tweet with a URL in it
url_tweet = tweets['text'].iloc[13]
print(url_tweet)

@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel http://t.co/ahlXHhKiyn


In [26]:
# URL 
url_pattern = r'(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])'
url_repl = ' URL '
re.sub(url_pattern, url_repl, url_tweet)

"@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel  URL "

In [27]:
# Hashtag
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
hashtag_repl = ' HASHTAG '
re.sub(hashtag_pattern, hashtag_repl, url_tweet)

"@VirginAmerica @virginmedia I'm flying your HASHTAG  HASHTAG  skies again! U take all the HASHTAG  away from travel http://t.co/ahlXHhKiyn"

In [28]:
# Digits
digit_tweet = tweets['text'].iloc[32]
print(digit_tweet)
digit_pattern = '\d+'
digit_repl = ' DIGIT '
re.sub(digit_pattern, digit_repl, digit_tweet)

@VirginAmerica help, left expensive headphones on flight 89 IAD to LAX today. Seat 2A. No one answering L&amp;F number at LAX!


'@VirginAmerica help, left expensive headphones on flight  DIGIT  IAD to LAX today. Seat  DIGIT A. No one answering L&amp;F number at LAX!'

What other kinds of text strings can you think of that we might want to replace?

Natural language is complex, and so there may be use cases where we might need specialized packages for preprocessing or removing text. For example, the [`emoji` package](https://pypi.org/project/emoji/) may be useful for social media text. The [`textacy` package](https://textacy.readthedocs.io/en/latest/) also provides useful preprocessing tools.

---

### Challenge 3: Text Cleaning with Multiple Steps

In Challenge 1, we imported many Amazon reviews, and stored them in a variable called `reviews`. Each element of the list is a string, representing the text of a single review. For each review:

* Replace any URLs and digits.
* Make all characters lower case.
* Strip all blankspace.

Keep in mind: the order in which you do these steps matters!

---

In [29]:
def preprocess(text):
    # YOUR CODE HERE
    """Preprocesses a string."""
    # Lowercase
    text = text.lower()
    # Replace URLs
    url_pattern = r'https?:\/\/.*[\r\n]*'
    url_repl = ' URL '
    text = re.sub(url_pattern, url_repl, text)
    # Replace digits
    digit_pattern = '\d+'
    digit_repl = ' DIGIT '
    text = re.sub(digit_pattern, digit_repl, text)
    # Remove blank spaces
    blankspace_pattern = r'\s+'
    blankspace_repl = ' '
    text = re.sub(blankspace_pattern, blankspace_repl, text)
    # Strip
    return text.strip()

In [30]:
reviews_prepocessed = [preprocess(review) for review in reviews]

In [31]:
print(reviews_prepocessed[0])

we have a DIGIT week old... he had gas and constipation problems for the first DIGIT weeks. we tried two different kinds of similac including for fussiness and gas and neither seemed to work. we switched to the organic a few weeks ago and saw quick improvement. i wish i could breast feed but i'm unable to, so for now this seems the best option especially since it was recommended we stick with a ready made formula for the gas problems.<br />ive read a lot of the reviews and took into consideration the information about sucrose. i plan on talking to the pediatrician and my midwife for additional information beyond the article written about it, especially since that is from DIGIT . i realize the concern and i am doing research on making my own formula so i know exactly whats in it and that its organic, but in the mean time baby l eats great with this, is healthy, and has fewer stomach problems. it's middle of the road when it comes to $ - although amazn is one of the more expensive places

## Tokenization

One of the most important steps in text analysis is tokenization. This is the process of breaking down the text into "tokens", which are distinct chunks that we recognize as unique in whatever corpus we're working in.

Let's start by importing an example file:

In [32]:
example2_path = '../data/example2.txt'

with open(example2_path) as file:
    example2 = file.read()
    
print(example2)

In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way. What do you do when you have #hashtags, @TwitterHandles, or https://urls.com? Different packages will make different decisions on when to split text apart, and when not to.



Let's try naively tokenizing by splitting up the text according to blankspace, using a basic Python string method:

In [33]:
tokens = example2.split()
# Print first ten tokens
tokens[:20]

['In',
 'this',
 'little',
 'example,',
 "we're",
 'going',
 'to',
 'see',
 'some',
 'of',
 'the',
 'problems',
 'that',
 'regularly',
 'appear',
 'in',
 'tokenization.',
 'Tokenization',
 'may',
 'seem']

We can roughly think of this as "word tokenization". However, it's not always clear that simply splitting up by spaces will get what we want. Consider contractions, for example, which really consist of two words connected together. More advanced tokenizations will actually treat these words differently.

`nltk` has a function called `word_tokenize` which can tokenize a string for us in an intelligent fashion. Ultimately, `nltk` basically is a bunch of regular expressions under the hood:

In [34]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/blueraspberry/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [35]:
from nltk.tokenize import word_tokenize
nltk_tokens = word_tokenize(example2)

In [36]:
print(nltk_tokens)

['In', 'this', 'little', 'example', ',', 'we', "'re", 'going', 'to', 'see', 'some', 'of', 'the', 'problems', 'that', 'regularly', 'appear', 'in', 'tokenization', '.', 'Tokenization', 'may', 'seem', 'simple', ',', 'but', 'it', "'s", 'harder', 'than', 'it', 'first', 'appears', '.', 'Why', 'is', 'it', 'so', 'hard', '?', 'Punctuations', ',', 'contractions', '(', 'like', 'do', "n't", ',', 'wo', "n't", 'and', 'would', "'ve", ')', 'get', 'in', 'the', 'way', '.', 'What', 'do', 'you', 'do', 'when', 'you', 'have', '#', 'hashtags', ',', '@', 'TwitterHandles', ',', 'or', 'https', ':', '//urls.com', '?', 'Different', 'packages', 'will', 'make', 'different', 'decisions', 'on', 'when', 'to', 'split', 'text', 'apart', ',', 'and', 'when', 'not', 'to', '.']


Looking at this example, you can see how `nltk` has made certain decisions about where and when to tokenize. Tokenization is critical for downstream processing, and there's a variety of methods for performing the tokenizing. Let's take a look at `spaCy`'s tokenizer.

In [37]:
# Install spaCy if necessary
%pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [38]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [39]:
# Import spaCy and load the dictionary
import spacy
nlp = spacy.load("en_core_web_sm")
# Pass the example into the English pipeline
doc = nlp(example2)
spacy_tokens = [token.text for token in doc]

In [40]:
# Compare NLTK to spaCy
print(nltk_tokens)
print(spacy_tokens)

['In', 'this', 'little', 'example', ',', 'we', "'re", 'going', 'to', 'see', 'some', 'of', 'the', 'problems', 'that', 'regularly', 'appear', 'in', 'tokenization', '.', 'Tokenization', 'may', 'seem', 'simple', ',', 'but', 'it', "'s", 'harder', 'than', 'it', 'first', 'appears', '.', 'Why', 'is', 'it', 'so', 'hard', '?', 'Punctuations', ',', 'contractions', '(', 'like', 'do', "n't", ',', 'wo', "n't", 'and', 'would', "'ve", ')', 'get', 'in', 'the', 'way', '.', 'What', 'do', 'you', 'do', 'when', 'you', 'have', '#', 'hashtags', ',', '@', 'TwitterHandles', ',', 'or', 'https', ':', '//urls.com', '?', 'Different', 'packages', 'will', 'make', 'different', 'decisions', 'on', 'when', 'to', 'split', 'text', 'apart', ',', 'and', 'when', 'not', 'to', '.']
['In', 'this', 'little', 'example', ',', 'we', "'re", 'going', 'to', 'see', 'some', 'of', 'the', 'problems', 'that', 'regularly', 'appear', 'in', 'tokenization', '.', 'Tokenization', 'may', 'seem', 'simple', ',', 'but', 'it', "'s", 'harder', 'than', 

---

### Challenge 4: Tokenizing a Large Text

Tokenize "Sowing and Reaping", which we imported at the beginning of this workshop. Use a method of your choice.

Once you've tokenized, find all the unique words types (you might want the `set` function). Then, sort the resulting `set` object to create a vocabulary (you might want to use the `sorted` function).

---

In [45]:
# YOUR CODE HERE
# nltk
nltk_sowing = word_tokenize(sowing_and_reaping)
print(nltk_sowing[:100])

['I', 'hear', 'that', 'John', 'Andrews', 'has', 'given', 'up', 'his', 'saloon', ';', 'and', 'a', 'foolish', 'thing', 'it', 'was', '.', 'He', 'was', 'doing', 'a', 'splendid', 'business', '.', 'What', 'could', 'have', 'induced', 'him', '?', "''", '``', 'They', 'say', 'that', 'his', 'wife', 'was', 'bitterly', 'opposed', 'to', 'the', 'business', '.', 'I', "don't", 'know', ',', 'but', 'I', 'think', 'it', 'quite', 'likely', '.', 'She', 'has', 'never', 'seemed', 'happy', 'since', 'John', 'has', 'kept', 'saloon', '.', "''", '``', 'Well', ',', 'I', 'would', 'never', 'let', 'any', 'woman', 'lead', 'me', 'by', 'the', 'nose', '.', 'I', 'would', 'let', 'her', 'know', 'that', 'as', 'the', 'living', 'comes', 'by', 'me', ',', 'the', 'way', 'of', 'getting']


In [48]:
# YOUR CODE HERE
# spacy
nlp = spacy.load("en_core_web_sm")
# Pass the example into the English pipeline
spacy_sowing = nlp(sowing_and_reaping)
spacy_tokens = [token.text for token in spacy_sowing]
print(spacy_tokens[:100])

['I', 'hear', 'that', 'John', 'Andrews', 'has', 'given', 'up', 'his', 'saloon', ';', 'and', 'a', 'foolish', 'thing', '\n', 'it', 'was', '.', 'He', 'was', 'doing', 'a', 'splendid', 'business', '.', 'What', 'could', 'have', 'induced', 'him', '?', '"', '\n\n', '"', 'They', 'say', 'that', 'his', 'wife', 'was', 'bitterly', 'opposed', 'to', 'the', 'business', '.', 'I', 'do', "n't", '\n', 'know', ',', 'but', 'I', 'think', 'it', 'quite', 'likely', '.', 'She', 'has', 'never', 'seemed', 'happy', 'since', 'John', '\n', 'has', 'kept', 'saloon', '.', '"', '\n\n', '"', 'Well', ',', 'I', 'would', 'never', 'let', 'any', 'woman', 'lead', 'me', 'by', 'the', 'nose', '.', 'I', 'would', 'let', 'her', '\n', 'know', 'that', 'as', 'the', 'living', 'comes']


In [49]:
# YOUR CODE HERE
# nltk: unique words with set
unique_nltk = set(nltk_sowing)
sorted_nltk = sorted(unique_nltk)

In [53]:
# YOUR CODE HERE
# spacy: unique words with set
unique_spacy = set(spacy_tokens)
sorted_spacy = sorted(unique_spacy)

In [50]:
# YOUR CODE HERE
# nltk: unique words with set
print(sorted(unique_nltk)[:100])

['!', '#', '$', '%', '&', "'", "''", "'AS-IS", "'Do", "'He", "'Here", "'Hurrah", "'Mary", "'My", "'One", "'Paul", "'That", "'What", "'Young", "'and", "'bout", "'business", "'d", "'fail", "'is", "'ll", "'one", "'our", "'re", "'right", "'s", "'spec", "'the", "'ve", '(', ')', '*', ',', '-', '--', '.', '._', '//gutenberg.net/license', '//pglaf.org', '//pglaf.org/donate', '//pglaf.org/fundraising', '//www.gutenberg.net', '//www.gutenberg.net/1/0/2/3/10234', '//www.gutenberg.net/1/1/0/2/11022', '//www.gutenberg.net/2/4/6/8/24689', '//www.gutenberg.net/GUTINDEX.ALL', '//www.ibiblio.org/gutenberg/etext06', '//www.pglaf.org', '/etext', '00', '01', '02', '03', '04', '05', '1', '1.A', '1.B', '1.C', '1.D', '1.E', '1.E.1', '1.E.2', '1.E.3', '1.E.4', '1.E.5', '1.E.6', '1.E.7', '1.E.8', '1.E.9', '1.F', '1.F.1', '1.F.2', '1.F.3', '1.F.4', '1.F.5', '1.F.6', '10', '10000', '10234', '11022.txt', '11022.zip', '1500', '16th', '2', '20', '200', '2001', '2003', '24689', '25', '3', '30', '4', '4557']


In [54]:
# YOUR CODE HERE
# spacy: unique words with set
print(sorted(sorted_spacy)[:100])

['\n', '\n\n', '\n\n\n', '\n\n\n\n', '\n\n\n\n\n', '\n\n ', '\n\n  ', '\n\n    ', '\n\n       ', '\n  ', '\n   ', '\n     ', ' ', '      ', '!', '"', '"[3', '"[6', '#', '$', '%', '&', "'", "'bout", "'d", "'ll", "'re", "'s", "'ve", '(', ')', '*', ',', '-', '--', '.', '/', '/etext', '00', '01', '02', '03', '04', '05', '1', '1.A.', '1.B.', '1.C', '1.C.', '1.D.', '1.E', '1.E.', '1.E.1', '1.E.2', '1.E.3', '1.E.4', '1.E.5', '1.E.6', '1.E.7', '1.E.8', '1.E.9', '1.F.', '1.F.1', '1.F.2', '1.F.3', '1.F.4', '1.F.5', '1.F.6', '10', '10000', '10234', '11022.txt', '11022.zip', '1500', '16th', '1887', '2', '20', '200', '2001', '2003', '24689', '25', '3', '30', '4', '4557', '5', '5,000', '50', '500', '501(c)(3', '596', '6', '60', '6221541', '64', '7', '8', '801']


In [52]:
# YOUR CODE HERE
# nltk: unique words with set
print(sorted(unique_nltk)[-100:])

['wifely', 'wild', 'wilderness', 'will', 'willed', 'willing', 'win', 'wind', 'winds', 'wine', 'wings', 'wins', 'winter', 'wise', 'wisely', 'wiser', 'wish', 'wished', 'wishing', 'with', 'within', 'without', 'witness', 'witnessed', 'witnesses', 'witty', 'wives', 'wizard', 'wo', 'wolf', 'woman', "woman's", 'womanhood', 'womanly', 'women', 'won', 'wonder', 'wonderful', 'woods', 'woolen', 'word', 'words', 'wore', 'work', 'worked', 'working', 'works', 'world', "world's", 'worldly', 'worm', 'worn', 'worrying', 'worse', 'worst', 'worth', 'worthless', 'worthy', 'would', "wouldn't", 'wound', 'wove', 'woven', 'wrath', 'wreath', 'wreck', 'wrecked', 'wrecks', 'wretch', 'wretched', 'wretchedness', 'wrinkles', 'write', 'writer', 'writes', 'writing', 'written', 'wrong', 'wronged', 'wrote', 'www.gutenberg.net', 'year', 'yearned', 'years', 'yes', 'yesterday', 'yet', 'yield', 'yielded', 'yielding', 'you', 'young', 'younger', 'your', 'yours', 'yourself', 'yourselves', 'youth', 'zealous', 'zipped']


In [55]:
# YOUR CODE HERE
# spacy: unique words with set
print(sorted(sorted_spacy)[-100:])

['wifely', 'wild', 'wilderness', 'will', 'willed', 'willing', 'win', 'wind', 'winds', 'wine', 'wings', 'wins', 'winter', 'wise', 'wisely', 'wiser', 'wish', 'wished', 'wishing', 'with', 'within', 'without', 'witness', 'witnessed', 'witnesses', 'witty', 'wives', 'wizard', 'wo', 'wolf', 'woman', 'womanhood', 'womanly', 'women', 'won', 'wonder', 'wonderful', 'woods', 'woolen', 'word', 'words', 'words--_getting', 'wore', 'work', 'worked', 'working', 'works', 'world', 'worldly', 'worm', 'worn', 'worrying', 'worse', 'worst', 'worth', 'worthless', 'worthy', 'would', 'wound', 'wove', 'woven', 'wrath', 'wreath', 'wreck', 'wrecked', 'wrecks', 'wretch', 'wretched', 'wretchedness', 'wrinkles', 'write', 'writer', 'writes', 'writing', 'written', 'wrong', 'wronged', 'wrote', 'www.gutenberg.net', 'year', 'yearned', 'years', 'yes', 'yesterday', 'yet', 'yield', 'yielded', 'yielding', 'you', 'you--', 'you[r', 'young', 'younger', 'your', 'yours', 'yourself', 'yourselves', 'youth', 'zealous', 'zipped']


## Removing Stop Words

Text often has words that are very common and usually not informative. These words tend to be pronouns or articles, such as "the", "a", "it", "them", etc. In many cases, these "stop words" are those that we may wish to remove before performing computation since they usually are not very informative. 

In practice, this is simple to do - we just filter out tokens by words. However, we may want to use different "stop word lists", depending on our use case. For example, `nltk` has a stop word list:

In [56]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [57]:
# What kinds of words are in the list?
print(stop[:50])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be']


In [58]:
# Remove tokens that are stop words
tokens = [token for token in tokens if token not in stop]
print(tokens)

['In', 'little', 'example,', "we're", 'going', 'see', 'problems', 'regularly', 'appear', 'tokenization.', 'Tokenization', 'may', 'seem', 'simple,', 'harder', 'first', 'appears.', 'Why', 'hard?', 'Punctuations,', 'contractions', '(like', "don't,", "would've)", 'get', 'way.', 'What', '#hashtags,', '@TwitterHandles,', 'https://urls.com?', 'Different', 'packages', 'make', 'different', 'decisions', 'split', 'text', 'apart,', 'to.']


In [59]:
# Compare to the original text
print(example2)

In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way. What do you do when you have #hashtags, @TwitterHandles, or https://urls.com? Different packages will make different decisions on when to split text apart, and when not to.



## Stemming and Lemmatization

Stemming and lemmatization both refer to removing morphological affixes on words. Many words consist of a "core" word with a modified ending that adjusts the word's meaning in a given context. For example, the word "grows" is simply "grow" with an "s" added to denote a change in verb tense. In many cases, we're interested in the core content of the word. Stemming and lemmatization are the process of getting at the "core" of a word. This "core" component is often referred to as the *lemma*.

Stemming is a rudimentary approach to obtaining the lemma: it simply removes an ending of a word. So, "grows" would be stemmed to "grow". The word "running" would be stemmed to "run".

Lemmatization is more general: it aims to find the lemma of a word, but can handle cases where stemming may not work. For example, the word "fairies" cannot be stemmed to the lemma, "fairy". So, we need additional rules - provided by lemmatization - that can appropriately turn "fairies" into "fairy".

`nltk` provides many algorithms for stemming. We'll use the Snowball Stemmer, which we'll import from `nltk`. We'll also look at the Word Net Lemmatizer:

In [60]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer

In [65]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/blueraspberry/nltk_data...


True

In [66]:
# Instantiate the stemmer and lemmatizer
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [67]:
# Stemming examples
print(stemmer.stem('grows'))
print(stemmer.stem('running'))
print(stemmer.stem('coded'))

grow
run
code


In [68]:
# When does stemming not quite work?
print(stemmer.stem('fairies'))
print(stemmer.stem('wolves'))
print(stemmer.stem('abaci'))
print(stemmer.stem('leaves'))
print(stemmer.stem('carried'))

fairi
wolv
abaci
leav
carri


In [69]:
# Let's try lemmatizing these, instead:
print(lemmatizer.lemmatize('fairies'))
print(lemmatizer.lemmatize('wolves'))
print(lemmatizer.lemmatize('abaci'))
print(lemmatizer.lemmatize('leaves'))
print(lemmatizer.lemmatize('carried'))

fairy
wolf
abacus
leaf
carried


What happened with that last one? Sometimes we need to provide the lemmatizer a 'part-of-speech' tag to help resolve ambiguous cases. This is another argument in the lemmatizer:

In [70]:
print(lemmatizer.lemmatize('carried', pos='v'))

carry


Try it with "leaves", which has more than one way to lemmatize!

In [71]:
print(lemmatizer.lemmatize('leaves', pos='n'))
print(lemmatizer.lemmatize('leaves', pos='v'))

leaf
leave


---

### Challenge 5: Apply a Lemmatizer to Text

Lemmatize the tokenized `example2` text using the `nltk`'s `WordNetLemmatizer`.

---

In [76]:
# YOUR CODE HERE
example2_token = word_tokenize(example2)
example2_lemma = [lemmatizer.lemmatize(token) for token in tokens]

In [77]:
print(example2_lemma)

['In', 'little', 'example,', "we're", 'going', 'see', 'problem', 'regularly', 'appear', 'tokenization.', 'Tokenization', 'may', 'seem', 'simple,', 'harder', 'first', 'appears.', 'Why', 'hard?', 'Punctuations,', 'contraction', '(like', "don't,", "would've)", 'get', 'way.', 'What', '#hashtags,', '@TwitterHandles,', 'https://urls.com?', 'Different', 'package', 'make', 'different', 'decision', 'split', 'text', 'apart,', 'to.']


In [78]:
print(example2)

In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way. What do you do when you have #hashtags, @TwitterHandles, or https://urls.com? Different packages will make different decisions on when to split text apart, and when not to.



---

### Challenge 6: Putting it All Together

Write a function called `preprocess()` that accepts a string and performs the following preprocessing steps:

* Lowercase text.
* Replace all URLs and numbers with their respective tokens.
* Strip blankspace.
* Tokenize.
* Remove punctuation.
* Remove stop words.
* Lemmatize the tokens.

Apply this function to `sowing_and_reaping`.

---

In [79]:
def preprocess(text):
    # YOUR CODE HERE
    """Preprocesses a string."""
    # Lowercase
    text = text.lower()
    # Replace URLs
    url_pattern = r'(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])'
    url_repl = ' URL '
    text = re.sub(url_pattern, url_repl, text)
    # Replace digits
    digit_pattern = '\d+'
    digit_repl = ' DIGIT '
    text = re.sub(digit_pattern, digit_repl, text)
    # Remove blank spaces
    blankspace_pattern = r'\s+'
    blankspace_repl = ' '
    text = re.sub(blankspace_pattern, blankspace_repl, text).strip()
    # Tokenize
    tokens = word_tokenize(text)
    # Remove punctuation
    tokens = [token for token in tokens if token not in punctuation]
    # Remove stop words
    stop = stopwords.words('english')
    tokens = [token for token in tokens if token not in stop]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

In [85]:
preprocess(sowing_and_reaping[:500])

['hear',
 'john',
 'andrew',
 'given',
 'saloon',
 'foolish',
 'thing',
 'splendid',
 'business',
 'could',
 'induced',
 "''",
 '``',
 'say',
 'wife',
 'bitterly',
 'opposed',
 'business',
 "n't",
 'know',
 'think',
 'quite',
 'likely',
 'never',
 'seemed',
 'happy',
 'since',
 'john',
 'kept',
 'saloon',
 "''",
 '``',
 'well',
 'would',
 'never',
 'let',
 'woman',
 'lead',
 'nose',
 'would',
 'let',
 'know',
 'living',
 'come',
 'way',
 'getting',
 'affair',
 'long',
 'well',
 'provided',
 "''",
 '``']

## Powerful Features of `spaCy`

We will end this portion of the workshop by examining some of the more powerful features offered by the newer NLP library, `spaCy`. Beside being quite fast, `spaCy` provides very powerful built-in tools in its tokenizer. For example, we automatically get many of the above operations in one fell swoop:

In [86]:
short_example = "We're learning about natural language processing at Berkeley."
doc = nlp(short_example)

for token in doc:
    print(
        f"Token: {token.text}; Lemma: {token.lemma_}; Part-of-speech: {token.pos_}; "
        f"Token shape: {token.shape_}; Alphabetical? {token.is_alpha}; Stop Word? {token.is_stop}"
    )

Token: We; Lemma: we; Part-of-speech: PRON; Token shape: Xx; Alphabetical? True; Stop Word? True
Token: 're; Lemma: be; Part-of-speech: AUX; Token shape: 'xx; Alphabetical? False; Stop Word? True
Token: learning; Lemma: learn; Part-of-speech: VERB; Token shape: xxxx; Alphabetical? True; Stop Word? False
Token: about; Lemma: about; Part-of-speech: ADP; Token shape: xxxx; Alphabetical? True; Stop Word? True
Token: natural; Lemma: natural; Part-of-speech: ADJ; Token shape: xxxx; Alphabetical? True; Stop Word? False
Token: language; Lemma: language; Part-of-speech: NOUN; Token shape: xxxx; Alphabetical? True; Stop Word? False
Token: processing; Lemma: processing; Part-of-speech: NOUN; Token shape: xxxx; Alphabetical? True; Stop Word? False
Token: at; Lemma: at; Part-of-speech: ADP; Token shape: xx; Alphabetical? True; Stop Word? True
Token: Berkeley; Lemma: Berkeley; Part-of-speech: PROPN; Token shape: Xxxxx; Alphabetical? True; Stop Word? False
Token: .; Lemma: .; Part-of-speech: PUNCT; T

Tokenizing, lemmatization, part of speech tagging, stop word detection, and a couple other things are provided to us up front when we pass a text into the `nlp` module.

`spaCy` also comes with some pretty shiny visualization tools:

In [87]:
from spacy import displacy
displacy.render(doc, style="dep", options={'compact': True})

For longer texts, we also get the ability to perform a variety of other operations very easily. Here are some cases:

In [88]:
example3_path = '../data/example3.txt'

with open(example3_path, 'r') as file:
    example3 = file.read()
    
doc = nlp(example3)

In [89]:
print(example3)

D-Lab helps Berkeley faculty, staff, and students move forward with world-class research in data intensive social science. We think of data as an expansive category, one that is constantly changing as the research frontier moves. We offer a venue for methodological exchange from all corners of campus and across its bounds. 

D-Lab provides cross-disciplinary resources for in-depth consulting and advising access to staff support and training and provisioning for software and other infrastructure needs. Networking with other Berkeley centers and facilities and with our departments and schools, we offer our services to researchers across the disciplines and underwrite the breadth of excellence of Berkeley’s graduate programs and faculty research. D-Lab builds networks that Berkeley researchers can connect with users of social science data in the off-campus world.


In [90]:
# Sentence segmentation
print('Sentence Segmentation')
for sentence in doc.sents:
    print(sentence)

# Entity detection
print('\nEntity Detection:')
for entity in doc.ents:
    print(entity.text, entity.label_)

# Noun chunks
print('\nNoun Chunks:')
for chunk in doc.noun_chunks:
    print(chunk)

Sentence Segmentation
D-Lab helps Berkeley faculty, staff, and students move forward with world-class research in data intensive social science.
We think of data as an expansive category, one that is constantly changing as the research frontier moves.
We offer a venue for methodological exchange from all corners of campus and across its bounds. 


D-Lab provides cross-disciplinary resources for in-depth consulting and advising access to staff support and training and provisioning for software and other infrastructure needs.
Networking with other Berkeley centers and facilities and with our departments and schools, we offer our services to researchers across the disciplines and underwrite the breadth of excellence of Berkeley’s graduate programs and faculty research.
D-Lab builds networks that Berkeley researchers can connect with users of social science data in the off-campus world.

Entity Detection:
Berkeley GPE
Berkeley ORG
Berkeley ORG
Berkeley ORG

Noun Chunks:
D-Lab
Berkeley facu

There's a whole lot else we can do with it! Check out `spaCy`'s documentation to see more.