# [LEGALST-190] Preprocessing Text - Lab 3-1

---

This lab will provide an introduction to manipulating strings and chunking sentences.

*Estimated Time: 30-40 minutes*

---

### Topics Covered
- How to tokenize text
- How to stem text
- How to chunk text

### Table of Contents

[The Data](#section data)<br>

1 - [Tokenization](#section 1)<br>

2 - [Stemming](#section 2)<br>

3 - [Chunking](#section 3)<br>


In [1]:
! pip install nltk



---

## The Data <a id='data'></a>


In this notebook, you'll be working with the text of each country’s statement from the General Debate in annual sessions of the United Nations General Assembly. This dataset is separated by country, session and year and tagged for each, and has over forty years of data from different countries.



### Visualizing data

Run the below cells and take a look at a sample of the data that we'll be working with.

In [2]:
import pandas as pd
data = pd.read_csv("../data/un-general-debates.zip", compression='zip')

In [3]:
data.head()

Unnamed: 0,session,year,country,text
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,44,1989,ZWE,﻿I should like at the outset to express my del...


## Tokenization  <a id='section 1'></a>

Tokenization is defined as <b>the process of segmenting running text into words and sentences</b>.


### Why do we need to tokenize text

Electronic text is a linear sequence of symbols. Before any processing is to be done, text needs to be segmented into linguistic units, and this process is called tokenization.

We usually look at grammar and meaning at the level of words, related to each other within sentences, within each document. So if we're starting with raw text, we first need to split the text into sentences, and those sentences into words -- which we call "tokens".

### How to tokenize

You might imagine that the easiest way to identify sentences is to split the document at every period '.', and to split the sentences using white space to get the words.

In [4]:
# using the split function to create tokens
paragraph = data['text'][0]
sentences = paragraph.split(".")
for s in sentences[:5]:
    print(s + '\n')

﻿It is indeed a pleasure for me and the members of my delegation to extend to Ambassador Garba our sincere congratulations on his election to the presidency of the forty-fourth session of the General Assembly

 His election to this high office is a well-deserved tribute to his personal qualities and experience

 I am fully confident that under his able and wise leadership the Assembly will further consolidate the gains achieved during the past year


My delegation associates itself with previous speakers in expressing its appreciation of the dedicated efforts of his predecessor, His Excellency Mr

 Dante Caputo, for the exemplary manner in which he discharged his duties as President of the forty-third session of the General Assembly



Then to split sentences further into words.

In [5]:
sentence = "What kind of patterns do you see in this graph?"
tokens = sentence.split(" ")
tokens

['What', 'kind', 'of', 'patterns', 'do', 'you', 'see', 'in', 'this', 'graph?']

We'll stop here as NLTK provides handy tools for us to use.

### NLTK

NLTK (Natural Language Toolkit) is a platform for building Python programs to work with human language data

In [6]:
import nltk
# run the below commented command if error
#nltk.download('punkt')

In [7]:
# create sentence tokens
speech = data['text'][4]
sents = nltk.sent_tokenize(speech)
sents[:3]

["\ufeffI should like at the outset to express my delegation's satisfaction and pleasure at your election, Sir, to the presidency of the General Assembly at its forty-fourth session.",
 'The unanimity of that decision reflects not only your own distinguished record as Foreign Minister and Permanent Representative of your country to the United Nations but also the prestige of your country, Nigeria, of which all of us in Africa are proud.',
 'The outgoing President of the General Assembly, Mr. Dante Caputo of Argentina, shouldered the responsibility of his office with distinction in a momentous and difficult year.']

In [8]:
s4 = "At eight o'clock on Thursday morning Arthur didn't feel very good."
nltk.word_tokenize(s4)

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 'Arthur',
 'did',
 "n't",
 'feel',
 'very',
 'good',
 '.']

nltk recognized that "o'clock" is one word and separated "didn't" into "did" and "n't"

For more complicated metrics, it's easier to use NLTK's classes and methods.

In [9]:
# Find the 10 most common tokens
tokens = nltk.word_tokenize(speech)
fd = nltk.collocations.FreqDist(tokens)
fd.most_common()[:10]

[('the', 401),
 ('of', 213),
 (',', 180),
 ('to', 177),
 ('.', 175),
 ('and', 139),
 ('in', 106),
 ('that', 88),
 ('a', 70),
 ('is', 63)]

Not so interesting as the most common words seem to be words that have no particular meanings.

A common step in text analysis is to remove noise. *However*, what you deem "noise" is not only very important but also dependent on the project at hand. For the purposes of today, we will discuss two common categories of strings often considered "noise". 

- Punctuation: While important for sentence analysis, punctuation will get in the way of word frequency and n-gram analyses. They will also affect any clustering on topic modeling.

- Stopwords: Stopwords are the most frequent words in any given language. Words like "the", "a", "that", etc. are considered not semantically important, and would also skew any frequency or n-gram analysis.

<b>Question</b> Write a function below that takes a string as an argument and returns a list of words without punctuation or stopwords.

`punctuation` is a list of punctuation strings, and we have created the list `stop_words` for you.

Hint: first you'll want to remove punctuation, then tokenize, then remove stop words. Make sure you account for upper and lower case!

In [10]:
def rem_punc_stop(text):
    
    from string import punctuation
    from nltk.corpus import stopwords
    
    stop_words = set(stopwords.words("english"))
    punctuation = set(punctuation)
    
    punc_free = "".join([ch for ch in text if ch not in punctuation])
    
    words = nltk.word_tokenize(punc_free)
    
    noise_free = [word for word in words if word not in stop_words]
    
    return noise_free

In [11]:
from nltk.corpus import stopwords
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Now we can rerun our frequency analysis without the noise:

In [12]:
#nltk.download('stopwords')
tokens_reduced = rem_punc_stop(speech)
tokens_reduced
#type(tokens_reduced)
#fd_reduced = nltk.collocations.FreqDist(tokens_reduced)
#fd_reduced.most_common()[:10]

['\ufeffI',
 'like',
 'outset',
 'express',
 'delegations',
 'satisfaction',
 'pleasure',
 'election',
 'Sir',
 'presidency',
 'General',
 'Assembly',
 'fortyfourth',
 'session',
 'The',
 'unanimity',
 'decision',
 'reflects',
 'distinguished',
 'record',
 'Foreign',
 'Minister',
 'Permanent',
 'Representative',
 'country',
 'United',
 'Nations',
 'also',
 'prestige',
 'country',
 'Nigeria',
 'us',
 'Africa',
 'proud',
 'The',
 'outgoing',
 'President',
 'General',
 'Assembly',
 'Mr',
 'Dante',
 'Caputo',
 'Argentina',
 'shouldered',
 'responsibility',
 'office',
 'distinction',
 'momentous',
 'difficult',
 'year',
 'We',
 'wish',
 'acknowledge',
 'debt',
 'Our',
 'SecretaryGeneral',
 'Mr',
 'Javier',
 'Perez',
 'de',
 'Cuellar',
 'head',
 'Organization',
 'roost',
 'troubled',
 'also',
 'productive',
 'successful',
 'years',
 'The',
 'turnaround',
 'fortunes',
 'United',
 'Nations',
 'watch',
 'owes',
 'much',
 'skill',
 'helmsman',
 'want',
 'reassure',
 'continued',
 'confidence',
 

Now our analysis is much more informational and revealing.

<b>POS tagging</b> The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging.

In [13]:
#nltk.download('averaged_perceptron_tagger')
tagged = nltk.pos_tag(tokens[2:8])
tagged

[('like', 'IN'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('outset', 'NN'),
 ('to', 'TO'),
 ('express', 'VB')]

## Stemming <a id='section 2'></a>

In NLP it is often the case that the specific form of a word is not as important as the idea to which it refers. For example, if you are trying to identify the topic of a document, counting 'running', 'runs', 'ran', and 'run' as four separate words is not useful. Reducing words to their stems is a process called stemming.

A popular stemming implementation is the Snowball Stemmer, which is based on the Porter Stemmer. Its algorithm looks at word forms and does things like drop final 's's, 'ed's, and 'ing's.

Just like the tokenizers, we first have to create a stemmer object with the language we are using. Refer to [this documentation](http://www.nltk.org/howto/stem.html) to create a snowball stemmer.

In [14]:
snowball = nltk.SnowballStemmer('english')

Now, we can try stemming some words

In [15]:
snowball.stem('running')

'run'

In [16]:
snowball.stem('eats')

'eat'

In [17]:
snowball.stem('embarassed')

'embarass'

Snowball is a very fast algorithm, but it has a lot of edge cases. In some cases, words with the same stem are reduced to two different stems

In [18]:
snowball.stem('cylinder'), snowball.stem('cylindrical')

('cylind', 'cylindr')

Sometimes two different words are reduced to the same stem.

In [19]:
snowball.stem('vacation'), snowball.stem('vacate')

('vacat', 'vacat')

<b>Question</b> How would the above two situations affect our text analysis

Your answer here

## Chunking<a id='section 3'></a>

We may want to work with larger segments of text than single words (but still smaller than a sentence). For instance, in the sentence "The black cat climbed over the tall fence", we might want to treat "The black cat" as one thing (the subject), "climbed over" as a distinct act, and "the tall fence" as another thing (the object). The first and third sequences are noun phrases, and the second is a verb phrase.

We can separate these phrases by "chunking" the sentence, i.e. splitting it into larger chunks than individual tokens. This is also an important step toward identifying entities, which are often represented by more than one word. You can probably imagine certain patterns that would define a noun phrase, using part of speech tags. For instance, a determiner (e.g. an article like "the") could be concatenated onto the noun that follows it. If there's an adjective between them, we can include that too.

To define rules about how to structure words based on their part of speech tags, we use a grammar (in this case, a "chunk grammar"). NLTK provides a RegexpParser that takes as input a grammar composed of regular expressions (which define patterns in text, we'll learn it in later labs). The grammar is defined as a string, with one line for each rule we define. Each rule starts with the label we want to assign to the chunk (e.g. NP for "noun phrase"), followed by a colon, then an expression in regex-like notation that will be matched to tokens' POS (part-of-speech) tags.

We can define a single rule for a noun phrase like this. The rule allows 0 or 1 determiner, then 0 or more adjectives, and finally at least 1 noun. (By using 'NN.*' as the last POS tag, we can match 'NN', 'NNP' for a proper noun, or 'NNS' for a plural noun.) If a matching sequence of tokens is found, it will be labeled 'NP'.

Take a look at different [POS tags](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [20]:
grammar = "NP: {<DT>?<JJ>*<NN.*>+}"

We create a chunk parser object by supplying this grammar, then use it to parse a sentence into chunks. The sentence we want to parse must already be POS-tagged, since our grammar uses those POS tags to identify chunks. Let's try this on the second sentence of the speech we generated above.

In [21]:
from nltk import RegexpParser

cp = RegexpParser(grammar)

sent_tagged = nltk.pos_tag(sents[1])
sent_chunked = cp.parse(sent_tagged)

print(sent_chunked)

(S
  (NP T/NNP h/NN e/NN  /NNP)
  u/JJ
  n/FW
  (NP a/DT n/NN i/NN)
  m/VBP
  (NP i/NN)
  t/VBP
  (NP y/NN  /NNP)
  o/VBZ
  (NP f/JJ  /NNP t/NN)
  h/VBD
  (NP a/DT t/NN  /NNP d/NN e/NN c/NN i/NN)
  s/VBP
  (NP i/NN)
  o/VBP
  (NP n/JJ  /NNP r/NN e/NN f/NN l/NN e/NN)
  c/VBP
  (NP t/NN s/NN  /NNP)
  n/CC
  (NP o/JJ t/NN  /NNP)
  o/VBZ
  (NP n/JJ l/NN y/NN  /NNP y/NNP)
  o/MD
  u/VB
  (NP r/NN  /NNP)
  o/VBZ
  (NP w/JJ n/JJ  /NNP d/NN i/NN)
  s/VBP
  (NP t/NN i/NN)
  n/VBP
  (NP g/NN u/NN i/NN)
  s/VBP
  (NP h/NN e/NN d/NN  /NNP r/NN e/NN c/NN o/NN r/NN d/NN)
   /VBZ
  (NP a/DT s/JJ  /NN F/NNP)
  o/VBZ
  (NP r/NN e/NN i/NN)
  g/VBP
  (NP n/JJ  /NNP M/NNP i/NN)
  n/VBP
  (NP i/NN)
  s/VBP
  (NP t/NN e/NN r/NN)
   /VBZ
  (NP a/DT n/JJ d/NN  /NNP P/NNP e/NN r/NN)
  m/VBD
  (NP a/DT n/JJ e/NN)
  (NP n/JJ t/NN  /NNP R/NNP e/NN p/NN r/NN e/NN s/NN e/NN)
  n/JJ
  t/VBZ
  (NP a/DT t/NN i/NN)
  v/VBP
  (NP e/NN  /NNP)
  o/VBZ
  (NP f/JJ  /NNP y/NN o/NN)
  (NP u/JJ r/NN  /NNP)
  c/VBZ
  (NP
    o/

When we called print() on this chunked sentence, it printed out a nested list of nodes.

In [None]:
type(sent_chunked)

nltk.tree.Tree

The tree object has a number of methods we can use to interact with its components. For instance, we can use the method draw() to see a more graphical representation. This will open a separate window.

The tree is pretty flat, because we defined a grammar that only grouped words into non-overlapping noun phrases, with no additional hierarchy above them. This is sometimes referred to as "shallow parsing".

In [None]:
sent_chunked.draw()

## Combining it all

Write a function that takes in a strubg, tokenizes it, removes noise, turns everything to lower-case, and returns a string of stems of all tokens.

Hint: any function from above that we can just grab and use?

As a reminder, this is what our table looks like

In [None]:
data.head()

In [None]:
def does_it_all(text):
    tokens = ""
    
    not_stemmed = rem_punc_stop(text)
    stemmed = [snowball.stem(word.lower()) for word in not_stemmed]
    for word in not_stemmed:
        tokens += snowball.stem(word) + " "
    
    return tokens

Let's apply our function to speeches from 2001.

First create a table that includes all 2001 speeches. Refer to [this doc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)

In [None]:
speech_2001 = data.loc[data['year'] == 2001]
speech_2001.head()

Then create a new column in speech_2001 which contains the tokenized string.

In [None]:
speech_2001_with_tokens = speech_2001.copy()
speech_2001_with_tokens['tokens'] = speech_2001['text'].apply(does_it_all)
speech_2001_with_tokens

Congratulations! You've learned tokenizing, stemming, and chunking texts.

---
Notebook developed by: Tian Qin

Data Science Modules: http://data.berkeley.edu/education/modules
