# Working with corpora and individual texts

(Recap - Roadmap for this section of the module)

* Lecture 14 Intro, two approaches to NLP - a quick summary, Linguistics - syntax vs semantics, intro to nltk/basic text processing/descriptive statistics

* **Lecture 15 - SYNTAX text processing and corpora processing, preprocessing pipeline, analysis vs generation, class on processing corpora** 

* Lecture 16 SEMANTICS information extraction for classification  + named entity recognition 
* Lecture 17 SEMANTICS topic modelling - class on topic modelling and NER
* Lecture 18 SEMANTICS sentiment analysis https://www.nltk.org/howto/sentiment.html https://www.nltk.org/api/nltk.sentiment.html

In [1]:
#let's get started 
import nltk

# this will run on jupyter.kent.ac.uk 

# if working on a different jupyter server, e.g. anaconda - you may need to try:
# !pip install nltk / %pip install nltk  / !conda install nltk / %conda install nltk
# or try these commands (minus ! or %) in the anaconda command prompt

The pre-processing pipeline we covered in Monday's lecture was:
* From text to tokens: tokenization
 * to determine the units that will be annotated
* Remove case differences (lower() or similar)
 * to convert everything to the same case
* From tokens to lemmas: stemming and lemmatization
 * to mark each token with its dictionary form or short form
* From tokens to part-of-speech tags: PoS tagging
 * to mark grammatical information about each token
* Parsing
 * to identify grammatical structure
(* Etc…)


In [2]:
# this next step may take some time
# you should only have to do this step once ...
# so if you've already done it, for the previous lab class, on this jupyter server,
# you probably don't need to do it again

# nltk.download()
#hit d for download in the dialogue that comes up
#when you are asked which package to download, type: book
#when it has finished, type q to quit

In [3]:
# Let's use the guide from https://www.nltk.org/book/ch03.html 
# to get a text to play with. 

from urllib import request

url = "https://www.gutenberg.org/cache/epub/23322/pg23322.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

In [4]:
raw

'\ufeffThe Project Gutenberg EBook of The Three Bears, by Anonymous\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: The Three Bears\r\n\r\nAuthor: Anonymous\r\n\r\nRelease Date: November 4, 2007 [EBook #23322]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK THE THREE BEARS ***\r\n\r\n\r\n\r\n\r\nProduced by Jacqueline Jeremy and the Online Distributed\r\nProofreading Team at http://www.pgdp.net (This file was\r\nproduced from images generously made available by The\r\nInternet Archive/American Libraries.)\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n[Illustration: Cover]\r\n\r\n\r\n_THE STORY OF_ THE THREE BEARS.\r\n\r\n\r\n    There were once three bears, who lived in a wood,\r\n    Their porridge was thick, and their chairs a

In [5]:
print(raw)

﻿The Project Gutenberg EBook of The Three Bears, by Anonymous

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: The Three Bears

Author: Anonymous

Release Date: November 4, 2007 [EBook #23322]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK THE THREE BEARS ***




Produced by Jacqueline Jeremy and the Online Distributed
Proofreading Team at http://www.pgdp.net (This file was
produced from images generously made available by The
Internet Archive/American Libraries.)










[Illustration: Cover]


_THE STORY OF_ THE THREE BEARS.


    There were once three bears, who lived in a wood,
    Their porridge was thick, and their chairs and beds good.
    The biggest bear, Bruin, was surly and rough;
    His wife, Mrs. Bruin, 

This should import 'The Project Gutenberg EBook of The Three Bears, by Anonymous'
### 1. Double check the first 100 characters of the text in 'raw'

In [6]:
len(raw)

27577

In [7]:
raw[0:100]

'\ufeffThe Project Gutenberg EBook of The Three Bears, by Anonymous\r\n\r\nThis eBook is for the use of anyone'

* From text to tokens: tokenization

### 2. Tokenize the text in `raw` into a variable called `tokens`.
How many tokens are there in total?
And how many distinct tokens?

In [16]:
tokens = nltk.word_tokenize(raw)

In [20]:
tokens

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Three',
 'Bears',
 ',',
 'by',
 'Anonymous',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 '.',
 'You',
 'may',
 'copy',
 'it',
 ',',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at',
 'www.gutenberg.org',
 'Title',
 ':',
 'The',
 'Three',
 'Bears',
 'Author',
 ':',
 'Anonymous',
 'Release',
 'Date',
 ':',
 'November',
 '4',
 ',',
 '2007',
 '[',
 'EBook',
 '#',
 '23322',
 ']',
 'Language',
 ':',
 'English',
 '*',
 '*',
 '*',
 'START',
 'OF',
 'THIS',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 'THE',
 'THREE',
 'BEARS',
 '*',
 '*',
 '*',
 'Produced',
 'by',
 'Jacqueline',
 'Jeremy',
 'and',
 'the',
 'Online',
 'Distributed',
 'Proofreading',
 'Team',
 'at',
 

In [18]:
len(tokens)

5037

In [19]:
len(set(tokens))

1298

* Remove case differences (lower() or similar)

### 3. Use the lower() function to convert  all tokens to lower case characters only, and store in a variable called `tokens_l` 
How many tokens are there in total now?
And how many distinct tokens? 
Is this different from Q2? Inspect the variables to understand why/why not.




In [21]:
tokens_1 = [t.lower() for t in tokens]

In [22]:
tokens_1

['\ufeffthe',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'the',
 'three',
 'bears',
 ',',
 'by',
 'anonymous',
 'this',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 '.',
 'you',
 'may',
 'copy',
 'it',
 ',',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'project',
 'gutenberg',
 'license',
 'included',
 'with',
 'this',
 'ebook',
 'or',
 'online',
 'at',
 'www.gutenberg.org',
 'title',
 ':',
 'the',
 'three',
 'bears',
 'author',
 ':',
 'anonymous',
 'release',
 'date',
 ':',
 'november',
 '4',
 ',',
 '2007',
 '[',
 'ebook',
 '#',
 '23322',
 ']',
 'language',
 ':',
 'english',
 '*',
 '*',
 '*',
 'start',
 'of',
 'this',
 'project',
 'gutenberg',
 'ebook',
 'the',
 'three',
 'bears',
 '*',
 '*',
 '*',
 'produced',
 'by',
 'jacqueline',
 'jeremy',
 'and',
 'the',
 'online',
 'distributed',
 'proofreading',
 'team',
 'at',
 

In [23]:
len(tokens_1)

5037

In [24]:
len(set(tokens_1))

1101

yes, the length is slightly different from question 2 because same words may have different cases which would have been counted as two seperate words.

* From tokens to lemmas: stemming and lemmatization

In [26]:
#Let's use the Porter stemmer and the Lancaster stemmer (two popular stemmers):

porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()



### 4. How many *distinct* stems are identified by the Porter stemmer?
Are there more or less than for the Lancaster stemmer?

In [28]:
# Let's try out the Porter stemmer first
porter_stems = [porter.stem(x) for x in tokens_1]
# How many distinct stems are identified?
print(len(set(porter_stems)))

947


In [30]:
lancaster_stems = [lancaster.stem(x) for x in tokens_1]

print(len(set(lancaster_stems)))

886


there are more stems identified by the porter stemmer

### 5. How many *distinct lemmas are identified by the Wordnet Lemmatizer?

In [31]:
wn_lemmatizer = nltk.stem.WordNetLemmatizer()

wn_lemmas = [wn_lemmatizer.lemmatize(x) for x in tokens_1]

In [32]:
wn_lemmas

['\ufeffthe',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'the',
 'three',
 'bear',
 ',',
 'by',
 'anonymous',
 'this',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restriction',
 'whatsoever',
 '.',
 'you',
 'may',
 'copy',
 'it',
 ',',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'term',
 'of',
 'the',
 'project',
 'gutenberg',
 'license',
 'included',
 'with',
 'this',
 'ebook',
 'or',
 'online',
 'at',
 'www.gutenberg.org',
 'title',
 ':',
 'the',
 'three',
 'bear',
 'author',
 ':',
 'anonymous',
 'release',
 'date',
 ':',
 'november',
 '4',
 ',',
 '2007',
 '[',
 'ebook',
 '#',
 '23322',
 ']',
 'language',
 ':',
 'english',
 '*',
 '*',
 '*',
 'start',
 'of',
 'this',
 'project',
 'gutenberg',
 'ebook',
 'the',
 'three',
 'bear',
 '*',
 '*',
 '*',
 'produced',
 'by',
 'jacqueline',
 'jeremy',
 'and',
 'the',
 'online',
 'distributed',
 'proofreading',
 'team',
 'at',
 'http

In [33]:
print(len(set(wn_lemmas)))

1046


### 6. Create a Frequency Distribution of the lemmas from Q5. Create a second Frequency Distribution of one of the sets of stems from Q4. Compare the most common 30 from each FreqDist. 
Do you notice the differences between lemmas and stems?

In [34]:
from nltk import FreqDist

In [35]:
fdist_stems = FreqDist(porter_stems)

In [36]:
fdist_stems

FreqDist({',': 269, 'the': 233, '.': 200, 'of': 135, 'and': 122, 'to': 98, 'project': 87, 'a': 86, 'in': 84, 'or': 78, ...})

In [41]:
fdist_stems.most_common(30)

[(',', 269),
 ('the', 233),
 ('.', 200),
 ('of', 135),
 ('and', 122),
 ('to', 98),
 ('project', 87),
 ('a', 86),
 ('in', 84),
 ('or', 78),
 ('you', 77),
 ('work', 77),
 ('with', 59),
 ('gutenberg-tm', 56),
 ('thi', 54),
 ('*', 48),
 ('ani', 38),
 ("''", 35),
 ('is', 34),
 ('for', 31),
 ('not', 31),
 ('``', 31),
 ('she', 31),
 ('gutenberg', 30),
 ('bear', 30),
 ('it', 30),
 (':', 29),
 ('electron', 29),
 ('copi', 25),
 ('distribut', 25)]

In [37]:
fdist_lemmas = FreqDist(wn_lemmas)

In [38]:
fdist_lemmas

FreqDist({',': 269, 'the': 233, '.': 200, 'of': 135, 'and': 122, 'a': 100, 'to': 98, 'project': 87, 'in': 84, 'or': 78, ...})

In [42]:
fdist_lemmas.most_common(30)

[(',', 269),
 ('the', 233),
 ('.', 200),
 ('of', 135),
 ('and', 122),
 ('a', 100),
 ('to', 98),
 ('project', 87),
 ('in', 84),
 ('or', 78),
 ('you', 77),
 ('work', 77),
 ('with', 59),
 ('gutenberg-tm', 56),
 ('this', 54),
 ('*', 48),
 ('any', 38),
 ("''", 35),
 ('is', 34),
 ('for', 31),
 ('not', 31),
 ('``', 31),
 ('she', 31),
 ('gutenberg', 30),
 ('bear', 30),
 ('it', 30),
 (':', 29),
 ('electronic', 27),
 ('foundation', 25),
 ('by', 24)]

* From tokens to part-of-speech tags: PoS tagging

### 7. Apply part-of-speech tagging to (i) the `tokens_l` and (ii) `tokens` sets (i.e. the lower case tokens, and the tokens that haven't been converted to lower case). Store these in variables `pos_tokens_l` and `pos_tokens` respectively.

Do we get the same number of results for (i) and for (ii)? Why/Why not?


In [43]:
pos_tokens = nltk.pos_tag(tokens)

In [46]:
pos_tokens

[('\ufeffThe', 'NN'),
 ('Project', 'NNP'),
 ('Gutenberg', 'NNP'),
 ('EBook', 'NNP'),
 ('of', 'IN'),
 ('The', 'DT'),
 ('Three', 'NNP'),
 ('Bears', 'NNP'),
 (',', ','),
 ('by', 'IN'),
 ('Anonymous', 'NNP'),
 ('This', 'DT'),
 ('eBook', 'NN'),
 ('is', 'VBZ'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('use', 'NN'),
 ('of', 'IN'),
 ('anyone', 'NN'),
 ('anywhere', 'RB'),
 ('at', 'IN'),
 ('no', 'DT'),
 ('cost', 'NN'),
 ('and', 'CC'),
 ('with', 'IN'),
 ('almost', 'RB'),
 ('no', 'DT'),
 ('restrictions', 'NNS'),
 ('whatsoever', 'RB'),
 ('.', '.'),
 ('You', 'PRP'),
 ('may', 'MD'),
 ('copy', 'VB'),
 ('it', 'PRP'),
 (',', ','),
 ('give', 'VB'),
 ('it', 'PRP'),
 ('away', 'RB'),
 ('or', 'CC'),
 ('re-use', 'VB'),
 ('it', 'PRP'),
 ('under', 'IN'),
 ('the', 'DT'),
 ('terms', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('Project', 'NNP'),
 ('Gutenberg', 'NNP'),
 ('License', 'NNP'),
 ('included', 'VBD'),
 ('with', 'IN'),
 ('this', 'DT'),
 ('eBook', 'NN'),
 ('or', 'CC'),
 ('online', 'NN'),
 ('at', 'IN'),
 ('www.gutenbe

In [49]:
len(pos_tokens)

5037

In [54]:
pos_tokens_1 = nltk.pos_tag(tokens_1)

In [55]:
pos_tokens_1

[('\ufeffthe', 'NN'),
 ('project', 'NN'),
 ('gutenberg', 'JJ'),
 ('ebook', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('three', 'CD'),
 ('bears', 'NNS'),
 (',', ','),
 ('by', 'IN'),
 ('anonymous', 'JJ'),
 ('this', 'DT'),
 ('ebook', 'NN'),
 ('is', 'VBZ'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('use', 'NN'),
 ('of', 'IN'),
 ('anyone', 'NN'),
 ('anywhere', 'RB'),
 ('at', 'IN'),
 ('no', 'DT'),
 ('cost', 'NN'),
 ('and', 'CC'),
 ('with', 'IN'),
 ('almost', 'RB'),
 ('no', 'DT'),
 ('restrictions', 'NNS'),
 ('whatsoever', 'RB'),
 ('.', '.'),
 ('you', 'PRP'),
 ('may', 'MD'),
 ('copy', 'VB'),
 ('it', 'PRP'),
 (',', ','),
 ('give', 'VB'),
 ('it', 'PRP'),
 ('away', 'RB'),
 ('or', 'CC'),
 ('re-use', 'VB'),
 ('it', 'PRP'),
 ('under', 'IN'),
 ('the', 'DT'),
 ('terms', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('project', 'NN'),
 ('gutenberg', 'NN'),
 ('license', 'NN'),
 ('included', 'VBD'),
 ('with', 'IN'),
 ('this', 'DT'),
 ('ebook', 'NN'),
 ('or', 'CC'),
 ('online', 'NN'),
 ('at', 'IN'),
 ('www.gutenberg.org',

In [57]:
len(pos_tokens_1)

5037

In [58]:
len(set(pos_tokens_1))

1233

In [61]:
pos_tokens_1_set = list(set(pos_tokens_1))

In [62]:
pos_tokens_1_set

[('machine', 'NN'),
 ('she', 'PRP'),
 ('ein', 'NN'),
 ('quarto', 'NN'),
 ('bones', 'NNS'),
 ('future', 'NN'),
 ('granted', 'VBN'),
 ('back-grounds', 'NNS'),
 ('complete', 'NN'),
 ('frowning', 'NN'),
 ('charge', 'NN'),
 (']', 'VBP'),
 ('people', 'NNS'),
 ('includes', 'VBZ'),
 ('simon', 'NN'),
 ('shared', 'VBN'),
 ('nor', 'CC'),
 ('paragraphs', 'NN'),
 ('simple', 'JJ'),
 ('you', 'PRP'),
 ('hold', 'VB'),
 ('one', 'NN'),
 ('jumped', 'VBD'),
 ('next', 'JJ'),
 ('secure', 'NN'),
 ('irs', 'NNS'),
 ('release', 'NN'),
 ('array', 'NN'),
 ('roamed', 'VBN'),
 ('illuminated', 'VBD'),
 ('paragraph', 'JJ'),
 ('certain', 'JJ'),
 ('lieu', 'NN'),
 ('produced', 'VBD'),
 ('copies', 'NNS'),
 ('find', 'VB'),
 ('daring', 'VBG'),
 ('roar', 'NN'),
 ('produced', 'VBN'),
 ('address', 'NN'),
 ('michael', 'JJ'),
 ('refund', 'VB'),
 ('enough', 'RB'),
 ('comply', 'VB'),
 ('be', 'VB'),
 ('expense', 'NN'),
 ('set', 'VBN'),
 ('except', 'IN'),
 ('following', 'JJ'),
 ('swamp', 'VBP'),
 ('asleep', 'JJ'),
 ('replace', 'VB')

### 8. Remove all stopwords frome the set of part-of-speech tagged lower-case tokens (`pos_tokens_l`) and save the results in a variable `pos_tokens_l_no_stop`
Generate and print a list of all distinct nouns in `pos_tokens_l_no_stop` 

In [63]:
from nltk.corpus import stopwords

In [64]:
stopwords = stopwords.words('english')

In [65]:
pos_tokens_1_no_stop = list()

for word in pos_tokens_1_set:
    word1 = word[0]
    if word1 not in stopwords:
        pos_tokens_1_no_stop.append(word)
        
     

In [67]:
len(pos_tokens_1_no_stop)

1113

In [71]:
nouns = [w[0] for w in pos_tokens_1_no_stop if w[1].startswith('N')]

In [75]:
list(set(nouns))

['diamonds',
 'frowning',
 'share',
 'paperwork',
 'executive',
 'exclusion',
 'staff',
 'links',
 'user',
 'mother',
 'address',
 'fear',
 'chairs',
 'royalties',
 'infringement',
 'nail',
 '//gutenberg.org/license',
 '//www.pglaf.org',
 'taxes',
 'fury',
 'son',
 'price',
 'work',
 'efforts',
 'alteration',
 'miss',
 'credit',
 'money',
 'assistance',
 'profits',
 'bed',
 'secure',
 'cover',
 'collection',
 'compilation',
 'additions',
 'jeremy',
 'opportunities',
 'defect',
 'wife',
 'costs',
 'property',
 'agent',
 'computer',
 'end',
 'edition',
 '//www.gutenberg.org/2/3/3/2/23322/',
 'mediæval',
 'simon',
 'chimney',
 'steam',
 'donation',
 'transcription',
 'pride',
 'phrase',
 'carrion',
 'warranties',
 'fees',
 'data',
 'unenforceability',
 'indemnity',
 'mrs.',
 'alphabet',
 'hither',
 'director',
 'representations',
 'cts._',
 'colors',
 'form',
 'return',
 'owner',
 'charge',
 'royalty',
 'machine',
 'originator',
 'promotion',
 'series',
 'liability',
 'name',
 'copyright'

In [73]:
len(nouns)

461

In [74]:
len(set(nouns))

454

### Challenge exercise (if you have time)
Can you deploy a parser on the original text?

Follow the examples at https://www.nltk.org/book/ch08.html 

Does the parser give you useful information that you can use to understand the text better?