In [None]:
'''
What is a Corpus?
Corpus is a collection of written texts and corpora is the plural of corpus. In NLTK, you have some corpora included like Gutenberg Corpus, Web and Chat Text and so on.
'''

In [1]:
from nltk.corpus import gutenberg as gt


In [None]:
'''
So this corpus has different txt txt files which contain different texts. If you want to see all the texts that this corpus has, you can say
'''

In [2]:

print(gt.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [None]:
'''
So you can see that this corpus has texts like Hamlet, Macbeth and a novel of Milton.

Let’s say that you want to access the file shakespeare-macbeth.txt  and see what words the text have. To do this, you can use the words  method. So in your code type:
'''

In [3]:
shakespeare_macbeth = gt.words("shakespeare-macbeth.txt")
print(shakespeare_macbeth)


['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', ...]


In [None]:
'''
As you can see, the words  method receives the file id as its parameter. So if you want to access Milton’s novel, for example, you can type gt.words("milton-paradise.txt")  .

If you run this code now, you will get a list of all the words of the text as your output like in the image below.
'''

In [None]:
'''
Another important function is the raw  function. What it does is it returns the whole text without doing any linguistic processing. If you type
'''

In [5]:
raw = gt.raw("shakespeare-macbeth.txt")
print(raw)


[The Tragedie of Macbeth by William Shakespeare 1603]


Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin

   All. Padock calls anon: faire is foule, and foule is faire,
Houer through the fogge and filthie ayre.

Exeunt.


Scena Secunda.

Alarum within. Enter King Malcome, Donalbaine, Lenox, with
attendants,
meeting a bleeding Captaine.

  King. What bloody man is that? he can report,
As seemeth by his plight, of the Reuolt
The newest state

   Mal. This is the Serieant,
Who like a good and hardie Souldier fought
'Gainst my Captiuitie: Haile braue friend;
Say to the King, the knowledge of the Broyle,
As thou didst leaue it

   Cap. Doubtfull it stood,
As two spent Swimmers, t

In [None]:
'''
Let’s say that now you want to see the sentences your text has. You can use the sents  function. So in your code type
'''

In [6]:
sents = gt.sents("shakespeare-macbeth.txt")
print(sents)


[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]


In [None]:
'''
You can use those functions to do more elaborate things. If you want for example see the number of words and sentences in all of the texts present in your corpus, you can say:
'''

In [7]:
for fileid in gt.fileids():
    num_words = len(gt.words(fileid))
    num_sents = len(gt.sents(fileid))
    print("Data for file:", fileid)
    print("Number of words:", num_words)
    print("Number of sentences:", num_sents, end="\n\n\n")


Data for file: austen-emma.txt
Number of words: 192427
Number of sentences: 7752


Data for file: austen-persuasion.txt
Number of words: 98171
Number of sentences: 3747


Data for file: austen-sense.txt
Number of words: 141576
Number of sentences: 4999


Data for file: bible-kjv.txt
Number of words: 1010654
Number of sentences: 30103


Data for file: blake-poems.txt
Number of words: 8354
Number of sentences: 438


Data for file: bryant-stories.txt
Number of words: 55563
Number of sentences: 2863


Data for file: burgess-busterbrown.txt
Number of words: 18963
Number of sentences: 1054


Data for file: carroll-alice.txt
Number of words: 34110
Number of sentences: 1703


Data for file: chesterton-ball.txt
Number of words: 96996
Number of sentences: 4779


Data for file: chesterton-brown.txt
Number of words: 86063
Number of sentences: 3806


Data for file: chesterton-thursday.txt
Number of words: 69213
Number of sentences: 3742


Data for file: edgeworth-parents.txt
Number of words: 210663

In [None]:
'''
Loading your own corpus
Now that you learned what is a corpus, you will learn how to load your own corpus.

To do this, you need a corpus reader so create a new file named loading-your-own-corpus.py  with the following lines.
'''

In [8]:
from nltk.corpus import PlaintextCorpusReader
import os


In [None]:
'''
The first import statement is for the PlainTextCorpusReader  class, that will be your corpus reader, and the second is for the os  module. The os  module will give the PlainTextCorpusReader  the path of the files you want to load.

To continue, download the play Taming of the Shrew in this link and place it in the same directory of your Python file.

After you download the play, create an object of PlainTextCorpusReader  with the following lines:
'''

In [34]:
# Reading the all files from directory - here we are reading all files present outside working directory

corpus_root = os.getcwd() + "/"
file_ids = ".*.txt"
corpus = PlaintextCorpusReader(corpus_root, file_ids)


In [35]:
'''
As you can see, PlainTextCorpusReader  expects two inputs in its constructor. The first one is corpus_root  and the second one is the file_ids  . The corpus_root  is the path of your files and the file_ids  are the name of the files.

To get the path of your files, you can use the getcwd  method of os  module. Note that we add a /  in the path. In the file_id , we use a RegEx expression to fetch all the files that you want. In our example, we want all files that have the .txt extension.

As this object returns you a corpus object, you can use the same functions you used in the previous section. So if you want to see the words in the text, for example, you can use:
'''

'\nAs you can see, PlainTextCorpusReader  expects two inputs in its constructor. The first one is corpus_root  and the second one is the file_ids  . The corpus_root  is the path of your files and the file_ids  are the name of the files.\n\nTo get the path of your files, you can use the getcwd  method of os  module. Note that we add a /  in the path. In the file_id , we use a RegEx expression to fetch all the files that you want. In our example, we want all files that have the .txt extension.\n\nAs this object returns you a corpus object, you can use the same functions you used in the previous section. So if you want to see the words in the text, for example, you can use:\n'

In [36]:
print(corpus.words("shakespeare-taming-2.txt"))


['THE', 'TAMING', 'OF', 'THE', 'SHREW', 'DRAMATIS', ...]


In [None]:
# Regular Expressions with NLTK

In [None]:
'''
Assuming you have a background on Regular Expressions, we will focus this section in using the search  function present in re  module.

To start this tutorial, create a file named regular-expressions.py  and import the following modules:
'''

In [37]:
from nltk.tokenize import word_tokenize
import re


In [None]:
'''
You will use the same text from the previous tutorial, “Taming of the Shrew”, and the same read_file  function, so add to your code:
'''

In [38]:
def read_file(filename): 
    with open(filename, 'r') as file:
         text = file.read() 
    return text


In [39]:
'''
you will have to normalize your text and tokenize it in order to get the words, so add the following code:
'''

'\nyou will have to normalize your text and tokenize it in order to get the words, so add the following code:\n'

In [40]:
text = read_file("shakespeare-taming-2.txt")
words = word_tokenize(text)


In [None]:
'''
Now, let’s talk about the search  function. The search function is present in the re  module and it takes two parameters: the first is a RegEX patter and the second parameter is the string which you want to apply the pattern. For example, let’s say you want to search all words that start with “a” in the string "abc def" . The code you will write is:
'''

In [43]:
re.search("^a", "abc def")


<_sre.SRE_Match object; span=(0, 1), match='a'>

In [44]:
'''
A useful thing to note is that you can use the search function in an if  statement, so the following code will print a message if the pattern is found:
'''

'\nA useful thing to note is that you can use the search function in an if  statement, so the following code will print a message if the pattern is found:\n'

In [45]:
if re.search("^a", "abc"):
    print("Found!!!")


Found!!!


In [None]:
'''
You can use it as well in list comprehensions to find words that end with “ed”, for example:
'''

In [49]:
words_ending_with_ed = [w for w in words if re.search("ed$", w)]

print(words_ending_with_ed)


['bed', 'winded', 'cried', 'bed', 'bed', 'bed', 'waked', 'distilled', 'husbanded', 'bed', 'fitted', 'observed', 'accomplished', 'shed', 'restored', 'commanded', 'infused', 'kindred', 'caged', 'bed', 'studded', 'breathed', 'painted', 'beguiled', 'surprised', 'painted', 'deed', 'shed', 'indeed', 'restored', 'waked', 'waked', 'thanked', 'bed', 'bed', 'charged', 'bed', 'arrived', 'approved', 'conceived', 'achieved', 'affected', 'abjured', 'resolved', 'need', 'resolved', 'appointed', 'brooked', 'married', 'whipped', 'maintained', 'agreed', 'wed', 'bed', 'rated', 'Sacred', 'advised', 'plotted', 'need', 'tied', 'charged', 'wounded', 'descried', 'changed', 'indeed', 'beloved', 'approved', 'rebused', 'deceased', 'wed', 'deceased', 'rehearsed', 'disguised', 'unsuspected', 'disguised', 'perused', 'perfumed', 'assured', 'promised', 'lighted', 'promised', 'beloved', 'Beloved', 'chafed', 'pitched', 'arrived', 'promised', 'provided', 'speed', 'wed', 'indeed', 'jested', 'revenged', 'grieved', 'called'

In [None]:
'''
Talking about RegEX, you cannot forget about ranges and closures. Let’s say that you want to find all words that end with one or more “e”. You can use the +  operator:
'''

In [50]:
words_ending_with_one_or_more_e = [w for w in words if re.search("e+$", w)]

print(words_ending_with_one_or_more_e)

['the', 'Page', 'Page', 'love', 'the', 'house', 'Before', 'alehouse', 'pheeze', 'rogue', 'Ye', 'are', 'baggage', 'the', 'are', 'the', 'we', 'came', 'Therefore', 'the', 'slide', 'the', 'have', 'thee', 'the', 'budge', 'come', 'charge', 'thee', 'the', 'couple', 'the', 'made', 'the', 'the', 'lose', 'the', 'he', 'He', 'the', 'twice', 'the', 'me', 'take', 'the', 'were', 'here', 'one', 'See', 'he', 'breathe', 'He', 'Were', 'he', 'ale', 'were', 'like', 'swine', 'he', 'loathsome', 'thine', 'image', 'practise', 'he', 'were', 'brave', 'he', 'the', 'Believe', 'me', 'he', 'choose', 'strange', 'he', 'take', 'manage', 'the', 'make', 'the', 'Procure', 'me', 'he', 'make', 'he', 'chance', 'be', 'submissive', 'reverence', 'one', 'the', 'the', 'please', 'Some', 'one', 'be', 'he', 'horse', 'disease', 'Persuade', 'he', 'he', 'he', 'he', 'he', 'gentle', 'be', 'pastime', 'be', 'we', 'he', 'true', 'diligence', 'He', 'we', 'he', 'Take', 'one', 'office', 'he', 'Some', 'see', 'Belike', 'some', 'noble', 'some', 'r

In [None]:
'''
Similarly, you can use *  operator to search for zero or more occurrences of a certain pattern. So if you want to see all words that end with zero or more “e”, you can use:
'''

In [None]:
words_that_may_end_with_e = [w for w in words if re.search("e*$", w)]

print(words_that_may_end_with_e)

In [None]:
'''
Applications of RegEX
Now that you are familiar with the search function, you are going to search through tokenized text using the findall  method from nltk.text.Text  class. As you already saw, this class expects a list of words in its constructor:
'''

In [52]:
from nltk.corpus import gutenberg, nps_chat
import nltk

moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))


In [None]:
'''
As you can see, in this example we are going to use a text present in Gutenberg corpus.

The findall  method expects a regular expression as its parameter but its regular expression is a bit different from the normal regular expression. The Text  class receives a tokenized list of words and when you call the findall  method, you need to specify these tokens.

Let’s say you want to search sentences that that start with “a”, some word and end with “man”. In this example, you need to search three tokens where the second word can be any word. So the code for this is:
'''

In [53]:
print(moby.findall(r'<a><.*><man>'))


a monied man; a nervous man; a dangerous man; a white man; a white
man; a white man; a pious man; a queer man; a good man; a mature man;
a white man; a Cape man; a great man; a wise man; a wise man; a
butterless man; a white man; a fiendish man; a pale man; a furious
man; a better man; a certain man; a complete man; a dismasted man; a
younger man; a brave man; a brave man; a brave man; a brave man
None


In [None]:
'''
As you can see, the tokens are separated with <>  and for each token, you have to specify the RegEX. To make it more clear, let’s see another example using nps_chat  corpus.

chat_obj = nltk.Text(nps_chat.words())

Let’s say you want to search sentences with three words that end with “bro”. Given that only the last word matters, you can use <.*>  for the first two words, to accept anything, and <bro>  for the last one. So the code you will use is:
'''

In [61]:
chat_obj = nltk.Text(nps_chat.words())

In [62]:
print(chat_obj.findall(r"<.*><.*><bro>"))


you rule bro; telling you bro; u twizted bro
None


In [None]:
'''
Now, let’s create our own nltk.text.Text  object. To create a Text  object, you need a list of words, so first create a string:
'''

In [63]:
text = "Hello , I am a computer programmer who is currently learning and studying NLP !"


In [64]:
# Tokenize it:

our_own_text_obj = nltk.Text(nltk.word_tokenize(text))


In [65]:
#  And now you can use the findall method:

print(our_own_text_obj.findall(r"<.*ing>"))


learning; studying
None


In [None]:
'''
Note that as this is an  nltk.text.Text  object, you can use all the functions mentioned in the previous tutorials such as concordance , similar  and count .

This is all for this tutorial. If you have any question, feel free to leave it in the comments below.

'''

In [None]:
# Lexical Resources and NLP Pipeline

In [None]:
'''
Lexical Resources Terms:

Lexical resource is a database containing several dictionaries or corpora. Dictionaries are lists of stop words, homonyms, usual words, etc. To start this tutorial, there are some definitions you have to know.

Tokenization is the act of breaking up a sequence of strings into pieces called tokens. These tokens can be something like words or keywords. You will see more about this later in this tutorial.

Homonyms are two distinct words that have the same spelling. Homonyms can create a problem when dealing with language because it is difficult to process and distinguish them. Examples of these words are pike and pike.

Stop words are commonly used words that the search engine will filter out before the processing. Examples of these words are “the”, “a” and “is”.

'''

In [None]:
'''
Understanding Lexical Resources Using NLTK:

To understand better these concepts and start with lexical resources, let’s create a file named lexical-resources-vocabulary.py . The function you will write in this file is the unusual_words  function. What this function does is it takes a list of words and returns the words that are not usually used.

Let’s start with the imports:

'''

In [66]:

import nltk
from nltk.corpus import gutenberg


In [67]:
# input - list of words 
# output - returns a list of unusual words 
def unusual_words(text):
     text_vocab = set([w.lower() for w in text if w.isalpha()])
     english_vocab = set([w.lower() for w in nltk.corpus.words.words() ])
     unusual_list = text_vocab.difference(english_vocab)
     return sorted(unusual_list)


In [None]:
'''
In the function, you can see that are three variables, text_vocab , english_vocab  and unusual_list .

The first variable is a normalized version of the input list. If you remember from the past tutorials, if you try to search the word “the” in your text and it is not normalized, you may not receive a result even if the word appears in the text.

The english_vocab  variable is a set containing all the usual words in English. As you can see, nltk  module has a function that returns you a list of those words, nltk.corpus.words.words() .

The last variable, unusual_list , contains the difference between text_vocab  and english_vocab , in other words, all the words in text_vocab  that are not in english_vocab .

You can test your code by adding the following lines:

'''

In [68]:
nltk.download('words')

[nltk_data] Downloading package words to /home/qx816/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [69]:

list_of_unusual_words = unusual_words(gt.words('austen-emma.txt'))
print(list_of_unusual_words)


['abbots', 'abdy', 'abhorred', 'abilities', 'absences', 'absented', 'absenting', 'abstained', 'absurdities', 'abused', 'abusing', 'acceded', 'accents', 'accepting', 'accepts', 'accommodations', 'accompanied', 'accompanying', 'accomplishments', 'accounted', 'accounts', 'accumulations', 'aches', 'achieved', 'acknowledging', 'acknowledgment', 'acknowledgments', 'acquiesced', 'acquirements', 'acquitted', 'acted', 'actions', 'adding', 'addressed', 'addresses', 'addressing', 'adieus', 'administered', 'admires', 'admits', 'adored', 'adoring', 'advances', 'advantages', 'adventuring', 'advising', 'affairs', 'affections', 'affixed', 'afforded', 'affording', 'affords', 'ages', 'aggrandise', 'agitated', 'agitating', 'agrees', 'aids', 'aimable', 'aimed', 'aired', 'airs', 'alarms', 'alderneys', 'alleviations', 'alliances', 'allowances', 'allowed', 'allowing', 'alluded', 'alluding', 'allusions', 'almane', 'alphabets', 'altered', 'amounted', 'amounting', 'amuses', 'announced', 'announcing', 'answered'

In [None]:
'''
In nltk  module, you also have a corpus of stop words. If you want to see the stop words in English, for example, you can use the following code:
'''

In [27]:
from nltk.corpus import stopwords
print(stopwords.words('english'))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
'''
NLP Pipeline:

Before starting processing raw text, you need first to be familiar with the architecture of the NLP. Let’s say you want to process the text in this site (https://www.nrdc.org/stories/global-warming-101). You can see that you cannot simply analyze the HTML from the site because you do not actually want everything present in the page, like images or icons.

So what you have to do here? First, you need to download the HTML, then you tokenize the text and finally, you normalize the words. This is what we call the NLP pipeline. As web scraping is not the focus of this tutorial, let’s start the NLP pipeline with the tokenization.

Note: You will be using the play that you downloaded from the previous tutorial, Taming of the Shrew, so if you have not downloaded it yet, you can download it from this link.
path : http://www.textfiles.com/etext/AUTHORS/SHAKESPEARE/
'''

In [None]:
'''
Tokenization
You will start with tokenization with two functions in nltk.tokenize  module, the word_tokezine  and sent_tokenize . As the name suggests, the first function will divide your text into words and the second function will divide your text into sentences. These two functions get a string as their parameters and return a list of tokens as their output.

You will start the example creating a file named tokenization.py  and add the following lines:
'''

In [70]:
from nltk.tokenize import word_tokenize, sent_tokenize

def read_file(filename):
     with open(filename, 'r') as file:
         text = file.read()
     return text 

text = read_file("/home/qx816/notebooks/Ref_Notebooks/Practice/shakespeare-taming-2.txt")


In [None]:
'''
What you are doing with these lines is that you are importing the functions word_tokenize  and sent_tokenize  from nltk.tokenize  module and you are creating a function to read your file and return it as a string.
'''

In [72]:

words = word_tokenize(text)
print(words)



In [None]:
'''
If you want all the words to be unique, you can use transform it into a set with:
'''

In [73]:
words = set(words)
print(words)



In [None]:
'''
And you can see how many unique words the text has as well:

 
'''

In [74]:
print(len(words))


3606


In [None]:
'''

'''

In [None]:
'''

'''