## Notebook Introduction

We cover the following topics:

- Tokenizing text using functions **word_tokenize** and **sent_tokenize**.

- Computing Frequencies with **FreqDist** and **ConditionalFreqDist**.

- Generating Bigrams and collocations with **bigrams** and **collocations**.

- Stemming word affixes using **PorterStemmer** and **LancasterStemmer**.

- Tagging words to their parts of speech using **pos_tag**.

## nltk

**nltk** is a popular Python framework used for developing Python programs to work with human language data. Key features of nltk:

- It provides access to over 50 text corpora and other lexical resources.
- It is a suite of text processing tools.
- It is free to use and Open source.
- It is available for Windows, Mac OS X, and Linux.

## Installing nltk

nltk can be installed, automatically, using the command - 

***pip install nltk***
and
***nltk.download()***

Installation can be verified by opening the Python terminal and typing command - 
***import nltk***

## Basic Understanding of nltk

Now let's understand by performing simple tasks in the next couple of slides.

In [1]:
## Splitting a sample text into a list of sentences using "sent_tokenize" function
import nltk
text = "Python is an interpreted high-level programming language for general-purpose programming. \
Created by Guido van Rossum and first released in 1991."
sentences = nltk.sent_tokenize(text)
len(sentences)

2

In [2]:
## Splitting a sample text into words using "word_tokenize" function
words = nltk.word_tokenize(text)
print(len(words))

# The expression words[:5] displays first five words of list words.
words[:5]

22


['Python', 'is', 'an', 'interpreted', 'high-level']

In [3]:
## Determining the frequency of words present in sample text using "FreqDist" function
## The expression wordfreq.most_common(n) displays n highly frequent words with their respective frequency count.
wordfreq = nltk.FreqDist(words)
wordfreq.most_common(2)

[('programming', 2), ('.', 2)]

## Downloading NLTK Book collection

In this notebook, you will be coordinating with several texts curated by NLTK authors. These texts are available in collection book of nltk. They can be downloaded by running the following command in Python interpreter, after importing nltk successfully.

***nltk.download('book')***

In [2]:
## The command loads nine texts and nine sentences, from the collection book
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Searching Text
There are multiple ways of searching for a pattern in a text. The example shown below searches for words starting with tri, and ending with r

In [5]:
text1.findall("<tri.*r>")

triangular; triangular; triangular; triangular


## Basic Tasks with Text
In this topic, you will understand how to perform the following activities, using **text1** as input text.

- Total Word Count
- Unique Word Count
- Transforming Words
- Word Coverage
- Filtering Words
- Frequency Distribution

## Determining Total Word Count

In [6]:
## The text1, imported from nltk.book is an object of nltk.text.Text class
print(type(text1))

## Total number of words in text1 is determined using len
n_words = len(text1)
n_words

<class 'nltk.text.Text'>


260819

## Determining Unique Word Count

In [7]:
## A unique number of words in text1 is determined using set and len methods
## The expression set(text1) generates list of unique words from text1
n_unique_words = len(set(text1))
n_unique_words

19317

## Transforming Words
It is possible to apply a function to any number of words and transform them.

In [13]:
## Now let's transform every word of text1 to lowercase and determine unique words once again.
text1_lcw = [ word.lower() for word in set(text1) ]
n_unique_words_lc = len(set(text1_lcw))
print(n_unique_words_lc)
print(abs(n_unique_words_lc - n_unique_words))
## A difference of 2086 can be found from n_unique_words.

17231
2086


## Determining Word Coverage
***Word Coverage*** refers to an average number of times a word is occurring in the text. The following examples determine Word Coverage of raw and transformed text1 and also check coverage for all words when lowercased in the text.

In [14]:
word_coverage1 = n_words / n_unique_words
print(word_coverage1)
word_coverage2 = n_words / n_unique_words_lc
print(word_coverage2)

13.502044830977896
15.136614241773549


## Filtering Words
Now let's see how to filter words based on specific criteria. The following example uses a list of comprehension with a condition and filters words having characters more than 17.

In [15]:
big_words = [word for word in set(text1) if len(word) > 17 ]
big_words

['uninterpenetratingly', 'characteristically']

Now let's see one more example which filters words having the prefix ***Sun***. The example is case-sensitive. It doesn't filter the words starting with lowercase ***s*** and followed by ***un***.

In [16]:
sun_words = [word for word in set(text1) if word.startswith('Sun') ]
sun_words

['Sunda', 'Sunday', 'Sunset']

## Frequency Distribution
***FreqDist*** functionality of nltk can be used to determine the frequency of all words, present in an input text.

## Common Methods of Frequency Distribution
Illustration of Commonly used methods on a frequency distribution fdist.
![Frequency Distribution](files/Frequency_Distribution.jpeg "Title")

In [19]:
## The following example, determines frequency distribution of text1 and further displays the frequency of word Sunday.
text1_freq = nltk.FreqDist(text1)
text1_freq['Sunday']

7

In [20]:
## Now let's identify three frequent words from text1_freq distribution using most_common method.
top3_text1 = text1_freq.most_common(3)
top3_text1

[(',', 18713), ('the', 13721), ('.', 6862)]

## Examples
In general, you would be interested in finding frequent words which are not common in usage and specific to input text. In the next example, you will perform the following -

- Filter words having all characters and of larger length.
- Determine frequency distribution of the filtered words.
- Identify the three most common words.

In [21]:
large_uncommon_words = [word for word in text1 if word.isalpha() and len(word) > 7 ]
text1_uncommon_freq = nltk.FreqDist(large_uncommon_words)
text1_uncommon_freq.most_common(3)

[('Queequeg', 252), ('Starbuck', 196), ('something', 119)]

In [10]:
from nltk.book import text6
text6_freq = nltk.FreqDist(text6)
text6_freq['BROTHER']
big_words6 = [word for word in set(text6) if len(word) > 10 ]
big_words6
b = [word for word in set(text6) if word.endswith('ship') ]
len(b)

['dictatorship']

In [22]:
from nltk.book import text6
b = [word for word in text6 if word.endswith('ing') ]
len(b)

281

## Popular Text Corpora
Two popular Text Corpora available from nltk, which you will be using are:

- **Genesis**: It is a collection of few words across multiple languages.

- **Brown**: It is the first electronic corpus of one million English words.

### Other Corpus in nltk

- **Gutenberg** : Collections from Project Gutenberg
- **Inaugural** : Collection of U.S Presidents inaugural speeches
- **stopwords** : Collection of stop words.
- **reuters** : Collection of news articles.
- **cmudict** : Collection of CMU Dictionary words.
- **movie_reviews** : Collection of Movie Reviews.
- **np_chat** : Collection of chat text.
- **names** : Collection of names associated with males and females.
- **state_union** : Collection of state union address.
- **wordnet** : Collection of all lexical entries.
- **words** : Collection of words in Wordlist corpus.

### Accessing Text Corpora
Any text corpus has to be imported before you start working with it. The below code imports genesis text corpus.

In [28]:
from nltk.corpus import genesis
## Various text collections available under genesis text corpus are viewed by fileids method.
genesis.fileids()

['english-kjv.txt',
 'english-web.txt',
 'finnish.txt',
 'french.txt',
 'german.txt',
 'lolcat.txt',
 'portuguese.txt',
 'swedish.txt']

## Working with a Text Corpus
Now let's understand how to work with a text corpus. The following example determines the average word length and average sentence length of each text collection present in genesis corpus.

The methods raw, words and sents used in code determine the total number of characters, words, and sentences present in a specific text collection.

In [29]:
for fileid in genesis.fileids():
    n_chars = len(genesis.raw(fileid))
    n_words = len(genesis.words(fileid))
    n_sents = len(genesis.sents(fileid))
    print(int(n_chars/n_words), int(n_words/n_sents), fileid)

4 30 english-kjv.txt
4 19 english-web.txt
5 15 finnish.txt
4 23 french.txt
4 23 german.txt
4 20 lolcat.txt
4 27 portuguese.txt
4 30 swedish.txt


## Text Corpus Structure
A text corpus is organized into any of the following four structures.

- Isolated - Holds Individual text collections.
- Categorized - Each text collection tagged to a category.
- Overlapping - Each text collection tagged to one or more categories, and
- Temporal - Each text collection tagged to a period, date, time, etc.

## Loading User Specific Corpus
Now let's see how to convert your collection of text files into a text corpus. Suppose, you have three files c1.txt, c2.txt and c3.txt in current directory path.

Creation of corpus wordlists corpus is shown in the following example.

In [1]:
from nltk.corpus import PlaintextCorpusReader
import os 
corpus_root = os.getcwd()

wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()

['.ipynb_checkpoints/NLP-checkpoint.ipynb',
 'Frequency_Distribution.jpeg',
 'NLP.ipynb',
 'c1.txt',
 'c2.txt',
 'c3.txt',
 'cfdist.jpeg']

In [2]:
wordlists

<PlaintextCorpusReader in 'C:\\Users\\svksh\\Documents\\#Personal\\Fresco\\Machine Learning\\NLP using Python'>

## Conditional Frequency
In the previous topic, you have studied about Frequency Distributions. ***FreqDist*** function computes the frequency of each item in a list. While computing a frequency distribution, you observe occurrence count of an event.

In [4]:
items = ['apple', 'apple', 'kiwi', 'cabbage', 'cabbage', 'potato']
nltk.FreqDist(items)

FreqDist({'apple': 2, 'cabbage': 2, 'kiwi': 1, 'potato': 1})

## Computing Conditional Frequency
A Conditional Frequency is a collection of frequency distributions, computed based on a condition. For computing a conditional frequency, you have to attach a condition to every occurrence of an event. Let's consider the following list for computing Conditional Frequency.

***ConditionalFreqDist*** function of nltk is used to compute Conditional Frequency Distribution (CDF). The same can be viewed in the following example.

In [9]:
c_items = [('F','apple'), ('F','apple'), ('F','kiwi'), ('V','cabbage'), ('V','cabbage'), ('V','potato') ]
cfd = nltk.ConditionalFreqDist(c_items)
print(cfd.conditions())
print(cfd['V'])
print(cfd['F'])

['F', 'V']
<FreqDist with 2 samples and 3 outcomes>
<FreqDist with 2 samples and 3 outcomes>


## Common methods of a Conditional Frequency
Illustration of Commonly used methods on a conditional frequency distribution, cfdist.

![CFD](files/cfdist.jpeg "Title")

## Counting Words by Genre
Now let's determine the frequency of words, of a particular genre, in brown corpus.

In [11]:
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist([ (genre, word) for genre in brown.categories() for word in brown.words(categories=genre) ])
cfd.conditions()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

## Viewing Word Count
Once after computing conditional frequency distribution, tabulate method is used for viewing the count along with arguments conditions and samples.

In [12]:
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership', 'worship', 'hardship'])

           leadership    worship   hardship 
government         12          3          2 
     humor          1          0          0 
   reviews         14          1          2 


## Viewing Cumulative Word Count
The cumulative count for different conditions is found by setting cumulative argument value to True.

In [13]:
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership', 'worship', 'hardship'], cumulative = True)

           leadership    worship   hardship 
government         12         15         17 
     humor          1          1          1 
   reviews         14         15         17 


## Accessing Individual Frequency Distributions
From the obtained conditional frequency distribution, you can access individual frequency distributions. The below example extracts frequency distribution of words present in news genre of brown corpus.

In [6]:
news_fd = cfd['news']
news_fd.most_common(3)
news_fd['the']

0

## Comparing Frequency Distributions
Now let's see another example, which computes the frequency of last character appearing in all names associated with males and females respectively and compares them. The text corpus names contain two files male.txt and female.txt.

In [17]:
from nltk.corpus import names
nt = [(fid.split('.')[0], name[-1]) for fid in names.fileids() for name in names.words(fid) ]
cfd2 = nltk.ConditionalFreqDist(nt)

## The expression cfd2['female'] > cfd2['male'] checks if the last characters in females occur more frequently than the
## last characters in males.
print(cfd2['female'] > cfd2['male'])

## The following code snippet displays frequency count of characters a and e in females and males, respectively.
cfd2.tabulate(samples=['a', 'e'])

True
          a    e 
female 1773 1432 
  male   29  468 


## Raw Text Processing
For most of the NLTK studies that you carry out, data is not readily available in the form of a text corpus. Also, raw text data from a different source can be obtained, processed and used for doing NLTK studies.

Some of the processing steps that you perform are :
- Tokenization
- Stemming

### Reading a Text File
In this topic, you will understand how data is read from different external sources. The following example reads content from a text file, available at Project Gutenberg site.

In [18]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
content1 = request.urlopen(url).read()

### Reading a HTML file
The following example reads content from a news article available over the web. ***Beautifulsoup*** module is used for scrapping the required text from the webpage.

In [19]:
from urllib import request
url = "http://www.bbc.com/news/health-42802191"
html_content = request.urlopen(url).read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

***find_all*** method returns all inner elements of div element, having class attribute value as ***story-body__inner***.

In [27]:
inner_body = soup.find_all('div', attrs={'class':'story-body__inner'})
inner_text = [elm.text for elm in inner_body[0].find_all(['h1', 'h2', 'p', 'li']) ]
text_content2 = '\n'.join(inner_text)

### Reading from Other Sources
You can also read text from some other text resources such as RSS feeds, FTP repositories, local text files, etc.
It is also possible to read a text in binary format, from sources like Microsoft Word and PDF.
Third party libraries such as ***pywin32***, ***pypdf*** are required for accessing Microsoft Word or PDF documents.

## Tokenization
***Tokenization*** is a step in which a text is broken down into words and punctuation. The simplest way of tokenizing is by using ***word_tokenize*** method. The below example tokenizes text read from Project Gutenberg.

In [22]:
text_content1 = content1.decode('unicode_escape')  # Converts bytes to unicode
tokens1 = nltk.word_tokenize(text_content1)
tokens1[3:8]

['Project', 'Gutenberg', 'EBook', 'of', 'Crime']

The following example tokenizes text scrapped from the HTML page.

In [23]:
tokens2 = nltk.word_tokenize(text_content2)
print(tokens2[:5])
len(tokens2)

['Smokers', 'need', 'to', 'quit', 'cigarettes']


751

### Regular Expressions for Tokenization
Regular expressions can also be utilized to split the text into tokens using ***re*** module in python. The below example splits the entire text text_content2 with regular expression **\w+**

In [25]:
import re
tokens2_2 = re.findall(r'\w+', text_content2)
len(tokens2_2)

668

## Creation of NLTK text
Using the obtained list of tokens, an object of NLTK text can be created as shown below. This obtained text can be used for further linguistic processing.

In [26]:
input_text2 = nltk.Text(tokens2)
type(input_text2)

nltk.text.Text

In [39]:
s = 'Python is cool!!!'
nltk.sent_tokenize(s)

['Python is cool!!', '!']

## Bigrams
Bigrams represent a set of two consecutive words appearing in a text.
***bigrams*** function is called on tokenized words, as shown in the following example, to obtain bigrams.

In [1]:
import nltk
s = 'Python is an awesome language.'
tokens = nltk.word_tokenize(s)
list(nltk.bigrams(tokens))

[('Python', 'is'),
 ('is', 'an'),
 ('an', 'awesome'),
 ('awesome', 'language'),
 ('language', '.')]

## Computing Frequent Bigrams
Now let's find out three frequently occurring bigrams, present in ***english-kjv*** collection of ***genesis*** corpus. Let's consider only those bigrams, whose words are having a length greater than 5.

In [9]:
from nltk.corpus import genesis
eng_tokens = genesis.words('english-kjv.txt')
eng_bigrams = nltk.bigrams(eng_tokens)
filtered_bigrams = [ (w1, w2) for w1, w2 in eng_bigrams if len(w1) >=5 and len(w2) >= 5 ]

After computing bi-grams, the following code computes frequency distribution and displays three most frequent bigrams.

In [5]:
eng_bifreq = nltk.FreqDist(filtered_bigrams)
eng_bifreq.most_common(3)

[(('their', 'father'), 19), (('lived', 'after'), 16), (('seven', 'years'), 15)]

### Determining Frequent After Words
Now let's see an example which determines the two most frequent words occurring after living are determined.

In [13]:
from nltk.corpus import genesis
eng_tokens = genesis.words('english-kjv.txt')
eng_bigrams = nltk.bigrams(eng_tokens)
eng_cfd = nltk.ConditionalFreqDist(eng_bigrams)
eng_cfd['living'].most_common(2)

[('creature', 7), ('thing', 4)]

### Generating Frequent Next Word
Now let's define a function named generate, which returns words occurring frequently after a given word.
After defining the function generate, it is called with ***eng_cfd*** and ***living*** parameters.
The output shows a word which occurs most frequently next to ***living*** is ***creature***.

In [14]:
def generate(cfd, word, n=5):
    n_words = []
    for i in range(n):
        n_words.append(word)
        word = cfd[word].max()
    return n_words

generate(eng_cfd, 'living')

['living', 'creature', 'that', 'he', 'said']

## Trigrams
Similar to Bigrams, Trigrams refers to set of all three consecutive words appearing in text.

In [15]:
s = 'Python is an awesome language.'
tokens = nltk.word_tokenize(s)
list(nltk.trigrams(tokens))

[('Python', 'is', 'an'),
 ('is', 'an', 'awesome'),
 ('an', 'awesome', 'language'),
 ('awesome', 'language', '.')]

### ngrams
***nltk*** also provides the function **ngrams**. It can be used to determine a set of all possible n consecutive words appearing in a text.

The following example displays a list of four consecutive words appearing in the text ***s***.

In [16]:
list(nltk.ngrams(tokens, 4))

[('Python', 'is', 'an', 'awesome'),
 ('is', 'an', 'awesome', 'language'),
 ('an', 'awesome', 'language', '.')]

## Collocations
A collocation is a pair of words that occur together, very often.

For example, red wine is a collocation.

One characteristic of a collocation is that the words in it cannot be substituted with words having similar senses.

For example, the combination maroon wine sounds odd.

### Generating Collocations
Now let's see how to generate collocations from text with the following example.

In [17]:
from nltk.corpus import genesis
tokens = genesis.words('english-kjv.txt')
gen_text = nltk.Text(tokens)
gen_text.collocations()

said unto; pray thee; thou shalt; thou hast; thy seed; years old;
spake unto; thou art; LORD God; every living; God hath; begat sons;
seven years; shalt thou; little ones; living creature; creeping thing;
savoury meat; thirty years; every beast


## Stemming
Stemming is a process of stripping affixes from words.

More often, you normalize text by converting all the words into lowercase. This will treat both words **The** and **the** as same. With stemming, the words **playing, played and play** will be treated as single word, i.e. **play**.

### Stemmers in nltk
**nltk** comes with few stemmers. The two widely used stemmers are ***Porter*** and ***Lancaster*** stemmers. These stemmers have their own rules for string affixes.

The following example demonstrates stemming of word ***builders*** using ***PorterStemmer***.

In [2]:
from nltk import PorterStemmer
porter = nltk.PorterStemmer()
porter.stem('builders')

'builder'

Now let's see how to use ***LancasterStemmer*** and stem the word builders. Lancaster Stemmer returns build whereas Porter Stemmer returns builder.

In [3]:
from nltk import LancasterStemmer
lancaster = LancasterStemmer()
lancaster.stem('builders')

'build'

### Normalizing with Stemming
Let's consider the text collection, text1.

Let's first determine the number of unique words present in original text1. Then normalize the text by converting all the words into lower case and again determine the number of unique words.

In [5]:
from nltk.book import *
print(len(set(text1)))
lc_words = [ word.lower() for word in text1] 
len(set(lc_words))

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
19317


17231

Now let's further normalize text1 with ***Porter*** Stemmer.
The above output shows that, after normalising with Porter Stemmer, the text1 collection has 10927 unique words.

In [7]:
from nltk import PorterStemmer
porter = PorterStemmer()
p_stem_words = [porter.stem(word) for word in set(lc_words) ]
len(set(p_stem_words))

10927

Now let's normalise with ***Lancaster*** stemmer and determine the unique words of text1. Applying Lancaster Stemmer to text1 collection resulted in 9036 words.

In [8]:
from nltk import LancasterStemmer
lancaster = LancasterStemmer()
l_stem_words = [lancaster.stem(word) for word in set(lc_words) ]
len(set(l_stem_words))

9036

## Understanding Lemma
**Lemma** is a lexical entry in a lexical resource such as word dictionary. You can find multiple Lemma's with the same spelling. These are known as **homonyms**.

For example, consider the two Lemma's listed below, which are **homonyms**.
1. saw [verb] - Past tense of see
2. saw [noun] - Cutting instrument

### Lemmatization
**nltk** comes with **WordNetLemmatizer**. This lemmatizer removes affixes only if the resulting word is found in lexical resource, **Wordnet**.

**WordNetLemmatizer** is majorly used to build a vocabulary of words, which are valid Lemmas.

In [10]:
wnl = nltk.WordNetLemmatizer()
wnl_stem_words = [wnl.lemmatize(word) for word in set(lc_words) ]
len(set(wnl_stem_words))

15168

## POS Tagging
The method of categorizing words into their parts of speech and then labeling them respectively is called **POS Tagging**.

### POS Tagger
A POS Tagger processes a sequence of words and tags a part of speech to each word.

***pos_tag*** is the simplest tagger available in nltk. The below example shows usage of ***pos_tag***.

In [16]:
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
nltk.pos_tag(words)

[('Python', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ'), ('.', '.')]

The words ***Python***, ***is*** and ***awesome*** are tagged to Proper Noun (NNP), Present Tense Verb (VB), and adjective (JJ) respectively. You can read more about the pos tags with the below help command

In [18]:
nltk.help.upenn_tagset()
## To know about a specific tag like JJ, use the below-shown expression
nltk.help.upenn_tagset('JJ')

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

### Tagging Text
Constructing a list of tagged words from a string is possible. A tagged word or token is represented in a tuple, having the word and the tag. In the input text, each word and tag are separated by **/**.

In [19]:
text = 'Python/NN is/VB awesome/JJ ./.'
[ nltk.tag.str2tuple(word) for word in text.split() ]

[('Python', 'NN'), ('is', 'VB'), ('awesome', 'JJ'), ('.', '.')]

### Tagged Corpora
Many of the text corpus available in nltk, are already tagged to their respective parts of speech.

***tagged_words*** method can be used to obtain tagged words of a corpus. The following example fetches tagged words of brown corpus and displays few.

In [20]:
from nltk.corpus import brown
brown_tagged = brown.tagged_words()
brown_tagged[:3]

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]

### DefaultTagger
***DefaultTagger*** assigns a specified tag to every word or token of given text. An example of tagging **NN** tag to all words of a sentence, is shown below.

In [21]:
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(words)

[('Python', 'NN'), ('is', 'NN'), ('awesome', 'NN'), ('.', 'NN')]

### Lookup Tagger
You can define a custom tagger and use it to tag words present in any text. The below-shown example defines a dictionary defined_tags, with three words and their respective tags.

In [22]:
import nltk
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
defined_tags = {'is':'BEZ', 'over':'IN', 'who': 'WPS'}

## Unigram Tagger
***UnigramTagger*** provides you the flexibility to create your taggers. Unigram taggers are built based on statistical information. i.e., they tag each word or token to most likely tag for that particular word.

You can build a unigram tagger through a process known as **training**.

Then use the tagger to tag words in a test set and evaluate the performance.
The example further defines a **UnigramTagger** with the defined dictionary and uses it to predict tags of words in text.
Since the words Python and awesome are not found in **defined_tags** dictionary, they are tagged to None.

In [23]:
baseline_tagger = nltk.UnigramTagger(model=defined_tags)
baseline_tagger.tag(words)

[('Python', None), ('is', 'BEZ'), ('awesome', None), ('.', None)]

Let's consider the tagged sentences of brown corpus collections, associated with government genre. Let's also compute the training set size, i.e., 80%.

In [24]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='government')
brown_sents = brown.sents(categories='government')
len(brown_sents)
train_size = int(len(brown_sents)*0.8)
train_size

2425

***unigram_tagger*** is built by passing trained tagged sentences as argument to ***UnigramTagger***.

The built unigram_tagger is further evaluated with test sentences.

In [25]:
train_sents = brown_tagged_sents[:train_size]
test_sents = brown_tagged_sents[train_size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

0.7799495586380832

In [26]:
## The following code snippet shows tagging words of a sentence, taken from the test set.
unigram_tagger.tag(brown_sents[3000])

[('The', 'AT'),
 ('first', 'OD'),
 ('step', 'NN'),
 ('is', 'BEZ'),
 ('a', 'AT'),
 ('comprehensive', 'JJ'),
 ('self', None),
 ('study', 'NN'),
 ('made', 'VBN'),
 ('by', 'IN'),
 ('faculty', None),
 (',', ','),
 ('by', 'IN'),
 ('outside', 'IN'),
 ('consultants', 'NNS'),
 (',', ','),
 ('or', 'CC'),
 ('by', 'IN'),
 ('a', 'AT'),
 ('combination', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('two', 'CD'),
 ('.', '.')]

## Summary on NLP with Python

- Tokenizing text using functions word_tokenize and sent_tokenize.

- Computing Frequencies with FreqDist and ConditionalFreqDist.

- Generating Bigrams and collocations with bigrams and collocations.

- Stemming word affixes using PorterStemmer and LancasterStemmer.

- Tagging words to their parts of speech using pos_tag.