# NLTK - Basics

** The *Natural Language Toolkit*, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in *Python*. **

In [1]:
import nltk

** Next we are going to download the necessary datasets/models that we will be needing.. **

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

** We import the brown corpus from the collection of available corpora. **

In [3]:
from nltk.corpus import brown

** We can list out the various genres of text inside the Brown corpus. We obtain the adjoining list of categories. **

In [4]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

** We can extract the words present in a particular category by the following piece of code. **

In [5]:
brown.words(categories='adventure')

['Dan', 'Morgan', 'told', 'himself', 'he', 'would', ...]

** Here we are specifying a limit in the number of words that are extracted and displayed from the corpus. **

In [6]:
brown.words(categories='adventure')[:100]

['Dan', 'Morgan', 'told', 'himself', 'he', 'would', ...]

#### Number of words in the corpus

** We can determine the total number of unique words in the given Brown corpus from the adjoining piece of code.. **

In [7]:
len(brown.words(categories='adventure'))

69342

### Analysis on Inaugural Corpus

** The *Inaugural* corpus is a collection of 55 texts, one for each presidential addresses of the presidents of The United States of America. **

In [8]:
from nltk.corpus import inaugural

** We can list out the texts present in the inaugural corpus by referring to their file ids. The following line of code displays a list of all 55 texts in the corpus. **

In [9]:
inaugural.fileids()

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

** We can print out the first 20 words of a particular presidential address by specifying the corresponding *file id* as the name of the text stored in the corpus and providing a limit of 20 at the end. **

In [11]:
inaugural.words(fileids='1861-Lincoln.txt')[:20]

['Fellow',
 '-',
 'Citizens',
 'of',
 'the',
 'United',
 'States',
 ':',
 'In',
 'compliance',
 'with',
 'a',
 'custom',
 'as',
 'old',
 'as',
 'the',
 'Government',
 'itself',
 ',']

** Similarly we can print the first 20 words of Obama's 2009 and 2013 Presidential address with the help of the adjoining line of codes. **

In [12]:
inaugural.words(fileids='2009-Obama.txt')[:20]

['My',
 'fellow',
 'citizens',
 ':',
 'I',
 'stand',
 'here',
 'today',
 'humbled',
 'by',
 'the',
 'task',
 'before',
 'us',
 ',',
 'grateful',
 'for',
 'the',
 'trust',
 'you']

In [13]:
inaugural.words(fileids='2013-Obama.txt')[:20]

['Thank',
 'you',
 '.',
 'Thank',
 'you',
 'so',
 'much',
 '.',
 'Vice',
 'President',
 'Biden',
 ',',
 'Mr',
 '.',
 'Chief',
 'Justice',
 ',',
 'Members',
 'of',
 'the']

In [14]:
inaugural.words(fileids='2017-Trump.txt')[:20]

['Chief',
 'Justice',
 'Roberts',
 ',',
 'President',
 'Carter',
 ',',
 'President',
 'Clinton',
 ',',
 'President',
 'Bush',
 ',',
 'President',
 'Obama',
 ',',
 'fellow',
 'Americans',
 ',',
 'and']

### Frequency Distribution of words in Book corpus

** We gain access to various pre-defined texts in the corpus book, by importing all the texts. **

In [15]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


** The *FreqDist* class is used to encode frequency distributions, which count the number of times that each outcome of an experiment occurs. **

In [16]:
f=FreqDist(text1)

In [17]:
print(f)

<FreqDist with 19317 samples and 260819 outcomes>


** We print the top 10 most commonly occuring words in the entire corpus. **

In [18]:
f.most_common(10)

[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982)]

### Applying FreqDist() on texts in inaugural corpus

** We extract all the words from the text specified by the corresponding file id and then construct a frequency distribution of the same. **

In [19]:
common=inaugural.words(fileids='1977-Carter.txt')

In [20]:
f1=FreqDist(common)

In [21]:
print(f1)

<FreqDist with 529 samples and 1380 outcomes>


** We list out the 5 most commonly occuring words in the corresponding text, in reference to the frequency distribution. **

In [22]:
f1.most_common(5)

[(',', 65), ('.', 50), ('the', 49), ('and', 45), ('to', 43)]

In [1]:
brown.words(categories='fiction')[:100]

NameError: name 'brown' is not defined

In [2]:
from nltk.corpus import brown

In [3]:
brown.words(categories='fiction')[:100]

['Thirty-three', 'Scotty', 'did', 'not', 'go', 'back', ...]

In [5]:
from nltk.corpus import inaugural

In [6]:
inaugural.words(fileids='1905-Roosevelt.txt')[:10]

['My',
 'fellow',
 'citizens',
 ',',
 'no',
 'people',
 'on',
 'earth',
 'have',
 'more']

In [7]:
inaugural.words(fileids='1933-Roosevelt.txt')[:10]

['I',
 'am',
 'certain',
 'that',
 'my',
 'fellow',
 'Americans',
 'expect',
 'that',
 'on']

### Analysis of contents in Webtext corpus

** Unlike established literature or even formal presidential addresses, the *webtext* contains slightly more informal language. This small collection is very important as it is very often observed that we encounter such informal oieces of texts in our daily lives. We are displaying the entire list of texts in the corpus. **

In [8]:
from nltk.corpus import webtext

In [9]:
webtext.fileids()

['firefox.txt',
 'grail.txt',
 'overheard.txt',
 'pirates.txt',
 'singles.txt',
 'wine.txt']

** We are printing out the first 20 words from a particular text, referenced with its file id. **

In [10]:
webtext.words(fileids='grail.txt')[:20]

['SCENE',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'KING',
 'ARTHUR',
 ':',
 'Whoa',
 'there',
 '!',
 '[',
 'clop',
 'clop']

** Next we are diplaying the first 20 words that occur in all the texts. This is facilitated with the help of the for loop. We get 6 lists of first 20 words that occur in each of the texts. **

In [13]:
for e in webtext.fileids():
    print(webtext.words(fileids=e)[:20])

['Cookie', 'Manager', ':', '"', 'Don', "'", 't', 'allow', 'sites', 'that', 'set', 'removed', 'cookies', 'to', 'set', 'future', 'cookies', '"', 'should', 'stay']
['SCENE', '1', ':', '[', 'wind', ']', '[', 'clop', 'clop', 'clop', ']', 'KING', 'ARTHUR', ':', 'Whoa', 'there', '!', '[', 'clop', 'clop']
['White', 'guy', ':', 'So', ',', 'do', 'you', 'have', 'any', 'plans', 'for', 'this', 'evening', '?', 'Asian', 'girl', ':', 'Yeah', ',', 'being']
['PIRATES', 'OF', 'THE', 'CARRIBEAN', ':', 'DEAD', 'MAN', "'", 'S', 'CHEST', ',', 'by', 'Ted', 'Elliott', '&', 'Terry', 'Rossio', '[', 'view', 'looking']
['25', 'SEXY', 'MALE', ',', 'seeks', 'attrac', 'older', 'single', 'lady', ',', 'for', 'discreet', 'encounters', '.', '35YO', 'Security', 'Guard', ',', 'seeking', 'lady']
['Lovely', 'delicate', ',', 'fragrant', 'Rhone', 'wine', '.', 'Polished', 'leather', 'and', 'strawberries', '.', 'Perhaps', 'a', 'bit', 'dilute', ',', 'but', 'good', 'for']
