# Gutenbern corpus with NLTK package
### Step 1: Download and Install Gutenberg Corpus

    Install NLTK: run sudo pip install -U nltk
    Install Numpy (optional): run sudo pip install -U numpy
    Test installation: run python then type import nltk

### Step 2: Import source data from the package NLTK
#### The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on

In [1]:
import nltk as NLTK
import itertools as ListOps

In [2]:
from nltk.corpus import brown
cats = brown.categories()
print(cats)

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


### Step 3: Relative Frequency distributionand conditional frequency of modals

In [3]:
#Frequency Distribution
text = brown.words(categories='news')
fdist = NLTK.FreqDist(w.lower() for w in text)
modals = ['can', 'could', 'may', 'might', 'will', 'would', 'should']
print('Frequency Distribution of modals \n')
for mods in modals:
    print(mods + ':', fdist[mods], end=' ')

Frequency Distribution of modals 

can: 94 could: 87 may: 93 might: 38 will: 389 would: 246 should: 61 

In [4]:
cFrqDist = NLTK.ConditionalFreqDist((genre, word)
                               for genre in brown.categories()
                               for word in brown.words(categories=genre))
genres = cats
print('Conditional frequency Distribution for modals across all genres\n')
cFrqDist.tabulate(conditions=genres, samples=modals)

Conditional frequency Distribution for modals across all genres

                   can  could    may  might   will  would should 
      adventure     46    151      5     58     50    191     15 
 belles_lettres    246    213    207    113    236    392    102 
      editorial    121     56     74     39    233    180     88 
        fiction     37    166      8     44     52    287     35 
     government    117     38    153     13    244    120    112 
        hobbies    268     58    131     22    264     78     73 
          humor     16     30      8      8     13     56      7 
        learned    365    159    324    128    340    319    171 
           lore    170    141    165     49    175    186     76 
        mystery     42    141     13     57     20    186     29 
           news     93     86     66     38    389    244     59 
       religion     82     59     78     12     71     68     45 
        reviews     45     40     45     26     58     47     18 
        rom

### Inaugural corpus

In [5]:
from nltk.corpus import inaugural

In [6]:
inaguralCats = inaugural.fileids()
print('The text files are: \n', inaguralCats, '\n \n The number of text files are: ', len(inaguralCats))

The text files are: 
 ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Re

In [7]:
NLTK.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /Users/raam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
print('Combining words from all', len(inaguralCats), 'text files of inaugural \n')

def getWords(x):
    return(inaugural.words(fileids=x))

allWordsFromInagural = list(map(getWords, inaguralCats))

#To join nested lists of allWordsFromInagural
allWordsFromInagural = list(ListOps.chain.from_iterable(allWordsFromInagural))

print('Number of words before treating the list with Stopwords: ', len(allWordsFromInagural), '\n --------')

# To filter words based on the mentioned conditions
allWordsFromInagural = list(filter(lambda x: x not in stopwords.words('english') and x.isalpha() == True and 
                                   len(x) > 7, allWordsFromInagural))

print('Number of words after implementing conditions and cleaning the list with Stopwords: ', 
      len(allWordsFromInagural))

Combining words from all 58 text files of inaugural 

Number of words before treating the list with Stopwords:  149797 
 --------
Number of words after implementing conditions and cleaning the list with Stopwords:  22762


In [9]:
inaguralFDist = NLTK.FreqDist(everyWord.lower() for everyWord in allWordsFromInagural)

topTen = dict(map(lambda x: (x, inaguralFDist[x]), allWordsFromInagural))

topTen = sorted(topTen.items(), key=lambda x: x[1], reverse = True)

topTenWord = list()
for top in topTen[:10]:
    topTenWord.append(top[0])

print('Top 10 frequently used words across all the text files are: \n', topTenWord)

Top 10 frequently used words across all the text files are: 
 ['government', 'citizens', 'constitution', 'national', 'congress', 'interests', 'political', 'executive', 'principles', 'progress']


In [10]:
print('Frequency Distribution of Top 10 words whose character length in > 7 \n')
for top10 in topTenWord:
    print(top10 + ':', inaguralFDist[top10], '\n', end=' ')

Frequency Distribution of Top 10 words whose character length in > 7 

government: 600 
 citizens: 247 
 constitution: 206 
 national: 157 
 congress: 130 
 interests: 115 
 political: 106 
 executive: 97 
 principles: 96 
 progress: 94 
 

In [12]:
from nltk.corpus import wordnet
def synonym_count_func(word):
    synonyms = []
    for syn in wordnet.synsets(word):
        for i in syn.lemmas():
            synonyms.append(i.name())
    print('Number of synonyms of', word, 'are: ', len(set(synonyms)))
    print('Synonyms are: ', *synonyms)
synonym_count_func('progress')

Number of synonyms of progress are:  19
Synonyms are:  advancement progress progress progression procession advance advancement forward_motion onward_motion progress progression advance progress come_on come_along advance get_on get_along shape_up advance progress pass_on move_on march_on go_on build_up work_up build progress
