## NLP - Natural Language Processing

Natural Language Processing helps in increasing computer intelligence to understand human languages as spoken and to respond.

NLP techniques are capable of processing and extracting meaningful insights, from huge unstructured data available online.

- It can automate **translating text** from one language to other.

- These techniques can be used for performing **sentiment analysis**.

- It helps in building applications that **interact with humans** as humans do.

- Also, NLP can help in automating **Text Classification, Spam Filtering,** and more.

In [1]:
#!pip install nltk #Natural Lanuage tool kit

In [2]:
import nltk

In [3]:
text = "Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991."

In [4]:
sentences = nltk.sent_tokenize(text)
sentences

['Python is an interpreted high-level programming language for general-purpose programming.',
 'Created by Guido van Rossum and first released in 1991.']

In [5]:
words = nltk.word_tokenize(text)

In [6]:
len(words)

22

In [7]:
words[:5]

['Python', 'is', 'an', 'interpreted', 'high-level']

In [8]:
wordfreq = nltk.FreqDist(words)
wordfreq

FreqDist({'programming': 2, '.': 2, 'Python': 1, 'is': 1, 'an': 1, 'interpreted': 1, 'high-level': 1, 'language': 1, 'for': 1, 'general-purpose': 1, ...})

In [9]:
wordfreq.most_common(2)

[('programming', 2), ('.', 2)]

##### Downloading NLTK Book collection
 - These texts are available in collection book of nltk.
 - They can be downloaded by running the following command in Python interpreter, after importing nltk successfully.

In [10]:
#nltk.download('book')

In [11]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [12]:
# The example shown below searches for words starting with tri, and ending with r.
text1.findall("<tri.*r>")

triangular; triangular; triangular; triangular


##### Determining Total Word Count

In [13]:
display(type(text1))
n_words=len(text1)
n_words

nltk.text.Text

260819

In [14]:
n_unique_words = len(set(text1))
n_unique_words

19317

##### Transforming Words

In [15]:
text1_lcw = [ word.lower() for word in set(text1) ]
n_unique_words_lc = len(set(text1_lcw))
n_unique_words_lc

17231

##### Determining Word Coverage

In [16]:
word_coverage1 = n_words / n_unique_words
word_coverage1

13.502044830977896

In [17]:
word_coverage2 = n_words / n_unique_words_lc
word_coverage2

15.136614241773549

##### Filtering Words

In [18]:
big_words = [word for word in set(text1) if len(word) > 17 ]
big_words

['characteristically', 'uninterpenetratingly']

In [19]:
sun_words = [word for word in set(text1) if word.startswith('Sun') ]
sun_words

['Sunda', 'Sunday', 'Sunset']

##### Frequency Distribution

In [30]:
text1_freq = nltk.FreqDist(text1)
text1_freq

FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})

In [31]:
text1_freq['Sunday']

7

In [33]:
top3_text1= text1_freq.most_common(3)
top3_text1

[(',', 18713), ('the', 13721), ('.', 6862)]

In [34]:
large_common_words = [word for word in text1 if word.isalpha() and len(word) > 7 ]
text1_common_freq = nltk.FreqDist(large_common_words)
text1_common_freq.most_common(3)

[('Queequeg', 252), ('Starbuck', 196), ('something', 119)]

##### Accessing Text Corpora

In [36]:
from nltk.corpus import genesis

In [37]:
genesis.fileids()

['english-kjv.txt',
 'english-web.txt',
 'finnish.txt',
 'french.txt',
 'german.txt',
 'lolcat.txt',
 'portuguese.txt',
 'swedish.txt']

The methods raw, words and sents used in code determine the total number of characters, words, and sentences present in a specific text collection.

In [38]:
 for fileid in genesis.fileids():
        n_chars = len(genesis.raw(fileid))
        n_words = len(genesis.words(fileid))
        n_sents = len(genesis.sents(fileid))
        print(int(n_chars/n_words), int(n_words/n_sents), fileid)

4 30 english-kjv.txt
4 19 english-web.txt
5 15 finnish.txt
4 23 french.txt
4 23 german.txt
4 20 lolcat.txt
4 27 portuguese.txt
4 30 swedish.txt


##### Conditional Frequency

In [40]:
c_items = [('F','apple'), ('F','apple'), ('F','kiwi'), ('V','cabbage'), ('V','cabbage'), ('V','potato') ]

In [42]:
cfd = nltk.ConditionalFreqDist(c_items)
cfd

<ConditionalFreqDist with 2 conditions>

In [43]:
cfd.conditions()

['F', 'V']

In [44]:
cfd.items()

dict_items([('F', FreqDist({'apple': 2, 'kiwi': 1})), ('V', FreqDist({'cabbage': 2, 'potato': 1}))])

In [45]:
cfd['V']

FreqDist({'cabbage': 2, 'potato': 1})

In [48]:
cfd['F']['apple']

2

In [50]:
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist([ (genre, word) for genre in brown.categories() for word in brown.words(categories=genre) ])

In [55]:
cfd.conditions()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [54]:
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership', 'worship', 'hardship'])

           leadership    worship   hardship 
government         12          3          2 
     humor          1          0          0 
   reviews         14          1          2 


In [56]:
news_fd = cfd['news']
news_fd

FreqDist({'the': 5580, ',': 5188, '.': 4030, 'of': 2849, 'and': 2146, 'to': 2116, 'a': 1993, 'in': 1893, 'for': 943, 'The': 806, ...})

##### Raw Text Processing

In [57]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
content1 = request.urlopen(url).read()

In [67]:
content1[:500]

b'\xef\xbb\xbfThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Crime and Punishment\r\n\r\nAuthor: Fyodor Dostoevsky\r\n\r\nRelease Date: March 28, 2006 [EBook #2554]\r\nLast Updated: October 27, 2016\r\n\r\nLanguage: English\r\n\r\nChar'

##### Reading a HTML file

In [68]:
from urllib import request

url = "http://www.bbc.com/news/health-42802191"
html_content = request.urlopen(url).read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')


In [73]:
inner_body = soup.find_all('div', attrs={'class':'story-body__inner'})
inner_text = [elm.text for elm in inner_body[0].find_all(['h1', 'h2', 'p', 'li']) ]
text_content2 = '\n'.join(inner_text)
text_content2

'Smokers need to quit cigarettes rather than cut back on them to significantly lower their risk of heart disease and stroke, a large BMJ study suggests. \nPeople who smoked even one cigarette a day were still about 50% more likely to develop heart disease and 30% more likely to have a stroke than people who had never smoked, researchers said. \nThey said it showed there was no safe level of smoking for such diseases. \nBut an expert said people who cut down were more likely to stop.\nWhy young people are now less likely to smoke\nQuit smoking campaign backs e-cigs\n\'One smoke leads to daily habit for most\'\n\'Stop completely\'\nCardiovascular disease, not cancer, is the greatest mortality risk for smoking, causing about 48% of smoking-related premature deaths.\nWhile the percentage of adults in the UK who smoked had been falling, the proportion of people who smoked one to five cigarettes a day had been rising steadily, researchers said. \nTheir analysis of 141 studies, published in t

In [78]:
text_content1 = content1.decode('unicode_escape') 
tokens1 = nltk.word_tokenize(text_content1)

In [79]:
tokens1[:6]

['ï', '»', '¿The', 'Project', 'Gutenberg', 'EBook']

In [81]:
tokens2 = nltk.word_tokenize(text_content2)
print(len(tokens2))
tokens2[:8]

751


['Smokers', 'need', 'to', 'quit', 'cigarettes', 'rather', 'than', 'cut']

##### Regular Expressions for Tokenization

In [83]:
import re

In [84]:
tokens2_2 = re.findall(r'\w+', text_content2)
len(tokens2_2)

668

In [85]:
pattern = r'\w+'
tokens2_3 = nltk.regexp_tokenize(text_content2, pattern)
len(tokens2_3)

668

##### Creation of NLTK text

In [86]:
input_text2 = nltk.Text(tokens2)
type(input_text2)

nltk.text.Text

In [87]:
input_text2[:5]

['Smokers', 'need', 'to', 'quit', 'cigarettes']

##### Bigrams

In [88]:
s = 'Python is an awesome language.'
tokens = nltk.word_tokenize(s)
list(nltk.bigrams(tokens))

[('Python', 'is'),
 ('is', 'an'),
 ('an', 'awesome'),
 ('awesome', 'language'),
 ('language', '.')]

##### Computing Frequent Bigrams

In [104]:
eng_tokens = genesis.words('english-kjv.txt')
eng_bigrams = nltk.bigrams(eng_tokens)
filtered_bigrams = [ (w1, w2) for w1, w2 in eng_bigrams if len(w1) >=5 and len(w2) >= 5 ]

In [106]:
filtered_bigrams[:5]

[('called', 'Night'),
 ('waters', 'which'),
 ('waters', 'which'),
 ('firmament', 'Heaven'),
 ('waters', 'under')]

In [107]:
eng_bifreq = nltk.FreqDist(filtered_bigrams)

In [108]:
eng_bifreq.most_common(3)

[(('their', 'father'), 19), (('lived', 'after'), 16), (('seven', 'years'), 15)]

Determining Frequent After Words

In [122]:
eng_cfd = nltk.ConditionalFreqDist(filtered_bigrams)

In [126]:
eng_cfd['living'].most_common(2)

[('creature', 7), ('thing', 4)]

##### Generating Frequent Next Word

In [145]:
def generate(cfd, word, n=5):
    n_words = []
    for i in range(n):
        n_words.append(word)
        word = cfd[word].max()
    return n_words

In [146]:
generate(eng_cfd, 'living')

['living', 'creature', 'after', 'their', 'father']

##### Trigrams

In [148]:
s = 'Python is an awesome language.'
tokens = nltk.word_tokenize(s)
list(nltk.trigrams(tokens))

[('Python', 'is', 'an'),
 ('is', 'an', 'awesome'),
 ('an', 'awesome', 'language'),
 ('awesome', 'language', '.')]

##### ngrams

In [149]:
list(nltk.ngrams(tokens, 4))

[('Python', 'is', 'an', 'awesome'),
 ('is', 'an', 'awesome', 'language'),
 ('an', 'awesome', 'language', '.')]

##### Collocations

In [150]:
tokens = genesis.words('english-kjv.txt')
gen_text = nltk.Text(tokens)
gen_text.collocations()

said unto; pray thee; thou shalt; thou hast; thy seed; years old;
spake unto; thou art; LORD God; every living; God hath; begat sons;
seven years; shalt thou; little ones; living creature; creeping thing;
savoury meat; thirty years; every beast


##### Stemming

Stemming is a process of stripping affixes from words.

- The two widely used stemmers are Porter and Lancaster stemmers.
- These stemmers have their own rules for string affixes.
- The following example demonstrates stemming of word builders using PorterStemmer.

In [152]:
from nltk import PorterStemmer
porter = nltk.PorterStemmer()
porter.stem('builders')

'builder'

In [153]:
from nltk import LancasterStemmer
lancaster = LancasterStemmer()
lancaster.stem('builders')

'build'

In [154]:
len(set(text1)) #from nltk.book import *

19317

In [156]:
lc_words = [ word.lower() for word in text1] 
len(set(lc_words))

17231

In [157]:
p_stem_words = [porter.stem(word) for word in set(lc_words) ]
len(set(p_stem_words))

10927

In [158]:
l_stem_words = [lancaster.stem(word) for word in set(lc_words) ]
len(set(l_stem_words))

9036

WordNetLemmatizer is majorly used to build a vocabulary of words, which are valid Lemmas.

In [159]:
wnl = nltk.WordNetLemmatizer()
wnl_stem_words = [wnl.lemmatize(word) for word in set(lc_words) ]
len(set(wnl_stem_words))

15168

##### POS Tags

In [160]:
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
nltk.pos_tag(words)

[('Python', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ'), ('.', '.')]

In [161]:
brown_tagged = brown.tagged_words()

In [162]:
brown_tagged[:5]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL')]

DefaultTagger

In [163]:
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(words)

[('Python', 'NN'), ('is', 'NN'), ('awesome', 'NN'), ('.', 'NN')]

Defined_tags

In [167]:
words = nltk.word_tokenize(text)
defined_tags = {'is':'BEZ', 'over':'IN', 'who': 'WPS'}

In [174]:
baseline_tagger = nltk.UnigramTagger(model=defined_tags)

In [175]:
baseline_tagger.tag(words)

[('Python', None), ('is', 'BEZ'), ('awesome', None), ('.', None)]

##### UnigramTagger

In [176]:
brown_tagged_sents = brown.tagged_sents(categories='government')
brown_sents = brown.sents(categories='government')
len(brown_sents)

3032

In [178]:
train_size = int(len(brown_sents)*0.8)
train_size

2425

In [179]:
train_sents = brown_tagged_sents[:train_size]
test_sents = brown_tagged_sents[train_size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

0.7799495586380832

In [180]:
unigram_tagger.tag(brown_sents[3000])

[('The', 'AT'),
 ('first', 'OD'),
 ('step', 'NN'),
 ('is', 'BEZ'),
 ('a', 'AT'),
 ('comprehensive', 'JJ'),
 ('self', None),
 ('study', 'NN'),
 ('made', 'VBN'),
 ('by', 'IN'),
 ('faculty', None),
 (',', ','),
 ('by', 'IN'),
 ('outside', 'IN'),
 ('consultants', 'NNS'),
 (',', ','),
 ('or', 'CC'),
 ('by', 'IN'),
 ('a', 'AT'),
 ('combination', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('two', 'CD'),
 ('.', '.')]