# **Natural Language Processing Webinar - Day 5**

Natural Language Processing can be done using a number of libraries. Some are - 

*   NLTK
*   Gensim
*   SpaCy
 
 We are going to be using NLTK. It is one of the most basic ones and easy to use.

NLTK --> Natural Language Toolkit  (Free open source library)

In [None]:
pip install nltk                          # Install NLTK library



In [None]:
import nltk                                # Import the NLTK library

In [None]:
nltk.download()                            # Access the NLTK modules from the library

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.....

True

In [None]:
nltk.download('gutenberg')                     # Download the Gutenberg corpus

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [None]:
from nltk.corpus import gutenberg              # Importing the package so we don't have to call it again

In [None]:
nltk.corpus.gutenberg.fileids()                # Gives out a list of .txt files

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [None]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt') 

In [None]:
emma

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]

With the slicing method , we can look at the first 100 words in our text file

In [None]:
emma[0 : 100]

['[',
 'Emma',
 'by',
 'Jane',
 'Austen',
 '1816',
 ']',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a',
 'comfortable',
 'home',
 'and',
 'happy',
 'disposition',
 ',',
 'seemed',
 'to',
 'unite',
 'some',
 'of',
 'the',
 'best',
 'blessings',
 'of',
 'existence',
 ';',
 'and',
 'had',
 'lived',
 'nearly',
 'twenty',
 '-',
 'one',
 'years',
 'in',
 'the',
 'world',
 'with',
 'very',
 'little',
 'to',
 'distress',
 'or',
 'vex',
 'her',
 '.',
 'She',
 'was',
 'the',
 'youngest',
 'of',
 'the',
 'two',
 'daughters',
 'of',
 'a',
 'most',
 'affectionate',
 ',',
 'indulgent',
 'father',
 ';',
 'and',
 'had',
 ',',
 'in',
 'consequence',
 'of',
 'her',
 'sister',
 "'",
 's',
 'marriage',
 ',',
 'been',
 'mistress',
 'of',
 'his',
 'house',
 'from',
 'a',
 'very',
 'early',
 'period',
 '.',
 'Her']

We can also find out the number of sentences , the number of words & number of characters in any particular text we are looking at.


For this , we have to download the **Punkt Sentence Tokenizer**

This tokenizer divides a text into a list of sentences.

In [None]:
nltk.download('punkt')               

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
characters = len(gutenberg.raw('austen-emma.txt')) 
words      = len(gutenberg.words('austen-emma.txt'))
sents      = len(gutenberg.sents('austen-emma.txt'))

print('No. of Characters: ', characters)
print('No. of Words: ', words)
print('No. of Sentences: ', sents)

No. of Characters:  887071
No. of Words:  192427
No. of Sentences:  7752


# **Text Processing**

Data pre-processing is the process of making the machine understand things better or making the input more machine understandable.
Let us look at the methods we discussed in our PPT.

## **1) Tokenization**

Tokenization is the process of breaking text up into smaller chunks as per our requirements. As we saw , to perform this we require the 'punkt' package from NLTK.

### **Word Tokenization**

This method basically breaks down the sentence into words based on the spacing between each word.

For performing tokenization , we require the "tokenize" module from NLTK library.

In [None]:
from nltk.tokenize import word_tokenize                # Module for Word Tokenization

In [None]:
sentence = "My name is Rohan Mathur and I like to study natural language processing"

words = word_tokenize(sentence)
type(words)

list

In [None]:
print(words)                                            # Prints out the individual words in a sentence 

['My', 'name', 'is', 'Rohan', 'Mathur', 'and', 'I', 'like', 'to', 'study', 'natural', 'language', 'processing']


We can also "sentence" tokenize a paragraph containing lots of sentences. Let us have a look at the implementation of this as well.

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
para  = "The weather is good today. It is 27 Celsius. I am really thirsty. I want water"

sents = sent_tokenize(para)                       #Seperates the sentences from a paragraph

print(sents)

['The weather is good today.', 'It is 27 Celsius.', 'I am really thirsty.', 'I want water']


### **Lowercasing**


Lowercasing ALL your text data, although commonly overlooked, is one of the simplest and most effective form of text preprocessing. It is applicable to most text mining and NLP problems and can help in cases where your dataset is not very large and significantly helps with consistency of expected output.

In [None]:
for sentences in sents:
    
    print(sentences.lower())

the weather is good today.
it is 27 celsius.
i am really thirsty.
i want water


### **Punctuation Removal**


Punctuations are of little use in NLP and can interfere when we use our data for further modelling , so they are removed.

We use a module called -  ```RegexpTokenizer``` from the ```tokenize``` module in NLTK for this

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
result    = tokenizer.tokenize("The vaccine is out! When is college going to reopen??")

print(result)

['The', 'vaccine', 'is', 'out', 'When', 'is', 'college', 'going', 'to', 'reopen']


### **Stop Words Removal**

Stop words are words which occur frequently in a corpus. e.g a, an, the, in. Frequently occurring words are removed from the corpus for the sake of text-normalization.

In [None]:
nltk.download('stopwords')                             # As punkt was to be downloaded for tokenizing , same has to be done with this

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords

Let us have a look at what all stopwords are included 

In [None]:
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [None]:
stop_words = stopwords.words('english')                     # A list containing all stop words

In [None]:
para_2 = "Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding."

In [None]:
tokenizer    = RegexpTokenizer(r'\w+')                                     # Remove all punctuations first
no_puncts    = tokenizer.tokenize(para_2)                                  # Iterate through each of the elements (words) in no_puncts & see if they exist in stop_words
no_stopwords = [word for word in no_puncts if not word in stop_words]      # If yes , remove them

no_stopwords

['Natural',
 'language',
 'processing',
 'NLP',
 'branch',
 'artificial',
 'intelligence',
 'helps',
 'computers',
 'understand',
 'interpret',
 'manipulate',
 'human',
 'language',
 'NLP',
 'draws',
 'many',
 'disciplines',
 'including',
 'computer',
 'science',
 'computational',
 'linguistics',
 'pursuit',
 'fill',
 'gap',
 'human',
 'communication',
 'computer',
 'understanding']

### **Stemming**

Often we want to map the different forms of the same word to the same root word, e.g. "walks", "walking", "walked" should all be the same as "walk".

The stemming and lemmatization process are hand-written regex rules written find the root word.


The three major stemming algorithms in use nowadays:



1.   **Porter:** It is the most commonly used stemmer nowadays. It is one of the few stemmers that actually have Java support and it is also the most computationally intensive of the algorithms. It is also the oldest stemming algorithm by a large margin.

2.   **Snowball:** This is an improvement over porter. It is slightly faster computation time than porter, with a reasonably large community around it.

3.   **Lancaster:** It is a very aggressive stemming algorithm. With Porter and Snowball, the stemmed representations are intuitive to a reader, not so with Lancaster, as many shorter words will become totally confusing. (*This method was invented in the Lancaster University & has been named after that*)


#### **All three can be done using ```nltk``` library**
 





We will be using the Porter Stemmer for demonstration purposes for this webinar. You can try others as well & see how the results pan out.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [None]:
to_be_stemmed = ['Trouble', 'Troubling' , 'Eat' , 'Eating', 
                 'Stroked' , 'Stroke' , 'Fake','Faking']

stemmed_words = [stemmer.stem(i) for i in to_be_stemmed]          # Iterate each word in the list

print(stemmed_words)

['troubl', 'troubl', 'eat', 'eat', 'stroke', 'stroke', 'fake', 'fake']


**PorterStemmer** is known for its simplicity and speed. It is commonly useful in Information Retrieval Environments

### **Lemmatization**

It is another process of reducing inflection from words. The way its different from stemming is that it reduces words to their origins which have actual meaning. Stemming sometimes generates words which are not even words.

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more. 

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [None]:
to_be_lemmed     = ['rocks' , 'corpora', 'goods']

lemmatized_words = [lemmatizer.lemmatize(i) for i in to_be_lemmed] 

print(lemmatized_words)

['rock', 'corpus', 'good']


### **Counting Words**




Let's implement a simple function that is often used in Natural Language Processing: **Counting Word Frequencies**

In [None]:
from collections import Counter

passage = '''As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His
 left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity
 , hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. 
 Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the 
 more favoured of his guests.'''


def count_words(text):
    
    counts = dict()                                 # Dictionary of { <word>: <count> } pairs to return
    
    text = text.lower()                             # Convert to lowercase
    
    tokenizer = RegexpTokenizer(r'\w+')             # Remove Punctuations & Split text into tokens (words)
    words = tokenizer.tokenize(text)
    
    
    print(Counter(words))

count_words(passage)

Counter({'a': 9, 'he': 6, 'the': 6, 'and': 5, 'as': 4, 'was': 4, 'with': 3, 'i': 2, 'of': 2, 'his': 2, 'left': 2, 'shoulder': 2, 'about': 2, 'waiting': 1, 'man': 1, 'came': 1, 'out': 1, 'side': 1, 'room': 1, 'at': 1, 'glance': 1, 'sure': 1, 'must': 1, 'be': 1, 'long': 1, 'john': 1, 'leg': 1, 'cut': 1, 'off': 1, 'close': 1, 'by': 1, 'hip': 1, 'under': 1, 'carried': 1, 'crutch': 1, 'which': 1, 'managed': 1, 'wonderful': 1, 'dexterity': 1, 'hopping': 1, 'upon': 1, 'it': 1, 'like': 1, 'bird': 1, 'very': 1, 'tall': 1, 'strong': 1, 'face': 1, 'big': 1, 'ham': 1, 'plain': 1, 'pale': 1, 'but': 1, 'intelligent': 1, 'smiling': 1, 'indeed': 1, 'seemed': 1, 'in': 1, 'most': 1, 'cheerful': 1, 'spirits': 1, 'whistling': 1, 'moved': 1, 'among': 1, 'tables': 1, 'merry': 1, 'word': 1, 'or': 1, 'slap': 1, 'on': 1, 'for': 1, 'more': 1, 'favoured': 1, 'guests': 1})
