![alt text](https://miro.medium.com/max/1600/1*p_zgFaUyb66IeyHsi15soA.jpeg)

**NLP** is short for **Natural Language Processing**. As you probably know, computers are not as great at understanding words as they are numbers. This is all changing though as advances in NLP are happening everyday. The fact that devices like Apple’s Siri and Amazon’s Alexa can (usually) comprehend when we ask the weather, for directions, or to play a certain genre of music are all examples of NLP. The spam filter in your email and the spellcheck you’ve used since you learned to type in elementary school are some other basic examples of when your computer is understanding language.


As a data scientist, we may use NLP for sentiment analysis (classifying words to have positive or negative connotation) or to make predictions in classification models, among other things. Typically, whether we’re given the data or have to scrape it, the text will be in its natural human format of sentences, paragraphs, tweets, etc. From there, before we can dig into analyzing, we will have to do some cleaning to break the text down into a format the computer can easily understand.

# NLTK (Natural Language Toolkit)

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. Although NLTK has adapted to more than 38 languages at present.

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning and wrappers for industrial-strength NLP libraries.  NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.

## NLP Library	

**NLTK** :	This is one of the most usable and mother of all NLP libraries.

**spaCy**:	This is completely optimized and highly accurate library widely used in deep learning

**Stanford CoreNLP Python**:	For client-server based architecture this is a good library in NLTK. This is written in JAVA, but it provides modularity to use it in Python.

**TextBlob**:	This is an NLP library which works in python2 and python3. This is used for processing textual data and provide mainly all type of operation in the form of API.

**Gensim**:	Genism is a robust open source NLP library support in python. This library is highly efficient and scalable.

**Pattern**:	It is a light-weighted NLP module. This is generally used in Web-mining, crawling or such type of spidering task. 

**Polyglot**:	For massive multilingual applications, Polyglot is best suitable NLP library. Feature extraction in the way on Identity and Entity.

**PyNLPl**:	PyNLPI also was known as 'Pineapple' and supports Python. It provides a parser for many data format like FoLiA/Giza/Moses/ARPA/Timbl/CQL.

**Vocabulary**:	This library is best to get Semantic type information from the given text.

In [1]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [4]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [5]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\pallaw\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [6]:
nltk.download('genesis')

[nltk_data] Downloading package genesis to
[nltk_data]     C:\Users\pallaw\AppData\Roaming\nltk_data...
[nltk_data]   Package genesis is already up-to-date!


True

In [7]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [8]:
whitman = nltk.corpus.gutenberg.words('whitman-leaves.txt')
print(whitman)

['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]


## Text Preprocessing

We will talk about the basic steps of text preprocessing. These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

After a text is obtained, we start with text normalization. Text normalization includes:



*    removing punctuations, accent marks and other diacritics
*    removing white spaces
*    expanding abbreviations
*    removing stop words, sparse terms, and particular words
*    text canonicalization
*    converting all letters to lower or upper case
*    converting numbers into words or removing numbers



### Removing punctuations, accent marks, special symbols and diacritics

In [9]:
# Sample code to remove a regex pattern 
import re 

def remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text)
    
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
        print(input_text)
                

    return input_text

regex_pattern = "#[\w]*"  

remove_regex("remove this #hashtag from my given string object", regex_pattern)

remove this  from my given string object


'remove this  from my given string object'

### Remove whitespaces

In [10]:
input_str = " \t a string example\t "
input_str = input_str.strip()
input_str

'a string example'

### Remove Numbers

In [11]:
import re
input_str = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
result = re.sub(r"\d+", "", input_str)
print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


### Convert Case

In [12]:
input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_str = input_str.lower()
print(input_str)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.


**Tokenization**

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

In [13]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pallaw\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
input_str = "NLTK is a leading platform for building Python programs to work with human language data."

from nltk.tokenize import word_tokenize
tokens = word_tokenize(input_str)
print (tokens)

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']


### Remove stop words

“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts.

In [15]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pallaw\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'is', 'then', "aren't", 'wouldn', 'we', 'ours', 'off', 'such', 'himself', "needn't", 'being', 'you', 'having', 'are', 'most', 'here', 'y', 'she', 'same', 'our', 't', 'ain', 'doing', 'they', 'weren', 'under', 'so', "hasn't", "wouldn't", 'hers', 'after', 'all', 'he', 'the', "hadn't", "that'll", 'these', "wasn't", 'below', "isn't", 'down', 'it', 'or', 'doesn', "you'd", 'above', 'only', 'nor', 'through', 'me', 'yourselves', 'why', 'for', 'as', 'has', 'had', "you'll", 'itself', 'before', 'myself', 'when', 'o', 're', 'have', 'don', "won't", 'her', 'to', 'each', "mustn't", 'those', 'that', 'needn', 'shan', 'was', 'an', 'should', 'i', 'if', 'didn', 'my', 'his', 'both', "should've", 'can', 'd', 'into', 'mustn', 'in', 'own', 'about', 'until', 'yourself', "weren't", "haven't", 'been', 'a', 'am', 'there', "don't", 'yours', 'whom', 'aren', 'from', "couldn't", 'ourselves', 'won', 'just', 'll', 'up', 'which', 'isn', "didn't", 'theirs', 'mightn', 'what', "mightn't", 'again', 'couldn', "shan't", 'have

In [17]:
input_str = "All work and no play makes jack dull boy. Its good to go out and have fun at times."
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'Its', 'good', 'go', 'fun', 'times', '.']


In [18]:
#sklearn can also provide a list of standard english stop words
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
print (ENGLISH_STOP_WORDS)

frozenset({'is', 'due', 'hasnt', 'moreover', 'otherwise', 'himself', 'top', 'you', 'become', 'are', 'fifteen', 'former', 'same', 'so', 'after', 'therein', 'it', 'empty', 'third', 'would', 'inc', 'nor', 'through', 'hence', 'thereby', 'except', 'had', 'has', 'elsewhere', 'seems', 'whole', 'have', 'her', 'either', 'next', 'bill', 'interest', 'his', 'whereafter', 'herein', 'whom', 'wherein', 'call', 'nothing', 'amongst', 'twelve', 'un', 'few', 'them', 'must', 'one', 'how', 'might', 'via', 'hereby', 'during', 'became', 'across', 'anything', 'other', 'etc', 'with', 'indeed', 'someone', 'detail', 'herself', 'us', 'already', 'enough', 'out', 'its', 'de', 'once', 'him', 'their', 'often', 'then', 'full', 'such', 'give', 'meanwhile', 'among', 'she', 'hereupon', 'least', 'hers', 'noone', 'the', 'thus', 'nevertheless', 'per', 'however', 'or', 'therefore', 'four', 'above', 'whereupon', 'alone', 'although', 'show', 'why', 'as', 'for', 'itself', 'part', 'whence', 'cannot', 'seemed', 'never', 'each', '

Most of what we are going to do with language relies on ﬁrst separating out or tokenizing words (splitting the text into minimal meaningful units) from running text, known as the task of tokenization.

English words are often separated from each other by whitespace, but whitespace is not always sufﬁcient. “New York” and “rock ’n’ roll” are sometimes treated as large words despite the fact that they contain spaces, while sometimes we’ll need to separate “I’m” into the two words I and am.

For processing tweets or texts we’ll need to tokenize emoticons like “ :)” or hashtags like #nlproc.

## Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

Stemming is a part of linguistic studies in morphology and artificial intelligence (AI) information retrieval and extraction. Stemming and AI knowledge extract meaningful information from vast sources like big data or the Internet since additional forms of a word related to a subject may need to be searched to get the best results. Stemming is also a part of queries and Internet search engines.

Recognizing, searching and retrieving more forms of words returns more results. When a form of a word is recognized it can make it possible to return search results that otherwise might have been missed. That additional information retrieved is why stemming is integral to search queries and information retrieval.


Applications of stemming are:

* Stemming is used in information retrieval systems like search engines.
* It is used to determine domain vocabularies in domain analysis.
* Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.

In [19]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str="There are several types of stemming algorithms for Natural languages"
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
for
natur
languag


**Errors in Stemming**:
There are mainly two errors in stemming – Overstemming and Understemming. Overstemming occurs when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems.

In [20]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
print(stemmer.stem("having"))
print(stemmer2.stem("having"))

print(SnowballStemmer("english").stem("generously"))

print(SnowballStemmer("porter").stem("generously"))

have
having
generous
gener


**N-Gram Stemmer**

An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*

Advantage: It is based on string comparisons and it is language dependent.

Limitation: It requires space to create and index the n-grams and it is not time efficient.

### Lemmatizer

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

For instance:

The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.

The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

In [21]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pallaw\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [25]:
import nltk
lemma = nltk.wordnet.WordNetLemmatizer()
lemma.lemmatize('article')
lemma.lemmatize('leaves')
lemma.lemmatize('worst')

'worst'

**Object Standardization**

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

In [23]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

print(lookup_words("RT We are going to CCD @ MG Road!! dm for more info.!!"))

Retweet We are going to CCD @ MG Road!! direct message for more info.!!
