<a href="https://colab.research.google.com/github/Ashwinikb27/DataScience/blob/main/NLP/Natural_Language_Pre_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NLP** is short for **Natural Language Processing**. As you probably know, computers are not as great at understanding words as they are numbers. This is all changing though as advances in NLP are happening everyday. The fact that devices like Apple’s Siri and Amazon’s Alexa can (usually) comprehend when we ask the weather, for directions, or to play a certain genre of music are all examples of NLP. The spam filter in your email and the spellcheck you’ve used since you learned to type in elementary school are some other basic examples of when your computer is understanding language.


As a data scientist, we may use NLP for sentiment analysis (classifying words to have positive or negative connotation) or to make predictions in classification models, among other things. Typically, whether we’re given the data or have to scrape it, the text will be in its natural human format of sentences, paragraphs, tweets, etc. From there, before we can dig into analyzing, we will have to do some cleaning to break the text down into a format the computer can easily understand.

# NLTK (Natural Language Toolkit)

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. Although NLTK has adapted to more than 38 languages at present.

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning and wrappers for industrial-strength NLP libraries.  NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.

## NLP Library	

**NLTK** :	This is one of the most usable and mother of all NLP libraries.

**spaCy**:	This is completely optimized and highly accurate library widely used in deep learning

**Stanford CoreNLP Python**:	For client-server based architecture this is a good library in NLTK. This is written in JAVA, but it provides modularity to use it in Python.

**TextBlob**:	This is an NLP library which works in python2 and python3. This is used for processing textual data and provide mainly all type of operation in the form of API.

**Gensim**:	Genism is a robust open source NLP library support in python. This library is highly efficient and scalable.

**Pattern**:	It is a light-weighted NLP module. This is generally used in Web-mining, crawling or such type of spidering task. 

**Polyglot**:	For massive multilingual applications, Polyglot is best suitable NLP library. Feature extraction in the way on Identity and Entity.

**PyNLPl**:	PyNLPI also was known as 'Pineapple' and supports Python. It provides a parser for many data format like FoLiA/Giza/Moses/ARPA/Timbl/CQL.

**Vocabulary**:	This library is best to get Semantic type information from the given text.

In [None]:
pip install nltk



In [None]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.....

True

In [None]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [None]:
nltk.download('genesis')

[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Unzipping corpora/genesis.zip.


True

In [None]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [None]:
whitman = nltk.corpus.gutenberg.words('whitman-leaves.txt')
print(whitman)

['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]


In [None]:
nltk.download('inaugural')

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.


True

In [None]:
nltk.download('nps_chat')

[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Unzipping corpora/nps_chat.zip.


True

In [None]:
nltk.download('webtext')

[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Unzipping corpora/webtext.zip.


True

In [None]:
nltk.download('treebank')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [None]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [None]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [None]:
len(text6)

16967

In [None]:
len(set(text6))

2166

In [None]:
sorted(set(text6))

['!',
 '!)',
 '!,',
 '!]',
 '#',
 "'",
 "'!",
 "',",
 "'.",
 "'...",
 "'?",
 '(',
 ',',
 ",'",
 ',--',
 '-',
 '--',
 '--...',
 '.',
 ".'",
 '.)',
 '..',
 '...',
 '...?',
 '...]',
 '1',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '2',
 '20',
 '21',
 '22',
 '23',
 '24',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '?',
 '?!',
 'A',
 'ALL',
 'AMAZING',
 'ANIMATOR',
 'ARMY',
 'ARTHUR',
 'Aaaaaaaaah',
 'Aaaaaaaah',
 'Aaaaaah',
 'Aaaah',
 'Aaaaugh',
 'Aaagh',
 'Aaah',
 'Aaauggh',
 'Aaaugh',
 'Aaauugh',
 'Aagh',
 'Aah',
 'Aauuggghhh',
 'Aauuugh',
 'Aauuuuugh',
 'Aauuuves',
 'Action',
 'Actually',
 'African',
 'Ages',
 'Aggh',
 'Agh',
 'Ah',
 'Ahh',
 'Alice',
 'All',
 'Allo',
 'Almighty',
 'Alright',
 'Am',
 'Amen',
 'An',
 'Anarcho',
 'And',
 'Angnor',
 'Anthrax',
 'Antioch',
 'Anybody',
 'Anyway',
 'Apples',
 'Aramaic',
 'Are',
 'Arimathea',
 'Armaments',
 'Arthur',
 'As',
 'Ask',
 'Assyria',
 'At',
 'Attila',
 'Augh',
 'Autumn',
 'Auuuuuuuugh',
 'Away',


## Text Preprocessing

We will talk about the basic steps of text preprocessing. These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

After a text is obtained, we start with text normalization. Text normalization includes:



*    removing punctuations, accent marks and other diacritics
*    removing white spaces
*    expanding abbreviations
*    removing stop words, sparse terms, and particular words
*    text canonicalization
*    converting all letters to lower or upper case
*    converting numbers into words or removing numbers



### Removing punctuations, accent marks, special symbols and diacritics

In [None]:
# Sample code to remove a regex pattern 
import re 

def remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*"  

remove_regex("remove this #hashtag from my given string object", regex_pattern)

'remove this  from my given string object'

In [None]:
regex_pattern = "@[\w]*"  

remove_regex("remove this @learnbay from my given string object", regex_pattern)


'remove this  from my given string object'

### Remove whitespaces

In [None]:
input_str = " \t a string example\t "
print (input_str)
input_str = input_str.strip()
input_str

 	 a string example	 


'a string example'

In [None]:
print(input_str.split())

['a', 'string', 'example']


### Remove or Replace Numbers

In [None]:
regex_pattern = "[0-9]+"  
input_txt = '1 banana 23 apple 456 oranges'

remove_regex(input_txt, regex_pattern)

' banana  apple  oranges'

In [None]:
import re
substitution_dict = {3:"three",5:"five"}
input_str = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
result = re.sub(r"\d+", , input_str)
print(result)

SyntaxError: ignored

In [None]:
pip install num2words

Collecting num2words
[?25l  Downloading https://files.pythonhosted.org/packages/eb/a2/ea800689730732e27711c41beed4b2a129b34974435bdc450377ec407738/num2words-0.5.10-py3-none-any.whl (101kB)
[K     |███▎                            | 10kB 8.6MB/s eta 0:00:01[K     |██████▌                         | 20kB 13.9MB/s eta 0:00:01[K     |█████████▊                      | 30kB 11.4MB/s eta 0:00:01[K     |█████████████                   | 40kB 9.0MB/s eta 0:00:01[K     |████████████████▏               | 51kB 5.1MB/s eta 0:00:01[K     |███████████████████▍            | 61kB 5.7MB/s eta 0:00:01[K     |██████████████████████▋         | 71kB 6.0MB/s eta 0:00:01[K     |█████████████████████████▉      | 81kB 6.3MB/s eta 0:00:01[K     |█████████████████████████████   | 92kB 6.2MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 3.7MB/s 
Installing collected packages: num2words
Successfully installed num2words-0.5.10


In [None]:
from num2words import num2words
num2words(42)

'forty-two'

In [None]:
import re
input_str = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
out = ' '.join([num2words(i) if i.isdigit() else i for i in input_str.split()])
print(out)

Box A contains three red and five white balls, while Box B contains four red and two blue balls.


In [None]:
# homework
# swap all the number to their equivalent words using regex only
import re
input_str = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
# result = re.sub(r"\d+", num2words(), input_str)
input_str

'Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.'

### Convert Case

In [None]:
input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_str = input_str.lower()
print(input_str)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.


**Tokenization**

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
input_str = "NLTK is a leading platform for building Python programs to work with human language data."

from nltk.tokenize import word_tokenize
tokens = word_tokenize(input_str)
print (tokens)

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']


### Remove stop words

“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'out', "it's", 'most', 'yourselves', 'same', 'hers', 'is', "haven't", 'o', 'they', 'only', 'do', 'above', 'then', 'myself', 'while', 'any', 'over', 'them', 'but', 'does', "isn't", 'why', 'about', 'has', 'whom', 'being', 'ma', 'too', 'be', 'own', 'that', 'up', 'will', 'at', 're', 'of', "shan't", 'been', 'our', 'hasn', 'wasn', 'theirs', 'those', 'it', 'doing', "you've", 'further', "didn't", "hasn't", 'yourself', 'off', 'll', 'by', 'm', "don't", 'below', 'nor', 'again', 'shouldn', 'both', 'what', 'ours', 'have', "needn't", 'having', 'won', 'after', 'very', 'didn', 'mustn', 'so', 'the', 'down', 'your', 'until', 'my', 'who', "should've", 'yours', 'if', 'where', 'are', 'himself', "mustn't", 'wouldn', 'themselves', 'she', 'me', 'before', "wasn't", 'into', 'i', 'you', 'its', 'from', 'between', 'which', "mightn't", 'how', 'shan', 'not', 'him', 'here', 'should', 'all', "aren't", 'because', 'can', 'a', 'once', 'such', "she's", 'don', 'couldn', 'am', 'than', 'against', 'through', 'few', 'isn', "w

In [None]:
input_str = "All work and no play makes jack a dull boy. Its good to go out and have fun at times."
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'Its', 'good', 'go', 'fun', 'times', '.']


In [None]:
#sklearn can also provide a list of standard english stop words
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
print (ENGLISH_STOP_WORDS)

frozenset({'out', 'most', 'five', 'sixty', 'anyone', 'is', 'becomes', 'since', 'above', 'often', 'any', 'besides', 'amoungst', 'bottom', 'though', 'almost', 'neither', 'seem', 'further', 'might', 'by', 'ever', 'after', 'very', 'yet', 'so', 'until', 'interest', 'where', 'himself', 'via', 'therefore', 'you', 'its', 'upon', 'wherein', 'put', 'here', 'whereupon', 'nobody', 'forty', 'beforehand', 'hence', 'sincere', 'than', 'anything', 'eleven', 'on', 'was', 'herself', 'as', 'un', 'with', 'other', 'thereby', 'moreover', 'twenty', 'thru', 'latterly', 'con', 'itself', 'cannot', 'system', 'were', 'due', 'hereupon', 'twelve', 'beyond', 'fire', 'amount', 'something', 'same', 'nine', 'nowhere', 'enough', 'someone', 'couldnt', 'then', 'everything', 'see', 'while', 'them', 'but', 'together', 'hereby', 'has', 'call', 'get', 'take', 'per', 'much', 'nothing', 'hundred', 'around', 'whereas', 'serious', 'sometimes', 'whence', 'else', 'another', 'again', 'everywhere', 'ie', 'cry', 'never', 'have', 'forme



Most of what we are going to do with language relies on ﬁrst separating out or tokenizing words (splitting the text into minimal meaningful units) from running text, known as the task of tokenization.

English words are often separated from each other by whitespace, but whitespace is not always sufﬁcient. “New York” and “rock ’n’ roll” are sometimes treated as large words despite the fact that they contain spaces, while sometimes we’ll need to separate “I’m” into the two words I and am.

For processing tweets or texts we’ll need to tokenize emoticons like “ :)” or hashtags like #nlproc.

## Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

Stemming is a part of linguistic studies in morphology and artificial intelligence (AI) information retrieval and extraction. Stemming and AI knowledge extract meaningful information from vast sources like big data or the Internet since additional forms of a word related to a subject may need to be searched to get the best results. Stemming is also a part of queries and Internet search engines.

Recognizing, searching and retrieving more forms of words returns more results. When a form of a word is recognized it can make it possible to return search results that otherwise might have been missed. That additional information retrieved is why stemming is integral to search queries and information retrieval.


Applications of stemming are:

* Stemming is used in information retrieval systems like search engines.
* It is used to determine domain vocabularies in domain analysis.
* Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str="There are several types of stemming algorithms for Natural languages"
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
for
natur
languag


**Errors in Stemming**:
There are mainly two errors in stemming – Overstemming and Understemming. Overstemming occurs when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems.

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
print(stemmer.stem("having"))
print(stemmer2.stem("having"))

print(SnowballStemmer("english").stem("generously"))

print(SnowballStemmer("porter").stem("generously"))

have
having
generous
gener


In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
# input_str="There are several types of stemming algorithms for Natural languages"
# input_str = "We say today our players were not playing with the spirit of the game."
input_str = "English fans were really rude and disrespectful to Italian fans and the city on London. This is so humiliating in a closely contested Euro Cup in Wembley, London"
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

english
fan
were
realli
rude
and
disrespect
to
italian
fan
and
the
citi
on
london
.
this
is
so
humili
in
a
close
contest
euro
cup
in
wembley
,
london


**N-Gram Stemmer**

An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*

Advantage: It is based on string comparisons and it is language dependent.

Limitation: It requires space to create and index the n-grams and it is not time efficient.

### Lemmatizer

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

For instance:

The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.

The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
import nltk
lemma = nltk.wordnet.WordNetLemmatizer()
lemma.lemmatize('played')
# lemma.lemmatize('meeting')

'played'

In [None]:
input_str="There are several types of stemming algorithms for Natural languages"
input_str = "English fans were really rude and disrespectful to Italian fans and the city of London. This is so humiliating in a closely contested Euro Cup in Wembley, London"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemma.lemmatize(word))

English
fan
were
really
rude
and
disrespectful
to
Italian
fan
and
the
city
of
London
.
This
is
so
humiliating
in
a
closely
contested
Euro
Cup
in
Wembley
,
London


**Object Standardization**

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

In [None]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love", "@":"at", "info":"information"}
def lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

print(lookup_words("RT We are going to CCD @ MG Road!! dm for more info !!"))

Retweet We are going to CCD at MG Road!! direct message for more information !!


In [None]:
{"IEEE 811.11ax": "Wireless protocol for internet router and network cards"}

{'IEEE 811.11ax': 'Wireless protocol for internet router and network cards'}