# NLP

Natural Language Processing (NLP) is the sub-field of Computer Science especially Artificial Intelligence (AI) that is concerned about enabling computers to understand and process human language. Technically, the main task of NLP would be to program computers for analyzing and processing huge amount of natural language data.

# Corpus
A corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting. Its plural is corpora. They can be derived in different ways like text that was originally electronic, transcripts of spoken language and optical character recognition, etc.

# NLTK
NLTK (Natural Language Toolkit) is a suite that contains libraries and programs for statistical language processing. It is one of the most powerful NLP libraries, which contains packages to make machines understand human language and reply to it with an appropriate response

## Install nltk

In [None]:
!pip install nltk

In [1]:
import nltk
#nltk.download()
##### python -m nltk.downloader all
##### python -m nltk.downloader -d /usr/local/share/nltk_data all


In [2]:
nltk.__version__

'3.2.4'

<img src="./images/nltk.png">

Test Installed data

In [3]:
from nltk.corpus import brown
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [4]:
len(brown.words())

1161192

In [5]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

## Loading all items from nltk books module.

In [6]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


# Text Extraction and Preprocessing


## Tokenization
Tokenization is the process by which a large quantity of text is divided into smaller parts called tokens.
<img src="tokenization.JPG">
****************************************************************************


# RegexpTokenizer
Tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

In [7]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
filterdText=tokenizer.tokenize('Hello ALL, This is the first session of NLP and we are  studying NLTK. 123') #\w+ matches one or more word characters (same as [a-zA-Z0-9_]+).
filterdText

['Hello',
 'ALL',
 'This',
 'is',
 'the',
 'first',
 'session',
 'of',
 'NLP',
 'and',
 'we',
 'are',
 'studying',
 'NLTK',
 '123']

********************************************************
word_tokenize() is used to split a sentence into words. 

In [8]:
# word tokenize
from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))

['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']


## Regex tweetTokeniser

In [14]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
tknzr.tokenize(s0)

['This',
 'is',
 'a',
 'cooool',
 '#dummysmiley',
 ':',
 ':-)',
 ':-P',
 '<3',
 'and',
 'some',
 'arrows',
 '<',
 '>',
 '->',
 '<--']

In [15]:
###It is possible to specify strip_handles and reduce_len parameters for a TweetTokenizer instance. Setting strip_handles to True, the tokenizer will remove Twitter handles (e.g. usernames). Setting reduce_len to True, repeated character sequences of length 3 or greater will be replaced with sequences of length 3.
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
s6 = '@remy: This is waaaaayyyy too much for you!!!!!!'
print(tknzr.tokenize(s6))

[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']


In [16]:
# The preserve_case parameter (default: True) allows to convert uppercase tokens to lowercase tokens. Emoticons are not affected
tknzr = TweetTokenizer(preserve_case=False)
s9 = "@jrmy: I'm REALLY HAPPYYY about that! NICEEEE :D :P"
tknzr.tokenize(s9)

['@jrmy',
 ':',
 "i'm",
 'really',
 'happyyy',
 'about',
 'that',
 '!',
 'niceeee',
 ':D',
 ':P']

**************************************
sent_tokenize() used to split group of sentences/ paragraphs into sentences.

In [9]:
# sentence tokenize
from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))
sent_tokenize("Enchanté, comment allez-vous? Tres bien. Mersi, et vous?","french") # tokenize language other than English

['God is Great!', 'I won a lottery.']


['Enchanté, comment allez-vous?', 'Tres bien.', 'Mersi, et vous?']

## Lower case conversion:

In [11]:
import re
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
words = text.split()
print(words)

['god', 'is', 'great', 'i', 'won', 'a', 'lottery']


# Stopwords
Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.

In [12]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [13]:
from nltk.corpus import stopwords
text="Today is a great day. It is even better than yesterday. And yesterday was the best day ever!"
stopwords=set(stopwords.words('english'))
from nltk.tokenize import word_tokenize
words=word_tokenize(text)
wordsFiltered=[]
for w in words:
        if w not in stopwords:
                 wordsFiltered.append(w)
wordsFiltered

['Today',
 'great',
 'day',
 '.',
 'It',
 'even',
 'better',
 'yesterday',
 '.',
 'And',
 'yesterday',
 'best',
 'day',
 'ever',
 '!']