 # Introduction to Natural Language Processing

![image.png ](attachment:image.png) 

### Learning Objectives
   - Describe what natural language processing (NLP) is all about
   - Describe the history of NLP
   - Differentiate between NLP and Text Analytics
   - Implement various preprocessing tasks
   - Describe the various phases of an NLP project

### Introduction
To start with looking at NLP, let's understand what natural language is. In simple terms, it's the language we use to express ourselves. It's a basic means of communication. To define more specifically, language is a mutually agreed set of protocols involving words/sounds we use to communicate with each other.

In this era of digitization and computation, we tend to comprehend language scientifically. This is because we are constantly trying to make inanimate objects understand us. Thus, it has become essential to develop mechanisms by which language can be fed to inanimate objects such as computers. NLP helps us do this.

example. 
 emails spam detection

### History of NLP
NLP is an area that overlaps with others. It has emerged from fields such as artificial intelligence, linguistics, formal languages, and compilers. With the advancement of computing technologies and the increased availability of data, the way natural language is being processed has changed. Previously, a traditional rule-based system was used for computations. Today, computations on natural language are being done using machine learning and deep learning techniques.

The major work on machine learning-based NLP started during the 1980s. During the 1980s, developments across various disciplines such as artificial intelligence, linguistics, formal languages, and computations led to the emergence of an interdisciplinary subject called NLP.

## Text Analytics and NLP


![image.png](attachment:image.png)

The art of extracting useful insights from any given text data can be referred to as text analytics. NLP, on the other hand, is not just restricted to text data. Voice (speech) recognition and analysis also come under the domain of NLP. 


## NLP can be broadly categorized into two types:

- ``Natural Language Understanding (NLU):`` NLU refers to a process by which an inanimate object with computing power is able to comprehend spoken language.

- `Natural Language Generation (NLG):` NLG refers to a process by which an inanimate object with computing power is able to manifest its thoughts in a language that humans are able to understand.


![image.png](attachment:image.png)

##  Basic Text Analytics

In this exercise, we will perform some basic text analytics on the given text data. Follow these steps to implement this exercise:

In [36]:
sentence = 'The quick brown fox jumps over the lazy dog'

In [50]:
sentence.split()[0]+" "+ sentence.split()[-1]

'The dog'

In [37]:
#Check whether the word 'quick' belongs to that text 
'quick' in sentence

True

In [39]:
if "brown" in sentence:
    print (" The brown is present in the sentence")

 The brown is present in the sentence


In [40]:
#Find out the index value of the word 'fox' 
sentence.index('fox')

16

In [42]:
#To find out the rank of the word 'lazy'
sentence.split().index('lazy')

7

In [45]:
#For printing the third word of the given text
sentence.split()[2]

'brown'

In [46]:
#To print the third word of the given sentence in reverse order
sentence.split()[2][::-1]

'nworb'

In [51]:
#To concatenate the first and last words of the given sentence
words = sentence.split()
first_word = words[0]
last_word = words[len(words)-1]
concat_word = first_word + last_word 
print(concat_word)

Thedog


In [52]:
words 

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

In [None]:
sentence = 'The quick brown fox jumps over the lazy dog'

In [53]:
#For printing words at even positions
[words[i] for i in range(len(words)) if i%2 == 0]

['The', 'brown', 'jumps', 'the', 'dog']

In [60]:
[words[i] for i in range(len(words)) if i%2 != 0]

['quick', 'fox', 'over', 'lazy']

In [59]:
[(words[i],"Even") if i%2==0 else (words[i],"Odd") for i in range(len(words))]

[('The', 'Even'),
 ('quick', 'Odd'),
 ('brown', 'Even'),
 ('fox', 'Odd'),
 ('jumps', 'Even'),
 ('over', 'Odd'),
 ('the', 'Even'),
 ('lazy', 'Odd'),
 ('dog', 'Even')]

In [64]:
marks=[22,33,45,56,67,78,82,99]

[("outstanding") if i>=6 ("VGood") elif  i>=4 ("Good") elif  i>=2  for i in range(len(marks)) ]

In [56]:
#For printing words at odd positions
[words[i] for i in range(len(words)) if i%2 == 0  ]

['quick', 'fox', 'over', 'lazy']

In [57]:
#To print the last three letters of the text
sentence[-3:]

'dog'

In [67]:
# To print the text in reverse order
sentence[::-1]                                                   

'god yzal eht revo spmuj xof nworb kciuq ehT'

In [None]:
sentence = 'The quick brown fox jumps over the lazy dog'

In [68]:
words

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

In [69]:
[word[::-1] for word in words]

['ehT', 'kciuq', 'nworb', 'xof', 'spmuj', 'revo', 'eht', 'yzal', 'god']

In [15]:
#To print each word of the given text in reverse order, maintaining their sequence

print(' '.join([word[::-1] for word in words]))

ehT kciuq nworb xof spmuj revo eht yzal god


# Various Steps in NLP

### Tokenization

- Tokenization refers to the procedure of splitting a sentence into its constituent words

For example, consider this sentence: "I am reading a book." Here, our task is to extract words/tokens from this sentence. After passing this sentence to a tokenization program, the extracted words/tokens would be "I", "am", "reading", "a", "book", and ".".

- unigrams
- bigrams
- trigrams
- n-gram refers to a sequence of n items from a given text.

## Tokenization of a Simple Sentence

- Require NLTK library

In [17]:
!pip install nltk



In [70]:
import nltk
from nltk import word_tokenize

In [71]:
word_tokenize("i am learning Python for NLP")

['i', 'am', 'learning', 'Python', 'for', 'NLP']

The word_tokenize() method is used to split the sentence into words/tokens. We need to add a sentence as input to the word_tokenize() method, so that it performs its job. The result obtained would be a list, which we will store in a word variable. 

In [19]:
words = word_tokenize("I am reading NLP Fundamentals")
print(words)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\karthick/nltk_data'
    - 'C:\\Users\\karthick\\anaconda3\\nltk_data'
    - 'C:\\Users\\karthick\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\karthick\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\karthick\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


In [73]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\karthick\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [21]:
words = word_tokenize("I am reading NLP Fundamentals")
print(words)

['I', 'am', 'reading', 'NLP', 'Fundamentals']


In [114]:
paragraph='''

Rajalakshmi Eduverse is an education initiative with a keen vision to impart education which is relevant to the real-world industry and intellectual needs and it is part of Sabari Foundation.

With decades of experience in the field of education, we believe that by harnessing the new age digital teaching methodologies, we can provide the quintessential learning experience.

Rajalakshmi Institutions established in 1997 is the largest private engineering college group in Tamil Nadu in terms of student enrolment. Having an enviable placement record, we have always been keen on imparting the latest and most relevant education to students ensuring industry readiness. The cornerstone of the Rajalakshmi experience has been a learner centric approach to ensure each and every learner has comfortably completed the Learning Objectives of the program they have signed up for.

We at REV are on a mission of leveraging the wealth of experience and knowledge available across our Institutions and inculcating it to the new age digital citizens.

Our contact and digital programs have been designed by industry and academic doyens taking into account of all the leading and global corporate demands.
'''

In [116]:
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize

In [126]:
sent_tokenize(paragraph)

['\n\nRajalakshmi Eduverse is an education initiative with a keen vision to impart education which is relevant to the real-world industry and intellectual needs and it is part of Sabari Foundation.',
 'With decades of experience in the field of education, we believe that by harnessing the new age digital teaching methodologies, we can provide the quintessential learning experience.',
 'Rajalakshmi Institutions established in 1997 is the largest private engineering college group in Tamil Nadu in terms of student enrolment.',
 'Having an enviable placement record, we have always been keen on imparting the latest and most relevant education to students ensuring industry readiness.',
 'The cornerstone of the Rajalakshmi experience has been a learner centric approach to ensure each and every learner has comfortably completed the Learning Objectives of the program they have signed up for.',
 'We at REV are on a mission of leveraging the wealth of experience and knowledge available across our

In [127]:
len(sent_tokenize(paragraph))

7

In [131]:
for i in range(len(sent_tokenize(paragraph))):
    words = nltk.word_tokenize(sentences[i])
    print(words)
               

['I', 'have', 'three', 'visions', 'for', 'India', '.']
['In', '3000', 'years', 'of', 'our', 'history', ',', 'people', 'from', 'all', 'over', 'the', 'world', 'have', 'come', 'and', 'invaded', 'us', ',', 'captured', 'our', 'lands', ',', 'conquered', 'our', 'minds', '.']
['From', 'Alexander', 'onwards', ',', 'the', 'Greeks', ',', 'the', 'Turks', ',', 'the', 'Moguls', ',', 'the', 'Portuguese', ',', 'the', 'British', ',', 'the', 'French', ',', 'the', 'Dutch', ',', 'all', 'of', 'them', 'came', 'and', 'looted', 'us', ',', 'took', 'over', 'what', 'was', 'ours', '.']
['Yet', 'we', 'have', 'not', 'done', 'this', 'to', 'any', 'other', 'nation', '.']
['We', 'have', 'not', 'conquered', 'anyone', '.']
['We', 'have', 'not', 'grabbed', 'their', 'land', ',', 'their', 'culture', ',', 'their', 'history', 'and', 'tried', 'to', 'enforce', 'our', 'way', 'of', 'life', 'on', 'them', '.']
['Why', '?']


## Parts-of-Speech (PoS) tagging.

![image.png](attachment:image.png)

In [22]:
import nltk
from nltk import word_tokenize
words = word_tokenize("I am reading NLP Fundamentals")
print(words)
nltk.pos_tag(words)

['I', 'am', 'reading', 'NLP', 'Fundamentals']


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle[0m

  Searched in:
    - 'C:\\Users\\karthick/nltk_data'
    - 'C:\\Users\\karthick\\anaconda3\\nltk_data'
    - 'C:\\Users\\karthick\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\karthick\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\karthick\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [23]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\karthick\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [24]:
import nltk
from nltk import word_tokenize
words = word_tokenize("I am reading NLP Fundamentals")
print(words)
nltk.pos_tag(words)

['I', 'am', 'reading', 'NLP', 'Fundamentals']


[('I', 'PRP'),
 ('am', 'VBP'),
 ('reading', 'VBG'),
 ('NLP', 'NNP'),
 ('Fundamentals', 'NNS')]

In [132]:
for i in range(len(sent_tokenize(paragraph))):
    words = nltk.word_tokenize(sentences[i])
    print(words)
    print(nltk.pos_tag(words))

['I', 'have', 'three', 'visions', 'for', 'India', '.']
[('I', 'PRP'), ('have', 'VBP'), ('three', 'CD'), ('visions', 'NNS'), ('for', 'IN'), ('India', 'NNP'), ('.', '.')]
['In', '3000', 'years', 'of', 'our', 'history', ',', 'people', 'from', 'all', 'over', 'the', 'world', 'have', 'come', 'and', 'invaded', 'us', ',', 'captured', 'our', 'lands', ',', 'conquered', 'our', 'minds', '.']
[('In', 'IN'), ('3000', 'CD'), ('years', 'NNS'), ('of', 'IN'), ('our', 'PRP$'), ('history', 'NN'), (',', ','), ('people', 'NNS'), ('from', 'IN'), ('all', 'DT'), ('over', 'IN'), ('the', 'DT'), ('world', 'NN'), ('have', 'VBP'), ('come', 'VBN'), ('and', 'CC'), ('invaded', 'VBN'), ('us', 'PRP'), (',', ','), ('captured', 'VBD'), ('our', 'PRP$'), ('lands', 'NNS'), (',', ','), ('conquered', 'VBD'), ('our', 'PRP$'), ('minds', 'NNS'), ('.', '.')]
['From', 'Alexander', 'onwards', ',', 'the', 'Greeks', ',', 'the', 'Turks', ',', 'the', 'Moguls', ',', 'the', 'Portuguese', ',', 'the', 'British', ',', 'the', 'French', ',', '

## Stop Word Removal

Stop words are common words that are just used to support the construction of sentences. We remove stop words from our analysis as they do not impact the meaning of sentences they are present in. Examples of stop words include a, am, and the. Since they occur very frequently and their presence doesn't have much impact on the sense of the sentence, they need to be removed.

In [25]:
import nltk
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\karthick\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [26]:
# In order to check the list of stopwords provided for the English language, 
#we pass it as a parameter to the words() function.
stop_words = stopwords.words('English')
print(stop_words)                                                            


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

To remove the stop words from a sentence, we first assign a string to the sentence variable and tokenize it into words using the word_tokenize() method.

In [30]:
sentence = "I am learning Python. It is one of the most popular programming languages"
sentence_words = word_tokenize(sentence)
print(sentence_words)

['I', 'am', 'learning', 'Python', '.', 'It', 'is', 'one', 'of', 'the', 'most', 'popular', 'programming', 'languages']


To remove the stopwords, first we need to loop through each word in the sentence, check whether there are any stop words, and then finally combine them to form a complete sentence

In [133]:
sentence_no_stops = ' '.join([word for word in sentence_words if word not in stop_words])
print(sentence_no_stops)

I learning Python . It one popular programming languages


# Text Normalization


There are some words that are spelt, pronounced, and represented differently, for example, words such as Mumbai and Bombay, and US and United States. Although they are different, they mean the same thing. There are also different forms words that need to be converted into base forms. For example, words such as "does" and "doing," when converted to their base form, become "do". Along these lines, text normalization is a process wherein different variations of text get converted into a standard form. We need to perform text normalization as there are some words that can mean the same thing as each other. There are various ways of normalizing text, such as spelling correction, stemming, and lemmatization

In [134]:
sentence = "I visited US from UK on 22-10-18"

We want to replace "US" with "United States", "UK" with "United Kingdom", and "18" with "2018". To do so, we make use of the replace() function and store the updated output in the "normalized_sentence" variable

In [135]:
normalized_sentence = sentence.replace("US", "United States").replace("UK", "United Kingdom").replace("-18", "-2018")

In [136]:
# Now, in order to check whether the text has been normalized, 
print(normalized_sentence)

I visited United States from United Kingdom on 22-10-2018


## Spelling Correction

Spelling correction is one of the most important tasks in any NLP project. It can be time consuming, but without it there are high chances of losing out on required information. We make use of the "autocorrect" Python library to correct spellings

In [137]:
!pip install autocorrect



In [143]:
from autocorrect import Speller
spell=Speller()
spell("Natureal")

'Natural'

In [138]:
# Spelling Correction of a Word and a Sentence
import nltk
from nltk import word_tokenize
from autocorrect import spell
#In order to correct the spelling of a word, pass a wrongly spelled word as a parameter to the spell() function.
spell('Natureal')

autocorrect.spell is deprecated,             use autocorrect.Speller instead


'Natural'

In [None]:
#Excerise 

In [145]:
#In order to correct the spelling of a sentence, we first need to tokenize it into words
sentence = word_tokenize("Ntural Luanguage Processin deals with the art of extracting insightes from Natural Languaes")
print(sentence)
from autocorrect import Speller
spell = Speller()
[spell(word) for word in sentence]

['Ntural', 'Luanguage', 'Processin', 'deals', 'with', 'the', 'art', 'of', 'extracting', 'insightes', 'from', 'Natural', 'Languaes']


['Natural',
 'Language',
 'Processing',
 'deals',
 'with',
 'the',
 'art',
 'of',
 'extracting',
 'insights',
 'from',
 'Natural',
 'Languages']

In [96]:
#Now that we have got the tokens, we loop through each token in sentence, correct them, and assign them to new variable
sentence_corrected = ' '.join([spell(word) for word in sentence])
print(sentence_corrected)

Natural Language Processing deals with the art of extracting insights from Natural Languages


In the preceding figure, we can see that most of the wrongly spelled words have been corrected. But the word "Processin" was wrongly converted into "Procession". It should have been "Processing". It happened because to change "Processin" to "Procession" or "Processing," an equal number of edits is required. To rectify this, we need to use other kinds of spelling correctors that are aware of context.

# Stemming

In languages such as English, words get transformed into various forms when being used in a sentence. For example, the word "product" might get transformed into "production" when referring to the process of making something or transformed into "products" in plural form. It is necessary to convert these words into their base forms, as they carry the same meaning. Stemming is a process that helps us in doing so. If we look at the following figure, we get a perfect idea about how words get transformed into their base forms:

![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)

#Exercise : Stemming
In this exercise, we will pass a few words through the stemming process such that they get converted into their base forms

In [98]:
import nltk
stemmer = nltk.stem.PorterStemmer()
stemmer.stem("production")

'product'

In [148]:
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
stemmer.stem("production")

'product'

In [149]:
stemmer.stem("fairly") # no meaning 

'fairli'

In [99]:
stemmer.stem("coming")

'come'

In [100]:
stemmer.stem("firing")

'fire'

In [101]:
stemmer.stem("battling")

'battl'

### Lemmatization
- Sometimes, the stemming process leads to inappropriate results. 
- For example, in the last exercise, the word "battling" got transformed to "battl," which has no meaning. 
- To overcome these problems with stemming, we make use of lemmatization. 
- In this process, an additional check is being made, by looking through the dictionary to extract the base form of a word. - However, this additional check slows down the process

#### Extracting the base word using Lemmatization

In [150]:
#the lemmatization process to produce the proper form of a given word
import nltk
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
#Bring the word to its proper form by using the lemmatize() method of the WordNetLemmatizer class.
lemmatizer.lemmatize('products')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\karthick\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


'product'

In [151]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
lemmatizer.lemmatize("fairly")

'fairly'

In [153]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
lemmatizer.lemmatize("history")

'history'

In [103]:
lemmatizer.lemmatize('production')

'production'

In [104]:
lemmatizer.lemmatize('coming')

'coming'

In [105]:
lemmatizer.lemmatize('battle')

'battle'

### named entity recognition (NER).
- Named entities are usually not present in dictionaries. So, we need to treat them separately. 
- The main objective of this process is to identify the named entities (such as proper nouns) and map them to the categories that are already defined. 
- For example, the categories might include names of persons, places, and so on. 
- To get a better understanding of this process, we'll look at an exercise.

In [154]:
#In this exercise, we will find out the named entities in a sentence.
import nltk
from nltk import word_tokenize
nltk.download('maxent_ne_chunker')
nltk.download('words')
sentence = "We are reading a book published by Packt which is based out of Birmingham."
i = nltk.ne_chunk(nltk.pos_tag(word_tokenize(sentence)), binary=True)
[a for a in i if len(a)==1]                                                                                                            

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\karthick\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\karthick\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


[Tree('NE', [('Packt', 'NNP')]), Tree('NE', [('Birmingham', 'NNP')])]

In the preceding figure, we can see that the code identifies the named entities "Packt" and "Birmingham" and maps them to an already-defined category such as "NNP".

### Word Sense Disambiguation
- There's a popular saying, "A man is known by the company he keeps". 
- Similarly, a word's meaning depends on its association with other words in the sentence. 
- This means two or more words with the same spelling may have different meanings in different contexts. 
- This often leads to ambiguity. 
- Word sense disambiguation is the process of mapping a word to the correct sense it carries.
- We need to disambiguate words based on the sense they carry so that they can be treated as different entities when being analyzed. 
- The following figure displays a perfect example of how ambiguity is caused due to the usage of the same word in different sentences:

![image.png](attachment:image.png)

In [155]:
#In this exercise, we will find the sense of the word "bank" in two different sentences.
import nltk
from nltk.wsd import lesk
from nltk import word_tokenize
sentence1 = "Keep your savings in the bank"
sentence2 = "It's so risky to drive over the banks of the road"
print(lesk(word_tokenize(sentence1), 'bank'))
print(lesk(word_tokenize(sentence2), 'bank'))

In [None]:
#Declare two variables, sentence1 and sentence2, and assign them with appropriate strings


In [157]:
#In order to find the sense of the word "bank" in the preceding two sentences,
#we make use of the lesk algorithm provided by the nltk.wsd library.
print(lesk(word_tokenize(sentence1), 'bank')) #Sense carried by the word "bank" in sentence1

Synset('savings_bank.n.02')


Here, savings_bank.n.02 refers to a container for keeping money safely at home.

In [110]:
print(lesk(word_tokenize(sentence2), 'bank')) # Sense carried by the word "bank" in sentence2

Synset('bank.v.07')


Here, bank.v.07 refers to a slope in the turn of a road.

- Thus, with the help of the lesk algorithm, we are able to identify the sense of a word in whatever context

### Sentence Boundary Detection
- Sentence boundary detection is the method of detecting where one sentence ends and where another sentence begins.
- If you are thinking that it is pretty easy, as a full stop (.) denotes the end of any sentence and the beginning of another sentence, then you are wrong. 
- This is because there can be instances wherein abbreviations are separated by full stops. 
- Various analyses need to be performed at a sentence level, so detecting boundaries of sentences is essential. 

In [111]:
#In this exercise, we will extract sentences from a paragraph
import nltk
from nltk.tokenize import sent_tokenize
#We make use of the sent_tokenize() method to detect sentences in a given text
sent_tokenize("We are reading a book. Do you know who is the publisher? It is Packt. Packt is based out of Birmingham.")
# separate out the sentences from given text

['We are reading a book.',
 'Do you know who is the publisher?',
 'It is Packt.',
 'Packt is based out of Birmingham.']

- We have covered all the preprocessing steps that are involved in NLP

# Bag of Words

- Bag of words model helps convert the text into numerical representation (numerical feature vectors) such that the same can be used to train models using machine learning algorithms. 
- Here are the key steps of fitting a bag-of-words model:


    - 1. Create a vocabulary indices of words or tokens from the entire set of documents. The vocabulary indices can be created in alphabetical order. 
    - 2.Construct the numerical feature vector for each document that represents how frequent each word appears in different documents. The feature vector representing each will be sparse in nature as the words in each document will represent only a small subset of words out of all words (bag-of-words) present in entire set of documents.


- The picture below represents the above concept. Note some of the following:

   - Number of words in header represents unique words in all the three documents listed in first column Against each document, number represents number of occurences. For example, for the first document, “bird” occured for 5 times, “the” occured for two times and “about” occured for 1 time.


![image.png](attachment:image.png)

### Creating a bag-of-words model using Python Sklearn

- To construct a bag-of-words model based on the word counts in the respective documents, the CountVectorizer class implemented in scikit-learn is used

- CountVectorizer (sklearn.feature_extraction.text.CountVectorizer) is used to fit the bag-or-words model
- The fit_transform method of CountVectorizer takes an array of text data, which can be documents or sentences.

In [None]:
paragraph='''

Rajalakshmi Eduverse is an education initiative with a keen vision to impart education which is relevant to the real-world industry and intellectual needs and it is part of Sabari Foundation.

With decades of experience in the field of education, we believe that by harnessing the new age digital teaching methodologies, we can provide the quintessential learning experience.

Rajalakshmi Institutions established in 1997 is the largest private engineering college group in Tamil Nadu in terms of student enrolment. Having an enviable placement record, we have always been keen on imparting the latest and most relevant education to students ensuring industry readiness. The cornerstone of the Rajalakshmi experience has been a learner centric approach to ensure each and every learner has comfortably completed the Learning Objectives of the program they have signed up for.

We at REV are on a mission of leveraging the wealth of experience and knowledge available across our Institutions and inculcating it to the new age digital citizens.

Our contact and digital programs have been designed by industry and academic doyens taking into account of all the leading and global corporate demands.
'''

In [162]:
# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

In [159]:
corpus

['rajalakshmi eduvers educ initi keen vision impart educ relev real world industri intellectu need part sabari foundat',
 'decad experi field educ believ har new age digit teach methodolog provid quintessenti learn experi',
 'rajalakshmi institut establish largest privat engin colleg group tamil nadu term student enrol',
 'enviabl placement record alway keen impart latest relev educ student ensur industri readi',
 'cornerston rajalakshmi experi learner centric approach ensur everi learner comfort complet learn object program sign',
 'rev mission leverag wealth experi knowledg avail across institut inculc new age digit citizen',
 'contact digit program design industri academ doyen take account lead global corpor demand']

In [163]:
X 

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
        0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0

In [164]:
import pandas as pd
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,67,68,69,70,71,72,73,74,75,76
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,1
1,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,1,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
5,0,0,1,1,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,1,0
6,1,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


### Term Frequency-Inverse Document Frequency (TF-IDF)
- “Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

- TF-IDF which means Term Frequency and Inverse Document Frequency, is a scoring measure widely used in information retrieval (IR) or summarization.
- TF-IDF is intended to reflect how relevant a term is in a given document.

### Term Frequency (TF)
- Let’s first understand Term Frequent (TF). It is a measure of how frequently a term, t, appears in a document, d:
![image.png](attachment:image.png)

### Inverse Document Frequency (IDF)

- IDF is a measure of how important a term is. We need the IDF value because computing just the TF alone is not sufficient to understand the importance of words:
![image.png](attachment:image.png)

In [None]:
Implementation

In [165]:
# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    

In [166]:
# Creating the  model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()

In [167]:
X


array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.36533294, 0.25744718, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.25744718, 0.        , 0.        , 0.        ,
        0.21370327, 0.        , 0.18266647, 0.25744718, 0.        ,
        0.25744718, 0.21370327, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.25744718, 0.        , 0.        ,
        0.25744718, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.18266647, 0.        , 0.25744718, 0.        ,
        0.21370327, 0.        , 0.25744718, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  