# **The Power of Natural Language Processing**

**Natural Language Processing is** about how we interact with computers and human language. NLP is a branch of artificial intelligence (AI).  AI is the science and engineering of making intelligent machines.

[Where does NLP fit with AI?](https://drive.google.com/file/d/1NzSueuixQGYtJJWnKnx70T14kPvwyZNA/view?usp=sharing)

NLP involves techniques, trends  and technologies deployed in a   range of powerful business cases and applications.  The global pandemic has brought the future forward by five years, due to AI adoption, investment, and latest NLP language models.

[NLP Applications] [ranging examples](https://drive.google.com/file/d/1_uBRrJ7ZtRfC4l1q3lrjgTlIO90da93f/view?usp=sharing)  

[Everyday NLP use](https://drive.google.com/file/d/1MzPYEa6myLDq_yJ0cavWxeVRkq3G0dZE/view?usp=sharing)

**Aim:** To grasp  an awareness of NLP applications and explore with  hands-on experience some  powerful NLP techniques used in solutions.  We will also identify some of the challenges of NLP.

**Objectives**
*   To use a range of NLP libraries, techniques and datasets for specific use cases and explore data
*   To create a simple chatbot and appreciate its limitations

# **1   Examples of natural language communications**

*   Conversations between people
*   Communicating with a hearing-impaired friend via sign language
*   Learning a foreign language to prepare for travel abroad
*   Using a smartphone to read a menu in another language
*   Reading/writing text messagesList item
*   A blind colleague reading braille or listening to a screen reader describe what’s on a computer screenList item
*   Receiving a client email in Spanish, translating it and responding in English knowing that your client can easily translate your email back to Spanish
and much more.

NLP is performed on text collections (corpora, plural of corpus)

*   Tweets
*   Facebook Posts
*   Conversations
*   Movie Reviews
*   Documents, Books and many more

Nuances of meaning make natural lanaguage understanding difficult 

*   Text's meaning can be influenced by context and reader's "world view"
*   Text is highly contexutal, ambigioius and irregular

# **2   Natural Language Processing**

**Objectives**

*   Perform NLP tasks fundamental to most case studies
*   Use Libraries - TextBlob, Textatistic and spaCy 
*   Tokenise text into words and sentences
*   Part-of-speech tagging(POS)
*   Sentiment analysis fot determining whether text is **positive, negative or **neutral** **
*   Detect the language of text and translate between languages
*   Word roots via stemming and lemmatization
*   Spelling checking and correction
*   Remove stops words in text
*   Create word cloud visualisations
*   Readability assessment
*   Named entity recognition and similarity detection
*   Review one example domain - a chatbot 
*   Identify a couple of  challenges of NLP

## **NLTK(Natural Language Toolkit)** 

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
NLTK has been called “a wonderful tool for teaching and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”
Natural Language Processing with Python provides a practical introduction to programming for language processing. I highly recommend this book to people beginning in NLP with Python.
Downloading and installing NLTK
1.	Install NLTK: run pip install nltk
2.	Test installation: run python then type import nltk

**Installing NLTK Packages**
Import NLTK and run nltk.download().This will open the NLTK downloader from where you can choose the corpora and models to download. You can also download all packages at once.



# **Downloading and installing NLTK**

NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

Natural Language Processing with Python provides a practical introduction to programming for language processing.

# **Installing NLTK Packages**

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) # for downloading packages
nltk.download('punkt') # first-time use only
nltk.download('wordnet') # first-time use only
nltk.download('brown') # first-time use only
nltk.download('stopwords') # first-time use only

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# **TextBlob Library**

A library build on top of NLTK
Some NLP tasks TextBlob can do:

Sentiment analysis—determining whether text has positive, neutral or negative sentiment.
Inter-language translation and language detection powered by Google Translate.

*   **Tokenization**—splitting text into pieces called tokens, which are meaningful units, such as words and numbers

*   **Parts-of-speech (POS) tagging**—identifying each word’s part of speech, such as noun, verb, adjective, etc.
*   **Noun phrase extraction**—locating groups of words that represent nouns, such as “red brick factory.”

*   The phrase “red brick factory” illustrates why natural language is such a difficult subject. Is a “red brick factory” a factory that makes red bricks? Is it a red factory that makes bricks of any colour? Is it a factory built of red bricks that makes products of any type? In today’s music world, it could even be the name of a rock band or the name of a game on your smartphone.

*   **Sentiment analysis**—determining whether text has positive, neutral or negative sentiment
*   **Inter-language translation** and language detection powered by Google Translate.
*   **Inflection—pluralizing **and singularizing words. There are other aspects of inflection that are not part of TextBlob.
*   **Spell checking** and spelling correction.
*   **Stemming—reducing words** to their stems by removing prefixes or suffixes. For example, the stem of “varieties” is “varieti.”
*   **Lemmatization—**like stemming, but produces real words based on the original words’ context. For example, the lemmatized form of “varieties” is “variety.”
*   **Word frequencies**—determining how often each word appears in a corpus.
*   **WordNet integration **for finding word definitions, synonyms and antonyms.
*   **Stop word elimination—**removing common words, such as a, an, the, I, we, you and more to analyze the important words in a corpus.
*   n-grams—producing sets of consecutive words in a corpus for use in identifying words that frequently appear adjacent to one another.









In [None]:
# Setup
!pip install textblob
!pip install wordcloud
!pip install spacy

import textblob.download_corpora

from textblob import TextBlob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Importing the necessary libraries
import io
import imageio
import random
import string # to process standard python strings
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')


In [None]:
text = 'Today is a beautiful day. Tomorrow looks like bad weather.'
blob = TextBlob(text)
blob


TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

**TextBlob, Sentences and Words Support String Methods and Comparisons**
Sentences, Words and TextBlobs inherit from BaseBlob, which defines many common methods and properties

**Tokenizing Text into Sentences and Words**

Getting a list of sentences

In [None]:
blob.sentences
blob.words
# TextBlob contains word objects

WordList(['Today', 'is', 'a', 'beautiful', 'day', 'Tomorrow', 'looks', 'like', 'bad', 'weather'])

# **Parts-of-Speech Tagging**

1.   Evaluate words based on context to determine parts of speech, which can help determine meaning
2.   Eight primary English parts of speech
*   Nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions and interjections (words that express emotion and that are typically followed by punctuation, like “Yes!” or “Ha!”)
*   Many subcategories
*   Some words have multiple meanings.  E.g., “set” and “run” have hundreds of meanings each!
*   TextBlob uses a PatternTagger to determine parts-of-speech
*  Uses pattern library POS tagging with Pattern's **63 parts-of-speech tags**

*   **NN**—a singular noun or mass noun
*   **VBZ**—a third person singular present verb
*   **DT**—a determiner (the, an, that, this, my, their, etc.)
*   **JJ**—an adjective
*  **NNP**—a proper singular noun
*   **IN**—a subordinating conjunction or preposition


In [None]:
blob
blob.tags

[('Today', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('day', 'NN'),
 ('Tomorrow', 'NNP'),
 ('looks', 'VBZ'),
 ('like', 'IN'),
 ('bad', 'JJ'),
 ('weather', 'NN')]

# **Extracting Noun Phrases**

Preparing to purchase a water ski

*   Might search for “best water ski”—“water ski” is a noun phrase

*   For best results, search engine must parse the noun phrase properly
*   Try searching for “best water,” “best ski”, “water ski” and “best water ski” and see what you get


In [None]:
blob
blob.noun_phrases

# A Word can represent a noun phrase with multiple words.

WordList(['beautiful day', 'tomorrow', 'bad weather'])

# **Sentiment Analysis with TextBlob’s Default Sentiment Analyzer**

*   Determines whether text is positive, neutral or negative.
*   One of the most common and valuable NLP tasks 
*   Consider the positive word “good” and the negative word “bad"
*   Alone they are positive and negative, respectively, but...
*   The food is not good — clearly has negative sentiment
*   The movie was not bad — clearly has positive sentiment (but not as positive as The movie was excellent!)
*   Complex machine-learning problem, but libraries like TextBlob can do it for you

# **Getting the Sentiment of a TextBlob**

In [None]:
text = 'Today is a beautiful day. Tomorrow looks like bad weather.'
blob = TextBlob(text)
blob

blob.sentiment



Sentiment(polarity=0.07500000000000007, subjectivity=0.8333333333333333)

*   Polarity is the sentiment — from -1.0 (negative) to 1.0 (positive) with 0.0 being neutral.
*   Subjectivity is a value from 0.0 (objective) to 1.0 (subjective).

## **Getting the polarity and subjectivity from the Sentiment Object** **bold text**

*   %precision magic specifies the default precision for standalone float objects and float objects in built-in types like lists, dictionaries and tuples:





In [None]:
%precision 3

'%.3f'

In [None]:
blob.sentiment.polarity

0.075

In [None]:
blob.sentiment.subjectivity

0.833

# **Getting the Sentiment of a Sentence**
One is positive (0.85) and one is negative (-0.6999999999999998), which might explain why the entire TextBlob’s sentiment was close to 0.0 (neutral)

In [None]:
blob
for sentence in blob.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.85, subjectivity=1.0)
Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666)


## **Sentiment Analysis with the NaiveBayesAnalyzer**

In [None]:
from textblob.sentiments import NaiveBayesAnalyzer
blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
blob
blob.sentiment
for sentence in blob.sentences:
    print(sentence.sentiment)

Sentiment(classification='pos', p_pos=0.8117563121751951, p_neg=0.18824368782480477)
Sentiment(classification='neg', p_pos=0.174363226578349, p_neg=0.8256367734216521)


# **Language Detection and Translation**


*   **Google Translate, Microsoft Bing Translator** and others can translate between scores of languages instantly
*   Now working on near-real-time translation 
*   Converse in real time with people who do not know your natural language


*   Can specify a source language explicitly by passing the from_lang keyword argument to the translate method
chinese = blob.translate(from_lang='en', to='zh')
*   Can specify a source language explicitly by passing the from_lang keyword argument to the translate method
*   For example - chinese = blob.translate(from_lang='en', to='zh')
*   from_lang and to use iso-639-1 language codes   - Google Translate’s list of supported languages

In [None]:
blob


TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

In [None]:
# blob.detect_language()

In [None]:
#spanish = blob.translate(to='es')
spanish


NameError: ignored

In [None]:
#chinese = blob.translate(to='zh')
chinese

In [None]:
#spanish.translate()

In [None]:
#chinese.translate() 

## **Inflection: Pluralization and Singularization¶**

*   Inflections are different forms of the same words, such as singular and plural (like “person” and “people”) and different verb tenses (like “run” and “ran”)
*   When you’re calculating word frequencies, you might first want to convert all inflected words to the same form for more accurate word frequencies


In [None]:
from textblob import Word
index = Word('index')
index.pluralize()

In [None]:
cacti = Word('cacti')
cacti.singularize()

In [None]:
animals = TextBlob('dog cat fish bird').words
animals.pluralize()

# **Spell Checking and Correction**

*   For natural language processing tasks, it’s important that the text be free of spelling errors
*   A Word’s spellcheck method returns a list of tuples containing possible correct spellings and confidence values
*   Assume we meant to type “they” but misspelled it as “theyr”


In [None]:
from textblob import Word
word = Word('theyr')
word.spellcheck()

In [None]:
word.correct()  # chooses word with the highest confidence value

# **Correction**
*   Word with the highest confidence value might not be the correct word for the given context
*   TextBlobs, Sentences and Words all have a correct method that you can call to correct spelling
*   Calling correct on a Word returns the correctly spelled word that has the highest confidence value


In [None]:
from textblob import TextBlob
sentence = TextBlob('Ths sentense has missplled wrds.')
sentence.correct()

# **Normalization: Stemming and Lemmatization**

*   Stemming removes a prefix or suffix from a word leaving only a stem, which may or may not be a real word
*   Lemmatization is similar, but factors in the word’s part of speech and meaning and results in a real word
*   Both normalize words for analysis
*   Before calculating statistics on words in a body of text, you might convert all words to lowercase so that capitalized and lowercase words are not treated differently.*   You might want to use a word’s root to represent the word’s many forms. E.g., treat "program" and "programs" as "program"

In [None]:
from textblob import Word
word = Word('varieties')
word.stem()

In [None]:
word.lemmatize()

# **Word Frequencies**


*   Various techniques for detecting similarity between documents rely on word frequencies
*   TextBlob can count word frequencies for you
*   When you read a file with Path’s read_text method, it closes the file immediately after it finishes reading the file

*   Access the word frequencies through the TextBlob’s word_counts dictionary

In [None]:
# Upload the ''RomeoAndJuliet.txt' file from a source

from google.colab import files
data_to_load = files.upload()

In [None]:
blob = TextBlob('RomeoAndJuliet.txt')
blob.noun_phrases.count('lady capulet')

# **Getting Definitions, Synonyms and Antonyms from WordNet**

**WordNet** is a English word database created by Princeton University
**TextBlob** uses NLTK’s WordNet interface to look up word definitions, and get **synonyms** and **antonyms**

(skipping this- but easy to do)

# **Deleting Stop Words**


*   Common words that are often removed before analysis because they do not provide useful information
*   Returned by the NLTK stopwords module’s words function

[Host of stop words found here: ](https://drive.google.com/file/d/155URmKRv7Ao6uzF9vj4A9oJoaFjKTUvi/view?usp=sharing)



In [None]:
from nltk.corpus import stopwords
stops = stopwords.words('english')
from textblob import TextBlob
blob = TextBlob('Today is a beautiful day.')
[word for word in blob.words if word not in stops]


# **n-grams**
**n-gram** a sequence of n text items, such as letters in words or words in a sentence.### 

**Used to** identify letters or words that frequently appear adjacent to one another
Predictive text input

*   Predictive text input
*   Speech-to-text


In [None]:
from textblob import TextBlob
text = 'Today is a beautiful day. Tomorrow looks like bad weather.'
blob = TextBlob(text)
#  TextBlob’s ngrams method produces a list of WordList n-grams of length three by default—known as trigrams

#  Use keyword argument n to produce n-grams of any desired length

In [None]:
blob.ngrams()

In [None]:
blob.ngrams(n=5)

# **Visualizing Word Frequencies with Bar charts  and Word Clouds**

Can enhance your corpus analyses


*   A bar chart quantitatively visualizes the top 20 words in Romeo and Juliet as bars representing each word and its frequency.
*   A word cloud qualitatively visualizes more frequently occurring words in larger fonts and less frequently occurring words in smaller fonts.

**Examples**

[Masked image wordcloud](https://drive.google.com/file/d/1DokgIjWuesvTx9zup7a8crJbDejPRNq0/view?usp=sharing)

[Bar chart word frequency](https://drive.google.com/file/d/19Jt2akrK4BHuIjYIOPV9moSv72t7WHFk/view?usp=sharing)



In [None]:
%matplotlib inline

# Loading the data
from pathlib import Path
from textblob import TextBlob

import re
file = open(Path('/content/RomeoAndJuliet.txt'), "r")
doclist = [ line for line in file ]
docstr = '' . join(doclist)
sentences = re.split(r'[.!?]', docstr)
print('All sentences - method 1')
print(2 * "\n")
print('sentences', sentences)
print('length of sentences', len(sentences))

blob = TextBlob(Path('/content/RomeoAndJuliet.txt').read_text())
print('All sentences')
print(2 * "\n")
print(blob.sentences)
print(2 * "\n")
print('All words')
print(blob.words)
print(2 * "\n")
print('All tags')
print(blob.tags)
print(2 * "\n")
print('All noun phrases')
print(2 * "\n")
print(blob.noun_phrases)
print(2 * "\n")

print('All counts')

print('Senetence length is ', len(blob.sentences))

words =blob.word_counts
print('Word  length is ', words.items())


print('Tag length ', blob.tags.count)


print('Noun phrase count', blob.noun_phrases.count)
#from google.colab import files
#uploaded = files.upload()
# blob = nlp('RomeoAndJuliet.txt')

#Load NLTK stop words

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

#Getting the Word Frequencies
# Get word frequency tuples

items = blob.word_counts.items()
print (items)
#Eliminating the Stop Words
#The expression item[0] gets the word from each tuple so we can check whether it’s in stop_words
pattern=['n\'t', 'points', 'th', 'um', 'uh', 'em', 'oh', 'na','ges']



items = [item for item in items if item[0] not in stop_words]
items = [item for item in items if item[0] not in pattern]


#Sorting the Words by Frequency
#Sort the tuples in items in descending order by frequency
#To specify the tuple element to sort by, use the itemgetter function from the Python Standard Library’s operator module


from operator import itemgetter

sorted_items = sorted(items, key=itemgetter(1), reverse=True)




# **Getting the Top 20 Words**

TextBlob tokenizaton splits all contractions at their apostrophes and counts the total number of apostrophes as one of the “words”

Romeo and Juliet has many contractions


If you display sorted_items[0], you’ll see that they are the most frequently occurring “word” with 867 of them

(In some locales this does not happen and element 0 is indeed 'romeo')
We ignore element 0

In [None]:
top20 = sorted_items[0:21]
# https://stackoverflow.com/questions/16819222/how-to-return-dictionary-keys-as-a-list-in-python


print (top20) 


# my_dict = {"a" : 1, "b" : 2}
# for key in sorted_items:
#    if sorted_items[items] == 'n\'t' or 'points' or 'th' or 'um' or 'uh' or 'ges':
#      sorted_items[items].pop()


# for key, value in sorted_items:
#    print(key, ' : ', value)
#    if key == #: 
#      sorted_items[items].pop(key)

#y= ['n't', 'points', 'th', 'um', 'uh',  'um', 'ges']
#new_dict = {}
#for key, value in top10:
#    if key is not 'n\'t' or 'points' or 'th' or 'um' or 'uh' or 'ges':
#        new_dict[key] = value
#top10 = new_dict
# print(top10)




Convert top20 to a DataFrame

In [None]:
import pandas as pd
df = pd.DataFrame(top20, columns=['word', 'count'])  

In [None]:
df

In [None]:
axes = df.plot.bar(x='word', y='count', legend=False)

import matplotlib.pyplot as plt

plt.gcf().tight_layout()

# **Named Entity Recognition with spaCy**

*   NLP can determine what a text is about
*   Named entity recognition attempts to locate and categorize items
*   dates, times, quantities, places, people, things, organizations and more

# **Loading the Language Model**

*   spaCy docs recommend the variable name nlp.

# **Creating a spaCy Doc**

*   Use the nlp object to create a spaCy Doc object representing the document to process


In [None]:
import spacy
nlp = spacy.load('en') 

In [None]:
document = nlp('In 1994, Tim Berners-Lee founded the ' + 
    'World Wide Web Consortium (W3C), devoted to ' +
    'developing web technologies')

# **Getting the Named Entities**


*   Returns tuple of Spans representing the named entities
*   Spans have many properties
*   Display text (the entity's text) and label_ (the kind of entity)

In [None]:
for entity in document.ents:
    print(f'{entity.text}: {entity.label_}')

# **Similarity Detection with spaCy**

*   Analyzing documents to determine how alike they are
*   Who wrote the works of William Shakespeare? Sir Francis Bacon? Christopher Marlowe? Others?
*   Comparing word frequencies can reveal writing-style similarities
*   We’ll compare Doc objects for Shakespeare’s Romeo and Juliet and Christopher Marlowe's Edward the Second

Loading the Language model - spaCy's medium sized model (~91MB) for better accuracy



In [None]:
# Setup
# For accuracy models

# spacy's small sized model (<40MB) - baseline model already installed

# spaCy's medium sized model (~91MB) for better accuracy
# !pip install spacy download en_core_web_md

# For spaCy's best accuracy, you can load the large sized model (~788mb)
# !pip install  spacy download en_core_web_lg  (not run for due to the bandwidth) 

# **Creating the spaCy Docs**

*   Create two Doc objects—one for Romeo and Juliet and one for Edward the Second:



In [None]:
#Upload 'Romeo and Juliet' and 'Edward the Second' 

from google.colab import files
uploaded = files.upload()
document1 = nlp('RomeoAndJuliet.txt')
from google.colab import files
uploaded = files.upload()
document2 = nlp('RomeoAndJuliet.txt')


# **Comparing the Books’ Similarity**

*   Returns a value from 0.0 (not similar) to 1.0 (identical) indicating how similar the documents are

In [None]:
document1.similarity(document2)
# This model uses the baseline model for accuacy. For better results use the alternative larger models (best result is  0.981459724155179)

# **Chatbot Practice**

## **Building a Simple Chatbot from Scratch in Python (using NLTK)**

![Alt text](https://cdn-images-1.medium.com/max/800/1*pPcVfZ7i-gLMabUol3zezA.gif)

History of chatbots dates back to 1966 when a computer program called ELIZA was invented by Weizenbaum. It imitated the language of a psychotherapist from only 200 lines of code. You can still converse with it here: [Eliza](http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm?utm_source=ubisend.com&utm_medium=blog-link&utm_campaign=ubisend). 

On similar lines let's create a very basic chatbot utlising the Python's NLTK library.It's a very simple bot with hardly any cognitive skills,but still a good way to get into NLP and get to know about chatbots.



**What is a Chatbot?**

A chatbot is an artificial intelligence-powered piece of software in a device (Siri, Alexa, Google Assistant etc), application, website or other networks that try to gauge consumer’s needs and then assist them to perform a particular task like a commercial transaction, hotel booking, form submission etc . Today almost every company has a chatbot deployed to engage with the users. Some of the ways in which companies are using chatbots are:
•	To deliver flight information
•	to connect customers and their finances
•	As customer support

*   To deliver flight information
*   to connect customers and their finances
*   as customer support

The possibilities are (almost) limitless.

**Two types of chatbots**: rule based and self learning 

1.   **Rule based approach** -  a bot answers questions based on some rules on which it is trained on. The rules defined can be very simple to very complex. The bots can handle simple queries but fail to manage complex ones.
2.   **Self-learning bots** are the ones that use some Machine Learning-based approaches and are definitely more efficient than rule-based bots. These bots can be of further two types: Retrieval Based or Generative


# **Read in a Chatbot knowledge base/dataset**
For our example, we will be using the Wikipedia page for chatbots as our corpus. 
Copy the contents from the page and place it in a text file named ‘chatbot.txt’. However, you can use any corpus of your choice.

In [None]:
# It will prompt you to select a file. Click on “Choose Files” then select and upload the file. Wait for the file to be 100% uploaded.

from google.colab import files
uploaded = files.upload()

f=open('chatbot.txt','r',  encoding='utf8', errors = 'ignore')
raw=f.read()
raw = raw.lower()# converts to lowercase

# **Basic text pre-processing includes:**

*   Converting to lowercase, (algorothmn consistency)
*   Tokenization
*   Removing the Stop words.
*   Stemming:  
*   Lemmatization:

# **Tokenisation**

In [None]:
sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences 
word_tokens = nltk.word_tokenize(raw)# converts to list of words

# **Preprocessing**
We shall now define a function called LemTokens which will take as input the tokens and return normalized tokens.

In [None]:
# Preprocessing
lemmer = WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

# **Keyword matching**
Next, we shall define a function for a greeting by the bot i.e if a user’s input is a greeting, the bot shall return a greeting response. 
ELIZA uses a simple keyword matching for greetings. We will utilize the same concept here.

In [None]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
 
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

### **Generating Response**

**Bag of Words**

After the initial preprocessing phase, we need to transform text into a meaningful vector (or array) of numbers. The bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

*   A vocabulary of known words.
*   A measure of the presence of known words.

For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).

**TF-IDF Approach**

A problem with the Bag of Words approach is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content”. Also, it will give more weight to longer documents than shorter documents.

One approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized. This approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-IDF for short, where:

Term Frequency: is a scoring of the frequency of the word in the current document.

TF = (Number of times term t appears in a document)/(Number of terms in the document)
Inverse Document Frequency: is a scoring of how rare the word is across documents.

IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.

**Cosine Similarity**
Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus

Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||
where d1,d2 are two non zero vectors.

To generate a response from our bot for input questions, the concept of document similarity will be used. 

We define a function response which searches the user’s utterance for one or more known keywords and returns one of several possible responses. 

If it doesn’t find the input matching any of the keywords, it returns a response:” I am sorry! I don’t understand you”

In [None]:
def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

Finally, we will feed the lines that we want our bot to say while starting and ending a conversation depending upon user’s input.

In [None]:
flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("ROBO: "+greeting(user_response))
            else:
                print("ROBO: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBO: Bye! take care..")

# **Your Questions**

**Question 1**: What is wrong with the Chatbot we created? - please answer at: [Got to](www.menti.com)  Code: 85 12 71 5
**Question 2**: Which of the following NLP statements are false? - please answer at: [Got to](www.menti.com)  Code: 85 12 71 5


# **Other NLP Libraries and Tools** 

Additional mostly free and open source NLP libraries and APIs:

*   Gensim—Similarity detection and topic modeling
*   Google Cloud Natural Language API—Cloud-based API for NLP tasks such as named entity recognition, sentiment analysis, parts-of-speech analysis and visualization, determining content categories and more
*   Microsoft Linguistic Analysis API
*   Bing sentiment analysis—Microsoft’s Bing search engine now uses sentiment in its search results
*   PyTorch NLP—Deep learning library for NLP
*   Stanford CoreNLP—A Java NLP library, which also provides a Python wrapper. Includes corefererence resolution, which finds all references to the same thing.
*   Apache OpenNLP—Another Java-based NLP library for common tasks, including coreference resolution. Python wrappers are available.
*   PyNLPl (pineapple)—Python NLP library provides a range of NLP capabilities
*   KoNLPy—Korean language NLP
*   Latest BERT language models

**Machine Learning and Deep Learning Natural Language Applications**

*   Answering natural language questions—For example, our publisher Pearson Education, has a partnership with IBM Watson that uses Watson as a virtual tutor. Students ask Watson natural language questions and get answers.
*   Summarizing documents—analyzing documents and producing short summaries (abstracts) that can, for example, be included with search results and can help you decide what to read.
*   Speech synthesis (speech-to-text), speech recognition (text-to-speech), inter-language text-to-text translation.
*   Collaborative filtering—used to implement recommender systems (“if you liked this movie, you might also like…”).
*   Text classification—e.g., classifying news articles by categories, such as world news, national news, local news, sports, business, entertainment, etc.
*   Topic modeling—finding the topics discussed in documents.
*   Sarcasm detection—often used with sentiment analysis.
*   Closed captioning—automatically adding text captions to video.
*   Speech to sign language and vice versa—to enable a conversation with a hearing-impaired person.
*   Lip reader technology—for people who can’t speak, convert lip movement to text or speech to enable conversation.


# **Natural Language Datasets**

*   Wikipedia—some or all of Wikipedia (https://meta.wikimedia.org/wiki/Datasets)
*   kaggle.com
*   {Big Bad NLP Database - dataset](https://www.kdnuggets.com/2020/02/big-bad-nlp-database.html)
*   IMDB (Internet Movie Database)—various movie and TV datasets are available.
*   Jeopardy! dataset—200,000+ questions from the Jeopardy! TV show. A milestone in AI occurred in 2011 when IBM Watson famously beat two of the world’s best Jeopardy! players.*   [Natural language processing datasets](https://machinelearningmastery.com/datasets-natural-language-processing/)
*  [Amazon Customer Reviews Dataset](https://registry.opendata.aws/amazon-reviews/ over 160+ million product reviews




# **Summary of the session**


*   An introduction to NLP
*   Use of a range of NLP libraries
*   Hands-on practice with a range of NLP tasks - mosty pre-processing for NLP analysis
*   An example NLP use case with specific features, models - Chatbot and its limitations
*   Some further ML, DL and Dataset information on NLP

Thank you!
