<a href="https://colab.research.google.com/github/Godswill-Claude/AB-Test-A-New-Launch-Menu/blob/main/Copy_of_Natural_Language_Processing_(NLP)_Natural_Language_Toolkit_(NLTK)_Demo1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLP is a branch of artificial intelligence (AI) concerned with giving computers the ability to understand text and spoken words much the same way that humans can.

NLP tools and libraries are found in the NLTK. Python provides a wide range of tools and libraries for NLP tasks.

NLTK is a set of open source python modules used to work with human language and data for applying statistical NLP. NLTK requires python 3.0 and above to run.

## THE NLP PROCESS WORKFLOW

Best practices suggests we take the following steps, in the stated order, when performing any NLP operation
1. Tokenization
2. Stopwords removal
3. Stemming and lemmatization
4. POS tagging
5. Information retrieval

## SETTING UP THE NLTK ENVIRONMENT

In [None]:
!pip install nltk



In [None]:
import nltk

In [None]:
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [None]:
# To test that nltk is successfully installed and imported in our machine

from nltk.corpus import brown
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

## SOME NLTK TOOLS FOR TEXT EXTRACTION AND PRE-PROCESSING

### 1. Tokenization

This is the process of removing sensitive data and placing unique symbols of identification in its place to return essential information.
It involves breaking texts into words and sentences.

We have
1. Sentence tokenization
2. Word tokenization

#### Sentence tokenization demo

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
# The sent_tokenize function from the nltk library requires the punkt resource to be installed
# in order to tokenize the text. To install the punkt resource using the nltk.download function:

!python3 -m nltk.downloader punkt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# we will assign a string of text to "text" and tokenize the string

text = "This is some random text we are using to demonstrate the implementation of text tokenization. Please follow through"

In [None]:
# calling sent_tokenize function on "text"

print(sent_tokenize(text))

['This is some random text we are using to demonstrate the implementation of text tokenization.', 'Please follow through']


Notice that our "text" string has been splitted into two parts or sentences, with a comma in-between

#### Word tokenization demo

In [None]:
from nltk.tokenize import word_tokenize

In [54]:
# calling the word_tokenize function on "text"

print(word_tokenize(text))

['This', 'is', 'some', 'random', 'text', 'we', 'are', 'using', 'to', 'demonstrate', 'the', 'implementation', 'of', 'text', 'tokenization', '.', 'Please', 'follow', 'through']


Notice that our "text" string has been splitted into words, with a comma in-between each

### 2. N-grams

In language, the meaning of sentences is dependent on the order of the words.

N-gram is a simple language model that assigns probabilities to sequences of words and sentences.

In technical terms, they can be defined as neighbouring sequences of items in a document.

N-grams is dependent on the number of splits you want to perform. 1-gram splits the sentence into each word. 2-gram splits the sentence into groups of two words, 3-gram splits into groups of three words, and so on.

N-grams help retain the context of a word because when a word is standing alone, it may be lost in translation

#### 3. Stop words

These are natual language words which have very little meaning, e.g. or, at, they, etc, and other prepositions.

These words take up space in the database and increase the processing time.

The nltk data library has a list of stop words that are stored in over 16 languages

In [None]:
# to install the stopwords corpus using the NLTK Downloader.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Nrcessary import statements for stopwords task

import nltk
from nltk.corpus import stopwords

In [None]:
# # The following command sets the stopwords to English Language and shows the list of stop
# words that are found in the nltk.corpus library
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

We can simply use the stopwords function to remove the above list of stopwords from our text while performing NLP

#### 4. Stemming

Involves reducing a text/stream of data/words to stem or base form by removing suffixes.e,g, reducing the words "dancing" or "danced" to the base form "dance", reducing "cooking" or "cooked" to "cook", etc.

Stemming algorithms include Porter stemmer, Lancaster stemmer, Snowball stemmer.

We will use the Porter stemmer to demonstrate stemming in python here

In [None]:
# to install the stopwords corpus using the NLTK Downloader.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Necessary import statements for stemming task

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
# # The following command sets the stopwords to English Language and shows the list of stop
# words that are found in the nltk.corpus library
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [None]:
# Assigning PorterStemmer to a variable ps

ps = PorterStemmer()

In [None]:
# But first, we have to tokenize our text before stemming.
# We will continue to use the "text" created

words = word_tokenize(text)

In [None]:
# Now, going into stemming of the words in "text", we create a for loop

for w in words:
  print(ps.stem(w))

thi
is
some
random
text
we
are
use
to
demonstr
the
implement
of
text
token
.
pleas
follow
through


Notice that some words in the output cell above have been reduced to their base form, even though some of the base forms are not accurate

#### 5. Lemmatization

This is the process of grouping together the different inflected forms of a word so they can be analysed as a single item

In [None]:
# Required import statements

import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
#Assigning WordNetLemmatizer to the variable lemmatizer

lemmatizer = WordNetLemmatizer()

In [None]:
# Applying lemmatizer to a string "drinks"

print("drinks:", lemmatizer.lemmatize("drinks"))

drinks: drink


Notice that "drinks" has been lemmatized to "drink"

In [None]:
# Applying lemmatizer to another string "corpora"

print("corpora:", lemmatizer.lemmatize("corpora"))

corpora: corpus


Notice that "corpora" has been lemmatized to "corpus"

In [None]:
# Applying lemmatizer to another string "Sunsets"

print("Sunsets:", lemmatizer.lemmatize("Sunsets"))

Sunsets: Sunsets


Notice that "Sunsets" has been lemmatized to "Sunsets" or left unchanged

#### 6. Part of speech (POS) tagging

This is the process of marking words in a corpus (a collection of written texts or a body of writing on a particular subject) to a corresponding part of speech tag based on its context and definition.

POS tags are useful in building name, entity recognition, in lemmatization, and in extracting relationships between words.

In [None]:
# Required import statements

from nltk.tag import DefaultTagger

In [None]:
# Defining our tag

tagging = DefaultTagger("NN")

In [None]:
# Now, doing the actual tagging

tagging.tag(["hello", "world"])

[('hello', 'NN'), ('world', 'NN')]

Notice that each word in the list has been tagged NN

#### 7. Name entity recognition (NER)

This seeks to extract a real-world entity from the text and sort it into pre-defined categories such as people, locations, organizations, dates, etc.

A typical NER output for a sample text could be as follows:

--Person: Pendo Manjele, Manny Loore, Maria Parker, John, etc (that is, NER would get these names and tag them to Person)

--Location: Lusaka, Kigali, Lagos, etc (that is, if it NER identifies the location, it will tag it as location and extract the information

--Organization: AT&T, Chevron, etc (that is, if there is organization info in a stream of text NER would extract it out and tag it as such).

NER is generally based on grammar rules and supervised models

In [None]:
# Required import statements

import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
# Initialising a body of text to be used as demo for NER

text_body = "Microsoft Corporation, headquartered in Redmond, Washington, announced a strategic partnership with Tesla, Inc. to develop autonomous vehicles."

In [None]:
# First, we perform word tokenization the text_body

tokenized_text_body = nltk.word_tokenize(text_body)

In [None]:
# Next, we tag our sentences using pos

tag_sentences = nltk.pos_tag(tokenized_text_body)

In [None]:
# Next, we apply Chunking to our tagged sentences

nltk.download('maxent_ne_chunker')
nltk.download('words')

ner_chunked_sentences = nltk.ne_chunk(tag_sentences)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [None]:
# Next, we create an empty list we will call "ne"
ne = []

# Creating a for loop to iterate through ner_chunked_sentences
for tagged_tree in ner_chunked_sentences:
  if hasattr(tagged_tree, 'label'):
    entity_name = ''.join(c[0] for c in tagged_tree.leaves())
    entity_type = tagged_tree.label()
    ne.append((entity_name, entity_type))
    print(ne)


[('Microsoft', 'PERSON')]
[('Microsoft', 'PERSON'), ('Corporation', 'ORGANIZATION')]
[('Microsoft', 'PERSON'), ('Corporation', 'ORGANIZATION'), ('Redmond', 'GPE')]
[('Microsoft', 'PERSON'), ('Corporation', 'ORGANIZATION'), ('Redmond', 'GPE'), ('Washington', 'GPE')]
[('Microsoft', 'PERSON'), ('Corporation', 'ORGANIZATION'), ('Redmond', 'GPE'), ('Washington', 'GPE'), ('Tesla', 'PERSON')]


As can be seen above, all relevant parts of the body of text has been tagged appropriately. There could be a few mis-matches, though