<a href="https://colab.research.google.com/github/Augusta02/Natural_Language_Processing/blob/main/NLP_Stemming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
import string

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Text Normalization

This is used to change words to their common form, when the variations mean the same thing. It involves the cleaning and preprocessing of data to be in a consistent form.

The process involves the removal of stop words, punctuations, stemmization, lemmization, handling capitalzation.


# Steps for Text Normalization


## Case Normalization

In [3]:
# case normalization
# this can be used by applying the lower()

text = 'Mr Kaya, bought a New P-40N Warhawk and Kayla is going on Vacation with her friends!!!!'

text_norm = text.lower()

print(text_norm)

mr kaya, bought a new p-40n warhawk and kayla is going on vacation with her friends!!!!


## Handling Punctuations

In [4]:
# handling punctuations
# using the string maketrans to remove the punctuation
# punct = 'Mr Kaya, bought a New P-40N Warhawk!!!!'

punctuation = string.punctuation
text_= text.maketrans('', '', punctuation)

print(text.translate(text_))

Mr Kaya bought a New P40N Warhawk and Kayla is going on Vacation with her friends


## Stop Word Removal

In [5]:
# stop word removal
# this is the process of removing common words
# with little meaning to the sentence such as 'the', 'is', 'and', 'a', etc.
# A useful technique when handling text with alot of stop words
# This helps to reduce dimensionality  and removing unneccessary words
# its downside is, it can lead to loss of information
# stopwords can be used emphasize sentiment

text = 'Mr Kaya bought a New P-40N Warhawk and Kayla is Driving with the Roof OPEN!!!!'
from nltk.corpus import stopwords
# set stopwords language
stop_words = set(stopwords.words('english'))

word_ = text.split()
# loop through word_ and check if the word is a stopword  and remove them
filter_word = [word for word in word_ if word not in stop_words]

'''OUTPUT: ['Mr', 'Kaya', 'bought', 'New', 'P-40N', 'Warhawk', 'Kayla', 'Driving', 'Roof', 'OPEN!!!!'] '''

# join text

join_word = ' '.join(filter_word)
print(join_word)


Mr Kaya bought New P-40N Warhawk Kayla Driving Roof OPEN!!!!


## Stemmization

This is the reduction of words to their root forms, this is done by removing affixes. There are three popular algorithms used for stemmization and they are:
- Porter Algorithm: This is the oldest algorithm,and its basic rule is to remove common affixes.
  - Step 1: Remove -s or -es if the word ends in -ess, -ed, -ing, -y, or -ly.
  - Step 2: If the word ends in -y, and the preceding consonant is not "w" or "x", then remove the -y and add -i.
  - Step 3: If the word ends in -e, and the preceding vowel is not "a", "i", or "o", then remove the -e.
  - Step 4: If the word ends in -l, then remove the -l unless the word ends in -ll.
  - Step 5: If the word ends in -er, then remove the -er unless the word is "better", "worse", "later", or "under".
  - Step 6: If the word ends in -ing, then remove the -ing unless the word ends in -inge.
  - Step 7: If the word ends in -ed, then remove the -ed unless the word ends in -eed.






In [6]:
from nltk.stem.porter import *
p_stemmer = PorterStemmer()

In [7]:
words= ['run', 'running', 'runner', 'runs', 'runners', 'ran']

for word in words:
  print(word + ' --->' + p_stemmer.stem(word))

run --->run
running --->run
runner --->runner
runs --->run
runners --->runner
ran --->ran


In [8]:
from nltk.stem.snowball import SnowballStemmer
s_stemmer= SnowballStemmer(language='english')

for word in words:
  print(word + '---->' + s_stemmer.stem(word))

run---->run
running---->run
runner---->runner
runs---->run
runners---->runner
ran---->ran


## Snowball

 It is an improved Porter Algorithm, The Snowball algorithm is a recursive algorithm, which means that it breaks down a word into its stem by repeatedly applying a set of rules. The rules in the Snowball algorithm are designed to remove suffixes from words, while preserving the stem of the word.

In [9]:
new_words = ['generate', 'generous', 'generation', 'generously']

for new in new_words:
  print(new + '---->' + p_stemmer.stem(new))
  print(new + '---->' + s_stemmer.stem(new))
  print('-------------')

generate---->gener
generate---->generat
-------------
generous---->gener
generous---->generous
-------------
generation---->gener
generation---->generat
-------------
generously---->gener
generously---->generous
-------------


Snowball is the newest NLP stemmization algorithm which was built using the basics of Porter Algorithms.


## Lancaster Algorithm:

This is a more aggressive algorithm and it tends to over-stem words. This means that it may reduce a word to its root form, even if the root form is not the most common or correct form of the word.

In [10]:
from nltk import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ls= LancasterStemmer()

In [11]:
y = ['is', 'was', 'are','be','been', 'were']

for x in y:
  print(f'{x} has the stemming of {p_stemmer.stem(x)}')
  print(f'{x} has the stemming of {ls.stem(x)}')
  print('-----------------')

is has the stemming of is
is has the stemming of is
-----------------
was has the stemming of wa
was has the stemming of was
-----------------
are has the stemming of are
are has the stemming of ar
-----------------
be has the stemming of be
be has the stemming of be
-----------------
been has the stemming of been
been has the stemming of been
-----------------
were has the stemming of were
were has the stemming of wer
-----------------


In [12]:
word= ['smiling', 'lead', 'marked', 'leader', 'likely', 'city', 'pretty', 'formula', 'drama', 'hero', 'solo', 'smiles']

for w in word:
  print(f'{w} has stem  ----> porter: {p_stemmer.stem(w)}')
  print('-----------')
  print(f'{w} has stem ------> snowball: {s_stemmer.stem(w)}')
  print('--------------')
  print(f'{w} has stem --------> lancaster: {ls.stem(w)}')
  print('-------------')

smiling has stem  ----> porter: smile
-----------
smiling has stem ------> snowball: smile
--------------
smiling has stem --------> lancaster: smil
-------------
lead has stem  ----> porter: lead
-----------
lead has stem ------> snowball: lead
--------------
lead has stem --------> lancaster: lead
-------------
marked has stem  ----> porter: mark
-----------
marked has stem ------> snowball: mark
--------------
marked has stem --------> lancaster: mark
-------------
leader has stem  ----> porter: leader
-----------
leader has stem ------> snowball: leader
--------------
leader has stem --------> lancaster: lead
-------------
likely has stem  ----> porter: like
-----------
likely has stem ------> snowball: like
--------------
likely has stem --------> lancaster: lik
-------------
city has stem  ----> porter: citi
-----------
city has stem ------> snowball: citi
--------------
city has stem --------> lancaster: city
-------------
pretty has stem  ----> porter: pretti
-----------
pretty

# Lemmatization

This is the process of grouping together different inflected forms of a word so they can be analyzed as a single item. It is identified by the word's lemma or dictionary form.

- it is more accurate than stems such that it accurately identifies words and group together correctly.
- it reduces amabuiguity of nlp tasks, stems have different meaning, such as bank which can be a financial instituition or side of the river, with context or analyzed group of words together, it understand the use of the word 'bank' in the sentence.
- It helps reduce the number of words to process

In [13]:
import spacy
# spacy.load is amethod used to specify the language model
# en_core_web_sm is a small english language model that is accurate and fast
# which can be applied on text or code.
nlp = spacy.load('en_core_web_sm')

In [18]:
'''The u in the expression text = nlp(u'I can be a data scientist and a blockchain engineer
in the United State of America') is a prefix that tells Python that the string is encoded in Unicode. This is necessary because spaCy uses Unicode internally,
and it needs to know that the string is encoded in Unicode in order to process it correctly.'''

text = nlp(u'I can be a data scientist and a blockchain engineer in the United State of America')

for t in text:
  print(t.text,'\t', t.lemma, '\t', t.lemma_)

I 	 4690420944186131903 	 I
can 	 6635067063807956629 	 can
be 	 10382539506755952630 	 be
a 	 11901859001352538922 	 a
data 	 6645506661261177361 	 data
scientist 	 16370364435822077466 	 scientist
and 	 2283656566040971221 	 and
a 	 11901859001352538922 	 a
blockchain 	 13707632201927900929 	 blockchain
engineer 	 2945926927285067412 	 engineer
in 	 3002984154512732771 	 in
the 	 7425985699627899538 	 the
United 	 13226800834791099135 	 United
State 	 3438489356621435858 	 State
of 	 886050111519832510 	 of
America 	 13134984502707718284 	 America


# WordNetLemmatizer

This is used to find the singular forms of words.

In [15]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [16]:
from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()

ml= ['cats', 'boxes', 'radii', 'cacti', 'visionaries', 'runners', 'speeches']

for m in ml:
  print(lem.lemmatize(m))

cat
box
radius
cactus
visionary
runner
speech


In [17]:
# indicating the part of speech of the word
# mennatize can return the word and in its part of speech

print(lem.lemmatize('beauty', 'n'))
print(lem.lemmatize('beauty', 'v'))
# print(lem.lemmatize('beautiful', 'adj'))

beauty
beauty


# Stage 4 of Language Processing

# Syntatic Analysis

Syntax in language means the arrangement of words. Syntatic analysis is the processing of words as a cluster. The goal of syntathic analysis is to determine the grammatical structure of the sentence. Therefore, we would learn the how words depend on each other, word class, word order, grammatical relations and consistuency parser.



## Word class:
> In natural language processing (NLP), a word class is a set of words that have similar grammatical properties. They are essential for NLP because they help to determine the meaning of words and sentences.
 - These words behave alike eg, dogs and cats which are Nouns
 - perform similar functions eg, walk and run which are action words Verbs
 - they undergo similar transformation.




## Part of Speech

> These are the traditonal parts of speech:
- Nouns: Words that refer to people, places, things, or ideas.
- Verbs: Words that describe actions or states of being.
- Adjectives: Words that describe nouns.
- Adverbs: Words that modify verbs, adjectives, or other adverbs.
- Determiners: Words that precede nouns to indicate quantity, number, or definiteness.
- Pronouns: Words that take the place of nouns.
- Prepositions: Words that show the relationship between a noun or pronoun and another word in a sentence.
- Conjunctions: Words that join words, phrases, or clauses together.
- Interjections: Words that express emotions or sudden thoughts.

They are called part of speech,POS, lexical category, word classes, morphological classes, lexical tag.


# POS Tagging: This is the process of assigning part of speech of each word in a corpus.

Importance of POS Tagging
- It helps understand how words can be joined together to create grammatical correct sentences.
- It helps in stemming, if part of speech of the word is correctly identified, it would provide a logical root of the word.
- it helps in predicting follow up words, such as Possissive pronouns (my, your) are followed by nouns while Personal pronouns are followed by verbs.
- It helps in automatic disambiguation of words, such that when it identifies the part of speech of the word, it understand the context of the sentence. Eg, I went to the bank. The word bank could mean a 'financial institution' or 'the side of a river' which are Nouns. This helps to narrow down the word.


## Open and Closed Classed Words

Parts of speech are divided into two;
- Open Class: these are words that accept addition of other words to them through morphological processes such as compounding or affixes, eg faxed
- Close Class: these are words that do not accept addition of words to them. eg pronouns, auxilary verbs, or words that are function words.





