<a href="https://colab.research.google.com/github/SufyAD/AI-ML/blob/nlp/ml_for_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Practical examples of NLP are:
-

What is `NLTK` and Why Are We Using It?

`NLTK` (Natural Language Toolkit) is a powerful **Python library** designed for working with human language data (text). It offers user-friendly interfaces to over 50 corpora and lexical resources, including WordNet, and provides a comprehensive suite of tools for text processing tasks such as:

* Tokenization

* Stemming

* Tagging

* Parsing

* Classification

* Semantic reasoning

We use `NLTK` to perform foundational **NLP operations efficiently** and to access rich linguistic resources for building robust language-processing application such as NLP bots, and real-world chatbots

## 1. Stemming techniques and their drawbacks

In [None]:
!pip install nltk



In [None]:
# Stemming technique
words = [
    'preprocessing', 'unbelievably', 'counterintuitive', 'misunderstanding',
    'reimplementation', 'overcompensating', 'underachievement', 'disenfranchisement',
    'miscommunication', 'internationalization', 'antidisestablishmentarianism',
    'bioengineering', 'microencapsulation', 'hyperresponsiveness', 'reconfiguration',
    'suboptimization', 'transcontinental', 'overspecialization', 'deconstructionist'
]

### PorterStemmer
A word stemmer based on the Porter stemming algorithm.

> Porter, M. "An algorithm for suffix stripping."


In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
for word in words:
  print ('------>')
  print(stemmer.stem(word)) # stemming words
# Disadvantages with Stemming
# it will deform the words and update such words that donot even exists

### RegexpStemmer
A word stemmer based on the Porter stemming algorithm.

> Porter, M. "An algorithm for suffix stripping."


```
RegexpStemmer(regex, min)
```






In [None]:
from nltk.stem import RegexpStemmer
reg_stemmer = RegexpStemmer('ing|able|ion|ent|ness', min=4) # set variables
for word in words:
  print(f"{word} -> {reg_stemmer.stem(word)}")


preprocessing -> preprocess
unbelievably -> unbelievably
counterintuitive -> counterintuitive
misunderstanding -> misunderstand
reimplementation -> reimplemat
overcompensating -> overcompensat
underachievement -> underachievem
disenfranchisement -> disenfranchisem
miscommunication -> miscommunicat
internationalization -> internatalizat
antidisestablishmentarianism -> antidisestablishmarianism
bioengineering -> bioengineer
microencapsulation -> microencapsulat
hyperresponsiveness -> hyperresponsive
reconfiguration -> reconfigurat
suboptimization -> suboptimizat
transcontinental -> transcontinal
overspecialization -> overspecializat
deconstructionist -> deconstructist


## 2. Lemmetization > Stemming

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from nltk.stem import WordNetLemmatizer # best version to convert words to root words
from nltk.corpus import wordnet
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

lemmetizer = WordNetLemmatizer()

# get pos of the words in the list implied by the pos_tag nltk fx
def get_pos(treebank_tag):
  if treebank_tag.startswith('J'):  # Adjective
    return wordnet.ADJ
  elif treebank_tag.startswith('V'): # Verb
    return wordnet.VERB
  elif treebank_tag.startswith('N'): # Noun
    return wordnet.NOUN
  elif treebank_tag.startswith('R'): # Adverb
    return wordnet.ADV
  else:
    return wordnet.NOUN

tagged = pos_tag(words) # this will add tag against each word of the list/para
for word, tag in tagged:
  print(f"{word} -> {lemmetizer.lemmatize(word, get_pos(tag))}")

preprocessing -> preprocessing
unbelievably -> unbelievably
counterintuitive -> counterintuitive
misunderstanding -> misunderstand
reimplementation -> reimplementation
overcompensating -> overcompensate
underachievement -> underachievement
disenfranchisement -> disenfranchisement
miscommunication -> miscommunication
internationalization -> internationalization
antidisestablishmentarianism -> antidisestablishmentarianism
bioengineering -> bioengineering
microencapsulation -> microencapsulation
hyperresponsiveness -> hyperresponsiveness
reconfiguration -> reconfiguration
suboptimization -> suboptimization
transcontinental -> transcontinental
overspecialization -> overspecialization
deconstructionist -> deconstructionist


## Removing stopwords and do Lemmetization
- we will use nltk.stopwords (from the stopwords library of nltk)
- do lemmetization

In [None]:
stopwords.words('english')

In [None]:
paragraph = """
Dr. Abdus Salam, in his Nobel Prize acceptance speech, emphasized the vital role of science in the progress of developing nations.
He spoke about the unity of scientific thought across cultures and highlighted the contributions of Muslim scientists throughout history.
With humility, he dedicated his award to the poor of the Third World, whose struggles inspired his pursuit of knowledge.
His words reflected a deep belief in the power of education and global collaboration for a better future.
"""

In [None]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import wordnet
# Step 1 : Tokenize
sentence = []
sentence = nltk.sent_tokenize(paragraph)

# Step 2: Remove step words and do lemmetization
lemmetizer = WordNetLemmatizer()
for sentence in sentence:
  words = word_tokenize(sentence)
  words = [lemmetizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  sentence = ' '.join(words)
  print(sentence)

Dr. Abdus Salam , Nobel Prize acceptance speech , emphasized vital role science progress developing nation .
He spoke unity scientific thought across culture highlighted contribution Muslim scientist throughout history .
With humility , dedicated award poor Third World , whose struggle inspired pursuit knowledge .
His word reflected deep belief power education global collaboration better future .


## Adding **Part Of Speech** Tagging in this project of removing the stopwords and Lemmetization
1. Do sentence tokenization
2. Remove stopwords and do word_tokenization
3. Add pos_tag to each words after removign stopwords
4. Do lemmatization w.r.t the pos_tag for each words
5. .join the lemmatized words to form a pre-processed sentence again

In [None]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

# Step 1 : Tokenize
sentences = nltk.sent_tokenize(paragraph)

# Step 2: Remove the stopwords
lemmatizer = WordNetLemmatizer()

for sent in sentences:
    words = nltk.word_tokenize(sent)
    words = [word for word in words if word.lower() not in set(stopwords.words('english'))]

    # Step 3: POS tagging
    pos_tagged = nltk.pos_tag(words)

    # Remove stopping words and do lemmetization with `Parts Of Speech`
    l_words = []
    for word, tag in pos_tagged:
        if tag.startswith('J'):
            l_words.append(lemmatizer.lemmatize(word, wordnet.ADJ))
        elif tag.startswith('V'):
            l_words.append(lemmatizer.lemmatize(word, wordnet.VERB))
        elif tag.startswith('R'):
            l_words.append(lemmatizer.lemmatize(word, wordnet.ADV))
        else:
            l_words.append(lemmatizer.lemmatize(word)) # default pos = NOUN
    sentence = ' '.join(l_words)
    print(sentence)

Dr. Abdus Salam , Nobel Prize acceptance speech , emphasize vital role science progress develop nation .
spoke unity scientific think across culture highlight contribution Muslim scientist throughout history .
humility , dedicate award poor Third World , whose struggle inspire pursuit knowledge .
word reflect deep belief power education global collaboration well future .
