<a href="https://colab.research.google.com/github/AASHA-CHANPA/GEN_AI/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install --user -U nltk
!pip install --user -U numpy



In [3]:
import nltk

1. Tokenization (Text → Words/Sentences)

In [6]:
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')
    # while the nltk library itself is installed, the necessary data files
    #  (specifically the Punkt tokenizer models for English)
    #  are not present in the locations NLTK is searching.

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [7]:
s1 = "On a $50,000 mortgage of 30 years at 8 percent, the monthly payment would be $366.88."
word_tokenize(s1)

['On',
 'a',
 '$',
 '50,000',
 'mortgage',
 'of',
 '30',
 'years',
 'at',
 '8',
 'percent',
 ',',
 'the',
 'monthly',
 'payment',
 'would',
 'be',
 '$',
 '366.88',
 '.']

In [8]:
from nltk.tokenize import word_tokenize
text = "Hello! I’m learning NLP with NLTK. It's fun."
tokens = word_tokenize(text)
print(tokens)


['Hello', '!', 'I', '’', 'm', 'learning', 'NLP', 'with', 'NLTK', '.', 'It', "'s", 'fun', '.']


2. Noise Removal / Regex Cleaning (Remove punctuation, links, emojis)

In [14]:
import re
text = "Hello #Aasha! I https://want.org to be more than friends with you! :)"
cleaned = re.sub(r"https?://\S+|[^A-Za-z\s]", "", text)
print(cleaned)

Hello Aasha I  to be more than friends with you 


 3. Stopwords Removal (Eliminate common words like “the”, “is”)

In [18]:
try:
    nltk.data.find('tcorpora/stopwords')
except LookupError:
    nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered = [word for word in tokens if word.lower() not in stop_words]
print(filtered)


['Hello', '!', '’', 'learning', 'NLP', 'NLTK', '.', "'s", 'fun', '.']


 4. Lexicon Normalization (Reduce words to root forms) - Stemming and Lemmatization

In [22]:
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [23]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print(stemmer.stem("running"))      # run
print(lemmatizer.lemmatize("running", pos="v"))  # run

run
run


5. POS Tagging (Part-of-Speech) (Understand word roles: noun, verb) and Helps in grammar, meaning, and context

In [25]:
try:
    nltk.data.find('taggers/averaged_perceptron_tagger_eng')
except LookupError:
    nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


In [26]:

from nltk import pos_tag
pos_tags = pos_tag(tokens)
print(pos_tags)


[('Hello', 'NN'), ('!', '.'), ('I', 'PRP'), ('’', 'VBP'), ('m', 'JJ'), ('learning', 'VBG'), ('NLP', 'NNP'), ('with', 'IN'), ('NLTK', 'NNP'), ('.', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('fun', 'NN'), ('.', '.')]


In [29]:
try:
    nltk.data.find('taggers/averaged_perceptron_tagger_eng')
except LookupError:
    nltk.download('averaged_perceptron_tagger_eng')

 6. Named Entity Recognition (NER)
 Use nltk.ne_chunk() on POS-tagged tokens
 Extract names, places, etc.

In [35]:
try:
    nltk.data.find('chunkers/maxent_ne_chunker_tab')
except LookupError:
    nltk.download('maxent_ne_chunker_tab')
try:
    nltk.data.find('corpora/words')
except LookupError:
    nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [37]:

from nltk import ne_chunk
tree = ne_chunk(pos_tags)
print(tree)


(S
  (GPE Hello/NN)
  !/.
  I/PRP
  ’/VBP
  m/JJ
  learning/VBG
  (ORGANIZATION NLP/NNP)
  with/IN
  (ORGANIZATION NLTK/NNP)
  ./.
  It/PRP
  's/VBZ
  fun/NN
  ./.)
