<a href="https://colab.research.google.com/github/11239m006/Natural_Language_Processing/blob/main/Nlp_exp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Tokenization**



In [2]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')  # ðŸ‘ˆ NEW line â€” required in new NLTK versions

from nltk.tokenize import sent_tokenize, word_tokenize

text = "I love Natural Language Processing. It's amazing!"
print("Sentence Tokenization:", sent_tokenize(text))
print("Word Tokenization:", word_tokenize(text))



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Sentence Tokenization: ['I love Natural Language Processing.', "It's amazing!"]
Word Tokenization: ['I', 'love', 'Natural', 'Language', 'Processing', '.', 'It', "'s", 'amazing', '!']


### **Stemming**

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["playing", "played", "plays"]
print([ps.stem(w) for w in words])


['play', 'play', 'play']


### **Lemmatization**

In [5]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lm = WordNetLemmatizer()
print(lm.lemmatize("running", pos='v'))


run


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### **Morphology (Prefix/Suffix Split Example)**

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")

text = "The cats are playing happily."
doc = nlp(text)

print(f"\n{'Word':<15} {'Lemma':<15} {'POS':<10} {'Morphology'}")
print("-" * 60)
for token in doc:
    print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10} {token.morph}")


Word            Lemma           POS        Morphology
------------------------------------------------------------
The             the             DET        Definite=Def|PronType=Art
cats            cat             NOUN       Number=Plur
are             be              AUX        Mood=Ind|Tense=Pres|VerbForm=Fin
playing         play            VERB       Aspect=Prog|Tense=Pres|VerbForm=Part
happily         happily         ADV        
.               .               PUNCT      PunctType=Peri


### **Normalization**


In [11]:
import re

text = "Hey!! What's up??? I'm learning NLP right now..."
# Convert to lowercase
text = text.lower()
# Remove punctuation/special chars
text = re.sub(r'[^a-z\s]', '', text)
# Remove extra spaces
text = re.sub(r'\s+', ' ', text).strip()

print("\nNormalized Text:", text)



Normalized Text: hey whats up im learning nlp right now


### **N-Gram (Unigram, Bigram, Trigram)**

In [13]:
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is amazing"
tokens = word_tokenize(text.lower())

print("\nUnigrams:", list(ngrams(tokens, 1)))
print("Bigrams:", list(ngrams(tokens, 2)))
print("Trigrams:", list(ngrams(tokens, 3)))


Unigrams: [('natural',), ('language',), ('processing',), ('is',), ('amazing',)]
Bigrams: [('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'amazing')]
Trigrams: [('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'amazing')]


### **N-Gram Smoothing (Add-1 Laplace)**

In [14]:
from collections import Counter

corpus = "I love NLP. NLP is fun."
tokens = word_tokenize(corpus.lower())
bigrams = list(ngrams(tokens, 2))
vocab = len(set(tokens))
bigram_counts = Counter(bigrams)
word_counts = Counter(tokens)

def laplace_prob(bigram):
    w1, w2 = bigram
    return (bigram_counts[bigram] + 1) / (word_counts[w1] + vocab)

example_bigram = ("nlp", "is")
print(f"\nLaplace Smoothed Probability of {example_bigram}:", laplace_prob(example_bigram))



Laplace Smoothed Probability of ('nlp', 'is'): 0.25


### **POS Tagging**

In [15]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')  # <-- new required file

from nltk import pos_tag

text = "John is playing football"
tokens = word_tokenize(text)
print("\nPOS Tags:", pos_tag(tokens))



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.



POS Tags: [('John', 'NNP'), ('is', 'VBZ'), ('playing', 'VBG'), ('football', 'NN')]


### **Hidden Markov Model (Simple POS Example using NLTK)**

In [16]:
from nltk.tag import hmm
trainer = hmm.HiddenMarkovModelTrainer()

train_data = [[('John', 'NOUN'), ('plays', 'VERB'), ('football', 'NOUN')],
              [('She', 'PRON'), ('enjoys', 'VERB'), ('music', 'NOUN')]]

tagger = trainer.train_supervised(train_data)
print("\nHMM Tagger Result:", tagger.tag(['She', 'plays', 'football']))



HMM Tagger Result: [('She', 'PRON'), ('plays', 'VERB'), ('football', 'NOUN')]


  X[i, j] = self._transitions[si].logprob(self._states[j])
  O[i, k] = self._output_logprob(si, self._symbols[k])
  P[i] = self._priors.logprob(si)


### **Brill/Bidirectional POS Tagger (Bending POS Tagger)**

In [17]:
from nltk.tag import brill, brill_trainer

train_data = [[('He', 'PRON'), ('is', 'VERB'), ('good', 'ADJ')],
              [('She', 'PRON'), ('was', 'VERB'), ('happy', 'ADJ')]]

initial_tagger = nltk.DefaultTagger('NOUN')
templates = brill.fntbl37()
trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates)

brill_tagger = trainer.train(train_data, max_rules=10)
print("\nBrill Tagger Output:", brill_tagger.tag(['He', 'is', 'happy']))



Brill Tagger Output: [('He', 'PRON'), ('is', 'VERB'), ('happy', 'ADJ')]
