<a href="https://colab.research.google.com/github/11239m006/Natural_Language_Processing/blob/main/Nlp_exp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Tokenization**



In [18]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize, word_tokenize

text = "NLP is amazing!"
print("Sentence Tokenization:", sent_tokenize(text))
print("Word Tokenization:", word_tokenize(text))

Sentence Tokenization: ['NLP is amazing!']
Word Tokenization: ['NLP', 'is', 'amazing', '!']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### **Stemming**

In [19]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = "running"
print(ps.stem(words))


run


### **Lemmatization**

In [20]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lm = WordNetLemmatizer()
print(lm.lemmatize("running", pos='n'))


running


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### **Morphology (Prefix/Suffix Split Example)**

In [21]:
import spacy
nlp = spacy.load("en_core_web_sm")

# Text
text = "The cats are playing happily."
doc = nlp(text)

# Print results
for token in doc:
    print(token.text, "→", token.lemma_, ",", token.pos_, ",", token.morph)


The → the , DET , Definite=Def|PronType=Art
cats → cat , NOUN , Number=Plur
are → be , AUX , Mood=Ind|Tense=Pres|VerbForm=Fin
playing → play , VERB , Aspect=Prog|Tense=Pres|VerbForm=Part
happily → happily , ADV , 
. → . , PUNCT , PunctType=Peri


### **Normalization**


In [22]:
import re

text = "Hey!! What's up??? I'm learning NLP right now..."
# Convert to lowercase
text = text.lower()
# Remove punctuation/special chars
text = re.sub(r'[^a-z\s]', '', text)
# Remove extra spaces
text = re.sub(r'\s+', ' ', text).strip()

print("\nNormalized Text:", text)



Normalized Text: hey whats up im learning nlp right now


### **N-Gram (Unigram, Bigram, Trigram)**

In [23]:
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is amazing"
tokens = word_tokenize(text.lower())

print("\nUnigrams:", list(ngrams(tokens, 1)))
print("Bigrams:", list(ngrams(tokens, 2)))
print("Trigrams:", list(ngrams(tokens, 3)))


Unigrams: [('natural',), ('language',), ('processing',), ('is',), ('amazing',)]
Bigrams: [('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'amazing')]
Trigrams: [('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'amazing')]


### **N-Gram Smoothing (Add-1 Laplace)**

In [24]:
from nltk import word_tokenize, ngrams
from collections import Counter

# Sample text
text = "I love NLP. NLP is fun."

# Tokenize words
tokens = word_tokenize(text.lower())

# Create bigrams (pairs of words)
bigrams = list(ngrams(tokens, 2))

# Count words and bigrams
word_counts = Counter(tokens)
bigram_counts = Counter(bigrams)

# Vocabulary size
vocab = len(set(tokens))

# Laplace Smoothing function
def laplace_prob(w1, w2):
    return (bigram_counts[(w1, w2)] + 1) / (word_counts[w1] + vocab)

# Example
print("Laplace Smoothed Probability of ('nlp', 'is'):", laplace_prob('nlp', 'is'))

Laplace Smoothed Probability of ('nlp', 'is'): 0.25


### **POS Tagging**

In [25]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

from nltk import pos_tag

text = "John is playing football"
tokens = word_tokenize(text)
print("\nPOS Tags:", pos_tag(tokens))




POS Tags: [('John', 'NNP'), ('is', 'VBZ'), ('playing', 'VBG'), ('football', 'NN')]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


### **Hidden Markov Model (Simple POS Example using NLTK)**

In [26]:
from nltk.tag import hmm
trainer = hmm.HiddenMarkovModelTrainer()

train_data = [[('John', 'NOUN'), ('plays', 'VERB'), ('football', 'NOUN')],
              [('She', 'PRON'), ('enjoys', 'VERB'), ('music', 'NOUN')]]

tagger = trainer.train_supervised(train_data)
print("\nHMM Tagger Result:", tagger.tag(['She', 'plays', 'football']))



HMM Tagger Result: [('She', 'PRON'), ('plays', 'VERB'), ('football', 'NOUN')]


  X[i, j] = self._transitions[si].logprob(self._states[j])
  O[i, k] = self._output_logprob(si, self._symbols[k])
  P[i] = self._priors.logprob(si)


### **Brill/Bidirectional POS Tagger (Bending POS Tagger)**

In [27]:
from nltk.tag import brill, brill_trainer

train_data = [[('He', 'PRON'), ('is', 'VERB'), ('good', 'ADJ')],
              [('She', 'PRON'), ('was', 'VERB'), ('happy', 'ADJ')]]

initial_tagger = nltk.DefaultTagger('NOUN')
templates = brill.fntbl37()
trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates)

brill_tagger = trainer.train(train_data, max_rules=10)
print("\nBrill Tagger Output:", brill_tagger.tag(['He', 'is', 'happy']))



Brill Tagger Output: [('He', 'PRON'), ('is', 'VERB'), ('happy', 'ADJ')]


### **spell correction**

In [28]:
from textblob import TextBlob

# Input sentence with mistakes
text = "I havv a pencll and a book"

# Create TextBlob object
blob = TextBlob(text)

# Correct the spelling
corrected_text = blob.correct()

print("Original Text:", text)
print("Corrected Text:", corrected_text)


Original Text: I havv a pencll and a book
Corrected Text: I have a pencil and a book


### DEDUCTION

In [29]:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

ps = PorterStemmer()

def stem(text):
    return [ps.stem(w) for w in word_tokenize(text.lower())]

def deduce(p, h):
    return "entailment" if h in p else "no entailment"

print("Stems:", stem("running runner runs easily fairer"))
print("Deduction:", deduce("All men are mortal Socrates is a man", "Socrates is mortal"))

Stems: ['run', 'runner', 'run', 'easili', 'fairer']
Deduction: no entailment
