<a href="https://colab.research.google.com/github/Burka-Developer/Machine-Learning/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization

Text ko tukdon mein todna — words, characters, or sentences.


In [13]:
import nltk
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [11]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

text = "I love Natural Language Processing!"
tokens = word_tokenize(text)
print(tokens)


['I', 'love', 'Natural', 'Language', 'Processing', '!']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Normalization

Clean text — lowercase, remove punctuations, extra spaces.


In [14]:
import re

text = "Hello, NLP World!!   "
normalized = re.sub(r'[^\w\s]', '', text.lower()).strip()
print(normalized)


hello nlp world


# Stemming

Word ka root nikalna (rough cut). Like: running → run

In [15]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runner", "ran", "easily", "fairly"]
stems = [stemmer.stem(w) for w in words]
print(stems)


['run', 'runner', 'ran', 'easili', 'fairli']


# Lemmatization

Smarter stemming — uses grammar rules and context.

In [16]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["running", "better", "flies"]
lemmas = [lemmatizer.lemmatize(w, pos='v') for w in words]
print(lemmas)


[nltk_data] Downloading package wordnet to /root/nltk_data...


['run', 'better', 'fly']


# Corpus

Collection of text — jaise ek library.

In [17]:
from nltk.corpus import gutenberg
nltk.download('gutenberg')

print(gutenberg.fileids())  # list of corpora
print(gutenberg.raw('austen-emma.txt')[:300])


[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was t


# Stop Words

Words like "the", "is", "in" — jo zyada meaning nahi dete.

In [18]:
from nltk.corpus import stopwords
nltk.download('stopwords')

words = word_tokenize("This is a very important message.")
filtered = [w for w in words if w.lower() not in stopwords.words('english')]
print(filtered)


['important', 'message', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# POS Tagging (Part of Speech)

Har word ka role: noun, verb, adjective...

In [20]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [21]:
nltk.download('averaged_perceptron_tagger')
sentence = word_tokenize("Dogs bark loudly.")
tags = nltk.pos_tag(sentence)
print(tags)


[('Dogs', 'NNS'), ('bark', 'NN'), ('loudly', 'RB'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Parsing

Grammar + structure analysis (sentence ka skeleton)

In [23]:
!pip install benepar

Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.6.0->benepar)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.6.0->benepar)
  Downloading nvidia_cublas_

In [24]:
!pip install spacy



In [25]:
import benepar
import spacy
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")
doc = nlp("The cat sat on the mat.")
for token in doc:
    print(f"{token.text} --> {token.dep_}")


The --> det
cat --> nsubj
sat --> ROOT
on --> prep
the --> det
mat --> pobj
. --> punct


# Syntax

Rules of sentence — jaise subject, verb, object order.

🧠 No special code needed — part of parsing & POS

# Semantics

Word ka matlab samajhna — context ke basis pe.

🧠 Example:
"Bank" = financial vs river bank → GPT or BERT handle this

# Pragmatics

User ki niyat samajhna — not literal meaning

🧠 “Can you pass the salt?” = request, not a question
→ Hard to code, used in LLMs and dialog agents

# Discourse

Link between sentences.

🧠 Example:

"I'm cold."
"I'll close the window."

→ Second sentence is a reply to the first.

# Bag of Words

Text → word frequency dict
Grammar & order nahi chahiye.



In [26]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love NLP", "NLP loves me"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())


['love' 'loves' 'me' 'nlp']
[[1 0 0 1]
 [0 1 1 1]]


# n-grams

Group of n words — for phrase detection.

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["I love NLP"]
vectorizer = CountVectorizer(ngram_range=(2, 2))  # bigrams
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())


['love nlp']


# Named Entity Recognition (NER)

Detect names, places, organizations from text.

In [28]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("Barack Obama was born in Hawaii.")
for ent in doc.ents:
    print(ent.text, "→", ent.label_)


Barack Obama → PERSON
Hawaii → GPE


# Sentiment Analysis

Text ka tone — positive, negative, neutral?

In [29]:
from textblob import TextBlob

text = "I absolutely love this movie!"
blob = TextBlob(text)
print(blob.sentiment)


Sentiment(polarity=0.625, subjectivity=0.6)


# Keyword Extraction

Most important words from a passage.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["battery life is great", "screen quality is amazing"]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
print(tfidf.get_feature_names_out())
print(X.toarray())


['amazing' 'battery' 'great' 'is' 'life' 'quality' 'screen']
[[0.         0.53404633 0.53404633 0.37997836 0.53404633 0.
  0.        ]
 [0.53404633 0.         0.         0.37997836 0.         0.53404633
  0.53404633]]


# Statistical Language Modeling

Probability of next word.
“Today is a sunny…” → likely next = “day”

🧠 Used in GPT/BERT
⚡ Needs big corpus + LSTM/Transformer models (advanced)

# Speech Recognition

Speech → Text

In [36]:
!pip install SpeechRecognition



In [34]:
import speech_recognition as sr
# mic will not working in colab
r = sr.Recognizer()
with sr.Microphone() as source:
    print("Speak:")
    audio = r.listen(source)
    print("You said:", r.recognize_google(audio))


AttributeError: Could not find PyAudio; check installation

# Natural Language Generation (NLG)

AI writes text (like ChatGPT does)

Use:

1. GPT-2

2. T5

3. OpenAI APIs

# Word Sense Disambiguation

Same word, different meanings?
“Bat” 🦇 vs. 🏏

🧠 Use: Contextual models like BERT or WordNet similarity