<a href="https://colab.research.google.com/github/Burka-Developer/Machine-Learning/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization

Text ko tukdon mein todna — words, characters, or sentences.


In [None]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

text = "I love Natural Language Processing!"
tokens = word_tokenize(text)
print(tokens)


# Normalization

Clean text — lowercase, remove punctuations, extra spaces.


In [None]:
import re

text = "Hello, NLP World!!   "
normalized = re.sub(r'[^\w\s]', '', text.lower()).strip()
print(normalized)


hello nlp world


# Stemming

Word ka root nikalna (rough cut). Like: running → run

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runner", "ran", "easily", "fairly"]
stems = [stemmer.stem(w) for w in words]
print(stems)


# Lemmatization

Smarter stemming — uses grammar rules and context.

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["running", "better", "flies"]
lemmas = [lemmatizer.lemmatize(w, pos='v') for w in words]
print(lemmas)


# Corpus

Collection of text — jaise ek library.

In [None]:
from nltk.corpus import gutenberg
nltk.download('gutenberg')

print(gutenberg.fileids())  # list of corpora
print(gutenberg.raw('austen-emma.txt')[:300])


# Stop Words

Words like "the", "is", "in" — jo zyada meaning nahi dete.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

words = word_tokenize("This is a very important message.")
filtered = [w for w in words if w.lower() not in stopwords.words('english')]
print(filtered)


# POS Tagging (Part of Speech)

Har word ka role: noun, verb, adjective...

In [None]:
nltk.download('averaged_perceptron_tagger')
sentence = word_tokenize("Dogs bark loudly.")
tags = nltk.pos_tag(sentence)
print(tags)


# Parsing

Grammar + structure analysis (sentence ka skeleton)

In [None]:
import benepar
import spacy
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")
doc = nlp("The cat sat on the mat.")
for token in doc:
    print(f"{token.text} --> {token.dep_}")


# Syntax

Rules of sentence — jaise subject, verb, object order.

🧠 No special code needed — part of parsing & POS

# Semantics

Word ka matlab samajhna — context ke basis pe.

🧠 Example:
"Bank" = financial vs river bank → GPT or BERT handle this

# Pragmatics

User ki niyat samajhna — not literal meaning

🧠 “Can you pass the salt?” = request, not a question
→ Hard to code, used in LLMs and dialog agents

# Discourse

Link between sentences.

🧠 Example:

"I'm cold."
"I'll close the window."

→ Second sentence is a reply to the first.

# Bag of Words

Text → word frequency dict
Grammar & order nahi chahiye.



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love NLP", "NLP loves me"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())


# n-grams

Group of n words — for phrase detection.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["I love NLP"]
vectorizer = CountVectorizer(ngram_range=(2, 2))  # bigrams
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())


# Named Entity Recognition (NER)

Detect names, places, organizations from text.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("Barack Obama was born in Hawaii.")
for ent in doc.ents:
    print(ent.text, "→", ent.label_)


# Sentiment Analysis

Text ka tone — positive, negative, neutral?

In [None]:
from textblob import TextBlob

text = "I absolutely love this movie!"
blob = TextBlob(text)
print(blob.sentiment)


# Keyword Extraction

Most important words from a passage.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["battery life is great", "screen quality is amazing"]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
print(tfidf.get_feature_names_out())
print(X.toarray())


# Statistical Language Modeling

Probability of next word.
“Today is a sunny…” → likely next = “day”

🧠 Used in GPT/BERT
⚡ Needs big corpus + LSTM/Transformer models (advanced)

# Speech Recognition

Speech → Text

In [None]:
import speech_recognition as sr

r = sr.Recognizer()
with sr.Microphone() as source:
    print("Speak:")
    audio = r.listen(source)
    print("You said:", r.recognize_google(audio))


# Natural Language Generation (NLG)

AI writes text (like ChatGPT does)

Use:

1. GPT-2

2. T5

3. OpenAI APIs

# Word Sense Disambiguation

Same word, different meanings?
“Bat” 🦇 vs. 🏏

🧠 Use: Contextual models like BERT or WordNet similarity