### 📚 Basic NLP Terminology Explained

- **Corpus**:  
  A corpus is a collection of text — usually **paragraphs or full documents**.  
  *Example:* A folder of news articles or a dataset of tweets.

- **Documents**:  
  Documents are individual pieces within a corpus. They often represent **a single article or paragraph**.  
  *Example:* One product review or one blog post.

- **Sentences**:  
  Each document is made up of **sentences**.  
  *Example:* "I love this product. It works great."

- **Words**:  
  Sentences are made up of **simple words**.  
  *Example:* "I", "love", "this", "product".

- **Vocabulary**:  
  The vocabulary is the set of **unique words** found in the entire corpus.  
  *Example:* If your corpus has the words "apple", "banana", "apple", the vocabulary is `{"apple", "banana"}`.


### 🔹 Key Terminologies in NLP

#### **Tokenization**
Splitting text into smaller units (tokens) like words, subwords, or characters.

---

#### **Stopwords**
Common words (like "the," "and," "is") that are often ignored in NLP tasks because they don’t contribute much to meaning.

---

#### **Stemming**
Reducing words to their root form.  
**Example:** `"running"` → `"run"`

---

#### **Lemmatization**
Reducing words to their base or dictionary form.  
**Example:** `"better"` → `"good"`

---

#### **Part-of-Speech (POS) Tagging**
Identifying the grammatical parts of speech in a sentence (e.g., noun, verb, adjective).  
**Example:** `"She runs quickly"` → `["She" (Pronoun), "runs" (Verb), "quickly" (Adverb)]`

---

#### **Named Entity Recognition (NER)**
Identifying and classifying proper nouns (e.g., names of people, organizations, locations).  
**Example:** `"Hashmi Bhatt is from Gujrat"` → `"Hashmi Bhatt"` (Person), `"Gujrat"` (Location)

---

#### **Dependency Parsing**
Analyzing the grammatical structure of a sentence to establish relationships between words.  
**Example:** `"I love programming"` → `"I"` (Subject), `"love"` (Verb), `"programming"` (Object)

---

#### **Bag of Words (BoW)**
A simple model used to represent text as an unordered collection of words, disregarding grammar and word order but keeping track of word frequency.

---

#### **TF-IDF (Term Frequency-Inverse Document Frequency)**
A statistical measure used to evaluate the importance of a word in a document relative to all other documents in a corpus.

---

#### **Word Embeddings**
A technique to represent words in continuous vector space, where similar words have similar vector representations.  
**Common methods:** Word2Vec, GloVe, FastText


# 🧩 Tokenization in NLP

### 🔹 What is Tokenization?
Tokenization is the process of breaking text into smaller units like **sentences**, **words**, or **characters**.

---

### 🔸 Types of Tokenization

- **Sentence Tokenization**  
  Splits text into individual sentences.  
  📚 *Library:* `nltk.sent_tokenize`, `spaCy`

- **Word Tokenization**  
  Splits sentences into words.  
  📚 *Library:* `nltk.word_tokenize`, `spaCy`, `TextBlob`

- **Subword Tokenization**  
  Breaks words into smaller parts (useful in deep learning).  
  📚 *Library:* `SentencePiece`, `Byte-Pair Encoding (BPE)`, Hugging Face's `tokenizers`

- **Character Tokenization**  
  Splits text into individual characters.  
  📚 *Library:* Simple Python code or custom logic

---

### ✅ Summary Table

| Type            | Description                     | Common Libraries                  |
|-----------------|---------------------------------|-----------------------------------|
| Sentence        | Text → Sentences                | `nltk`, `spaCy`                   |
| Word            | Sentences → Words               | `nltk`, `spaCy`, `TextBlob`       |
| Subword         | Words → Subword units           | `SentencePiece`, `tokenizers`     |
| Character       | Text → Characters               | Python string methods             |


In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
corpus="""Mary closed on closing night when she was in the mood to close.
Mary had a little lamb.
Her fleece was white as snow"""


In [None]:
print(corpus)

Mary closed on closing night when she was in the mood to close. Mary had a little lamb. Her fleece was white as snow


In [None]:
##  Tokenization
## Sentence-->paragraphs
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.tokenize import TreebankWordTokenizer

In [None]:
d1 = sent_tokenize(corpus)
print(type(d1))
d1

<class 'list'>


['Mary closed on closing night when she was in the mood to close.',
 'Mary had a little lamb.',
 'Her fleece was white as snow']

In [None]:
d2 = word_tokenize(corpus)
print(type(d2))
d2

<class 'list'>


['Mary',
 'closed',
 'on',
 'closing',
 'night',
 'when',
 'she',
 'was',
 'in',
 'the',
 'mood',
 'to',
 'close',
 '.',
 'Mary',
 'had',
 'a',
 'little',
 'lamb',
 '.',
 'Her',
 'fleece',
 'was',
 'white',
 'as',
 'snow']

In [None]:
wordpunct_tokenize(corpus)

['Mary',
 'closed',
 'on',
 'closing',
 'night',
 'when',
 'she',
 'was',
 'in',
 'the',
 'mood',
 'to',
 'close',
 '.',
 'Mary',
 'had',
 'a',
 'little',
 'lamb',
 '.',
 'Her',
 'fleece',
 'was',
 'white',
 'as',
 'snow']

In [None]:
tokenizer=TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Mary',
 'closed',
 'on',
 'closing',
 'night',
 'when',
 'she',
 'was',
 'in',
 'the',
 'mood',
 'to',
 'close.',
 'Mary',
 'had',
 'a',
 'little',
 'lamb.',
 'Her',
 'fleece',
 'was',
 'white',
 'as',
 'snow']

# 🌱 Stemming Process in NLP

### 🔹 What is Stemming?
Stemming is the process of **reducing a word to its root form** by removing prefixes or suffixes.

- Example:  
  `"playing"`, `"played"`, `"plays"` → `"play"`

> ⚠️ Note: The stemmed word may not always be a valid word in English.

---

### 🔸 Why Use Stemming?
- Helps in **text normalization**
- Reduces **dimensionality** of the data
- Groups related words together during analysis

---

### 🔸 Common Stemming Algorithms & Libraries

| Stemmer               | Description                                 | Library         |
|-----------------------|---------------------------------------------|------------------|
| **Porter Stemmer**     | Most common, rule-based, fast and simple     | `nltk.PorterStemmer` |
| **Lancaster Stemmer**  | More aggressive than Porter                  | `nltk.LancasterStemmer` |
| **Snowball Stemmer**   | An improvement over Porter (supports multiple languages) | `nltk.SnowballStemmer` |

---



In [None]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemming=PorterStemmer()

In [None]:
for word in words:
    print(word+"---->"+stemming.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [None]:
from nltk.stem import RegexpStemmer

In [None]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [None]:
reg_stemmer.stem('eating')


'eat'

In [None]:
reg_stemmer.stem('ingeating')

'ingeat'

In [None]:
from nltk.stem import SnowballStemmer

In [None]:
snowballsstemmer=SnowballStemmer('english')

In [None]:
for word in words:
    print(word+"---->"+snowballsstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [None]:
# diffrence btw PorterStemmer and SnowBallStemmer
# Portar
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [None]:
# diffrence btw PorterStemmer and SnowBallStemmer
# SnowBall
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

# 🌿 Lemmatization Process in NLP

### 🔹 What is Lemmatization?
Lemmatization reduces a word to its **base or dictionary form** (called a *lemma*), considering the **context and part of speech**.

- Example:  
  `"better"` → `"good"`  
  `"running"` → `"run"` (as a verb)

> ✅ Lemmatization returns **real words** and is more accurate than stemming.

---

### 🔸 Why Use Lemmatization?
- More accurate normalization than stemming  
- Helps in better **semantic analysis**  
- Preserves meaning by considering grammar rules

---

### 🔸 Common Lemmatization Tools & Libraries

| Tool               | Description                                 | Library         |
|--------------------|---------------------------------------------|------------------|
| **WordNet Lemmatizer** | Uses WordNet dictionary to find base form  | `nltk.WordNetLemmatizer` |
| **spaCy Lemmatizer**   | Advanced lemmatizer with POS tagging built-in | `spaCy` |

---

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer=WordNetLemmatizer()

In [None]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
lemmatizer.lemmatize("going",pos='v')

'go'

In [None]:
lemmatizer.lemmatize("going",pos='a')

'going'

In [None]:
lemmatizer.lemmatize("going",pos='n')

'going'

In [None]:
for word in words:
    print(word+"---->"+lemmatizer.lemmatize(word,pos='v'))

eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalized---->finalize


# 🚫 Stopwords in NLP

### 🔹 What are Stopwords?
Stopwords are **common words** (like "the", "and", "is") that are often **ignored** in text processing because they don’t add significant meaning.

- Example:  
  "The quick brown fox jumps over the lazy dog."  
  Stopwords: `"The", "over", "the"`

> ✅ **Stopwords removal** helps reduce noise in data and improve model efficiency.

---

### 🔸 Why Remove Stopwords?
- **Reduce dimensionality** by ignoring words with little semantic value
- Helps **focus on meaningful words** for tasks like sentiment analysis, topic modeling, etc.

---

In [None]:
## Speech Of Nikola Tesla
paragraph = """I have scarcely had courage enough to address an audience on a few unavoidable occasions, and the experience of this evening, even as disconnected from the cause of our meeting, is quite novel to me.
Although in those few instances, of which I have retained agreeable memory, my words have met with a generous reception, I never deceived myself, and knew quite well that my success was not due to any excellency in the rhetorical or demonstrative art.
Nevertheless, my sense of duty to respond to the request with which I was honored a few days ago was strong enough to overcome my very grave apprehensions in regard to my ability of doing justice to the topic assigned to me.
It is true, at times—even now, as I speak—my mind feels full of the subject, but I know that, as soon as I shall attempt expression, the fugitive conceptions will vanish, and I shall experience certain well known sensations of abandonment, chill and silence.
I can see already your disappointed countenances and can read in them the painful regret of the mistake in your choice."""

In [None]:
from nltk.corpus import stopwords

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [None]:
sentences=nltk.sent_tokenize(paragraph)

In [None]:
sentences

['I have scarcely had courage enough to address an audience on a few unavoidable occasions, and the experience of this evening, even as disconnected from the cause of our meeting, is quite novel to me.',
 'Although in those few instances, of which I have retained agreeable memory, my words have met with a generous reception, I never deceived myself, and knew quite well that my success was not due to any excellency in the rhetorical or demonstrative art.',
 'Nevertheless, my sense of duty to respond to the request with which I was honored a few days ago was strong enough to overcome my very grave apprehensions in regard to my ability of doing justice to the topic assigned to me.',
 'It is true, at times—even now, as I speak—my mind feels full of the subject, but I know that, as soon as I shall attempt expression, the fugitive conceptions will vanish, and I shall experience certain well known sensations of abandonment, chill and silence.',
 'I can see already your disappointed countenanc

In [None]:
from nltk.stem import SnowballStemmer
snowballstemmer=SnowballStemmer('english')

In [None]:
## Apply Stopwords And Filter And then Apply Snowball Stemming

for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[snowballstemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)# converting all the list of words into sentences

In [None]:
sentences

['i scarc courag enough address audienc unavoid occas , experi even , even disconnect caus meet , quit novel .',
 'although instanc , i retain agreeabl memori , word met generous recept , i never deceiv , knew quit well success due excel rhetor demonstr art .',
 'nevertheless , sens duti respond request i honor day ago strong enough overcom grave apprehens regard abil justic topic assign .',
 'it true , times—even , i speak—mi mind feel full subject , i know , soon i shall attempt express , fugit concept vanish , i shall experi certain well known sensat abandon , chill silenc .',
 'i see alreadi disappoint counten read pain regret mistak choic .']

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

In [None]:
## Apply Stopwords And Filter And then Apply Snowball Stemming

for i in range(len(sentences)):
    #sentences[i]=sentences[i].lower()
    words=nltk.word_tokenize(sentences[i])
    words=[lemmatizer.lemmatize(word.lower(),pos='v') for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)# converting all the list of words into sentences

In [None]:
sentences

['scarc courag enough address audienc unavoid occas , experi even , even disconnect caus meet , quit novel .',
 'although instanc , retain agreeabl memori , word meet generous recept , never deceiv , know quit well success due excel rhetor demonstr art .',
 'nevertheless , sens duti respond request honor day ago strong enough overcom grave apprehens regard abil justic topic assign .',
 'true , times—even , speak—mi mind feel full subject , know , soon shall attempt express , fugit concept vanish , shall experi certain well know sensat abandon , chill silenc .',
 'see alreadi disappoint counten read pain regret mistak choic .']

# 🏷️ Part-of-Speech (POS) Tagging in NLP

### 🔹 What is POS Tagging?
Part-of-Speech (POS) tagging is the process of **labeling each word** in a sentence with its **grammatical role**, like noun, verb, adjective, etc.

- Example:  
  `"She runs quickly"` →  
  `"She" (Pronoun), "runs" (Verb), "quickly" (Adverb)`

---

### 🔸 Why Use POS Tagging?
- Helps in **understanding sentence structure**
- Useful for **syntax analysis**, **lemmatization**, **NER**, and **translation**
- Provides **contextual meaning** to words

---

### 🔸 Common POS Tags

| Tag  | Meaning       | Example        |
|------|---------------|----------------|
| NN   | Noun          | dog, book      |
| VB   | Verb          | run, eat       |
| JJ   | Adjective     | quick, blue    |
| RB   | Adverb        | quickly, very  |
| PRP  | Pronoun       | he, she        |
| IN   | Preposition   | on, at         |
| DT   | Determiner    | the, a         |

---

### 🔸 Libraries for POS Tagging

| Library  | Function                         |
|----------|----------------------------------|
| **NLTK** | `nltk.pos_tag()`                 |
| **spaCy**| Built-in POS tagging pipeline    |
| **TextBlob** | `textblob.tags`             |

---

In [None]:
''' ALL POS TAGs
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective – ‘big’
JJR adjective, comparative – ‘bigger’
JJS adjective, superlative – ‘biggest’
LS list marker 1)
MD modal – could, will
NN noun, singular ‘- desk’
NNS noun plural – ‘desks’
NNP proper noun, singular – ‘Harrison’
NNPS proper noun, plural – ‘Americans’
PDT predeterminer – ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun –  I, he, she
PRP$ possessive pronoun – my, his, hers
RB adverb – very, silently,
RBR adverb, comparative – better
RBS adverb, superlative – best
RP particle – give up
TO – to go ‘to’ the store.
UH interjection – errrrrrrrm
VB verb, base form – take
VBD verb, past tense – took
VBG verb, gerund/present participle – taking
VBN verb, past participle – taken
VBP verb, sing. present, non-3d – take
VBZ verb, 3rd person sing. present – takes
WDT wh-determiner – which
WP wh-pronoun – who, what
WP$ possessive wh-pronoun, eg- whose
WRB wh-adverb, eg- where, when
'''

'\nCC coordinating conjunction \nCD cardinal digit \nDT determiner \nEX existential there (like: “there is” … think of it like “there exists”) \nFW foreign word \nIN preposition/subordinating conjunction \nJJ adjective – ‘big’ \nJJR adjective, comparative – ‘bigger’ \nJJS adjective, superlative – ‘biggest’ \nLS list marker 1) \nMD modal – could, will \nNN noun, singular ‘- desk’ \nNNS noun plural – ‘desks’ \nNNP proper noun, singular – ‘Harrison’ \nNNPS proper noun, plural – ‘Americans’ \nPDT predeterminer – ‘all the kids’ \nPOS possessive ending parent’s \nPRP personal pronoun –  I, he, she \nPRP$ possessive pronoun – my, his, hers \nRB adverb – very, silently, \nRBR adverb, comparative – better \nRBS adverb, superlative – best \nRP particle – give up \nTO – to go ‘to’ the store. \nUH interjection – errrrrrrrm \nVB verb, base form – take \nVBD verb, past tense – took \nVBG verb, gerund/present participle – taking \nVBN verb, past participle – taken \nVBP verb, sing. present, non-3d – 

In [None]:
## Speech Of DR APJ Abdul Kalam
paragraph = """I have three visions for India. In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my
               first vision is that of freedom. I believe that India got its first vision of
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India
               stands up to the world, no one will respect us. Only strength respects strength. We must be
               strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career"""

In [None]:
from nltk.corpus import stopwords
sentences=nltk.sent_tokenize(paragraph)

In [None]:
sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n               the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my \n               first vision is that of freedom.',
 'I believe that India got its first vision of \n               this in 1857, when we started the War of Independence.',
 'It is this freedom that\n               we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s developme

In [None]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
## We will find the Pos Tag

for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[word for word in words if word not in set(stopwords.words('english'))]
    #sentences[i]=' '.join(words)# converting all the list of words into sentences
    pos_tag=nltk.pos_tag(words)
    print(pos_tag)

[('I', 'PRP'), ('three', 'CD'), ('visions', 'NNS'), ('India', 'NNP'), ('.', '.')]
[('In', 'IN'), ('3000', 'CD'), ('years', 'NNS'), ('history', 'NN'), (',', ','), ('people', 'NNS'), ('world', 'NN'), ('come', 'VBP'), ('invaded', 'VBN'), ('us', 'PRP'), (',', ','), ('captured', 'VBD'), ('lands', 'NNS'), (',', ','), ('conquered', 'VBD'), ('minds', 'NNS'), ('.', '.')]
[('From', 'IN'), ('Alexander', 'NNP'), ('onwards', 'NNS'), (',', ','), ('Greeks', 'NNP'), (',', ','), ('Turks', 'NNP'), (',', ','), ('Moguls', 'NNP'), (',', ','), ('Portuguese', 'NNP'), (',', ','), ('British', 'NNP'), (',', ','), ('French', 'NNP'), (',', ','), ('Dutch', 'NNP'), (',', ','), ('came', 'VBD'), ('looted', 'JJ'), ('us', 'PRP'), (',', ','), ('took', 'VBD'), ('.', '.')]
[('Yet', 'RB'), ('done', 'VBN'), ('nation', 'NN'), ('.', '.')]
[('We', 'PRP'), ('conquered', 'VBD'), ('anyone', 'NN'), ('.', '.')]
[('We', 'PRP'), ('grabbed', 'VBD'), ('land', 'NN'), (',', ','), ('culture', 'NN'), (',', ','), ('history', 'NN'), ('tried'

In [None]:
 "Taj Mahal is a beautiful Monument".split()

['Taj', 'Mahal', 'is', 'a', 'beautiful', 'Monument']

In [None]:

print(nltk.pos_tag("Taj Mahal is a beautiful Monument".split()))

[('Taj', 'NNP'), ('Mahal', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('beautiful', 'JJ'), ('Monument', 'NN')]


# 🏷️ Named Entity Recognition (NER) in NLP

### 🔹 What is Named Entity Recognition (NER)?
Named Entity Recognition (NER) is the process of identifying and classifying **proper nouns** in a text into predefined categories, such as **persons**, **organizations**, **locations**, **dates**, etc.

- Example:  
  `"Barack Obama is from Hawaii."`  
  **Entities**:  
  - "Barack Obama" (Person)  
  - "Hawaii" (Location)

---

### 🔸 Why Use NER?
- Extracts valuable information from text automatically
- Helps in **structured data extraction** for tasks like knowledge graph creation and data analysis
- Enables **question answering**, **summarization**, and **chatbots**

---

### 🔸 Common Named Entities Categories

| Category       | Example                  |
|----------------|--------------------------|
| **Person**     | Barack Obama, Einstein   |
| **Location**   | New York, Hawaii         |
| **Organization** | Google, Microsoft       |
| **Date**       | January 1, 2022          |
| **Time**       | 5 PM, morning            |
| **Money**      | $100, 50 euros           |
| **Percent**    | 50%, 30 percent          |
| **Miscellaneous** | Nobel Prize, Everest   |

---

### 🔸 Libraries for NER

| Library      | Function                                 |
|--------------|------------------------------------------|
| **spaCy**    | Built-in NER model, highly efficient     |
| **NLTK**     | Supports NER with pre-trained models    |
| **Stanford NER** | Java-based NER library for fine-grained categories |

---

In [None]:
import spacy

# Load spaCy's pre-trained model
nlp = spacy.load("en_core_web_sm")


In [None]:
# Process a sentence
sentence = "The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel, whose company specialized in building metal frameworks and structures."
doc = nlp(sentence)

# Extract named entities
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

The Eiffel Tower (LOC)
1887 to 1889 (DATE)
Gustave Eiffel (PERSON)


# 🔡 Text to Vectors in NLP

### 🔸 What is Text to Vectors?

Text to Vectors (also called **Text Vectorization**) is the process of converting **textual data** into **numerical format** (vectors), so that machine learning models can understand and process the data.

---

### 🔹 Why Convert Text to Vectors?

- Machine learning algorithms **can't work** with raw text.
- Text must be **numerically represented** to perform classification, clustering, sentiment analysis, etc.

---

### 🔸 Common Text Vectorization Techniques

#### 1. **Bag of Words (BoW)**
- Creates a matrix based on **word frequency**.
- Ignores grammar and word order.
- Simple but can be very sparse.

#### 2. **TF-IDF (Term Frequency-Inverse Document Frequency)**
- Measures how **important** a word is in a document relative to the entire corpus.
- Reduces weight of common words and boosts rare but meaningful ones.

#### 3. **Word Embeddings**
- Represents words as **dense vectors** in a continuous space.
- Captures **semantic meaning** and word similarity.
- Examples:
  - **Word2Vec**
  - **GloVe**
  - **FastText**

---


# 🧺 Bag of Words (BoW) in NLP

### 🔸 What is Bag of Words?

**Bag of Words (BoW)** is a basic and popular technique to convert **text data into numerical vectors**.  
It treats each document as a **collection (bag)** of its words, **ignoring grammar and word order**.

---

### 🔹 How It Works

1. Create a **vocabulary** of all unique words in the dataset.
2. For each document, count how many times each word from the vocabulary appears.
3. Represent the document as a **vector of word counts**.

---

### 🔸 Example

Suppose we have two sentences:

- Sentence 1: `"I love NLP"`
- Sentence 2: `"NLP is fun"`

Vocabulary: `[I, love, NLP, is, fun]`

Vectors:

- Sentence 1 → `[1, 1, 1, 0, 0]`
- Sentence 2 → `[0, 0, 1, 1, 1]`

---

### 🔹 Pros

- Easy to understand and implement
- Works well with small and clean datasets

---

### 🔹 Cons

- Ignores word order and context
- Produces **sparse vectors** (lots of zeros)
- Doesn't capture meaning or relationships between words

---

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
sentences = [
    "he is good boy",
    "she is good girl",
    "he and she is good person"
]

In [None]:
# Step 3: Initialize vectorizer with English stopwords
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(sentences)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Bag of Words Matrix:\n", X.toarray())

Vocabulary: ['boy' 'girl' 'good' 'person']
Bag of Words Matrix:
 [[1 0 1 0]
 [0 1 1 0]
 [0 0 1 1]]


# 📚 N-grams in NLP

### 🔸 What is an N-gram?

An **N-gram** is a continuous sequence of **N items** (typically words or characters) from a given text.

- Helps preserve some **word order** information (unlike BoW).
- Useful in **language modeling**, **text generation**, and **feature extraction**.

---

### 🔹 Types of N-grams

| N-gram Type | Description                 | Example (Text: "I love NLP") |
|-------------|-----------------------------|-------------------------------|
| Unigram     | Single word (N = 1)         | `["I", "love", "NLP"]`        |
| Bigram      | Pair of words (N = 2)       | `["I love", "love NLP"]`      |
| Trigram     | Sequence of 3 words (N = 3) | `["I love NLP"]`              |

---

### 🔸 Why Use N-grams?

- Capture **local context** and **word order**.
- Helps in **predictive text**, **spelling correction**, and **sentiment analysis**.

---

### 🔹 Pros and Cons

**✅ Pros**
- Maintains some sequence information
- Improves context understanding over BoW

**❌ Cons**
- Higher N → Larger feature space → More memory
- Rare N-grams may not generalize well

---

In [None]:
# Step 1: Import library
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Step 2: Define the input sentences
sentences = [
    "food is good",
    "food is not good"
]


In [None]:
# Basic Bag of Words (Unigrams = single words)
# BoW captures only individual words and their frequency. It doesn't understand context.
# Example: "not good" and "good" are treated as separate, unrelated words.
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(sentences)

print("Unigram (BoW) Vocabulary:")
print(bow_vectorizer.get_feature_names_out())

print("\nBoW Matrix:")
print(X_bow.toarray())


Unigram (BoW) Vocabulary:
['food' 'good' 'is' 'not']

BoW Matrix:
[[1 1 1 0]
 [1 1 1 1]]


In [None]:
# Bigram Vectorizer (2-word combinations)
# Bigrams capture 2-word phrases, helping to preserve context like "not good" or "food is".
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigram = bigram_vectorizer.fit_transform(sentences)

print("\nBigram Vocabulary:")
print(bigram_vectorizer.get_feature_names_out())

print("\nBigram Matrix:")
print(X_bigram.toarray())



Bigram Vocabulary:
['food is' 'is good' 'is not' 'not good']

Bigram Matrix:
[[1 1 0 0]
 [1 0 1 1]]


In [None]:
#Trigram Vectorizer (3-word combinations)
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))
X_trigram = trigram_vectorizer.fit_transform(sentences)

print("\nTrigram Vocabulary:")
print(trigram_vectorizer.get_feature_names_out())

print("\nTrigram Matrix:")
print(X_trigram.toarray())



Trigram Vocabulary:
['food is good' 'food is not' 'is not good']

Trigram Matrix:
[[1 0 0]
 [0 1 1]]


# 📊 TF-IDF (Term Frequency-Inverse Document Frequency) in NLP

### 🔸 What is TF-IDF?

**TF-IDF** is a statistical measure used to evaluate the **importance of a word** in a document relative to a collection of documents (corpus).

- **Term Frequency (TF)**: How frequently a word appears in a document.
- **Inverse Document Frequency (IDF)**: How important or rare a word is across all documents in the corpus.

The formula for TF-IDF is:

### TF-IDF = TF * IDF


---

### 🔹 Why Use TF-IDF?

- **Highlights important words** in a document.
- Reduces the impact of common words (e.g., "the", "and").
- Helps capture the **relevance** of words in the context of a document set.

---

### 🔸 How It Works

1. **Term Frequency (TF)**: Measures the frequency of a word in a single document.
   - Formula:  
### TF = (Number of times word appears in a document) / (Total number of words in the document)


2. **Inverse Document Frequency (IDF)**: Measures the importance of a word in the entire corpus.
- Formula:  
### IDF = log[(Total number of documents) / (Number of documents containing the word)]


3. **TF-IDF**: The product of TF and IDF, giving us a score that reflects both the frequency of the word in a specific document and its rarity in the whole corpus.

---

### 🔹 Example

Suppose we have the following two documents:

- Document 1: `"I love NLP"`
- Document 2: `"NLP is fun"`

To calculate the TF-IDF for a word like **"NLP"**, we calculate the TF and IDF separately:

- **TF** in Document 1 for "NLP":  
`TF = 1 / 3 = 0.33`  
(since "NLP" appears once in a document of 3 words)

- **IDF** for "NLP":  
`IDF = log(2 / 2) = 0`  
(since "NLP" appears in both documents, its IDF is low)

- **TF-IDF** for "NLP" in Document 1:  
`TF-IDF = 0.33 * 0 = 0`  
(indicating "NLP" isn't very important here due to its high frequency across documents)

---




In [None]:
# Step 1: Import the necessary library
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
# Step 2: Define the sentences
sentences = [
    "good boy",
    "good girl",
    "good boy and girl"
]


In [None]:
# Step 3: Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Step 4: Fit and transform the sentences into TF-IDF matrix
X_tfidf = tfidf_vectorizer.fit_transform(sentences)

# Step 5: Get the vocabulary (words)
print("TF-IDF Vocabulary:")
print(tfidf_vectorizer.get_feature_names_out())

# Step 6: Print the TF-IDF matrix
print("\nTF-IDF Matrix:")
print(X_tfidf.toarray())


TF-IDF Vocabulary:
['and' 'boy' 'girl' 'good']

TF-IDF Matrix:
[[0.         0.78980693 0.         0.61335554]
 [0.         0.         0.78980693 0.61335554]
 [0.63174505 0.4804584  0.4804584  0.37311881]]


# 🔠 Word2Vec in NLP

### 🔸 What is Word2Vec?

**Word2Vec** is a technique to represent words as **vectors** in a continuous vector space. It captures **semantic meaning** — similar words end up close together in that space.

---

### 🔹 Why Use Word2Vec?

- Converts text data into a format (vectors) that ML algorithms can work with.
- Captures **contextual and semantic** relationships between words.
- Helps improve model performance in text classification, sentiment analysis, etc.

---

### 🔸 How Does Word2Vec Work?

There are two training models:

1. **CBOW (Continuous Bag of Words)**  
   - Predicts a word based on context.
   - Example: From "The cat _ on the mat", predict the missing word "sits".

2. **Skip-Gram**  
   - Predicts context words from a target word.
   - Example: Given "sits", predict ["The", "cat", "on", "the", "mat"].

---

### 🔹 Advantages of Word2Vec

- ✅ Learns **semantic relationships** between words (e.g., king - man + woman ≈ queen).
- ✅ Efficient and **scalable** for large datasets.
- ✅ Supports vector arithmetic and analogy reasoning.
- ✅ Captures **context** in a more meaningful way than one-hot or BoW.

---

### 🔸 Disadvantages of Word2Vec

- ❌ Can't handle **out-of-vocabulary (OOV)** words (words not seen during training).
- ❌ Word vectors are **static** — the meaning of a word doesn't change with context.
  - Example: "bank" in "river bank" and "money bank" has the same vector.
- ❌ Requires large corpus to train meaningful vectors.
- ❌ Doesn't capture **morphological variations** well (e.g., "run", "running").

---

In [8]:
!pip install gensim



In [None]:
# Step 1: Import Word2Vec
import gensim
from gensim.models import Word2Vec

In [None]:
# Step 2: Create a bigger dataset (tokenized sentences)
sentences = [
    ["i", "love", "natural", "language", "processing"],
    ["nlp", "includes", "text", "preprocessing", "and", "vectorization"],
    ["machine", "learning", "is", "fun"],
    ["deep", "learning", "is", "part", "of", "ai"],
    ["word2vec", "creates", "word", "embeddings"],
    ["you", "can", "use", "cbow", "or", "skipgram"],
    ["unsupervised", "learning", "is", "useful", "for", "text", "data"],
    ["love", "this", "awesome", "nlp", "course"],
    ["vector", "representations", "help", "understand", "meaning"],
    ["language", "models", "predict", "next", "word"]
]

In [None]:
# Step 3: Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)

# vector_size: Higher values = more info, but slower training.

# window: Smaller = more local context; larger = global context.

# min_count: Helps remove rare, noisy words in larger datasets.

# sg: Skip-gram is better at capturing rare word meanings; CBOW is faster.

In [None]:
# Vector of a Word
print("Vector for 'nlp':")
print(model.wv["nlp"])

In [None]:
# Similarity Between Two Words
similarity = model.wv.similarity("nlp", "language")
print("\nSimilarity between 'nlp' and 'language':", similarity)

In [None]:
# Most Similar Words
print("\nMost similar to 'learning':")
print(model.wv.most_similar("learning"))

In [None]:
# Find the Odd Word Out
odd_word = model.wv.doesnt_match(["nlp", "machine", "apple", "language"])
print("\nOdd one out:", odd_word)