# NLP


---

## 🔄 NLP Pipeline Overview

1. **Text Input**
2. **Text Cleaning / Preprocessing**
3. **Tokenization**
4. **Stopword Removal**
5. **Stemming / Lemmatization**
6. **Part-of-Speech Tagging**
7. **Named Entity Recognition**
8. **Vectorization** (for ML)
9. **Modeling / Analysis**

---

## ✅ Step-by-Step Pipeline with Code

We’ll use **NLTK** for stemming and **spaCy** for lemmatization and advanced tasks.

### 🔹 1. Text Input

```python
text = "John's running in the marathon was better than his previous attempts. He studies NLP techniques daily."
```

---

### 🔹 2. Text Cleaning

```python
import re

# Remove punctuation and lowercase
clean_text = re.sub(r'[^\w\s]', '', text.lower())
print(clean_text)
```

**Output**:
`johns running in the marathon was better than his previous attempts he studies nlp techniques daily`

---

### 🔹 3. Tokenization

```python
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

tokens = word_tokenize(clean_text)
print(tokens)
```

**Output**:
`['johns', 'running', 'in', 'the', 'marathon', 'was', 'better', 'than', 'his', 'previous', 'attempts', 'he', 'studies', 'nlp', 'techniques', 'daily']`

---

### 🔹 4. Stopword Removal

```python
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print(filtered_tokens)
```

**Output**:
`['johns', 'running', 'marathon', 'better', 'previous', 'attempts', 'studies', 'nlp', 'techniques', 'daily']`

---

### 🔹 5. Stemming (using NLTK)

```python
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]
print(stemmed)
```

**Output**:
`['john', 'run', 'marathon', 'better', 'previou', 'attempt', 'studi', 'nlp', 'techniqu', 'daili']`

---

### 🔹 6. Lemmatization (using spaCy)

```python
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(" ".join(filtered_tokens))
lemmatized = [token.lemma_ for token in doc]
print(lemmatized)
```

**Output**:
`['john', 'run', 'marathon', 'well', 'previous', 'attempt', 'study', 'NLP', 'technique', 'daily']`

---

### 🔹 7. POS Tagging

```python
for token in doc:
    print((token.text, token.pos_))
```

**Example Output**:

```
('john', 'PROPN')
('run', 'VERB')
('marathon', 'NOUN')
('well', 'ADV')
...
```

---

### 🔹 8. Named Entity Recognition (NER)

```python
doc = nlp(text)  # use original text
for ent in doc.ents:
    print((ent.text, ent.label_))
```

**Output**:

```
('John', 'PERSON')
```

---

### 🔹 9. Vectorization (Optional for ML)

```python
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["John is running a marathon.", "He studies NLP."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())
```

---

## 📌 Summary

| Stage            | Tool Used    | Output                             |
| ---------------- | ------------ | ---------------------------------- |
| Cleaning         | Regex        | Clean, lowercase text              |
| Tokenization     | NLTK / spaCy | List of words                      |
| Stopword Removal | NLTK         | Filtered tokens                    |
| Stemming         | NLTK         | Root-like forms (may be non-words) |
| Lemmatization    | spaCy        | Real base forms (dictionary words) |
| POS Tagging      | spaCy        | Tags like NOUN, VERB, etc.         |
| NER              | spaCy        | Extract entities like PERSON, DATE |
| Vectorization    | scikit-learn | Numerical matrix for modeling      |

---

Let me know if you'd like the same pipeline with `TextBlob`, `transformers`, or deep learning!


Sure! Let’s break down **stemming** in **simple, everyday terms**, with clear examples.

---

## 🌱 What is Stemming?

**Stemming** is like chopping off the ends of words to get to the **root form**, even if the result isn’t a real word.

### 🔍 Think of it like this:

Imagine you're trying to group similar words together. Stemming helps by removing prefixes or suffixes like:

* `-ing`, `-ed`, `-ly`, `-es`, `-s`, etc.

---

## 🛠 How it Works (In Simple Terms)

You don't care about **perfect grammar** — you just want to match **similar words**.

For example:

* **“play”, “playing”, “played”, “plays”** → all become → **“play”**
* **“studies”, “studied”, “studying”** → → **“studi”** (note: not a real word)

### ✅ Why it's useful:

It helps search engines or NLP tools recognize that **“playing” and “played” are about the same concept**.

---

## 🔁 Real-Life Example

### Sentence:

> "She was playing and studied hard every night."

### After Stemming:

> "She wa **play** and **studi** hard everi night."

You can see:

* "playing" → "play"
* "studied" → "studi"
* "every" → "everi"

Some of the stemmed words (like **studi** and **everi**) aren’t real English words, but that’s okay — the goal is to match **similar** words quickly.

---

## ⚙ Common Stemming Algorithms:

* **Porter Stemmer** (most common, fast, simple)
* **Snowball Stemmer** (improved version of Porter)
* **Lancaster Stemmer** (more aggressive)

---

## 🔑 Summary in Layman's Terms:

| Feature         | Description                                          |
| --------------- | ---------------------------------------------------- |
| 📌 What it does | Cuts words to a basic form (may not be a real word)  |
| 🧠 Why it helps | Groups similar words together for analysis or search |
| ⚠️ Downsides    | Can chop too much or give meaningless words          |
| 📚 Example      | “Studies” → “Studi”, “Running” → “Run”               |

---

Let me know if you want to compare this with **lemmatization** side by side again — or if you'd like a demo with code!


In [2]:
corpus = """
The Great question! spaCy and NLTK are two of the most widely used Natural Language Processing (NLP) libraries in Python, but they have different focuses and strengths.
"""

In [5]:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\chand\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [6]:
sent_tokenize(corpus)

['\nThe Great question!',
 'spaCy and NLTK are two of the most widely used Natural Language Processing (NLP) libraries in Python, but they have different focuses and strengths.']

### Stemming

In [7]:
from nltk.stem import PorterStemmer

In [8]:
stemming = PorterStemmer()

In [9]:
words = ["Eating","Eaten","Eates","Programming","Programs","Programmer"]

In [11]:
for word in words:
    print(f"{word} ---> {stemming.stem(word)}")

Eating ---> eat
Eaten ---> eaten
Eates ---> eat
Programming ---> program
Programs ---> program
Programmer ---> programm


### SNOBALL STEMMA

In [12]:
from nltk.stem import SnowballStemmer

In [14]:
ball_stem = SnowballStemmer(
    language='english'
)

In [15]:
for word in words:
    print(f"{word}-->{ball_stem.stem(word)}")

Eating-->eat
Eaten-->eaten
Eates-->eat
Programming-->program
Programs-->program
Programmer-->programm


In [16]:
stemming.stem("Fairly")

'fairli'

In [17]:
ball_stem.stem("Fairly")

'fair'



---

## 📌 What is Lemmatization?

**Lemmatization** is the process of reducing a word to its **base or dictionary form** (called a **lemma**). Unlike stemming, it always returns real English words.

---

## 🧠 What is WordNetLemmatizer?

* `WordNetLemmatizer` is a tool in **NLTK (Natural Language Toolkit)**.
* It uses the **WordNet** lexical database to find the **correct lemma** based on the **part of speech (POS)**.
* It’s **smarter than stemming** because it considers the context (like verb vs noun).

---

## 🔄 Difference: Lemmatization vs Stemming

| Word      | Stemming | Lemmatization |
| --------- | -------- | ------------- |
| `studies` | `studi`  | `study`       |
| `better`  | `better` | `good`        |
| `running` | `run`    | `run`         |
| `was`     | `wa`     | `be`          |

🔎 **Lemmatization uses grammar rules + dictionary**
✂️ **Stemming just chops word endings**

---

## 📚 Real-World Example (Layman Terms)

Imagine a **search engine**. If someone searches for:

> “How to **studied** English?”

You’d also want it to find:

* “study English”
* “studying English”
* “studies in English”

With **lemmatization**, all forms become → `study` ✅
This ensures **better search results, accurate text mining, and cleaner data**.

---

## ✅ Python Example using WordNetLemmatizer

### 🔹 Step-by-step:

```python
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary resources
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')         # For tokenization
nltk.download('averaged_perceptron_tagger')  # For POS tagging

# Create the lemmatizer
lemmatizer = WordNetLemmatizer()

# Example text
text = "The children are running and studies were being conducted."

# Tokenize the text
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

# Helper to convert NLTK POS tags to WordNet POS tags
from nltk.corpus.reader.wordnet import VERB, NOUN, ADJ, ADV

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return ADJ
    elif tag.startswith('V'):
        return VERB
    elif tag.startswith('N'):
        return NOUN
    elif tag.startswith('R'):
        return ADV
    else:
        return NOUN  # default

# POS tagging
pos_tags = nltk.pos_tag(tokens)

# Lemmatize each word with POS
lemmatized = [
    lemmatizer.lemmatize(word, get_wordnet_pos(pos))
    for word, pos in pos_tags
]

print("Original:", tokens)
print("Lemmatized:", lemmatized)
```

---

### 🔎 Output:

```
Original: ['The', 'children', 'are', 'running', 'and', 'studies', 'were', 'being', 'conducted', '.']
Lemmatized: ['The', 'child', 'be', 'run', 'and', 'study', 'be', 'be', 'conduct', '.']
```

💡 Notice:

* "children" → "child"
* "running" → "run"
* "studies" → "study"
* "conducted" → "conduct"
* "being" → "be"

✅ These are the **true base forms** used in dictionaries.

---

## 🔧 When to Use Lemmatization?

Use **lemmatization** when:

* You care about **grammatically correct base forms**
* You're building search engines, chatbots, question answering systems
* You need **clean input** for sentiment analysis, classification, etc.

---

## 🧠 Summary

| Concept                   | Description                                 |
| ------------------------- | ------------------------------------------- |
| **Lemmatization**         | Converts a word to its dictionary form      |
| **WordNetLemmatizer**     | NLTK tool that uses WordNet dictionary      |
| **Needs POS?**            | Yes — works better with part-of-speech tags |
| **Better than stemming?** | Yes, in most real-world NLP tasks           |

---




In [21]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\chand\AppData\Roaming\nltk_data...


True

In [22]:
lemma = WordNetLemmatizer()

In [31]:
for i in words:
    print(lemma.lemmatize(i,pos = 'n'))

Eating
Eaten
Eates
Programming
Programs
Programmer


## Lemma Usecase
### Q&A Chatbot text summarizations

## STOP WORDDS

In [33]:
para = """
Natural Language Processing (NLP) is a subfield of artificial intelligence. It focuses on enabling computers to understand and process human languages. Many real-world applications such as chatbots, sentiment analysis, and machine translation use NLP.

However, processing raw text isn't straightforward. It requires multiple steps like tokenization, stopword removal, stemming or lemmatization, and vectorization. Each step helps simplify and clean the data before using it in machine learning models.

Removing stopwords like "is", "the", "and", or "in" helps reduce noise and improves model accuracy. These words are common but usually don't add much meaning to the content.

"""

In [34]:
print(para)


Natural Language Processing (NLP) is a subfield of artificial intelligence. It focuses on enabling computers to understand and process human languages. Many real-world applications such as chatbots, sentiment analysis, and machine translation use NLP.

However, processing raw text isn't straightforward. It requires multiple steps like tokenization, stopword removal, stemming or lemmatization, and vectorization. Each step helps simplify and clean the data before using it in machine learning models.

Removing stopwords like "is", "the", "and", or "in" helps reduce noise and improves model accuracy. These words are common but usually don't add much meaning to the content.




In [36]:
import nltk

In [37]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chand\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [38]:
from nltk.corpus import stopwords

In [44]:
stop_words = stopwords.words('english')

In [45]:
stop_words

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [None]:
from nltk.stem import PorterStemmer,WordNetLemmatizer

lemma = WordNetLemmatizer()
stemmer = PorterStemmer()


In [47]:
para_toekn = nltk.sent_tokenize(para)

In [74]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

# Download resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample paragraph
para = """
Natural Language Processing (NLP) is a subfield of artificial intelligence. It focuses on enabling computers to understand and process human languages. Many real-world applications such as chatbots, sentiment analysis, and machine translation use NLP.
"""

# Setup tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Tokenize paragraph into sentences
para_token = sent_tokenize(para)

# Dictionaries to hold processed sentences
sentence_stem = {}
sentence_lemma = {}

for i in range(len(para_token)):
    # Tokenize each sentence into words
    words = word_tokenize(para_token[i])

    # Remove stopwords and punctuation
    filtered = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]

    # Apply stemming
    stem_words = [stemmer.stem(word) for word in filtered]

    # Apply lemmatization (default POS='n' for noun)
    lemma_words = [lemmatizer.lemmatize(word,pos='v') for word in stem_words]

    # Save results
    sentence_stem[i] = ' '.join(stem_words)
    sentence_lemma[i] = ' '.join(lemma_words)

# Output
print("Stemmed Sentences:\n", sentence_stem)
print("\nLemmatized Sentences:\n", sentence_lemma)


Stemmed Sentences:
 {0: 'natur languag process nlp subfield artifici intellig', 1: 'focus enabl comput understand process human languag', 2: 'mani applic chatbot sentiment analysi machin translat use nlp'}

Lemmatized Sentences:
 {0: 'natur languag process nlp subfield artifici intellig', 1: 'focus enabl comput understand process human languag', 2: 'mani applic chatbot sentiment analysi machin translat use nlp'}


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\chand\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chand\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\chand\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [75]:
sentence_stem

{0: 'natur languag process nlp subfield artifici intellig',
 1: 'focus enabl comput understand process human languag',
 2: 'mani applic chatbot sentiment analysi machin translat use nlp'}

In [76]:
sentence_lemma

{0: 'natur languag process nlp subfield artifici intellig',
 1: 'focus enabl comput understand process human languag',
 2: 'mani applic chatbot sentiment analysi machin translat use nlp'}

In [69]:
sentence_stem

{0: 'natur languag process ( nlp ) subfield artifici intellig .',
 1: 'it focus enabl comput understand process human languag .',
 2: 'mani real-world applic chatbot , sentiment analysi , machin translat use nlp .',
 3: "howev , process raw text n't straightforward .",
 4: 'it requir multipl step like token , stopword remov , stem lemmat , vector .',
 5: 'each step help simplifi clean data use machin learn model .',
 6: "remov stopword like `` '' , `` '' , `` '' , `` '' help reduc nois improv model accuraci .",
 7: "these word common usual n't add much mean content ."}

In [70]:
sentence_lemma

{0: 'natur languag process ( nlp ) subfield artifici intellig .',
 1: 'it focus enabl comput understand process human languag .',
 2: 'mani real-world applic chatbot , sentiment analysi , machin translat use nlp .',
 3: "howev , process raw text n't straightforward .",
 4: 'it requir multipl step like token , stopword remov , stem lemmat , vector .',
 5: 'each step help simplifi clean data use machin learn model .',
 6: "remov stopword like `` '' , `` '' , `` '' , `` '' help reduc nois improv model accuraci .",
 7: "these word common usual n't add much mean content ."}

In [77]:
lemmatizer.lemmatize("dogs",pos='v')

'dog'