# NLP


---

## 🔄 NLP Pipeline Overview

1. **Text Input**
2. **Text Cleaning / Preprocessing**
3. **Tokenization**
4. **Stopword Removal**
5. **Stemming / Lemmatization**
6. **Part-of-Speech Tagging**
7. **Named Entity Recognition**
8. **Vectorization** (for ML)
9. **Modeling / Analysis**

---

## ✅ Step-by-Step Pipeline with Code

We’ll use **NLTK** for stemming and **spaCy** for lemmatization and advanced tasks.

### 🔹 1. Text Input

```python
text = "John's running in the marathon was better than his previous attempts. He studies NLP techniques daily."
```

---

### 🔹 2. Text Cleaning

```python
import re

# Remove punctuation and lowercase
clean_text = re.sub(r'[^\w\s]', '', text.lower())
print(clean_text)
```

**Output**:
`johns running in the marathon was better than his previous attempts he studies nlp techniques daily`

---

### 🔹 3. Tokenization

```python
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

tokens = word_tokenize(clean_text)
print(tokens)
```

**Output**:
`['johns', 'running', 'in', 'the', 'marathon', 'was', 'better', 'than', 'his', 'previous', 'attempts', 'he', 'studies', 'nlp', 'techniques', 'daily']`

---

### 🔹 4. Stopword Removal

```python
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print(filtered_tokens)
```

**Output**:
`['johns', 'running', 'marathon', 'better', 'previous', 'attempts', 'studies', 'nlp', 'techniques', 'daily']`

---

### 🔹 5. Stemming (using NLTK)

```python
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]
print(stemmed)
```

**Output**:
`['john', 'run', 'marathon', 'better', 'previou', 'attempt', 'studi', 'nlp', 'techniqu', 'daili']`

---

### 🔹 6. Lemmatization (using spaCy)

```python
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(" ".join(filtered_tokens))
lemmatized = [token.lemma_ for token in doc]
print(lemmatized)
```

**Output**:
`['john', 'run', 'marathon', 'well', 'previous', 'attempt', 'study', 'NLP', 'technique', 'daily']`

---

### 🔹 7. POS Tagging

```python
for token in doc:
    print((token.text, token.pos_))
```

**Example Output**:

```
('john', 'PROPN')
('run', 'VERB')
('marathon', 'NOUN')
('well', 'ADV')
...
```

---

### 🔹 8. Named Entity Recognition (NER)

```python
doc = nlp(text)  # use original text
for ent in doc.ents:
    print((ent.text, ent.label_))
```

**Output**:

```
('John', 'PERSON')
```

---

### 🔹 9. Vectorization (Optional for ML)

```python
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["John is running a marathon.", "He studies NLP."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())
```

---

## 📌 Summary

| Stage            | Tool Used    | Output                             |
| ---------------- | ------------ | ---------------------------------- |
| Cleaning         | Regex        | Clean, lowercase text              |
| Tokenization     | NLTK / spaCy | List of words                      |
| Stopword Removal | NLTK         | Filtered tokens                    |
| Stemming         | NLTK         | Root-like forms (may be non-words) |
| Lemmatization    | spaCy        | Real base forms (dictionary words) |
| POS Tagging      | spaCy        | Tags like NOUN, VERB, etc.         |
| NER              | spaCy        | Extract entities like PERSON, DATE |
| Vectorization    | scikit-learn | Numerical matrix for modeling      |

---

Let me know if you'd like the same pipeline with `TextBlob`, `transformers`, or deep learning!


Sure! Let’s break down **stemming** in **simple, everyday terms**, with clear examples.

---

## 🌱 What is Stemming?

**Stemming** is like chopping off the ends of words to get to the **root form**, even if the result isn’t a real word.

### 🔍 Think of it like this:

Imagine you're trying to group similar words together. Stemming helps by removing prefixes or suffixes like:

* `-ing`, `-ed`, `-ly`, `-es`, `-s`, etc.

---

## 🛠 How it Works (In Simple Terms)

You don't care about **perfect grammar** — you just want to match **similar words**.

For example:

* **“play”, “playing”, “played”, “plays”** → all become → **“play”**
* **“studies”, “studied”, “studying”** → → **“studi”** (note: not a real word)

### ✅ Why it's useful:

It helps search engines or NLP tools recognize that **“playing” and “played” are about the same concept**.

---

## 🔁 Real-Life Example

### Sentence:

> "She was playing and studied hard every night."

### After Stemming:

> "She wa **play** and **studi** hard everi night."

You can see:

* "playing" → "play"
* "studied" → "studi"
* "every" → "everi"

Some of the stemmed words (like **studi** and **everi**) aren’t real English words, but that’s okay — the goal is to match **similar** words quickly.

---

## ⚙ Common Stemming Algorithms:

* **Porter Stemmer** (most common, fast, simple)
* **Snowball Stemmer** (improved version of Porter)
* **Lancaster Stemmer** (more aggressive)

---

## 🔑 Summary in Layman's Terms:

| Feature         | Description                                          |
| --------------- | ---------------------------------------------------- |
| 📌 What it does | Cuts words to a basic form (may not be a real word)  |
| 🧠 Why it helps | Groups similar words together for analysis or search |
| ⚠️ Downsides    | Can chop too much or give meaningless words          |
| 📚 Example      | “Studies” → “Studi”, “Running” → “Run”               |

---

Let me know if you want to compare this with **lemmatization** side by side again — or if you'd like a demo with code!


In [2]:
corpus = """
The Great question! spaCy and NLTK are two of the most widely used Natural Language Processing (NLP) libraries in Python, but they have different focuses and strengths.
"""

In [5]:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\chand\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [6]:
sent_tokenize(corpus)

['\nThe Great question!',
 'spaCy and NLTK are two of the most widely used Natural Language Processing (NLP) libraries in Python, but they have different focuses and strengths.']

### Stemming

In [7]:
from nltk.stem import PorterStemmer

In [8]:
stemming = PorterStemmer()

In [9]:
words = ["Eating","Eaten","Eates","Programming","Programs","Programmer"]

In [11]:
for word in words:
    print(f"{word} ---> {stemming.stem(word)}")

Eating ---> eat
Eaten ---> eaten
Eates ---> eat
Programming ---> program
Programs ---> program
Programmer ---> programm


### SNOBALL STEMMA

In [12]:
from nltk.stem import SnowballStemmer

In [14]:
ball_stem = SnowballStemmer(
    language='english'
)

In [15]:
for word in words:
    print(f"{word}-->{ball_stem.stem(word)}")

Eating-->eat
Eaten-->eaten
Eates-->eat
Programming-->program
Programs-->program
Programmer-->programm


In [16]:
stemming.stem("Fairly")

'fairli'

In [17]:
ball_stem.stem("Fairly")

'fair'