<a href="https://colab.research.google.com/github/Jacobgokul/ML-Playground/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is NLP?
Natural Language Processing (NLP) is a field in AI that helps computers understand, interpret, and generate human language.

✅ It's how machines understand text and speech, like Google Translate or Siri or Alexa.

# Why NLP Matters
Humans speak in natural languages like English, Tamil, or Hindi, but computers understand only numbers. NLP acts as the translator between human language and machine language.



# What Exactly Does NLP Do?

| Task                    | Example                                             |
| ----------------------- | --------------------------------------------------- |
| **Understand Meaning**  | Understand “I’m feeling down” means someone is sad  |
| **Extract Information** | Pull names, dates, locations from articles (NER)    |
| **Translate Languages** | Convert English → Japanese using translation models |
| **Generate Text**       | Write a paragraph or code based on your input       |
| **Summarize Documents** | Condense a 2000-word article into 3 lines           |
| **Answer Questions**    | Like ChatGPT does                                   |


# Key Concepts | Techniques

## 1. Tokenization

What: Splits sentences into words or subwords.

Why: ML models can’t understand full text – they need units (tokens).

Types:

Word tokenization → ["Hello", "world"]

Character tokenization → ["H", "e", "l", "l", "o"]

Subword tokenization (used in Transformers) → "playing" → ["play", "##ing"]

```
from nltk.tokenize import word_tokenize
word_tokenize("I'm learning NLP.")
# ['I', "'m", 'learning', 'NLP', '.']
```

## 2. Stopword Removal
What: Remove frequent/common words like the, is, a, an.

Why: These words occur a lot but carry little meaning in classification tasks.
```
from nltk.corpus import stopwords
stopwords.words('english')  # includes 'is', 'the', etc.
```

## 3. Stemming vs Lemmatization
| Stemming            | Lemmatization               |
| ------------------- | --------------------------- |
| Cuts suffix         | Finds proper root word      |
| “studies” → “studi” | “studies” → “study”         |
| Less accurate       | More linguistically correct |

Steamming:
 Running -> remove 'ing' -> Runn

Lemmatization:
 Running -> root word -> Run

## 4. Bag of Words (BoW)
Each word is treated like a feature in a vector.
```
Vocabulary: [‘I’, ‘love’, ‘NLP’]
Sentence: "I love NLP" → [1,1,1]
Sentence: "I love Python" → [1,1,0]
```

## 5. TF-IDF (Term Frequency-Inverse Document Frequency)
What: Improves BoW by reducing the weight of common words.
Why: Words like "good", "the", "very" may appear in every document, but we want to focus on rare but important terms.
### TF
- Measures how often a term appears in a document.
- So, repetitive words in a document will have a high TF score.
#### Formula
- TF = frequency of word in doc

### IDF
- Measures how rare a term is across all documents.
- So, common terms across many documents (like “the”, “is”) will have a low IDF score.
- Rare terms (that appear in few documents) will have a high IDF score.

#### Formula
- IDF = log(total docs / docs containing the word)

#### Formula for TF-IDF
    TF-IDF(t,d)=TF(t,d)×IDF(t)

- High TF-IDF: Term appears frequently in a specific doc, but rarely in others.

- Low TF-IDF: Term is either common in all docs or infrequent in the target doc.

```
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(["I love NLP", "NLP loves me"])
```

## 6. Word Embeddings
 What: Map each word to a dense vector that captures meaning and context.
 Why: Unlike BoW/TF-IDF, embeddings like Word2Vec or GloVe understand similarity.

 ### Popular Embedding Techniques:
    Word2Vec

    GloVe

    FastText

    BERT embeddings (contextual)
  

## 7. Named Entity Recognition (NER)
- Extract entities like names, locations, dates etc,.
-  Useful in chatbots, search engines, info extraction.

Example:
'Gokul works in Accenture at Coimbatore campus'
-> PERSON: Gokul, ORG: Accenture, LOCATION: Coimbatore