<a href="https://colab.research.google.com/github/Jacobgokul/ML-Playground/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is NLP?
Natural Language Processing (NLP) is a field in AI that helps computers understand, interpret, and generate human language.

✅ It's how machines understand text and speech, like Google Translate or Siri or Alexa.

# Why NLP Matters
Humans speak in natural languages like English, Tamil, or Hindi, but computers understand only numbers. NLP acts as the translator between human language and machine language.



# What Exactly Does NLP Do?

| Task                    | Example                                             |
| ----------------------- | --------------------------------------------------- |
| **Understand Meaning**  | Understand “I’m feeling down” means someone is sad  |
| **Extract Information** | Pull names, dates, locations from articles (NER)    |
| **Translate Languages** | Convert English → Japanese using translation models |
| **Generate Text**       | Write a paragraph or code based on your input       |
| **Summarize Documents** | Condense a 2000-word article into 3 lines           |
| **Answer Questions**    | Like ChatGPT does                                   |


# Key Concepts | Techniques

## 1. Tokenization

- What: Splits sentences into words or subwords.

- Why: ML models can’t understand full text – they need units (tokens).

- Types:

    - Word tokenization → ["Hello", "world"]

    - Character tokenization → ["H", "e", "l", "l", "o"]

    - Subword tokenization (used in Transformers) → "playing" → ["play", "##ing"]

    ```py
    from nltk.tokenize import word_tokenize
    word_tokenize("I'm learning NLP.")
    # ['I', "'m", 'learning', 'NLP', '.']
    ```

## 2. Stopword Removal
- What: Remove frequent/common words like the, is, a, an.

- Why: These words occur a lot but carry little meaning in classification tasks.
    
    ```py
    from nltk.corpus import stopwords
    stopwords.words('english')  # includes 'is', 'the', etc.
    ```

## 3. Stemming vs Lemmatization
| Stemming            | Lemmatization               |
| ------------------- | --------------------------- |
| Cuts suffix         | Finds proper root word      |
| “studies” → “studi” | “studies” → “study”         |
| Less accurate       | More linguistically correct |

Steamming:
 Running -> remove 'ing' -> Runn

Lemmatization:
 Running -> root word -> Run

```py
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download("wordnet")

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print(stemmer.stem("studies"))   # studi
print(lemmatizer.lemmatize("studies"))  # study
```


## 6. Bag of words (BoW)
- It’s the simplest way to turn text into numbers so a machine can use it.
- Idea: “Don’t care about order of words, just count how many times each word appears.”
- Called a bag because it’s like putting all words into a bag, shaking it, and just counting—not worrying about grammar or sequence.

#### Step-by-step example
- Imagine two sentences:

    - “I love NLP”

    - “I love Python”

- Step 1: Build Vocabulary
    - Collect all unique words → [“I”, “love”, “NLP”, “Python”]

- Step 2: Represent each sentence as a vector (word counts)
    - Sentence 1: “I love NLP” → [1, 1, 1, 0]
    - Sentence 2: “I love Python” → [1, 1, 0, 1]

Each number says how many times that word appears.

#### Pros and Cons
- Advantages
    - Super simple, easy to understand.

    - Works well for small tasks (spam detection, simple text classification).

- Disadvantages
    - Ignores order → “dog bites man” vs “man bites dog” look the same.

    - Vocabulary grows huge as text grows (sparse matrix).

    - Doesn’t capture meaning or similarity (e.g., “happy” ≠ “glad”).


```py
from sklearn.feature_extraction.text import CountVectorizer

# Example documents
docs = ["I love NLP", "I love Python"]

# Create BoW model
cv = CountVectorizer()
X = cv.fit_transform(docs)

# Show vocabulary
print(cv.get_feature_names_out())
# ['love', 'nlp', 'python']

# Show BoW vectors
print(X.toarray())
# [[1, 1, 0],
#  [1, 0, 1]]

```


## 5. TF-IDF (Term Frequency-Inverse Document Frequency)
What: Improves BoW by reducing the weight of common words.

Why: Words like "good", "the", "very" may appear in every document, but we want to focus on rare but important terms.
### TF
- Measures how often a term appears in a document.
- So, repetitive words in a document will have a high TF score.
#### Formula
- TF = Count of term t in document d​ / Total terms in document d

##### Example:
    - Document = "I love NLP, I love Python"
        - Word “love” count = 2
        - Total words = 6
        - TF(love) = 2/6 = 0.33

### IDF
- Measures how rare a term is across all documents.
- So, common terms across many documents (like “the”, “is”) will have a low IDF score.
- Rare terms (that appear in few documents) will have a high IDF score.

#### Formula
- IDF = log(Total number of documents /  Number of docs containing the word)

##### Example:
    - Suppose we have 10 documents.

    - Word “Python” appears in 2 of them.

    - IDF(Python) = log(10 / 2) = log(5) ≈ 1.6

    - Word “the” appears in all 10 docs → log(10/10) = 0

#### Formula for TF-IDF
    TF-IDF(t,d)=TF(t,d)×IDF(t)

- High TF-IDF: Term appears frequently in a specific doc, but rarely in others.

- Low TF-IDF: Term is either common in all docs or infrequent in the target doc.


##### Code:
```py
from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["I love NLP", "NLP loves me", "I love Python and NLP"]

# Create TF-IDF model
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)

# Vocabulary
print(tfidf.get_feature_names_out())

# TF-IDF values
print(X.toarray())

```


## 6. Word Embeddings

### What are Word Embeddings?
Word embeddings are a way of representing words as dense vectors (lists of numbers) where:
    - Words with similar meaning are closer in vector space.

    - Captures semantics (meaning), unlike BoW or TF-IDF.

Think of it as giving each word an address in a multi-dimensional space. Words that “live” near each other mean similar things.

### Why not BoW/TF-IDF?
- **BoW/TF-IDF** → Just counts. Doesn’t know that “king” and “queen” are related.

- **Embeddings** → Learn from context. So “dog” is close to “puppy,” but far from “car.”

### Type of Embeddings
- Word2Vec 
    - Introduced by Google in 2013
    - Trained on huge text.
    - Captures famous analogy:
        king - man + woman ≈ queen
- GloVe
    - Stanford in 2014
    - Uses global word co-occurrence.
    - Learns “statistics of the whole corpus.”
- FastText
    - Facebook by 2016
    - Works with subwords (handles rare words, misspellings better).
- BERT embeddings
    - Google by 2018
    -  Contextual embeddings → word meaning depends on sentence.
    - Example:
        - “I sat by the bank of the river.”
        - “I went to the bank to deposit money.”
        Same word → different embeddings.

```py
!pip install gensim

from gensim.models import Word2Vec

# Training a small toy model
sentences = [
    ["I", "love", "natural", "language", "processing"],
    ["I", "love", "deep", "learning"],
    ["NLP", "is", "fun"],
    ["Python", "is", "great", "for", "NLP"]
]

model = Word2Vec(sentences, vector_size=20, window=3, min_count=1)

# Vector for word 'NLP'
print(model.wv['NLP'])

# Find similar words
print(model.wv.most_similar('love'))
```