### 1. Tokenization
**Definition**:  
Tokenization is the process of splitting text into individual units or tokens, such as words, sentences, or subwords.

**Types of Tokenization**:
- **Word Tokenization**: Splitting text into words.
- **Sentence Tokenization**: Splitting text into sentences.

**Applications**:
- Preprocessing step in NLP tasks like translation, summarization, and classification.

**Example**:  
Text: "I love Python!"  
Tokens: `["I", "love", "Python", "!"]`

---

### 2. Stemming
**Definition**:  
Stemming is the process of reducing a word to its root form by removing prefixes or suffixes.

**Goal**:  
To normalize different variations of a word into a base form.

**Example**:  
- "running" → "run"  
- "happiness" → "happi"  

**Algorithms**:
- **Porter Stemmer**
- **Lancaster Stemmer**

**Applications**:
- Used in information retrieval to group words with the same root.

---

### 3. Lemmatization
**Definition**:  
Lemmatization is the process of reducing a word to its base or dictionary form (lemma), considering context and meaning.

**Difference from Stemming**:  
- Stemming only removes affixes, while lemmatization transforms a word into its meaningful root form.

**Example**:  
- "better" → "good"  
- "running" → "run"

**Applications**:
- More accurate than stemming for tasks requiring semantics (e.g., text classification, sentiment analysis).

---

### 4. Bag of Words (BoW)
**Definition**:  
BoW is a model used to represent text data in which the order of words is ignored, and only their frequencies are considered.

**How it Works**:
- Each word in the corpus is treated as a feature.
- The text is represented by a vector of word counts or occurrences.

**Example**:
Text: "I love programming and I love Python"
Bag of Words Representation:  
`{"I": 2, "love": 2, "programming": 1, "and": 1, "Python": 1}`

**Applications**:
- Text classification, document clustering, sentiment analysis.

---

### 5. TF-IDF (Term Frequency-Inverse Document Frequency)
**Definition**:  
TF-IDF is a statistic used to measure the importance of a word in a document relative to its frequency across the corpus.

**Formula**:
- **TF** (Term Frequency): The frequency of a term in a document.
  $
  TF(t, d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}
  $
- **IDF** (Inverse Document Frequency): The inverse of the number of documents containing the term.
  $
  IDF(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term t}}\right)
  $
- **TF-IDF**: The product of TF and IDF.
  $
  TF\_IDF(t, d) = TF(t, d) \times IDF(t)
  $

**Applications**:
- Text search engines, document classification, information retrieval.

**Example**:  
For a word "Python" in a document with 100 terms and appearing 3 times:  
TF = 3/100, IDF = log(100/10)  
TF-IDF would measure its relevance based on both frequency and rarity.

---

### 6. Word Embedding (Word2Vec)
**Definition**:  
Word2Vec is a deep learning-based technique for transforming words into continuous vector representations, capturing semantic relationships between words.

**How it Works**:
- **Skip-Gram**: Predicts the context words given a target word.
- **Continuous Bag of Words (CBOW)**: Predicts the target word from context words.

**Benefits**:
- Captures semantic meanings and word similarities (e.g., "king" - "man" + "woman" ≈ "queen").
- Reduces dimensionality of word representations.

**Example**:  
Word2Vec can map words like "king" and "queen" close to each other in vector space.

**Applications**:
- Text classification, sentiment analysis, recommendation systems.

**Libraries**:  
- **Gensim**: Popular library for training Word2Vec models.


### 7. Named Entity Recognition (NER) - Short Notes

**Definition**:  
NER is a natural language processing (NLP) task that identifies and classifies proper nouns or entities in text into predefined categories such as people, organizations, locations, dates, etc.

**Key Categories**:
- **Person (PER)**: Names of people (e.g., "Elon Musk")
- **Location (LOC)**: Geographical locations (e.g., "Paris")
- **Organization (ORG)**: Companies or institutions (e.g., "Google")
- **Date (DATE)**: Temporal expressions (e.g., "January 1, 2024")
- **Other**: Time, Money, Percentage, etc.

**NER Process**:
1. **Tokenization**: Breaking the text into tokens (words).
2. **Entity Detection**: Identifying potential named entities.
3. **Entity Classification**: Categorizing the identified entities into types.

**Applications**:
- Information extraction
- Question answering
- Search engine optimization
- Content categorization

**Example**:  
Text: "Elon Musk, the CEO of SpaceX, visited New York on December 25, 2023."  
Output:  
- "Elon Musk" → Person (PER)  
- "SpaceX" → Organization (ORG)  
- "New York" → Location (LOC)  
- "December 25, 2023" → Date (DATE)

**Challenges**:  
- Ambiguity of terms (e.g., "Apple" could be a company or fruit)
- Domain-specific entities (e.g., medical terms)

## **8. Normalization**
- **Definition**: Standardizing text for uniformity.
- **Tasks**:
  - Convert to lowercase: "Text" → "text".
  - Remove punctuation: "Hello, world!" → "Hello world".
  - Expand contractions: "don't" → "do not".
- **Purpose**: Ensures consistent text representation.

## **9. Corpus**
- **Definition**: A collection of texts for linguistic or statistical analysis.
- **Types**:
  - Monolingual: Texts in one language.
  - Multilingual: Texts across multiple languages.
- **Example**: Historical or scientific texts.

---

## **10. Stop Words**
- **Definition**: Common words (e.g., "the," "is") removed from text to focus on meaningful content.
- **Example**:
  Original: "The quick brown fox jumps over the lazy dog."  
  Without Stop Words: "quick brown fox jumps lazy dog"

---

## **11. Parts-of-Speech (POS) Tagging**
- **Definition**: Assigning grammatical tags to words (e.g., noun, verb, adjective).
- **Purpose**: Helps in understanding the structure and meaning of text.
- **Example**:
  - "The cat sleeps."
    - "The" → Determiner
    - "cat" → Noun
    - "sleeps" → Verb

---

## **12. Statistical Language Modeling**
- **Definition**: Estimating probabilities of word sequences to predict the likelihood of a text.
- **Applications**:
  - Speech recognition
  - Text generation
- **Example**: Predicting the next word in "The weather is..."

---

## **13. n-grams**

- **Definition**: A text representation model that preserves contiguous sequences of N items (words or characters) from a text selection.
- **Difference from Bag of Words**: Unlike Bag of Words, n-grams consider word order within a sequence.
- **Example**: Trigrams (3-grams) for the sentence:  
  "There, there," said James. "There, there."  
  Representation:
  ```python
  [
      "there there said",
      "there said james",
      "said james there",
      "james there there"
  ]

---

## **12. Regular Expressions**

- **Definition**: Regular expressions (regex or regexp) are concise patterns used for searching, matching, and manipulating text.
- **Purpose**: Extend beyond simple wildcard characters (`*`, `?`) to define complex text patterns.
- **Example Use Case**: Extracting email addresses from text:
  ```python
  import re
  
  text = "Contact us at support@example.com or sales@example.org"
  pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
  emails = re.findall(pattern, text)
  print(emails)  # Output: ['support@example.com', 'sales@example.org']


In [9]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
import string

# Sample text
text = "I love programming. Python is my favorite language! I am learning Natural Language Processing."

# 1. Tokenization
sentences = sent_tokenize(text)  # Sentence Tokenization
words = word_tokenize(text)      # Word Tokenization
print("Tokenized Sentences:", sentences)
print("Tokenized Words:", words)

# 2. Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words if word not in string.punctuation]
print("\nStemmed Words:", stemmed_words)

# 3. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words if word not in string.punctuation]
print("\nLemmatized Words:", lemmatized_words)

# 4. Bag of Words (BoW)
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform([text])  # Fit and transform the text
print("\nBag of Words Representation:", bow_matrix.toarray())

# 5. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([text])
print("\nTF-IDF Representation:", tfidf_matrix.toarray())

# 6. Word Embedding (Word2Vec)
# Create a list of sentences for Word2Vec
sentences_list = [["I", "love", "programming"], ["Python", "is", "my", "favorite", "language"], 
                  ["I", "am", "learning", "Natural", "Language", "Processing"]]

# Train a Word2Vec model
model = Word2Vec(sentences_list, vector_size=50, window=5, min_count=1, workers=4)

# Get vector for the word 'love'
vector = model.wv["love"]
print("\nWord2Vec Vector for 'love':", vector)

# Find most similar words to 'love'
similar_words = model.wv.most_similar("love")
print("\nMost Similar Words to 'love':", similar_words)


Tokenized Sentences: ['I love programming.', 'Python is my favorite language!', 'I am learning Natural Language Processing.']
Tokenized Words: ['I', 'love', 'programming', '.', 'Python', 'is', 'my', 'favorite', 'language', '!', 'I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.']

Stemmed Words: ['i', 'love', 'program', 'python', 'is', 'my', 'favorit', 'languag', 'i', 'am', 'learn', 'natur', 'languag', 'process']

Lemmatized Words: ['I', 'love', 'programming', 'Python', 'is', 'my', 'favorite', 'language', 'I', 'am', 'learning', 'Natural', 'Language', 'Processing']

Bag of Words Representation: [[1 1 1 2 1 1 1 1 1 1 1]]

TF-IDF Representation: [[0.26726124 0.26726124 0.26726124 0.53452248 0.26726124 0.26726124
  0.26726124 0.26726124 0.26726124 0.26726124 0.26726124]]

Word2Vec Vector for 'love': [ 1.62645429e-02 -8.91466811e-03 -2.13671452e-03  2.01272964e-03
 -3.82227910e-04  2.29635485e-03  1.22277215e-02 -4.05430801e-05
 -6.49193069e-03 -3.02145723e-03  1.17945978e-02  3