**1. Tokenization**

Tokenization is the process of breaking down text into smaller pieces, such as words or sentences. This is a fundamental step in NLP, as most text processing requires analyzing words or phrases.

```
Word Tokenization Example
```

In [4]:
import nltk
from nltk.tokenize import word_tokenize

# Download the 'punkt_tab' data package
nltk.download('punkt_tab')

# Example text
text = "Hello, how are you doing today? I'm learning NLP!"

# Tokenize the text into words
tokens = word_tokenize(text)

print("Original Text:", text)
print("Tokenized Words:", tokens)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Original Text: Hello, how are you doing today? I'm learning NLP!
Tokenized Words: ['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'I', "'m", 'learning', 'NLP', '!']


```
Sentence Tokenization Example
```

In [5]:
from nltk.tokenize import sent_tokenize

# Example text
text = "NLP is fascinating. It has many real-world applications. I'm excited to learn it!"

# Tokenize the text into sentences
sentences = sent_tokenize(text)

print("Original Text:", text)
print("Tokenized Sentences:", sentences)


Original Text: NLP is fascinating. It has many real-world applications. I'm excited to learn it!
Tokenized Sentences: ['NLP is fascinating.', 'It has many real-world applications.', "I'm excited to learn it!"]


**2. Stop Words Removal:**

Stop words are common words in a language (e.g., "is," "the," "and") that don't carry significant meaning for many NLP tasks.

In [6]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Example text
text = "This is an example sentence showing off the stop words filtration."

# Tokenize and remove stop words
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text)
filtered_words = [word for word in tokens if word.lower() not in stop_words]

print("Original Words:", tokens)
print("Filtered Words (without stop words):", filtered_words)


Original Words: ['This', 'is', 'an', 'example', 'sentence', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
Filtered Words (without stop words): ['example', 'sentence', 'showing', 'stop', 'words', 'filtration', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**3. Stemming and Lemmatization:**

These processes reduce words to their base forms, but in different ways.
```
Stemming Example
```
Stemming removes suffixes to bring words to their root form. It can be aggressive and may produce non-standard words

In [7]:
from nltk.stem import PorterStemmer

# Example text
words = ["running", "runner", "ran", "easily", "fairly"]

# Apply stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in words]

print("Original Words:", words)
print("Stemmed Words:", stems)


Original Words: ['running', 'runner', 'ran', 'easily', 'fairly']
Stemmed Words: ['run', 'runner', 'ran', 'easili', 'fairli']


```
Lemmatization Example
```
Lemmatization uses linguistic rules to return meaningful base forms of words.


In [8]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Example text
words = ["running", "runner", "ran", "better", "easily"]

# Apply lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words]  # Specify 'v' for verbs

print("Original Words:", words)
print("Lemmatized Words:", lemmas)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Original Words: ['running', 'runner', 'ran', 'better', 'easily']
Lemmatized Words: ['run', 'runner', 'run', 'better', 'easily']


**4.Bag of words:**

Bag of Words (BoW) is a simple method to represent text as a collection of words, where each word is treated as an individual feature, and the text is represented by the frequency (count) of these words, ignoring grammar, order, and context.
```
Implementation
```

In [9]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Raw data
corpus = [
    "I loVe NLP!!! It's AMAZING 😊",
    "NLP?? Is it fUn, or AMAZING?!!",
    "Learning NLP and its applications!!!",
    "NLP is powerful: it solves many real-world problems!",
    "Do you LOVE NLP? #fun #amazing"
]

# Initialize tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

# Data cleaning function
def clean_text(text):
    # Lowercase the text
    text = text.lower()
    # Remove special characters and digits
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stop words
    tokens = [word for word in tokens if word not in stop_words]
    # Apply stemming
    tokens = [stemmer.stem(word) for word in tokens]
    return ' '.join(tokens)

# Clean the entire corpus
cleaned_corpus = [clean_text(doc) for doc in corpus]

print("Original Corpus:")
print(corpus)
print("\nCleaned Corpus:")
print(cleaned_corpus)


Original Corpus:
["I loVe NLP!!! It's AMAZING 😊", 'NLP?? Is it fUn, or AMAZING?!!', 'Learning NLP and its applications!!!', 'NLP is powerful: it solves many real-world problems!', 'Do you LOVE NLP? #fun #amazing']

Cleaned Corpus:
['love nlp amaz', 'nlp fun amaz', 'learn nlp applic', 'nlp power solv mani realworld problem', 'love nlp fun amaz']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


```
After cleaning, we convert the cleaned text into a Bag of Words representation.
```

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

# Create the BoW model
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(cleaned_corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBag of Words Representation:\n", X_bow.toarray())


Vocabulary: ['amaz' 'applic' 'fun' 'learn' 'love' 'mani' 'nlp' 'power' 'problem'
 'realworld' 'solv']

Bag of Words Representation:
 [[1 0 0 0 1 0 1 0 0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0]
 [0 1 0 1 0 0 1 0 0 0 0]
 [0 0 0 0 0 1 1 1 1 1 1]
 [1 0 1 0 1 0 1 0 0 0 0]]


**5.TF-IDF:**

TF-IDF provides a more nuanced representation by weighing terms based on their frequency across documents.

```
Implementation
```

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the TF-IDF model
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(cleaned_corpus)

print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF Representation:\n", X_tfidf.toarray())


Vocabulary: ['amaz' 'applic' 'fun' 'learn' 'love' 'mani' 'nlp' 'power' 'problem'
 'realworld' 'solv']

TF-IDF Representation:
 [[0.58148208 0.         0.         0.         0.70050458 0.
  0.41372929 0.         0.         0.         0.        ]
 [0.58148208 0.         0.70050458 0.         0.         0.
  0.41372929 0.         0.         0.         0.        ]
 [0.         0.67009179 0.         0.67009179 0.         0.
  0.31930233 0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.43739254
  0.20841989 0.43739254 0.43739254 0.43739254 0.43739254]
 [0.47625576 0.         0.57373967 0.         0.57373967 0.
  0.33885989 0.         0.         0.         0.        ]]


**Analysis:**
```
Bag of Words: Focuses on raw counts, which can be useful for simpler models but may not account for word importance.
```
```
TF-IDF: Adds weight to rare words (e.g., "realworld," "solv") while down-weighting common words like "nlp.
```

# N-grams

## **Definition**
An **n-gram** is a contiguous sequence of `n` words from a given text. It captures local word context by considering a fixed number of consecutive words.

---

## **Types of N-grams**
### 1. **Unigram (n=1)**
- **Description**: Single words.
- **Example**: "I love NLP" Unigrams: ["I", "love", "NLP"]

### 2. **Bigram (n=2)**
- **Description**: Pairs of consecutive words.
- **Example**: "I love NLP" Bigrams: [("I", "love"), ("love", "NLP")]

### 3. **Trigram (n=3)**
- **Description**: Three consecutive words.
- **Example**: "I love NLP" Trigrams: [("I", "love", "NLP")]
```
Example Implementation:
```

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# Example corpus
corpus = ["I love natural language processing"]

# Generate n-grams (bigrams in this case)
vectorizer = CountVectorizer(ngram_range=(2, 2))  # (2, 2) specifies bigrams
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Bigram Representation:\n", X.toarray())


Vocabulary: ['language processing' 'love natural' 'natural language']
Bigram Representation:
 [[1 1 1]]



---

## **Use Cases**
1. **Text Classification**: Capturing context in sentences for sentiment analysis or spam detection.
2. **Machine Translation**: Translating sequences of words instead of individual words.
3. **Speech Recognition**: Predicting word sequences for more accurate transcriptions.

---


# Word2Vec

## **Definition**
**Word2Vec** is a neural network-based model that represents words as dense vectors in a continuous vector space. Developed by Google, it captures semantic and syntactic relationships between words.

---

## **Architectures**
1. **CBOW (Continuous Bag of Words)**  
   - **Description**: Predicts a word based on its surrounding context words.  
   - **Example**:  
     For the sentence `"I love NLP"`, CBOW predicts `"love"` using `"I"` and `"NLP"` as context.
     
2. **Skip-gram**  
   - **Description**: Predicts the surrounding context words given a single word.  
   - **Example**:  
     For the word `"love"` in `"I love NLP"`, Skip-gram predicts `"I"` and `"NLP"`.

---

## **How It Works**
- Words that appear in similar contexts have similar vector representations in the embedding space.
- Example:  


In [13]:
from gensim.models import Word2Vec

# Example corpus (tokenized sentences)
corpus = [
    ["I", "love", "natural", "language", "processing"],
    ["language", "processing", "is", "fun"],
    ["I", "am", "learning", "NLP"]
]

# Train a Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=10, window=2, min_count=1, workers=4)

# Access word vector
vector = model.wv['language']
print("Vector for 'language':", vector)

# Find similar words
similar = model.wv.most_similar("language")
print("\nWords similar to 'language':", similar)


Vector for 'language': [ 0.07380505 -0.01533471 -0.04536613  0.06554051 -0.0486016  -0.01816018
  0.0287658   0.00991874 -0.08285215 -0.09448818]

Words similar to 'language': [('processing', 0.5436006188392639), ('I', 0.32937225699424744), ('is', 0.23243840038776398), ('natural', 0.035253241658210754), ('fun', -0.17998705804347992), ('NLP', -0.21133741736412048), ('am', -0.38205230236053467), ('love', -0.5145737528800964), ('learning', -0.5381841063499451)]
