# Bag of Words (BoW)

## What is Bag of Words?

The **Bag of Words (BoW)** is a natural language processing (NLP) technique used to represent text data. It simplifies text into a numerical format that machine learning models can understand. The BoW model disregards grammar, word order, or semantics, focusing solely on the frequency of words in the text.

### Why is it Called "Bag of Words"?

The term "Bag of Words" emphasizes that:
1. It treats text as a collection (or "bag") of words.
2. The position or sequence of words is ignored.
3. Only the count or presence of words matters.

---

## How Does Bag of Words Work?

### 1. Text Preprocessing:
Before applying BoW, text data is preprocessed. Common preprocessing steps include:
- **Lowercasing**: Convert all text to lowercase to avoid case sensitivity.
- **Tokenization**: Split text into individual words (tokens).
- **Stop Word Removal**: Remove common words like "the", "is", "and", which may not carry significant meaning.
- **Stemming or Lemmatization**: Reduce words to their base forms (e.g., "running" → "run").

### 2. Vocabulary Creation:
- Create a unique list of words (vocabulary) from the entire corpus (collection of text data).
- Each word in the vocabulary becomes a feature.

### 3. Vector Representation:
- Each document is represented as a vector of word counts.
- The length of the vector is equal to the size of the vocabulary.
- Each element of the vector corresponds to the frequency (or presence) of a specific word in the document.

---

### Example of Bag of Words:

#### Input Documents:
1. "I love machine learning."
2. "Machine learning is amazing."
3. "I love amazing AI."

#### Preprocessing:
- Lowercase the text.
- Remove stop words like "is" and "the".
- Tokenize the text.

#### Vocabulary:
`['ai', 'amazing', 'i', 'learning', 'love', 'machine']`

#### Bag of Words Representation:
| Document Index | AI | Amazing | I | Learning | Love | Machine |
|----------------|----|---------|---|----------|------|---------|
| 1              | 0  | 0       | 1 | 1        | 1    | 1       |
| 2              | 0  | 1       | 0 | 1        | 0    | 1       |
| 3              | 1  | 1       | 1 | 0        | 1    | 0       |

Each document is now represented as a numerical vector.

---

## Advantages of Bag of Words

1. **Simple to Understand and Implement**:
   - BoW is straightforward and requires minimal computational resources for small datasets.

2. **Effective for Small Text Data**:
   - Works well for applications like spam detection or sentiment analysis when text data is limited.

3. **Compatible with ML Models**:
   - BoW vectors can be used as input for algorithms like Naive Bayes, SVM, and Logistic Regression.

---

## Limitations of Bag of Words

1. **Ignores Word Order**:
   - BoW does not capture the sequence or context of words, which can lead to loss of meaning.

2. **High Dimensionality**:
   - For large vocabularies, the vector size can become very large, leading to memory and computational inefficiencies.

3. **Sparse Representation**:
   - Many words in the vocabulary may not appear in a given document, resulting in sparse vectors (vectors with many zeros).

4. **No Semantic Information**:
   - BoW treats synonyms (e.g., "good" and "great") as completely different words and does not account for polysemy (words with multiple meanings).

---

## Enhancements Over Bag of Words

1. **TF-IDF (Term Frequency-Inverse Document Frequency)**:
   - Weighs words based on their importance in a document relative to the entire corpus.
   - Reduces the impact of common but less informative words.

2. **Word Embeddings**:
   - Techniques like Word2Vec, GloVe, and FastText encode words into dense vectors, capturing semantic relationships and context.

3. **N-grams**:
   - Instead of single words, BoW can use N-grams (sequences of N consecutive words) to capture some contextual information.

---

## Bag of Words in Python

### Example Implementation with `CountVectorizer` (scikit-learn)
```python
from sklearn.feature_extraction.text import CountVectorizer

# Sample Data
documents = [
    "I love machine learning.",
    "Machine learning is amazing.",
    "I love amazing AI."
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and Transform the Data
X = vectorizer.fit_transform(documents)

# Display Vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())

# Display Bag of Words Representation
print("Bag of Words Matrix:\n", X.toarray())


In [1]:
import numpy as np 
import pandas as pd


In [9]:
df = pd.DataFrame({
    'text':['People watch youtube people',
           'youtube is good',
            'people write comment',
            'user post vidoes'
           ],'output':[1,1,0,0]
})

In [10]:
df

Unnamed: 0,text,output
0,People watch youtube people,1
1,youtube is good,1
2,people write comment,0
3,user post vidoes,0


In [11]:
# COunt vectorizer is a bag of word technique
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [12]:
bow = cv.fit_transform(df['text'])

In [13]:
print(cv.vocabulary_)

{'people': 3, 'watch': 7, 'youtube': 9, 'is': 2, 'good': 1, 'write': 8, 'comment': 0, 'user': 5, 'post': 4, 'vidoes': 6}


In [14]:
bow.toarray()

array([[0, 0, 0, 2, 0, 0, 0, 1, 0, 1],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 0]])

# N-grams

## What is an N-gram?

An **N-gram** is a contiguous sequence of `N` items (words, characters, or tokens) from a given text or speech data. N-grams are commonly used in Natural Language Processing (NLP) to capture context and patterns within text data. They can help retain some information about the order and relationship of words, unlike simpler models like Bag of Words.

---

## Types of N-grams

### 1. **Unigrams (1-grams)**:
   - Single-word tokens.
   - Example: For the sentence **"I love data science"**, the unigrams are:
     ```
     ["I", "love", "data", "science"]
     ```

### 2. **Bigrams (2-grams)**:
   - Pairs of consecutive words.
   - Example: For the sentence **"I love data science"**, the bigrams are:
     ```
     ["I love", "love data", "data science"]
     ```

### 3. **Trigrams (3-grams)**:
   - Groups of three consecutive words.
   - Example: For the sentence **"I love data science"**, the trigrams are:
     ```
     ["I love data", "love data science"]
     ```

### 4. **N-grams (N > 3)**:
   - Groups of `N` consecutive words.
   - Example: For **"I love data science"**, the 4-gram (N=4) is:
     ```
     ["I love data science"]
     ```

---

## How Does N-grams Work?

1. **Tokenization**: 
   - The text is split into words or tokens.
2. **Grouping**:
   - Words are grouped into `N` consecutive tokens.

For example, given the sentence:  
**"Natural Language Processing is interesting."**

- Unigrams: `["Natural", "Language", "Processing", "is", "interesting"]`
- Bigrams: `["Natural Language", "Language Processing", "Processing is", "is interesting"]`
- Trigrams: `["Natural Language Processing", "Language Processing is", "Processing is interesting"]`

---

## Applications of N-grams

1. **Text Representation**:
   - N-grams provide a better understanding of context compared to unigrams (single words).

2. **Language Modeling**:
   - Predict the next word based on preceding N-1 words. For example, in predictive text input or autocomplete.

3. **Text Classification**:
   - Used to extract features for tasks like sentiment analysis, spam detection, or topic modeling.

4. **Speech Recognition**:
   - N-grams help identify common word sequences to improve recognition accuracy.

5. **Machine Translation**:
   - Useful in capturing relationships between words to translate phrases effectively.

---

## Advantages of N-grams

1. **Captures Context**:
   - Bigrams and higher-order N-grams capture the relationship between consecutive words, unlike Bag of Words.

2. **Simplicity**:
   - Easy to compute and implement.

3. **Improved Features**:
   - Provides richer features for machine learning models compared to unigrams.

---

## Limitations of N-grams

1. **High Dimensionality**:
   - As `N` increases, the number of possible N-grams grows exponentially, leading to high memory usage and sparsity.

2. **Data Dependency**:
   - Higher-order N-grams require more data to generate meaningful patterns. For example, trigrams may not appear frequently in small datasets.

3. **Loss of Global Context**:
   - Even with N-grams, the model may still not capture long-range dependencies in text.

---

## Implementation in Python

Here’s an example of how to generate N-grams using Python:

### Example: Generate N-grams
```python
from sklearn.feature_extraction.text import CountVectorizer

# Sample Text
text = ["I love natural language processing and machine learning"]

# Initialize CountVectorizer for Bigrams (N=2)
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(text)

# Display Vocabulary
print("Bigrams Vocabulary:", vectorizer.get_feature_names_out())

# Display Bigrams Frequency
print("Bigrams Matrix:\n", X.toarray())
