# Text Representation Techniques: BoW, TF-IDF, and n-grams

## Agenda:
1. **Introduction** - Overview of text representation methods.
2. **Bag of Words (BoW)** - Explanation and implementation.
3. **Term Frequency-Inverse Document Frequency (TF-IDF)** - Explanation and implementation.
4. **n-grams** - Explanation and implementation.
5. **Practical Applications** - Code for applying these methods on example data.
6. **Summary** - Final thoughts and summary of techniques.


## Introduction

In this notebook, we will explore various techniques for representing text data numerically, including **Bag of Words (BoW)**, **TF-IDF**, and **n-grams**. These methods are commonly used in tasks like text classification, sentiment analysis, and other natural language processing (NLP) tasks.

We will use simple examples to illustrate how each method works.


## Bag of Words (BoW)

**Bag of Words (BoW)** is a method for representing text data in a numerical format. It involves counting the frequency of each word in the document while ignoring grammar and word order.

For example, let's consider the following sentences:

1. **Sentence 1**: "I have a food, I hate it tastes."
2. **Sentence 2**: "I have a pen."
3. **Sentence 3**: "The weather is beautiful."


In [1]:
# Importing necessary libraries
from sklearn.feature_extraction.text import CountVectorizer

# Example sentences
documents = [
    "I have a food, I hate it tastes.",
    "I have a pen.",
    "The weather is beautiful."
]

# Bag of Words (BoW)
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
bow_array = bow_matrix.toarray()

# Displaying the results
print("BoW Representation:\n", bow_array)
print("\nFeature Names (Words):\n", vectorizer.get_feature_names_out())

BoW Representation:
 [[0 1 1 1 0 1 0 1 0 0]
 [0 0 0 1 0 0 1 0 0 0]
 [1 0 0 0 1 0 0 0 1 1]]

Feature Names (Words):
 ['beautiful' 'food' 'hate' 'have' 'is' 'it' 'pen' 'tastes' 'the' 'weather']


## Term Frequency-Inverse Document Frequency (TF-IDF)

**TF-IDF** is a more advanced method that adjusts the frequency of words based on how common or rare they are across all documents. It is calculated using two components:
1. **Term Frequency (TF)**: The number of times a word appears in a document.
2. **Inverse Document Frequency (IDF)**: A measure of how important a word is in the entire corpus.

The formula for TF-IDF is:

\[
\text{TF-IDF}(t) = \text{TF}(t) \times \log\left(\frac{N}{df(t)}\right)
\]

Where:
- \( t \) is the term,
- \( N \) is the total number of documents,
- \( df(t) \) is the number of documents containing the term \( t \).


In [2]:
# Importing necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
tfidf_array = tfidf_matrix.toarray()

# Displaying the results
print("TF-IDF Representation:\n", tfidf_array)
print("\nFeature Names (Words):\n", tfidf_vectorizer.get_feature_names_out())

TF-IDF Representation:
 [[0.         0.46735098 0.46735098 0.35543247 0.         0.46735098
  0.         0.46735098 0.         0.        ]
 [0.         0.         0.         0.60534851 0.         0.
  0.79596054 0.         0.         0.        ]
 [0.5        0.         0.         0.         0.5        0.
  0.         0.         0.5        0.5       ]]

Feature Names (Words):
 ['beautiful' 'food' 'hate' 'have' 'is' 'it' 'pen' 'tastes' 'the' 'weather']


## n-grams

**n-grams** are sequences of \( n \) words from a given text. They can be used to capture the context of the words in the text. The most common types are:
1. **Unigrams**: Individual words.
2. **Bigrams**: Pairs of consecutive words.
3. **Trigrams**: Triplets of consecutive words.

For example, using the sentence "I have a food":
- **Unigrams**: ["I", "have", "a", "food"]
- **Bigrams**: ["I have", "have a", "a food"]
- **Trigrams**: ["I have a", "have a food"]


In [3]:
# Importing necessary libraries
from sklearn.feature_extraction.text import CountVectorizer

# Example sentences
documents = [
    "I have a food, I hate it tastes.",
    "I have a pen.",
    "The weather is beautiful."
]

# n-grams (Unigrams, Bigrams, Trigrams)
vectorizer = CountVectorizer(ngram_range=(1, 3))  # Unigrams, Bigrams, and Trigrams
ngram_matrix = vectorizer.fit_transform(documents)
ngram_array = ngram_matrix.toarray()

# Displaying the results
print("n-grams Representation (Unigrams, Bigrams, Trigrams):\n", ngram_array)
print("\nFeature Names (n-grams):\n", vectorizer.get_feature_names_out())

n-grams Representation (Unigrams, Bigrams, Trigrams):
 [[0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1]]

Feature Names (n-grams):
 ['beautiful' 'food' 'food hate' 'food hate it' 'hate' 'hate it'
 'hate it tastes' 'have' 'have food' 'have food hate' 'have pen' 'is'
 'is beautiful' 'it' 'it tastes' 'pen' 'tastes' 'the' 'the weather'
 'the weather is' 'weather' 'weather is' 'weather is beautiful']


## Summary

In this notebook, we covered three common techniques for representing text numerically:
1. **Bag of Words (BoW)**: Counts the frequency of each word in the text, but ignores word order.
2. **TF-IDF**: Adjusts word frequencies based on their importance across documents, giving more weight to rare words.
3. **n-grams**: Captures sequences of words, which helps in understanding the context and relationships between words.

These techniques are foundational for many NLP tasks like text classification, sentiment analysis, and information retrieval.