## Unigrams

Unigrams are the simplest form of n-grams, where each word is considered as a single token. The table for unigrams from the sentence "The quick brown fox" would look like this:

**Unigrams**

| Words |
|-------|
| The   |
| quick |
| brown |
| fox   |


Explanation: Each word is treated as an individual entity. Unigrams don't capture any context or word order, just the presence of words.

## Bigrams

Bigrams are pairs of consecutive words. They can provide more context compared to unigrams by capturing the immediate word relationships. The bigrams for our sentence are shown in the table below:

**Bigrams**

|    | Phrases    |
|----|------------|
| 0  | The quick  |
| 1  | quick brown|
| 2  | brown fox  |


Explanation: Each entry represents a pair of adjacent words from the sentence. Bigrams help in understanding the text better by capturing the local context, which is particularly useful for tasks like predictive text input or sentiment analysis.

## Trigrams

Trigrams consist of three consecutive words, offering an even broader context than bigrams. Here's how the trigrams from the sentence would be tabulated:

**Trigrams**

|    | Phrases          |
|----|------------------|
| 0  | The quick brown  |
| 1  | quick brown fox  |


**Explanation:** Trigrams capture a wider scope of context than bigrams, reflecting the flow of ideas or concepts within the text. They are particularly valuable in language modeling and translation, where understanding the sequence of words is crucial.

**Understanding the Tables**

**Unigrams Table:** Shows individual words. While it's good for basic text analysis, it lacks contextual information.

**Bigrams Table:** Provides pairs of consecutive words, introducing a basic level of context and sequence, which is absent in unigrams.

**Trigrams Table:** Offers a richer context by including sequences of three words, making it more effective for capturing the nuances of language.

Using these n-grams, we can build features for machine learning models in natural language processing tasks. Unigrams are useful for general frequency-based analysis but might miss out on context. Bigrams and trigrams capture more of the sentence structure, which can be crucial for understanding the meaning, sentiment, or intent of the text, especially in nuanced or complex linguistic scenarios. Higher-order n-grams (like trigrams) can be particularly useful in tasks requiring a deeper understanding of context, such as speech recognition, machine translation, or sentiment analysis, where the exact sequence of words significantly impacts the meaning.

In [1]:
#pip install -U scikit-learn

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text
text = ["The quick brown fox"]

# Initialize CountVectorizer for unigrams
vectorizer_unigrams = CountVectorizer(ngram_range=(1, 1))
unigrams = vectorizer_unigrams.fit_transform(text)

# Initialize CountVectorizer for bigrams
vectorizer_bigrams = CountVectorizer(ngram_range=(2, 2))
bigrams = vectorizer_bigrams.fit_transform(text)

# Initialize CountVectorizer for trigrams
vectorizer_trigrams = CountVectorizer(ngram_range=(3, 3))
trigrams = vectorizer_trigrams.fit_transform(text)

# Convert to arrays and get feature names
unigram_features = vectorizer_unigrams.get_feature_names_out()
bigram_features = vectorizer_bigrams.get_feature_names_out()
trigram_features = vectorizer_trigrams.get_feature_names_out()

# Print the results
print("Unigrams:\n", unigram_features)
print("\nBigrams:\n", bigram_features)
print("\nTrigrams:\n", trigram_features)


Unigrams:
 ['brown' 'fox' 'quick' 'the']

Bigrams:
 ['brown fox' 'quick brown' 'the quick']

Trigrams:
 ['quick brown fox' 'the quick brown']
