
## 8.3.1 N-grams

### Explanation of N-gram Models

N-grams are a fundamental concept in Natural Language Processing (NLP) used to model the structure of text. An N-gram is a contiguous sequence of \( N \) items from a given sample of text or speech. In the context of text, these items can be characters, words, or even sentences. 

For example:
- Unigram (1-gram): "The"
- Bigram (2-gram): "The cat"
- Trigram (3-gram): "The cat sat"

N-gram models are used to predict the next item in such sequences based on the previous \( N-1 \) items. These models are simple yet powerful for tasks such as text generation, machine translation, and speech recognition.


___
___
### Readings:
- [What are N-Grams?](https://kavita-ganesan.com/what-are-n-grams/)
- [What Are N-Grams and How to Implement Them in Python?](https://www.analyticsvidhya.com/blog/2021/09/what-are-n-grams-and-how-to-implement-them-in-python/)
- [Exploring N-gram Models in Natural Language Processing](https://readmedium.com/en/https:/medium.com/@evertongomede/exploring-n-gram-models-in-natural-language-processing-bf5852b32050)
___
___

### Benefits and Scenarios for Using N-grams

N-grams have several benefits and are used in various NLP scenarios:

- **Simplicity**: N-gram models are relatively easy to understand and implement.
- **Text Generation**: N-grams can be used to generate new text by predicting the next word in a sequence.
- **Language Modeling**: They are used in building language models that can predict the probability of a sequence of words.
- **Feature Extraction**: N-grams serve as features in text classification and sentiment analysis tasks.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "I love machine learning.",
    "Machine learning is amazing.",
    "I am learning new techniques in machine learning."
]

# Create a CountVectorizer with n-gram range (e.g., bigrams)
vectorizer = CountVectorizer(ngram_range=(1, 2))

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Get the feature names (n-grams)
feature_names = vectorizer.get_feature_names_out()

# Sum up the occurrences of each n-gram
sum_words = X.sum(axis=0)

# Create a dictionary of n-grams and their counts
ngram_counts = {word: sum_words[0, idx] for word, idx in vectorizer.vocabulary_.items()}

# Display the n-grams and their counts
for ngram, count in ngram_counts.items():
    print(f"{ngram}: {count}")


love: 1
machine: 3
learning: 4
love machine: 1
machine learning: 3
is: 1
amazing: 1
learning is: 1
is amazing: 1
am: 1
new: 1
techniques: 1
in: 1
am learning: 1
learning new: 1
new techniques: 1
techniques in: 1
in machine: 1


___
___
### Making a DataFrame

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Example documents
documents = [
    "The quick brown fox jumps over the lazy dog",
    "The cat sat on the mat",
    "The dog barked at the mailman",
    "The quick brown fox",
    "The cat chased the mouse",
    "The dog chased the cat",
    "The mouse ran away",
    "The quick cat jumped over the lazy dog",
    "The quick brown dog",
    "The lazy dog lay in the sun"
]

# Initialize the CountVectorizer for bigrams
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(documents)

# Create a DataFrame to view the bigrams and their frequencies
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df)


   at the  barked at  brown dog  brown fox  cat chased  cat jumped  cat sat  \
0       0          0          0          1           0           0        0   
1       0          0          0          0           0           0        1   
2       1          1          0          0           0           0        0   
3       0          0          0          1           0           0        0   
4       0          0          0          0           1           0        0   
5       0          0          0          0           0           0        0   
6       0          0          0          0           0           0        0   
7       0          0          0          0           0           1        0   
8       0          0          1          0           0           0        0   
9       0          0          0          0           0           0        0   

   chased the  dog barked  dog chased  ...  ran away  sat on  the cat  \
0           0           0           0  ...         0     

### Conclusion

N-gram models provide a foundational approach to understanding and generating text based on contiguous sequences of items. Despite their simplicity, they are widely used in various NLP applications and serve as a basis for more complex models. Implementing N-grams using libraries like `sklearn` allows for efficient extraction and analysis of these sequences in text data.