### N-grams  -continuous sequence of words

**N-grams are continuous sequences of 𝑛 items** (words, characters, or tokens) extracted from a given text. In Natural Language Processing (NLP), they are commonly used to represent or analyze textual data.  

**Key Concepts:**  
- Unigrams: Single words/tokens.  
Example: "I love NLP" → ["I", "love", "NLP"]    

- Bigrams: Two consecutive words.   
Example: "I love NLP" → [("I", "love"), ("love", "NLP")]    

- Trigrams: Three consecutive words.  
Example: "I love NLP" → [("I", "love", "NLP")]    

- N-grams: General form for sequences of 𝑛 consecutive items.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print("Document-Term Matrix:\n", X_dense)
print(X_dense.shape)   # 3 obs and 12 vocab 

Vocabulary: ['but' 'fantastic' 'great' 'hated' 'it' 'loved' 'movie' 'not' 'okay'
 'terrible' 'the' 'was']
Document-Term Matrix:
 [[0 1 0 0 1 1 1 0 0 0 1 1]
 [1 0 1 0 0 0 1 1 1 0 1 1]
 [0 0 0 1 1 0 1 0 0 1 1 1]]
(3, 12)


In [3]:
len(vectorizer.vocabulary_)

12

In [5]:
vectorizer.vocabulary_

{'loved': 5,
 'the': 10,
 'movie': 6,
 'it': 4,
 'was': 11,
 'fantastic': 1,
 'okay': 8,
 'but': 0,
 'not': 7,
 'great': 2,
 'hated': 3,
 'terrible': 9}

In [8]:
##Unigrams ngram_range=(1,1)

from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,1))

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print("Document-Term Matrix:\n", X_dense)
print(X_dense.shape)   # 3 obs and 12 vocab 

Vocabulary: ['but' 'fantastic' 'great' 'hated' 'it' 'loved' 'movie' 'not' 'okay'
 'terrible' 'the' 'was']
Document-Term Matrix:
 [[0 1 0 0 1 1 1 0 0 0 1 1]
 [1 0 1 0 0 0 1 1 1 0 1 1]
 [0 0 0 1 1 0 1 0 0 1 1 1]]
(3, 12)


In [9]:
##Unigrams and Bigrams ngram_range=(1,2)

from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,2))

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print("Document-Term Matrix:\n", X_dense)
print(X_dense.shape)   # 3 obs and 12 vocab 

Vocabulary: ['but' 'but not' 'fantastic' 'great' 'hated' 'hated the' 'it' 'it was'
 'loved' 'loved the' 'movie' 'movie it' 'movie was' 'not' 'not great'
 'okay' 'okay but' 'terrible' 'the' 'the movie' 'was' 'was fantastic'
 'was okay' 'was terrible']
Document-Term Matrix:
 [[0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0]
 [1 1 0 1 0 0 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 0 1 0]
 [0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 1 1 0 0 1]]
(3, 24)


In [10]:
##Bigrams ngram_range=(2,2)

from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2,2))

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print("Document-Term Matrix:\n", X_dense)
print(X_dense.shape)   # 3 obs and 12 vocab 

Vocabulary: ['but not' 'hated the' 'it was' 'loved the' 'movie it' 'movie was'
 'not great' 'okay but' 'the movie' 'was fantastic' 'was okay'
 'was terrible']
Document-Term Matrix:
 [[0 0 1 1 1 0 0 0 1 1 0 0]
 [1 0 0 0 0 1 1 1 1 0 1 0]
 [0 1 1 0 1 0 0 0 1 0 0 1]]
(3, 12)


In [11]:
##Trigrams ngram_range=(3,3)

from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer(ngram_range=(3,3))

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print("Document-Term Matrix:\n", X_dense)
print(X_dense.shape)   # 3 obs and 12 vocab 

Vocabulary: ['but not great' 'hated the movie' 'it was fantastic' 'it was terrible'
 'loved the movie' 'movie it was' 'movie was okay' 'okay but not'
 'the movie it' 'the movie was' 'was okay but']
Document-Term Matrix:
 [[0 0 1 0 1 1 0 0 1 0 0]
 [1 0 0 0 0 0 1 1 0 1 1]
 [0 1 0 1 0 1 0 0 1 0 0]]
(3, 11)


In [12]:
##using nltk
 
from nltk.util import ngrams

# Input text
text = "I love learning NLP and Python".split()

# Generate n-grams (e.g., bigrams)
bigrams = list(ngrams(text, 2))
trigrams = list(ngrams(text, 3))

print("Bigrams:", bigrams)
print("Trigrams:", trigrams)


Bigrams: [('I', 'love'), ('love', 'learning'), ('learning', 'NLP'), ('NLP', 'and'), ('and', 'Python')]
Trigrams: [('I', 'love', 'learning'), ('love', 'learning', 'NLP'), ('learning', 'NLP', 'and'), ('NLP', 'and', 'Python')]


# End!