# N-Gram Implementation using NLTK

This notebook demonstrates how to generate **unigrams**, **bigrams**, and **trigrams** using Python's `nltk` library.

We'll tokenize text, then use `nltk.ngrams()` to extract n-gram sequences. This is useful for understanding local word context and co-occurrence patterns.


In [24]:
# Import the pandas library for data manipulation and analysis, especially for working with DataFrames
import pandas as pd

In [25]:
# Read the CSV file 'spamclassification.csv' from the 'Datasets' folder using pandas
# Specifying encoding='latin1' to avoid UnicodeDecodeError with special characters
messages = pd.read_csv('Datasets/spamclassification.csv', encoding='latin1')

# Display the first 5 rows of the DataFrame to get a quick look at the data
messages.head()

Unnamed: 0,Label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Cleaning and Preprocessing

In [26]:
# Import the regular expression module for text cleaning
import re

# Import the Natural Language Toolkit (nltk) library for text processing
import nltk

# Import a list of common English stopwords (e.g., "the", "is", "and") from nltk
from nltk.corpus import stopwords

# Import the Porter Stemmer algorithm from nltk for stemming words (reducing words to their base/root form)
from nltk.stem.porter import PorterStemmer

# Initialize the PorterStemmer object
ps = PorterStemmer()

In [31]:
corpus = []  # Initialize an empty list to store the cleaned and processed text data

for i in range(0, len(messages)):
    # Remove all characters except alphabets from the message text
    review = re.sub('[^a-zA-Z]', ' ', messages['text'][i])

    # Convert all characters to lowercase
    review = review.lower()

    # Split the sentence into individual words (tokenization)
    review = review.split()

    # Apply stemming to each word and remove stopwords 
    review = [ps.stem(word) for word in review if word not in stopwords.words('english')]

    # Join the processed words back into a single string
    review = ' '.join(review)

    # Append the cleaned review to the corpus list
    corpus.append(review)


## Create Bag of Words Model with N-gram

In [32]:
# Import the CountVectorizer class from scikit-learn's text feature extraction module
from sklearn.feature_extraction.text import CountVectorizer

In [33]:
print("🔹 Unigram (ngram_range=(1, 1))")
cv = CountVectorizer(max_features=100, binary=True, ngram_range=(1,1))
X = cv.fit_transform(corpus).toarray()
print("Vocabulary (Unigrams):")
print(cv.vocabulary_)

🔹 Unigram (ngram_range=(1, 1))
Vocabulary (Unigrams):
{'go': 22, 'great': 25, 'got': 24, 'wat': 90, 'ok': 56, 'free': 18, 'win': 94, 'text': 77, 'txt': 85, 'say': 67, 'alreadi': 0, 'think': 80, 'hey': 28, 'week': 92, 'back': 3, 'like': 38, 'still': 73, 'send': 69, 'even': 15, 'friend': 19, 'prize': 62, 'claim': 7, 'call': 4, 'mobil': 47, 'co': 8, 'home': 30, 'want': 89, 'today': 82, 'cash': 6, 'day': 12, 'repli': 64, 'www': 96, 'right': 65, 'thank': 78, 'take': 75, 'time': 81, 'use': 87, 'messag': 44, 'oh': 55, 'ye': 97, 'make': 42, 'way': 91, 'feel': 16, 'dont': 14, 'miss': 46, 'ur': 86, 'tri': 84, 'da': 11, 'lor': 39, 'meet': 43, 'realli': 63, 'get': 20, 'know': 33, 'love': 40, 'let': 37, 'work': 95, 'wait': 88, 'yeah': 98, 'tell': 76, 'pleas': 61, 'msg': 49, 'see': 68, 'pl': 60, 'need': 51, 'tomorrow': 83, 'hope': 31, 'well': 93, 'lt': 41, 'gt': 26, 'ask': 1, 'morn': 48, 'happi': 27, 'sorri': 72, 'give': 21, 'new': 52, 'find': 17, 'year': 99, 'later': 35, 'pick': 59, 'good': 23, 'co

In [34]:
print("🔹 Bigram (ngram_range=(2, 2))")
cv = CountVectorizer(max_features=100, binary=True, ngram_range=(2,2))
X = cv.fit_transform(corpus).toarray()
print("Vocabulary (Bigrams):")
print(cv.vocabulary_)

🔹 Bigram (ngram_range=(2, 2))
Vocabulary (Bigrams):
{'free entri': 32, 'claim call': 17, 'call claim': 3, 'free call': 31, 'call mobil': 9, 'chanc win': 16, 'txt word': 90, 'let know': 53, 'go home': 36, 'pleas call': 67, 'lt gt': 57, 'want go': 97, 'like lt': 54, 'sorri call': 80, 'call later': 8, 'ur award': 91, 'call custom': 4, 'custom servic': 24, 'cash prize': 15, 'po box': 68, 'tri contact': 87, 'draw show': 29, 'show prize': 79, 'prize guarante': 73, 'guarante call': 43, 'valid hr': 95, 'select receiv': 76, 'privat account': 71, 'account statement': 0, 'statement show': 82, 'call identifi': 5, 'identifi code': 49, 'code expir': 21, 'urgent mobil': 94, 'caller prize': 13, 'call landlin': 7, 'wat time': 98, 'ur mob': 93, 'gud ni': 45, 'new year': 62, 'send stop': 78, 'get back': 34, 'co uk': 20, 'gud mrng': 44, 'nice day': 63, 'lt decim': 56, 'decim gt': 26, 'txt nokia': 88, 'good morn': 38, 'ur friend': 92, 'good night': 39, 'network min': 61, 'repli call': 75, 'last night': 52,

In [35]:
print("🔹 Trigram (ngram_range=(3, 3))")
cv = CountVectorizer(max_features=100, binary=True, ngram_range=(3,3))
X = cv.fit_transform(corpus).toarray()
print("Vocabulary (Trigrams):")
print(cv.vocabulary_)

🔹 Trigram (ngram_range=(3, 3))
Vocabulary (Trigrams):
{'like lt gt': 43, 'sorri call later': 80, 'pleas call custom': 66, 'call custom servic': 6, 'custom servic repres': 22, 'guarante cash prize': 35, 'draw show prize': 24, 'show prize guarante': 78, 'prize guarante call': 70, 'special select receiv': 82, 'speak live oper': 81, 'live oper claim': 45, 'privat account statement': 68, 'account statement show': 0, 'call identifi code': 7, 'identifi code expir': 40, 'bonu caller prize': 4, 'select receiv award': 77, 'match pleas call': 54, 'urgent tri contact': 96, 'lt decim gt': 47, 'secret admir look': 76, 'admir look make': 1, 'look make contact': 46, 'make contact find': 53, 'contact find reveal': 19, 'find reveal think': 29, 'reveal think ur': 73, 'think ur special': 86, 'ur special call': 94, 'draw txt music': 25, 'www ldew com': 99, 'anytim network min': 2, 'camcord repli call': 12, 'cant pick phone': 13, 'pick phone right': 64, 'phone right pl': 63, 'right pl send': 74, 'pl send me

When we set `ngram_range=(1, 1)` in `CountVectorizer`, we are only extracting **unigrams** — i.e., individual words.

This is exactly what the **Bag of Words (BoW)** model does:
- It builds a vocabulary of all unique words in the corpus
- It creates a vector for each sentence based on the **frequency** of each word
- It **ignores word order and context**

So,
- `ngram_range=(1, 1)` → Unigrams only → **BoW representation**

🔍 Example:
> Sentence: `"Food is good"`

Unigrams: `["food", "is", "good"]`  
BoW vector: `[1, 1, 1]` for those 3 words

➡️ Conclusion: **BoW is just a unigram model** — that's why `ngram_range=(1, 1)` produces the same feature set.
