##  **N-Grams**

---

### **What is an N-Gram?**

An **N-gram** is a sequence of **N consecutive words** from a text.

* `Unigram` = single word
* `Bigram` = pair of 2 words
* `Trigram` = group of 3 words
* and so on...

---

### **Why Use N-Grams?**

* To **capture word combinations** and some context/structure.
* Helps models understand **phrases**, not just single words.
* Useful for tasks like **text classification, sentiment analysis,** and **language modeling**.

---

### **Example:**

Text: `"I love cats"`

* **Unigrams**: `["I", "love", "cats"]`
* **Bigrams**: `["I love", "love cats"]`
* **Trigrams**: `["I love cats"]`

---

### **Summary**

| Aspect    | Details                                               |
| --------- | ----------------------------------------------------- |
| Purpose   | Capture word sequences (context)                      |
| Pros      | Adds meaning, better than BoW alone                   |
| Cons      | Increases vector size, sparse data                    |
| Use cases | Text classification, sentiment analysis, autocomplete |

---



In [2]:
import pandas as pd

In [3]:
df = pd.DataFrame({'text': ['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'], 'output': [1, 1, 0, 0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
cv= CountVectorizer(ngram_range=(2, 2))

In [6]:
BOW= cv.fit_transform(df['text']) ## All Vocab found out

In [7]:
print(cv.vocabulary_)

{'people watch': 2, 'watch campusx': 4, 'campusx watch': 0, 'people write': 3, 'write comment': 5, 'campusx write': 1}


In [8]:
print(BOW[0].toarray())

[[0 0 1 0 1 0]]


In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
cv= CountVectorizer(ngram_range=(3, 3))

In [10]:
BOW= cv.fit_transform(df['text']) ## All Vocab found out

In [11]:
print(cv.vocabulary_)

{'people watch campusx': 2, 'campusx watch campusx': 0, 'people write comment': 3, 'campusx write comment': 1}


In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
cv= CountVectorizer(ngram_range=(1, 2))

In [13]:
BOW= cv.fit_transform(df['text']) ## All Vocab found out

In [14]:
print(cv.vocabulary_)

{'people': 4, 'watch': 7, 'campusx': 0, 'people watch': 5, 'watch campusx': 8, 'campusx watch': 1, 'write': 9, 'comment': 3, 'people write': 6, 'write comment': 10, 'campusx write': 2}


In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
cv= CountVectorizer(ngram_range=(1, 3))

In [16]:
BOW= cv.fit_transform(df['text']) ## All Vocab found out

In [17]:
print(cv.vocabulary_)

{'people': 6, 'watch': 11, 'campusx': 0, 'people watch': 7, 'watch campusx': 12, 'people watch campusx': 8, 'campusx watch': 1, 'campusx watch campusx': 2, 'write': 13, 'comment': 5, 'people write': 9, 'write comment': 14, 'people write comment': 10, 'campusx write': 3, 'campusx write comment': 4}


### **Lets Try With The Same Dataset**

In [2]:
import pandas as pd

In [3]:
spam_classifier= pd.read_csv("spam.csv")

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
cv = CountVectorizer(
    lowercase=True,
    stop_words='english',
    token_pattern='[a-zA-Z]+',
    ngram_range=(2,2),
)

In [14]:
BOW= cv.fit_transform(spam_classifier['Message']) ## All Vocab found out

In [15]:
cv.vocabulary_

{'jurong point': 11615,
 'point crazy': 18409,
 'crazy available': 4377,
 'available bugis': 1102,
 'bugis n': 2414,
 'n great': 15924,
 'great world': 9168,
 'world la': 28425,
 'la e': 12373,
 'e buffet': 6217,
 'buffet cine': 2410,
 'cine got': 3405,
 'got amore': 8869,
 'amore wat': 544,
 'ok lar': 17028,
 'lar joking': 12452,
 'joking wif': 11567,
 'wif u': 27948,
 'u oni': 25930,
 'free entry': 7689,
 'entry wkly': 6638,
 'wkly comp': 28158,
 'comp win': 3934,
 'win fa': 27997,
 'fa cup': 6909,
 'cup final': 4489,
 'final tkts': 7275,
 'tkts st': 24839,
 'st text': 22546,
 'text fa': 23929,
 'fa receive': 6910,
 'receive entry': 19484,
 'entry question': 6633,
 'question std': 19013,
 'std txt': 22700,
 'txt rate': 25577,
 'rate t': 19235,
 't c': 23293,
 'c s': 2606,
 's apply': 20207,
 'apply s': 744,
 'u dun': 25755,
 'dun say': 6168,
 'say early': 20791,
 'early hor': 6306,
 'hor u': 10616,
 'u c': 25693,
 'c say': 2607,
 'nah don': 16003,
 'don t': 5817,
 't think': 23476,
 

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
cv = CountVectorizer(
    lowercase=True,
    stop_words='english',
    token_pattern='[a-zA-Z]+',
    ngram_range=(3,3),
)

In [17]:
BOW= cv.fit_transform(spam_classifier['Message']) ## All Vocab found out

In [18]:
print(cv.vocabulary_)

