# Textual Data: Bag-of-Words and N-Grams

## Textual Data

A textual dataset consists of multiple texts. Each text is called a **document**. The collection of texts is called a **corpus**.

In [4]:
seuss_dir = "http://dlsun.github.io/pods/data/drseuss/"
seuss_files = [
    "green_eggs_and_ham.txt", "cat_in_the_hat.txt",
    "fox_in_socks.txt", "how_the_grinch_stole_christmas.txt",
    "hop_on_pop.txt", "horton_hears_a_who.txt",
    "oh_the_places_youll_go.txt", "one_fish_two_fish.txt"]

In [5]:
# read the files
import requests

docs = {}
for filename in seuss_files:
    url = seuss_dir + filename
    print(f"Reading {url}...")
    response = requests.get(url, "r")
    docs[filename] = response.text

Reading http://dlsun.github.io/pods/data/drseuss/green_eggs_and_ham.txt...
Reading http://dlsun.github.io/pods/data/drseuss/cat_in_the_hat.txt...
Reading http://dlsun.github.io/pods/data/drseuss/fox_in_socks.txt...
Reading http://dlsun.github.io/pods/data/drseuss/how_the_grinch_stole_christmas.txt...
Reading http://dlsun.github.io/pods/data/drseuss/hop_on_pop.txt...
Reading http://dlsun.github.io/pods/data/drseuss/horton_hears_a_who.txt...
Reading http://dlsun.github.io/pods/data/drseuss/oh_the_places_youll_go.txt...
Reading http://dlsun.github.io/pods/data/drseuss/one_fish_two_fish.txt...


*Goal*: Turn this corpus into a matrix of numbers.

But what would each column represent?!

## Bag-of-Words Model

In the **bag-of-words model**, each column represents a word, and the values in the column are the word counts.

First, we need to count the words in each document.

In [6]:
# get the word counts
from collections import Counter
Counter(docs["hop_on_pop.txt"].split())

Counter({'is': 10,
         'on': 10,
         'We': 10,
         'a': 9,
         'He': 6,
         'PAT': 6,
         'Brown': 6,
         'in': 5,
         'all': 5,
         'like': 5,
         'Pup': 4,
         'ALL': 4,
         'RED': 4,
         'and': 4,
         'to': 4,
         'Mr.': 4,
         'PUP': 3,
         'the': 3,
         'can': 3,
         'Pat': 3,
         'sat': 3,
         'What': 3,
         'THING': 3,
         'WALK': 3,
         'of': 3,
         'down.': 3,
         'went': 3,
         'up.': 2,
         'CUP': 2,
         'MOUSE': 2,
         'HOUSE': 2,
         'are': 2,
         'BALL': 2,
         'play': 2,
         'wall.': 2,
         'day.': 2,
         'after': 2,
         'SEE': 2,
         'BEE': 2,
         'see': 2,
         'THREE': 2,
         'that': 2,
         'They': 2,
         'call': 2,
         'me': 2,
         'BED': 2,
         'I': 2,
         'him': 2,
         'NO': 2,
         'Dad': 2,
         'sad.': 2,
         'That

In [7]:
# stack the word counts into a dataframe
# add fillna(0) to replace NaN with 0
import pandas as pd
pd.DataFrame([Counter(doc.split()) for doc in docs.values()],
              index=docs.keys()).fillna(0).astype(int)

Unnamed: 0,I,am,Sam,That,Sam-I-am,Sam-I-am!,do,not,like,that,...,Gack,park,"home,",Clark.,grow,sleep,Zeep.,gone.,Tomorrow,one.
green_eggs_and_ham.txt,71,3,3,2,4,2,34,46,44,1,...,0,0,0,0,0,0,0,0,0,0
cat_in_the_hat.txt,48,0,0,4,0,0,13,27,13,16,...,0,0,0,0,0,0,0,0,0,0
fox_in_socks.txt,9,0,0,0,0,0,6,1,1,1,...,0,0,0,0,0,0,0,0,0,0
how_the_grinch_stole_christmas.txt,6,0,0,2,0,0,2,1,2,11,...,0,0,0,0,0,0,0,0,0,0
hop_on_pop.txt,2,1,0,2,0,0,0,2,5,2,...,0,0,0,0,0,0,0,0,0,0
horton_hears_a_who.txt,18,1,0,7,0,0,0,3,0,24,...,0,0,0,0,0,0,0,0,0,0
oh_the_places_youll_go.txt,2,0,0,0,0,0,2,6,1,11,...,0,0,0,0,0,0,0,0,0,0
one_fish_two_fish.txt,48,3,0,0,0,0,11,9,21,1,...,1,1,1,1,1,2,1,1,1,1


This is called the **term-frequency matrix**.

## Bag-of-Words in Scikit-Learn

Alternatively, we can use `CountVectorizer` in scikit-learn to produce a term-frequency matrix.

In [8]:
# use ConterVectorizer for producing a term-frequency matrix
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(docs.values())
vectorizer.transform(docs.values())

# Wait! Why are there only 1344 words?

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 2308 stored elements and shape (8, 1344)>

The set of words across a corpus is called the **vocabulary**. We can view the vocabulary in a fitted `CounterVectorizer` as follows:

In [9]:
# The number here represents the column index in the matrix!
vectorizer.vocabulary_

{'am': 23,
 'sam': 935,
 'that': 1138,
 'do': 287,
 'not': 767,
 'like': 644,
 'you': 1336,
 'green': 471,
 'eggs': 326,
 'and': 26,
 'ham': 495,
 'them': 1141,
 'would': 1316,
 'here': 526,
 'or': 786,
 'there': 1143,
 'anywhere': 32,
 'in': 576,
 'house': 558,
 'with': 1303,
 'mouse': 722,
 'eat': 323,
 'box': 132,
 'fox': 419,
 'could': 242,
 'car': 179,
 'they': 1145,
 'are': 35,
 'may': 688,
 'will': 1292,
 'see': 953,
 'tree': 1204,
 'let': 635,
 'me': 691,
 'be': 62,
 'mot': 718,
 'train': 1202,
 'on': 778,
 'say': 944,
 'the': 1139,
 'dark': 265,
 'rain': 884,
 'goat': 453,
 'boat': 118,
 'so': 1035,
 'try': 1213,
 'if': 575,
 'good': 459,
 'thank': 1136,
 'sun': 1107,
 'did': 279,
 'shine': 972,
 'it': 586,
 'was': 1255,
 'too': 1188,
 'wet': 1268,
 'to': 1178,
 'play': 836,
 'we': 1261,
 'sat': 940,
 'all': 16,
 'cold': 231,
 'day': 270,
 'sally': 934,
 'two': 1220,
 'said': 932,
 'how': 560,
 'wish': 1302,
 'had': 488,
 'something': 1042,
 'go': 452,
 'out': 789,
 'ball': 50

## Text Normalization

What's wrong with the way we counted words originally?

`Counter({'UP': 1, 'PUP': 3, 'Pup': 4, 'is': 10, 'up.': 2, ...})`

It's usually good to **normalize** for punctuation and capitalization.

Normalization options are specified when you initialize the `CountVectorizer`. By default, Scikit-Learn strips punctuation and converts all characters to lowercase.

But if you don't want Scikit-Learn to normalize for punctuation and capitalization, you can do the following:

In [10]:
# turn off the normalization
vec = CountVectorizer(lowercase=False, token_pattern=r"[\S]+")
vec.fit(docs.values())
vec.transform(docs.values())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 3679 stored elements and shape (8, 2562)>

## The Shortcomings of Bag-of-Words

Bag-of-words is easy to understand and easy to implement.

What are its disadvantages?

Consider the following documents:

1. "The dog bit her owner."
2. "Her dog bit the owner."

Both documents have the same exact bag-of-words representation:
|           | the | her | dog | owner | bit |
|-----------|-----|-----|-----|--------|-----|
| 1         | 1   | 1   | 1   | 1      | 1   |
| 2         | 1   | 1   | 1   | 1      | 1   |

But they mean something quite different!

## BoW 모델의 장단점은 무엇인가요? (by Copilot)

### ✅ 장점 (Advantages)

- **간단함과 효율성**  
  구조가 단순하고 빠르게 벡터화 가능 → 기초적인 텍스트 분석이나 모델 학습에 적합

- **머신러닝 모델과 호환 가능**  
  대부분의 기계학습 알고리즘에서 직접 사용 가능

- **고차원 표현 가능**  
  대규모 단어 집합 기반으로 문서 간의 정밀한 차이 분석 가능

---

### ❌ 단점 (Disadvantages)

- **문맥 정보 상실**  
  단어 순서나 문법 의미를 고려하지 않음

- **차원의 저주 (Curse of Dimensionality)**  
  단어 수가 많아질수록 벡터의 차원이 커지고 계산 효율 저하

- **불용어 포함 가능성**  
  중요하지 않은 단어(`the`, `is`, `a` 등)도 높은 빈도로 포함될 수 있음

---

### 🔧 개선 방법

- **TF-IDF (Term Frequency-Inverse Document Frequency)**  
  문서마다 중요한 단어를 강조하고 일반 단어는 억제

- **N-Gram 모델**  
  연속된 단어 조합(2-gram, 3-gram 등)을 활용하여 단어 순서 반영

- **Word Embeddings (예: Word2Vec, GloVe)**  
  단어의 의미와 문맥을 반영한 벡터 학습 방식으로 BoW보다 정교함

## N-grams

An **n-gram** is a sequence of $n$ words.

[Google Books Ngram Viewer](https://books.google.com/ngrams/)

N-grams allow us to capture more of the meaning.

For example, if we count **bigrams** (2-grams) instead of words, we can distinguish the two documents from before:

1. "The dog bit her owner."
2. "Her dog bit the owner."

| 문서 번호 | the,dog | her,dog | dog,bit | bit,the | bit,her | the,owner | her,owner |
|-----------|--------|--------|--------|--------|--------|------------|-----------|
| 1         | 1      | 1      | 0      | 1      | 1      | 1          | 1         |
| 2         | 2      | 0      | 1      | 0      | 0      | 0          | 0         |

Scikit-Learn can create n-grams.

Just pass in `ngram_range=` to the `CounterVectorizer`. To get bigrams, we set the range to `(2, 2)`:

In [11]:
vec = CountVectorizer(ngram_range=(2, 2))
vec.fit(docs.values())
vec.transform(docs.values())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6459 stored elements and shape (8, 5846)>

We can also get individual words (unigrams) alongside the bigrams:

In [12]:
vec = CountVectorizer(ngram_range=(1, 2))
vec.fit(docs.values())
vec.transform(docs.values())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 8767 stored elements and shape (8, 7190)>