# Bag of Words: A Step-by-Step Implementation
This notebook demonstrates text preprocessing techniques including tokenization, stemming, lemmatization, and feature extraction using Bag of Words.


## 1. Import required Libraries

In [31]:
import nltk
import re
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

## 2. Dataset Preparation

In [32]:
review_text = (
    "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, "
    "under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek "
    "(the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully "
    "one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. "
    "It's not. It's clichéd and uninspiring.)"
)
display(review_text)

"I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.)"

## 3. Sentence Tokenization
Tokenize the review into sentences using both NLTK and regex for comparison.

using NLTK

In [34]:
sentences_nltk = nltk.sent_tokenize(review_text)
print("Sentences (NLTK):\n", sentences_nltk)

Sentences (NLTK):
 ['I love sci-fi and am willing to put up with a lot.', 'Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood.', 'I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original).', "Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting.", "(I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV.", "It's not.", "It's clichéd and uninspiring.)"]


using Regex

In [35]:
sentences_regex = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", review_text)
print("Sentences (Regex):\n", sentences_regex)

Sentences (Regex):
 ['I love sci-fi and am willing to put up with a lot.', 'Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood.', 'I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original).', "Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting.", "(I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV.", "It's not.", "It's clichéd and uninspiring.)"]


## 4. Word Tokenization
Tokenize the sentences into words using NLTK and custom regex.

using NLTK

In [37]:
words_nltk = nltk.word_tokenize(review_text)
print("Words (NLTK):\n", words_nltk)

Words (NLTK):
 ['I', 'love', 'sci-fi', 'and', 'am', 'willing', 'to', 'put', 'up', 'with', 'a', 'lot', '.', 'Sci-fi', 'movies/TV', 'are', 'usually', 'underfunded', ',', 'under-appreciated', 'and', 'misunderstood', '.', 'I', 'tried', 'to', 'like', 'this', ',', 'I', 'really', 'did', ',', 'but', 'it', 'is', 'to', 'good', 'TV', 'sci-fi', 'as', 'Babylon', '5', 'is', 'to', 'Star', 'Trek', '(', 'the', 'original', ')', '.', 'Silly', 'prosthetics', ',', 'cheap', 'cardboard', 'sets', ',', 'stilted', 'dialogues', ',', 'CG', 'that', 'does', "n't", 'match', 'the', 'background', ',', 'and', 'painfully', 'one-dimensional', 'characters', 'can', 'not', 'be', 'overcome', 'with', 'a', "'sci-fi", "'", 'setting', '.', '(', 'I', "'m", 'sure', 'there', 'are', 'those', 'of', 'you', 'out', 'there', 'who', 'think', 'Babylon', '5', 'is', 'good', 'sci-fi', 'TV', '.', 'It', "'s", 'not', '.', 'It', "'s", 'clichéd', 'and', 'uninspiring', '.', ')']


Custom regex tokenization

In [38]:
def tokenise(sentence):
    return re.findall(r"[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+", sentence)

word_tokens_regex = [tokenise(sent) for sent in sentences_regex[:5]]
print("Words (Regex):")
print(*word_tokens_regex, sep="\n")

Words (Regex):
['I', 'love', 'sci-fi', 'and', 'am', 'willing', 'to', 'put', 'up', 'with', 'a', 'lot']
['Sci-fi', 'movies', 'TV', 'are', 'usually', 'underfunded', 'under-appreciated', 'and', 'misunderstood']
['I', 'tried', 'to', 'like', 'this', 'I', 'really', 'did', 'but', 'it', 'is', 'to', 'good', 'TV', 'sci-fi', 'as', 'Babylon', '5', 'is', 'to', 'Star', 'Trek', 'the', 'original']
['Silly', 'prosthetics', 'cheap', 'cardboard', 'sets', 'stilted', 'dialogues', 'CG', 'that', "doesn't", 'match', 'the', 'background', 'and', 'painfully', 'one-dimensional', 'characters', 'cannot', 'be', 'overcome', 'with', 'a', "'sci-fi'", 'setting']
["I'm", 'sure', 'there', 'are', 'those', 'of', 'you', 'out', 'there', 'who', 'think', 'Babylon', '5', 'is', 'good', 'sci-fi', 'TV']


## 5. Stemming
Use the Porter Stemmer to reduce words to their root forms.

In [40]:
ps = PorterStemmer()

stemmed_sentences = []
for sentence in sentences_nltk:
    words = nltk.word_tokenize(sentence)
    stemmed = [ps.stem(word) for word in words if word.lower() not in stopwords.words("english")]
    stemmed_sentences.append(" ".join(stemmed))

print("Stemmed Sentences:\n", stemmed_sentences)

Stemmed Sentences:
 ['love sci-fi will put lot .', 'sci-fi movies/tv usual underfund , under-appreci misunderstood .', 'tri like , realli , good tv sci-fi babylon 5 star trek ( origin ) .', "silli prosthet , cheap cardboard set , stilt dialogu , cg n't match background , pain one-dimension charact overcom 'sci-fi ' set .", "( 'm sure think babylon 5 good sci-fi tv .", "'s .", "'s clichéd uninspir . )"]


## 6. Lemmatization
Use WordNet Lemmatizer to obtain base forms of words.

In [41]:
wordnet = WordNetLemmatizer()

lemmatized_sentences = []
for sentence in sentences_nltk:
    words = nltk.word_tokenize(sentence)
    lemmatized = [wordnet.lemmatize(word) for word in words if word.lower() not in stopwords.words("english")]
    lemmatized_sentences.append(" ".join(lemmatized))

print("Lemmatized Sentences:\n", lemmatized_sentences)

Lemmatized Sentences:
 ['love sci-fi willing put lot .', 'Sci-fi movies/TV usually underfunded , under-appreciated misunderstood .', 'tried like , really , good TV sci-fi Babylon 5 Star Trek ( original ) .', "Silly prosthetics , cheap cardboard set , stilted dialogue , CG n't match background , painfully one-dimensional character overcome 'sci-fi ' setting .", "( 'm sure think Babylon 5 good sci-fi TV .", "'s .", "'s clichéd uninspiring . )"]


## 7. Bag of Words Representation
Create a Bag of Words representation for the processed sentences.


In [27]:
wordnet = WordNetLemmatizer()
corpus = []

for sentence in sentences_nltk:
	review = re.sub('[^a-zA-Z]', ' ', sentence)				# substitute all non-alphabets with space
	review = review.lower()
	words = review.split()
	processed = [wordnet.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
	corpus.append(' '.join(processed))

print(corpus)


['love sci fi willing put lot', 'sci fi movie tv usually underfunded appreciated misunderstood', 'tried like really good tv sci fi babylon star trek original', 'silly prosthetics cheap cardboard set stilted dialogue cg match background painfully one dimensional character cannot overcome sci fi setting', 'sure think babylon good sci fi tv']


### 7.1 Single Word (Unigram) Bag of Words

In [43]:
# Unigram BOW
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()

# Create DataFrame for visualization
feature_names = cv.get_feature_names_out()
df_unigram = pd.DataFrame(X, columns=feature_names)
print("Unigram BOW:")
df_unigram

Unigram BOW:


Unnamed: 0,appreciated,babylon,background,cannot,cardboard,cg,character,cheap,dialogue,dimensional,...,star,stilted,sure,think,trek,tried,tv,underfunded,usually,willing
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,1,0
2,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,1,1,1,0,0,0
3,0,0,1,1,1,1,1,1,1,1,...,0,1,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,1,0,0,0


### 7.2 Bigram Bag of Words
As we move from unigram to bigram, the connection between mathematical similarity and intuitional similarity becomes tighter.

In [44]:
# Bigram BOW
cv_bigram = CountVectorizer(max_features=1500, ngram_range=(1, 2))
X_bigram = cv_bigram.fit_transform(corpus).toarray()

# Create DataFrame for visualization
feature_names_bigram = cv_bigram.get_feature_names_out()
df_bigram = pd.DataFrame(X_bigram, columns=feature_names_bigram)
print("Bigram BOW:")
df_bigram

Bigram BOW:


Unnamed: 0,appreciated,appreciated misunderstood,babylon,babylon good,babylon star,background,background painfully,cannot,cannot overcome,cardboard,...,tried like,tv,tv sci,tv usually,underfunded,underfunded appreciated,usually,usually underfunded,willing,willing put
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
1,1,1,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,1,0,0
2,0,0,1,0,1,0,0,0,0,0,...,1,1,1,0,0,0,0,0,0,0
3,0,0,0,0,0,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
