# 8.2 Training a Bag-of-words Model

## Part 1: Preparing the dataset

- Large Movie Review Dataset, https://ai.stanford.edu/~amaas/data/sentiment/

> This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.



In [1]:
%load_ext watermark
%watermark -p torch,lightning,pandas --conda

torch    : 1.13.1
lightning: 1.9.2
pandas   : 1.5.3

conda environment: dl-fundamentals



In [2]:
# pip install datasets

import os.path as op

import numpy as np
import pandas as pd

from local_dataset_utilities import download_dataset, load_dataset_into_to_dataframe, partition_dataset
from local_dataset_utilities import IMDBDataset

# 1 Loading the dataset into DataFrames

In [3]:
download_dataset()

df = load_dataset_into_to_dataframe()
df.head()

100%|███████████████████████████████████| 50000/50000 [00:21<00:00, 2358.83it/s]

Class distribution:





Unnamed: 0,text,label
0,I went and saw this movie last night after bei...,1
0,Actor turned director Bill Paxton follows up h...,1
0,As a recreational golfer with some knowledge o...,1
0,"I saw this film in a sneak preview, and it is ...",1
0,Bill Paxton has taken the true story of the 19...,1


In [4]:
partition_dataset(df)

In [5]:
df_train = pd.read_csv("train.csv")
df_train.tail()

Unnamed: 0,index,text,label
34995,0,Frank Capra's creativity must have been just a...,0
34996,0,Just saw the film tonight in a preview and it'...,0
34997,0,"If you love Japanese monster movies, you'll lo...",1
34998,0,Because it came from HBO and based on the IMDb...,0
34999,0,"WARNING!!! SOME POSSIBLE PLOT SPOILERS, AS IF ...",0


In [6]:
np.bincount(df_train['label'])

array([17452, 17548])

In [7]:
df_val = pd.read_csv("val.csv")
df_val.tail()

Unnamed: 0,index,text,label
4995,0,The Matador is a strange film. Its main charac...,1
4996,0,Not bad performances. Whoopi plays the wise/wa...,0
4997,0,I was surprised when I saw this film. I'd hear...,0
4998,0,When great director/actor combinations are tal...,0
4999,0,This show is non Stop hilarity. the first joke...,1


In [8]:
np.bincount(df_val['label'])

array([2542, 2458])

In [9]:
df_test = pd.read_csv("test.csv")
df_test.tail()

Unnamed: 0,index,text,label
9995,0,Every generation fully believes it is living i...,0
9996,0,Possibly the most brilliant thing about Che: P...,1
9997,0,I was unsure of this movie before renting and ...,1
9998,0,"Just got out of an advance screening, and wow ...",1
9999,0,I sense out there a mix of confusion and varyi...,1


In [10]:
np.bincount(df_test['label'])

array([5006, 4994])

## 2) Bag-of-Words Model

In [13]:
# pip install scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

In [47]:
cv = CountVectorizer(lowercase=True, max_features=10_000, stop_words="english")

cv.fit(df_train["text"])

In [38]:
cv.vocabulary_

{'when': 9774,
 'we': 9718,
 'started': 8484,
 'watching': 9705,
 'this': 9018,
 'series': 7924,
 'on': 6249,
 'cable': 1335,
 'had': 4084,
 'no': 6087,
 'idea': 4457,
 'how': 4399,
 'it': 4795,
 'would': 9916,
 'be': 885,
 'even': 3182,
 'you': 9971,
 'hate': 4164,
 'character': 1551,
 'hold': 4315,
 'back': 790,
 'because': 905,
 'they': 9002,
 'are': 587,
 'so': 8266,
 'beautifully': 902,
 'developed': 2559,
 'can': 1371,
 'almost': 409,
 'understand': 9358,
 'why': 9800,
 'react': 7173,
 'to': 9102,
 'frustration': 3725,
 'fear': 3419,
 'greed': 4002,
 'or': 6283,
 'temptation': 8931,
 'the': 8976,
 'way': 9714,
 'do': 2720,
 'as': 642,
 'if': 4473,
 'viewer': 9556,
 'is': 4781,
 'experiencing': 3266,
 'one': 6251,
 'of': 6212,
 'christopher': 1661,
 'learning': 5167,
 'br': 1170,
 'put': 7041,
 'up': 9441,
 'with': 9856,
 'abuse': 186,
 'her': 4241,
 'physically': 6598,
 'and': 470,
 'emotionally': 3031,
 'but': 1316,
 'just': 4934,
 'have': 4174,
 'read': 7176,
 'newspaper': 6058

In [17]:
X_train = cv.transform(df_train["text"])
X_val = cv.transform(df_val["text"])
X_test = cv.transform(df_test["text"])

In [18]:
X_train.shape

(35000, 10000)

In [21]:
X_train[0]

<1x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 139 stored elements in Compressed Sparse Row format>

In [33]:
np.array(X_train[0].todense())[0]

array([0, 0, 0, ..., 0, 0, 0])

In [34]:
np.bincount(np.array(X_train[0].todense())[0])

array([9861,  101,   17,    9,    6,    3,    2,    0,    0,    0,    1])

In [28]:
X_train[0].todense().flatten()

matrix([[0, 0, 0, ..., 0, 0, 0]])

In [45]:
np.array(X_train.todense()).shape

(35000, 10000)