## Bag  of Words (BOW) Vectorization

In this file, I will go through the BOW (Bag of Words) vectorization method, which is one of the simplest and most commonly used methods for text vectorization. It converts text data into a numerical format that can be used by machine learning algorithms.

This file will show how the BOW works in python, and how to use it in the quick_sentiments package. The BOW vectorization method is implemented in the `quick_sentiments.vect.BOW` module.

BOW is a sub function and user do not have to directly use it. It is used in the main function `run_pipeline` to vectorize the text data. However, it is useful to understand how it works and how to use it in case you want to use it directly.

In [1]:
import polars as pl
df = pl.read_csv("training_data/train.csv", encoding="utf-8")
df.head()

movieid,reviewerName,isFrequentReviewer,reviewText,sentiment
str,str,bool,str,str
"""marvelous_pira…","""Benjamin Henry…",False,"""Henry Selick’s…","""POSITIVE"""
"""tony_montana_f…","""Felicia Lopez""",False,"""With a cast th…","""NEGATIVE"""
"""darth_vader_ka…","""Mr. Charles Bu…",True,"""Creed II does …","""POSITIVE"""
"""lara_croft_gli…","""Ryan Barrett""",False,"""I know what yo…","""POSITIVE"""
"""jason_bourne_s…","""Alexander Glov…",False,"""Director Ferna…","""POSITIVE"""


In [7]:
from quick_sentiments import pre_process
from quick_sentiments.vect.BOW import vectorize

In [5]:
BOW_demo =[pre_process(text) for text in df["reviewText"][:5]]
BOW_demo

['henry selicks first movie since 2009s coraline fifth stopmotion masterpiece',
 'cast read like vogue oscar party guest list valentine day cantmiss cinema instead standard hollywood schmaltz',
 'creed ii give u anything another slightly superior rocky sequel win point nt expect knockout',
 'know thinking limitless bradley cooper cell multiply lucy tap brain new thrillsnew skill passing hour',
 'director fernando meirelles tell story urgency sharp visual composition washed cinematography ooze gangster life']

In [9]:
BOW, vect = vectorize(BOW_demo)

   - Generating Bag-of-Words features...


In [14]:

BOW_df = pl.DataFrame(BOW.toarray())
BOW_df

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,column_37,column_38,column_39,column_40,column_41,column_42,column_43,column_44,column_45,column_46,column_47,column_48,column_49,column_50,column_51,column_52,column_53,column_54,column_55,column_56,column_57,column_58,column_59,column_60,column_61,column_62,column_63,column_64,column_65,column_66,column_67,column_68
i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0
0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1
0,0,0,1,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0


Each words are represented by a vector of numbers, which is called a vector representation. For instance, column_0 is the representation for 'henry'. The  columns could have been named after the words, but we avoid that so that later in the new dataset, we can link the same words to the column values and drop the new words that are not in the original dataset.

--------------------------------------------------------------------

This is how the BOW looks with the words as column names

In [18]:
feature_names = vect.get_feature_names_out()
BOW_df_names = pl.DataFrame(BOW.toarray(), schema=feature_names.tolist())
BOW_df_names.head()

2009s,another,anything,bradley,brain,cantmiss,cast,cell,cinema,cinematography,composition,cooper,coraline,creed,day,director,expect,fernando,fifth,first,gangster,give,guest,henry,hollywood,hour,ii,instead,knockout,know,life,like,limitless,list,lucy,masterpiece,meirelles,movie,multiply,new,nt,ooze,oscar,party,passing,point,read,rocky,schmaltz,selicks,sequel,sharp,since,skill,slightly,standard,stopmotion,story,superior,tap,tell,thinking,thrillsnew,urgency,valentine,visual,vogue,washed,win
i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0
0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1
0,0,0,1,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0
