# Bag of Words Text Vectorization

Bag of Words (BoW) is a text representation technique that converts text into fixed-length vectors by counting the frequency of words in the document. It is commonly used in natural language processing and machine learning.

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

data = [' Most shark attacks occur about 10 feet from the beach since that is where the people ware',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [10]:
countvec = CountVectorizer()
x = countvec.fit_transform(data)
df = pd.DataFrame(x.toarray(), columns=countvec.get_feature_names_out())

In [11]:
df.head()

Unnamed: 0,10,about,admirable,ahead,are,as,attacks,back,bait,beach,...,were,west,when,where,which,with,work,works,worms,you
0,1,1,0,0,0,0,1,0,0,1,...,0,0,0,2,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,1,1,0,0,0,0,1,0,1
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


# TF-IDF Text Vectorization

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It helps in transforming text data into meaningful numerical vectors for machine learning algorithms.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()
x = tfidfvec.fit_transform(data)

df = pd.DataFrame(x.toarray(), columns=tfidfvec.get_feature_names_out())
df.head()

Unnamed: 0,10,about,admirable,ahead,are,as,attacks,back,bait,beach,...,were,west,when,where,which,with,work,works,worms,you
0,0.254324,0.254324,0.0,0.0,0.0,0.0,0.254324,0.0,0.0,0.254324,...,0.0,0.0,0.0,0.254324,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.293641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.293641,0.293641,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.292313,0.0,0.0,0.0,0.0,...,0.356474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.267837,0.0,0.0,0.0,0.0,0.0,...,0.0,0.267837,0.267837,0.0,0.0,0.0,0.0,0.267837,0.0,0.267837
4,0.0,0.0,0.0,0.290766,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.290766,0.0,0.0,0.0


In [17]:
x.toarray()

array([[0.25432361, 0.25432361, 0.        , 0.        , 0.        ,
        0.        , 0.25432361, 0.        , 0.        , 0.25432361,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.25432361, 0.25432361,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.25432361, 0.        ,
        0.25432361, 0.        , 0.25432361, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.25432361, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.25432361, 0.        , 0.        , 0.        , 0.25432361,
        0.        , 0.        , 0.        , 0.208549  , 0.22578817,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.25432361, 0.        , 0.        , 0.        , 0.        ,
        0.25432361, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.     