# 02-BOW

Now that we master the preprocessing, let's make our first Bag Of Words (BOW).

We will reuse our dataset of Coldplay songs to make a BOW.

As usual, the first step is to import some libraries. So import *nltk* as well as all the libraries you will need.

In [1]:
# Import NLTK and all the needed libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Load now the dataset in *coldplay.csv* using pandas.

In [2]:
# TODO: Load the dataset in coldplay.csv
data = pd.read_csv('coldplay.csv')
df = pd.DataFrame(data)
print(df)

       Artist                           Song  \
0    Coldplay                 Another's Arms   
1    Coldplay                Bigger Stronger   
2    Coldplay                       Daylight   
3    Coldplay                       Everglow   
4    Coldplay  Every Teardrop Is A Waterfall   
..        ...                            ...   
115  Coldplay           Hymn For The Weekend   
116  Coldplay                    In My Place   
117  Coldplay                            Ink   
118  Coldplay              Ladder To The Sun   
119  Coldplay                           Lost   

                                                  Link  \
0              /c/coldplay/anothers+arms_21079526.html   
1            /c/coldplay/bigger+stronger_20032648.html   
2                   /c/coldplay/daylight_20032625.html   
3                   /c/coldplay/everglow_21104546.html   
4    /c/coldplay/every+teardrop+is+a+waterfall_2091...   
..                                                 ...   
115     /c/coldpl

You already know this dataset, but you can check it again if you want to refresh your memory.

In [11]:
# TODO: Explore the data
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Artist  120 non-null    object
 1   Song    120 non-null    object
 2   Link    120 non-null    object
 3   Lyrics  120 non-null    object
dtypes: object(4)
memory usage: 3.9+ KB
None


Now using the *CountVectorizer* of scikit-learn, make a BOW of all the lyrics of Coldplay, and print the result.

In [4]:
# TODO: Compute a BOW
countVectorizer = CountVectorizer(max_features=2000, stop_words='english')
BOW = countVectorizer.fit_transform(df.Lyrics).toarray()

print(BOW.shape)
print(BOW)

(120, 1569)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Now that we have the BOW matrix, we would like to have a new dataframe having the BOW for each song, and as columns the corresponding words (just as we did in the lecture at the end).

So that at the end we would end up with a dataframe containing something like the following (120 raws for 120 songs, and as many columns as words):

| | ah | adventure | ... | yeah 
|---|---|---|---|---| 
| 0 | 0 | 1 | ... | 4 |
| 1 | 8 | 0 | ... | 2 |
|...|...|...|...|...|
| 119 | 5 | 0 | ... | 8 |

In [6]:
# TODO: Create a new dataframe containing the BOW outputs and the corresponding words as columns. And print it
tokens = countVectorizer.get_feature_names_out()
print(tokens)

lyricsDF = pd.DataFrame(BOW, columns=tokens)


['10' '2000' '2gether' ... 'yesterday' 'young' 'yuletide']
     10  2000  2gether  76543  aaaaaah  aaaaah  aaaah  achin  adventure  \
0     0     0        0      0        0       0      0      0          0   
1     0     0        0      0        0       0      0      0          0   
2     0     0        0      0        0       0      0      0          0   
3     0     0        0      0        0       0      0      0          0   
4     0     0        0      0        0       0      0      0          0   
..   ..   ...      ...    ...      ...     ...    ...    ...        ...   
115   0     0        0      0        0       0      0      0          0   
116   0     0        0      0        0       0      0      0          0   
117   0     0        1      0        0       0      0      0          0   
118   0     0        0      0        0       0      0      0          0   
119   0     0        0      0        0       0      0      0          0   

     advice  ...  x2  x7  ya  yeah  year

Well as you see we're still having some issue, we have some tokens that are not words, like '10' or '2000'.

To get rid of that, we could use directly regular expressions within the function. Another solution would be to make preprocessing before using the function *CountVectorizer*.

For the moment, we won't pay attention to this issue. But if you are curious and have time, you can find on google how to remove those words using the *CountVectorizer*.

Now we would like to see what are the most used words by Coldplay.

In [7]:
sum_bow = lyricsDF.sum()
sum_bow.idxmax()

# I am not surprised

'oh'

So what is the most used word? Are you surprised?

Now make a sort in order to show the 10 most used.

In [10]:
# TODO: print the 10 most used word by Coldplay
sumBowSort = sum_bow.sort_values(ascending=False)
print(sumBowSort[:10])

oh      334
don     190
know    137
just    136
ll      132
come    126
yeah    111
love     95
ooh      95
want     86
dtype: int64


Here it is! You know the Coldplay lyrics more than the singers now!