# 02-BOW

![](https://images.unsplash.com/photo-1543549477-d62fd00770d9?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1189&q=80)

Photo by [Sheldon Nunes](https://unsplash.com/photos/IfHj-ucav3c)

Now that we master the preprocessing, let's make our first Bag Of Words (BOW).

We will reuse our dataset of Coldplay songs to make a BOW.

As usual, the first step is to import some libraries. So import *nltk* as well as all the libraries you will need.

In [19]:
# TODO: Import NLTK and all the needed libraries
from nltk.tokenize import word_tokenize
import pandas as pd
import numpy as np
import nltk


Load now the dataset in *coldplay.csv* using pandas.

In [20]:
# TODO: Load the dataset in coldplay.csv
df = pd.read_csv('coldplay.csv')

You already know this dataset, but you can check it again if you want to refresh your memory.

In [21]:
# TODO: Explore the data
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Artist  120 non-null    object
 1   Song    120 non-null    object
 2   Link    120 non-null    object
 3   Lyrics  120 non-null    object
dtypes: object(4)
memory usage: 3.9+ KB


Now using the *CountVectorizer* of scikit-learn, make a BOW of all the lyrics of Coldplay, and print the result.

In [24]:
# TODO: Compute a BOW of all the lyrics in the csv
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000, stop_words='english')
BOW = vectorizer.fit_transform(df.Lyrics).toarray()


Now that we have the BOW matrix, we would like to have a new dataframe having the BOW for each song, and as columns the corresponding words (just as we did in the lecture at the end).

So that at the end we would end up with a dataframe containing something like the following (120 raws for 120 songs, and as many columns as words):

| | ah | adventure | ... | yeah 
|---|---|---|---|---| 
| 0 | 0 | 1 | ... | 4 |
| 1 | 8 | 0 | ... | 2 |
|...|...|...|...|...|
| 119 | 5 | 0 | ... | 8 |

In [25]:
# TODO: Create a new dataframe containing the BOW outputs and the corresponding words as columns. And print it
# df of BOW outputs for each song
# Get the words associated to those numbers
tokens = vectorizer.get_feature_names_out()
print(f'tokens: {tokens}')

# dataframe of BOW of all lyrics
df_bow = pd.DataFrame(data=BOW, columns=tokens)
df_bow
# 120 rows : OK


tokens: ['2000' 'aaaaaah' 'aaaaah' 'aaaah' 'adventure' 'ah' 'ahh' 'ahhhh' 'aim'
 'aiming' 'air' 'alive' 'alright' 'angel' 'animal' 'animals' 'answer'
 'answers' 'anybody' 'apart' 'appear' 'arms' 'army' 'arrive' 'aside' 'ask'
 'asleep' 'aterfall' 'atmosphere' 'attention' 'attitude' 'avoid' 'awake'
 'away' 'ba' 'baby' 'backwards' 'bad' 'ball' 'baltimore' 'battle'
 'beating' 'beautiful' 'beckon' 'beg' 'began' 'begin' 'beginning' 'begun'
 'believe' 'believer' 'bells' 'belong' 'best' 'better' 'big' 'bigger'
 'birds' 'bit' 'black' 'blame' 'bleed' 'blind' 'block' 'blood' 'blossom'
 'blow' 'blue' 'body' 'bones' 'bono' 'boom' 'born' 'bought' 'boy' 'boys'
 'break' 'breakdown' 'breaking' 'breath' 'bridge' 'bright' 'bring' 'broke'
 'broken' 'brother' 'brothers' 'brought' 'brutality' 'bubble' 'bullet'
 'burn' 'burning' 'burst' 'bursting' 'button' 'buy' 'cage' 'came' 'canvas'
 'car' 'care' 'careful' 'carry' 'castle' 'catch' 'cathedrals' 'caught'
 'cause' 'caused' 'chance' 'change' 'changing' 'chaos'

Unnamed: 0,2000,aaaaaah,aaaaah,aaaah,adventure,ah,ahh,ahhhh,aim,aiming,...,x15,x2,x7,ya,yeah,years,yellow,yes,yesterday,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
116,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,11,0,0,0,0,0
117,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,3,0,0,0,0,0
118,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Well as you see we're still having some issue, we have some tokens that are not words, like '10' or '2000'.

To get rid of that, we could use directly regular expressions within the function. Another solution would be to make preprocessing before using the function *CountVectorizer*.

For the moment, we won't pay attention to this issue. But if you are curious and have time, you can find on google how to remove those words using the *CountVectorizer*.

Now we would like to see what are the most used words by Coldplay.

In [18]:
## most used words: ATTENTION SANS PRE-PROCESSING ICI !
df_bow.sum(axis=0).sort_values(ascending=False)[:10]
"""
oh      334
don     190
know    137
just    136
ll      132
come    126
yeah    111
ooh      95
love     95
want     86
"""

oh      334
don     190
know    137
just    136
ll      132
come    126
yeah    111
ooh      95
love     95
want     86
dtype: int64

So what is the most used word? Are you surprised?

Now make a sort in order to show the 10 most used.

In [None]:
# TODO: print the 10 most used word by Coldplay
# voir ci-dessus 

Here it is! You know the Coldplay lyrics more than the singers now!