## **WORD VECTORIZATION**

Imporing every library necessary for this action.

In [1]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet = True)
np.random.seed(0)

The songs are contained within the lyrics data subdirectory, contained within the same folder as this practice project. Each song is stored in a single file, with files ranging from song1.txt to song20.txt.

To make it easy to read in all of the documents, use a list comprehension to create a list containing the name of every single song file in the cell below.

In [2]:
filenames = [f"song{str(i)}.txt" for i in range(1, 21)]
filenames

['song1.txt',
 'song2.txt',
 'song3.txt',
 'song4.txt',
 'song5.txt',
 'song6.txt',
 'song7.txt',
 'song8.txt',
 'song9.txt',
 'song10.txt',
 'song11.txt',
 'song12.txt',
 'song13.txt',
 'song14.txt',
 'song15.txt',
 'song16.txt',
 'song17.txt',
 'song18.txt',
 'song19.txt',
 'song20.txt']

Next, we import a single song to see what our text looks like so that we can make sure we clean and tokenize it correctly.

The code in the cell below to reads in the lyrics from song18.txt as a list of lines, just using vanilla Python.

In [3]:
with open ("lyrics data/song18.txt") as f:
    test_song = f.readlines()

test_song

['[Kendrick Lamar:]\n',
 "Two wrongs don't make us right away\n",
 "Tell me something's wrong\n",
 'Party all of our lives away\n',
 'To take you on\n',
 '[Zacari:]\n',
 'Oh, baby I want you\n',
 'Baby I need you\n',
 'I wanna see you\n',
 'Baby I wanna go out yeah\n',
 'Baby I wanna go out yeah\n',
 'Baby I want you\n',
 'Baby I need you\n',
 'I wanna see you\n',
 'Baby I wanna go out yeah\n',
 'Baby I wanna go out yeah\n',
 'All night (all night, all night)\n',
 'All night\n',
 "Your body's on fire\n",
 'And your drinks on ice\n',
 'All night (all night, all night)\n',
 'All night\n',
 "Your body's on fire\n",
 'And your drinks on ice\n',
 '[Babes Wodumo:]\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Everybody say kikiritikiki (kikiritikiki)\n',
 'Everybody say kikiritikiki (kikiritikiki)\n',
 'Everybody say kikiritik

## **Tokenizing our Data**

Before we can create a bag of words or vectorize each document, we need to clean it up and split each song into an array of individual words.

Before we tokenize, however, we need to do one more step! Computers are very particular about strings. If we tokenized our data in its current state, we would run into the following problems:

Counting things that aren't actually words. In the example above, "[Kendrick Lamar:]" is a note specifying who is speaking, not a lyric contained in the actual song, so it should be removed.
Punctuation and capitalization would mess up our word counts. To the Python interpreter, `all`, `All`, and `(all` are unique words, and would all be counted separately. We need to remove punctuation and capitalization, so that all words will be counted correctly.
Before we tokenize our songs, we'll do only a small manual bit of cleaning.

In the cell below, we write a function that will:

- Remove lines that just contain ['artist names']
- Join the list of strings into one big string for the entire song
- Remove newline characters \n
- Remove the following punctuation marks: ",.'?!()"
- Make every word lowercase

In [7]:
def clean_song(song):
    clean_lines = [line for line in song if "[" not in line and "]" not in line]
    clean_song = " ".join(clean_lines)

    for symbol in ",.'?!()":
        clean_song = clean_song.replace(symbol, "")

    clean_song = clean_song.replace("\n", " ")

    return clean_song.lower()

In [8]:
clean_test_song = clean_song(test_song)
print(clean_test_song)

two wrongs dont make us right away  tell me somethings wrong  party all of our lives away  to take you on  oh baby i want you  baby i need you  i wanna see you  baby i wanna go out yeah  baby i wanna go out yeah  baby i want you  baby i need you  i wanna see you  baby i wanna go out yeah  baby i wanna go out yeah  all night all night all night  all night  your bodys on fire  and your drinks on ice  all night all night all night  all night  your bodys on fire  and your drinks on ice  oh my word oh my gosh oh my word oh my gosh  oh my word oh my gosh oh my word oh my gosh  oh my word oh my gosh oh my word oh my gosh  oh my word oh my gosh oh my word oh my gosh  everybody say kikiritikiki kikiritikiki  everybody say kikiritikiki kikiritikiki  everybody say kikiritikiki kikiritikiki  everybody say kikiritikiki kikiritikiki  ungbambe ungdedele ungbhasobhe unggudluke  ungbambe ungdedele ungbhasobhe unggudluke  ungbambe ungdedele ungbhasobhe unggudluke  ungbambe ungdedele ungbhasobhe unggudlu

Now, we can use 'nltks `word_tokenize()` function on the song string to get a fully tokenized version of the song.

In [9]:
tokenized_test_song = word_tokenize(clean_test_song)
tokenized_test_song[:10]

['two',
 'wrongs',
 'dont',
 'make',
 'us',
 'right',
 'away',
 'tell',
 'me',
 'somethings']

## **Count Vectorization**

Machine Learning algorithms don't understand strings. However, they do understand math, which means they understand vectors and matrices. By `Vectorizing` the text, we just convert the entire text into a vector, where each element in the vector represents a different word.

Consider the following example: 

<center>"I scream, you scream, we all scream for ice cream."</center>

| 'aardvark' | 'apple' | [...] | 'I' | 'you' | 'scream' | 'we' | 'all' | 'for' | 'ice' | 'cream' | [...] | 'xylophone' | 'zebra' |
|:----------:|:-------:|:-----:|:---:|:-----:|:--------:|:----:|:-----:|:-----:|:-----:|:-------:|:-----:|:-----------:|:-------:|
|      0     |    0    |   0   |  1  |   1   |     3    |   1  |   1   |   1   |   1   |    1    |   0   |      0      |    0    |

This is called a `Sparse Representation`, since the strong majority of the columns will have a value of 0.

Alternatively, we can represent this sentence as a plain old Python dictionary of word frequency counts:

```python

BoW = {
    'I':1,
    'you':1,
    'scream':3,
    'we':1,
    'all':1,
    'for':1,
    'ice':1,
    'cream':1
}

```
Both of these are examples of **Count Vectorization**. They allow us to represent a sentence as a vector, with each element in the vector corresponding to how many times that word is used.

create a function that takes in a tokenized, cleaned song and returns a count vectorized representation of it as a Python dictionary.

Hint: We'll use `set()` since we'll need each unique word in the tokenized song.

In [10]:
def count_vectorize(tokenized_song):
    unique_words = set(tokenized_song)

    song_dict = {word : 0 for word in unique_words}

    for word in tokenized_song:
        song_dict[word] += 1

    return song_dict

In [11]:
test_vectorized = count_vectorize(tokenized_test_song)
print(test_vectorized)

{'wrongs': 1, 'make': 1, 'i': 30, 'wan': 18, 'night': 24, 'two': 1, 'all': 25, 'me': 1, 'say': 4, 'away': 2, 'out': 12, 'go': 13, 'wrong': 1, 'ice': 6, 'my': 16, 'na': 18, 'gosh': 8, 'somethings': 1, 'take': 1, 'us': 1, 'drinks': 6, 'up': 16, 'your': 12, 'party': 1, 'of': 1, 'fire': 6, 'dont': 1, 'oh': 17, 'need': 6, 'word': 8, 'our': 1, 'baby': 24, 'ungbambe': 4, 'everybody': 4, 'want': 6, 'and': 6, 'ungdedele': 4, 'yeah': 12, 'we': 1, 'you': 19, 'tell': 1, 'right': 1, 'to': 1, 'on': 13, 'see': 6, 'kikiritikiki': 8, 'bodys': 6, 'lives': 1, 'high': 16, 'ungbhasobhe': 4, 'unggudluke': 4}


## **TF-IDF**

We just calculated our Term Frequency above with Count Vectorization!

In the cell below, we complete a function that takes in a list of tokenized songs, with each item in the list being a clean, tokenized version of the song. The function should return a dictionary containing the inverse document frequency values for each word.  

The formula for Inverse Document Frequency is:  
<br>  
<br>
$$\large \text{IDF}(t) =  log_e(\frac{\text{Total Number of Documents}}{\text{Number of Documents with } t \text{ in it}})$$

In [12]:
def inverse_document_frequency(list_of_token_songs):

    num_docs = len(list_of_token_songs)

    unique_words = set([item for sublist in list_of_token_songs for item in sublist])

    inv_doc_freq = {word : 0 for word in unique_words}

    for word in unique_words:
        num_docs_with_word = 0
        for song_tokens in list_of_token_songs:
            if word in song_tokens:
                num_docs_with_word += 1
        inv_doc_freq[word] = np.log(num_docs / num_docs_with_word)

    return inv_doc_freq


In the cell below, complete the `tf_idf()` function. This function should take in a list of tokenized songs, just as the `inverse_document_frequency()` function did. This function returns a new list of dictionaries, with each dictionary containing the tf-idf vectorized representation of a corresponding song document. You'll need to calculate the term frequency for each song using the `count_vectorize()` function we defined above.

**_NOTE:_** Each document should contain the full vocabulary of the entire combined corpus! So, even if a song doesn't have the word "kikiritikiki" (a vocalization in our test song), it should have a dictionary entry with that word as the key and `0` as the value.

In [13]:
def tf_idf(list_of_token_songs):
    
    unique_words = set({item for sublist in list_of_token_songs for item in sublist})

    idf = inverse_document_frequency(list_of_token_songs)

    tf_idf_list_of_dicts = []

    for song_tokens in list_of_token_songs:
        song_tf = count_vectorize(song_tokens)
        doc_tf_idf = {word : 0 for word in unique_words}

        for word in unique_words:
            if word in song_tokens:
                doc_tf_idf[word] = song_tf[word] * idf[word]
            else:
                doc_tf_idf[word] = 0
        tf_idf_list_of_dicts.append(doc_tf_idf)

    return tf_idf_list_of_dicts

In [16]:
def main(filenames):

    all_songs = []

    for song in filenames:
        with open(f"lyrics data/{song}") as f:
            song_lyrics = f.readlines()
            all_songs.append(song_lyrics)

    all_song_tokens = []

    for song in all_songs:
        song_tokens = word_tokenize(clean_song(song))
        all_song_tokens.append(song_tokens)

    tf_idf_all_docs = tf_idf(all_song_tokens)
    
    return tf_idf_all_docs



In [18]:
tf_idf_all_docs = main(filenames)
tf_idf_all_docs

[{'87': 0,
  'fourteen': 2.995732273553991,
  'just': 0,
  'land': 0,
  'stop': 0,
  'ooh': 0,
  'spent': 2.995732273553991,
  'angel': 0,
  'else': 0,
  'ice': 0,
  'trees': 0,
  'pearly': 8.987196820661973,
  'wreck': 0,
  'prowl': 0,
  'truly': 0,
  'steppin': 0,
  'small': 0,
  'down': 0,
  'move': 0,
  'confrontation': 0,
  'hell': 0,
  'bullshit': 0,
  'mothafuck': 0,
  'swing': 0,
  'psych': 0,
  'cutlass': 0,
  'pull': 0,
  'lime': 0,
  'kinda': 0,
  'ride': 1.8971199848858813,
  'cabanas': 0,
  'belittles': 0,
  'beat': 0,
  'skrrt': 0,
  'ta': 0,
  'afternoon': 0,
  'pack': 0,
  'guards': 0,
  'has': 0,
  'runnin': 4.1588830833596715,
  'purist': 0,
  'beer': 0,
  'involved': 0,
  'hurricanes': 0,
  'name': 0,
  'rebound': 0,
  'depending': 0,
  'insides': 0,
  'charter': 0,
  'today': 0,
  'indiana': 0,
  'side': 0,
  'chess': 0,
  'foreman': 0,
  'magnums': 0,
  'second': 0,
  'rightful': 0,
  'committing': 0,
  'raging': 0,
  'forever': 4.605170185988092,
  'hail': 0,
  's

## **Visualizing our Vectorizations**

In [19]:
vocab = list(tf_idf_all_docs[0].keys())
num_dims = len(vocab)
print(f"Number of Dimensions: {num_dims}")

Number of Dimensions: 1342
