## **WORD VECTORIZATION**

Imporing every library necessary for this action.

In [1]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet = True)
np.random.seed(0)

The songs are contained within the lyrics data subdirectory, contained within the same folder as this practice project. Each song is stored in a single file, with files ranging from song1.txt to song20.txt.

To make it easy to read in all of the documents, use a list comprehension to create a list containing the name of every single song file in the cell below.

In [2]:
filenames = [f"song{str(i)}.txt" for i in range(1, 21)]
filenames

['song1.txt',
 'song2.txt',
 'song3.txt',
 'song4.txt',
 'song5.txt',
 'song6.txt',
 'song7.txt',
 'song8.txt',
 'song9.txt',
 'song10.txt',
 'song11.txt',
 'song12.txt',
 'song13.txt',
 'song14.txt',
 'song15.txt',
 'song16.txt',
 'song17.txt',
 'song18.txt',
 'song19.txt',
 'song20.txt']

Next, we import a single song to see what our text looks like so that we can make sure we clean and tokenize it correctly.

The code in the cell below to reads in the lyrics from song18.txt as a list of lines, just using vanilla Python.

In [3]:
with open ("lyrics data/song18.txt") as f:
    test_song = f.readlines()

test_song

['[Kendrick Lamar:]\n',
 "Two wrongs don't make us right away\n",
 "Tell me something's wrong\n",
 'Party all of our lives away\n',
 'To take you on\n',
 '[Zacari:]\n',
 'Oh, baby I want you\n',
 'Baby I need you\n',
 'I wanna see you\n',
 'Baby I wanna go out yeah\n',
 'Baby I wanna go out yeah\n',
 'Baby I want you\n',
 'Baby I need you\n',
 'I wanna see you\n',
 'Baby I wanna go out yeah\n',
 'Baby I wanna go out yeah\n',
 'All night (all night, all night)\n',
 'All night\n',
 "Your body's on fire\n",
 'And your drinks on ice\n',
 'All night (all night, all night)\n',
 'All night\n',
 "Your body's on fire\n",
 'And your drinks on ice\n',
 '[Babes Wodumo:]\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Everybody say kikiritikiki (kikiritikiki)\n',
 'Everybody say kikiritikiki (kikiritikiki)\n',
 'Everybody say kikiritik

## **Tokenizing our Data**

Before we can create a bag of words or vectorize each document, we need to clean it up and split each song into an array of individual words.

Before we tokenize, however, we need to do one more step! Computers are very particular about strings. If we tokenized our data in its current state, we would run into the following problems:

Counting things that aren't actually words. In the example above, "[Kendrick Lamar:]" is a note specifying who is speaking, not a lyric contained in the actual song, so it should be removed.
Punctuation and capitalization would mess up our word counts. To the Python interpreter, `all`, `All`, and `(all` are unique words, and would all be counted separately. We need to remove punctuation and capitalization, so that all words will be counted correctly.
Before we tokenize our songs, we'll do only a small manual bit of cleaning.

In the cell below, we write a function that will:

- Remove lines that just contain ['artist names']
- Join the list of strings into one big string for the entire song
- Remove newline characters \n
- Remove the following punctuation marks: ",.'?!()"
- Make every word lowercase

In [4]:
def clean_song(song):
    clean_lines = [line for line in song if "[" not in line and "]" not in line]
    clean_song = " ".join(clean_lines)

    for symbol in ",.'?!()":
        clean_song = clean_song.replace(symbol, "")

    clean_song = clean_song.replace("\n", " ")

    return clean_song.lower()

In [5]:
clean_test_song = clean_song(test_song)
print(clean_test_song)

two wrongs dont make us right away  tell me somethings wrong  party all of our lives away  to take you on  oh baby i want you  baby i need you  i wanna see you  baby i wanna go out yeah  baby i wanna go out yeah  baby i want you  baby i need you  i wanna see you  baby i wanna go out yeah  baby i wanna go out yeah  all night all night all night  all night  your bodys on fire  and your drinks on ice  all night all night all night  all night  your bodys on fire  and your drinks on ice  oh my word oh my gosh oh my word oh my gosh  oh my word oh my gosh oh my word oh my gosh  oh my word oh my gosh oh my word oh my gosh  oh my word oh my gosh oh my word oh my gosh  everybody say kikiritikiki kikiritikiki  everybody say kikiritikiki kikiritikiki  everybody say kikiritikiki kikiritikiki  everybody say kikiritikiki kikiritikiki  ungbambe ungdedele ungbhasobhe unggudluke  ungbambe ungdedele ungbhasobhe unggudluke  ungbambe ungdedele ungbhasobhe unggudluke  ungbambe ungdedele ungbhasobhe unggudlu

Now, we can use 'nltks `word_tokenize()` function on the song string to get a fully tokenized version of the song.

In [6]:
tokenized_test_song = word_tokenize(clean_test_song)
tokenized_test_song[:10]

['two',
 'wrongs',
 'dont',
 'make',
 'us',
 'right',
 'away',
 'tell',
 'me',
 'somethings']

## **Count Vectorization**

Machine Learning algorithms don't understand strings. However, they do understand math, which means they understand vectors and matrices. By `Vectorizing` the text, we just convert the entire text into a vector, where each element in the vector represents a different word.

Consider the following example: 

<center>"I scream, you scream, we all scream for ice cream."</center>

| 'aardvark' | 'apple' | [...] | 'I' | 'you' | 'scream' | 'we' | 'all' | 'for' | 'ice' | 'cream' | [...] | 'xylophone' | 'zebra' |
|:----------:|:-------:|:-----:|:---:|:-----:|:--------:|:----:|:-----:|:-----:|:-----:|:-------:|:-----:|:-----------:|:-------:|
|      0     |    0    |   0   |  1  |   1   |     3    |   1  |   1   |   1   |   1   |    1    |   0   |      0      |    0    |

This is called a `Sparse Representation`, since the strong majority of the columns will have a value of 0.

Alternatively, we can represent this sentence as a plain old Python dictionary of word frequency counts:

```python

BoW = {
    'I':1,
    'you':1,
    'scream':3,
    'we':1,
    'all':1,
    'for':1,
    'ice':1,
    'cream':1
}

```
Both of these are examples of **Count Vectorization**. They allow us to represent a sentence as a vector, with each element in the vector corresponding to how many times that word is used.

create a function that takes in a tokenized, cleaned song and returns a count vectorized representation of it as a Python dictionary.

Hint: We'll use `set()` since we'll need each unique word in the tokenized song.

In [7]:
def count_vectorize(tokenized_song):
    unique_words = set(tokenized_song)

    song_dict = {word : 0 for word in unique_words}

    for word in tokenized_song:
        song_dict[word] += 1

    return song_dict

In [8]:
test_vectorized = count_vectorize(tokenized_test_song)
print(test_vectorized)

{'kikiritikiki': 8, 'make': 1, 'right': 1, 'fire': 6, 'your': 12, 'us': 1, 'baby': 24, 'up': 16, 'lives': 1, 'i': 30, 'we': 1, 'all': 25, 'go': 13, 'ice': 6, 'ungbhasobhe': 4, 'bodys': 6, 'to': 1, 'me': 1, 'tell': 1, 'and': 6, 'of': 1, 'drinks': 6, 'take': 1, 'oh': 17, 'need': 6, 'na': 18, 'wan': 18, 'party': 1, 'my': 16, 'ungbambe': 4, 'wrongs': 1, 'everybody': 4, 'want': 6, 'away': 2, 'on': 13, 'see': 6, 'unggudluke': 4, 'out': 12, 'say': 4, 'wrong': 1, 'yeah': 12, 'you': 19, 'high': 16, 'dont': 1, 'two': 1, 'our': 1, 'somethings': 1, 'word': 8, 'night': 24, 'ungdedele': 4, 'gosh': 8}


## **TF-IDF**

We just calculated our Term Frequency above with Count Vectorization!

In the cell below, we complete a function that takes in a list of tokenized songs, with each item in the list being a clean, tokenized version of the song. The function should return a dictionary containing the inverse document frequency values for each word.  

The formula for Inverse Document Frequency is:  
<br>  
<br>
$$\large \text{IDF}(t) =  log_e(\frac{\text{Total Number of Documents}}{\text{Number of Documents with } t \text{ in it}})$$

In [9]:
def inverse_document_frequency(list_of_token_songs):

    num_docs = len(list_of_token_songs)

    unique_words = set([item for sublist in list_of_token_songs for item in sublist])

    inv_doc_freq = {word : 0 for word in unique_words}

    for word in unique_words:
        num_docs_with_word = 0
        for song_tokens in list_of_token_songs:
            if word in song_tokens:
                num_docs_with_word += 1
        inv_doc_freq[word] = np.log(num_docs / num_docs_with_word)

    return inv_doc_freq


In the cell below, complete the `tf_idf()` function. This function should take in a list of tokenized songs, just as the `inverse_document_frequency()` function did. This function returns a new list of dictionaries, with each dictionary containing the tf-idf vectorized representation of a corresponding song document. You'll need to calculate the term frequency for each song using the `count_vectorize()` function we defined above.

**_NOTE:_** Each document should contain the full vocabulary of the entire combined corpus! So, even if a song doesn't have the word "kikiritikiki" (a vocalization in our test song), it should have a dictionary entry with that word as the key and `0` as the value.

In [10]:
def tf_idf(list_of_token_songs):
    
    unique_words = set({item for sublist in list_of_token_songs for item in sublist})

    idf = inverse_document_frequency(list_of_token_songs)

    tf_idf_list_of_dicts = []

    for song_tokens in list_of_token_songs:
        song_tf = count_vectorize(song_tokens)
        doc_tf_idf = {word : 0 for word in unique_words}

        for word in unique_words:
            if word in song_tokens:
                doc_tf_idf[word] = song_tf[word] * idf[word]
            else:
                doc_tf_idf[word] = 0
        tf_idf_list_of_dicts.append(doc_tf_idf)

    return tf_idf_list_of_dicts

In [11]:
def main(filenames):

    all_songs = []

    for song in filenames:
        with open(f"lyrics data/{song}") as f:
            song_lyrics = f.readlines()
            all_songs.append(song_lyrics)

    all_song_tokens = []

    for song in all_songs:
        song_tokens = word_tokenize(clean_song(song))
        all_song_tokens.append(song_tokens)

    tf_idf_all_docs = tf_idf(all_song_tokens)
    
    return tf_idf_all_docs



In [12]:
tf_idf_all_docs = main(filenames)
tf_idf_all_docs[:3]

[{'champaign-urbana': 0,
  'grabs': 0,
  'cause': 0,
  'keep': 0,
  'flowers': 0,
  'pristine': 2.995732273553991,
  'nursin': 0,
  'destruction': 0,
  'beer': 0,
  'hers': 0,
  'trees': 0,
  'keys': 0,
  'traffic': 0,
  'engine': 0,
  'voicing': 0,
  'spent': 2.995732273553991,
  'here': 0,
  'goes': 0,
  'title': 0,
  'kung-fu': 0,
  'standout': 0,
  'day': 2.0996442489973552,
  'in': 0.5268025782891318,
  'swaps': 0,
  'muncie': 0,
  'no': 0,
  'raging': 0,
  'mama': 0,
  'mirror': 0,
  'five': 0,
  'away': 0,
  'case': 0,
  'calculated': 0,
  'section': 0,
  'forever': 4.605170185988092,
  'might': 0,
  'front': 0,
  'shits': 0,
  'lose': 0,
  'before': 0,
  'nope': 0,
  'steppin': 0,
  'sunshine': 2.995732273553991,
  'knob': 0,
  'clique': 0,
  'ropes': 0,
  'japan': 2.995732273553991,
  'foreman': 0,
  'trance': 0,
  'everlasting': 0,
  'hittas': 0,
  'dock': 0,
  'probably': 0,
  'lil': 0,
  'fucked': 0,
  'prayer': 0,
  'damn': 0,
  'fair': 0,
  'knockoff': 0,
  'safe': 0,
  '

## **Visualizing our Vectorizations**

In [13]:
vocab = list(tf_idf_all_docs[0].keys())
num_dims = len(vocab)
print(f"Number of Dimensions: {num_dims}")

Number of Dimensions: 1342


There are too many dimensions for us to visualize! In order to make it understandable to human eyes, we'll need to reduce it to 2 or 3 dimensions.  

To do this, we'll use a technique called **_t-SNE_** (short for _t-Stochastic Neighbors Embedding_).

First, we need to pull the words out of the dictionaries stored in `tf_idf_all_docs` so that only the values remain, and store them in lists instead of dictionaries.  This is because the t-SNE only works with array-like objects, not dictionaries.  

In the cell below, create a list of lists that contains a list representation of the values of each of the dictionaries stored in `tf_idf_all_docs`.  The same structure should remain -- e.g. the first list should contain only the values that were in the first dictionary in `tf_idf_all_docs`, and so on. 

In [14]:
tf_idf_vals_list = []

for i in tf_idf_all_docs:
    tf_idf_vals_list.append(list(i.values()))

tf_idf_vals_list[0][:10]

[0, 0, 0, 0, 0, 2.995732273553991, 0, 0, 0, 0]

Now that we have only the values, we can use the `TSNE()` class from `sklearn` to transform our data appropriately. In the cell below, instantiate `TSNE()` with the following arguments:
- `n_components=3` (so we can compare 2 vs 3 components when graphing)
- `perplexity=19` (the highest number of neighbors explored given the size of our dataset)
- `learning_rate=200` (a higher learning rate than using 'auto', to avoid getting stuck in a local minimum)
- `init='random'` (so SKLearn will randomize the initialization)
- `random_state=13` (so that random initialization won't be TOO random)

Then, use the created object's `.fit_transform()` method to transform the data stored in `tf_idf_vals_list` into 3-dimensional data.  Then, inspect the newly transformed data to confirm that it has the correct dimensionality. 

In [15]:
t_sne_object_3d = TSNE(n_components = 3, 
                       perplexity = 19,
                       learning_rate = 200,
                       init = 'random',
                       random_state = 13)


transformed_data_3d = t_sne_object_3d.fit_transform(np.array(tf_idf_vals_list))
transformed_data_3d

AttributeError: 'NoneType' object has no attribute 'split'

In [None]:
t_sne_object_2d = TSNE(n_components = 2, 
                       perplexity = 19,
                       learning_rate = 200,
                       init = 'random', 
                       random_state = 13)
transformed_data_2d = t_sne_object_2d.fit_transform(np.array(tf_idf_vals_list))
transformed_data_2d

AttributeError: 'NoneType' object has no attribute 'split'

In [None]:
kendrick_3d = transformed_data_3d[:10]
k3_x = [i[0] for i in kendrick_3d]
k3_y = [i[1] for i in kendrick_3d]
k3_z = [i[2] for i in kendrick_3d]

garth_3d = transformed_data_3d[10:]
g3_x = [i[0] for i in garth_3d]
g3_y = [i[1] for i in garth_3d]
g3_z = [i[2] for i in garth_3d]

fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(k3_x, k3_y, k3_z, c='b', s=60, label='Kendrick')
ax.scatter(g3_x, g3_y, g3_z, c='red', s=60, label='Garth')
ax.view_init(40,10)
ax.legend()
plt.show()

kendrick_2d = transformed_data_2d[:10]
k2_x = [i[0] for i in kendrick_2d]
k2_y = [i[1] for i in kendrick_2d]

garth_2d = transformed_data_2d[10:]
g2_x = [i[0] for i in garth_2d]
g2_y = [i[1] for i in garth_2d]

fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(222)
ax.scatter(k2_x, k2_y, c='b', label='Kendrick')
ax.scatter(g2_x, g2_y, c='red', label='Garth')
ax.legend()
plt.show()

NameError: name 'transformed_data_3d' is not defined


Both graphs show a basic trend among the red and blue dots, although the 3-dimensional 
graph is more informative than the 2-dimensional graph. We see a separation between the 
two artists because they both have words that they use, but the other artist does not. 

The words in each song that are common to both are reduced to very small numbers or to 0, 
because of the log operation in the IDF function.  This means that the elements of each 
song vector with the highest values will be the ones that have words that are unique to 
that specific document, or at least are rarely used in others.