# Word Vectorization - Lab

## Introduction

In this lab, we'll learn how tokenize and vectorize text documents, create an use a Bag of Words, and identify words unique to individual documents using TF-IDF Vectorization. 

## Objectives

You will be able to: 

* Tokenize a corpus of words and identify the different choices to be made while parsing them
* Use a Count Vectorization strategy to create a Bag of Words
* Use TF-IDF Vectorization with multiple documents to identify words that are important/unique to certain documents

## Let's get started!

Run the cell below to import everything necessary for this lab.  

In [3]:
import pandas as pd
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
np.random.seed(0)

### Our Corpus

In this lab, we'll be working with 20 different documents, each containing song lyrics from either Garth Brooks or Kendrick Lamar albums.  

The songs are contained within the `data` subdirectory, contained within the same folder as this lab.  Each song is stored in a single file, with files ranging from `song1.txt` to `song20.txt`.  

To make it easy to read in all of the documents, use a list comprehension to create a list containing the name of every single song file in the cell below. 

In [4]:
!ls data

lyrics_helper.py  song12.txt  song16.txt  song1.txt   song4.txt  song8.txt
lyrics_url.txt	  song13.txt  song17.txt  song20.txt  song5.txt  song9.txt
song10.txt	  song14.txt  song18.txt  song2.txt   song6.txt
song11.txt	  song15.txt  song19.txt  song3.txt   song7.txt


In [5]:
filenames = ['song' + str(n+1) + '.txt' for n in range(20)]
print(filenames)

['song1.txt', 'song2.txt', 'song3.txt', 'song4.txt', 'song5.txt', 'song6.txt', 'song7.txt', 'song8.txt', 'song9.txt', 'song10.txt', 'song11.txt', 'song12.txt', 'song13.txt', 'song14.txt', 'song15.txt', 'song16.txt', 'song17.txt', 'song18.txt', 'song19.txt', 'song20.txt']


Next, let's import a single song to see what our text looks like so that we can make sure we clean and tokenize it correctly. 

In the cell below, read in and print out the lyrics from `song11.txt`.  Use vanilla python, no pandas needed.  

In [6]:
with open('./data/song11.txt') as f:
    for line in f:
        print(line)

[Kendrick Lamar:]

Love, let's talk about love

Is it anything and everything you hoped for?

Or do the feeling haunt you?

I know the feeling haunt you

[SZA:]

This may be the night that my dreams might let me know

All the stars approach you, all the stars approach you, all the stars approach you

This may be the night that my dreams might let me know

All the stars are closer, all the stars are closer, all the stars are closer

[Kendrick Lamar:]

Tell me what you gon' do to me

Confrontation ain't nothin' new to me

You can bring a bullet, bring a sword, bring a morgue

But you can't bring the truth to me

Fuck you and all your expectations

I don't even want your congratulations

I recognize your false confidence

And calculated promises all in your conversation

I hate people that feel entitled

Look at me crazy 'cause I ain't invite you

Oh, you important?

You the moral to the story? You endorsin'?

Motherfucker, I don't even like you

Corrupt a man's heart with a gift

That's 

In [7]:
documents = []

for file in filenames:
    #print(file)
    with open('./data/' + file) as f:
        documents.append(f.read())


In [8]:
documents[1:3]

["My head is aching, and I'm late for work\nThere's a girl in the kitchen, and she's wearing my shirt\nMy buddies are home again, they're threatening to leave\nYes that beer on my nightstand will be breakfast for me\nCause that's how it goes with cowboys and friends\nAs soon as it's over, it all starts again\nThat's the way that it should be\nCause that's the way that it's been\nYeah the fun never ends, when the party begins\nCowboys and friends\nI've been working all morning, just busting my back\nAnd it's all for a foreman who doesn't know jack\nMy buddies keep talking, they say we're going out late\nGuess that sleep that I'm wanting will just have to wait\nCause that's how it goes with cowboys and friends\nAs soon as it's over, it all starts again\nThat's the way that it should be\nCause that's the way that it's been\nYeah the fun never ends, when the party begins\nCowboys and friends\nYeah the party begins where this two lane road ends\nCowboys and friends\n",
 'She worked the wind

### Tokenizing our Data

Before we can create a Bag of Words or vectorize each document, we need to clean it up and split each song into an array of individual words.  Computers are very particular about strings. If we tokenized our data in it's current state, we would run into the following problems:

1. Counting things that aren't actually words.  In the example above, `"[Kendrick]"` is a note specifying who is speaking, not a lyric contained in the actual song, so it should be removed.  
1. Punctuation and capitalization would mess up our word counts.  To the python interpreter, `love`, `Love`, `Love?`, and `Love\n` are all unique words, and would all be counted separately.  We need to remove punctuation and capitalization, so that all words will be counted correctly. 

Consider the following sentences from the example above:

`"Love, let's talk about love\n", 'Is it anything and everything you hoped for?\n'`

After tokenization, this should look like:

`['love', 'let's', 'talk', 'about', 'love', 'is', 'it', 'anything', 'and', 'everything', 'you', 'hoped', 'for']`

Tokenization is pretty tedious if we handle it manually, and would probably make use of Regular Expressions, which is outside the scope of this lab.  In order to keep this lab moving, we'll use a library function to clean and tokenize our data so that we can move onto vectorization.  

Tokenization is a required task for just about any Natural Language Processing (NLP) task, so great industry-standard tools exist to tokenize things for us, so that we can spend our time on more important tasks without getting bogged down hunting every special symbol or punctuation in a massive dataset. For this lab, we'll make use of the tokenizer in the amazing `nltk` library, which is short for _Natural Language Tool Kit_.

**_NOTE:_** NLTK requires extra installation methods to be run the first time certain methods are used.  If `nltk` throws you an error about needing to install additional packages, follow the instructions in the error message to install the dependencies, and then rerun the cell.  

Before we tokenize our songs, we'll do only a small manual bit of cleaning.  In the cell below, write a function that allows us to remove lines that have `['artist names']` in it, to ensure that our song files contain only lyrics that are actually in the song. For the lines that remain, make every word lowercase, remove newline characters `\n`, and any of the following punctuation marks: `",.'?!"`

Test the function on `test_song` to show that it has successfully removed `'[Kendrick Lamar:]'` and other instances of artist names from the song and returned it.  

In [9]:
test_song = documents[15]

In [10]:
artists = ['Kendrick Lamar', 'Garth Brooks', 'SZA', 
           'Kendrick Lamar & SZA', 'Khalid', 'The Weeknd', 
           'Babes Wodumo', 'Zacari', 'Hykeem Carter (Kendrick Lamar)',
           'Future & Kendrick Lamar', 'Jay Rock', 'James Blake',
           'Future','Ab-Soul','Anderson .Paak', 'Jorja Smith',
           'Jorja Smith (Kendrick Lamar)', 'Swae Lee',
           'Swae Lee (Kendrick Lamar)']

punctuations = ",.'?!()[]`"

In [11]:
def clean_song(song):
    """
    Clean songs:
    -Remove artists | remove newlines 
    """
    
    # remove artist headers
    for artist in artists:
        song = song.replace('[' + artist  + ':]',' ')
    
    # remove newlines
    song = song.replace('\n',' ')
    
    # remove double backticks
    song = song.replace('``',' ')
    
    # remove double quotes
    song = song.replace("''",' ')
    
    # every word lowercase
    song = song.lower()
    
    # any of the following punctuation marks: ",.'?!"
    output = ''
    for char in song:
        if char in punctuations:
            char = ' '
        output += char
    song = output
    return song
    
song_without_brackets = clean_song(test_song)
print(song_without_brackets[:20])

  miss me with that 


Great. Now, write a function that takes in songs that have had their brackets removed, joins all of the lines into a single string, and then uses `tokenize()` on it to get a fully tokenized version of the song.  Test this funtion on `song_without_brackets` to ensure that the function works. 

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/werlindo/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
def tokenize(song):
    """
    Input: string of song lyrics
    Output: list of words(tokens)
    """
    song = word_tokenize(song)
    return song

tokenized_test_song = tokenize(song_without_brackets)
tokenized_test_song[:10]


['miss',
 'me',
 'with',
 'that',
 'bullshit',
 'bullshit',
 'you',
 'ain',
 't',
 'really']

Great! Now that we know the ability to tokenize our songs, we can move onto Vectorization. 

### Count Vectorization

Machine Learning algorithms don't understand strings.  However, they do understand math, which means they understand vectors and matrices.  By **_Vectorizing_** the text, we just convert the entire text into a vector, where each element in the vector represents a different word.  The vector is the length of the entire vocabulary--usually, every word that occurs in the English language, or at least every word that appears in our corpus.  Any given sentence can then be represented as a vector where all the vector is 1 (or some other value) for each time that word appears in the sentence. 

Consider the following example: 

<center>"I scream, you scream, we all scream for ice cream."</center>

| 'aardvark' | 'apple' | [...] | 'I' | 'you' | 'scream' | 'we' | 'all' | 'for' | 'ice' | 'cream' | [...] | 'xylophone' | 'zebra' |
|:----------:|:-------:|:-----:|:---:|:-----:|:--------:|:----:|:-----:|:-----:|:-----:|:-------:|:-----:|:-----------:|:-------:|
|      0     |    0    |   0   |  1  |   1   |     3    |   1  |   1   |   1   |   1   |    1    |   0   |      0      |    0    |

This is called a **_Sparse Representation_**, since the strong majority of the columns will have a value of 0.  Note that elements corresponding to words that do not occur in the sentence have a value of 0, while words that do appear in the sentence have a value of 1 (or 1 for each time it appears in the sentence).

Alternatively, we can represent this sentence as a plain old python dictionary of word frequency counts:

```python
BoW = {
    'I':1,
    'you':1,
    'scream':3,
    'we':1,
    'all':1,
    'for':1,
    'ice':1,
    'cream':1
}
```

Both of these are examples of **_Count Vectorization_**. They allow us to represent a sentence as a vector, with each element in the vector corresponding to how many times that word is used.

#### Positional Information and Bag of Words

Notice that when we vectorize a sentence this way, we lose the order that the words were in.  This is the **_Bag of Words_** approach mentioned earlier.  Note that sentences that contain the same words will create the same vectors, even if they mean different things--e.g. `'cats are scared of dogs'` and `'dogs are scared of cats'` would both produce the exact same vector, since they contain the same words.  

In the cell below, create a function that takes in a tokenized, cleaned song and returns a Count Vectorized representation of it as a python dictionary. Add in an optional parameter called `vocab` that defaults to `None`. This way, if we are using a vocabulary that contains words not seen in the song, we can still use this function by passing it in to the `vocab` parameter. 

**_Hint:_**  Consider using a `set` object to make this easier!

In [14]:
def count_vectorize(song, vocab=None):
    """
    Takes in song lyrics as a list of words
    Outputs dictiionary of unique words in list, with 
        respective counts
    """    
    
    #Create a set of the unique words 
    unique_words = set(song)

    # Create a dictionary
    word_count = dict.fromkeys(unique_words,0)

    #Initialize Dictionary 
    d={}

    # Iterate through the text of Macbeth
    for word in song:
        d[word] = d.get(word,0)+1
    return d  

test_vectorized = count_vectorize(tokenized_test_song)
print(test_vectorized)

{'miss': 3, 'me': 8, 'with': 7, 'that': 10, 'bullshit': 6, 'you': 28, 'ain': 14, 't': 19, 'really': 2, 'wild': 2, 'a': 16, 'tourist': 6, 'i': 66, 'be': 6, 'blackin': 4, 'out': 5, 'the': 22, 'purist': 4, 'made': 6, 'hundred': 3, 'thou': 4, 'then': 10, 'freaked': 14, 'it': 30, '500': 3, 'bought': 3, '87': 2, 'for': 5, 'weekend': 6, 'this': 8, 'what': 13, 'want': 15, 'and': 5, 's': 6, 'like': 9, 'lil': 5, 'bitch': 7, 'mvp': 1, 'get': 9, 'no': 3, 'sleep': 1, 'don': 4, 'bust': 1, 'open': 1, 'ocean': 1, 'yeah': 4, 'bite': 2, 'back': 2, 'do': 2, 'need': 1, 'two': 1, 'life': 3, 'jackets': 1, 'gon': 11, 'hold': 5, 'press': 1, 'never': 5, 'control': 1, 'front': 1, 'keep': 3, '100': 1, 'know': 5, 'boss': 1, 'top': 1, 'dawg': 1, 'bossed': 1, 'my': 19, 'up': 8, 'crossin': 1, 'over': 1, 'stutter': 1, 'steppin': 1, 'got': 12, 'hall': 1, 'of': 1, 'fame': 1, 'in': 6, 'all': 6, 'posters': 1, 've': 1, 'been': 6, 'ready': 9, 'whip': 1, 'clique': 1, 'shit': 1, 'check': 1, 'shot': 1, 'on': 7, 'full': 2, 'ar

Great! You've just successfully vectorized your first text document! Now, let's look at a more advanced type of vectorization, TF-IDF!

### TF-IDF Vectorization

TF-IDF stands for **_Term Frequency, Inverse Document Frequency_**.  This is a more advanced form of vectorization that weights each term in a document by how unique it is to the given document it is contained in, which allows us to summarize the contents of a document using a few key words.  If the word is used often in many other documents, it is not unique, and therefore probably not too useful if we wanted to figure out how this document is unique in relation to other documents.  Conversely, if a word is used many times in a document, but rarely in all the other documents we are considering, then it is likely a good indicator for telling us that this word is important to the document in question.  

The formula TF-IDF uses to determine the weights of each term in a document is **_Term Frequency_** multipled by **_Inverse Document Frequency_**, where the formula for Term Frequency is:

$$\large Term\ Frequency(t) = \frac{number\ of\ times\ t\ appears\ in\ a\ document} {total\ number\ of\ terms\ in\ the\ document} $$
<br>
<br>
Complete the following function below to calculate term frequency for every term in a document.  

In [15]:
def term_frequency(BoW_dict):
    """
    input : a bag of words dictionary, with counts
    output : a dictionary with counts replaced with frequency within
                document
    """
    tf = {}
    
    num_terms_doc = sum(BoW_dict.values())
    
    for word, count in BoW_dict.items():
        tf[word]  = count/num_terms_doc
    
    return tf

test = term_frequency(test_vectorized)
print(list(test)[10:20])

['a', 'tourist', 'i', 'be', 'blackin', 'out', 'the', 'purist', 'made', 'hundred']


The formula for Inverse Document Frequency is:  
<br>  
<br>
$$\large  IDF(t) =  log_e(\frac{Total\ Number\ of\ Documents}{Number\ of\ Documents\ with\ t\ in\ it})$$

Now that we have this, we can easily calculate _Inverse Document Frequency_.  In the cell below, complete the following function.  this function should take in the list of dictionaries, with each item in the list being a Bag of Words representing the words in a different song. The function should return a dictionary containing the inverse document frequency values for each word.  

In [16]:
def inverse_document_frequency(list_of_dicts):
    # Instantiate set
    all_words = set()
   
    # Fill set with unique words
    for song in list_of_dicts:
        for word in song.keys():
            all_words.add(word)
    
    # Based on set, create base dictionary with 0 counts
    # Let's use dictionary comprehension
    all_words_dict = {word:0 for word in all_words}
    
    # Loop through all_words
    for word, count in all_words_dict.items():
        
        curr_word_ct = 0
        
        # Loop through each song, get words:
        for song in list_of_dicts:
            if word in song:
                curr_word_ct += 1

        all_words_dict[word] = np.log(len(list_of_dicts)/curr_word_ct)
        
    return all_words_dict



In [18]:
# Testing
songs_dict = [ count_vectorize(tokenize(clean_song(doc))) for doc in documents ]

test_dict = inverse_document_frequency(songs_dict)

test_dict

{'second': 2.302585092994046,
 'afraid': 2.995732273553991,
 'man': 1.2039728043259361,
 'mando': 2.995732273553991,
 'freedom': 2.302585092994046,
 'whoever': 2.995732273553991,
 'dirt': 2.995732273553991,
 'head': 1.6094379124341003,
 'people': 2.302585092994046,
 'hey': 2.302585092994046,
 'team': 2.302585092994046,
 'wrong': 2.302585092994046,
 'brother': 2.302585092994046,
 'herb': 2.995732273553991,
 'realest': 2.995732273553991,
 'friend': 2.995732273553991,
 'matches': 2.995732273553991,
 'hope': 1.8971199848858813,
 'lose': 2.995732273553991,
 'kill': 2.302585092994046,
 'cleaver': 2.995732273553991,
 'heading': 2.995732273553991,
 'doors': 2.302585092994046,
 'teach': 2.995732273553991,
 'serve': 2.995732273553991,
 'drummer': 2.995732273553991,
 'been': 1.0498221244986776,
 'sound': 2.995732273553991,
 'fingerprints': 2.995732273553991,
 'yeah': 1.0498221244986776,
 'trying': 1.8971199848858813,
 'sing': 2.995732273553991,
 'neighbor': 2.995732273553991,
 'surface': 2.995732

### Computing TF-IDF

Now that we can compute both Term Frequency and Inverse Document Frequency, computing an overall TF-IDF value is simple! All we need to do is multiply the two values.  

In the cell below, complete the `tf_idf()` function.  This function should take in a list of dictionaries, just as the `inverse_document_frequency()` function did.  This function return a new list of dictionaries, with each dictionary containing the tf-idf vectorized representation of a corresponding song document. 

**_NOTE:_** Each document should contain the full vocabulary of the entire combined corpus.  

In [None]:
def tf_idf(list_of_dicts):
    # Create empty dictionary containing full vocabulary of entire corpus, to be used for each song
    tf_idf_dict = {}
    # Create list to hold those dicts
    tf_idf_list = []
    
    #Leverage idf function to get unique words
    #idf = inverse_document_frequency(list_of_dicts)
    #all_words_dict = {word:0 for key in idf.keys()}
    
    # Create tf-idf list of dictionaries, containing a dictionary that will be updated 
    # for each document
    for song in list_of_dicts:
        #get term-frequency dict for this song
        song_tf = term_frequency(song)
        #get idf for this song and calc tf-idf
        for word in song:
            tf_
        for 
        
    
    # Now, compute tf and then use this to compute and set tf-idf values for each document

    
    
    pass

---

### Vectorizing All Documents

Now that we've created all the necessary helper functions, we can load in all of our documents and run each through the vectorization pipeline we've just created.

In the cell below, complete the `main` function.  This function should take in a list of file names (provided for you in the `filenames` list we created at the start), and then:

1. Read in each document
1. Tokenize each document
1. Convert each document to a Bag of Words (dictionary representation)
1. Return a list of dictionaries vectorized using tf-idf, where each dictionary is a vectorized representation of a document.  

**_HINT:_** Remember that all files are stored in the `data/` directory.  Be sure to append this to the filename when reading in each file, otherwise the path won't be correct!

---

In [None]:
def main(filenames):
    pass

tf_idf_all_docs = None
print(list(tf_idf_all_docs[0])[:10])

### Visualizing our Vectorizations

Now that we have a tf-idf representation each document, we can move on to the fun part--visualizing everything!

Let's investigate how many dimensions our data currently has.  In the cell below, examine our dataset to figure out how many dimensions our dataset has. 

**_HINT_**: Remember that every word is it's own dimension!

In [None]:
num_dims = None
print("Number of Dimensions: {}".format(num_dims))

That's much too high-dimensional for us to visualize! In order to make it understandable to human eyes, we'll need to reduce dimensionality to 2 or 3 dimensions.  

### Reducing Dimensionality

To do this, we'll use a technique called **_t-SNE_** (short for _t-Stochastic Neighbors Embedding_).  This is too complex for us to code ourselves, so we'll make use of sklearn's implementation of it.  

First, we need to pull the words out of the dictionaries stored in `tf_idf_all_docs` so that only the values remain, and store them in lists instead of dictionaries.  This is because the t-SNE object only works with Array-like objects, not dictionaries.  

In the cell below, create a list of lists that contains a list representation of the values of each of the dictionaries stored in `tf_idf_all_docs`.  The same structure should remain--e.g. the first list should contain only the values that were in the 1st dictionary in `tf_idf_all_docs`, and so on. 

In [None]:
tf_idf_vals_list = []

for i in tf_idf_all_docs:
    tf_idf_vals_list.append(list(i.values()))
    
tf_idf_vals_list[0][:10]

Now that we have only the values, we can use the `TSNE` object from `sklearn` to transform our data appropriately.  In the cell below, create a `TSNE` with `n_components=3` passed in as a parameter.  Then, use the created object's `fit_transform()` method to transform the data stored in `tf_idf_vals_list` into 3-dimensional data.  Then, inspect the newly transformed data to confirm that it has the correct dimensionality. 

In [None]:
t_sne_object_3d = None
transformed_data_3d = None
transformed_data_3d

We'll also want to check out how the visualization looks in 2d.  Repeat the process above, but this time, create a `TSNE` object with 2 components instead of 3.  Again, use `fit_transform()` to transform the data and store it in the variable below, and then inspect it to confirm the transformed data has only 2 dimensions. 

In [None]:
t_sne_object_2d = None
transformed_data_2d = None
transformed_data_2d

Now, let's visualize everything!  Run the cell below to a 3D visualization of the songs.

In [None]:
kendrick_3d = transformed_data_3d[:10]
k3_x = [i[0] for i in kendrick_3d]
k3_y = [i[1] for i in kendrick_3d]
k3_z = [i[2] for i in kendrick_3d]

garth_3d = transformed_data_3d[10:]
g3_x = [i[0] for i in garth_3d]
g3_y = [i[1] for i in garth_3d]
g3_z = [i[2] for i in garth_3d]

fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(k3_x, k3_y, k3_z, c='b', s=60, label='Kendrick')
ax.scatter(g3_x, g3_y, g3_z, c='red', s=60, label='Garth')
ax.view_init(30, 10)
ax.legend()
plt.show()

kendrick_2d = transformed_data_2d[:10]
k2_x = [i[0] for i in kendrick_2d]
k2_y = [i[1] for i in kendrick_2d]

garth_2d = transformed_data_2d[10:]
g2_x = [i[0] for i in garth_2d]
g2_y = [i[1] for i in garth_2d]

fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(222)
ax.scatter(k2_x, k2_y, c='b', label='Kendrick')
ax.scatter(g2_x, g2_y, c='red', label='Garth')
ax.legend()
plt.show()

Interesting! Take a crack at interpreting these graphs by answering the following question below:

What does each graph mean? Do you find one graph more informative than the other? Do you think that this method shows us discernable differences between Kendrick Lamar songs and Garth Brooks songs?  Use the graphs and your understanding of TF-IDF to support your answer.  

Write your answer to this question below this line:
________________________________________________________________________________________________________________________________

Both graphs show a basic trend among the red and blue dots, although the 3-dimensional graph is more informative than the 2-dimensional graph.  We see a separation between the two artists because they both have words that they use, but the other artist does not.  The words in each song that are common to both are reduced very small numbers or to 0, because of the log operation in the IDF function.  This means that the elements of each song vector with the highest values will be the ones that have words that are unique to that specific document, or at least are rarely used in others.  

## Summary

In this lab, we learned how to: 
* Tokenize a corpus of words and identify the different choices to be made while parsing them
* Use a Count Vectorization strategy to create a Bag of Words
* Use TF-IDF Vectorization with multiple documents to identify words that are important/unique to certain documents
* Visualize and compare vectorized text documents

---


In [115]:
import os
[filename for filename in os.listdir('data/')
 if 'song' in filename]

['song19.txt',
 'song10.txt',
 'song11.txt',
 'song15.txt',
 'song14.txt',
 'song18.txt',
 'song2.txt',
 'song17.txt',
 'song4.txt',
 'song20.txt',
 'song7.txt',
 'song6.txt',
 'song1.txt',
 'song12.txt',
 'song13.txt',
 'song9.txt',
 'song5.txt',
 'song3.txt',
 'song8.txt',
 'song16.txt']

---


---

### Digression:

Let's do our own!

In [116]:
import glob

filenames = glob.glob('data/song*.txt')

In [117]:
filenames

['data/song19.txt',
 'data/song10.txt',
 'data/song11.txt',
 'data/song15.txt',
 'data/song14.txt',
 'data/song18.txt',
 'data/song2.txt',
 'data/song17.txt',
 'data/song4.txt',
 'data/song20.txt',
 'data/song7.txt',
 'data/song6.txt',
 'data/song1.txt',
 'data/song12.txt',
 'data/song13.txt',
 'data/song9.txt',
 'data/song5.txt',
 'data/song3.txt',
 'data/song8.txt',
 'data/song16.txt']

In [120]:
data = {}
for filename in filenames:
    with open(filename) as f:
        text = ""
        for line in f:
            if line.strip().startswith('['):
                continue
            text += line
    data[filename] = text


In [123]:
documents = list(data.values())

In [124]:
documents[0]

"Wakanda\nWelcome\nBig shot, hol' up, wait, peanut butter insides (no)\nOutside, cocaine white, body look like Gentiles (Gentiles)\nEmotion, emotion, emotion, emotional\nWhy you emotional? Why you emotional?\nAh, bitch, you emotional, yeah\nBig shot, big shot, (hol' on, hol' on), peanut butter insides (hol' on)\nOutside, cocaine white, body look like Gentiles (Gentiles)\nEmotion, emotion, emotion, emotional\nWhy you emotional? Why you emotional?\nAh, bitch, you emotional, yeah\nServe that work for Kung-Fu Kenny\nGot juice, got work, got weight, got plenty\nGot them, got her, got more, got Benji (yeah)\nTop off gettin' topped-off in the city\nBig Top Dawg and I dance on 'em like Diddy\nPop off and I pop back like Fiddy (yeah)\nI hit the ceiling and forgot about the floor (yeah)\nBrand so big, got my haters on the ropes (yeah)\nThis be the wave, plus I live on the coast (yeah)\nWhen I touch a bag, young nigga do the most (yeah)\nMmm, woo, and I Wakanda flex\nAnd you know what time it is 

In [127]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [128]:
vec = TfidfVectorizer()

In [129]:
vec.fit(documents)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [132]:
tfidf_matrix = vec.transform(documents).toarray()

In [133]:
tfidf_matrix

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.02207968, 0.02207968, 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [134]:
additional_data = vec.transform(["Happy Birthday Mando"])

In [135]:
additional_data

<1x1307 sparse matrix of type '<class 'numpy.float64'>'
	with 2 stored elements in Compressed Sparse Row format>