# Word Vectorization - Lab

[Solutions](https://github.com/learn-co-curriculum/dsc-word-vectorization-lab)

## Introduction

In this lab, you'll learn how to tokenize and vectorize text documents, create a bag of words, and apply TF-IDF vectorization. The objectives of this lab include:

- Implementing tokenization and count vectorization from scratch
- Implementing TF-IDF from scratch
- Applying dimensionality reduction techniques to visualize vectorized text data

By the end of this lab, you will have gained hands-on experience in these techniques and be able to interpret visualizations of the vectorized text data.

In [2]:
import pandas as pd
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt', quiet=True)
np.random.seed(0)

### Corpus Overview

In this lab, we have a corpus consisting of 20 documents containing song lyrics from Garth Brooks and Kendrick Lamar albums. The song files are stored in the `data` subdirectory of this lab's folder. Each song is stored in a separate file, named from `song1.txt` to `song20.txt`. 

To conveniently read in all the documents, we can use a list comprehension to create a list that contains the names of all the song files.

In [1]:
filenames = None

In [3]:
filenames = [f'song{str(i)}.txt' for i in range(1,21)]
filenames

['song1.txt',
 'song2.txt',
 'song3.txt',
 'song4.txt',
 'song5.txt',
 'song6.txt',
 'song7.txt',
 'song8.txt',
 'song9.txt',
 'song10.txt',
 'song11.txt',
 'song12.txt',
 'song13.txt',
 'song14.txt',
 'song15.txt',
 'song16.txt',
 'song17.txt',
 'song18.txt',
 'song19.txt',
 'song20.txt']

Next, let's import a single song to see what our text looks like so that we can make sure we clean and tokenize it correctly. 

Use the code in the cell below to read in the lyrics from `song18.txt` as a list of lines, just using vanilla Python:

In [None]:
# Import and print song18.txt
with open('data/song18.txt') as f:
    test_song = f.readlines()
    
test_song

['[Kendrick Lamar:]\n',
 "Two wrongs don't make us right away\n",
 "Tell me something's wrong\n",
 'Party all of our lives away\n',
 'To take you on\n',
 '[Zacari:]\n',
 'Oh, baby I want you\n',
 'Baby I need you\n',
 'I wanna see you\n',
 'Baby I wanna go out yeah\n',
 'Baby I wanna go out yeah\n',
 'Baby I want you\n',
 'Baby I need you\n',
 'I wanna see you\n',
 'Baby I wanna go out yeah\n',
 'Baby I wanna go out yeah\n',
 'All night (all night, all night)\n',
 'All night\n',
 "Your body's on fire\n",
 'And your drinks on ice\n',
 'All night (all night, all night)\n',
 'All night\n',
 "Your body's on fire\n",
 'And your drinks on ice\n',
 '[Babes Wodumo:]\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Oh my word oh my gosh oh my word (Oh my gosh)\n',
 'Everybody say kikiritikiki (kikiritikiki)\n',
 'Everybody say kikiritikiki (kikiritikiki)\n',
 'Everybody say kikiritik

### Tokenizing Data

Before creating a bag of words or vectorizing each document, it is necessary to clean up the data and split each song into individual words. Tokenization is the process of breaking down text into smaller units, such as words or tokens.

Consider the example sentences:

`"Two wrongs don't make us right away\n", "Tell me something's wrong\n"`

After tokenization, they would look like:

`['two', 'wrongs', 'dont', 'make', 'us', 'right', 'away', 'tell', 'me', 'somethings', 'wrong']`

Tokenization can be a tedious task if done manually, involving regular expressions and handling various symbols and punctuation. However, to streamline the process and focus on more important tasks, we can use existing tools. In this lab, we will leverage the `nltk` library, short for _Natural Language Tool Kit_, which provides powerful tokenization capabilities.

**_NOTE:_** If you encounter an error related to missing packages when using `nltk`, it may require additional dependencies to be installed. Follow the instructions provided in the error message to install the required packages, and then rerun the cell.

### Preparing for Tokenization

Before tokenizing the data, an additional step is required to ensure proper handling of the text. Computers have specific requirements when working with strings, and without addressing them, we may encounter issues:

- Counting non-word elements: Some text may contain notes or annotations that are not part of the actual lyrics, such as `"[Kendrick Lamar:]"`. These elements should be removed to avoid incorrect word counts.
- Punctuation and capitalization: Python interprets words with different capitalization or punctuation as unique, leading to separate counts. To ensure accurate counting, we need to remove punctuation and convert all words to lowercase.

To address these concerns, we'll perform a manual cleaning step before tokenizing the songs. In the following cell, write a function that performs the following tasks:

- Remove lines containing only `['artist names']`.
- Join the list of strings into a single string representing the entire song.
- Remove newline characters (`\n`).
- Remove the following punctuation marks: `",.'?!()"`
- Convert all words to lowercase.

Test the function using `test_song` to demonstrate that it successfully removes `'[Kendrick Lamar:]'` and other artist names, returns the song as a single string (not a list of strings), removes newlines and punctuation, and converts all words to lowercase.

In [None]:
def clean_song(song):
    pass

clean_test_song = clean_song(test_song)
print(clean_test_song)

In [None]:
def clean_song(song):
    clean_lines = [line for line in song if "[" not in line and "]" not in line]
    clean_song = " ".join(clean_lines)
    for symbol in ",.'?!()":
        clean_song = clean_song.replace(symbol, "")
    clean_song = clean_song.replace("\n", " ")
    return clean_song.lower()
    
clean_test_song = clean_song(test_song)
print(clean_test_song)

two wrongs dont make us right away  tell me somethings wrong  party all of our lives away  to take you on  oh baby i want you  baby i need you  i wanna see you  baby i wanna go out yeah  baby i wanna go out yeah  baby i want you  baby i need you  i wanna see you  baby i wanna go out yeah  baby i wanna go out yeah  all night all night all night  all night  your bodys on fire  and your drinks on ice  all night all night all night  all night  your bodys on fire  and your drinks on ice  oh my word oh my gosh oh my word oh my gosh  oh my word oh my gosh oh my word oh my gosh  oh my word oh my gosh oh my word oh my gosh  oh my word oh my gosh oh my word oh my gosh  everybody say kikiritikiki kikiritikiki  everybody say kikiritikiki kikiritikiki  everybody say kikiritikiki kikiritikiki  everybody say kikiritikiki kikiritikiki  ungbambe ungdedele ungbhasobhe unggudluke  ungbambe ungdedele ungbhasobhe unggudluke  ungbambe ungdedele ungbhasobhe unggudluke  ungbambe ungdedele ungbhasobhe unggudlu

Great! Now, we can use `nltk`'s `word_tokenize()` function on the song string to get a fully tokenized version of the song. Test this function on `clean_test_song` to ensure that the function works. 

In [None]:
tokenized_test_song = None

In [None]:
tokenized_test_song = word_tokenize(clean_test_song)
tokenized_test_song[:10]

### Vectorizing Text: Count Vectorization

To enable machine learning algorithms to process text, we need to convert it into a numerical representation. Vectorization is the process of representing text as vectors or matrices, where each element corresponds to a word in the vocabulary.

Count vectorization converts text into a vector, where each element represents a unique word in the vocabulary. The vector's length is determined by the entire vocabulary, typically consisting of all the words in the English language or those present in our corpus. Each sentence can be represented by a vector, with the value of each element indicating the frequency of that word in the sentence.

For example, the sentence "I scream, you scream, we all scream for ice cream" can be represented as a vector:

| 'aardvark' | 'apple' | [...] | 'I' | 'you' | 'scream' | 'we' | 'all' | 'for' | 'ice' | 'cream' | [...] | 'xylophone' | 'zebra' |
|:----------:|:-------:|:-----:|:---:|:-----:|:--------:|:----:|:-----:|:-----:|:-----:|:-------:|:-----:|:-----------:|:-------:|
|      0     |    0    |   0   |  1  |   1   |     3    |   1  |   1   |   1   |   1   |    1    |   0   |      0      |    0    |

This is known as a sparse representation, where most elements in the vector have a value of 0. The presence of a word in the sentence is indicated by a non-zero value, such as 1, while words not present in the sentence have a value of 0.

Alternatively, we can represent the sentence as a dictionary of word frequency counts:

```python
BoW = {
    'I': 1,
    'you': 1,
    'scream': 3,
    'we': 1,
    'all': 1,
    'for': 1,
    'ice': 1,
    'cream': 1
}
```

Both representations exemplify count vectorization, allowing us to capture the frequency of each word and represent sentences as vectors.

It's important to note that count vectorization using the Bag of Words approach does not preserve the order of words in the sentence. Sentences containing the same words will result in identical vectors, even if the meanings are different.

In the cell below, create a function that takes in a tokenized, cleaned song and returns a count vectorized representation of it as a Python dictionary.

**_Hint:_**  Consider using a `set()` since you'll need each unique word in the tokenized song! 

In [None]:
def count_vectorize(tokenized_song):
    pass

test_vectorized = count_vectorize(tokenized_test_song)
print(test_vectorized)

In [None]:
def count_vectorize(tokenized_song):
    unique_words = set(tokenized_song)

    song_dict = {word:0 for word in unique_words}

    for word in tokenized_song:
        song_dict[word] += 1

    return song_dict

test_vectorized = count_vectorize(tokenized_test_song)
print(test_vectorized)

### TF-IDF Vectorization: Summarizing Document Contents

TF-IDF (Term Frequency, Inverse Document Frequency) is an advanced form of vectorization that assigns weights to each term in a document based on its uniqueness within the document and across the corpus. It allows us to summarize the document's contents using key words.

TF-IDF considers the term frequency, which is the frequency of a term in a document, and the inverse document frequency, which measures the rarity of the term across all documents. A term that appears frequently in many documents is deemed less unique and less informative for distinguishing the document. On the other hand, a term that occurs frequently within a document but rarely in the rest of the corpus is considered more important in representing the document.

The TF-IDF formula calculates the weight of each term by multiplying its term frequency with its inverse document frequency. We have already obtained the term frequency using Count Vectorization, as demonstrated earlier.

TF-IDF vectorization enables us to capture the significance of terms within a document, highlighting their importance in representing the document's unique characteristics.

The IDF (Inverse Document Frequency) calculation requires all the documents in our corpus, not just an individual document. We will postpone testing this function for now.

In the following cell, write a function that takes a list of tokenized songs as input. Each item in the list should be a clean, tokenized version of a song. The function should return a dictionary containing the IDF values for each word.

The IDF formula is:

$$\text{IDF}(t) =  \log_e\left(\frac{\text{Total Number of Documents}}{\text{Number of Documents with } t \text{ in it}}\right)$$

In [None]:
def inverse_document_frequency(list_of_token_songs):
    pass