In [1]:
import numpy as np
import pandas as pd
import psycopg2 as psy
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sb
import pickle as pic

%matplotlib inline

# From Language to Lyrics

## Introduction

Human beings have spent millennia attempting to express themselves through poetry and music. Often, emotions and thoughts find the purest form when they are funneled through a creative outlet. However, although everyone has the ability to appreciate a well formed metaphor or an apt phrase when they hear or see it, not everyone is capable of creating at the same level. If that's the case what is one to do? If only there were a way to translate a mundane phrase into a lyrical snippet from your favorite eloquent artist to describe how you really feel.


**ENTER** From Language to Lyrics: a recommendation engine for matching an inputted phrase to a set of lyrics that best embodies the meaning of that phrase.

On a more serious note, translation and information retrieval are crucial to providing users with relevant information and/or transforming that information into an understandable format. Combining my love of music and lyricism and the practicality of deeply understanding information retrieval tasks led me to create this lyrics recommendation engine

## Getting the Data

For mr primary dataset, I chose genius.com, a lyrics site that contains user created explanations of what those lyrics mean. This was an ideal dataset for this problem for a number of reasons:

1. They have a readily accessible and actively maintained API where one can query for artist, song, lyrics and annotation (user explanations of lyrics) information
2. The lyrics, through their connection to the annotations, are broken down into easily digestable chunks, usually between 2-8 lines. This is convenient because when translating a regular statement into lyrics, a user likely wants a return of similar length instead of an entire song that they need to parse through
3. Perhaps most importantly, the fact that there are annotations on almost all pieces of lyrics means that there is an **explanation** of what those lyrics mean. The great (and terrible, from a information retrieval perspective) part about lyrics, is that they are often metaphorical or roundabout in nature. This means that using traditional IR methods may not be effective in returning the most relevant results for a user. The addition of the annotation provides a more explicit description of what a given set of lyrics mean, which helps get the user better results

In order to obtain the data. I first created a list of artists that was a combination of the authors of the top selling hip-hop albums (per complex magazine) and the most critically acclaimed hip-hop songs of the past 5 years from Billboard. I then used this list of artists to query the Genius API to obtain all the lyrical snippets (from here on referred to a referents), annotations, and associated metadata from their entire corpus of lyrics, as well as some general information regarding the songs themselves. I stored this information in a SQL database, using the Genius created IDs to link across songs, referents, and annotations.

## Cleaning/Exploring the Data

Because I pulled the data from the Genius API, the data itself did not require a whole lot of cleaning before I began working with it. I did remove unicode characters that couldn't be encoded into an ASCII so that I would have an easier time splitting documents and creating vector representations. I've included some preliminary exploratory analysis below:

### Distribution of Referent Length

<img src="./Images/ref_length.png">

### Annotations by State

<img src="./Images/ann_state.png">

### Annotation Comment Count

<img src="./Images/ann_comment_count.png">

### Annotation Length (Words)

<img src="./Images/ann_length.png">