# Data Clean Up

Looking through the data I have, I need to resolve 'duplicates' and null values, evaluate if they are true duplicates/nulls. I also noticed that the featured column does not list the actual artist that was collaborating on the track, so I will need to clear it and pull the who it is featuring from the Title column if available.

<p> I will need to create a target column based on whether the artists sing their own names/labels in the track. I will need to create an 'Alias' column; Akon will alway sing 'Convict' in his tracks, Nicki's 'Young Money', Lil' Wayne's 'Weezy' etc.

### Imports

In [1]:
import pandas as pd
import numpy as np

import time   
import lyricsgenius # Lyricsgenious Library to use with Genius API
from requests.exceptions import Timeout # To avoid timeout errors during pull

In [2]:
df = pd.read_csv('./data/gathering_lyrics.csv', index_col = 0)

In [3]:
df.drop('Unnamed: 0.1', axis = 1, inplace = True) # I did the data gathering with two notebooks to make the process faster.

In [4]:
df.head()

Unnamed: 0,Artist,Featured,Title,release_year,lyrics
0,Jason Derulo,Jason Derulo,Swalla,2017-02-24,[Intro: Nicki Minaj]\nDrank\nYoung Money\n\n[V...
1,Jason Derulo,Jason Derulo,Talk Dirty,2013-08-02,"[Intro: Jason Derulo & Rie Abe]\n(Jason, haha\..."
2,Jason Derulo,Jason Derulo,Wiggle,2014-06-06,"[Intro: Jason Derulo & Snoop Dogg]\nAyo, Jason..."
3,Jason Derulo,Jason Derulo,Trumpets,2013-11-07,[Chorus]\nEvery time that you get undressed\nI...
4,Jason Derulo,Jason Derulo,Tip Toe,2017-11-10,[Intro: Jason Derulo & Soaky Siren]\nDerulo\nW...


In [5]:
df.dtypes

Artist          object
Featured        object
Title           object
release_year    object
lyrics          object
dtype: object

### Filling in Lyrics as needed
Going through the function once more, this time I will feed it the artist name and song title in hopes to get the lyrics.

In [6]:
def fill_lyrics(artist, title, year): # no arguments needed
    genius = lyricsgenius.Genius(API)
    song = genius.search_song(title, artist) # get song based on title and artist
    if song is not None:
        df.loc[(df['Artist'] == artist) & (df['Title'] == title), ['lyrics', 'release_year']] = [song.lyrics, year] # using .loc to replace the lyric with gathered lyrics
    else:
        df.loc[(df['Artist'] == artist) & (df['Title'] == title), 'lyrics'] = np.NaN # replace the lyric with NaN if nothing is gathered


In [7]:
df.isnull().sum()

Artist             0
Featured           0
Title              0
release_year    1901
lyrics             4
dtype: int64

For the missing year values, It is in a yyy-mm-dd format, I only want the year, and will try to fill the values that I can as I fill the lyrics

In [8]:
df['release_year'] = df['release_year'].str[:4] # splitting the strings in release_year value to only give me the year

#### Checking Nulls

In [9]:
df.loc[df['lyrics'].isnull()]
# Elastic Heart already exhists
# Yellow Flicker Beat already exhists
# The Proclaimers should have lyrics
# Pharrell Burger haha he has his own Signature Burger in Tokyo...

Unnamed: 0,Artist,Featured,Title,release_year,lyrics
2052,Sia,Sia,Elastic Heart (Video Breakdown),,
2378,Lorde,Lorde,“Yellow Flicker Beat” Single Art,2014.0,
4866,The Proclaimers,The Proclaimers,After You’re Gone,,
5421,Pharrell Williams,Pharrell Williams,Pharrell Burger,2014.0,


Using the above function to fill in the year and lyric for The Proclaimers, checking Genius the lyrics do exhist. 

In [10]:
fill_lyrics("The Proclaimers", "After You’re Gone", 2012) # filling the one 

Searching for "After You’re Gone" by The Proclaimers...
Done.


In [11]:
df.loc[(df['Artist'] == "The Proclaimers") & (df['Title'] == "After You’re Gone")] # Looks Good

Unnamed: 0,Artist,Featured,Title,release_year,lyrics
4866,The Proclaimers,The Proclaimers,After You’re Gone,2012,The love you leave\nWill be there after you're...


Dropping the rest of the null values in lyrics

In [12]:
df.drop(df.loc[df['lyrics'].isnull()].index, inplace=True)

In [13]:
df.shape

(11698, 5)

In [14]:
# df[df['Title'].str.contains("feat.")]

In [105]:
df[df.duplicated('lyrics', keep = False)].sort_values('lyrics')

Unnamed: 0,Artist,Featured,Title,release_year,lyrics
95,Jason Derulo,Jason Derulo,Red Card,1996.0,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n
2694,Florida Georgia Line,Florida Georgia Line,Rooftop,,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n
2690,Florida Georgia Line,Florida Georgia Line,Turnt,,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n
6332,Shawn Mendes,Shawn Mendes,Always Been You,2020.0,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n
6161,Rick Astley,Rick Astley,Every One of Us*,,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n
9867,Kehlani,Kehlani,Act Up*,,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n
10374,ZAYN,ZAYN,Dragonfly,,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n
10376,ZAYN,ZAYN,Night and Day,,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n
10383,ZAYN,ZAYN,Roses,,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n
10398,ZAYN,ZAYN,Windowsill,,\n Lyrics for this song have yet to be released. Please check back once the song has been released.\n \n


Of the 441 duplicates, there are 73 songs where the lyrics of the song have not yet been release or are unavailable. Additionally there are 159 rows where the lyrics are Instrumental. This is due to artists like Deadmau5, Dirty South, Armin van Buuren who are more techno/electronic genre artists. )
The rest of the tracks that were marked duplicates were true duplicates. Fisherspooner had 10 songs that were the same, because of different Remixes, but the lyrics are all the same
<br> Then dropping another 81 rows where the lyrics are less than 100 characters in length, grabbing some odd bits that were pulled </br>

In [158]:
df.drop_duplicates(subset='lyrics', keep="first", inplace=True) # Keeping first instance and dropping the rest

In [159]:
df.drop(df.loc[df['lyrics'].str.len() < 120].index, inplace=True) # Catching any lyrics that are less than 120 characters

In [169]:
df.drop(df.loc[(df['Artist'] == 'Jason Derulo') & (df['Title'] == 'Red Card')].index, inplace=True) # One very stuborn row, has no lyrics

In [191]:
df['lyrics'][10]

' Girl, ladies, let your hurr down (Ooh, yeah) Let your hurr down Wes about to get down (Ooh, yeah)   Get ugly Diddily, diddily, diddily, diddily (Ooh) Diddily, diddily, diddily, diddily (Yeah) Diddily, diddily, diddily, diddily, get ugly, babe (Ooh) Diddily, diddily, diddily, diddily, get ugly Diddily, diddily, diddily, diddily (Ooh) Youre too sexy to me, sexy to me Diddily, diddily, diddily, diddily (Yeah) Youre too sexy to me, sexy to me Diddily, diddily, diddily, diddily (Ooh) So sexy Diddily, diddily, diddily, diddily Damn, thats ugly   Bruh, I cant, I cant even lie Im about to be that guy Someone else gon have to drive me home (Home) La la la Bang-a-rang-rang, bang-a-ring-a-rang-rang Bass in the trunk, vibrate that thang Do your thang, thang, girl, do that thang Like la la la   To them pretty facety girls tryna impress each other (Tryna impress) And them undercover freaks who aint nothin but trouble (Undercover, baby, yeah) Baby, Ima tell you some only cause I love ya People all 

#### String Cleanup
I need to clean up the characters in the lyrics columns. Where there was a line break we now have \n. There are also verse sections marked [Verse], [Chorus] [Intro] and so on that need to be removed.

In [190]:
df['lyrics'] = df['lyrics'].str.replace('\n', ' ').str.replace("\'", "")
df['lyrics'] = df['lyrics'].str.replace(r'\[[^)]*\]','')

## TO DO:

- Alias Column?
- Target Column
- Balance
- Popularity? Explicity?


In [None]:
#### Fixing the Featured column
Currently the featured column has the original artist's name 

In [None]:
feat = []    
for values in df['title']:
    try:
        search = re.search(r'(?:\(|\[)(?:feat.|feat.|feat |with )[^)]*(?:\)|\])', values, re.I).group()
        feat.append(search.replace('(' or '[', '').replace('feat. ' or 'Feat. ' or 'with ', '').replace(')' or ']', '')) 

    except:
        feat.append('')
        
    
df['title'] = df['title'].str.replace(r'(?:\(|\[)(?:feat. |feat |with )[^)]*(?:\)|\])','', re.I)
df.insert(1, 'featuring', feat) 
#https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns?page=1&tab=votes#tab-top
# Thanks Brett for Regex Help

In [None]:
skip = []
for row in df.values:
    if row[7] in (row[0]) or row[7] in (row[2]):
        skip.append(1)
    else:
        skip.append(0)

In [17]:
pd.set_option('display.max_rows', None)
pd.options.display.max_colwidth = 300