# Data Clean Up

Looking through the data I have, I need to resolve 'duplicates' and null values, evaluate if they are true duplicates/nulls. I also noticed that the featured column does not list the actual artist that was collaborating on the track, instead has the html tags, I will need to extract the artist names from this section and re-place the current value. The same goes for media. I want to extract the Spotify ID in-case I want to extract audio features. 

<p> I will need to create a target column based on whether the artists sing their own names/labels in the track. I will need to create an 'Alias' column; Akon will alway sing 'Convict' in his tracks, Nicki's 'Young Money', Lil' Wayne's 'Weezy' etc.

### Imports

In [1]:
import pandas as pd
import numpy as np

import time   
import lyricsgenius # Lyricsgenious Library to use with Genius API
from requests.exceptions import Timeout # To avoid timeout errors during pull

In [2]:
df = pd.read_csv('./data/gathering2_lyrics.csv')

In [3]:
df.head()

Unnamed: 0,artist,featured,title,media,release_year,lyrics
0,Jason Derulo,"[{'api_path': '/artists/25005', 'header_image_...",Swalla,"[{'provider': 'youtube', 'start': 0, 'type': '...",2017-02-24,[Intro: Nicki Minaj]\nDrank\nYoung Money\n\n[V...
1,Jason Derulo,"[{'api_path': '/artists/14325', 'header_image_...",Talk Dirty,"[{'provider': 'youtube', 'start': 0, 'type': '...",2013-08-02,"[Intro: Jason Derulo & Rie Abe]\n(Jason, haha\..."
2,Jason Derulo,"[{'api_path': '/artists/46', 'header_image_url...",Wiggle,[{'native_uri': 'spotify:track:2sLwPnIP3CUVmIu...,2014-06-06,"[Intro: Jason Derulo & Snoop Dogg]\nAyo, Jason..."
3,Jason Derulo,[],Trumpets,[{'native_uri': 'spotify:track:5KONnBIQ9LqCxye...,2013-11-07,[Chorus]\nEvery time that you get undressed\nI...
4,Jason Derulo,"[{'api_path': '/artists/1583', 'header_image_u...",Tip Toe,[{'native_uri': 'spotify:track:2z4pcBLQXF2BXKF...,2017-11-10,[Intro: Jason Derulo & Soaky Siren]\nDerulo\nW...


Checking the describe of the data and adding a column for nulls for quick viewing

In [4]:
desc = df.describe() # assign describe to variable
add_null = pd.concat([df.isnull().sum().rename('Nulls'),desc.T],axis=1)

In [5]:
add_null

Unnamed: 0,Nulls,count,unique,top,freq
artist,0,11890,130,Nicki Minaj,100
featured,0,11890,1899,[],8784
title,0,11890,11000,Intro,17
media,0,11890,10540,[],1340
release_year,1920,9970,2786,2016-03-25,42
lyrics,3,11887,11640,[Instrumental],62


### Filling in Lyrics as needed
Going through the function once more, this time I will feed it the artist name and song title in hopes to get the lyrics.

In [6]:
def fill_lyrics(artist, title, year): # no arguments needed
    genius = lyricsgenius.Genius(API)
    song = genius.search_song(title, artist) # get song based on title and artist
    if song is not None:
        df.loc[(df['Artist'] == artist) & (df['Title'] == title), ['lyrics', 'release_year']] = [song.lyrics, year] # using .loc to replace the lyric with gathered lyrics
    else:
        df.loc[(df['Artist'] == artist) & (df['Title'] == title), 'lyrics'] = np.NaN # replace the lyric with NaN if nothing is gathered


For the missing year values, It is in a yyy-mm-dd format, I only want the year, and will try to fill the values that I can as I fill the lyrics

In [7]:
df['release_year'] = df['release_year'].str[:4] # splitting the strings in release_year value to only give me the year

#### Checking Nulls

In [8]:
df.loc[df['lyrics'].isnull()]
# Elastic Heart already exhists
# Yellow Flicker Beat already exhists
# Pharrell Burger haha he has his own Signature Burger in Tokyo...

Unnamed: 0,artist,featured,title,media,release_year,lyrics
2052,Sia,"[{'api_path': '/artists/49361', 'header_image_...",Elastic Heart (Video Breakdown),"[{'provider': 'youtube', 'start': 0, 'type': '...",,
2378,Lorde,[],“Yellow Flicker Beat” Single Art,[],2014.0,
5321,Pharrell Williams,[],Pharrell Burger,[],2014.0,


Dropping the null values in lyrics

In [9]:
df.drop(df.loc[df['lyrics'].isnull()].index, inplace=True)

In [10]:
df.shape

(11887, 6)

In [11]:
df[df.duplicated('lyrics', keep = False)]

Unnamed: 0,artist,featured,title,media,release_year,lyrics
95,Jason Derulo,[],Red Card,[],1996,\n Lyrics for this song h...
155,CAKE,[],Never Gonna Give You Up,[],2007,You look so good to me right now...\nYou reall...
174,CAKE,[],Bound Away,[],2011,I'm an unknown individual in an unattended car...
179,CAKE,[],Teenage Pregnancy,[],2011,[Instrumental]
184,CAKE,[],Conroy,[],2007,[Instrumental]
...,...,...,...,...,...,...
11561,Ava Max,[],My Way (Shew Remix),"[{'attribution': 'iamshew', 'provider': 'sound...",2018,"[Verse 1]\nMy momma use to say\n""Baby make me ..."
11563,Ava Max,[],Dream Away,[],2006,No Lyrics Available Yet
11570,Ava Max,[],Treat Me Like a Lady,[],2006,No Lyrics Available Yet
11571,Ava Max,[],Head & Heart,[],,\n Lyrics for this song h...


Of the 335 duplicates, there are 63 songs where the lyrics of the song have not yet been release or are unavailable. Additionally there are 69 rows where the lyrics are Instrumental. This is due to some artists like Christina Perri who also do lullabies, MGMT, Diplo, who do electronic music and songs that are Intros to albums. 
The rest of the tracks that were marked duplicates were true duplicates. Fisherspooner had 10 songs that were the same, because of different Remixes, but the lyrics are all the same
<br> Then dropping another 104 rows where the lyrics are less than 150 characters in length, grabbing some odd bits that were pulled </br>

In [12]:
df.drop_duplicates(subset='lyrics', keep="first", inplace=True) # Keeping first instance and dropping the rest

In [13]:
df.drop(df.loc[df['lyrics'].str.len() < 150].index, inplace=True) # Catching any lyrics that are less than 120 characters

#### String Cleanup
I need to clean up the characters in the lyrics columns. Where there was a line break we now have \n. There are also verse sections marked [Verse], [Chorus] [Intro] and so on that need to be removed.

In [14]:
df['lyrics'][0]

'[Intro: Nicki Minaj]\nDrank\nYoung Money\n\n[Verse 1: Jason Derulo]\nLove in a thousand different flavors\nI wish that I could taste them all tonight\nNo, I ain\'t got no dinner plans\nSo you should bring all your friends\nI swear that a-all y\'all my type\n\n[Pre-Chorus: Jason Derulo]\nAll you girls in here, if you\'re feeling thirsty\nCome on take a sip \'cause you know what I\'m servin\', ooh\n\n[Chorus: Jason Derulo]\nShimmy shimmy yay, shimmy yay, shimmy ya (Drank)\nSwalla-la-la (Drank)\nSwalla-la-la (Swalla-la-la)\nSwalla-la-la\nShimmy shimmy yay, shimmy yay, shimmy ya (Drank)\nSwalla-la-la (Drank)\nSwalla-la-la (Swalla-la-la)\nSwalla-la-la\nFreaky, freaky gyal\nMy freaky, freaky gyal\n\n[Verse 2: Ty Dolla $ign]\nShimmy shimmy shimmy yay, shimmy yah\nBad girls gon\' swalla-la-la\nBust down on my wrist in this bitch\nMy pinky-ring bigger than his\nMet her out in Beverly Hills, ay\nDolla got too many girls, ay\nMet her out in Beverly Hills\nAll she wear is red bottom heels\nWhen s

In [15]:
df['lyrics'] = df['lyrics'].str.replace('\n', ' ').str.replace("\'", "")
df['lyrics'] = df['lyrics'].str.replace(r'[\[].*?[\]]','')

In [16]:
df['lyrics'][0]

' Drank Young Money   Love in a thousand different flavors I wish that I could taste them all tonight No, I aint got no dinner plans So you should bring all your friends I swear that a-all yall my type   All you girls in here, if youre feeling thirsty Come on take a sip cause you know what Im servin, ooh   Shimmy shimmy yay, shimmy yay, shimmy ya (Drank) Swalla-la-la (Drank) Swalla-la-la (Swalla-la-la) Swalla-la-la Shimmy shimmy yay, shimmy yay, shimmy ya (Drank) Swalla-la-la (Drank) Swalla-la-la (Swalla-la-la) Swalla-la-la Freaky, freaky gyal My freaky, freaky gyal   Shimmy shimmy shimmy yay, shimmy yah Bad girls gon swalla-la-la Bust down on my wrist in this bitch My pinky-ring bigger than his Met her out in Beverly Hills, ay Dolla got too many girls, ay Met her out in Beverly Hills All she wear is red bottom heels When she back it up, put it on the Snap When she droppin low, put it on the Gram DJ poppin, she gon swallow that Champagne poppin, she gon swallow that   All you girls in 

## TO DO:
- Fix Feature 
- Fix media & rename column
- Alias Column?
- Target Column
- Balance
- Popularity? Explicity?


#### Fixing the Featured column
Currently the featured column has html if there were featured artist and [] if no additional artists were on the track. I am going to extract the names from the tags and refill the column where possible.

In [17]:
df.reset_index(inplace=True)

In [18]:
import regex as re

In [19]:
feat = []    
for values in df['featured']:
    if values != '[]':
        search = re.findall(r'name(.+?),', values, re.I)
        res = str(search)
        feat.append(re.sub(r'([][:\'\"])', '', res).replace(',', ' & ').strip())
#         print(res.sub(r'[^a-zA-Z0-9$]', "", res).strip('  '))
    else:
        feat.append('')

#https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns?page=1&tab=votes#tab-top
# Thanks Brett & Alex for Regex Help & John for Workarounds to a dictionary that was not a dictionary.

In [20]:
feat

['Ty Dolla $ign &   Nicki Minaj',
 '2 Chainz',
 'Snoop Dogg',
 '',
 'French Montana',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Tyga',
 '',
 '',
 '',
 'Kid Ink',
 'Jordin Sparks',
 'Farruko',
 'Jennifer Lopez &   Matoma',
 '',
 '',
 '',
 '',
 '',
 '',
 'Ty Dolla $ign &   Nicki Minaj',
 '',
 'Nicki Minaj',
 '',
 '',
 '',
 'Meghan Trainor',
 '',
 '',
 '2 Chainz',
 'MC Fioti',
 'Julia Michaels',
 '',
 'Stevie Wonder &   Keith Urban',
 'K. Michelle',
 '',
 '',
 '',
 'The Game',
 '',
 '',
 'Stefflon Don',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Pitbull',
 '',
 '',
 '',
 'Ty Dolla $ign &   Nicki Minaj',
 'Tia Ray',
 'Fanny J',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Jeremih',
 'R. City',
 '',
 '',
 '',
 'Swae Lee',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '

In [21]:
df['featured'] = feat

In [22]:
df

Unnamed: 0,index,artist,featured,title,media,release_year,lyrics
0,0,Jason Derulo,Ty Dolla $ign & Nicki Minaj,Swalla,"[{'provider': 'youtube', 'start': 0, 'type': '...",2017,Drank Young Money Love in a thousand differ...
1,1,Jason Derulo,2 Chainz,Talk Dirty,"[{'provider': 'youtube', 'start': 0, 'type': '...",2013,"(Jason, haha Jason Derulo) Haha, get Jazzy on..."
2,2,Jason Derulo,Snoop Dogg,Wiggle,[{'native_uri': 'spotify:track:2sLwPnIP3CUVmIu...,2014,"Ayo, Jason (Oh yeah!) Say somethin to her, ho..."
3,3,Jason Derulo,,Trumpets,[{'native_uri': 'spotify:track:5KONnBIQ9LqCxye...,2013,Every time that you get undressed I hear symp...
4,4,Jason Derulo,French Montana,Tip Toe,[{'native_uri': 'spotify:track:2z4pcBLQXF2BXKF...,2017,Derulo Whine fa me darlin Way you move ya spi...
...,...,...,...,...,...,...,...
11531,11885,21 Savage,,Hold Up,"[{'provider': 'youtube', 'start': 0, 'type': '...",2018,"21, 21 21, 21 Good head make a nigga toes cur..."
11532,11886,21 Savage,,Hollow Tips (Freestyle),"[{'provider': 'youtube', 'start': 0, 'type': '...",2016,I got a lot of extended clips Lot of extended...
11533,11887,21 Savage,,Act A Fool,[],,"Wheezy Beats Internet, internet Fuck the inte..."
11534,11888,21 Savage,,Pass Her,"[{'provider': 'youtube', 'start': 0, 'type': '...",,"Good job, First All the bitches want me (2..."


In [None]:
skip = []
for row in df.values:
    if row[7] in (row[0]) or row[7] in (row[2]):
        skip.append(1)
    else:
        skip.append(0)

In [None]:
df.insert(1, 'featuring', feat) 

In [None]:
df['media'][0]

In [None]:
pd.set_option('display.max_rows', None)


In [None]:
pd.options.display.max_colwidth = 300