# Data Cleaning

After we have collected and scraped all the data from the two websites, we wil then have to clean the data based on the types of lyrics each song contains. After converting all the columns to their respective types, we wil then remove all songs that are:
1. Instrumentals
2. Non-english speaking songs (if they are majority non english)

First, we'll import the packages we'll be using for this notebook. Then, let's load all the songs from the data csv files.

In [1]:
%conda install langdetect
import pandas as pd
import numpy as np
import re
from langdetect import detect
from ast import literal_eval

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.

Note: you may need to restart the kernel to use updated packages.



PackagesNotFoundError: The following packages are not available from current channels:

  - langdetect

Current channels:

  - https://repo.anaconda.com/pkgs/main/win-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/win-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/msys2/win-64
  - https://repo.anaconda.com/pkgs/msys2/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




In [2]:
data = pd.DataFrame()
for i in range(1,12):
    FILE = '../Data Collection/data/collected/all data/data' + str(i) + '.csv'
    print(FILE)
    data = pd.concat([data, pd.read_csv(FILE)])
data = data.reset_index(drop = True)

../Data Collection/data/collected/all data/data1.csv
../Data Collection/data/collected/all data/data2.csv
../Data Collection/data/collected/all data/data3.csv
../Data Collection/data/collected/all data/data4.csv
../Data Collection/data/collected/all data/data5.csv
../Data Collection/data/collected/all data/data6.csv
../Data Collection/data/collected/all data/data7.csv
../Data Collection/data/collected/all data/data8.csv
../Data Collection/data/collected/all data/data9.csv
../Data Collection/data/collected/all data/data10.csv
../Data Collection/data/collected/all data/data11.csv


Now let us use the detect function from langdetect to see if these example strings are written in english or not.

In [3]:
examples = ['this is a sentence in english',
            'welcome to the twilight zone', 
            "'hola' is spanish for hello",
            "おはようございます"]

In [4]:
for example in examples:
    print(detect(example))

en
en
en
ja


Looks like the detect function works properly for detecting the language of these strings.
Now, lets detect the language of each song in our dataset, and add that as a feature, and then check to see if the song is instrumental, and add that as a feature as well.

In [5]:
def get_language(lyrics):
    try:
        return detect(lyrics)
    except:
        return 'NaN'

data['language'] = data['lyrics'].apply(get_language)

In [6]:
def is_instrumental(lyrics):
    if len(lyrics.split(' ')) < 5 and 'instrumental' in lyrics.lower():
        return True
    return False

data['instrumental'] = data['lyrics'].apply(is_instrumental)

In [7]:
data[:5]

Unnamed: 0,title,artist,lyrics,listens,hotness,genres,genius ID,spotify ID,language,instrumental
0,Fast Cars,Craig David,\n\n[Chorus - Craig David]\nFast cars\nFast wo...,751624,28,"['R&B Genius', 'Rock Genius']",,,en,False
1,Watching The Rain,Scapegoat Wax,"\n\n(Ya ya ya ya ya ya ya)\nHello, hello, it's...",10681,6,['Pop Genius'],,,en,False
2,Infierno,Mesita,"\n\n[Letra de ""Infierno""]\n\n[Estribillo]\nNo ...",628847,0,"['Uruguay', 'Latin Urban', 'Trap', 'En Español...",,,es,False
3,Balaio,Itamar Assumpção,\n\nNega\nO que que tem no balaio?\nO que que ...,16495,10,"['Brasil', 'Avant Garde', 'Em Português', 'Pop...",,,pt,False
4,Venganza,Ivy Queen,\n\n*coro*\nYa me canse de tus cosas\nHoy quie...,94916,0,"['En Español', 'Pop Genius']",,,es,False


Now, we need to clean the lyric strings and reformat all the data types.

Not only will we have to replace "\n"s and "\r"s, but we will need to replace words found within parenthesis, parenthesis themselves, colons, exclamation points, periods, and other signs so that when we create our corpus, the words we extract are the same("corn." should be the same as "corn!"). Doing so will once again require using the str.replace() function. Reference: https://stackoverflow.com/questions/14596884/remove-text-between-and-in-python

In [8]:
def clean_lyrics(lyrics):
    return split_lyrics(remove_extranious(lyrics))

def split_lyrics(lyrics):
    lines = []
    for line in lyrics.split('\n'):
        if line != '':
            lines.append(line.strip())
    return lines

def remove_extranious(lyrics):
    lyrics = re.sub(r'\[.*?\]', '', lyrics)
    for character in ['\r']:
        lyrics = lyrics.replace(character, ' ')
    for character in ['?','.','!',',','-',"'",'’','(',')','*','/','"']:
        lyrics = lyrics.replace(character, '')
    return lyrics

In [9]:
test_lyrics = data.loc[0]['lyrics'] + '(This stuff should stay...) [don’t keep this] won’t go'
print(test_lyrics)



[Chorus - Craig David]
Fast cars
Fast women
Speed bikes with the nitro in them
Dangerous when driven
Those are the type that I be feeling [x2]

[Verse 1 - Craig David]
Sitting there while I observe
I like your lines I love your curves
Checking out your bodywork
How can I get with her
You're the one that I want
Do anything to turn you on
Somebody please just pass the keys so you can take a ride with me

[Pre-Chorus - Craig David]
I'm on a mission
First thing disarming your system
Next thing slip the key in the ignition
Just listen
To the way that you purr at me you know you prefer the speed
When your back starts dipping
Wheel spinning when the gears start shifting
I'm sticking til the turbo kicks in
You know that I'm missing
Got me moving so fast you got me missing the flash a 5.0

[Chorus - Craig David]
Fast cars
Fast women
Speed bikes with the nitro in them
Dangerous when driven
Those are the type that I be feeling [x2]

[Verse 2 - Craig David]
Feel the ride feel the rush
The moment

In [10]:
print(clean_lyrics(test_lyrics))

['Fast cars', 'Fast women', 'Speed bikes with the nitro in them', 'Dangerous when driven', 'Those are the type that I be feeling', 'Sitting there while I observe', 'I like your lines I love your curves', 'Checking out your bodywork', 'How can I get with her', 'Youre the one that I want', 'Do anything to turn you on', 'Somebody please just pass the keys so you can take a ride with me', 'Im on a mission', 'First thing disarming your system', 'Next thing slip the key in the ignition', 'Just listen', 'To the way that you purr at me you know you prefer the speed', 'When your back starts dipping', 'Wheel spinning when the gears start shifting', 'Im sticking til the turbo kicks in', 'You know that Im missing', 'Got me moving so fast you got me missing the flash a 50', 'Fast cars', 'Fast women', 'Speed bikes with the nitro in them', 'Dangerous when driven', 'Those are the type that I be feeling', 'Feel the ride feel the rush', 'The moment I tease your clutch', 'Reacting to my every touch', 'We

In [11]:
data['lyrics'] = data['lyrics'].apply(clean_lyrics)

In [12]:
def get_lyrics_length(lyrics):
    length = 0
    for line in lyrics:
        length += len(line.split(' '))
    return length
data['song length'] = data['lyrics'].apply(get_lyrics_length)

In [13]:
def clean_genre(genres_string):
    genres = literal_eval(genres_string)
    genre_list = []
    for genre in genres:
        genre_list.append(genre.replace('Genius','').strip().lower())
    return genre_list
data['genres'] = data['genres'].apply(clean_genre)

In [14]:
print('We have in total ' + str(len(data)) + ' datapoints')
print('We have ' + str(len(data[data['instrumental'] == False][data['language'] == 'en'])) + ' English datapoints')
data[:5]

We have in total 118709 datapoints
We have 85829 English datapoints


  


Unnamed: 0,title,artist,lyrics,listens,hotness,genres,genius ID,spotify ID,language,instrumental,song length
0,Fast Cars,Craig David,"[Fast cars, Fast women, Speed bikes with the n...",751624,28,"[r&b, rock]",,,en,False,379
1,Watching The Rain,Scapegoat Wax,"[Ya ya ya ya ya ya ya, Hello hello its me agai...",10681,6,[pop],,,en,False,360
2,Infierno,Mesita,"[No sé lo que me estás haciendo, Con esa mirad...",628847,0,"[uruguay, latin urban, trap, en español, latin...",,,es,False,418
3,Balaio,Itamar Assumpção,"[Nega, O que que tem no balaio, O que que tem ...",16495,10,"[brasil, avant garde, em português, pop]",,,pt,False,349
4,Venganza,Ivy Queen,"[coro, Ya me canse de tus cosas, Hoy quiero ba...",94916,0,"[en español, pop]",,,es,False,290


The data looks great, so lets export to csv for use in the next steps. We'll be saving both the entire dataset as well as the filtered dataset.

In [16]:
data.to_csv('entire_clean.csv', index = False)
data[data['instrumental'] == False][data['language'] == 'en'].to_csv('english_clean.csv', index = False)

  
