# Data Cleaning

After we have collected and scraped all the data from the two websites, we wil then have to clean the data based on the types of lyrics each song contains. After converting all the columns to their respective types, we wil then remove all songs that are:
1. Instrumentals
2. Non-english speaking songs (if they are majority non english)

First, we'll import the packages we'll be using for this notebook. Then, let's load all the songs from the data csv files.

In [2]:
import langdetect as ld
import langid
import pandas as pd
import numpy as np
import re
from ast import literal_eval

In [3]:
data = pd.DataFrame()
for i in range(1,12):
    FILE = '../Data Collection/data/collected/all data/data' + str(i) + '.csv'
    print(FILE)
    data = pd.concat([data, pd.read_csv(FILE)])
data = data.reset_index(drop = True)

../Data Collection/data/collected/all data/data1.csv
../Data Collection/data/collected/all data/data2.csv
../Data Collection/data/collected/all data/data3.csv
../Data Collection/data/collected/all data/data4.csv
../Data Collection/data/collected/all data/data5.csv
../Data Collection/data/collected/all data/data6.csv
../Data Collection/data/collected/all data/data7.csv
../Data Collection/data/collected/all data/data8.csv
../Data Collection/data/collected/all data/data9.csv
../Data Collection/data/collected/all data/data10.csv
../Data Collection/data/collected/all data/data11.csv


Now let us use the detect function from langdetect to see if these example strings are written in english or not.

In [4]:
examples = ['this is a sentence in english',
            'welcome to the twilight zone', 
            "'hola' is spanish for hello",
            "おはようございます"]

In [4]:
for example in examples:
    print(ld.detect_langs(example))

[en:0.9999966178658325]
[en:0.9999987822353855]
[en:0.9999968118127358]
[ja:0.9999999999997472]


In [5]:
string = 'here is some stuff'
langid.classify(string)[0]

'en'

In [6]:
DETECTION_THRESHOLD = .9999
def get_langdetect(lyrics):
    try:
        detection = ld.detect_langs(lyrics)
        for lang in detection:
            language, prob = lang.lang, lang.prob
            if prob > DETECTION_THRESHOLD:
                return language
        return 'Likely ' + detection[0].lang
    except:
        return 'NaN'
    
def get_language(lyrics):
    ld = get_langdetect(lyrics)
    li = langid.classify(lyrics)[0]
    if ld == li:
        return ld
    else:
        return {'langid' : li, 'langdetect' : ld}

data['language'] = data['lyrics'].apply(get_language)

In [7]:
def is_instrumental(lyrics):
    if len(lyrics.split(' ')) < 5 and 'instrumental' in lyrics.lower():
        return True
    return False

data['instrumental'] = data['lyrics'].apply(is_instrumental)

In [1]:
data[:5]

NameError: name 'data' is not defined

Now, we need to clean the lyric strings and reformat all the data types.

Not only will we have to replace "\n"s and "\r"s, but we will need to replace words found within parenthesis, parenthesis themselves, colons, exclamation points, periods, and other signs so that when we create our corpus, the words we extract are the same("corn." should be the same as "corn!"). Doing so will once again require using the str.replace() function. Reference: https://stackoverflow.com/questions/14596884/remove-text-between-and-in-python

In [90]:
bad_characters = ['%','@','&','=','?','❓','？','.','!',',','-','~',"'",'’','`','*','^','/','"','{','}','_','�',';','‘','…','[',']','—','”','\\','“',':',
                 '©', '£', '$', '🔥', '#', '👑', '💃🏽', '🔐', '👋🏽', '+', '\u0024', '\u20AC', '\u00A3', '\u00A5', '\u00A2',
                 '\u20B9', '\u20A8', '\u20B1', '\u20A9', '\u0E3F', '\u20AB', '\u20AA', '\u00A9', '\u00AE', '\u2117',
                 '\u2122', '\u2120', '\xad', '\u2028', '⛽️', '✡', '《', '「', '」', '。', '￼', '🐐', '👅', '👉🏾', '👴🏼', '💇', '💋',
                 '💪', '💸', '🔮', '😉', '😎', '😷', '🚷', '►', '„', '•', '†', '–', '‒', '«', '\x93', '°', '¡', '¦',
                 '♪', '\x98', '|', '|', '½', '\x80', '🍻', '🙏', '®', '¿', '🏁', '❤', '∞', 'â', '€˜','\u03B1','\u03B2','\u03B3','\u03B4','\u03B5','\u03B6','\u03B7','\u03B8','\u03B9','\u03BA','\u03BB','\u03BC','\u03BD','\u03BE','\u03BF','\u03C1','\u03C3','\u03C4','\u03C5','\u03C6','\u03C7','\u03C8','\u03C9','\u0391','\u0392','\u0393','\u0394','\u0395','\u0396','\u0397','\u0398','\u0399','\u039A','\u039B','\u039C','\u039D','\u039E','\u039F','\u03A0','\u03A2','\u03A3','\u03A4','\u03A5','\u03A6','\u03A7','\u03A8','\u03A9',
                 '<', '>']

space_likes = ['\xa0', '\t', '\u0009', '\u000D', '\u00A0', '\u0020', '\u1680', '\u180E', '\u2000',
               '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009',
               '\u200A', '\u200B', '\u202F', '\u205F', '\u3000', '\uFEFF', '\r']

def clean_lyrics(lyrics):
    return split_lyrics(replace_numerics(remove_extranious(remove_emoji(lyrics))))

def split_lyrics(lyrics):
    lyrics = lyrics.replace('\u000A','\n')
    lines = []
    for line in lyrics.split('\n'):
        if line != '':
            lines.append(line.strip())
    return lines

def remove_extranious(lyrics):
    lyrics = re.sub(r'(?s)\[.*?\]', '', lyrics)
    lyrics = re.sub(r'(?s)\(.*?\)', '', lyrics)
    lyrics = re.sub(r'(?s)\<.*?\>', '', lyrics)
    for character in space_likes:
        lyrics = lyrics.replace(character, ' ')
    for character in bad_characters:
        lyrics = lyrics.replace(character, '')
    return lyrics

def remove_emoji(string):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

def replace_numerics(lyrics):
    lyrics = re.sub(r'[0-9]+', ' # ', lyrics)
    return lyrics

In [97]:
data['lyrics'] = data['lyrics'].apply(clean_lyrics)

In [11]:
def get_lyrics_length(lyrics):
    length = 0
    for line in lyrics:
        length += len(line.split(' '))
    return length
data['song length'] = data['lyrics'].apply(get_lyrics_length)

In [12]:
def clean_genre(genres_string):
    genres = literal_eval(genres_string)
    genre_list = []
    for genre in genres:
        genre_list.append(genre.replace('Genius','').strip().lower())
    return genre_list
data['genres'] = data['genres'].apply(clean_genre)

In [98]:
print('We have in total ' + str(len(data)) + ' datapoints')
print('We have ' + str(len(data[data['instrumental'] == False][data['language'] == 'en'])) + ' English datapoints')
data[:5]

We have in total 118709 datapoints
We have 84338 English datapoints


  


Unnamed: 0,title,artist,lyrics,listens,hotness,genres,genius ID,spotify ID,language,instrumental,song length
0,Fast Cars,Craig David,"[Fast cars, Fast women, Speed bikes with the n...",751624,28,"[r&b, rock]",,,en,False,402
1,Watching The Rain,Scapegoat Wax,"[Hello hello its me again, You know since youv...",10681,6,[pop],,,en,False,278
2,Infierno,Mesita,"[No sé lo que me estás haciendo, Con esa mirad...",628847,0,"[uruguay, latin urban, trap, en español, latin...",,,"{'langid': 'es', 'langdetect': 'Likely es'}",False,430
3,Balaio,Itamar Assumpção,"[Nega, O que que tem no balaio, O que que tem ...",16495,10,"[brasil, avant garde, em português, pop]",,,pt,False,360
4,Venganza,Ivy Queen,"[coro, Ya me canse de tus cosas, Hoy quiero ba...",94916,0,"[en español, pop]",,,es,False,294


The data looks great, so lets export to csv for use in the next steps. We'll be saving both the entire dataset as well as the filtered dataset.

In [99]:
data.to_csv('entire_clean.csv', index = False)
data[data['instrumental'] == False][data['language'] == 'en'].to_csv('english_clean.csv', index = False)

  
