# EDA/Data cleaning artist data

After scraping the top100 charts, the top20 hip hop charts and the dr artists from spotify, I'm now going to clean both data sets up for further processing (getting the audio features).

Next steps are:
- EDA
- data cleaning/wrangling
    - add another feature called genre
      -  take the artists from the top20 hip hip charts and assign the genre hip hop to them
    - clean up the not needed index column
    - extract the year, month in a separate column
    - check outliers like all the christmas songs
- getting audio features from tracks (top100 & top20 hip hop)

In [1]:
# importing needed libraries

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [None]:
# importing dr artist data

dr_artists = pd.read_csv('../data/dr_artist_name_id_df.csv')

In [None]:
# creating a list of all unique dr artist names

dr_artists_lst = list(dr_artists['artist_name'].unique())

In [None]:
# importing top20 hip hop data

top20_hiphop = pd.read_csv('../data/top20_hiphop_tracks_raw.csv')

In [None]:
# creating a list of all unique hiphop artist names

top20_hiphop_artists_lst = list(top20_hiphop['artist'].unique())

In [None]:
# importing top100 2010-2021 data

top100 = pd.read_csv('../data/top100_tracks_raw.csv')

- data cleaning/wrangling
    - add another feature called genre - **NEXT**
    - clean up the not needed index column
    - extract the year in a separate column
    - check outliers like all the christmas songs

In [None]:
# creating new empty column in top100 called genre

top100['genre'] = 0

In [None]:
# filling empty genre column in top100 with hiphop from artists in top20

for artist in top20_hiphop_artists_lst:
    for i in range(len(top100)):
        if artist in top100.loc[i,'artist']:
            top100.loc[i, 'genre'] = 'HipHop'

In [None]:
# filling empty genre column in top100 with dr artists from artists in dr

for artist in dr_artists_lst:
    for i in range(len(top100)):
        if artist in top100.loc[i,'artist']:
            top100.loc[i, 'genre'] = 'Deutschrap'

In [None]:
# did it work?

top100['genre'].unique()

In [None]:
# saving df with new feature genre 

top100.to_csv('../data/top100_tracks_V1.csv')

- data cleaning/wrangling
    - add another feature called genre - **DONE**
    - clean up the not needed index column - **NEXT**
    - extract the year in a separate column
    - check outliers like all the christmas songs

In [None]:
# checking the column names

top100.columns

In [None]:
# dropping the not needed columns

top100 = top100.drop(['Unnamed: 0','artistMatch?'], axis=1)

- data cleaning/wrangling
    - add another feature called genre - **DONE**
    - clean up the not needed index column - **DONE** 
    - extract the year in a separate column - **NEXT**
    - check outliers like all the christmas songs

In [None]:
# creating the new column year and extracting the year from the week start column

top100['year'] = pd.DatetimeIndex(top100['week_start']).year

In [None]:
# lets do the same for month which we could use to detect the christmas outlier tracks in e.g. DEC

top100['month'] = pd.DatetimeIndex(top100['week_start']).day

# needed to use "day" here although I mean month, but the date is set up other way around

In [None]:
# saving df with new features month and year 

top100.to_csv('../data/top100_tracks_V2.csv')

- data cleaning/wrangling
    - add another feature called genre - **DONE**
    - clean up the not needed index column - **DONE** 
    - extract the year in a separate column - **DONE** 
    - check outliers like all the christmas songs - **NEXT**

In [2]:
# found this wikipedia webside with a long list of christmas songs

url = 'https://en.wikipedia.org/wiki/List_of_popular_Christmas_singles_in_the_United_States'

website_text = requests.get(url).text
soup = BeautifulSoup(website_text,'xml')

table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')

data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])

christmas_tracks = pd.DataFrame(data, columns=['title', 'artist', 'year', 'info'])

needs to be cleaned a bit:
- remove ""
- make everything lower case
- if None in info, remove row

In [3]:
christmas_tracks['title'] = christmas_tracks['title'].str.replace('"','')
christmas_tracks['title'] = christmas_tracks['title'].str.replace(r"\(.*\)","")
christmas_tracks['title'] = christmas_tracks['title'].str.lower()
christmas_tracks = christmas_tracks.dropna(subset=['info'])

In [None]:
christmas_lst = list(christmas_tracks['title'].unique())
christmas_lst

In [5]:
# saving df with new features month and year 

christmas_tracks.to_csv('../data/christmas_tracks.csv')

In [None]:
# might want to make everything lower case

top100['title'] = top100['title'].str.lower()
top100['artist'] = top100['artist'].str.lower()
top100['genre'] = top100['genre'].str.lower()
top100['label'] = top100['label'].str.lower()

In [None]:
# lets see how many christmas songs we have

count = 0

for item in christmas_lst:
    for i in range(len(top100)):
        if item in top100.loc[i,'title']:
            count += 1
print(count)

In [None]:
# creating new feature to mark the tracks which are christmas tracks

for track in christmas_lst:
    for i in range(len(top100)):
        if track in top100.loc[i,'title']:
            top100.loc[i, 'genre'] = 'christmas'
            
top100['genre'].value_counts()

In [None]:
# saving df with new features month and year 

top100.to_csv('../data/top100_tracks_V3.csv')