# Feature engineering
In feature engineering, we carry out feature engineering, extract new features that are relevant for the problem. For Spotify data set, three additional features are extracted from Spotify API and Genius API. Two scripts are developed for the data extraction.

1. Extract artist genre using Spotify API (JAVA) [Here](https://github.com/MacyChan/spotify-user-behaviour-predictor/blob/master/spotify_user_behaviour_predictor/scr/getLyrics.py)
2. Extract song lyric using self developed package (Python) [Here](https://github.com/UBC-MDS/pylyrics)

In [1]:
import pandas as pd
import re

## Reading the data CSV
Read in the data CSV and store it as a pandas dataframe named `spotify_df`.

In [2]:
spotify_df = pd.read_csv('data/spotify_data.csv', index_col = 0 )
spotify_df.head(6)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,1,Mask Off,Future
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,1,Redbone,Childish Gambino
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,1,Xanny Family,Future
3,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4.0,0.23,1,Master Of None,Beach House
4,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4.0,0.904,1,Parallel Lines,Junior Boys
5,0.00479,0.804,251333,0.56,0.0,8,0.164,-6.682,1,0.185,85.023,4.0,0.264,1,Sneakin’,Drake


## Artist Information
`genres` and `popularity` are extracted from Spotify API, which included the genres and popularity of the corresponding artist.

In [3]:
artist_df = pd.read_csv('data/artist_info.csv', index_col = 0 )
artist_df.head(6)

Unnamed: 0_level_0,genres,name,popularity
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1RyvyyTE3xzB2ZywiAwp0i,atl hip hop,Future,91
1RyvyyTE3xzB2ZywiAwp0i,hip hop,Future,91
1RyvyyTE3xzB2ZywiAwp0i,pop rap,Future,91
1RyvyyTE3xzB2ZywiAwp0i,rap,Future,91
1RyvyyTE3xzB2ZywiAwp0i,southern hip hop,Future,91
1RyvyyTE3xzB2ZywiAwp0i,trap,Future,91


Pivot the artist table with `genres` in columns and `artist` in row, count the number of `artist` appeared.

In [4]:
artist_df_pivot = (
    artist_df.pivot_table(
        index="name",
        columns="genres",
        values="popularity",
        #aggfunc=lambda x: len(x.unique()),
        aggfunc="count",
    )
    .add_prefix("genres_")
    .reset_index()
)

artist_df_pivot.fillna(0, inplace=True)

Join pivoted artist table to original table

In [5]:
spotify_df = spotify_df.merge(artist_df_pivot, left_on='artist', right_on='name')
spotify_df = spotify_df.drop(['name'], axis=1)
spotify_df.head(6)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,...,genres_vocaloid,genres_west coast rap,genres_west coast trap,genres_wonky,genres_worcester ma indie,genres_world,genres_world fusion,genres_world worship,genres_worship,genres_zolo
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.185,0.704,282253,0.431,0.0972,8,0.249,-7.893,1,0.131,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0796,0.868,342773,0.627,0.0,1,0.0983,-4.843,0,0.116,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.218,0.793,194240,0.607,5e-06,4,0.348,-6.488,0,0.0821,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.374,0.772,167158,0.612,2e-06,7,0.108,-7.274,0,0.254,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Song Information
`lyrics` is extracted from Genius API, which included the lyrics of the corresponding song.  
A python script is developed for scraping the lyrics [here]()

In [6]:
lyrics_df = pd.read_csv('data/lyrics_info.csv', index_col = 0 )
lyrics_df.head(6)

Unnamed: 0_level_0,artist,lyrics
song_title,Unnamed: 1_level_1,Unnamed: 2_level_1
Mask Off,Future,\nCall it how it is (Call it how it is)\nHendr...
Xanny Family,Future,\nThree exotic broads and I got 'em soakin' pa...
Blood On the Money,Future,"\nThey gave lil' Trotty 25 for them thangs, ni..."
Move That Dope,Future,"\n\n\nReal dope dealers for real, haha\nHahaha..."
Blow a Bag,Future,\nYeah\nI woke up feeling like fucking up some...
Lay Up,Future,\nBeast mode\n(Zaytoven)\n\n\nI fuck on that b...


Join the lyrics with the dataframe.

In [7]:
spotify_df = spotify_df.merge(lyrics_df)
spotify_df.head(6)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,...,genres_west coast rap,genres_west coast trap,genres_wonky,genres_worcester ma indie,genres_world,genres_world fusion,genres_world worship,genres_worship,genres_zolo,lyrics
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,\nCall it how it is (Call it how it is)\nHendr...
1,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,\nThree exotic broads and I got 'em soakin' pa...
2,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"\nThey gave lil' Trotty 25 for them thangs, ni..."
3,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"\n\n\nReal dope dealers for real, haha\nHahaha..."
4,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,\nYeah\nI woke up feeling like fucking up some...
5,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,\nBeast mode\n(Zaytoven)\n\n\nI fuck on that b...


## Export CSV
Export new csv with additional feature for further machine learning process.

In [8]:
spotify_df.to_csv('data/spotify_df_processed.csv',index=False)