# **Preprocessing the Data for ML Model**


---
Scikit learn requirements:


1.   No missing Data
2.   Numeric Values only

** Dealing with Categorical Values **
 - We need to convert the categorical values to numerical values before we train the model.
 - For this, we create the dummy variables for each categories using:
                        - Scikit learn: OneHotEncoder(),
                        - Pandas: get_dummies()







In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
music_df= pd.read_csv("/content/drive/MyDrive/top10s.csv", encoding='latin-1')
music_df.rename(columns={'pop':'popularity'},inplace=True)
music_df.head()

Unnamed: 0.1,Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,popularity
0,1,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,2,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,3,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,4,Bad Romance,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,5,Just the Way You Are,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


In [None]:
music_df= music_df.drop('title', axis=1)
music_df.head()
#dropping the tilte value as it will also be created as dummy variable when we call get_dummies()

Unnamed: 0.1,Unnamed: 0,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,popularity
0,1,Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,2,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,3,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,4,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,5,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


Here, we will use Popularity as Target value and Genre as feature value, which is also categorical values.


**Create Dummy Values**

In [None]:
#create the dummies values using pandas
music_dummies = pd.get_dummies(music_df, drop_first=True)


In [None]:
music_dummies.columns

Index(['Unnamed: 0', 'year', 'bpm', 'nrgy', 'dnce', 'dB', 'live', 'val', 'dur',
       'acous',
       ...
       'top genre_house', 'top genre_indie pop',
       'top genre_irish singer-songwriter', 'top genre_latin',
       'top genre_metropopolis', 'top genre_moroccan pop',
       'top genre_neo mellow', 'top genre_permanent wave', 'top genre_pop',
       'top genre_tropical house'],
      dtype='object', length=827)

**Create the Model**

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
#preparing target and feature values
X= music_dummies.drop("popularity",axis=1).values
y= music_dummies['popularity'].values

In [None]:

#estimator
kf= KFold(n_splits=5,shuffle=True, random_state=42)

#instantiate the model
ridge= Ridge(alpha=0.2)

#cross validation
score= cross_val_score(ridge, X,y, cv=kf,
                       scoring= "neg_mean_squared_error",error_score='raise')
rmse= np.sqrt(-score)

print("Average RMSE:{}".format(np.mean(rmse)))
print("S.D:{}".format(np.std(y)))

Average RMSE:11.626285782081009
S.D:14.50570259447759


**scoring="neg_mean_squared_error":** This parameter specifies the scoring metric to use for evaluating the model. In this case, it's using negative mean squared error. The negative sign is used because cross_val_score assumes that higher values are better, but in the case of mean squared error (MSE), lower values are better. So, the negative MSE is used to make it consistent with the assumption that higher values are better.

# Handling Missing Data

In [None]:
#checking missing data
print(music_df.isna().sum().sort_values())

Unnamed: 0    0
artist        0
top genre     0
year          0
bpm           0
nrgy          0
dnce          0
dB            0
live          0
val           0
dur           0
acous         0
spch          0
popularity    0
dtype: int64


**Approaches to tackle missing data:**


---


1.   Dropping Missing Data:
          - Common Approach: removing observations accounting for less than 5 % of actual data

2.   Imputing missing data:
          - using educated guess to replace missing data,
          - can use mean(common), median, mode or other values,
          - for categorical values: use mode,
          - Note: We must split the data first before imputing to avoid leaking data test set info to the model, which is called **Data Leakage**.




In [None]:
#dropping data
music_df.dropna(subset=['artist','top genre','year'])

Unnamed: 0.1,Unnamed: 0,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,popularity
0,1,Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,2,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,3,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,4,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,5,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
598,599,Mark Ronson,dance pop,2019,104,66,61,-7,20,16,176,1,3,75
599,600,Ed Sheeran,pop,2019,95,79,75,-6,7,61,206,21,12,75
600,601,DJ Khaled,dance pop,2019,136,76,53,-5,9,65,260,7,34,70
601,602,Mark Ronson,dance pop,2019,114,79,60,-6,42,24,217,1,7,69


In [None]:
#imputing

from sklearn.impute import SimpleImputer

#split categorical and numerical data

X_cat = music_df['top genre'].values.reshape(-1,1)
X_num = music_df.drop(['top genre','popularity','artist'], axis=1).values

y=music_df['popularity'].values

#train and test set
X_train_cat, X_test_cat, y_train,y_test = train_test_split(X_cat,y,
                                                           test_size=0.3, random_state=42)

X_train_num, X_test_num, y_train,y_test = train_test_split(X_num,y,
                                                           test_size=0.3, random_state=42)

#imputing categorical values

imp_cat = SimpleImputer(strategy="most_frequent")

X_train_cat = imp_cat.fit_transform(X_train_cat)   #use fit transoform for training set
X_test_cat =  imp_cat.transform(X_test_cat) #use transform for test set

#imputing numerical values
imp_num = SimpleImputer() #default mean

X_train_num = imp_num.fit_transform(X_train_num)
X_test_num = imp_num.transform(X_test_num)

#concatenating categorical and numerical trainset
X_train = np.append(X_train_num,X_train_cat, axis=1)
X_test = np.append(X_test_num, X_test_num, axis=1)

# Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

#task: to predict whether a song is dance pop or not

#convert genre column
music_df['top genre'] = np.where(music_df['top genre'] == 'dance pop',1,0)

X = music_df.drop('top genre',axis=1).values
y= music_df['top genre'].values

#create pipeline
#declare steps

steps = [("imputation", SimpleImputer()),
         ("logistic_regression", LogisticRegression())]

pipeline = Pipeline(steps)

X_train,X_test, y_train, y_test = train_test_split( X,y, test_size = 0.3,
                                                   random_state =42)

pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test,y_test)