# Data Prepocessing with Scikit Learn

#### The main task for machine learning engineers is to **first analyze the data for viable trends, then create an efficient input pipeline for training a model.** This process involves using libraries like NumPy and pandas for handling data, along with machine learning frameworks like TensorFlow for creating the model and input pipeline. 
#### The scikit-learn library includes tools for data preprocessing and data mining. It is imported in Python via the statement 
> **import sklearn**

# Standardizing Data

#### When data can take on any range of values, it makes it difficult to interpret. Therefore, data scientists will convert the data into a standard format to make it easier to understand. **The standard format refers to data that has 0 mean and unit variance (i.e. standard deviation = 1), and the process of converting data into this format is called data standardization.**

#### Data standardization is a relatively simple process. For each data value, x, we subtract the overall mean of the data, μ, then divide by the overall standard deviation, σ. The new value, z, represents the standardized data value

# NumPy and scikit-learn

#### The array’s rows represent individual data observations, while each column represents a particular feature of the data, i.e. the same format as a spreadsheet data table.

#### The scikit-learn data preprocessing module is called **sklearn.preprocessing**. One of the functions in this module, scale, applies data standardization to a given axis of a NumPy array.

In [2]:
import numpy as np
# defined pizza data
pizza_data = np.array([[2100,   10,  800],
       [2500,   11,  850],
       [1800,   10,  760],
       [2000,   12,  800],
       [2300,   11,  810]])
# Newline to separate print statements
print('{}\n'.format(repr(pizza_data)))

from sklearn.preprocessing import scale
# Standardizing each column of pizza_data
col_standardized = scale(pizza_data)
print('{}\n'.format(repr(col_standardized)))

# Column means (rounded to nearest thousandth)
col_means = col_standardized.mean(axis=0).round(decimals=3)
print('{}\n'.format(repr(col_means)))

# Column standard deviations
col_stds = col_standardized.std(axis=0)
print('{}\n'.format(repr(col_stds)))
import tensorflow as tf

array([[2100,   10,  800],
       [2500,   11,  850],
       [1800,   10,  760],
       [2000,   12,  800],
       [2300,   11,  810]])

array([[-0.16552118, -1.06904497, -0.1393466 ],
       [ 1.4896906 ,  0.26726124,  1.60248593],
       [-1.40693001, -1.06904497, -1.53281263],
       [-0.57932412,  1.60356745, -0.1393466 ],
       [ 0.66208471,  0.26726124,  0.2090199 ]])

array([ 0., -0.,  0.])

array([1., 1., 1.])



2022-10-05 03:53:36.521964: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### We normally standardize the data independently across each feature of the data array. This way, we can see how many standard deviations a particular observation's feature value is from the mean.
#### If for some reason we need to standardize the data across rows, rather than columns, we can set the axis keyword argument in the scale function to 1. This may be the case when analyzing data within observations, rather than within a feature.

## Preprocessing Using Pandas

In [2]:
# Import library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



In [28]:
music = pd.read_csv('datasets/music.csv')
print(music.columns)
music_small = music[['release_date', 'len','dating', 'violence', 'world/life', 'night/time',
       'shake the audience', 'family/gospel', 'romantic', 'communication',
       'obscene', 'music', 'movement/places', 'light/visual perceptions', 'violence', 'age', 'genre', 'topic']]
music_small.head()
# music_small.value_counts()

Index(['Unnamed: 0', 'artist_name', 'track_name', 'release_date', 'genre',
       'lyrics', 'len', 'dating', 'violence', 'world/life', 'night/time',
       'shake the audience', 'family/gospel', 'romantic', 'communication',
       'obscene', 'music', 'movement/places', 'light/visual perceptions',
       'family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability',
       'loudness', 'acousticness', 'instrumentalness', 'valence', 'energy',
       'topic', 'age'],
      dtype='object')


Unnamed: 0,release_date,len,dating,violence,world/life,night/time,shake the audience,family/gospel,romantic,communication,obscene,music,movement/places,light/visual perceptions,violence.1,age,genre,topic
0,1950,95,0.000598,0.063746,0.000598,0.000598,0.000598,0.048857,0.017104,0.263751,0.000598,0.039288,0.000598,0.000598,0.063746,1.0,pop,sadness
1,1950,51,0.035537,0.096777,0.443435,0.001284,0.001284,0.027007,0.001284,0.001284,0.001284,0.118034,0.001284,0.212681,0.096777,1.0,pop,world/life
2,1950,24,0.00277,0.00277,0.00277,0.00277,0.00277,0.00277,0.158564,0.250668,0.00277,0.323794,0.00277,0.00277,0.00277,1.0,pop,music
3,1950,54,0.048249,0.001548,0.001548,0.001548,0.0215,0.001548,0.411536,0.001548,0.001548,0.001548,0.12925,0.001548,0.001548,1.0,pop,romantic
4,1950,48,0.00135,0.00135,0.417772,0.00135,0.00135,0.00135,0.46343,0.00135,0.00135,0.00135,0.00135,0.00135,0.00135,1.0,pop,romantic


#### In the Music Dataframe genre Column contains Categorical Values. So, we need to convert the categorical value into numeric.

#### Using ```pd.get_dummies()``` we can do this.
#### the Notable fact that: We use ``` drop_first=True``` (by default False). Because we don't want to create duplicates. **Ask How?**
>#### **If any song dose not present in the genre, the song will was present in that dropped columns(in this case Rock column)**

In [29]:
music_dummy_without_drop_first_true = pd.get_dummies(music_small, drop_first=True)
music_dummy_with_drop_first = pd.get_dummies(music_small, drop_first=False)
display(music_dummy_without_drop_first_true.head())
display(music_dummy_with_drop_first.head())

music_dummies = music_dummy_without_drop_first_true

Unnamed: 0,release_date,len,dating,violence,world/life,night/time,shake the audience,family/gospel,romantic,communication,...,genre_pop,genre_reggae,genre_rock,topic_music,topic_night/time,topic_obscene,topic_romantic,topic_sadness,topic_violence,topic_world/life
0,1950,95,0.000598,0.063746,0.000598,0.000598,0.000598,0.048857,0.017104,0.263751,...,1,0,0,0,0,0,0,1,0,0
1,1950,51,0.035537,0.096777,0.443435,0.001284,0.001284,0.027007,0.001284,0.001284,...,1,0,0,0,0,0,0,0,0,1
2,1950,24,0.00277,0.00277,0.00277,0.00277,0.00277,0.00277,0.158564,0.250668,...,1,0,0,1,0,0,0,0,0,0
3,1950,54,0.048249,0.001548,0.001548,0.001548,0.0215,0.001548,0.411536,0.001548,...,1,0,0,0,0,0,1,0,0,0
4,1950,48,0.00135,0.00135,0.417772,0.00135,0.00135,0.00135,0.46343,0.00135,...,1,0,0,0,0,0,1,0,0,0


Unnamed: 0,release_date,len,dating,violence,world/life,night/time,shake the audience,family/gospel,romantic,communication,...,genre_reggae,genre_rock,topic_feelings,topic_music,topic_night/time,topic_obscene,topic_romantic,topic_sadness,topic_violence,topic_world/life
0,1950,95,0.000598,0.063746,0.000598,0.000598,0.000598,0.048857,0.017104,0.263751,...,0,0,0,0,0,0,0,1,0,0
1,1950,51,0.035537,0.096777,0.443435,0.001284,0.001284,0.027007,0.001284,0.001284,...,0,0,0,0,0,0,0,0,0,1
2,1950,24,0.00277,0.00277,0.00277,0.00277,0.00277,0.00277,0.158564,0.250668,...,0,0,0,1,0,0,0,0,0,0
3,1950,54,0.048249,0.001548,0.001548,0.001548,0.0215,0.001548,0.411536,0.001548,...,0,0,0,0,0,0,1,0,0,0
4,1950,48,0.00135,0.00135,0.417772,0.00135,0.00135,0.00135,0.46343,0.00135,...,0,0,0,0,0,0,1,0,0,0


In [32]:
# importing library
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score

# KFold Define
kf = KFold(n_splits=6)

# Create X and y
X = music_dummies.drop('age', axis=1).values
y = music_dummies[['age']].values

# Instantiate a ridge model
ridge = Ridge(alpha=0.2)

# Perform cross-validation
scores = cross_val_score(ridge, X, y, cv=kf, scoring="neg_mean_squared_error")

# Calculate RMSE
rmse = np.sqrt(-scores)
print("Average RMSE: {}".format(np.mean(rmse)))
print("Standard Deviation of the target array: {}".format(np.std(y)))

Average RMSE: 1.4681512562530744e-08
Standard Deviation of the target array: 0.2641019611262181
