## Transformation

### Import the pandas library

In [1]:
import pandas as pd

### Create a Dataframe, which is a 2-D labeled data structure

In [2]:
df = pd.DataFrame([["Edward Remirez","Male",28,"Bachelors"],
["Arnav Sharma","Male",23,"Masters"],
["Sophia Smith","Female",19,"High School"]], columns=['Name','Gender','Age',
'Degree'])
df

Unnamed: 0,Name,Gender,Age,Degree
0,Edward Remirez,Male,28,Bachelors
1,Arnav Sharma,Male,23,Masters
2,Sophia Smith,Female,19,High School


### Import the OneHotEncoder class from the sklearn.preprocessing module

OneHotEncoder is a utility class that can convert categorical data into a format that can be provided to ML algorithm to improve the performance

In [3]:
from sklearn.preprocessing import OneHotEncoder

### Create an instance of the OneHotEncoder class and fit it to the `Gender` column of the Dataframe

The `fit()` method is used to analyze the `Gender` column, identify the unique categories, and learn the mapping from categories to one-hot vectors.
The result is stored in the variable `encoder_for_gender`, which can now be used to transform the `Gender` column or any other data with the same categories into one-hot vectors.

In [4]:
encoder_for_gender = OneHotEncoder().fit(df[['Gender']])

### Verify the values and their column indices

In [5]:
encoder_for_gender.categories_

[array(['Female', 'Male'], dtype=object)]

### Use the `encoder_for_gender` that was previously fitted to transform the `Gender` column

The transform() method applies the mapping learned by the fit() method to the `Gender` column, converting each category into a one-hot vector.
The result is stored in the variable 'gender_values', which now contains the one-hot encoded values for the 'Gender' column.

In [6]:
gender_values = encoder_for_gender.transform(df[['Gender']]) 

### Convert the sparse matrix 'gender_values' to a dense numpy array using the toarray() method.

In [7]:
gender_values.toarray()

array([[0., 1.],
       [0., 1.],
       [1., 0.]])

### Add the one-hot encoded 'Gender' values to the DataFrame as new columns 'Gender_F' and 'Gender_M'

'Gender_F' will be 1 for females and 0 for males, and 'Gender_M' will be 1 for males and 0 for females.

In [8]:
df[['Gender_F', 'Gender_M']] = gender_values.toarray()
df

Unnamed: 0,Name,Gender,Age,Degree,Gender_F,Gender_M
0,Edward Remirez,Male,28,Bachelors,0.0,1.0
1,Arnav Sharma,Male,23,Masters,0.0,1.0
2,Sophia Smith,Female,19,High School,1.0,0.0


## Normalization

### Create a new Dataframe with the specified data

In [9]:
df = pd.DataFrame({'Age': {0: 28, 1: 23, 2: 19},
 'Gender_F': {0: 0.0, 1: 0.0, 2: 1.0},
 'Gender_M': {0: 1.0, 1: 1.0, 2: 0.0},
 'Degree_encoded': {0: 0.0, 1: 2.0, 2: 1.0}})
df

Unnamed: 0,Age,Gender_F,Gender_M,Degree_encoded
0,28,0.0,1.0,0.0
1,23,0.0,1.0,2.0
2,19,1.0,0.0,1.0


## 1. Min-Max Scaling

### Import the MinMaxScaler class from the `sklearn.preprocessing` module

MinMaxScaler is utility class that can scale numerical data to a specified range (default is 0 to 1).

In [10]:
from sklearn.preprocessing import MinMaxScaler

### Create an instance of the MinMaxScaler class and fit it to the 'Age' column of the DataFrame.

The fit() method is used to compute the minimum and maximum values of the 'Age' column to be used for later scaling.
The transform() method applies the scaling to the 'Age' column based on the minimum and maximum values computed by the fit() method.

In [11]:
scaler = MinMaxScaler()
scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])

df

Unnamed: 0,Age,Gender_F,Gender_M,Degree_encoded
0,1.0,0.0,1.0,0.0
1,0.444444,0.0,1.0,2.0
2,0.0,1.0,0.0,1.0


## 2. Standard Scaling

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])

In [13]:
df

Unnamed: 0,Age,Gender_F,Gender_M,Degree_encoded
0,1.2675,0.0,1.0,0.0
1,-0.090536,0.0,1.0,2.0
2,-1.176965,1.0,0.0,1.0


You can view the parameters of the scaler using

In [14]:
print(scaler.mean_)
print(scaler.scale_)

[0.48148148]
[0.40908745]
