# ***Power Transformers In-Depth Understanding***
***sklearn.preprocessing.PowerTransformer***

>class sklearn.preprocessing.PowerTransformer(method='yeo-johnson', *, standardize=True, copy=True)

1. ***Apply a power transform featurewise to make data more Gaussian-like (same as Normal Distribution).***


2. ***Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.***


3. ***Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood(sklearn) and also done by Bayesian statistic but not in sklearn.***


4. ***Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.***


5. ***By default, zero-mean, unit-variance normalization is applied to the transformed data so don't need to apply standardization.***


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as stats

from sklearn.model_selection import train_test_split,cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PowerTransformer


In [None]:
df = pd.read_csv('../input/regression-with-neural-networking/concrete_data.csv')

In [None]:
df.head(4)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe() #It's having 0 which is an issue when it comes to Box-Cox  

In [None]:
df.isnull().sum()

In [None]:
X = df.drop(columns=['Strength'])
y = df.iloc[:,-1]

In [None]:
X_train,X_test ,y_train,y_test = train_test_split(X,y,test_size=0.2 ,random_state=42)

In [None]:
lr =LinearRegression()
lr.fit(X_train,y_train)

y_pred = lr.predict(X_test)

r2_score(y_test,y_pred)

In [None]:
# let's do cross-val: We can see that the situation here is much worse than actual
np.mean(cross_val_score(lr,X,y,scoring ='r2'))

In [None]:
%matplotlib inline

In [None]:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
X_train_scaled = pd.DataFrame(scalar.fit_transform(X_train) , columns=X_train.columns)
X_test_scaled = pd.DataFrame(scalar.transform(X_test),columns=X_test.columns)

## ***Distribution Plots and QQ plots for each feature***

In [None]:
for col in X_train_scaled.columns:
    fig, axes = plt.subplots(1,2, figsize = (14,4))
    sns.kdeplot(X_train[col] , ax  = axes[0])
    axes[0].set_title(col)
    
    stats.probplot(X_train[col] , dist ='norm' , plot =plt)
    plt.title(col)
    
    plt.show()

In [None]:
# Applying the Box-Cox Transform 
# Now here I'll not be using standardized data because in Power Transfor there is already standardization is implemented on columns

pt = PowerTransformer(method = 'box-cox')

X_train_transformed = pt.fit_transform(X_train + 0.000001)
X_test_transformed = pt.transform(X_test + 0.000001)

# Here I am adding 0.000001 because box- cox can't work with 0 values in data , so I am adding very very small values which 
# is not going to change anything so much (read in a blog)
pd.DataFrame({'cols':X_train.columns , 'box-cox_lambdas': pt.lambdas_})

In [None]:
# applying Linear regression on transformed data

lr.fit(X_train_transformed,y_train)
y_pred2 = lr.predict(X_test_transformed)

r2_score(y_test,y_pred2)

In [None]:
#  Now the score is coming out to be better but still always look out for cross-val-score
X_transformed =pt.fit_transform(X+0.0000001)
np.mean(cross_val_score(lr,X_transformed,y,scoring ='r2'))

In [None]:
X_train_transformed = pd.DataFrame(X_train_transformed , columns = X_train.columns)
X_test_transformed = pd.DataFrame(X_test_transformed , columns = X_test.columns)

## ***Before and after comparison for Box-Cox plot { Distribution Plots }***
- Left one : Before Transformation
- Right one - After Transformation

In [None]:
# Before and after comparison for Box-Cox plot
for col in X_train_transformed.columns:
    plt.figure(figsize=(14,4))
    plt.subplot(121)
    sns.kdeplot(X_train[col])
    plt.title(col)
    
    plt.subplot(122)
    sns.kdeplot(X_train_transformed[col])
    plt.title(col)
    
    plt.show()

In [None]:
# Now Applying YEO-JOHNSON transform

pt1 = PowerTransformer()
X_train_transformed2 = pt1.fit_transform(X_train)
X_test_transformed2 = pt1.transform(X_test)

lr1 = LinearRegression()

lr1.fit(X_train_transformed2,y_train)

y_pred3 = lr1.predict(X_test_transformed2)

print(r2_score(y_test,y_pred3))

pd.DataFrame({'cols':X_train.columns , 'yeo-johnson': pt1.lambdas_})

In [None]:
pt2 = PowerTransformer()
X_transformed2 = pt2.fit_transform(X)
np.mean(cross_val_score(lr1 , X_transformed2 , y, scoring = 'r2'))

In [None]:
X_train_transformed2 = pd.DataFrame(X_train_transformed2 ,columns=X_train.columns)
X_test_transformed2 = pd.DataFrame(X_test_transformed2 ,columns=X_test.columns)

## ***Before and after comparison for Yeo-Johnson plot { Distribution Plots }***

In [None]:

for col in X_train_transformed2.columns:
    plt.figure(figsize=(14,4))
    plt.subplot(121)
    sns.kdeplot(X_train[col])
    plt.title(col)
    
    plt.subplot(122)
    sns.kdeplot(X_train_transformed2[col])
    plt.title(col)
    
    plt.show()

## ***Observations :***
> We have calculated r2 score without any power tranformers, Box cox , Yeo Johnson transformation on all the features and then plotted them as well to get the gist of it and how the data is distributed.

>## Without any 
     - r2 score: ***0.6275531792314851***
     - cross val score: ***0.4609940491662866***
>## Box-Cox Transformation 
     - r2 score: ***0.8047825006181188***
     - cross val score:***0.6658537942219864*** 
>## Yeo-Johnson Transformation
     - r2 score: ***0.8161906513339305***
     - cross val score: ***0.6834625134285746***


## ***Box- Cox & Yeo Johnsons Lambda Values for each Columns***

In [None]:
pd.DataFrame({'cols': X_train.columns ,'box-cox-lambdas': pt.lambdas_ , 'Yeo-Johnson_lambdas':pt1.lambdas_})

# Thanks for giving it a read !!