# Power Transformer 101

## What? 

### Powertransformer is part of sklearn preprocessing module


### The individual transformers can be found inside the SciPy stats module

- The key part of this technique is found right in the name **TRANSFORM** 
- Two data transformers are inside this module 
    - Box-Cox Transformation (will take in **ONLY** positive values) 
    - Yeo Johnson Transformation (can take in **BOTH** positive and negative values) 
    - What about 0? 
        - She ain't no hero here... 
    
    
![alt text](boxcox_beforeafter.png "Title")    

# The Math 

## Lambda can be between -5 and 5

## The Confidence Interval is important! 

![alt text](math_boxcox.png "Title")

In [28]:
!pip install sklearn
!pip install matplotlib

Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting scikit-learn (from sklearn)
[?25l  Downloading https://files.pythonhosted.org/packages/e9/57/8a9889d49d0d77905af5a7524fb2b468d2ef5fc723684f51f5ca63efed0d/scikit_learn-0.21.3-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (10.5MB)
[K     |████████████████████████████████| 10.5MB 4.4MB/s 
Collecting joblib>=0.11 (from scikit-learn->sklearn)
[?25l  Downloading https://files.pythonhosted.org/packages/8f/42/155696f85f344c066e17af287359c9786b436b1bf86029bb3411283274f3/joblib-0.14.0-py2.py3-none-any.whl (294kB)
[K     |████████████████████████████████| 296kB 4.6MB/s 
[?25hBuilding wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/patrickcavins/Library/Caches/pip/wheels/76/03/bb/589

In [1]:
#Libraries needed for this demonstration
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import PowerTransformer
from scipy import stats
import seaborn as sns 
import matplotlib.pyplot as plt

%matplotlib inline 

ModuleNotFoundError: No module named 'sklearn'

## A Quick Demo

In [0]:
# Import the Data, in this case AMES Housing Data. We are only working with one-column 
df = pd.read_csv('./train.csv')

df['Garage Area'].head(10)

d

## Some Quick EDA 

In [0]:
df['Garage Area'].isnull().sum()

# there is only one nan object 
df['Garage Area'] = df['Garage Area'].replace(np.nan, 0)

# I am only doing this because the box cox says that there are negative values... which isn't true?

# I **think** it is because  of the zero, and the difference in the transformation which that requires. Yea.. 

df['Garage Area'] = df['Garage Area'].replace(0, 50)

In [0]:
# What don't have any nulls? 
print (df['Garage Area'].isnull().sum())

# We have a non-zero minimun? 
print (df['Garage Area'].min())

In [0]:
garage_area = np.asarray(df['Garage Area'])

In [0]:
# We can see that this data is screwed, and does not represent a normal distribution 
sns.kdeplot(garage_area, shade=True);

In [0]:
# Values in the Box-Cox Transformation
# xt = (x**lambda - 1) / lambda 

#define the set of lambda's that we want to search over ()
lmbda = np.linspace(start = .5, stop = 1.0, num =4) 

xt = []
lambda_list = []
for i in lmbda: 
    #x = the input data 
    x = garage_area 
    #box-cox transformation 
    transform = (x**i - 1) / i
    #appending the list 
    xt.append(transform)

In [0]:
print ('These are the tranformed values, Box-Cox Transformation for a given Lambda')
print ('--'*50)
print (xt[1]) # we can grab the individual values 
print ('--'*50)
print ('These are the lambdas which we used in this transformation')
print (lmbda) # this is the list of lambda's we generated 

In [0]:
lmbda[0]

In [0]:
print (lmbda)

In [0]:
# round()

In [0]:
sns.kdeplot(garage_area, shade=True, label="Raw")
sns.kdeplot(xt[0], shade=True, label= f"BC_Tranform(Lambda Value: {round(lmbda[0],3)}")
sns.kdeplot(xt[1], shade=True, label= f"BC_Tranform(Lambda Value: {round(lmbda[1],3)}")
sns.kdeplot(xt[2], shade=True, label= f"BC_Tranform(Lambda Value: {round(lmbda[2],3)}")
sns.kdeplot(xt[3], shade=True, label= f"BC_Tranform(Lambda Value: {round(lmbda[3],3)}")

# control x and y limits
plt.ylim(0, 0.06)
plt.xlim(0, 300)


In [0]:
# The trick of Box-Cox transformation is to find lambda value, 
# however in practice this is quite affordable. The following function returns 
# the transformed variable, lambda value, confidence interval for lambda according to certain alpha level.

# garage_area_xt: transformed variable.
# maxlog: lambda
# interval: confidence interval

# http://dataunderthehood.com/2018/01/15/box-cox-transformation-with-python/
    
garage_area_xt, maxlog, interval = stats.boxcox(garage_area, alpha=0.95)

print (maxlog)
print (interval)
print (garage_area_xt)


In [0]:
# Changes in the KDE... 
sns.kdeplot(garage_area, shade=True, label="Raw")
sns.kdeplot(garage_area_xt, shade=True, label="Transformed")
plt.title('Comparing the Raw Input to the Transformed Output')

In [0]:
## Z-Score (garage area)

garage_area_std = np.std(garage_area)

print (garage_area_std)

garage_area_Z = ((garage_area -garage_area.mean()) / (garage_area_std)) 

# plt.hist(garage_area_Z);

In [0]:
## Z-Score (garage area_xt)

garage_area_xt_std = np.std(garage_area_xt)

print (garage_area_xt_std)

garage_area_xt_Z = ((garage_area_xt - garage_area_xt.mean()) / (garage_area_xt_std))

# plt.hist(garage_area_xt_Z);

In [0]:
# Changes in the KDE!  

plt.figure(figsize=(13,8))
sns.set_context('poster')
sns.kdeplot(garage_area_Z, shade=True, color='navy', label="Raw (normalizaied)")
sns.kdeplot(garage_area_xt_Z, shade=True, color='gold', label="Transformed (normalized)")

## Using the sklearn Modele 

In [0]:
# Let's use two features...

print (df['Lot Area'].min())

In [0]:
plt.figure(figsize=(8,5))
sns.kdeplot(df['Lot Area'], shade=True)

In [0]:
#Features 
X = df[['Garage Area', 'Lot Area']]

## Instantiate PowerTransformer 


### Parameters 

**method** : str, (default=’yeo-johnson’)

**standardize** : boolean, default=True
Set to True to apply zero-mean, unit-variance normalization to the transformed output.

### Attributes

**lambdas_**: array of float, shape (n_features,)
The parameters of the power transformation for the selected features.



In [0]:
pt = PowerTransformer(method='box-cox', standardize=True,) 

#Fit the data to the powertransformer
skl_boxcox = pt.fit(X)

In [0]:
#Here are the lambdas 
skl_boxcox.lambdas_

In [0]:
#Transform 
skl_boxcox = pt.transform(X)

In [0]:
# skl_boxcox

In [0]:
lot_area = np.asarray(df['Lot Area'])

In [0]:
plt.figure(figsize=(13,8))
sns.set_context('poster')
sns.kdeplot(garage_area, shade=True, color='navy', label="Garage Area")
sns.kdeplot(lot_area, shade=True, color='gold', label="Lot Area")

# control x and y limits
# plt.ylim(0, 0.06)
# plt.xlim(0, 20_000)



In [0]:
plt.figure(figsize=(8,4))
plt.hist(skl_boxcox, bins=15);

## Good Tidbits:

"Box-cox transformation is a statistical technique used to remove heteroscedasticity of a variable and also make it look like more normally distributed, which represents a big deal for statisticians and economists regarding normality and homoscedasticity assumptions for linear models."

http://dataunderthehood.com/2018/01/15/box-cox-transformation-with-python/



"For example, the data may have a skew, meaning that the bell in the bell shape may be pushed one way or another."

https://machinelearningmastery.com/how-to-transform-data-to-fit-the-normal-distribution/



"But, generally the answer is that for most meaningful analysis, you need the same 𝜆 value for all datasets. The reason is that the Box-Cox transformation **not only changes the scale of the data, it also changes the unit of measurement**. "

https://stats.stackexchange.com/questions/243975/skewness-transformation-for-one-but-not-the-other-variable/243984#243984

https://stats.stackexchange.com/questions/243975/skewness-transformation-for-one-but-not-the-other-variable/243984#243984

#### References: 

http://www.kmdatascience.com/2017/07/box-cox-transformations-in-python.html

    