# **TRANSFORMATION**
- It is the process of converting skewed distributed data to nrml distribution.
- **USE** - In order to apply inferential statistics we require data to be normally distributed, otherwise we dont require it.
- **HOW** - By applying various transformation,

    - *for right skewed data* -- log and nth root transformation.
    - *for left skewed data* -- exp(exponential) and nth power transformation.


In [1]:
import numpy as np
import pandas as pd

In [2]:
df=pd.read_csv('/Users/sakshisahu/Documents/learning/Data_Analytics/data/Sample.csv')
df.head()

Unnamed: 0,ID,Income,ReviewScore,Height,Age,ProductViews
0,1,22000,5.0,168,25,50
1,2,25000,5.0,172,30,60
2,3,30000,4.9,171,28,100
3,4,40000,4.8,173,34,150
4,5,60000,4.8,169,29,180


In [3]:
df.shape

(10, 6)

In [4]:
continous=['Income','ReviewScore','ProductViews']
df[continous].skew()

Income          1.589953
ReviewScore    -1.982947
ProductViews    0.514834
dtype: float64

## **1. for right skewed data**
- We dont need to apply log and nth root transformation at a time, which one of these give skewness close to 0 is best.

In [5]:
df['Income'].skew()  #it is right skewed as skewness is >1.

np.float64(1.58995343495131)

### *1.1 log transformation*
- It meant applying log **np.log()** in each values of a column.

- Befor applying log, we have to check the minimun value for applying column because if it is 0 then **log(0) = ∞** , and ∞ as a value if we check the skewness of a column then it will be NONE and non of the statistical measure either can be calculated as ∞ has no fixed value.

- So, if a column have actually 0 as a value, in order to do log  we have to add **0.01 to each values** of that column, as 0.01 to each value have less impact on result and 0 problem will be resolved by doing so.

- If the value lies between -1 to 1 then it nrml distributed.

In [6]:
df['log_Income'] = np.log (df['Income'])
print(df['log_Income'].skew())

0.34390326559175677


In [7]:
df.head(2)

Unnamed: 0,ID,Income,ReviewScore,Height,Age,ProductViews,log_Income
0,1,22000,5.0,168,25,50,9.998798
1,2,25000,5.0,172,30,60,10.126631


### *1.2 nth root transformation*
- It means applying nth root for each and every values in column.
- x^(1/n) = n√x = x**(1/n) (in python) 
- here there is no issue of 0.

In [8]:
# doing manually -- check value lies between -1 to 1
df['root_Income'] = df['Income']**(1/12)
print(df['root_Income'].skew())

0.4471821816742738


In [9]:
df.head(2)

Unnamed: 0,ID,Income,ReviewScore,Height,Age,ProductViews,log_Income,root_Income
0,1,22000,5.0,168,25,50,9.998798,2.300745
1,2,25000,5.0,172,30,60,10.126631,2.325386


## **2. for left skewed value**

In [10]:
df['ReviewScore'].skew()

np.float64(-1.9829469559344794)

### *2.1 exp transformation*

In [11]:
df['exp_ReviewScore'] = np.exp(df['ReviewScore'])
df['exp_ReviewScore'].skew()

np.float64(-0.8742559746847688)

In [12]:
df.head(2)

Unnamed: 0,ID,Income,ReviewScore,Height,Age,ProductViews,log_Income,root_Income,exp_ReviewScore
0,1,22000,5.0,168,25,50,9.998798,2.300745,148.413159
1,2,25000,5.0,172,30,60,10.126631,2.325386,148.413159


### *2.2 power transformation*

In [13]:
df['pwr_ReviewScore'] = df['ReviewScore']**5
df['pwr_ReviewScore'].skew()

np.float64(-0.8689113517717533)

In [14]:
df.head(2)

Unnamed: 0,ID,Income,ReviewScore,Height,Age,ProductViews,log_Income,root_Income,exp_ReviewScore,pwr_ReviewScore
0,1,22000,5.0,168,25,50,9.998798,2.300745,148.413159,3125.0
1,2,25000,5.0,172,30,60,10.126631,2.325386,148.413159,3125.0


## **3. BOX-COX Transformation**