# <p style='text-align: center;'> Feature Transformations / Preprocessing </p>

## Feature Transformation :
- Feature transformation is a mathematical transformation in which we apply a mathematical formula to a particular column (feature) and transform the values, which are useful for our further analysis. 


- It is a technique by which we can boost our model performance. 


- It is also known as Feature Engineering, which creates new features from existing features that may help improve the model performance.
  
  
## Reasons for using transformations :
- Convenience: A transformed scale may be as natural as the original scale and more convenient for a specific purpose (e.g. 
  percentages rather than original data, sines rather than degrees). One important example is standardization.


- Reducing skewness: A transformation may be used to reduce skewness. A distribution that is symmetric or nearly so is often   
  easier to handle and interpret than a skewed distribution. To reduce right skewness, take roots or logarithms or reciprocals 
  (roots are weakest). This is the commonest problem in practice. To reduce left skewness, take squares or cubes or higher 
  powers.


- Equal spreads: A transformation may be used to produce approximately equal spreads, despite marked variations in level, which 
  again makes data easier to handle and interpret.


- Linear relationships: When looking at relationships between variables, it is often far easier to think about patterns that 
  are approximately linear than about patterns that are highly curved.


- Additive relationships: Relationships are often easier to analyse when additive rather than multiplicative.


<b> Here we are going to load "sks.csv" data-set to perform transformation operations.

In [2]:
# Importing the necessary libraries.
import pandas as pd
import numpy as np

# Load the "sks.csv" dataset.
df = pd.read_csv("sks.csv")

# print the dataset.
df

Unnamed: 0,Income,Age,Department
0,15000,25,HR
1,1800,18,Legal
2,120000,42,Marketing
3,10000,51,Management


## Logarithmic Transformation :
- The Logarithmic Transformation is used to convert a skewed distribution to a normal distribution/less-skewed distribution.


- Logarithmic Transformation is mainy used to convert the right skewed distribution to a normal distribution/less-skewed 
  distribution.
  
  
- The log of each value is taken in feature, a nice way to deal with large numbers (Log of 1,000,000 is only 6). Thus, it reduces the impact of both high and low values in features.

In [3]:
# checking the skewness of the "Income" variable from "df" DataFrame.
df["Income"].skew()

1.9426145704858286

- It is a right skewed, so we are going to convert a right skewed distribution to a normal distribution/less-skewed 
  distribution by using Logarithmic Transformation.

In [15]:
# Aplly logirithmic on each value of "Income" variable.
# And store it in to the new variable as "log_sks".
df["log_sks"] = np.log(df["Income"])

# checking the skewness of th "log_sks" variable from "df" DataFrame.
df["log_sks"].skew()

0.3099392840650014

- Now we can see that, it's converted to normal distribution.

- Note : if our data has negative values or values ranging from 0 to 1, we cannot apply log transform directly – since the log of negative numbers and numbers between 0 and 1 is undefined, we would get error or NaN values in our data. In such cases, we can add a number to these values to make them all greater than 1. 

## Square Root Transformation :
- The Square Root Transformation is used to convert a skewed distribution to a normal distribution/less-skewed distribution.


- Square Root Transformation is mainy used to convert the right skewed distribution to a normal distribution/less-skewed distribution.


- Square Root Transformation can be used for reducing the skewness of right-skewed data.


- Square Root transformation is defined only for positive numbers and this transformation is weaker than Log Transformation.



<b> Here we are using same "sks.csv" data-set to perform transformation operations.

In [17]:
# checking the skewness of the "Income" variable from "df" DataFrame.
df["Income"].skew()

1.9426145704858286

- It is a right skewed, so we are going to convert a right skewed distribution to a normal distribution/less-skewed distribution by using Root Transformation.

In [36]:
# apply Root Transformation on "Income" variable.
# And store it in to the new variable as "sqr_sks".
df["sqr_sks"] = df["Income"]**(1/200)

# checking the skewness of th "sqr_sks" variable from "df" DataFrame.
df["sqr_sks"].skew()

0.3281471105495321

- Now we can see that, it's converted to normal distribution.


- This is a lengthy process to convert the normal distribution/less-skewed distribution as compare to Logarithmic Transformation.

## Reciprocal Transformation :
- The reciprocal transformation is defined as the transformation of x to 1/x.


- Reciprocal Transformation is mainy used to convert the right skewed distribution to a normal distribution/less-skewed distribution.


- The transformation has a dramatic effect on the shape of the distribution, reversing the order of values with the same sign. The transformation can only be used for non-zero values.


- A negative reciprocal transformation is almost identical, except that x maps to -1/x and preserves the order of variables.

<b> Here we are using same "df" DataFrame for Reciprocal transformation operations.

In [3]:
# checking the skewness of the "Income" variable from "df" DataFrame.
df["Income"].skew()

1.9426145704858286

- It is a right skewed, so we are going to convert a right skewed distribution to a normal distribution/less-skewed distribution by using Reciprocal Transformation.

In [4]:
# apply Reciprocal Transformation on "Income" variable.
# And store it in to the new variable as "rec_sks".
df["rec_sks"]=1/(df["Income"])

# checking the skewness of th "rec_sks" variable from "df" DataFrame.
df["rec_sks"].skew()

1.862821602439993

- Now still right skewed. Reciprocal Transformation is not suitable to this dataset.

## Square Transformation :
- The Square Transformation is used to convert a skewed distribution to a normal distribution/less-skewed distribution.


- Square Transformation is mainy used to convert the left skewed distribution to a normal distribution/less-skewed istribution.


- Square Transformation can be used for reducing the skewness of left-skewed data.

<b> Here we are going to create our own data-set with one column as "x" variable, dataframe name as "df1".

In [12]:
# Importing the necessary libraries.
import pandas as pd

# Create DataFrame.
df1=pd.DataFrame({"x":[120,100,80,100,10]})

# print the DataFrame.
df1

Unnamed: 0,x
0,120
1,100
2,80
3,100
4,10


In [71]:
# checking the skewness of th "x" variable from "df1" DataFrame.
df1["x"].skew()

-1.645977036097245

- It is a left skewed, so we are going to convert a left skewed distribution to a normal distribution/less-skewed distribution by using Square Transformation.

In [63]:
# apply Square Transformation on "x" variable.
# And store it in to the new variable as "x_square".
df1["x_square"] = df1["x"]**(3)

# checking the skewness of th "sqr_sks" variable from "df" DataFrame.
df1["x_square"].skew()

0.06588432141496725

- Now we can see that, it's converted to normal distribution.

## Box-Cox Transformation :
- A box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.


- This is included in the concept of power transformations.


- Data must be positive. If the data contains zero or negative values Box-Cox transformation is not suitable for this.


- The Box-Cox Transformation is defined as :

        (Y**λ - 1)/λ
        
        
- Where,
        - Y is the response variable.
        - λ is the transformation parameter.



- λ varies from -5 to 5 in the transformation, all values of λ are considered and the optimal value for a given variable is selected.


<b> Here we are using same "sks.csv" data-set,i.e. "df" DataFrame to perform Box-Cox transformation operations.

In [6]:
# checking the skewness of the "Income" variable from "df" DataFrame.
df["Income"].skew()

1.9426145704858286

- It is a right skewed, so we are going to convert a right skewed distribution to a normal distribution/less-skewed distribution by using Box-Cox transformation.

- We can perform a box-cox transformation in Python by using the scipy.stats.boxcox() function. 

In [10]:
# Importing the necessary libraries.
import pandas as pd
from scipy.stats import boxcox 

# Apply boxcox transformation on "Income" variable.
# And store it in to the new variable as "Income_boxcox".
df["Income_boxcox"], param = boxcox(df.Income)

# print the Lambda value.
print("λ =",param)

# checking the skewness of th "Income_boxcox" variable from "df" DataFrame.
df["Income_boxcox"].skew()

λ = -0.06750036465839077


0.060932187578592846

- In the above example, we can see that it's converted to normal distribution.


- Here we have applyied boxcox transformation on "Income" variable, it find's some value for λ such that the transformed data is as close to normally distributed as possible between -5 to +5 and returns the optimal value of lambda that produces a more normal distribution.


- In the above example, the optimal value of lambda is -0.0675, this lambda value transformed the data is close to normally distribution. So now the skeweness is 0.0609