## Attribute Transformation 
An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

#### Dataset contains features with different metrics and scales. For example --> pregnant and insulin values are based on different scales of measurement. The magnitude of "insulin" value is higher than "pregnant" in the diabetes dataset. Hence many algorithm that are sensitive to varying scales of value will be biased towards the one with higher magnitdue.For example neural netwroks are highly sensitive to scaling of the data attributes.Hence we need to convert the dataset into suitabe format before it is fed into the neurons. 

### Solution to varying scale values

>We need a mechanism that scales all the attribute values into a given range typically between 0 to +1 or between a certain specified range. This approach is called feature scaling.

> Below are two approaches taht converts each feature into same scale

         1. Min-Max Scaler(Normalization)
         2. Standardization



Using MinMaxScaler() Rescaling X_train dataset


#### Here, minj and maxj represent the minimum and maximum values of attribute j. The jth attribute value $x_{i}^{j}$  of the ith row is scaled as:

####                             $y_{i}^{j} = (x_{i}^{j} - min_{j})/(max_{j}-min_{j}) $ 

<font color = red> We transform only the train dataset for scaling or any data tranformation tasks</font>

#### Spliting the cleaned data into input  features $(X_{i})$  and output component (Y) 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
dbts_new= pd.read_csv('C:/Users/acer/nikhil/DataMiningLab/lab/imputed_data_diabetes.csv')
dbts_new.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,6,148.0,72,35.0,125,33.6,0.627,50,1
1,1,85.0,66,29.0,125,26.6,0.351,31,0
2,8,183.0,64,29.15342,125,23.3,0.672,32,1
3,1,89.0,66,23.0,94,28.1,0.167,21,0
4,0,137.0,40,35.0,168,43.1,2.288,33,1


In [3]:
spltd_data = dbts_new.values
# separate the dataset into input and output components
X = spltd_data [:,0:8]
Y = spltd_data[:,8]

### Separate the splitted dataset into training and testing dataset with training  dataset = 80% of cleaned data and test dataset  = 20% of cleaned dataset

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

### Use Sci-Kit learn MinMaxScaler () for normlization

In [5]:
from sklearn.preprocessing import MinMaxScaler
sclr = MinMaxScaler(feature_range=(0, 1))
scaled_data_X_train = sclr.fit_transform(X_train)
# summarize transformed data
np.set_printoptions(precision=4)
print(scaled_data_X_train[0:5,:])

[[0.     0.5613 0.5333 0.2408 0.1521 0.5112 0.082  0.0833]
 [0.1333 0.2581 0.2889 0.1739 0.0849 0.2495 0.38   0.    ]
 [0.0667 0.1871 0.2889 0.0326 0.1521 0.0982 0.0726 0.    ]
 [0.2    0.4    0.5333 0.2408 0.1521 0.1554 0.0551 0.1   ]
 [0.     0.6065 0.4    0.3043 0.2096 0.3354 0.1947 0.    ]]


#### The above code converted all the feature values into the  scale between 0 and 1 using Normalization or Min-Max scaling. 
<font color = green>Some learning algorithms like Neural Networks expect input values between [0,1] hence we use normalization for scaling in such case. </font>

## Standardization 
****
It is another approach to scaling where the scaled value isn't within the [0,1] range. <b>It is suitable where the data collection process has errors and hence has extreme values or outliers.</b>

The jth attribute value $x_{i}^{j}$ of the ith row is  normalized by:

###                         Z-score_normalization (x')=  ($x_{i}^{j}$ -$\mu_{j}$)  /  $\sigma_{j}$

 where the $j^{th}$  attribute has mean $\mu_{j}$ and standard deviation $\sigma_{j}$ . 
                       
****
>We use a function "StandardScaler()"  for standardization purpose.

In [6]:
from sklearn.preprocessing import StandardScaler
scale_ftrs_stndrd = StandardScaler().fit(X_train)
scaled_stndrd_X_train = scale_ftrs_stndrd.transform(X_train)
# summarize transformed data
np.set_printoptions(precision=3)
print(scaled_stndrd_X_train[0:5,:])

[[-1.142  0.321 -0.041  0.005 -0.163  1.594 -0.616 -0.615]
 [-0.547 -1.221 -1.897 -0.68  -0.762 -0.278  1.455 -1.036]
 [-0.845 -1.582 -1.897 -2.128 -0.163 -1.361 -0.681 -1.036]
 [-0.249 -0.499 -0.041  0.005 -0.163 -0.951 -0.803 -0.53 ]
 [-1.142  0.55  -1.053  0.655  0.35   0.336  0.167 -1.036]]


## Dimensionality Reduction
Dimensionality reduction is all about summarizing the data with most of the information preserved in compact form. Reducing the dimension of the feature space, creates fewer relationships between variables and hence the model is less likely to overfit. One of such technique discussed here is the Principal Component Analysis (PCA).
****
<b> PCA is a  dimensionality-reduction technique for reducing the dimensionality of large data sets, i.e. by transforming a large set of input features into a smaller set which still contains most of the information in the original dataset .But Before applying PCA, the  dataset must be rescaled, if not rescaled, the  model/algorithm's accuracy may not be improved much. </b>

In [7]:
from sklearn.decomposition import PCA

prcpl_cmpnts = PCA(n_components=3)  # use three diagonal compnents for data reduction and summarization 
prncpl_smmry = prcpl_cmpnts.fit(scaled_stndrd_X_train)
print(("Explained Variance: %s") % (prncpl_smmry.explained_variance_ratio_)) # summarize the components


Explained Variance: [0.285 0.192 0.139]


In [8]:
print(prncpl_smmry.components_)

[[ 0.259  0.432  0.371  0.411  0.332  0.416  0.181  0.352]
 [-0.582  0.041 -0.168  0.259  0.254  0.362  0.26  -0.549]
 [ 0.027  0.436 -0.286 -0.427  0.526 -0.39   0.338  0.036]]


****
Above code created three principial components as denoted in three separate arrays. Each array represents the component that summarizes the overall data. 