# **Attribute Transformation**
An attribute transformation refers to the process of converting an attribute's values into a new set of values, ensuring that each original value can be uniquely identified by its transformed counterpart.

When datasets contain features with varying scales and units, such as "pregnant" and "insulin" in a diabetes dataset, the difference in magnitude between attributes can introduce bias in algorithms sensitive to scale differences. For example, since "insulin" typically has much larger values than "pregnant," algorithms like neural networks may prioritize features with larger magnitudes, leading to inaccurate or suboptimal results.To address this issue, it is essential to bring all attributes onto a similar scale, typically within a range like 0 to 1 or another predefined interval. This process, known as feature scaling, ensures that no attribute dominates others solely due to its scale.

##**Common Methods for Feature Scaling**
**Min-Max Scaling (Normalization):**
This technique rescales the feature values to fit within a specific range, usually between 0 and 1. It is particularly useful when the data distribution does not follow a standard pattern or contains outliers.

**Standardization (Z-Score Scaling)**:
Standardization transforms the data such that it has a mean of 0 and a standard deviation of 1. This method is highly effective for algorithms that assume the data is normally distributed, such as logistic regression or support vector machines.

By applying feature scaling, datasets are converted into a format that improves algorithm performance, ensures fair treatment of all features, and enhances the reliability of the analysis or prediction.



Using MinMaxScaler() Rescaling X_train dataset


#### minj and maxj represent the minimum and maximum values of attribute j. The jth attribute value $x_{i}^{j}$  of the ith row is scaled as:

####                             $y_{i}^{j} = (x_{i}^{j} - min_{j})/(max_{j}-min_{j}) $

<font color = red> We transform only the train dataset for scaling or any data tranformation tasks</font>

#### **Split the cleaned data into input  features $(X_{i})$  and output component (Y)**

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [5]:
dbts_new= pd.read_csv('/content/imputed_data_diabetes1.csv')
dbts_new.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,6,148.0,72,35.0,125,33.6,0.627,50,1
1,1,85.0,66,29.0,125,26.6,0.351,31,0
2,8,183.0,64,29.15342,125,23.3,0.672,32,1
3,1,89.0,66,23.0,94,28.1,0.167,21,0
4,0,137.0,40,35.0,168,43.1,2.288,33,1


In [6]:
spltd_data = dbts_new.values
# separate the dataset into input and output components
X = spltd_data [:,0:8]
Y = spltd_data[:,8]

### **Separate the splitted dataset into training and testing dataset with training  dataset = 80% of cleaned data and test dataset  = 20% of cleaned dataset**

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

###**Use Sci-Kit learn MinMaxScaler () for normlization**

In [8]:
from sklearn.preprocessing import MinMaxScaler
sclr = MinMaxScaler(feature_range=(0, 1))
scaled_data_X_train = sclr.fit_transform(X_train)
# summarize transformed data
np.set_printoptions(precision=4)
print(scaled_data_X_train[0:5,:])

[[0.4118 0.6014 0.7333 0.1848 0.6997 0.2495 0.0188 0.4314]
 [0.     0.2657 0.5333 0.2408 0.1667 0.2883 0.0736 0.0784]
 [0.     0.3147 0.4444 0.1087 0.1667 0.0573 0.0719 0.    ]
 [0.1176 0.2797 0.4889 0.0652 0.0526 0.0593 0.241  0.098 ]
 [0.0588 0.4476 0.6222 0.4457 0.2793 0.4233 0.4615 0.3922]]


#### The above code converted all the feature values into the  scale between 0 and 1 using Normalization or Min-Max scaling.
<font color = green>Some learning algorithms like Neural Networks expect input values between [0,1] hence we use normalization for scaling in such case. </font>

#**Standardization**
****
It is another approach to scaling where the scaled value isn't within the [0,1] range. <b>It is suitable where the data collection process has errors and hence has extreme values or outliers.</b>

The jth attribute value $x_{i}^{j}$ of the ith row is  normalized by:

###                         Z-score_normalization (x')=  ($x_{i}^{j}$ -$\mu_{j}$)  /  $\sigma_{j}$

 where the $j^{th}$  attribute has mean $\mu_{j}$ and standard deviation $\sigma_{j}$ .
                       
****
>We use a function "StandardScaler()"  for standardization purpose.

In [9]:
from sklearn.preprocessing import StandardScaler
scale_ftrs_stndrd = StandardScaler().fit(X_train)
scaled_stndrd_X_train = scale_ftrs_stndrd.transform(X_train)
# summarize transformed data
np.set_printoptions(precision=3)
print(scaled_stndrd_X_train[0:5,:])

[[ 0.915  0.653  1.467 -0.616  4.202 -0.334 -1.065  0.867]
 [-1.139 -0.903 -0.031 -0.031 -0.164 -0.052 -0.674 -0.694]
 [-1.139 -0.676 -0.697 -1.41  -0.164 -1.725 -0.686 -1.041]
 [-0.552 -0.838 -0.364 -1.864 -1.099 -1.71   0.523 -0.607]
 [-0.845 -0.06   0.635  2.108  0.758  0.925  2.098  0.693]]


## **Dimensionality Reduction**
#### Dimensionality reduction is all about summarizing the data with most of the information preserved in compact form.Reducing the dimension of the feature space, creates fewer relationships between variables and hence the model is less likely to overfit.

#### one of such technique discussed here is the Principal Component Analysis (PCA)
****
<b> PCA is a  dimensionality-reduction technique for reducing the dimensionality of large data sets, i.e. by transforming a large set of input features into a smaller set which still contains most of the information in the original dataset .But Before applying PCA, the  dataset must be rescaled, if not rescaled, the  model/algorithm's accuracy may not be improved much. </b>

In [10]:
from sklearn.decomposition import PCA

prcpl_cmpnts = PCA(n_components=3)  # use three diagonal compnents for data reduction and summarization
prncpl_smmry = prcpl_cmpnts.fit(scaled_stndrd_X_train)
print(("Explained Variance: %s") % (prncpl_smmry.explained_variance_ratio_)) # summarize the components


Explained Variance: [0.289 0.181 0.139]


In [11]:
print(prncpl_smmry.components_)

[[ 0.288  0.413  0.388  0.401  0.313  0.418  0.145  0.376]
 [ 0.591 -0.109  0.125 -0.284 -0.277 -0.385 -0.13   0.549]
 [-0.034  0.51  -0.282 -0.403  0.568 -0.373  0.178  0.061]]


****
Above code created three principial components as denoted in three separate arrays. Each array represents the component that summarizes the overall data.