In [1]:
import pandas as pd

In [2]:
flower_data = pd.read_csv("Iris.csv")

In [3]:
flower_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
flower_data.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


## Handling data normalization

Data normalization is the process of transforming data into a common scale or range, to eliminate differences in magnitude and make the data more comparable and interpretable. Normalization is an important step in data preprocessing, as it can improve the accuracy and performance of machine learning models and other data analysis techniques.

1. **Min-Max Normalization**

This method scales the data to a fixed range, typically between 0 and 1. The formula for min-max normalization is:

```python
x_norm = (x - x_min) / (x_max - x_min)
```

In [10]:
flower_data.SepalLengthCm.describe()

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: SepalLengthCm, dtype: float64

In [6]:
min_length = flower_data.SepalLengthCm.min()
max_length = flower_data.SepalLengthCm.max()

In [7]:
norm_length = (flower_data.SepalLengthCm - min_length) / (max_length - min_length )

In [9]:
norm_length.describe()

count    150.000000
mean       0.428704
std        0.230018
min        0.000000
25%        0.222222
50%        0.416667
75%        0.583333
max        1.000000
Name: SepalLengthCm, dtype: float64

In [10]:
# Using Sklearn
from sklearn.preprocessing import MinMaxScaler

minmax_scaler = MinMaxScaler()

norm_data = minmax_scaler.fit_transform(flower_data[['SepalLengthCm','SepalWidthCm','PetalLengthCm',"PetalWidthCm"]].values)

In [11]:
norm_data

array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667],
       [0.30555556, 0.79166667, 0.11864407, 0.125     ],
       [0.08333333, 0.58333333, 0.06779661, 0.08333333],
       [0.19444444, 0.58333333, 0.08474576, 0.04166667],
       [0.02777778, 0.375     , 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ],
       [0.30555556, 0.70833333, 0.08474576, 0.04166667],
       [0.13888889, 0.58333333, 0.10169492, 0.04166667],
       [0.13888889, 0.41666667, 0.06779661, 0.        ],
       [0.        , 0.41666667, 0.01694915, 0.        ],
       [0.41666667, 0.83333333, 0.03389831, 0.04166667],
       [0.38888889, 1.        , 0.08474576, 0.125     ],
       [0.30555556, 0.79166667, 0.05084746, 0.125     ],
       [0.22222222, 0.625     ,

2. **Z-score Normalization / StandardScaler**

   This method scales the data to have zero mean and unit variance. The formula for z-score normalization is:

```python
x_norm = (x - mean) / std
```

where `x` is the original value, `mean` and `std` are the mean and standard deviation of the data, respectively, and `x_norm` is the normalized value.

In [12]:
mean_length = flower_data.SepalLengthCm.mean()
std_length = flower_data.SepalLengthCm.std()

In [13]:
norm_length = (flower_data.SepalLengthCm - mean_length) / std_length

In [14]:
norm_length.describe()

count    1.500000e+02
mean    -5.684342e-16
std      1.000000e+00
min     -1.863780e+00
25%     -8.976739e-01
50%     -5.233076e-02
75%      6.722490e-01
max      2.483699e+00
Name: SepalLengthCm, dtype: float64

In [15]:
# Sklearn
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()

norm_data = std_scaler.fit_transform(flower_data[['SepalLengthCm','SepalWidthCm','PetalLengthCm',"PetalWidthCm"]].values)

In [17]:
pd.DataFrame(norm_data).describe()

Unnamed: 0,0,1,2,3
count,150.0,150.0,150.0,150.0
mean,-4.736952e-16,-6.631732e-16,3.315866e-16,-2.842171e-16
std,1.00335,1.00335,1.00335,1.00335
min,-1.870024,-2.438987,-1.568735,-1.44445
25%,-0.9006812,-0.5877635,-1.227541,-1.181504
50%,-0.05250608,-0.1249576,0.3362659,0.1332259
75%,0.6745011,0.5692513,0.7627586,0.7905908
max,2.492019,3.114684,1.786341,1.710902


3. **Log transformation:**

   This method applies a logarithmic function to the data, to reduce the range of values and make the data more symmetric and normally distributed. The formula for log normalization is:

```python
x_norm = log(x)
```

where `x` is the original value, and `x_norm` is the normalized value.

In [18]:
import numpy as np

In [20]:
np.log(flower_data.SepalLengthCm).describe()

count    150.000000
mean       1.755393
std        0.141189
min        1.458615
25%        1.629241
50%        1.757858
75%        1.856298
max        2.066863
Name: SepalLengthCm, dtype: float64

In [21]:
np.log(flower_data[['SepalLengthCm','SepalWidthCm','PetalLengthCm',"PetalWidthCm"]].values)

array([[ 1.62924054,  1.25276297,  0.33647224, -1.60943791],
       [ 1.58923521,  1.09861229,  0.33647224, -1.60943791],
       [ 1.54756251,  1.16315081,  0.26236426, -1.60943791],
       [ 1.5260563 ,  1.13140211,  0.40546511, -1.60943791],
       [ 1.60943791,  1.28093385,  0.33647224, -1.60943791],
       [ 1.68639895,  1.36097655,  0.53062825, -0.91629073],
       [ 1.5260563 ,  1.22377543,  0.33647224, -1.2039728 ],
       [ 1.60943791,  1.22377543,  0.40546511, -1.60943791],
       [ 1.48160454,  1.06471074,  0.33647224, -1.60943791],
       [ 1.58923521,  1.13140211,  0.40546511, -2.30258509],
       [ 1.68639895,  1.30833282,  0.40546511, -1.60943791],
       [ 1.56861592,  1.22377543,  0.47000363, -1.60943791],
       [ 1.56861592,  1.09861229,  0.33647224, -2.30258509],
       [ 1.45861502,  1.09861229,  0.09531018, -2.30258509],
       [ 1.75785792,  1.38629436,  0.18232156, -1.60943791],
       [ 1.74046617,  1.48160454,  0.40546511, -0.91629073],
       [ 1.68639895,  1.

4. **Power Transformation:**

   This method applies a power function to the data, to adjust the skewness and kurtosis of the distribution and make the data more symmetric and normally distributed. The formula for power normalization is:

```python
x_norm = sign(x) * abs(x) ** a
```

where `x` is the original value, `a` is the power parameter (typically between 0 and 1), `sign` is the sign function that returns the sign of x (+1 or -1), and `abs` is the absolute value function. The normalized value `x_norm` is obtained by raising the absolute value of `x` to the power of `a`, and then multiplying it by the sign of `x` to preserve the direction of the data.

In [40]:
pow_factor = 0.8
norm = np.sign(flower_data.SepalLengthCm) * np.power(np.abs(flower_data.SepalLengthCm), pow_factor)

In [41]:
norm.describe()

count    150.000000
mean       4.098643
std        0.464265
min        3.211994
25%        3.681766
50%        4.080773
75%        4.415135
max        5.225185
Name: SepalLengthCm, dtype: float64