In [17]:
import pandas as pd
import sklearn
from sklearn.preprocessing import (MinMaxScaler, StandardScaler)
import numpy as np


**ETL (extract, transform, load) integration**: This involves extracting data from different sources, transforming it to meet the needs of the target system, and loading it into a target database or data warehouse. ETL tools are commonly used to automate this process, making it more efficient and less error-prone. Python has several powerful ETL tools like `Apache Nifi`, `Apache Airflow`, and `Apache Beam`. These tools allow developers to extract data from various sources, perform transformations, and load it into a target database or data warehouse.

# Handling Data normalization


Data normalization is the process of transforming data into a common scale or range, to eliminate differences in magnitude and make the data more comparable and interpretable. Normalization is an important step in data preprocessing, as it can improve the accuracy and performance of machine learning models and other data analysis techniques.

In [2]:
data = pd.read_csv("C:\\Users\\Prabin\\clean_data.csv")

## 1. Min-Max Normalization
This method scales the data to a fixed range, typically between 0 and 1. The formula for min-max normalization is:
\begin{equation}
x_{\text{norm}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}end{equation}]


In [3]:
sales_min = data['sales'].min()
sales_max = data['sales'].max()

In [4]:
sales_normalization = (data['sales'] - sales_min) / (sales_max - sales_min)

In [5]:
sales_normalization.describe()

count    85870.000000
mean         0.015259
std          0.054357
min          0.000000
25%          0.000738
50%          0.003694
75%          0.011074
max          1.000000
Name: sales, dtype: float64

In [6]:
data['sales'].describe()

count     85870.000000
mean       5980.048970
std       21302.453181
min           0.000000
25%         289.144572
50%        1447.518085
75%        4339.950965
max      391900.008970
Name: sales, dtype: float64

<br>

#### Using sklearn
```python
scaler = MinMaxScaler() 
scaler.fit(data['sales']) #
scaler.transform(data['sales'])
```

In [8]:
scaler = MinMaxScaler() 
norm = scaler.fit_transform(data['sales'].values.reshape(-1, 1))

In [9]:
norm

array([[2.84810425e-03],
       [6.30484573e-04],
       [1.53688397e-03],
       ...,
       [3.96281792e-03],
       [5.33952966e-03],
       [5.06251583e-08]])

In [10]:
print(norm.min())
print(norm.max())

0.0
1.0


## 2. Z-score Normalization

In [11]:
x_mean = data['sales'].mean()
x_std = data['sales'].std()
sales_norm = (data['sales'] - x_mean)/(x_std)
sales_norm.describe()

count    8.587000e+04
mean    -9.681321e-18
std      1.000000e+00
min     -2.807211e-01
25%     -2.671478e-01
50%     -2.127704e-01
75%     -7.699104e-02
max      1.811622e+01
Name: sales, dtype: float64

In [12]:
# Using Sklearn
scaler = StandardScaler() 


norm = scaler.fit_transform(data['sales'].values.reshape(-1,1))
print(f"{norm.mean():.4f}")
print(f"{norm.std():.4f}")

-0.0000
1.0000


## 3. Log transformation

In [13]:
norm = np.log(data['sales'])
norm.head()

  result = getattr(ufunc, method)(*inputs, **kwargs)


0    7.017660
1    5.509740
2    6.400764
3    2.783431
4    5.523882
Name: sales, dtype: float64

## 4. Power Transformation

In [14]:
power_factor = 0.5
norm = np.sign(data['sales'])*np.power(np.abs(data['sales']), power_factor)


In [16]:
norm.describe()

count    85870.000000
mean        51.291368
std         57.872995
min          0.000000
25%         17.004252
50%         38.046262
75%         65.878304
max        626.019176
Name: sales, dtype: float64