3.1 Introduction

Data transformation involves changing the format, structure, or values of data to ensure consistency, quality, and readiness for analysis. This chapter will cover various techniques for transforming data, including normalization, standardization, and encoding categorical variables.

Objectives
Understand different data transformation techniques.
Apply these techniques to prepare data for analysis and modeling.

2.3 Techniques
1. Normalization
2. Standardization
3. Encoding Categorical Variables
   1. One-Hot Encoding
   2. Label Encoding
4. Binning
5. Log Transformation
6. Polynomial Transformation
7. Box-Cox Transformation
8. Feature Scaling
9. Text Data Transformation
10. Tokenization
11. Stemming/Lemmatization
12. Handling Dates and Times
13. Aggregations and Rolling Calculations
14. Discretization

2.3.1 Normalization

Introduction to Normalization
Normalization is the process of scaling individual data points to a common range, usually [0, 1]. It is useful for ensuring that features contribute equally to the analysis and prevents any single feature from dominating due to its scale.

Techniques
Min-Max Normalization: Rescales the data to a fixed range (typically [0, 1]).

In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Read the data from the specified location
df = pd.read_csv('D:/Projects/Data-cleaning-series/Chapter03 Data Transformation/Products.csv')

# Display the initial DataFrame
print("Initial DataFrame:")
print(df.to_string(index=False))

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Apply Min-Max Normalization to the 'Price' column
df['Price_Normalized'] = scaler.fit_transform(df[['Price']])

# Display the DataFrame after normalization
print("\nDataFrame After Min-Max Normalization:")
print(df.to_string(index=False))


Initial DataFrame:
 Product ID Product Name  Price    Category  Stock              Description
          1     Widget A  19.99 Electronics  100.0    A high-quality widget
          2     Widget B  29.99 Electronics    NaN                      NaN
          3          NaN  15.00  Home Goods   50.0      Durable and stylish
          4     Widget D    NaN  Home Goods  200.0       A versatile widget
          5     Widget E   9.99         NaN   10.0    Compact and efficient
          6     Widget F  25.00 Electronics    0.0 Latest technology widget
          7     Widget G    NaN     Kitchen  150.0     Multi-purpose widget
          8     Widget H  39.99     Kitchen   75.0          Premium quality
          9     Widget I    NaN Electronics    NaN        Advanced features
         10     Widget J  49.99 Electronics   60.0            Best in class

DataFrame After Min-Max Normalization:
 Product ID Product Name  Price    Category  Stock              Description  Price_Normalized
          1

Explanation:

Read the Data: Load the dataset from the specified location using pd.read_csv().

Initial Display: Display the DataFrame to see the data before applying normalization.

Initialize Scaler: Initialize the MinMaxScaler from sklearn.preprocessing.

Apply Normalization: Apply Min-Max normalization to the 'Price' column and add the normalized values as a new column.

Final Display: Display the DataFrame after applying normalization.