Lecture: AI I - Basics 

Previous:
[**Chapter 4.1: Data Preparation with scikit-learn**](../01_data_preparation.ipynb)

---

# Solution 4.1: Data Preparation with scikit-learn

- [Task 1: Missing Value Imputation](#task-1-missing-value-imputation)
- [Task 2: Data Scaling](#task-2-data-scaling)
- [Task 3: Building a Preprocessing Pipeline](#task-3-building-a-preprocessing-pipeline)

> Hint: When doing the exercises put your solution in the designated "Solution" section:
> ```python
> # Solution (put your code here)
> ```

## Task 1: Missing Value Imputation

Missing data is common in real-world datasets and can break machine learning algorithms. You'll practice different imputation strategies to handle missing values.

a) Use `SimpleImputer` to replace missing values with the mean and store it in `mean_imputed`.

In [1]:
# prerequisites (don't edit this block)
import numpy as np
from sklearn.impute import SimpleImputer

np.random.seed(42)
data_with_nan = np.array([
    [1.0, 2.0, 3.0],
    [4.0, np.nan, 6.0],
    [7.0, 8.0, np.nan],
    [np.nan, 11.0, 12.0],
    [13.0, 14.0, 15.0]
])

In [2]:
# Solution (put your code here)
mean_imputer = SimpleImputer(strategy='mean')
mean_imputed = mean_imputer.fit_transform(data_with_nan)

In [3]:
# Test case (don't edit this block)
assert ~np.isnan(mean_imputed).any()
assert mean_imputed[1, 1] == 8.75

b) Use `SimpleImputer` to replace missing values with the median and store it in `median_imputed`.

In [4]:
# Solution (put your code here)
median_imputer = SimpleImputer(strategy='median')
median_imputed = median_imputer.fit_transform(data_with_nan)

In [5]:
# Test case (don't edit this block)
assert ~np.isnan(mean_imputed).any()
assert median_imputed[1, 1] == 9.5

c) Use `SimpleImputer` to replace missing values with a constant value (`0`) and store it in `constant_imputed`.

In [6]:
# Solution (put your code here)
constant_imputer = SimpleImputer(strategy='constant', fill_value=0)
constant_imputed = constant_imputer.fit_transform(data_with_nan)

In [7]:
# Test case (don't edit this block)
assert ~np.isnan(mean_imputed).any()
assert constant_imputed[1, 1] == 0

## Task 2: Data Scaling

Features with different scales (e.g., height in meters vs. age in years) can bias machine learning algorithms. Scaling ensures all features contribute equally to the learning process.

a) Apply `StandardScaler` to normalize the data (mean=0, std=1) and store it in `standard_scaled`.

In [8]:
# prerequisites (don't edit this block)
from sklearn.preprocessing import StandardScaler, MinMaxScaler

sample_data = np.array([
    [1.75, 70.5, 25], 
    [1.80, 85.2, 30],
    [1.65, 58.9, 22],
    [1.90, 95.1, 35],
    [1.70, 68.3, 28]
])

In [9]:
# Solution (put your code here)
standard_scaler = StandardScaler()
standard_scaled = standard_scaler.fit_transform(sample_data)

In [10]:
# Test case (don't edit this block)
standard_scaled
assert np.isclose(np.min(standard_scaled), -1.355261854357877)
assert np.isclose(np.max(standard_scaled), 1.6274669424134713)

b) Apply `MinMaxScaler` to scale data to range [0, 1] and store it in `minmax_scaled`.

In [11]:
# Solution (put your code here)
minmax_scaler = MinMaxScaler()
minmax_scaled = minmax_scaler.fit_transform(sample_data)

In [12]:
# Test case (don't edit this block)
assert np.isclose(np.min(minmax_scaled), 0)
assert np.isclose(np.max(minmax_scaled), 1.0)

c) Apply `MinMaxScaler` to scale data to range [-1, 1] and store it in `minmax_neg_scaled`.

In [13]:
# Solution (put your code here)
minmax_neg_scaler = MinMaxScaler(feature_range=(-1, 1))
minmax_neg_scaled = minmax_neg_scaler.fit_transform(sample_data)

In [15]:
# Test case (don't edit this block)
assert np.isclose(np.min(minmax_neg_scaled), -1)
assert np.isclose(np.max(minmax_neg_scaled), 1.0)

## Task 3: Building a Preprocessing Pipeline

Real-world data preprocessing involves multiple steps. Pipelines chain transformations together, ensuring consistent preprocessing and preventing data leakage between training and test sets. Also, real datasets often mix numerical and categorical features that need different preprocessing. ColumnTransformer lets you apply different transformations to different column types simultaneously.

a) Use `Pipeline` from `sklearn.pipeline` with three named steps:
   - `'imputer'`: `SimpleImputer(strategy='mean')`
   - `'scaler'`: `StandardScaler()`
Store in variable `preprocessing_pipeline`.

In [16]:
# prerequisites (don't edit this block)
from sklearn.pipeline import Pipeline

In [17]:
# Solution (put your code here)
preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
])

In [18]:
# Test case (don't edit this block)
np.random.seed(42)
raw_data = np.array([
    [1.75, 70.5, np.nan],
    [np.nan, 85.2, 30],
    [1.65, np.nan, 22],
    [1.90, 95.1, 35],
    [1.70, 68.3, 28],
    [1.82, 72.1, np.nan],
    [1.68, 61.4, 26],
    [np.nan, 88.9, 32]
])

transformed_data = preprocessing_pipeline.fit_transform(raw_data)
assert ~np.isnan(transformed_data).any()
assert np.isclose(np.min(transformed_data), -1.887677947616136)
assert np.isclose(np.max(transformed_data), 2.0044593143431815)

b) Define a ColumnTransformer with the following transformations:
- for numerical features (`age`, `salary`, `experience`), apply the `preprocessing_pipeline` from task 3 and name it `'num'`.
- for categorical features (`department`), apply `OneHotEncoder()` and name it `'cat'`.
Store in variable `column_transformer`.

Additionally, fit and transform the `data` using `column_transformer` and store the result in `transformed_mixed`.

In [19]:
# prerequisites (don't edit this block)
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

mixed_data = pd.DataFrame({
    'age': [25, 30, 35, 22, 28, 33],
    'salary': [50000, 60000, 75000, 45000, 55000, 70000],
    'department': ['IT', 'HR', 'IT', 'Finance', 'HR', 'IT'],
    'experience': [2.5, 5.0, 8.5, 1.0, 3.5, 6.0]
})

In [20]:
# Solution (put your code here)
column_transformer = ColumnTransformer(
    transformers=[
        ('num', preprocessing_pipeline, ['age', 'salary', 'experience']),
        ('cat', OneHotEncoder(), ['department'])
    ],
)
column_transformer.set_output()

transformed_mixed = column_transformer.fit_transform(mixed_data)

In [30]:
# Test case (don't edit this block)
feature_names = column_transformer.get_feature_names_out()
assert len(feature_names) == 6

df = pd.DataFrame(transformed_mixed, columns=feature_names)
assert df["cat__department_Finance"].sum() == 1
assert df["cat__department_HR"].sum() == 2
assert df["cat__department_IT"].sum() == 3

assert not df["num__age"].isnull().any()
assert not df["num__salary"].isnull().any()
assert not df["num__experience"].isnull().any()

---

Lecture: AI I - Basics 

Next: [**Chapter 4.2: Machine Learning with scikit-learn**](../04_ml/02_machine_learning.ipynb)