# End-to-End Machine Learning

## Author: AC

## Date: 27-11-2025

### Learning Outcomes:
1. Master the End-to-End Machine Learning Workflow

- Execute a complete machine learning project loop: from data acquisition, cleaning, and feature engineering to model training and evaluation.

- Understand how to structure a data processing workflow that meets industry standards.

<br>

2. Perform Advanced Data Preprocessing

- Missing Values: Use SimpleImputer for median or mode imputation instead of dropping data.

- Outlier Detection: Apply various denoising strategies, including IQR, Winsorization (Capping), and Isolation Forest.

- Distribution Transformation: Identify skewed data and apply Log Transform or Yeo-Johnson power transformations to improve model performance.

<br>

3. Implement Feature Engineering & Encoding

- Categorical Encoding: Distinguish between and apply OneHotEncoder (for nominal data) and OrdinalEncoder/LabelEncoder (for ordinal/target data).

- Feature Scaling: Master the use of StandardScaler (Z-score) and MinMaxScaler, understanding why scaling is critical for many algorithms.

<br>

4. Automate with Scikit-Learn Pipelines

- Core Skill: Encapsulate preprocessing steps using ColumnTransformer and Pipeline.

- Value: Ensure consistent transformations across training and test sets to effectively prevent Data Leakage and simplify model deployment.

<br>

5. Optimize Model Selection & Evaluation

- Stratified Splitting: Use StratifiedShuffleSplit to ensure train/test set distributions match the overall population, avoiding sampling bias.

- Hyperparameter Tuning: Automate the search for the best model parameters (e.g., n_estimators, max_features) using GridSearchCV and RandomizedSearchCV.

# Project Setup & Data Split

## Import Libraries

In [169]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load and Read Dataset

In [170]:
# Url to Dataset
url = "https://raw.githubusercontent.com/ageron/handson-ml2/refs/heads/master/datasets/housing/housing.csv"

# Read the dataset from the url link
df = pd.read_csv(url)

# Display first row
df.head(1)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY


# Exploratory Data Analysis (EDA)

## Checking the distribution of categorical features

In [171]:
df["ocean_proximity"].value_counts()

# df["ocean_proximity"].unique() # check distinct categories

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

## Checking for missing values

In [172]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

## Checking for duplicates

In [173]:
df.duplicated().sum()  # Check for duplicates

np.int64(0)

## Checking Feature Correlations

In [174]:
corr_matrix = df.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False)

# corr_matrix #display all correlation between each category

median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64

## Data Splitting

### 1. Split and Create the 80% Train and 20% Test Dataset

In [175]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='ocean_proximity')
y = df['ocean_proximity']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,    
    random_state=42,  
    shuffle=True        
)

### 2. Train–Validation–Test Split

**Train Set** — Used to Learn Patterns

**Validation Set** — Used to Tune the Model

**Test Set** — Used Once at the Very End

In [176]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='ocean_proximity')
y = df['ocean_proximity']

# Train + Temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    shuffle=True,
    stratify=y
)

# Temp → Val + Test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    random_state=42,
    shuffle=True,
    stratify=y_temp
)

print(X_train.shape, X_val.shape, X_test.shape)

(14448, 9) (3096, 9) (3096, 9)


### 3. Stratified Shuffle Split

create train/test splits while keeping the same class proportions in each split as in the original dataset.

In [177]:
from sklearn.model_selection import StratifiedShuffleSplit


split = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

for train_index, test_index in split.split(df, df['ocean_proximity']):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

print("Train Set Category Proportions")
print(strat_train_set['ocean_proximity'].value_counts(normalize=True))

print("="*30)

print("Test Set Category Proportions")
print(strat_test_set['ocean_proximity'].value_counts(normalize=True))

df = strat_train_set.copy()

Train Set Category Proportions
ocean_proximity
<1H OCEAN     0.442622
INLAND        0.317414
NEAR OCEAN    0.128807
NEAR BAY      0.110950
ISLAND        0.000208
Name: proportion, dtype: float64
Test Set Category Proportions
ocean_proximity
<1H OCEAN     0.442668
INLAND        0.317345
NEAR OCEAN    0.128714
NEAR BAY      0.110950
ISLAND        0.000323
Name: proportion, dtype: float64


# Data Preprocessing

## Data Cleaning

### 1. Remove duplicates

In [178]:
df = df.drop_duplicates()  # Remove these rows

### 2. Handle missing values using imputer

In [179]:
from sklearn.impute import SimpleImputer

# Create a Simple Imputer
imputer = SimpleImputer(strategy='median') # strategy -> 'mean' / 'most_frequent' / 'median' / 'constant'

# Fill NAN using median
df['total_bedrooms'] = imputer.fit_transform(df[['total_bedrooms']])

print(imputer.statistics_)

[433.]


### 3. Remove rows with missing values

In [180]:
# Drop rows with empty values
df = df.dropna(axis=1)

### 4. Remove outliers using **IQR**

In [181]:
# IQR
Q1 = df['median_house_value'].quantile(0.25)
Q3 = df['median_house_value'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['median_house_value'] < lower_bound) | (df['median_house_value'] > upper_bound)]

# plt.boxplot(df['median_house_value'])
# plt.show()

### 5. Capping/Winsorizing

Winsoring columns to reduce the impact of extreme outliers

In [182]:
lower = df['median_house_value'].quantile(0.01)
upper = df['median_house_value'].quantile(0.99)
df['median_house_value'] = df['median_house_value'].clip(lower, upper)

### 6. Isolation Forest

To automatically detect outliers in a dataset

In [183]:
from sklearn.ensemble import IsolationForest

def detect_outliers(df):
    #only works with numerical features
    num_cols = df.select_dtypes(include='number').columns

    # Initialize Isolation Forest
    iso = IsolationForest(contamination=0.01, random_state=42) # expect about 1% of samples to be anomalies

    # # Train the model and let it label which rows are anomalies (-1) or normal (1)
    pred = iso.fit_predict(df[num_cols])

    # Add the anomaly results to the dataframe
    df["outlier"] = (pred == -1).astype(int)

    return df

df = detect_outliers(df)
df[df["outlier"] == 1].head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,outlier
9185,-118.56,34.41,4.0,17313.0,3224.0,6902.0,2707.0,5.6798,320900.0,<1H OCEAN,1
9159,-118.45,34.44,16.0,13406.0,2574.0,7030.0,2440.0,4.6861,187900.0,<1H OCEAN,1


## Feature Transformation

### Log Transform

Suitable for: **Right-skewness**

- All values are positive

This step can be used to examine the skewness patterns.

In [184]:
# num_cols = df.select_dtypes(include='number').columns
#
# df[num_cols].hist(figsize=(10, 8), bins=50)
# plt.tight_layout()
# plt.show()

After that, we choose columns to transform

In [185]:
df_log = df.copy()
cols = ["total_rooms","total_bedrooms","population","households","median_income","median_house_value"]
for col in cols:
    df_log[col+"_log"] = np.log1p(df_log[col])

#### Yeo-Johnson

Automatically finds skewed numerical features in your DataFrame and applies the Yeo-Johnson transformation to reduce the skewness.

- Data included zero or negative

- mixed skewness

In [186]:
import numpy as np
from sklearn.preprocessing import PowerTransformer

def auto_yeojohnson(df, threshold=1.0):
    num_cols = df.select_dtypes(include="number").columns
    skewed = df[num_cols].skew()
    cols = skewed[skewed > threshold].index.tolist()

    pt = PowerTransformer(method="yeo-johnson")
    df[cols] = pt.fit_transform(df[cols])

    print("Applied Yeo-Johnson to:", cols)
    return df

df3 = auto_yeojohnson(df)

Applied Yeo-Johnson to: ['total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'outlier']


## Feature Encoding

### 1. One Hot Encoder

Suitable for: **Feature without order**

- Convert unordered categorical features into multiple binary columns (one per category)



In [187]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['ocean_proximity']])

encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['ocean_proximity']))
df3 = pd.concat([df.drop('ocean_proximity', axis=1), encoded_df], axis=1)

df3.head(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,outlier,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
11844,-120.92,40.02,35.0,-2.044276,-1.945714,-2.091243,-2.096325,-0.604136,102500.0,-0.100686,0.0,1.0,0.0,0.0,0.0
10949,-117.87,33.75,18.0,-1.424389,-0.772604,-0.526188,-0.870282,-0.590028,162500.0,-0.100686,1.0,0.0,0.0,0.0,0.0
17965,-121.99,37.32,20.0,1.101247,1.029851,0.825399,1.034451,0.649832,217700.0,-0.100686,,,,,


### 2. Label Encoder

Suitable for: **Target Variable**

- Convert class labels (strings) into integer codes, typically for target variable (y)

In [188]:
from sklearn.preprocessing import LabelEncoder

Encoder = LabelEncoder()
df2 = df.copy()
df2['ocean_proximity'] = Encoder.fit_transform(df2['ocean_proximity'])

df2['ocean_proximity'].value_counts()

ocean_proximity
0    6395
1    4586
4    1861
3    1603
2       3
Name: count, dtype: int64

### 3. Ordinal Encoder

Suitable for: **Feature with order**

- Convert ordered categorical features into integer values representing ranking or level

In [189]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

df2 = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'red'],
    'size': ['S', 'M', 'L', 'XL', 'M']
})
print(df2)

print("-"*15)
encoder = OrdinalEncoder()
df_encoded = df2.copy()
df_encoded[['color', 'size']] = encoder.fit_transform(df2[['color', 'size']])

print(df_encoded)

   color size
0    red    S
1  green    M
2   blue    L
3  green   XL
4    red    M
---------------
   color  size
0    2.0   2.0
1    1.0   1.0
2    0.0   0.0
3    1.0   3.0
4    2.0   1.0


## Feature Scaling

### 1. Standard Scaler

Scales features to have mean = 0 and standard deviation = 1

- Sensitive to outliers

In [190]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()

X = df.drop(columns=["ocean_proximity"])

X_scaled = scaler.fit_transform(X)

### 2. MinMax Scaler

Scales features to a fixed range, usually [0, 1]

- Sensitive to outliers

In [191]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X = df.drop(columns=["ocean_proximity"])

X_scaled = scaler.fit_transform(X)

### 3. Robust Scaler

Uses median and IQR (Interquartile Range)

- Data has outliers or is skewed
- Keeps most values within a manageable range without being affected by extreme values.

In [192]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X = df.drop(columns=["ocean_proximity"])

X_scaled = scaler.fit_transform(X)

## Pipelines

The pipeline combines preprocessing and model training into a single, unified workflow.

**Remind**: The steps must be executed in the correct order.

In [193]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

housing = strat_train_set.drop("median_house_value", axis=1) # features (X)
housing_labels = strat_train_set["median_house_value"].copy() # target (y)

# Identify numerical and categorical columns
num_cols = housing.select_dtypes(include='number').columns
cat_cols = housing.select_dtypes(exclude='number').columns

# Preprocessing for numerical features:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine numerical and categorical transformers
# so each column receives the appropriate preprocessing.
preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ]
)

# Full pipeline:
clf = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', RandomForestRegressor(
        n_estimators=200,
        random_state=42
    ))
])

# Model Training & Hyperparameter Tuning

### 1. Grid Search

In [194]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

# Define the hyperparameter grid
param_grid = {
    'model__n_estimators': [30, 100],
    'model__max_features': [2, 4, 8],
}

# Create the GridSearchCV
grid = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    return_train_score=True
)

# Train using the training set (housing & housing_labels)
grid.fit(housing, housing_labels)

# Output the best-found hyperparameters
print("Best Parameters:", grid.best_params_)

# Convert the negative MSE back to a readable RMSE
best_rmse = np.sqrt(-grid.best_score_)
print("Best Cross-Validation Score (RMSE):", best_rmse)

# Extract features and labels from the stratified test set
X_test_final = strat_test_set.drop("median_house_value", axis=1)
y_test_final = strat_test_set["median_house_value"].copy()

# Retrieve the best model found by GridSearchCV
best_model = grid.best_estimator_
final_predictions = best_model.predict(X_test_final)

# Compute the final RMSE to measure real-world performanceE
final_mse = mean_squared_error(y_test_final, final_predictions)
final_rmse = np.sqrt(final_mse) # <--- 修正点3：计算 RMSE
print("Final RMSE:", final_rmse)

Best Parameters: {'model__max_features': 8, 'model__n_estimators': 100}
Best Cross-Validation Score (RMSE): 49684.84664464555
Final RMSE: 48244.37018042808


In [199]:
# Get the best model
final_model = grid.best_estimator_

# Get the feature importance values
feature_importances = final_model.named_steps["model"].feature_importances_

# Since you used a Pipeline (OneHotEncoder), the number of feature names increased.
# We need to align them. Here we simplify by only looking at the numerical values,
# or you can try to manually align the column names.

print("Feature importance ranking:")

# Assume 'attributes' is your list of feature names.
# (If it’s inconvenient to get the column names, you can directly print the values to see the distribution.)
# sorted(zip(feature_importances, attributes), reverse=True)

print(feature_importances)

Feature importance ranking:
[1.13307588e-01 1.03569127e-01 4.90404095e-02 2.99481958e-02
 2.79803109e-02 3.79115572e-02 2.56946273e-02 4.38899323e-01
 1.03560399e-02 1.54348407e-01 3.45530037e-04 2.20241299e-03
 6.39647136e-03]


### Randomized Search

In [195]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import randint

# 1. Load the Iris dataset
X, y = load_iris(return_X_y=True)

# 2. Train/Test Split
# Splits the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

# 3. Initialize a RandomForest classifier
rf = RandomForestClassifier(random_state=42)

# 4. Define "range-based" hyperparameter distributions (key concept of RandomizedSearchCV)
param_distributions = {
    'n_estimators': randint(50, 300),      # Randomly sample between 50 and 300 trees
    'max_depth': randint(3, 20),           # Randomly sample tree depth between 3 and 20
    'min_samples_split': randint(2, 10),   # Randomly sample minimum split size between 2 and 10
}

# 5. Set up Randomized Search
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_distributions,
    n_iter=20,            # Draw 20 random combinations (not all)
    cv=5,                 # 5-fold cross-validation
    scoring='accuracy',   # Evaluate using accuracy
    random_state=42,
    n_jobs=-1             # Use all available CPU cores
)

# 6. Train the model (Random sampling + cross-validation on training set)
random_search.fit(X_train, y_train)

# 7. Display the best-found hyperparameters and CV score
print("Best Parameters:", random_search.best_params_)
print("Best Cross-Validation Score:", random_search.best_score_)

# 8. Evaluate the best model on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Test Set Accuracy:", accuracy_score(y_test, y_pred))

Best Parameters: {'max_depth': 9, 'min_samples_split': 5, 'n_estimators': 142}
Best Cross-Validation Score: 0.95
Test Set Accuracy: 1.0
