# Introduction to Scikit-learn (sklearn)

This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn library.

what we're going to cover: 

0. An end-to-end Scikit Learn workflow
1. Getting the data ready
2. choose the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together

## 0. An end-to-end Scikit-Learn workflow

In [None]:
import warnings 
warnings.filterwarnings("default")

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn as sk


In [None]:
# 1. Get the data ready 
heart_disease = pd.read_csv("./data/heart-disease.csv")
heart_disease

In [None]:
# Create X (features matrix)
X = heart_disease.drop("target", axis = 1)

#Create Y (Labels)
Y = heart_disease["target"]



In [None]:
# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100) # no need of the oarameter

# We'll keep the default hyperparameters 
clf.get_params()

In [None]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

X_Train, X_Test, Y_Train, Y_Test = train_test_split(X,Y,test_size=0.2)

In [None]:
clf.fit(X_Train, Y_Train)

In [None]:
# make a prediction 
y_preds = clf.predict(X_Test)

In [None]:
y_preds

In [None]:
# 4. Evaluate the model 
clf.score(X_Train, Y_Train)

In [None]:
clf.score(X_Test, Y_Test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(Y_Test, y_preds))

In [None]:
confusion_matrix(Y_Test, y_preds)

In [None]:
accuracy_score(Y_Test,y_preds)

In [None]:
# 5. Improve a model
# Try diffrent amount of n_estimators
np.random.seed(42)
for i in range (10,100,10):
    print (f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_Train,Y_Train)
    print(f"Model accuracy on test set: {clf.score(X_Test,Y_Test)*100:.2f}%")
    print(" ")

In [None]:
# 6. Save a model and load it 
import pickle

with open("random_forest_model_1.pkl", "wb") as f:
    pickle.dump(clf, f)

In [None]:

with open("random_forest_model_1.pkl", "rb") as f:
    loaded_model = pickle.load(f)

    
loaded_model.score(X_Test, Y_Test)

In [None]:
# Let's listify the contents 

whats_were_covering = [
"0. An end-to-end Scikit Learn workflow",
"1. Getting the data ready",
"2. choose the right estimator/algorithm for our problems",
"3. Fit the model/algorithm and use it to make predictions on our data",
"4. Evaluating a model",
"5. Improve a model",
"6. Save and load a trained model",
"7. Putting it all together"]

In [None]:
whats_were_covering

# 1. Getting the data ready
Three main things we have to do. 
1. Split the data into features and labels  ( X & y)
2. Filling (also called imputing) or disregarding missing values
3. Vpnverting non-numerical values to numerical values (also called feature encoding

In [None]:
heart_disease.head()

In [None]:
X = heart_disease.drop("target", axis = 1)
X.head()

In [None]:
y = heart_disease["target"]
y.head()

In [None]:
# Train and Test Sets
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(X,y, test_size = 0.2)

In [None]:
X_train.shape, X_test.shape, y_test.shape, y_train.shape

In [None]:
X.shape[0]*0.8

In [None]:
len(heart_disease)

## 1.1 Make sure it's all numerical

In [None]:
car_sales = pd.read_csv ("./data/car-sales-extended.csv")
car_sales.head()

In [None]:
len(car_sales)

In [None]:
car_sales.info()

In [None]:
#Split into X and y
X = car_sales.drop("Price", axis = 1)
y = car_sales["Price"]

# Split into Training and test
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

In [None]:
# Build ML model
from sklearn.ensemble import RandomForestRegressor # Predict a number

model = RandomForestRegressor()
# model.fit(X_train, y_train) # throw a error because the model cant handle strings
# model.score(X_Test,y_test)


In [None]:
from sklearn.preprocessing import OneHotEncoder # Turn the categories into numbers 
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour","Doors"] # Doors because there a only 3 typen (5,4 and 3) it is also categorical
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                 one_hot,
                                 categorical_features)],
                               remainder = "passthrough")
transformed_X = transformer.fit_transform(X)

transformed_X


In [None]:
pd.DataFrame(transformed_X) # the odometer ha not changed because it is not categorical

<img src="./images/one_hot.png"/>

In [None]:
# PREPROCESSING: One-Hot-Encoding for categorical columns (and passthrough for numeric columns)
#
# Motivation:
# Most ML algorithms in scikit-learn expect purely numerical input (a matrix in R^n).
# Raw categorical values such as "Toyota" or "Red" cannot be used directly.
# Also, we must NOT simply map categories to integers (e.g., Toyota=1, BMW=2),
# because that would create an artificial order and distance between categories.
#
# OneHotEncoder:
# - Learns the set of unique categories in each selected column during `fit()`.
# - During `transform()`, it converts each category into a binary vector:
#   Example for Colour = {Red, Blue, Black}:
#     Red   -> [1, 0, 0]
#     Blue  -> [0, 1, 0]
#     Black -> [0, 0, 1]
# - This produces "orthogonal" features (no implied ranking), which is the correct
#   mathematical representation for nominal variables.
# - By default, the output is a sparse matrix to save memory (because most entries are 0).
#
# Why "Doors" is treated as categorical:
# Even though Doors is numeric-looking (3, 4, 5), the values represent discrete types,
# not a continuous measurement. The difference between 3 and 4 doors is not a linear
# quantity in the same way as "Odometer (KM)". Treating it as categorical avoids
# misleading linear assumptions (especially important for linear models).
#
# ColumnTransformer:
# - Applies transformations to specific columns only.
# - Here: apply OneHotEncoder to ["Make", "Colour", "Doors"].
# - `remainder="passthrough"` ensures that all other columns (e.g., Odometer, Price, etc.)
#   remain in the dataset unchanged. Without this, those columns would be dropped.
#
# fit_transform(X):
# - `fit`: discovers categories in the selected columns (learns the encoding schema).
# - `transform`: outputs a final numerical design matrix that concatenates:
#     (a) one-hot encoded categorical columns
#     (b) untouched numerical columns (passthrough)
# - The resulting matrix `transformed_X` is ready for model training in scikit-learn.
#
# Best practice note:
# In a full ML workflow, you typically fit the transformer ONLY on training data
# (fit on X_train, transform X_train and X_test) to avoid data leakage.

In [None]:
# another way 
dummies = pd.get_dummies(car_sales[["Make", "Colour","Doors"]]) # dont work in int columns
dummies

In [None]:
# Let's refit the model 
np.random.seed(42)

X_train, X_test,y_train, y_test = train_test_split(transformed_X,y,test_size = 0.2)

model.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test) # to less data try it in real life

### 1.2 What if there were missing values? 

1. Fill them with some value (also known as imputation).
2. Remove the samples with missing data altogether

In [None]:
# import car_sales missing data
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

In [None]:
car_sales_missing.isna().sum() #how many missing values

In [None]:
# Create X & y
X = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing["Price"]

In [None]:
# Lets try and convert our data to numbers

from sklearn.preprocessing import OneHotEncoder # Turn the categories into numbers 
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour","Doors"] # Doors because there a only 3 typen (5,4 and 3) it is also categorical
one_hot = OneHotEncoder(sparse_output=False) # getting a array not a sparse Matrix
transformer = ColumnTransformer([("one_hot", 
                                 one_hot,
                                 categorical_features)],
                               remainder = "passthrough")
transformed_X = transformer.fit_transform(X)

transformed_X

In [None]:
pd.DataFrame(transformed_X).head()

In [None]:
# Example 1: Standardization (Z-Score Scaling)
# This transforms each numerical feature so that it has a mean of 0
# and a standard deviation of 1.
# The resulting values can be negative or positive and are not bounded.
# Standardization is commonly used for distance-based or gradient-based
# algorithms such as Logistic Regression, SVMs, and KNN.

from sklearn.preprocessing import StandardScaler
import pandas as pd

# Beispiel-Datensatz
data = {
    "Odometer_KM": [6000, 45000, 120000, 250000, 345000],
    "Repair_Cost": [100, 300, 700, 1200, 1700]
}

df = pd.DataFrame(data)

# Initialisiere StandardScaler
scaler = StandardScaler()

# Fit + Transform
df_scaled = pd.DataFrame(
    scaler.fit_transform(df),
    columns=df.columns
)

df_scaled


In [None]:
# Example 2: Normalization (Min-Max Scaling)
# This rescales each numerical feature to a fixed range between 0 and 1.
# The minimum value of a feature becomes 0, the maximum value becomes 1,
# and all other values are scaled proportionally in between.
# Min-Max scaling is often used when features have known bounds or when
# training neural networks, where a consistent input range is beneficial.

from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Beispiel-Datensatz
data = {
    "Odometer_KM": [6000, 45000, 120000, 250000, 345000],
    "Repair_Cost": [100, 300, 700, 1200, 1700]
}

df = pd.DataFrame(data)

# Initialisiere Min-Max-Scaler
scaler = MinMaxScaler()

# Fit + Transform (nur für Demonstration an Gesamtdaten)
df_scaled = pd.DataFrame(
    scaler.fit_transform(df),
    columns=df.columns
)

df_scaled


### Option 2: Fill missing values with Scikit-Learn

In [None]:
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

In [None]:
car_sales_missing.isna().sum()

In [None]:
#drop the rows with no labels
car_sales_missing.dropna(subset = ["Price"], inplace = True)
car_sales_missing.isna().sum()

In [None]:
# Split into X & y
X = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing["Price"]

In [None]:
# Fill missing values with scikit Learn (better way before Encoding into numbers)
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with "missing" & numerical values with mean 
cat_imputer = SimpleImputer(strategy="constant", fill_value = "missing")
door_imputer = SimpleImputer ( strategy = "constant", fill_value = 4) 
num_imputer = SimpleImputer(strategy = "mean")

#Define columns 
cat_features = ["Make","Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

#Create a Imputer (something that fills missing data) 
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer",door_imputer,door_features),
    ("num:imputer", num_imputer, num_features)
])

#Transform the data 
filled_X = imputer.fit_transform(X)
filled_X

In [None]:
car_sales_filled = pd.DataFrame(filled_X, columns=["Make","Colour","Doors","Odometer (KM)"])

In [None]:
car_sales_filled

In [None]:
car_sales_filled.isna().sum()

In [None]:
# Split into X & y don't change
X = car_sales_filled


In [None]:
# Lets try and convert our data to numbers

from sklearn.preprocessing import OneHotEncoder # Turn the categories into numbers 
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour","Doors"] # Doors because there a only 3 typen (5,4 and 3) it is also categorical
one_hot = OneHotEncoder(sparse_output=False) # getting a array not a sparse Matrix
transformer = ColumnTransformer([("one_hot", 
                                 one_hot,
                                 categorical_features)],
                               remainder = "passthrough") # dont change other columns
transformed_X = transformer.fit_transform(X)

transformed_X

In [None]:
pd.DataFrame(transformed_X)

In [None]:
# Now we've got our data as numbers and filled ( no missing data) 
#Let's fit a model 
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test) # worse score maybe wrong model (or we need more samples)

In [None]:
whats_were_covering

# 2. choose the right estimator/algorithm for our problems

Some things to note: 
* Sklearn refers to machine learning models, algorithms as estimators. 
* Classification problem - predicting a category ( heart disease or not)
* Sometime you'll see clf (short fpr classifier) used as a classification estimator
* Regression problem - predicting a number (selling price for a car) 

<img src ="./images/sklearn-ml-map.png" />

### 2.1 picking a ML model for regression problem 
Let's use  the california housing toy dataset

In [None]:
from sklearn.datasets import fetch_california_housing 

In [None]:
df = fetch_california_housing()

In [None]:
df 

In [None]:
housing_df = pd.DataFrame(df["data"])
housing_df

In [None]:
housing_df = pd.DataFrame(df["data"], columns = df["feature_names"])
housing_df

In [None]:
housing_df["target"]= df ["target"]
housing_df

In [None]:
housing_df = housing_df.drop("target", axis = 1)

In [None]:
#import algorithm
from sklearn.linear_model import Ridge 
from sklearn.model_selection import train_test_split
#setup random seed
np.random.seed(42)

# Create the data 
X = housing_df
y = df["target"]

# Split into train and test data

X_train, X_test, y_train, y_test = train_test_split(X,y)

# Instantiate and fit the model 
model = Ridge()
model.fit(X_train,y_train)

model.score(X_test,y_test)

In [None]:
# Varianz beschreibt die vertikale Streuung der Zielwerte um einen Referenzwert.
# 
# - Gesamtvarianz (SST):
#   Abstand der echten Werte y zum Mittelwert y_mean.
#   Sie zeigt, wie stark die Daten insgesamt streuen,
#   also wie viel "Information" bzw. Unordnung in den Zielwerten steckt.
#
# - Nicht erklärte Varianz (SSE):
#   Abstand der echten Werte y zu den Modellvorhersagen y_pred.
#   Sie zeigt, wie viel dieser ursprünglichen Streuung nach dem Modell
#   noch übrig bleibt und vom Modell nicht erklärt werden kann.
#
# - Zusammenhang zu R²:
#   R² = 1 - (nicht erklärte Varianz / Gesamtvarianz)
#   → R² misst, welcher Anteil der ursprünglichen Streuung
#     durch das Modell reduziert bzw. erklärt wurde.
#
# Grafische Vorstellung:
#   • Abstand Punkt → Mittelwert  = Gesamtvarianz
#   • Abstand Punkt → Modelllinie = nicht erklärte Varianz

What if Ridge didn't work or the score didn't fit our needs 

Try a Ensemble Model

In [None]:
# Ensemble-Methoden kombinieren mehrere einzelne Modelle (Base Learners)
# zu einem gemeinsamen Gesamtmodell, um stabilere und genauere Vorhersagen
# zu erzielen als mit einem einzelnen Modell.
#
# Motivation:
# Einzelmodelle leiden häufig unter hohem Bias (Underfitting) oder
# hoher Varianz (Overfitting). Ensembles wirken diesem Problem entgegen.
#
# Grundprinzip:
# - Mehrere Modelle werden trainiert
# - Jedes Modell macht leicht unterschiedliche Fehler
# - Durch Mittelung oder Gewichtung heben sich Fehler teilweise auf
#
# Hauptarten von Ensemble-Methoden:
#
# 1) Bagging (z. B. Random Forest):
#    - Modelle werden parallel auf zufälligen Stichproben trainiert
#    - Reduziert hauptsächlich die Varianz
#    - Besonders effektiv bei instabilen Modellen (z. B. Entscheidungsbäume)
#
# 2) Boosting (z. B. Gradient Boosting, XGBoost):
#    - Modelle werden sequenziell trainiert
#    - Jedes neue Modell fokussiert sich auf die Fehler der vorherigen
#    - Reduziert hauptsächlich den Bias
#
# 3) Stacking:
#    - Mehrere unterschiedliche Modelle werden kombiniert
#    - Ein Meta-Modell lernt, wie die Vorhersagen optimal gemischt werden
#
# Zusammenhang zu Bias–Varianz-Tradeoff:
# Ensemble-Methoden reduzieren Varianz, Bias oder beides und senken dadurch
# die nicht erklärte Varianz, was zu besserer Generalisierung führt.

In [None]:
# import the RandomForestRegressor model class
from sklearn.ensemble import RandomForestRegressor

# Setup random seed 
np.random.seed(42)

# model 
rfg = RandomForestRegressor()

# data 
# Create the data 
X = housing_df
y = df["target"]

# Split into train and test data

X_train, X_test, y_train, y_test = train_test_split(X,y)

rfg.fit(X_train, y_train) 

rfg.score(X_test,y_test)

## 2.2 Picking a ML model for a classification Problem

In [None]:
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

In [None]:
len(heart_disease)

In [None]:
# import the LinearSVC
from sklearn.svm import LinearSVC

# Setup random seed 
np.random.seed(42)

# getting data ready 
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X,y)

#Instantiate LinearSVC
clf = LinearSVC()
clf.fit(X_train,y_train)

#Evaluate the Linear SVC
clf.score(X_test,y_test)

In [None]:
#Works good! LinearSVC

In [None]:
# import the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Setup random seed 
np.random.seed(42)

# getting data ready 
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X,y)

#Instantiate RandomForest
clf = RandomForestClassifier()
#fit the model to the data (training)
clf.fit(X_train,y_train)

#Evaluate the Random Forest ( use the patterns the model has learned)
clf.score(X_test,y_test)

In [None]:
whats_were_covering

## 3. Fit the model/algorithm and use it to make predictions on our data

### 3.1 Fitting a model to the data

X = features, features variables, data

y = labels , targets, target variables

In [None]:
# import the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Setup random seed 
np.random.seed(42)

# getting data ready 
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X,y)

#Instantiate RandomForest
clf = RandomForestClassifier()
#fit the model to the data (training)
clf.fit(X_train,y_train)

#Evaluate the Random Forest ( use the patterns the model has learned)
clf.score(X_test,y_test)

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# the fit method find patterns when the target is 1 and when target is 0 

### 3.2 Make predictions using a ML model

2 ways to make predictions

1. predict()
2. predict_proba()

In [None]:
# Use a trained model to make predictions with predict()

In [None]:
#clf.predict(np.array[1,6,7,8,5])# this doesnt work

In [None]:
clf.predict(X_test)

In [None]:
np.array(y_test)

In [None]:
# compare predictions to truth labels to eval. the model
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test) # this is what the method score does the accuracy 

In [None]:
# another way for the accuracy 
from sklearn.metrics import accuracy_score
accuracy_score(y_preds,y_test)

Make predictions with predict_proba()

In [None]:
# predict_proba returns probabilities of a classification label

In [None]:
clf.predict_proba(X_test[:5])

In [None]:
# Lets predict on the same data
clf.predict(X_test[:5])

predict () can also be used for regression models

In [None]:
housing_df.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = housing_df
y = y = pd.Series(fetch_california_housing()["target"])

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)

model = RandomForestRegressor()

model.fit(X_train, y_train)

# Make a predictions
y_preds = model.predict(X_test)

In [None]:
y_preds

In [None]:
np.array(y_test[:10])

In [None]:
# COmpare the predictions to the truth 
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_preds)

# Mean Absolute Error (MAE):
# Der MAE gibt an, um wie viele Einheiten ein Modell im Durchschnitt danebenliegt.
# Dazu wird für jede Vorhersage die absolute Abweichung zum echten Wert berechnet
# (Minuszeichen werden ignoriert) und anschließend der Mittelwert dieser Abweichungen gebildet.
# Der MAE hat dieselbe Einheit wie die Zielvariable und ist leicht interpretierbar.

In [None]:
whats_were_covering

## 4. Evaluating a model

three ways to evaluate Scikit-Learn models/estimators:

    1. Estimator's built-in score method
    2. the scoring parameter 
    3. Problem specific metric functions

### 4.1 Evaluating a model with the score method

In [None]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

# Create X & y
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

#Create train test
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.2)

#Create Model
rfc = RandomForestClassifier()

# Fit model
rfc.fit(X_train, y_train)


In [None]:
rfc.score(X_train,y_train) # the highest vakue for the score method is 1.0 and lowest is 0.0 the return value is the accuracy 

In [None]:
rfc.score(X_test,y_test)

Let's use the data on a regression model 

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create X & y
X = pd.DataFrame(fetch_california_housing()["data"], columns = fetch_california_housing()["feature_names"])
y = fetch_california_housing()["target"]

#Create train test
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.2)

#Create Model
rfc = RandomForestRegressor()

# Fit model
rfc.fit(X_train, y_train)

In [None]:
rfc.score(X_test,y_test) # the default score (evaluation metric is r_squared for regression algorithms highest = 1.0, lowest 0.0

### 4.2 Evaluating a model using the scoring parameter

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

# Create X & y
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

#Create train test
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.2)

#Create Model
rfc = RandomForestClassifier()

# Fit model
rfc.fit(X_train, y_train);

In [None]:
rfc.score(X_test,y_test)

<img src="./images/cross_validation.png" />

In [None]:
cross_val_score(clf,X,y, cv = 5) # returns a array 

In [None]:
np.random.seed(42) 
#SIngle training and test split score
clf_single_score = clf.score(X_test,y_test) 

# take the mean of 5fold cv score 
clf_cross_val_score = np.mean (cross_val_score(clf,X,y))

#compare the two 
clf_single_score, clf_cross_val_score

In [None]:
# Scoring parameter set to None by default 
cross_val_score(clf,X,y,scoring = None)

In [None]:
# Cross-Validation wird verwendet, um die tatsächliche Leistungsfähigkeit eines Modells
# zuverlässig zu bewerten. Im Gegensatz zu einem einzelnen Train/Test-Split mittelt sie
# die Performance über mehrere Datenaufteilungen und reduziert damit Zufallseinflüsse.
# Sie wird standardmäßig bei Modellvergleich, Hyperparameter-Tuning und in professionellen
# ML-Projekten eingesetzt, während ein Single Split nur für schnelle Tests geeignet ist.

### 4.2 Classification model evaluation metrics 

1. Accuracy
2. Area under ROC Curve
3. Confusion matrix
4. Classification report

### Accuracy

In [None]:
heart_disease.head()

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed (42)
X = heart_disease.drop("target",axis=1)
y =  heart_disease["target"]

clf = RandomForestClassifier()

cross_val_score(clf,X,y,cv =5) # return value => mean accuracy 

In [None]:
cross_val_score = cross_val_score(clf,X,y,cv =5)
np.mean(cross_val_score)

In [None]:
print(f"Heart Disease Classifier Cross-Validated Accuracy: {np.mean(cross_val_score)*100}")

### Area under the reciever operating charateristic curve ( AUC/ROC )

* Area under curve (AUC)
* ROC
  
ROC curves are a comparison of a model's true positive rate (tpr) versus a model false positive rate (fpr)

* True positive = model predicts 1 when truth is 1
* False positive = model predicts 1 when truth is 0
* True negative = model predicts 0 when truth is 0
* False negative = model predicts 0 when truth is 1
  



In [None]:
X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.2)

clf.fit(X_train,y_train)

In [None]:
from sklearn.metrics import roc_curve

# Make Predictions with probalilities 
y_probs = clf.predict_proba(X_test)
y_probs[:10]

In [None]:
y_probs_positive = y_probs[:,1]
y_probs_positive[:10]

In [None]:
# Calculate fpr, tpr and tresholds
fpr, tpr, thresholds = roc_curve(y_test, y_probs_positive)

# Check the false positive rates
fpr

In [None]:
# Create a function for plotting ROC curves 
import matplotlib.pyplot as plt

def plot_roc_curve (fpr, tpr):
    """
    Plots a ROC Curve give the fpr and tpr of a model
    """
    #plot the curve 
    plt.plot (fpr,tpr, color="orange", label="ROC")
    #plot line with no predictive power (baseline) 
    plt.plot([0,1],[0,1],color="darkblue", linestyle = "--", label= "Guessing") 

    #Customizing the plot
    plt.xlabel("False positive rate (fpr)")
    plt.ylabel("True positive rate (tpr)")
    plt.title ("Reciever Operating Characteristic (ROC) Curve") 
    plt.legend()
    plt.show()

plot_roc_curve(fpr,tpr)

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test,y_probs_positive)

In [None]:
# perfect ROC and AUC score 
fpr, tpr, thresholds = roc_curve(y_test,y_test)
plot_roc_curve(fpr,tpr)

In [None]:
#perfect AUC score 
roc_auc_score(y_test,y_test)

In [None]:
# =============================================================================
# ROC & AUC – Kurzfassung
# =============================================================================
#
# ROC (Receiver Operating Characteristic):
# Zeigt, wie sich ein Klassifikationsmodell bei allen möglichen Schwellenwerten
# verhält. Dargestellt wird der Zusammenhang zwischen:
# - True Positive Rate  (wie viele echte Positive erkannt werden)
# - False Positive Rate (wie viele Negative fälschlich als positiv erkannt werden)
#
# AUC (Area Under the Curve):
# Eine einzelne Zahl, die die ROC-Kurve zusammenfasst.
# Sie misst die Trennfähigkeit des Modells.
#
# Intuition:
# AUC = Wahrscheinlichkeit, dass ein zufällig positives Beispiel
# einen höheren Score erhält als ein zufällig negatives.
#
# Interpretation:
# AUC = 1.0 -> perfekte Trennung
# AUC = 0.5 -> Zufall
#
# Merksatz:
# ROC/AUC bewerten nicht eine konkrete Entscheidung,
# sondern wie gut das Modell positive von negativen Fällen trennt.
# =========================================================================

### Confusion Matrix

A Confusion Matrix is a quick way to compare the labels a model predicts and the actual labels it was supposed to predict. 
In essence, giving you an idea of where the model is getting confused

In [None]:
from sklearn.metrics import confusion_matrix

y_preds = clf.predict(X_test)

confusion_matrix(y_test,y_preds)

In [None]:
# Visualize confusion matrix with pd.crosstab

pd.crosstab(y_test,y_preds, rownames = ["Actual Labels"], colnames = ["Predicted Labels"])

In [None]:
24+8+3+26

In [None]:
len(y_preds)

<img src="./images/cm_anatomy.png"/>

In [None]:
import sys  # How to install a new package in JN 
#!conda install --yes --prefix {sys.prefix} seaborn  already installed in the environment

In [None]:
# Make our Confusion matrix more visual with seaborn heatmap()
import seaborn as sns

#set the font scale 
sns.set(font_scale=1.5)
#Create a confusion matrix 
conf_mat = confusion_matrix(y_test,y_preds)

# Plot is using Seaborn
sns.heatmap(conf_mat)


### Creating a confusion matrix using scikit Learn


In [None]:
import sklearn
sklearn.__version__

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(estimator = clf, X = X, y= y)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_true = y_test, y_pred= y_preds)

### Classification Report 

In [None]:
from sklearn.metrics import classification_report 

print(classification_report(y_test,y_preds))

<img src="./images/cr_anatomy.png"/>

In [None]:
# where precision and recall become vlauable 

disease_true = np.zeros(10000)
disease_true[0] = 1 # only one possitive

disease_preds = np.zeros(10000) # model predictions 

pd.DataFrame(classification_report(disease_true,disease_preds, output_dict=True))

* Accuray is a good measure to start with if all classes are balanced (e.g same amount of samples which are labelled with 0 or 1)
* Precision and recall become more important when classes are imbalanced.
* If false positive predictions are worse than false negatives, aim for higer precision
* If false negative predictions are worse than false positives, aim for higher recall.
* f1-score is a combination of precision and recall.

### 4.2.2 Regression model evaluation metrics 

Model evaluation metrics documentation - https://scikit-learn.org/0.15/modules/model_evaluation.html#regression-metrics

The ones we are going to cover are: 

1. R^2 or coefficient of determination 
2. Mean Absolute Error MAE
3. Mean squared Error MSE 

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = housing_df
y = df["target"]

X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.2)

model = RandomForestRegressor()

model.fit(X_train,y_train)

In [None]:
model.score(X_test,y_test)

In [None]:
y_test

In [None]:
y_test.mean()

In [None]:
from sklearn.metrics import r2_score

# Fill an array with y_test mean
y_test_mean = np.full(len(y_test),y_test.mean())

In [None]:
y_test_mean[:10]

In [None]:
r2_score(y_true = y_test, 
        y_pred = y_test_mean)

In [None]:
r2_score(y_true = y_test, 
        y_pred = y_test)

### Mean absolute error

MAE is the average of the absolute diffrences between predictions and actual values 
It gives you an idea of how wrong your model predictions are 

In [None]:
# MAE 
from sklearn.metrics import mean_absolute_error 

y_preds = model.predict(X_test)
mae = mean_absolute_error(y_test, y_preds)
mae

In [None]:
y_preds

In [None]:
y_test

In [None]:
df2 = pd.DataFrame( data = {"actual values": y_test, "predicted values": y_preds})
df2["diffrences"]= df2["predicted values"] - df2["actual values"]
df2.head(10)

In [None]:
# MAE using formulas and diffrences 
np.abs(df2["diffrences"]).mean() # MAE

### Mean squared error 

MSE is the mean of the square of the errors between actual and predictied values

In [None]:
# Mean squared Error 
from sklearn.metrics import mean_squared_error

y_preds = model.predict(X_test)
mse = mean_squared_error(y_test, y_preds)
mse

In [None]:
df2["squared_diffrences"] = np.square(df2["diffrences"])
df2.head(10)

In [None]:
# Calculate MSE by hand 
squared = np.square(df2["diffrences"])
squared.mean()

In [None]:
df_large_error = df2.copy()
df_large_error.loc[0, "squared_diffrences"] = 16

In [None]:
df_large_error.head()

In [None]:
# Calculate MSE with large error 
df_large_error["squared_diffrences"].mean()

In [None]:
df_large_error.loc[1:100,"squared_diffrences"] = 20

In [None]:
df_large_error.head()

In [None]:
# Calculate MSE with large error 
df_large_error["squared_diffrences"].mean()

<img src="./images/regression_metrics.png" />

### 4.2.3 Finally using the scoring parameter

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop("target",axis=1)
y = heart_disease["target"]

clf = RandomForestClassifier()


In [None]:
np.random.seed(42)

# Cross Validation accuracy 

cv_acc = cross_val_score(clf, X,y, cv = 5, scoring = None) # if scoring = None, estimators default scoring evaluation metric is used (accuracy for classification problem)
cv_acc

In [None]:
# Cross-validated accuracy 
print(f"The cross-calidated accuracy is: {np.mean(cv_acc)*100:2f}%")

In [None]:
np.random.seed(42)

cv_acc = cross_val_score(clf, X,y, cv = 5, scoring = "accuracy")
cv_acc

In [None]:
# Cross-validated accuracy 
print(f"The cross-calidated accuracy is: {np.mean(cv_acc)*100:2f}%")

In [None]:
#Precision
np.random.seed(42)
cv_precision = cross_val_score(clf, X,y, cv = 5, scoring = "precision")
cv_precision

In [None]:
# Cross-validated precision
print(f"The cross-calidated precision is: {np.mean(cv_precision)*100:2f}")

In [None]:
# Recall 
np.random.seed(42)
cv_recall = cross_val_score(clf, X,y, cv = 5, scoring = "recall")
cv_recall

In [None]:
# Cross-validated recall
print(f"The cross-calidated recall is: {np.mean(cv_recall)*100:2f}")

Let's see the scoring parameter bein used for a regression problem....

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = housing_df
y = df["target"]

model = RandomForestRegressor()

In [None]:
np.random.seed(42)

cv_r2 = cross_val_score(model,X,y, cv = 3, scoring = None)
np.mean(cv_r2)

In [None]:
# MAE 
cv_mae = cross_val_score(model,X,y, cv = 3, scoring = "neg_mean_absolute_error")
np.mean(cv_mae)

In [None]:
cv_mae

In [None]:
#MSE
cv_mse = cross_val_score(model,X,y, cv = 3, scoring = "neg_mean_squared_error")
np.mean(cv_mse)

In [None]:
cv_mse

### 4.3 Using diffrent evaluation metrics as Scikit-Learn functions 

The 3rd way to evaluate scikit learn machine learning models is to using sklearn.metrics module 

In [None]:
from sklearn.metrics import accuracy_score, precision_score,recall_score,f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Create X and y 
X = heart_disease.drop("target",axis = 1) 
y = heart_disease["target"]

# Split 

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2) 

# Create model 
clf = RandomForestClassifier()

# Fit model
clf.fit(X_train,y_train)

y_preds = clf.predict(X_test)

accuracy_score(y_test,y_preds), precision_score(y_test,y_preds), recall_score(y_test,y_preds), f1_score(y_test,y_preds)




In [None]:
clf.score(X_test,y_test)

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Create X and y 
X = housing_df
y = df["target"]

# Split 

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2) 

# Create model 
clf = RandomForestRegressor()

# Fit model
clf.fit(X_train,y_train)

y_preds = clf.predict(X_test)

r2_score(y_test,y_preds), mean_absolute_error(y_test,y_preds), mean_squared_error(y_test,y_preds)

In [None]:
clf.score(X_test,y_test)

In [None]:
whats_were_covering

## 5. Improve a model

First predicitions = baseline predictions.
First model = baseline model. 

From a data perspective: 
* Could we collect more data? (the more data, the better)
* Could we improve our data?

From a model perspective: 
* Is there a better model we could use?
* Could we improve the current model?

Hyperparameters vs Parameters 

Parameters = model find these patterns in data 
Hyperparameters = settings on a model you can adjust to (potentially) improve it's ability to find patterns. 

Three ways to adjust hyperparameters: 
1. By hand
2. Randomly with RandomSearchCV
3. Exhaustively with GridSearchCV


In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [None]:
#how to get the Hyperparameters

clf.get_params()

### 5.1 Tuning Hyperparameter by hand 

Let's make 3 sets, training, validation and test.

<img src = "./images/hp.png"/>

<img src="./images/concept.png" />

In [None]:
clf.get_params() # look scikit learn doc they suggest 

We're going try and adjust: 

* `max_depth`
* `max_features`
* `min_samples_leaf`
* `min_samples_split`
* `n_estimators`

In [None]:
 def evaluate_preds(y_true, y_preds):
     """
     Performs evaluation comparison on y_true labels vs. y_pred labels on a classification model.
     """
     accuracy = accuracy_score(y_true,y_preds)
     precision = precision_score(y_true,y_preds)
     recall = recall_score(y_true,y_preds)
     f1 = f1_score(y_true,y_preds)
     metric_dict = { "accuracy": round(accuracy,2),
                     "precision": round (precision,2),
                    "recall": round (recall,2), 
                    "f1": round(f1,2)}
     print(f"Accuracy: {accuracy * 100:.2f}%")
     print(f"Precision: {precision * 100:.2f}")
     print(f"Recall: {recall * 100:.2f}")
     print(f"F1 score: {f1 * 100:.2f}")

     return metric_dict

In [None]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

#Shuffle the data 
heart_disease_shuffled = heart_disease.sample(frac=1) 

# Split into X & y

X = heart_disease_shuffled.drop("target", axis = 1) 
y = heart_disease_shuffled["target"]

# Split the data into train, validation & test sets 
train_split = round(0.7*len(heart_disease_shuffled)) #70% of data
valid_split = round(train_split+0.15 *len(heart_disease_shuffled))
X_train,y_train = X[:train_split], y[:train_split]
X_valid,y_valid = X[train_split:valid_split], y[train_split:valid_split]
X_test,y_test =X[valid_split:], y[valid_split:]

len(X_train), len(X_valid), len(X_test)

clf = RandomForestClassifier()

clf.fit(X_train,y_train)

# Make baseline predictions
y_preds = clf.predict(X_valid)

# Evaluate the classifier on validation set 
baseline_metrics = evaluate_preds(y_valid,y_preds)

In [None]:
np.random.seed(42) 

# create a second classifier with diffrent hyperparameters 
clf_2 = RandomForestClassifier(n_estimators=1000)
clf_2.fit(X_train,y_train)
# Make  predictions with different HP 
y_preds_2 = clf_2.predict(X_valid)

# Evaluate the 2nd classifier on validation set 
metrics = evaluate_preds(y_valid,y_preds_2)

In [None]:
np.random.seed(42) 

# create a third classifier with diffrent hyperparameters 
clf_3 = RandomForestClassifier( n_estimators=500,
    max_features="sqrt",
    min_samples_leaf=5,
    min_samples_split=10)


clf_3.fit(X_train,y_train)
# Make  predictions with different HP 
y_preds_3 = clf_3.predict(X_valid)

# Evaluate the 3rd classifier on validation set 
metrics_2 = evaluate_preds(y_valid,y_preds_3)

### 5.2 Hyperparameter tuning with RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

grid = {"n_estimators":[10,100,200,500,1000,1200],
        "max_depth":[None, 5,10,20,30],
        "max_features": ["log2","sqrt"],
        "min_samples_split":[2,4,6],
        "min_samples_leaf":[1,2,4]}

np.random.seed(42)

#Split into X & y
X = heart_disease_shuffled.drop("target",axis = 1) 
y = heart_disease_shuffled["target"]

#Split into train and test sets 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

#Instantiate RandomForestClassifier
clf = RandomForestClassifier(n_jobs = 1 ) # How much of the processor we dedicate to the model

#Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=grid,
                            n_iter=10, # number of models to try
                            cv = 5, #5 fold cross validation 
                            verbose = 2 # the detail of the output 0-3 
                           )
# Fit the RSCV version of clf 
rs_clf.fit(X_train,y_train); # automatically will make a valid set 

In [None]:
rs_clf.best_params_

In [None]:
# Make predictions  with the best parameters 
rs_y_preds = rs_clf.predict(X_test) 

#Evaluate the predictions
rs_metrics = evaluate_preds(y_test,rs_y_preds)

### 5.2 Hyperparameter tuning with GridSearchCV

In [None]:
grid

In [None]:
## Grid Search is kind of Brute Force Search

6*5*2*3*3*5  # amount of diffrent models the last 5 because of CV 

In [None]:
grid_2 = {'n_estimators': [ 100, 200, 500],
 'max_depth': [None],
 'max_features': ['log2', 'sqrt'], # auto is not supported
 'min_samples_split': [6],
 'min_samples_leaf': [1, 2]} # We reduced the combinations of Hyperparameters based on the result of the RSCV

In [None]:
3*1*2*1*2*5

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split


np.random.seed(42)

#Split into X & y
X = heart_disease_shuffled.drop("target",axis = 1) 
y = heart_disease_shuffled["target"]

#Split into train and test sets 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

#Instantiate RandomForestClassifier
clf = RandomForestClassifier(n_jobs = 1 ) # How much of the processor we dedicate to the model

#Setup GridSearchCV
gs_clf = GridSearchCV(estimator=clf,
                            param_grid=grid_2,
                            cv = 5, #5 fold cross validation 
                            verbose = 2 # the detail of the output 0-3 
                           )
# Fit the RSCV version of clf 
gs_clf.fit(X_train,y_train); # automatically will make a valid set 

In [None]:
gs_y_preds = gs_clf.predict(X_test)

# evaluate the predictions
gs_metrics = evaluate_preds(y_test,gs_y_preds)

Let's compare our diffrent models metrics

In [None]:
compare_metrics = pd.DataFrame({"baseline":baseline_metrics,
                               "clf_2":metrics_2,
                                "random search":rs_metrics,
                                "grid search":gs_metrics})

compare_metrics.plot.bar(figsize = (10,8))

In [None]:
# it depends what should be the focus of the model accuracy or precision or recall or f1 
# look at: https://colab.research.google.com/drive/1ISey96a5Ag6z2CvVZKVqTKNWRwZbZl0m
# for right train test split for all models

In [None]:
whats_were_covering

## 6. Save and load a trained model

Two ways to save and load machine learning models: 
1. With Python's `pickle` module
2. With the `joblib`module

**Pickle** Python Object Serialization

In [None]:
# Our Python Object is our model 
import pickle 

# Save an existing model to file 
with open("gs_random_forest_model_1.pkl", "wb") as f: #
    pickle.dump(gs_clf, f)

In [None]:
# Load a saved model
with open("gs_random_forest_model_1.pkl", "rb") as f:
    loaded_pickle_model = pickle.load(f)

In [None]:
# Make some predictions
pickle_y_preds = loaded_pickle_model.predict(X_test)
evaluate_preds(y_test,pickle_y_preds)


### Joblib Module

In [None]:
from joblib import dump, load # joblib is more efficient!!!

# save model to file 

dump(gs_clf, filename="gs_random_forest_model_1.joblib")

In [None]:
# import a saved joblib model 

loaded_job_model = load(filename="gs_random_forest_model_1.joblib")

In [None]:
joblib_y_preds = loaded_job_model.predict(X_test)
evaluate_preds(y_test,joblib_y_preds)

In [None]:
whats_were_covering

## 7. Putting it all together!

In [None]:
data = pd.read_csv("data/car-sales-extended-missing-data.csv")
data

In [None]:
data.dtypes

In [None]:
data.isna().sum()

<img src="images/all.png" />

Steps we want to do (all in one cell):
1. Fill missing data
2. Convert data to numbers
3. Build a model on the data 

In [None]:
# Getting the data ready
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# Setup random seed 
import numpy as np
np.random.seed(42)

# Import data and drop rows with missing labels
data = pd.read_csv("data/car-sales-extended-missing-data.csv")
data.dropna(subset=["Price"], inplace = True)

# Define diffrent features and transformer pipeline 
categorical_features = ["Make","Colour"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy ="constant", fill_value="missing" )),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

door_feature =["Doors"]
door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy ="constant", fill_value=4 )),
    ])

numeric_features = ["Odometer (KM)"]
numeric_transformer = door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy ="mean" )),
    ])

# Setup the preprocessing steps (fill missing values then convert to numbers 
preprocessor = ColumnTransformer(
    transformers=[("cat", categorical_transformer, categorical_features),
                 ("door",door_transformer,door_feature),
                 ("num", numeric_transformer, numeric_features)
                ])

# Creating a preprocessing and modelling pipeline
model = Pipeline(steps=[("preprocessor",preprocessor),
                        ("model",RandomForestRegressor())])
#Split the data
X = data.drop("Price", axis = 1)
y = data["Price"]
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

#Fit and score the model
model.fit(X_train,y_train)
model.score(X_test,y_test)

In [None]:
# Warum Pipeline/Imputer wichtig ist:
# Alles was .fit() macht (Imputer, OneHotEncoder, Scaling) "lernt" Regeln aus den Daten
# z.B. Mittelwert/ häufigste Kategorie/ alle vorhandenen Kategorien.
# Wenn du diese Schritte auf dem ganzen X fit-test, fließen Infos aus X_test ins Training → Data Leakage.
# Das Modell sieht die Testdaten dann indirekt, weil X_train durch Test-Statistiken transformiert wird.
# Pipeline verhindert das automatisch: fit nur auf X_train, transform auf X_test.

It's also possible to use `GridSearchCV`of `RandomizedSearchCV`with our `Pipeline`.

In [None]:
# use GSCV with our regression pipeline
from sklearn.model_selection import GridSearchCV

pipe_grid = {
    "preprocessor__num__imputer__strategy" : ["mean", "median"], 
    "model__n_estimators": [100,1000],
    "model__max_depth":[None,5],
    "model__max_features": ["sqrt"],
    "model__min_samples_split":[2,4]
}

gs_model = GridSearchCV(model, pipe_grid,cv=5, verbose = 2)
gs_model.fit(X_train,y_train)

In [None]:
gs_model.score(X_test,y_test)

In [None]:
whats_were_covering