# BIAS & VARIANCE EXAMPLE

Importing the dataset

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Importing the data

In [4]:
iris = load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

Extract feature and target

In [5]:
X, y = iris.data, iris.target

Splitting the data into train and test set

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

Train the decision tree classifiers with different depths

In [12]:
depths = [1,5,10]
for depth in depths:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)

    # Predict and evaluate
    y_pred_train = tree.predict(X_train)
    y_pred_test = tree.predict(X_test)
    train_acc = accuracy_score(y_train, y_pred_train)
    test_acc = accuracy_score(y_test, y_pred_test)

    print(f"Depth {depth}: Train Accuracy = {train_acc:.3f}, Test Accuracy = {test_acc:.3f}")

Depth 1: Train Accuracy = 0.648, Test Accuracy = 0.711
Depth 5: Train Accuracy = 0.990, Test Accuracy = 1.000
Depth 10: Train Accuracy = 1.000, Test Accuracy = 1.000


There is high bias when the depth = 1 (model is underfitting). There is balance bias at depth = 5. Then no bias i.e high variance at depth = 10 (model is overfitting).