**Exercice 1 :**

In [17]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Loading the Iris dataset
iris = load_iris()

# Extracting petal length and width
X = iris.data[:, 2:4]  # Petal length and width are the third and fourth features

# Creating a DataFrame for better visualization
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df['target_name'] = iris_df['target'].apply(lambda x: iris.target_names[x])

# Printing the target variable's names, values and counting the number of classes
target_names = iris.target_names
target_values = iris.target
class_count = len(set(target_values))

iris_df.head(), target_names, target_values, class_count


(   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 0                5.1               3.5                1.4               0.2   
 1                4.9               3.0                1.4               0.2   
 2                4.7               3.2                1.3               0.2   
 3                4.6               3.1                1.5               0.2   
 4                5.0               3.6                1.4               0.2   
 
    target target_name  
 0       0      setosa  
 1       0      setosa  
 2       0      setosa  
 3       0      setosa  
 4       0      setosa  ,
 array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1,

In [18]:
from sklearn.tree import DecisionTreeClassifier,export_graphviz

# Creating a DecisionTreeClassifier object with maximum depth of 2
treeClassifier = DecisionTreeClassifier(max_depth=2)

treeClassifier.fit(X, target_values)

In [19]:
# Export the decision tree to a .dot file
export_graphviz(treeClassifier, 
                out_file="IrisDTree.dot",
                feature_names=iris.feature_names[2:],  # Using petal length and width
                class_names=iris.target_names,
                rounded=True,
                filled=True)

![Visualize the decision tree](graphviz.png)


**Exercice 2 :**

In [20]:
# Calculating the Gini impurity for the entire dataset (root node)
total_samples = len(iris.target)
gini_root = 1 - sum([(np.count_nonzero(iris.target == c) / total_samples)**2 for c in np.unique(iris.target)])

# Recalculating the Gini impurity for petal length and petal width using the entire dataset
def calculate_gini_for_entire_dataset(feature_index):
    # Initialize counts for each class for each feature threshold
    thresholds = np.unique(iris.data[:, feature_index])
    best_gini = 1.0

    for threshold in thresholds:
        left_classes = iris.target[iris.data[:, feature_index] <= threshold]
        right_classes = iris.target[iris.data[:, feature_index] > threshold]

        # Handle case where a node (left or right) has no samples
        if len(left_classes) == 0 or len(right_classes) == 0:
            continue

        # Calculate Gini for left and right nodes
        gini_left = 1.0 - sum([(np.count_nonzero(left_classes == c) / len(left_classes))**2 for c in np.unique(iris.target)]) if len(left_classes) > 0 else 0
        gini_right = 1.0 - sum([(np.count_nonzero(right_classes == c) / len(right_classes))**2 for c in np.unique(iris.target)]) if len(right_classes) > 0 else 0

        # Weighted average of Gini impurity
        gini = (len(left_classes) / total_samples) * gini_left + (len(right_classes) / total_samples) * gini_right

        best_gini = min(best_gini, gini)

    return best_gini

gini_petal_length_entire = calculate_gini_for_entire_dataset(2)  # Petal length
gini_petal_width_entire = calculate_gini_for_entire_dataset(3)  # Petal width

gini_root, gini_petal_length_entire, gini_petal_width_entire



(0.6666666666666667, 0.3333333333333333, 0.3333333333333333)

In [21]:
# Estimating the class probabilities for a flower with petals 5 cm long and 1.5 cm wide
flower = np.array([[5.0, 1.5]])  # Petal length = 5 cm, Petal width = 1.5 cm
probabilities = treeClassifier.predict_proba(flower)

probabilities


array([[0.        , 0.90740741, 0.09259259]])

1. **Start at the Root Node**: The root node of the decision tree typically asks a question about one of the features. In your case, you mentioned a threshold of petal length being smaller or greater than 0.8 cm. Since our flower's petal length is 5 cm, which is greater than 0.8 cm, we move to the right child of the root node.

2. **Follow the Tree Path**: At each subsequent node, a similar decision is made based on the thresholds for petal length or width defined by the tree. We continue this process, moving left or right at each node, depending on how our flower's measurements compare to the thresholds at each node.

3. **Reach the Final Leaf Node**: The process continues until we reach a leaf node. Each leaf node in a decision tree represents a class prediction or a probability distribution over the classes. In our case, the final leaf node where we end up will give us the probability distribution for the classes.

   - Since our flower has a petal length of 5 cm and petal width of 1.5 cm, it's likely that the path taken in the tree led us to a leaf node where Iris versicolor is the most probable class, given the high probability (90.74%) estimated for this class.

4. **Interpreting the Leaf Node**: The final leaf node's prediction aligns with the highest probability class estimated by the model. In our example, the leaf node we reach through this process indicates that the flower is most likely an Iris versicolor, which is consistent with the `predict_proba` function's output.


In [22]:
predicted_class = treeClassifier.predict(flower)

predicted_class

array([1])

**Exercice 3 :**

In [23]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
def kfoldCrossValidation(X, y, k, M):
    # Set up k-fold cross-validation
    kfold = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = []
    # Perform k-fold cross-validation
    for train_index, test_index in kfold.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        # Fit the classifier on the training data
        M.fit(X_train, y_train)
        # Predict on the test data
        y_pred = M.predict(X_test)
        # Calculate accuracy and store in scores list
        accuracy = accuracy_score(y_test, y_pred)
        scores.append(accuracy)
    return scores

In [24]:
# Implementing the average accuracy
def average_accuracy(X, y, k, M):
    scores = kfoldCrossValidation(X, y, k, M)
    return sum(scores) / len(scores)

In [26]:
# Applying average accuracy on the trained tree with different values of k
X = iris.data  # Using all features this time
y = iris.target
k_values = [2, 3, 5, 10]

# Tree with max depth of 2
treeClassifierMax2 = DecisionTreeClassifier(max_depth=2)
avg_accuracies_max2 = [average_accuracy(X, y, k, treeClassifierMax2) for k in k_values]
print("Average accuracies for max depth of 2:", avg_accuracies_max2)

Average accuracies for max depth of 2: [0.96, 0.94, 0.9466666666666667, 0.9466666666666669]


**Exercice 4 :**