In [None]:
1. Estimated Depth of a Decision Tree Trained on a One Million Instance Training Set
The estimated depth of a Decision Tree trained on a one million instance training set can vary widely based on several factors, including:

Number of Features: More features can lead to a deeper tree as the algorithm attempts to find the best splits.
Data Characteristics: If the data is complex and not linearly separable, the tree might be deeper to capture all interactions.
Stopping Criteria: Without restrictions (like max depth), the tree can grow until it perfectly classifies all training instances, leading to a very deep tree.

In [None]:
2. Gini Impurity of a Node Compared to its Parent
The Gini impurity of a node is usually lower than that of its parent after a split. This is because the goal of the split is to reduce impurity, thereby creating child nodes that are more homogeneous in terms of class distribution.

In [None]:
3. Reducing Max Depth to Address Overfitting
Yes, it is a good idea to reduce the maximum depth of a Decision Tree if it is overfitting the training set. Overfitting occurs when the model learns the noise in the training data rather than the underlying patterns. Limiting the tree depth can help generalize the model better to unseen data, thus improving performance on the test set.

In [None]:
4. Scaling Input Features for Underfitting Decision Trees
Scaling input features is not necessary for Decision Trees. They are invariant to the scale of the features since they make splits based on feature values rather than distances. If a Decision Tree is underfitting, the solution may involve increasing tree complexity (like increasing the max depth, increasing min samples per leaf, etc.) rather than scaling the input features.

In [None]:
5. Time to Train a Decision Tree on 10 Million Instances
The training time for a Decision Tree typically does not scale linearly with the number of instances, as it also depends on factors like the number of features and the specific implementation. However, as a rough estimate:
If it takes 1 hour to train on 1 million instances, training on 10 million instances could take approximately 10 hours if we assume linear scaling (which is often not the case). However, due to optimizations in the algorithm and potential hardware capabilities, the actual time might be less

In [None]:
6. Setting presort=True with 100,000 Instances
Setting presort=True can speed up training for small datasets but is generally not recommended for larger datasets (like 100,000 instances) due to its inefficiency in terms of memory usage and time. Instead, the default behavior (not presorting) is usually more efficient for larger datasets.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# a. Build a moons dataset
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

# b. Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# c. Grid search for best hyperparameters
param_grid = {'max_leaf_nodes': [10, 20, 30, 40, 50, None]}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# d. Train model with best hyperparameters
best_tree = grid_search.best_estimator_
best_tree.fit(X_train, y_train)
accuracy = best_tree.score(X_test, y_test)

print(f"Accuracy: {accuracy * 100:.2f}%")


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import ShuffleSplit
from scipy.stats import mode

# a. Create subsets of the training set
n_trees = 1000
n_instances_per_subset = 100
shuffle_split = ShuffleSplit(n_splits=n_trees, train_size=n_instances_per_subset, random_state=42)

# b. Train Decision Trees on each subset
trees = []
for train_index, _ in shuffle_split.split(X_train):
    X_subset = X_train[train_index]
    y_subset = y_train[train_index]
    tree = DecisionTreeClassifier(max_leaf_nodes=best_tree.max_leaf_nodes, random_state=42)
    tree.fit(X_subset, y_subset)
    trees.append(tree)

# c. Make predictions for each test case
tree_predictions = np.array([tree.predict(X_test) for tree in trees])
majority_votes = mode(tree_predictions, axis=0)[0].flatten()

# d. Evaluate predictions
accuracy_forest = np.mean(majority_votes == y_test)
print(f"Random Forest Accuracy: {accuracy_forest * 100:.2f}%")
