This homework reviews the basic concepts associated with tree-based methods and introduces the machine learning workflow using real-world datasets. It combines conceptual exercises with applied modeling tasks in Python.
Some questions require independent research beyond the material covered in lectures.
tree-based-methods-assignment/
│── README.md
│── requirements.txt
│── .gitignore
│── Week5_Tree_Based_Methods.ipynb
└── Dataset/
└── Auto.csv
Question | Marks |
---|---|
Q1 | 3 |
Q2 | 3 |
Q3a | 2 |
Q3b | 2 |
Q3c | 5 |
- Task: Draw an example of a partition of a two-dimensional feature space resulting from recursive binary splitting, with at least six regions.
- Deliverable: Partition sketch and corresponding decision tree, labeled with regions (R1, R2, …) and cutpoints (t1, t2, …).
- Tools used:
numpy
,matplotlib
,sklearn.tree.DecisionTreeClassifier
andplot_tree
.
- Scenario: Ten bootstrapped samples produce estimated probabilities of a class being Red.
- Task: Compare two approaches for combining results:
- Majority vote
- Average probability
- Finding: Majority vote → Red. Average probability → Green.
- Dataset:
Auto.csv
(from An Introduction to Statistical Learning). - Tasks:
a) Load and preprocess data (convert columns, impute missing values, handle outliers).
b) Split data into training and test sets.
c) Fit and compare three models:- Linear Regression
- Gradient Boosting
- Random Forest
- Evaluation: Compare train and test R² and MSE scores.
- Linear Regression: Consistent train/test performance, but relatively high MSE and lower R². Indicates underfitting.
- Gradient Boosting: Improved accuracy, lower MSE, but some overfitting observed.
- Random Forest: Best overall performance with highest R², lowest MSE, and slightly less overfitting than Gradient Boosting.
Conclusion: Random Forest performed best, followed by Gradient Boosting, with Linear Regression performing the worst.
The assignment uses the Auto dataset from An Introduction to Statistical Learning.
- Features:
cylinders
,displacement
,horsepower
,weight
,acceleration
,year
,origin
- Target:
mpg
(miles per gallon)
The assignment was developed in Python using a shared virtual environment (~/venvs/ml-env
).
To install dependencies:
cd ~/projects/tree-based-methods-assignment
source ~/venvs/ml-env/bin/activate
pip install -r requirements.txt
- James, Witten, Hastie, Tibshirani (2021). An Introduction to Statistical Learning with Applications in Python.
- Scikit-learn Documentation: https://scikit-learn.org/stable/
---