Week 5: Tree-Based Methods

Objective

This homework reviews the basic concepts associated with tree-based methods and introduces the machine learning workflow using real-world datasets. It combines conceptual exercises with applied modeling tasks in Python.

Some questions require independent research beyond the material covered in lectures.

Repository Structure


tree-based-methods-assignment/
│── README.md
│── requirements.txt
│── .gitignore
│── Week5_Tree_Based_Methods.ipynb
└── Dataset/
└── Auto.csv

Marks Distribution

Question	Marks
Q1	3
Q2	3
Q3a	2
Q3b	2
Q3c	5

Questions

Q1. Recursive Binary Splitting

Task: Draw an example of a partition of a two-dimensional feature space resulting from recursive binary splitting, with at least six regions.
Deliverable: Partition sketch and corresponding decision tree, labeled with regions (R1, R2, …) and cutpoints (t1, t2, …).
Tools used: numpy, matplotlib, sklearn.tree.DecisionTreeClassifier and plot_tree.

Q2. Bagging and Combining Predictions

Scenario: Ten bootstrapped samples produce estimated probabilities of a class being Red.
Task: Compare two approaches for combining results:
1. Majority vote
2. Average probability
Finding: Majority vote → Red. Average probability → Green.

Q3. Boosting and Random Forests

Dataset: Auto.csv (from An Introduction to Statistical Learning).
Tasks:
a) Load and preprocess data (convert columns, impute missing values, handle outliers).
b) Split data into training and test sets.
c) Fit and compare three models:
- Linear Regression
- Gradient Boosting
- Random Forest
Evaluation: Compare train and test R² and MSE scores.

Findings

Linear Regression: Consistent train/test performance, but relatively high MSE and lower R². Indicates underfitting.
Gradient Boosting: Improved accuracy, lower MSE, but some overfitting observed.
Random Forest: Best overall performance with highest R², lowest MSE, and slightly less overfitting than Gradient Boosting.

Conclusion: Random Forest performed best, followed by Gradient Boosting, with Linear Regression performing the worst.

Dataset

The assignment uses the Auto dataset from An Introduction to Statistical Learning.

Features: cylinders, displacement, horsepower, weight, acceleration, year, origin
Target: mpg (miles per gallon)

Setup Instructions

The assignment was developed in Python using a shared virtual environment (~/venvs/ml-env).

To install dependencies:

cd ~/projects/tree-based-methods-assignment
source ~/venvs/ml-env/bin/activate
pip install -r requirements.txt

References

James, Witten, Hastie, Tibshirani (2021). An Introduction to Statistical Learning with Applications in Python.
Scikit-learn Documentation: https://scikit-learn.org/stable/

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Week 5: Tree-Based Methods

Objective

Repository Structure

Marks Distribution

Questions

Q1. Recursive Binary Splitting

Q2. Bagging and Combining Predictions

Q3. Boosting and Random Forests

Findings

Dataset

Setup Instructions

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Dataset		Dataset
.gitignore		.gitignore
README.md		README.md
Week5_Tree_Based_Methods.ipynb		Week5_Tree_Based_Methods.ipynb
requirements.txt		requirements.txt

JDede1/tree-based-methods-assignment

Folders and files

Latest commit

History

Repository files navigation

Week 5: Tree-Based Methods

Objective

Repository Structure

Marks Distribution

Questions

Q1. Recursive Binary Splitting

Q2. Bagging and Combining Predictions

Q3. Boosting and Random Forests

Findings

Dataset

Setup Instructions

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages