Skip to content

Assignment 5 (Week 5) for the Master’s in MIS/ML program at the University of Arizona – Tree-Based Methods. Covers decision trees, bagging, random forests, and boosting using real-world datasets from An Introduction to Statistical Learning.

Notifications You must be signed in to change notification settings

JDede1/tree-based-methods-assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Week 5: Tree-Based Methods

Objective

This homework reviews the basic concepts associated with tree-based methods and introduces the machine learning workflow using real-world datasets. It combines conceptual exercises with applied modeling tasks in Python.

Some questions require independent research beyond the material covered in lectures.


Repository Structure


tree-based-methods-assignment/
│── README.md
│── requirements.txt
│── .gitignore
│── Week5_Tree_Based_Methods.ipynb
└── Dataset/
└── Auto.csv


Marks Distribution

Question Marks
Q1 3
Q2 3
Q3a 2
Q3b 2
Q3c 5

Questions

Q1. Recursive Binary Splitting

  • Task: Draw an example of a partition of a two-dimensional feature space resulting from recursive binary splitting, with at least six regions.
  • Deliverable: Partition sketch and corresponding decision tree, labeled with regions (R1, R2, …) and cutpoints (t1, t2, …).
  • Tools used: numpy, matplotlib, sklearn.tree.DecisionTreeClassifier and plot_tree.

Q2. Bagging and Combining Predictions

  • Scenario: Ten bootstrapped samples produce estimated probabilities of a class being Red.
  • Task: Compare two approaches for combining results:
    1. Majority vote
    2. Average probability
  • Finding: Majority vote → Red. Average probability → Green.

Q3. Boosting and Random Forests

  • Dataset: Auto.csv (from An Introduction to Statistical Learning).
  • Tasks:
    a) Load and preprocess data (convert columns, impute missing values, handle outliers).
    b) Split data into training and test sets.
    c) Fit and compare three models:
    • Linear Regression
    • Gradient Boosting
    • Random Forest
  • Evaluation: Compare train and test and MSE scores.

Findings

  • Linear Regression: Consistent train/test performance, but relatively high MSE and lower R². Indicates underfitting.
  • Gradient Boosting: Improved accuracy, lower MSE, but some overfitting observed.
  • Random Forest: Best overall performance with highest R², lowest MSE, and slightly less overfitting than Gradient Boosting.

Conclusion: Random Forest performed best, followed by Gradient Boosting, with Linear Regression performing the worst.


Dataset

The assignment uses the Auto dataset from An Introduction to Statistical Learning.

  • Features: cylinders, displacement, horsepower, weight, acceleration, year, origin
  • Target: mpg (miles per gallon)

Setup Instructions

The assignment was developed in Python using a shared virtual environment (~/venvs/ml-env).

To install dependencies:

cd ~/projects/tree-based-methods-assignment
source ~/venvs/ml-env/bin/activate
pip install -r requirements.txt

References

  • James, Witten, Hastie, Tibshirani (2021). An Introduction to Statistical Learning with Applications in Python.
  • Scikit-learn Documentation: https://scikit-learn.org/stable/

---

About

Assignment 5 (Week 5) for the Master’s in MIS/ML program at the University of Arizona – Tree-Based Methods. Covers decision trees, bagging, random forests, and boosting using real-world datasets from An Introduction to Statistical Learning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published