# Section 1: From zero to one

## 0. Introduction
- Start with the Decision Tree model
- [Fitting/ Training the model] Split training data into two groups according to patterns.
- [Improving the Decision Tree] Use more "splits" to calture more factors. A tree with more "splits" is a "deeper" tree.

## 1. Get Familiar With Data
- Use pandas by
  1. Import pacakge `import pandas as pd`.
  2. Read csv file `data = pd.read_csv(file_path)`.
  3. Print summary of data, including count, mean, std, min, max, 25%, 50%, 75% `data.describe()` .
  - count: shows rows with non-missing value.
  - 25% (25th percentile): data is sorted from the smallest, 25% shows the number that is larger than 25% of values and smaller than 75% of them.
- [DataFrame] Similar to sheet in Excel

## 2. Selecting Variables for Modelling
1. Repeat steps 1-2 from Part 1
2. Print variables/columns: `data.columns`
3. Optional: Drop missing values: `data = data.dropna(axis=0)`
4. Set the prediction target (The value you want to predict, e.g. Price) as the y axis: `y = data.Price`
5. Select the features (variables used for prediction): `features = ['feature1', 'feature2', 'feature3']`
  - Note that only numerical data types can be used at this stage.
6. Set the data of these features as X-axis values: `X=data[features]`
7. Optional: Print the summary of these features: `X.describe()`
8. Optional: Print the first 5 data of these features: `X.head()`

## 3. Building the Model and Making Predictions (Without Evaluation)
1. Define the model used (In our case, it's the decision tree): `from sklearn.tree import DecisionTreeRegressor`//
`model = DecisionTreeRegressor(random_state=1)`
- `random_state = 1` ensures you get the same results in each run, it is none as defult
2. Fit the data into the model: `model.fit(X,y)`
3. Predict: `prediction = model.predict(X)`
4. View the prediction: `y.head()`

- To evaluate, we can calculate the mean absolute error, which is the average difference between the predictions and the actual data. However, if we use the data that is used to train the model, there will be bias. In the next section, we will evaluate the model using new, unseen data.

In [1]:
# Download data from Kaggle

import kagglehub

# Download latest version
path = kagglehub.dataset_download("dansbecker/melbourne-housing-snapshot")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/dansbecker/melbourne-housing-snapshot?dataset_version_number=5...


100%|██████████| 451k/451k [00:00<00:00, 758kB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/dansbecker/melbourne-housing-snapshot/versions/5





In [2]:
# 1. Get Familar with Data

## 1.1 Import package
import pandas as pd

## 1.2 Read CSV file
data = pd.read_csv('/root/.cache/kagglehub/datasets/dansbecker/melbourne-housing-snapshot/versions/5/melb_data.csv')

## 1.3 Print summary of data
data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [3]:
# 2. Selecting Variables for Modelling

## 2.2 Print variables
data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [4]:
## 2.3 Drop missing values
data = data.dropna(axis=0)

## 2.4 Set the prediction target (The value you want to predict, e.g. Price) as the y axis
y = data.Price

## 2.5 Select the features (variables used for prediction). Only numerical data types can be used.
features = ['Rooms', 'Price', 'Distance', 'Bedroom2', 'Bathroom', 'Car','Landsize', 'BuildingArea', 'YearBuilt']

## 2.6 Set the data of these features as X-axis values
X=data[features]

## 2.7 Optional: Print the summary of these features
X.describe()

## 2.8 Optional: Print the first 5 data of these features
X.head()

Unnamed: 0,Rooms,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt
1,2,1035000.0,2.5,2.0,1.0,0.0,156.0,79.0,1900.0
2,3,1465000.0,2.5,3.0,2.0,0.0,134.0,150.0,1900.0
4,4,1600000.0,2.5,3.0,1.0,2.0,120.0,142.0,2014.0
6,3,1876000.0,2.5,4.0,2.0,0.0,245.0,210.0,1910.0
7,2,1636000.0,2.5,2.0,1.0,2.0,256.0,107.0,1890.0


In [5]:
# 3. Building the Model and Making Predictions (Without Evaluation)

## 3.1 Define the model used (In our case, it's the decision tree)
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=1)

### `random_state = 1` ensures you get the same results in each run, it is none as defult

## 3.2 Fit the data into the model
model.fit(X,y)

## 3.3 Predict
prediction = model.predict(X)

## 3.4 View the predictions
y.head()

Unnamed: 0,Price
1,1035000.0
2,1465000.0
4,1600000.0
6,1876000.0
7,1636000.0


# Section 2: How well does the Model Predict Unseen Data?

## 1. Split the data
- We don't have to generate new data! We can just split the original data into two: one for training and one for validation.
1. Import package to split data: `from sklearn.model_selection import train_test_split`
2. Spit data into training and validation: `train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)`
3. Build the model, same as Section 1 Part 3, but use the training data: `model = DecisionTreeRegressor()`//`model.fit(train_X, train_y)`
4. Use the model on unseen data, and get the predictions: `val_predictions = model.predict(val_X)`
5. Print the mean absolute error, which is the average difference between predictions and actual values: `from sklearn.metrics import mean_absolute_error`//`print(f"The mean absolute error for the validation data = {mean_absolute_error(val_y, val_predictions)}.")`

In [9]:
# 1. Import package to split data
from sklearn.model_selection import train_test_split

# 2. Spit data into training and validation
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# 3. Build the model, same as Section 1 Part 3, but use the training data
model = DecisionTreeRegressor()
model.fit(train_X, train_y)

# 4. Use the model on unseen data, and get the predictions
val_predictions = model.predict(val_X)

# 5. Print the mean absolute error, which is the average difference between predictions and actual values:
from sklearn.metrics import mean_absolute_error
print(f"Mean absolute error for the validation data = {mean_absolute_error(val_y, val_predictions)}")

Mean absolute error for the validation data = 2353.2014202711425


# Section 3: An Alternative

## 1. Why is the Decision Tree Model Not Working?
- Underfitting: When the tree has too few branches, it groups houses based on overly general patterns, leading to poor performance on both training and validation data.
- Overfitting: When the tree has too many branches, it captures overly specific patterns, resulting in high accuracy on training data but poor performance on validation data.

## 2. Random Forest Model
- Uses many trees, in contrast to the decision tree, which uses one tree.
- Makes prediction by averaging the predictions of each componet tree.
1. Import packages: `from sklearn.ensemble import RandomForestRegressor`
2. Build the model and make predictions, same as Section 2 Part 1, but use random forest: `forest_model = RandomForestRegressor(random_state=1)`//`forest_model.fit(train_X, train_y)`//`melb_preds = forest_model.predict(val_X)`
3. Print the mean absolute error
`print(f"Mean absolute error using Random Forest = {mean_absolute_error(val_y, melb_preds)}")`

In [12]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(f"Mean absolute error using Random Forest = {mean_absolute_error(val_y, melb_preds)}")

Mean absolute error using Random Forest = 3024.9537314396384


# Conclusion

In this notebook, we have learnt about different machine learning models and how to select data to train them. Your next step can be:

The Intermediate Machine Learning course will teach you how to handle non-numerical features. You will also learn to use xgboost, a technique giving even better accuracy than Random Forest.

The Pandas course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects.

You are also ready for the Deep Learning course, where you will build models with better-than-human level performance at computer vision tasks.