# First ML model

I want to follow a guide from "Machine Learning for Absolute Beginners: Python for Data Science, Book" to create my first ML model. 

## Get data 

Data used in this notebook is available here: <https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market>

To get this data we need to first install kagglehub

In [35]:
%pip install kagglehub

2472.54s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


Now we can download the dataset.

In [36]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("anthonypino/melbourne-housing-market")


Let's add other libraries that we need:

In [37]:
%pip install pandas


2478.37s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


In [38]:
%pip install scikit-learn

2483.76s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


Once dependencies are installed, we can import them:

In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error

## Import the dataset

In [40]:
df = pd.read_csv(path + "/Melbourne_housing_FULL.csv")

## Scrubbing the dataset

Goal: modifying or removing incomplete, irrelevant or duplicated data.

First delete columns we don't need:

In [41]:
del df['Address']
del df['Method']
del df['SellerG']
del df['Date']
del df['Postcode']
del df['Lattitude']
del df['Longtitude']
del df['Regionname']
del df['Propertycount']

Next step is to remove missing values.

In [42]:
df.dropna(axis = 0, how = 'any', subset = None, inplace = True)

Next, convert columns that contain non-numeric data to numeric values using one-
hot encoding. 

In [45]:
df = pd.get_dummies(df, columns = ['Suburb', 'CouncilArea', 'Type'])

Lastly, assign the dependent and independent variables with Price as y and X as the
remaining 11 variables

In [46]:
X = df.drop('Price',axis=1)
y = df['Price']

## Split the dataset

70% for training and 30% for tests

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
shuffle = True)

## Select Algorithm and Configure Hyperparameters

In [48]:
model = ensemble.GradientBoostingRegressor(
    n_estimators = 250,
    learning_rate = 0.1,
    max_depth = 5,
    min_samples_split = 4,
    min_samples_leaf = 6,
    max_features = 0.6,
    loss = 'huber'
)

In [49]:
model.fit(X_train, y_train)

## Evaluate the Results

In [50]:
mae_train = mean_absolute_error(y_train, model.predict(X_train))
print ("Training Set Mean Absolute Error: %.2f" % mae_train)

Training Set Mean Absolute Error: 122609.90


In [51]:
mae_test = mean_absolute_error(y_test, model.predict(X_test))
print ("Test Set Mean Absolute Error: %.2f" % mae_test)

Test Set Mean Absolute Error: 158780.88
