# **Lab: Engineering for ML**



## Exercise 2: EDA and Baseline Model

In this exercise we will start our data science project by preparing the dataset for modeling

**Pre-requisites:**
- Create a github account (https://github.com/join)
- Install git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
- Install pyenv (https://realpython.com/lessons/installing-pyenv/)
- Install poetry (https://python-poetry.org/docs/#installation)
- Install Wget for Windows users (https://eternallybored.org/misc/wget/)

The steps are:
1.   Setup Environment
2.   Load and explore dataset
3.   Prepare Data
4.   Split Dataset
5.   Get Baseline model
6.   Push changes


## 1. Setup Environment

**[1.1]** Download the dataset (https://raw.githubusercontent.com/aso-uts/labs_datasets/refs/heads/main/36120-adv_mla/lab01/wfh.csv) into the sub-folder data/raw

In [None]:
# For Windows users, you can download and install WGET. Or you can manually download the file from the link and save it to specified path
! wget -P ~/Projects/adv_mla_2025/adv_mla_lab_1/data/raw https://raw.githubusercontent.com/aso-uts/labs_datasets/refs/heads/main/36120-adv_mla/lab01/wfh.csv

**[1.5]** Launch Jupyter Lab from your virtual environment

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
! poetry run jupyter lab

**[1.6]** Create a new Jupyter Notebook called `1_baseline.ipynb` inside the `work/adv_mla_lab_1/notebooks/` directory



## 2. Load and Explore Dataset



**[2.1]** Launch magic commands to automatically reload modules

In [None]:
%load_ext autoreload
%autoreload 2

**[2.2]** Import the pandas and numpy package

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
import pandas as pd
import numpy as np

**[2.3]** Load the dataset into dataframe called df

In [None]:
# Placeholder for student's code (Python code)

In [None]:
#Solution:
df = pd.read_csv('../data/raw/wfh.csv')

**[2.4]** Display the first 5 rows of df

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.head()

**[2.5]** Display the dimensions (shape) of df

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.shape

**[2.6]** Display the summary (info) of df

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.info()

**[2.7]** Display the descriptive statistics of df


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.describe(include='all')

## 3. Prepare Data

**[3.1]** Create a copy of df and save it into a variable called df_cleaned

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df_cleaned = df.copy()

**[3.2]** Extract the column `work_home_actual` and save it into variable called `y`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
y = df_cleaned.pop('work_home_actual')

**[3.3]** Import OrdinalEncoder from sklearn.preprocessing

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from sklearn.preprocessing import OrdinalEncoder

**[3.4]** Instantiate a OrdinalEncoder with the values from `workday` column

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ord_enc = OrdinalEncoder(categories=[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']])

**[3.5]** Fit and apply the OrdinalEncoder on `workday` column and replace with the encoded values

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df_cleaned['workday'] = ord_enc.fit_transform(df_cleaned[['workday']])

**[3.6]** Apply OneHotEncoding on `salary_range` and save the result back in `df_cleaned`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df_cleaned = pd.get_dummies(df_cleaned, columns=["salary_range"])

**[3.7]** Remove the `id` column from `df_cleaned`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df_cleaned.drop(["id"], axis=1, inplace=True)

## 4. Split Dataset

**[4.1]** import train_test_split from sklearn.model_selection

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
from sklearn.model_selection import train_test_split

**[4.2]** Split the data into training validation and testing sets as Numpy arrays

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
X_data, X_test, y_data, y_test = train_test_split(features, y, test_size=0.2, random_state=8)
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=8)

**[4.3]** Print the dimensions of `X_train`, `X_val`, `X_test`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

**[4.4]** Print the dimensions of `y_train`, `y_val`, `y_test`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

**[4.5]** Save the sets into the folder `data/processed`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_val.to_csv('../data/processed/X_val.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_val.to_csv('../data/processed/y_val.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

## 5. Get Baseline Model

**[5.1]** Import the DummyClassifier module from sklearn

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from sklearn.dummy import DummyClassifier

**[5.2]** Instantiate the Dummy class into a variable called `base_clf` and fit it on the training set it

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
base_clf = DummyClassifier(strategy='most_frequent')
base_clf.fit(X_train, y_train)

**[5.3]** Import roc_auc_score from sklearn.metrics

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from sklearn.metrics import roc_auc_score

**[5.6]** Display the ROC scores of this baseline model on the training set

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
y_proba_preds = base_clf.predict_proba(X_train)
roc_auc_score(y_train, y_proba_preds[:, 1])

## 6.   Push changes

**[6.1]** Add your changes to git staging area

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git add .

**[6.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git commit -m "prepare data and baseline"

**[6.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git push

**[6.4]** Stop Jupyter Lab

In [None]:
# Solution:
ctrl+c