# **Lab: ML Engineering**



## Exercise 2: EDA and Baseline Model

In this exercise we will start our data science project by preparing the dataset for modeling

**Pre-requisites:**
- Create a github account (https://github.com/join)
- Install git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
- Install pyenv (https://realpython.com/lessons/installing-pyenv/)
- Install poetry (https://python-poetry.org/docs/#installation)
- Install Wget for Windows users (https://eternallybored.org/misc/wget/)

The steps are:
1.   Setup Environment
2.   Load and explore dataset
3.   Prepare Data
4.   Split Dataset
5.   Get Baseline model
6.   Push changes


## 1. Setup Environment

**[1.1]** Download the dataset (https://raw.githubusercontent.com/aso-uts/adv_mla/main/lab1/insurance.csv) into the sub-folder data/raw

In [None]:
# For Windows users, you can download and install WGET. Or you can manually download the file from the link and save it to specified path
wget -P /Users/anthonyso/Projects/adv_mla_2023/adv_mla_lab_1/data/raw https://raw.githubusercontent.com/aso-uts/adv_mla/main/lab1/insurance.csv

SyntaxError: ignored

**[1.5]** Launch Jupyter Lab from your virtual environment

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
poetry run jupyter lab

**[1.6]** Create a new Jupyter Notebook called `1_baseline.ipynb` inside the `work/adv_mla_lab_1/notebooks/` directory



## 2. Load and Explore Dataset



**[2.1]** Launch magic commands to automatically reload modules

In [1]:
%load_ext autoreload
%autoreload 2

**[2.2]** Import the pandas and numpy package

In [None]:
# Placeholder for student's code (Python code)

In [2]:
# Solution
import pandas as pd
import numpy as np

**[2.3]** Load the dataset into dataframe called df

In [None]:
# Placeholder for student's code (Python code)

In [6]:
#Solution:
df = pd.read_csv('./data/raw/insurance.csv')

FileNotFoundError: [Errno 2] No such file or directory: './data/raw/insurance.csv'

**[2.4]** Display the first 5 rows of df

In [7]:
# Placeholder for student's code (Python code)

In [8]:
# Solution
df.head()

NameError: name 'df' is not defined

**[2.5]** Display the dimensions (shape) of df

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.shape

(50000, 7)

**[2.6]** Display the summary (info) of df

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       50000 non-null  int64  
 1   sex       50000 non-null  object 
 2   bmi       50000 non-null  float64
 3   children  50000 non-null  int64  
 4   smoker    50000 non-null  object 
 5   region    50000 non-null  object 
 6   charges   50000 non-null  float64
dtypes: float64(2), int64(2), object(3)
memory usage: 2.7+ MB


**[2.7]** Display the descriptive statistics of df


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.describe(include='all')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,50000.0,50000,50000.0,50000.0,50000,50000,50000.0
unique,,2,,,2,4,
top,,male,,,no,southeast,
freq,,25176,,,38976,14197,
mean,39.46312,,30.713734,1.11376,,,13343.216363
std,14.117142,,6.092727,1.212835,,,12131.222744
min,18.0,,17.291,0.0,,,1137.5359
25%,27.0,,26.6,0.0,,,4694.4318
50%,40.0,,30.3,1.0,,,9399.232775
75%,51.0,,34.57,2.0,,,17340.746925


## 3. Prepare Data

**[3.1]** Create a copy of df and save it into a variable called df_cleaned

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df_cleaned = df.copy()

**[3.2]** Extract the column `charges` and save it into variable called `target`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
target = df_cleaned.pop('charges')

**[3.3]** Create 2 lists named `num_cols` and `cat_cols` containing respectively the names of numerical and categotical columns

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
num_cols = list(df_cleaned.select_dtypes('number').columns)
cat_cols = list(set(df_cleaned.columns) - set(num_cols))

**[3.4]** Import StandardScaler from sklearn.preprocessing

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
from sklearn.preprocessing import StandardScaler, OneHotEncoder

**[3.5]** Instantiate the OneHotEncoder

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ohe = OneHotEncoder(sparse_output=False, drop='first')

**[3.6]** Fit and apply the OneHotEncoder on `df_cleaned` and save the result in `features`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
features = ohe.fit_transform(df_cleaned[cat_cols])

**[3.7]** Convert `features` into a dataframe

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
features = pd.DataFrame(features, columns=ohe_cat_cols)

**[3.8]** Instantiate the StandardScaler

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
scaler = StandardScaler()

**[3.9]** Fit and apply the scaling on `df` and add the results into `features`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
features[num_cols] = scaler.fit_transform(df_cleaned[num_cols])

**[3.10]** Import dump from joblib



In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from joblib import dump

**[3.11]** Save the one-hot encoder and scaler into the folder `models` and call the files respectively `ohe.joblib` and  `scaler.joblib`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
dump(ohe, '../models/ohe.joblib')
dump(scaler, '../models/scaler.joblib')

## 4. Split Dataset

**[4.1]** import train_test_split from sklearn.model_selection

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
from sklearn.model_selection import train_test_split

**[4.2]** Split the data into training validation and testing sets as Numpy arrays

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
X_data, X_test, y_data, y_test = train_test_split(features, target, test_size=0.2, random_state=8)
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=8)

**[4.3]** Print the dimensions of `X_train`, `X_val`, `X_test`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(32000, 8)
(8000, 8)
(10000, 8)


**[4.4]** Print the dimensions of `y_train`, `y_val`, `y_test`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(32000,)
(8000,)
(10000,)


**[4.5]** Save the sets into the folder `data/processed`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_val.to_csv('../data/processed/X_val.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_val.to_csv('../data/processed/y_val.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

## 5. Get Baseline Model

**[5.1]** Calculate the average of the target variable for the training set and save it into a variable called `pred_value`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
pred_value = y_train.mean()

**[5.2]** Generate a numpy array with same dimensions as y_train that contains only the value saved in pred_value

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
y_base = np.full((len(y_train), 1), pred_value)

**[5.5]** Import mean_squared_error and mean_absolute_error from sklearn.metrics

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae

**[5.6]** Display the RMSE and MAE scores of this baseline model on the training set

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
print(mse(y_base, y_train, squared=False))
print(mae(y_base, y_train))

12116.584822448176
9118.852804794265


## 6.   Push changes

**[6.1]** Add your changes to git staging area

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git add .

**[6.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git commit -m "prepare data and baseline"

**[6.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git push

**[6.4]** Stop Jupyter Lab