# Lab work 1 : Machine Learning Basics

This notebook builds on the first lecture of Foundations of Machine Learning. We'll focus on the preprocessing pipeline, the actual models will come later, but for now, you'll see how each step gets us closer to a proper model.

Important note: the steps shown here are not always the most efficient or the most “industry-approved.” Their main purpose is pedagogical. So don't panic if something looks suboptimal—it's meant to be.

If you have questions (theoretical or practical), don't hesitate to bug your lecturer.

We will try to accurately predict the price of a diamond based on a [dataset]((https://www.kaggle.com/datasets/shivam2503/diamonds)). Let's first load the dataset.

In [None]:
import pandas as pd

df = pd.read_csv("NB1 - Diamonds.csv")
df.head()

Before diving into the dataset, notice that the column *Unnamed: 0* doesn't seem to carry any useful information.

**Task** : Use the [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method to remove the *Unnamed: 0* column.

Here are the columns we'll be working with:

* **price** : price in US dollars ($326-$18,823)
* **carat** : weight of the diamond (0.2-5.01)
* **cut** : quality of the cut (Fair, Good, Very Good, Premium, Ideal)
* **color** : diamond colour, from J (worst) to D (best)
* **clarity** : a measurement of how clear the diamond is from I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, to IF (best)
* **x** : length in mm (0-10.74)
* **y** : width in mm (0-58.9)
* **z** : depth in mm (0-31.8)
* **depth** : total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)
* **table** : width of top of diamond relative to widest point (43-95)

We're all eager to jump into machine learning, so let's build our very first linear regressor!

## My first model !

We know that a linear regression only works with numerical inputs. So, in this case, we can use the columns *carat*, *x*, *y*, *z*, *depth* and *table* to predict the target column *price*. 

**Task** : From the dataframe *df*, extract a matrice *X* (the features) and a vector *y* (the target). Then use [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split *X* and *y* into training and test sets.

Now that we have our training and test sets, it's time to train and evaluate! We'll measure performance with two metrics: [`RMSE`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html#sklearn.metrics.root_mean_squared_error) and [`MAE`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error).


**Task** : Using the [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) class, fit a model on the training set. Then use the `predict` method to make predictions on the test set and print out performance for both metrics.

So... is our model actually any good?

**Task** : Interpret the results you just got, and then compute the performance of a very simple baseline model for comparison.

It could definitely be better. Let's take a step back and reflect:

1. We chose columns just based on their type. That means we ignored categorical data and didn't even check whether the values made sense.
2. We trained the model once and never checked for overfitting. Oh, and we forgot to scale the inputs too.

Maybe we rushed things a bit. Time to get back to the bread and butter of data science: data preparation.

## Data quality

**Task** : Use the [`isna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) method together with `sum` to check for missing values in the dataset and interpret.

**Task** : Use the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method to examine the distributions of the numerical variables, and then interpret the results.

The minimum values for *x*, *y*, and *z* are 0. So apparently... we have some 2D diamonds ?!

**Task**: Display all observations where at least one of the three dimension variables is equal to 0.

Clearly, these *flat diamonds* make no sense for our study, they're just data collection errors.

**Task**: Remove these observations, and report the dataset size before and after the cleanup.

So far, we haven't really looked at any distributions. Let's explore the distribution of a column, say *carat*, and how it relates to the target.

**Task**: Build a function explore_column with the following parameters:
* *df*: the dataframe containing the columns of interest
* *column*: the name of the column you want to inspect
* *target_column*: the name of the target column

The function should display, side by side, a histogram of the column and a scatter plot showing its relationship with the target.

**Task** : Use the previous function on the *y* column.

Looks like we have some outliers! These two points could seriously mess with our predictions because they don't follow the trend. To make this clearer, here's a small toy example illustrating the effect.

In [None]:
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style="whitegrid")
%matplotlib inline

size = 100
sigma = 0.25
function_real = lambda x: x + 1

x = np.random.normal(size=size)
y = function_real(x) + sigma * np.random.normal(size=size)

offset = 3
random_index = np.random.randint(0, size)
x[random_index] = x[random_index] + offset

x_range = np.array([np.min(x), np.max(x)])

model = LinearRegression().fit(x.reshape(-1, 1), y)
function_learned = lambda x: model.coef_[0] * x + model.intercept_

plt.figure(figsize=(14, 6))
plt.plot(x_range, function_real(x_range), ls='--', alpha=0.8, color=sns.color_palette()[0], label="Real function to learn")
plt.plot(x_range, function_learned(x_range), alpha=0.8, color=sns.color_palette()[2], label="Function learned")
plt.scatter(x, y, alpha=0.8)
plt.scatter(x[random_index], y[random_index], color=sns.color_palette()[2], label="Outlier")

plt.title("Toy regression with outlier setup")
plt.ylabel("Target")
plt.xlabel("Feature")
plt.legend()
plt.show()

**Task** : Remove the outliers from the *y* column, then explore the other columns as needed. Make sure to print the number of observations before and after the cleanup.

Next, we need to handle the categorical variables. A good way to explore them is with a violin plot.

In [None]:
def make_violin_plot(column, figsize=(12, 6)):
    plt.figure(figsize=figsize)
    sns.violinplot(data=df, x="price", y=column, inner=None)
    plt.title("Distribution of price in function of %s" % column.capitalize())
    plt.show()

make_violin_plot("clarity")

After carefully exploring all the categorical variables, it's time to transform them into a format our model can use. One-Hot Encoding is a good choice here.

**Task** : Use the [`pd.get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function to perform One-Hot-Encoding. Don't forget to check the number of columns before and after the transformation.

How useful were all these preprocessing steps for our task? Let's find out by measuring performance again.

**Task**: Split the data into training and test sets, then train the model and evaluate its performance on the test set. Compare the results to your previous predictions and provide some commentary.

So far, we've only measured performance *once*. Now, we're going to use cross-validation with the [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function.

Since we want to use RMSE and MAE with cross_val_score, we'll need to wrap them using [`make_scorer`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html).

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

model = LinearRegression()
scores = cross_val_score(model, X, y, scoring=make_scorer(root_mean_squared_error), cv=5)
mean_scores = np.mean(scores)
std_scores = np.std(scores)
name = root_mean_squared_error.__name__
print(f"{name}: {mean_scores:.4f} (+/- {std_scores:.4f})")

We want to understand how each part of our preprocessing affects prediction quality. To do this, we'll measure performance at each key step.

**Task**: Define a function named train_predict with the following parameters:
* *X*: feature matrix
* *y*: target vector
* *metric*: a performance metric
* *cv*: number of folds for cross-validation

The function should generalize the code from the previous cell so it can be reused for different preprocessing steps.

**Task** : Write a cell that measure performance at each key steps of the notebook. 

## Going further

Exploring your model's predictions—especially the errors—is a crucial part of improving it. This helps you understand how the model reacts to different inputs and can guide you toward better preprocessing, feature engineering, or even model improvements.

**Task**: Inspect the errors of the model and reflect on what they reveal.