# Developing a ML Model
## Finding data
We are going to predict the quality of a wine from its compounds. We will import the data that we need as a .csv dataframe from the University of California Irvine repository. It contains 12 columns, the first 11 being the compounds of the wines and the 12th being the quality. We will use that data to achieve a supervised learning: we will train a ML model to understand the relation between the last column of the dataframe and the 11 others. This way, given the measure of the 11 compounds of a wine sample, the model will be able to predict the quality of that wine sample.

In [1]:
# Pandas will help us to work with dataframes
import pandas as pd
# We get some wine database, because why not
csv_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
# I need specify the ';' separator, default is a ',' in csv
data = pd.read_csv(csv_url, sep=";")
# Let's take a look at our lovely data
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


We need to divide the dataset in two parts. The first one will be used to train the model, and the second to test the model - to see if the model can predict the quality that we know a sample has.

In [2]:
# The sklearn library can split the data into train/test for us
from sklearn.model_selection import train_test_split
train, test = train_test_split(data)

In both train and test datasets, we need to specify which columns will be used as an input for the prediction, and which column contains the desired output of the prediction.

In [3]:
# The predicted column being quality, we split each set accordingly
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]

## Training a Model
We train an ElasticSearch model to make the prediction. This model has two parameters, that we will initialize randomly.

In [4]:
# I get default parameter values for my model (0.5)
alpha = 0.5
l1_ratio = 0.5

We will also define some metrics, to evaluate the quality of our predictions.

In [6]:
# I get some evaluation metrics
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# And group them in one function
def eval_metrics(actual, pred):
    # RMSE: root-mean-square error
    rmse = np.sqrt(mean_squared_error(actual, pred))
    # MAE: mean absolute error
    mae = mean_absolute_error(actual, pred)
    # R²: coefficient of determination
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

We can now train our model and see how well it has learned.

In [7]:
# MLFlow will do all the magic
import mlflow
# We will use an ElasticNet model to predict wine quality
from sklearn.linear_model import ElasticNet
# Let's run it
with mlflow.start_run():
    # Train the model
    lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    lr.fit(train_x, train_y)

    # Get the predicted qualities of each wine sample
    predicted_qualities = lr.predict(test_x)
    # Compare with actual qualities and estimate prediction
    (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

    print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
    print("  RMSE: %s" % rmse)
    print("  MAE: %s" % mae)
    print("  R2: %s" % r2)

Elasticnet model (alpha=0.500000, l1_ratio=0.500000):
  RMSE: 0.7518259553725545
  MAE: 0.6171724014322878
  R2: 0.13517797994265668


## Optimizing Models
Of course, the parameters of our model might not be optimal. We always want to test different combinations of values. In order to do so, we will use MLFlow to log our trained models, their parameters and their performances for later comparhisons.

In [8]:
# First, we don't want ugly warnings during the demo
import warnings
warnings.filterwarnings('ignore')
# This time we are going to test a range of parameters
for alpha in np.arange(0, 1., 0.1):
    for l1_ratio in np.arange(0, 1., 0.1):
        # Same as before
        with mlflow.start_run():
            # Train + predict
            lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
            lr.fit(train_x, train_y)
            predicted_qualities = lr.predict(test_x)
            (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

            # This time we log, we do not print
            mlflow.log_param("alpha", alpha)
            mlflow.log_param("l1_ratio", l1_ratio)
            mlflow.log_metric("rmse", rmse)
            mlflow.log_metric("r2", r2)
            mlflow.log_metric("mae", mae)
    
            # Saving the model(s)
            mlflow.sklearn.log_model(lr, "model")

MLFlow will allow us to easily visualize the performances of our model for different parameter values.

In [None]:
import os
os.system('mlflow ui')