## Defining performance metrics on Regression


It is difficult to measure the quality of a given model without quantifying its performance over training and testing. This is typically done using some type of performance metric, whether it is through calculating some type of error, the goodness of fit, or some other useful measurement. For this project, you will be calculating the coefficient of determination, R2, to quantify your model's performance. The coefficient of determination for a model is a useful statistic in regression analysis, as it often describes how "good" that model is at making predictions.

The values for R2 range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable. A model with an R2 of 0 always fails to predict the target variable, whereas a model with an R2 of 1 perfectly predicts the target variable. Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the features. A model can be given a negative R2 as well, which indicates that the model is no better than one that naively predicts the mean of the target variable.

For the performance_metric function in the code cell below, you will need to implement the following:

Use r2_score from sklearn.metrics to perform a performance calculation between y_test and y_pred. Assign the performance score to the score variable.

## Regression Evaluation Metrics


Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
Mean Squared Error (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
Comparing these metrics:

MAE is the easiest to understand, because it's the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.
All of these are loss functions, because we want to minimize them.

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

path = "https://frenzy86.s3.eu-west-2.amazonaws.com/python/data/Startup.csv"
df = pd.read_csv(path)
df

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
0,165349.2,136897.8,471784.1,192261.83
1,162597.7,151377.59,443898.53,191792.06
2,153441.51,101145.55,407934.54,191050.39
3,144372.41,118671.85,383199.62,182901.99
4,142107.34,91391.77,366168.42,166187.94
5,131876.9,99814.71,362861.36,156991.12
6,134615.46,147198.87,127716.82,156122.51
7,130298.13,145530.06,323876.68,155752.6
8,120542.52,148718.95,311613.29,152211.77
9,123334.88,108679.17,304981.62,149759.96


In [2]:
## 1 - Declare Features and target
X = df.drop(columns='Profit')
y = df['Profit']

In [3]:
## 2 -suddividere il problema in Training e Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    random_state = 667
                                                    )

In [4]:
## 3 - Creare ed allenare il modello (fit) sulla parte di training
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

In [5]:
## 4 - creare la predizione sulla parte di TEST
y_pred = model.predict(X_test) #on Test set

In [11]:
## 5 -  Misurare l'errore del mio modello
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

r2score = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print('R2_score: ', r2score)
print('MAE: ', mae)
print('MSE: ', mse)
print('RMSE: ', rmse)

R2_score:  0.9441590602423613
MAE:  6863.327578772785
MSE:  81932298.45331672
RMSE:  9051.64617367011


In [None]:
model.predict([[324,34,56]])[0]

46990.02802376018

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/python/savemodel.png" width=600>

## JOBLIB
<img src='https://frenzy86.s3.eu-west-2.amazonaws.com/python/joblib.png' widht=2000>

In [7]:
import joblib

## to save a model
joblib.dump(model,'regression_test.pkl')

['regression_test.pkl']

In [None]:
## to load model
newmodel = joblib.load('regression_test.pkl')
newmodel

In [None]:
newmodel.predict([[324,34,56]])[0]

46990.02802376018

## MLEM
<img src='https://frenzy86.s3.eu-west-2.amazonaws.com/python/mlem.png' widht=600>

In [8]:
!pip install mlem -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.5/218.5 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m83.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m5.1 MB/s[0m

In [9]:
import mlem

In [10]:
#Save model

mlem.api.save(model,
              'model_', # model_.mlem
              sample_data = X_train #features
              )

MlemModel(location=Location(path='/content/model_.mlem', project=None, rev=None, uri='file:///content/model_.mlem', project_uri=None, fs=<fsspec.implementations.local.LocalFileSystem object at 0x7f6f0833fd30>), params={}, artifacts={'data': LocalArtifact(uri='model_', size=585, hash='c0dcc13560c49b97e561a2c2a4aad458')}, requirements=Requirements(__root__=[InstallableRequirement(module='sklearn', version='1.2.2', package_name='scikit-learn', extra_index=None, source_url=None, vcs=None, vcs_commit=None), InstallableRequirement(module='numpy', version='1.22.4', package_name=None, extra_index=None, source_url=None, vcs=None, vcs_commit=None), InstallableRequirement(module='pandas', version='1.5.3', package_name=None, extra_index=None, source_url=None, vcs=None, vcs_commit=None)]), processors_cache={'model': SklearnModel(model=LinearRegression(), io=SimplePickleIO(), methods={'predict': Signature(name='predict', args=[Argument(name='X', type_=DataFrameType(value=None, columns=['', 'R&D Spen

In [None]:
!cat model_.mlem

artifacts:
  data:
    hash: 6c8831c451bc8a2245fab14fb7b13c5f
    size: 585
    uri: model_
call_orders:
  predict:
  - - model
    - predict
object_type: model
processors:
  model:
    methods:
      predict:
        args:
        - name: X
          type_:
            columns:
            - ''
            - R&D Spend
            - Administration
            - Marketing Spend
            dtypes:
            - int64
            - float64
            - float64
            - float64
            index_cols:
            - ''
            type: dataframe
        name: predict
        returns:
          dtype: float64
          shape:
          - null
          type: ndarray
    type: sklearn
requirements:
- module: sklearn
  package_name: scikit-learn
  version: 1.2.2
- module: numpy
  version: 1.22.4
- module: pandas
  version: 1.5.3


In [None]:
## Load Model

new_model = mlem.api.load('model_.mlem')

new_model.predict(X_test)

array([113783.74732396, 161047.15549438,  95980.28965584, 164871.12705721,
        44330.77114766,  67888.27214267, 191528.21340132, 112251.03459869,
        73092.82547184, 116557.72221927])

In [None]:
new_model.predict([[1,1,1]])[0]

46723.24017641536