# Homework 2 - Linear Regression
For homework 2, you will apply linear regression to real energy datasets. This assignment is designed to provide hands-on practices. 

MAKE YOUR OWN COPY OF THIS FILE BEFORE YOU START. 

Complete each task and submit your Jupyter notebook on Blackboard.

# Section:
- Linear Regression
- Train-Test Split
- Error Metric
- K-Fold Cross-Validation
- Real-World Applications
  - Energy
  - Ocean (MS Student Only)

## To-Do Lists
Look out for sections marked "# IMPLEMENT" and "# QUESTION"
- Undergrads: 5 Implement Blocks + 1 Question Block - 6 Points Total
- Masters: 7 Implement Blocks + 2 Question Blocks - 9 Points Total

Partial credits will be given.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model

## [1] Linear Regression

In [None]:
# generate 1000 random points to use for example and your own testing
num_points = 1000

num_features = 5

# set random seed for reproducibility
np.random.seed(hash("csci")%461)

# create random data from random function (w*X)
true_weights = np.random.randn(5)+([3]+[0]*(num_features-1))

# randomly generate data according to normal distribution
X_full = np.random.randn(num_points, num_features)

# create random y from features as linear weighted sum, with some additional noise
y_full = X_full.dot(true_weights) + np.random.randn(num_points)

In [None]:
# print out the true weights
true_weights

In [None]:
# scatter plot - the first dimension with the target y 
# (visually examine correlation between feature and target)
plt.scatter(X_full[:,0], y_full)
plt.xlabel("Feature 0",size=14)
plt.ylabel("Target Y",size=14)
plt.show()

In [None]:
# example of fitting a linear regression model (using the one from sklearn)
# initialize model
example_model = linear_model.LinearRegression()

# fit model on some data (first 20 items)
example_model.fit(X_full[:20], y_full[:20])

# make predictions on other data
y_pred = example_model.predict(X_full)

# evaluate results
plt.scatter(y_pred, y_full)
plt.xlabel("Predicted Y",size=14)
plt.ylabel("True Y",size=14)
plt.show()

## [2] Train-Test Split

In [None]:
def random_split(p: float, X: np.array, y: np.array):
    
    """ 
    
    Given a numpy feature matrix X of size NxM where N is the number of points and M is 
    the number of features and a target variable y of size Nx1 where N is the number
    of points. Randomly split X and y by P where P is the "proportion" of
    points in the training set, 1-P is proportion of points in the test set.

    Try to use np.random.shuffle with the indices

    Output should be a list of numpys [X_train, X_test, y_train, y_train]
    
    In this order,
    X_train of size N*P x M
    X_test of size N*(1-P) x M
    y_train of size N*P x 1
    y_train of size N*(1-P) x 1
    
    """
    
    # -------------------------------------------------------------------------
    # IMPLEMENT - 1 Point
    # -------------------------------------------------------------------------
    np.random.shuffle(X)
    np.random.shuffle(y)

    X_trainSize = round(X.shape[0] * p)
    X_train = X[:X_trainSize, :]
    X_test = X[X_trainSize:, :]
    y_trainSize = round(y.shape[0] * p)
    y_train = y[:y_trainSize]
    y_test = y[y_trainSize:]
    partitioned_arrs = [X_train, X_test, y_train, y_test]
    return partitioned_arrs

In [None]:
# example usage
print("X shape:", X_full.shape)
print("y shape:", y_full.shape)

# Here we use the function you implemented
train_p = 0.8
X_tr, X_te, y_tr, y_te = random_split(train_p, X_full, y_full)
print("shapes of output:", X_tr.shape, X_te.shape, y_tr.shape, y_te.shape)

In [None]:
# fit model with train-test split to get y_pred on the test set
model = linear_model.LinearRegression()

# fit with train data and labels
model.fit(X_tr, y_tr)
# apply to test data
y_pred = model.predict(X_te)

# evaluate results on test labels
plt.scatter(y_te, y_pred)
plt.xlabel("True Y")
plt.ylabel("Predicted Y")
plt.show()

## [3] Error Metric

In [None]:
from numpy import diff


def mean_squared_error(y_true: np.array, y_pred: np.array):
    
    """ 
    
    Computes mean squared error between input vectors y_true and y_pred. Both inputs
    are numpy array of size Nx1 where N is the number of points.

    Output should be a numpy float
    
    You should not be using sklearn's implementation here
    
    """

    # -------------------------------------------------------------------------
    # IMPLEMENT - 1 Point
    # -------------------------------------------------------------------------
    diffArr = np.subtract(y_true, y_pred)
    mse = (np.square(diffArr)).mean()
    return mse

In [None]:
from sklearn import metrics

# let us look at three different metrics: 
# your Mean Square Error, sklearn Mean Absolute Error, and R2
mse = mean_squared_error(y_te, y_pred)
mae = metrics.mean_absolute_error(y_te, y_pred)
rsq = metrics.r2_score(y_te, y_pred)
print(mse)
print(mae)
print(rsq)

## [4] K-Fold Cross-Validation

In [None]:
def cross_validate(model, X: np.array, y: np.array, k: int, metrics=[mean_squared_error]):
    
    """ 
    
    Given a model with feature matrix X of size NxM and target variable y
    of size Nx1 where N is the number of points and M is the number of features.

    Partition X and y into K partitions. Use K-fold cross-validation to train and 
    evaluate the model given a list of error metrics (either your own functions or existing ones). 

    metrics specifies a list of error metrics function names (one or more of them), of length E.
    The default value for 'metrics' is to use the function mean_square_error
    
    Remember: 
    train the model with the training set and evaluate the model on the test set.
    
    Output should be a list of size ExK where E is the number of error metrics
    and K is the number of partitions.

    Output Example:
    cross_validate = [[mse_cross_val_1, mse_cross_val_2, ..., mse_cross_val_k],
                     [other_metric_1,...],[other_metric_2,...]]
    
    """

    # -------------------------------------------------------------------------
    # IMPLEMENT - 1 Point
    # -------------------------------------------------------------------------
    proportion = float(1 - (1/k))
    cross_val_metrics = np.ndarray(shape=(len(metrics), k), buffer=0)
    #is metrics even an array... thats a major question here
    #only other question is if im going through the other arrays right
    for e in range(len(metrics)):
        temp = k
        rowCV = np.zeros(k)
        while(k > 0):
            xTrain, xTest, yTrain, yTest = random_split(proportion, X, y)
            model.fit(xTrain, yTrain)
            yPred = model.predict(xTest)
            rowCV[k-1] = metrics[e](yTest, yPred)
            k=k-1
        cross_val_metrics[e] = rowCV
        k = temp

    return cross_val_metrics

In [None]:
model = linear_model.LinearRegression()

# let us try your k-fold cross validation with three different error metrics
k = 3
cross_val_metrics = cross_validate(model, X_full, y_full, k, metrics=[mean_squared_error, metrics.mean_absolute_error, metrics.r2_score])

df_lr = pd.DataFrame({"Model":["Linear","Linear","Linear"],
                      "Error":["MSE", "MAE", "R2"],
                      "Mean":np.mean(cross_val_metrics, axis=1),
                      "Std":np.std(cross_val_metrics, axis=1)})

df_lr = df_lr.sort_values("Error").reset_index(drop=True)

df_lr

# note if you haven't yet implemented mean_squared_error, you can add any of the
# sklearn metrics from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

In [None]:
'''

Run 5-fold cross-validation, Ridge regression model (import it from sklearn) on the data
  - Show average error over the K folds for mean squared error, mean absolute error, r2 score ( a list of 3 values)
  - Show standard deviation over the K folds for the same 3 metrics (a list of 3 values)
  - Put these statistics in a pandas dataframe

'''

# -------------------------------------------------------------------------
# IMPLEMENT - 1 Point
# -------------------------------------------------------------------------
from sklearn.linear_model import Ridge
from sklearn import metrics

metrics = [mean_squared_error, metrics.mean_absolute_error, metrics.r2_score]
modelRaw = linear_model.RidgeCV(alphas=[0.2, 0.5, 0.8])
model = modelRaw.fit(X_full, y_full)
k=5

cvMetrics = cross_validate(model, X_full, y_full, k, metrics)
meansArr = np.mean(cvMetrics, axis = 1)
sdArr = np.std(cvMetrics, axis=1)
#to be inserted to pd.DataFrame

pd.DataFrame({"Model":["Ridge","Ridge","Ridge"],
                      "Error":["MSE", "MAE", "R2"],
                      "Mean":meansArr,
                      "Std":sdArr})

Unnamed: 0,Model,Error,Mean,Std
0,Ridge,MSE,0,0
1,Ridge,MAE,0,0
2,Ridge,R2,0,0


## [5] Real-World Application - Energy

The Commercial Buildings Energy Consumption Survey (CBECS) provide building characteristics information for the estimated 5.9 million U.S. commercial buildings in 2012. Building characteristics data tables include number of workers, ownership and occupancy, structural characteristics, energy sources and uses, energy related building features, and more. For more informatoin, see
- https://asu.pure.elsevier.com/en/publications/machine-learning-approaches-for-estimating-commercial-building-en


## Import Libraries

In [None]:
import pandas as pd    # pandas (for reading and handling data in dataframes)

import numpy as np     # matrix/linear algebra library

from scipy import stats # prob/stats library to get distribution information

import os              # platform independent filesystem manipulations

from collections import Counter

from sklearn import model_selection, kernel_ridge, linear_model, metrics, feature_selection, preprocessing

# plotting libraries
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn-paper')

np.random.seed(6174)

## Download Data

In [None]:
# download data into data/cbecs in the current directory
# run !ls data/cbecs to see the csv it downloaded
!wget --directory-prefix=data/cbecs/ -Nq https://www.eia.gov/consumption/commercial/data/2012/xls/2012_public_use_data_aug2016.csv

In [None]:
# declare directory to write results and file to read data from
CBECS_DATA_FN = "data/cbecs/2012_public_use_data_aug2016.csv"

In [None]:
def mkdir(path):
    try: 
        os.makedirs(path)
    except OSError:
        if not os.path.isdir(path):
            raise
        else:
            print("(%s) already exists" % (path))

mkdir("output/")
mkdir("output/trainedModels/")

## Parse and Format Data

### Read dataframe from downloaded file
This pandas code automatically parses csv. Try it out with datasets containing datetimes, it's pretty handy


In [None]:
raw_df = pd.read_csv(CBECS_DATA_FN)

In [None]:
raw_df.describe()

### Select relevant columns
Let's get the relevant feature and target columns from the data and store them into a list

In [None]:
xIndices = []
yIndices = []

target_col = "MFBTU"
feature_cols = []

for col_ind, col_name in enumerate(raw_df.columns):
    if col_name == "MFBTU":
        yIndices.append(col_ind)
    elif "ZMFBTU" in col_name:
        xIndices.append(col_ind)
        feature_cols.append(col_name)
    elif col_name.startswith("Z"): # checked, every feature that starts with a Z is an accessory variable that tells whether another variable was imputed, etc.
        pass
    elif col_name.startswith("FINALWT"): # there is a finalwt feature for every other variable, these are unecessary
        pass
    elif col_ind >= 1051: # don't keep data after MFBTU
        pass
    else:
        xIndices.append(col_ind)
        feature_cols.append(col_name)
raw_df = raw_df[[target_col] + feature_cols]
print(raw_df.shape)

In [None]:
col_to_check = "ZMFBTU"
useful_rows = (raw_df[col_to_check] != 2) & (raw_df[col_to_check] != 9)
raw_df = raw_df[useful_rows]

In [None]:
# the floors variables takes some very high values
# let's set these to 20 and 30 floors respectively
# alternatively we could consider a quantile scaler
# note that here we are changing the data!
# you can check that the above cell returns a different value now
raw_df[raw_df["NFLOOR"] == 994]["NFLOOR"] = 20
raw_df[raw_df["NFLOOR"] == 995]["NFLOOR"] = 30

### Eliminate Features with > 25% missing data

In [None]:
raw_df.isna().sum()

In [None]:
bad_col_inds = (raw_df.isna().sum() / raw_df.shape[0]) > 0.25
raw_df = raw_df[raw_df.columns[~bad_col_inds]]

### Replace missing values with most frequently occuring value

In [None]:
# taken from raw_df.mode? (asked for help of mode, not a question)
raw_df.fillna(raw_df.mode().iloc[0], inplace=True)

### Read data values into matrices X, and Y

In [None]:
# take X and Y from the useful rows in the corresponding columns
X = raw_df[raw_df.columns[raw_df.columns != target_col]].values
Y = raw_df[target_col].values
# let's print out some summary info of X and y
print("features shape:", X.shape)
print("target shape:", Y.shape)

### Save pre-processed dataset

In [None]:
# print(X.shape)
# print(Y.shape)
feature_columns = np.array(raw_df.columns[raw_df.columns != target_col])
# print(feature_columns.shape)

np.save("output/cbecs_X_MFBTU.npy", X)
np.save("output/cbecs_Y_MFBTU.npy", Y)
np.save("output/cbecs_headers_MFBTU.npy", feature_columns)

numberOfSamples = X.shape[0]
numberOfFeatures = X.shape[1]

### Create reduced feature set

In [None]:
'''

"PBA": Principal building activity
"SQFT": Square footage
"CDD65": Cooling degree days (base 65)
"HDD65": Heating degree days (base 65)
"NFLOOR": Number of floors

'''

# keeping feature
columnsToKeep = np.array(["PBA", "SQFT", "CDD65", "HDD65", "NFLOOR"])

newX = raw_df[columnsToKeep]
newY = raw_df[target_col]

# print(newX.shape)
# print(newY.shape)
# print(columnsToKeep.shape)

np.save("output/cbecs_reduced_X_MFBTU.npy", newX)
np.save("output/cbecs_reduced_Y_MFBTU.npy", newY)
np.save("output/cbecs_reduced_headers_MFBTU.npy", columnsToKeep)

## Model Fitting

### Helper Functions

In [None]:

pbaLabels = {
    1  : 'Vacant',
    2  : 'Administrative/professional office',
    3  : 'Bank/other financial',
    4  : 'Government office',
    5  : 'Medical office (non-diagnostic)',
    6  : 'Mixed-use office',
    7  : 'Other office',
    8  : 'Laboratory',
    9  : 'Distribution/shipping center',
    10 : 'Non-refrigerated warehouse',
    11 : 'Self-storage',
    12 : 'Convenience store',
    13 : 'Convenience store with gas station',
    14 : 'Grocery store/food market',
    15 : 'Other food sales',
    16 : 'Fire station/police station',
    17 : 'Other public order and safety',
    18 : 'Medical office (diagnostic)',
    19 : 'Clinic/other outpatient health',
    20 : 'Refrigerated warehouse',
    21 : 'Religious worship',
    22 : 'Entertainment/culture',
    23 : 'Library',
    24 : 'Recreation',
    25 : 'Social/meeting',
    26 : 'Other public assembly',
    27 : 'College/university',
    28 : 'Elementary/middle school',
    29 : 'High school',
    30 : 'Preschool/daycare',
    31 : 'Other classroom education',
    32 : 'Fast food',
    33 : 'Restaurant/cafeteria',
    34 : 'Other food service',
    35 : 'Hospital/inpatient health',
    36 : 'Nursing home/assisted living',
    37 : 'Dormitory/fraternity/sorority',
    38 : 'Hotel',
    39 : 'Motel or inn',
    40 : 'Other lodging',
    41 : 'Vehicle dealership/showroom',
    42 : 'Retail store',
    43 : 'Other retail',
    44 : 'Post office/postal center',
    45 : 'Repair shop',
    46 : 'Vehicle service/repair shop',
    47 : 'Vehicle storage/maintenance',
    48 : 'Other service',
    49 : 'Other',
    50 : 'Strip shopping mall',
    51 : 'Enclosed mall',
    52 : 'Courthouse/probation office',
    53 : 'Bar/pub/lounge',
    91 : 'Other'
}

def to_categorical(y, nb_classes=None):

    '''
    
    Convert class vector (integers from 0 to nb_classes) to binary class matrix, for use with categorical_crossentropy.

    # Arguments
        y: class vector to be converted into a matrix
        nb_classes: total number of classes

    # Returns
        A binary matrix representation of the input.
    
    '''
    y = np.array(y, dtype='int')
    if not nb_classes:
        nb_classes = np.max(y)+1
    Y = np.zeros((len(y), nb_classes))
    for i in range(len(y)):
        Y[i, y[i]] = 1.
    return Y

def doOneHot(classVals,uniqueVals=None,returnNames=False):

    oneHotClasses = classVals.copy()

    if uniqueVals is None:
        uniqueVals = sorted(list(set(classVals)))

    uniqueValsMap = {val:i for i,val in enumerate(uniqueVals)}
    for i in range(oneHotClasses.shape[0]):
        oneHotClasses[i] = uniqueValsMap[oneHotClasses[i]]
    oneHotClasses = to_categorical(oneHotClasses)
    
    if returnNames:
        return oneHotClasses, uniqueVals
    else:
        return oneHotClasses

def getDataset(datasetType=0,pbaOneHot=True):
    X,Y,columnNames = None,None,None

    if datasetType == 0: # all features
        X = np.load("output/cbecs_X_MFBTU.npy")
        Y = np.load("output/cbecs_Y_MFBTU.npy")
        columnNames = np.load("output/cbecs_headers_MFBTU.npy", allow_pickle=True)
    elif datasetType == 1:
        X = np.load("output/cbecs_reduced_X_MFBTU.npy")
        Y = np.load("output/cbecs_reduced_Y_MFBTU.npy")
        columnNames = np.load("output/cbecs_reduced_headers_MFBTU.npy")
    else:
        raise ValueError("Invalid datasetType")

    classVals = X[:,columnNames=="PBA"].copy().flatten()

    excludedColumnNames = ["PUBID","PBA","PBAPLUS","REGION","CENDIV"]
    excludedMask = (columnNames!=excludedColumnNames[0])
    for i in range(1,len(excludedColumnNames)):
        excludedMask = excludedMask & (columnNames!=excludedColumnNames[i])

    X = X[:,excludedMask]
    columnNames = columnNames[excludedMask]

    # do a 1-hot encoding of the PBA column and add the features to X
    if pbaOneHot:
        oneHotClasses,uniqueVals = doOneHot(classVals.copy(),returnNames=True)
        X = np.hstack([X,oneHotClasses])

        oneHotNames = []
        for val in uniqueVals:
            oneHotNames.append("PBA %s" % (pbaLabels[val]))
        columnNames = np.hstack([columnNames,oneHotNames])
    scaler = preprocessing.StandardScaler()
    X = scaler.fit_transform(X)

    return X,Y,columnNames,classVals

# Model prediction on energy data - Ridge, Lasso regression used here!

In [None]:
X,Y,columnNames,classVals = getDataset(0,pbaOneHot=True)

In [None]:

'''

Import Ridge regression and Lasso regression from sklearn 
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model

Run 5-fold cross-validation, Ridge regression model on the data
  - Show average error for mean squared error, mean absolute error, r2 score ( a list of 3 values)
  - Show standard deviation over the K folds for the same 3 metrics (a list of 3 values)
  - Put these statistics in a pandas dataframe

Run 5-fold cross-validation, Lasso regression model on the data
  - Show average error for mean squared error, mean absolute error, r2 score ( a list of 3 values)
  - Show standard deviation over the K folds for the same 3 metrics (a list of 3 values)
  - Put these statistics in a pandas dataframe

Concatenate the two pandas dataframes
Print out the results

'''

# -------------------------------------------------------------------------
# IMPLEMENT - 1 Point
# -------------------------------------------------------------------------
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn import metrics

metrics = [mean_squared_error, metrics.mean_absolute_error, metrics.r2_score]
modelRaw = linear_model.RidgeCV()
modelRawAlt = linear_model.LassoCV()
model = modelRaw.fit(X, Y)
modelAlt = modelRawAlt.fit(X, Y)
k=50

cvMetrics = cross_validate(model, X, Y, k, metrics)
cvMetricsAlt = cross_validate(modelAlt, X, Y, k, metrics)
meansArr = np.mean(cvMetrics, axis = 1)
sdArr = np.std(cvMetrics, axis=1)
meansArrAlt = np.mean(cvMetricsAlt, axis = 1)
sdArrAlt = np.std(cvMetricsAlt, axis=1)
#to be inserted to pd.DataFrame

ridgedf = pd.DataFrame({"Model":["Ridge","Ridge","Ridge"],
                      "Error":["MSE", "MAE", "R2"],
                      "Mean":meansArr,
                      "Std":sdArr})
lassodf = pd.DataFrame({"Model":["Lasso","Lasso","Lasso"],
                      "Error":["MSE", "MAE", "R2"],
                      "Mean":meansArrAlt,
                      "Std":sdArrAlt})

frames = [ridgedf, lassodf]

result = pd.concat(frames)
print(result)

Unnamed: 0,Model,Error,Mean,Std
0,Ridge,MSE,0,0
1,Lasso,MSE,0,0
2,Ridge,MAE,0,0
3,Lasso,MAE,0,0
4,Ridge,R2,0,0
5,Lasso,R2,0,0


In [None]:
'''

Q: Which model is better?

'''

# -------------------------------------------------------------------------
# QUESTION - 1 Point
# -------------------------------------------------------------------------

# Your answer.

##  [6] Real-World Application - Ocean (MS Student Only)

The CalCOFI dataset contains hydrographic data from 1949 to present. (https://www.kaggle.com/datasets/sohier/calcofi). 

Ocean.csv is generated by filtering CalCOFI dataset – bottle.csv from kaggle describing a set of features/measurements at different ocean measurement stations (only one point in time), we filtered down to only (1) water temperature; (2) depth; (3) specific chemicals; and removed data records/stations (4) missing data. 

In [None]:
!wget -Nq https://raw.githubusercontent.com/csci461/dataset/main/ocean.csv

In [None]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("ocean.csv")
df.head()

In [None]:
'''

Feature Description

"Depthm": Bottle depth in meters
"T_degC": Water temperature in degrees Celsius
"Salnty": Salinity (Practical Salinity Scale 1978)
"O2ml_L": Milliliters oxygen per liter of seawater
"ChlorA": Migrograms Chlorophyll-a per liter seawater, measured fluorometrically
"Phaeop": Micrograms Phaeopigment per liter seawater, measured fluormetrically
"PO4uM": Micromoles Phosphate per liter of seawater
"SiO3uM": Micromoles Silicate per liter of seawater
"NO2uM": Micromoles Nitrite per liter of seawater
"NO3uM": Micromoles Nitrate per liter of seawater
"NH3uM": Micromoles Ammonia per liter of seawater

'''

In [None]:
feature_list = ["Depthm","Salnty","O2ml_L","ChlorA","Phaeop","PO4uM","SiO3uM","NO2uM","NO3uM","NH3uM"]
target_variable = ["T_degC"]

In [None]:
'''

Use sklearn KFold, LinearRegression

Run 10-fold cross-validation, linear regression model on the ocean data
  - Use sklearn r2_score, mean_squared_error, mean_absolute_error
  - Show average error over K folds for mean squared error, mean absolute error, r2 score ( a list of 3 values)
  - Show standard deviation over the K folds for the same 3 metrics (a list of 3 values)
  - Put these statistics in a pandas dataframe
  - Show a predicted Y vs true Y scatter plot

'''

# -------------------------------------------------------------------------
# IMPLEMENT - 1 Point
# -------------------------------------------------------------------------


In [None]:
'''

Q: Do linear regression model performs worse at predicting low or high water temperature
   (in degrees Celsius) based on the scatter plot?

'''

# -------------------------------------------------------------------------
# QUESTION - 1 Point
# -------------------------------------------------------------------------

# Your answer.

##Hyperparameter Tuning

In [None]:
'''

Split the ocean dataset into training, validation, and test set (60%-20%-20%) with random_state = 123

Hint: use sklearn train_test_split twice

Run Lasso regression model on the training and validation set for different alpha values
α = [0.1,0.5,1,5,10,20]

Print a pandas dataframe containing the validation mean squared error, mean absolute error, r2 score for each α

Determine the best α based on mean squared error, mean absolute error, r2 score

Run Lasso regression model on the training and test set with this α

Show test mean squared error, mean absolute error, r2 score

'''

# -------------------------------------------------------------------------
# IMPLEMENT - 1 Point
# -------------------------------------------------------------------------

# example
df = pd.DataFrame({"Model":["Lasso"]*18,"Alpha":[0.1,0.1,0.1,0.5,0.5,0.5,1,1,1,5,5,5,10,10,10,20,20,20],
                   "Error":["MSE","MAE","R2"]*6,"Value":[0]*18})
df = df.sort_values(["Error","Alpha"]).reset_index(drop=True)
df

Unnamed: 0,Model,Alpha,Error,Value
0,Lasso,0.1,MAE,0
1,Lasso,0.5,MAE,0
2,Lasso,1.0,MAE,0
3,Lasso,5.0,MAE,0
4,Lasso,10.0,MAE,0
5,Lasso,20.0,MAE,0
6,Lasso,0.1,MSE,0
7,Lasso,0.5,MSE,0
8,Lasso,1.0,MSE,0
9,Lasso,5.0,MSE,0
