## Intro to Sklearn

What are we going to cover?
* An end  to end Scikit learn workflow
* Getting the data ready
* Choose the right estimator/algorithm for our problem 
* We will fit the model and make predictions
* Evaluating the model
* Improve the model
* Save and load a trained model
* Putting it all together!

## An end to end Workflow

In [1]:
import numpy as np
import matplotlib.pyplot as plt
# Getting data ready
import pandas as pd
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease

FileNotFoundError: [Errno 2] No such file or directory: 'data/heart-disease.csv'

In [None]:
# X=  feature matrix/variables
X = heart_disease.drop("target",axis = 1)
# y = target variables
y = heart_disease["target"]

In [None]:
# Choose the right model and hyperparameters
# Example here:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier() #clf =classifier
# We will keep the default params (Hyper params)
clf.get_params()

In [None]:
# Fit the model to the training data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)

In [None]:
clf.fit(X_train,y_train)

In [None]:
# Making predictions 
y_preds = clf.predict(X_test)
y_preds

In [None]:
y_test

In [None]:
# Evaluate the model
clf.score(X_train,y_train)

In [None]:
clf.score(X_test,y_test)

In [None]:
# Other evaluation metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test,y_preds))

In [None]:
confusion_matrix(y_test,y_preds)

In [None]:
accuracy_score(y_test,y_preds)

In [None]:
#Improve the model
# Try different amount of n estimators (a certain hyper parameter)
np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators= i)
    clf.fit(X_train,y_train)
    print(f" [T] - Model accuracy on test set: {clf.score(X_test,y_test)* 100:.2f} %")

In [None]:
# SAVE a model and load it
import pickle
pickle.dump(clf,open("../trained_models/random_forest_model_1.pkl","wb"))

In [None]:
loaded_model = pickle.load(open("../trained_models/random_forest_model_1.pkl","rb"))
loaded_model.score(X_test,y_test)

## 1.Getting our data ready
Three main things we have to do:
1. Split th data into features and labels (usually 'X','y')
2. Filling\imputing\disregarding missing values
3. Converting non numerical values to numerical (feature encoding)

In [None]:
heart_disease.head()

In [None]:
X = heart_disease.drop("target",axis=1)
X.head()

In [None]:
 y =heart_disease["target"]
y.head()

In [None]:
#Splitting the data to training and test sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.2)

In [None]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

## 1.1 Make sure everything is numerical (car-sales)


In [None]:
car_sales  = pd.read_csv("data/scikit-learn-data/car-sales-extended.csv")
car_sales.head()

In [None]:
len(car_sales),car_sales.dtypes

In [None]:
#split X/y
X= car_sales.drop("Price",axis = 1)
y = car_sales["Price"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.2)

In [None]:
#Build machine learning model 
# prdeict a number -> regressor 
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

In [None]:
#Turning the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [None]:
categorical_features = ["Make","Colour","Doors"] #Doors are treated like buckets,since they are counted 
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                    remainder = "passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

In [None]:
X_df = pd.DataFrame(transformed_X)
X_df

In [None]:
#Another way
dummies = pd.get_dummies(car_sales[["Make","Colour","Doors"]])
dummies

In [None]:
#Refitting the model
np.random.seed(42)
X_train,X_test,y_train,y_test = train_test_split(transformed_X,
                                                 y,
                                                 test_size= 0.2)
model.fit(X_train,y_train)

In [None]:
#Pretty bad , need more kinds of info maybe
model.score(X_test,y_test)

## 1.2 Missing values
1. Fill them with some value (imputation)
2. Remove the samples with missing data

In [None]:
# import car sales missing data
car_sales_missing = pd.read_csv("data/scikit-learn-data/car-sales-extended-missing-data.csv")
car_sales_missing.head(),len(car_sales_missing)

In [None]:
car_sales_missing.isna().sum()

In [None]:
#Try converting to numbers
#Create X/y
X = car_sales_missing.drop("Price",axis= 1)
y = car_sales_missing["Price"]
#Conversion
categorical_features = ["Make","Colour","Doors"] #Doors are treated like buckets,since they are counted 
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                    remainder = "passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

### option 1  :Filling

In [None]:
# Fill "Make"
car_sales_missing["Make"].fillna("missing",inplace=True)
# Fill "Colour"
car_sales_missing["Colour"].fillna("missing",inplace=True)
# Fill "Odometer (KM)"
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean()
                                          ,inplace=True)
# Fill "Doors"
car_sales_missing["Doors"].fillna(4,inplace=True) #Just basic human logic, and this is the majority
car_sales_missing.isna().sum()

In [None]:
### option 2: Removal (fitting for missing target values)
car_sales_missing.dropna(subset = ["Price"],inplace=True)
len(car_sales_missing),car_sales_missing.isna().sum()

In [None]:
#Creating X/y
X = car_sales_missing.drop("Price",axis = 1)
y= car_sales_missing["Price"]
#Conversion
categorical_features = ["Make","Colour","Doors"] #Doors are treated like buckets,since they are counted 
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                    remainder = "passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

### 1.2.2 Filling missing values with sklearn (before was pandas)

In [None]:
#Re reading
car_sales_missing = pd.read_csv("data/scikit-learn-data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

In [None]:
car_sales_missing.dropna(subset= ["Price"],inplace=True)
car_sales_missing.isna().sum()

In [None]:
#Split into X/y
X = car_sales_missing.drop("Price",axis = 1)
y = car_sales_missing["Price"]

In [None]:
#Sklearn cleaning/filling
from sklearn.impute import SimpleImputer #Filler
from sklearn.compose import ColumnTransformer
#Fill categorical values with "missing", and numerical with mean
cat_imputer = SimpleImputer(strategy= "constant",fill_value="missing")#cat = categorical
door_imputer = SimpleImputer(strategy= "constant",fill_value= 4) # Same reasoning as before
num_imputer = SimpleImputer(strategy= "mean")

##Define columns
cat_features= ["Make","Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

#Create an imputer to fill the missing the data
imputer = ColumnTransformer([
    ("cat_imputer",cat_imputer,cat_features),
    ("door_imputer",door_imputer,door_features),
    ("num_imputer",num_imputer,num_features)
])
#Transforming the data
filled_X = imputer.fit_transform(X)
filled_X

In [None]:
car_sales_filled = pd.DataFrame(data = filled_X,
                               columns=["Make","Colours","Doors","Odometer (KM)"])
car_sales_filled.head(10)

In [None]:
car_sales_filled.isna().sum(),len(car_sales_filled)


In [None]:
#Re converting to numbers
#Split into X/y
X = car_sales_filled #Price already dropped 
y = car_sales_missing["Price"]
#Conversion
categorical_features = ["Make","Colours","Doors"] #Doors are treated like buckets,since they are counted 
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                    remainder = "passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

In [None]:
#Fitting the model 
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(transformed_X,
                                                 y,
                                                test_size= 0.2
                                                )
model = RandomForestRegressor(n_estimators= 100)
model.fit(X_train,y_train)
#Scoring
model.score(X_test,y_test)
#pretty bad Accuracy

## 2.Choosing the right estimator
- sklearn refers to machine learning models and models as estimators
1. For classification problems: predicting a category (heart disease or not)
   - Sometimes we can see the 'clf' abbreviation
2. For regression problems: predicting a number (selling price of car)

Cheat sheet!:

<img src = "ml_map.png"/>

## 2.1 picking a model for a regression problem (california housing dataset)

In [None]:
# Get california housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

In [None]:
#Turning this to a data frame
housing_df = pd.DataFrame(data = housing["data"],
                         columns= housing["feature_names"],
                         )
housing_df["target"] = housing["target"] #value normalized to be in $100,000
housing_df.head()

In [None]:
#Import algorithm (We decided on ridge regression)
from sklearn.linear_model import Ridge
#Set up random seed,to split the same
np.random.seed(42)

#Create data
X = housing_df.drop("target",axis = 1)
y = housing_df["target"] #MedHouseVal

#Split the data to train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.2)

# Instantiate and fit the model on the training set
model = Ridge()
model.fit(X_train,y_train)

In [None]:
#Score the model on the test set
model.score(X_test,y_test) #coefficient of determination, R^2,Higher is better

#Improvements,trying different models maybe
- Ensemble models: combining the prediction of several base estimators. To improve generalization
- Ensemble RandomForest model: lots of decision trees,and majority Voting/ average for classification and regression respectively

## Trying RandomForestRegressor
(Takes alot of time lol)

In [None]:
# #Import algorithm (We decided on RandomForestRegressor)
# from sklearn.ensemble import RandomForestRegressor
# #Set up random seed,to split the same
# np.random.seed(42)

# #Create data
# X = housing_df.drop("target",axis = 1)
# y = housing_df["target"] #MedHouseVal

# #Split the data to train and test
# X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.2)

# # Instantiate and fit the model on the training set
# model = RandomForestRegressor(n_estimators= 100)
# model.fit(X_train,y_train)

In [None]:
# #Score the model on the test set (Need to research more metrics)
# model.score(X_test,y_test) #coefficient of determination, R^2,Higher is better

## 2.2 picking a model for a classification problem (heart disease dataset)

In [None]:
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

In [None]:
len(heart_disease)

In [None]:
#Import algorithm (We decided on LinearSvc)
from sklearn.svm import LinearSVC
#Set up random seed,to split the same
np.random.seed(42)

#Create data
X = heart_disease.drop("target",axis = 1)
y = heart_disease["target"] #MedHouseVal

#Split the data to train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.2)

# Instantiate and fit the model on the training set
clf = LinearSVC(dual= True,max_iter=1000)
clf.fit(X_train,y_train)

In [None]:
#Score
clf.score(X_test,y_test) # The lecture got 47% accuracy, don't know why

 - Trying the classifier instead

In [None]:
#Import algorithm (We decided on Random forest classifier)
from sklearn.ensemble import RandomForestClassifier
#Set up random seed,to split the same
np.random.seed(42)

#Create data
X = heart_disease.drop("target",axis = 1)
y = heart_disease["target"] #MedHouseVal

#Split the data to train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.2)

# Instantiate and fit the model on the training set
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)

In [None]:
#Score
clf.score(X_test,y_test) 

## Tidbit: (General guide line)
1. if we have structured data ,we should use ensemble methods (things in a data frame)
2. if we have structured data ,we should use deep learning or transfer learning (audio, text, images... )

## 3. Fitting the model and use it to make predictions
### 3.1 Fitting the data
Different names for:
- `x` = features , feature variables, data
- `y` = labels, targets, target variables

### 3.2 Making predictions with our model

### Training the classifier for heart disease

In [None]:
#Import algorithm (We decided on Random forest classifier)
from sklearn.ensemble import RandomForestClassifier
#Set up random seed,to split the same
np.random.seed(42)

#Create data
X = heart_disease.drop("target",axis = 1)
y = heart_disease["target"] #MedHouseVal

#Split the data to train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.2)

# Instantiate and fit the model on the training set
clf = RandomForestClassifier(n_estimators=100)
#Fitting the model to the data
clf.fit(X_train,y_train)

### 3.2 Making predictins with the model
2 main ways to make predictions:
  1. `predict()`
  2.  `predict_proba()`

In [None]:
X_test

In [None]:
clf.predict(X_test) #predictions

In [None]:
np.array(y_test) #truth labels

In [None]:
y_preds = clf.predict(X_test)

In [None]:
#Scoring using mean
np.mean(y_preds == y_test)

In [None]:
#accuracy score
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_preds)

## `predict_proba()`
returns probability estimates of a classification label

In [None]:
clf.predict_proba(X_test[:5])

In [None]:
#Lets predict() on the same data 
clf.predict(X_test[:5])

`predict()` can also be used with regression models

In [None]:
housing_df.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
#Create data
X = housing_df.drop("target",axis= 1)
y= housing_df["target"]
# split to train and test sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
#Create model instance
model = RandomForestRegressor(n_estimators= 100)
#Fit model
model.fit(X_train,y_train)
#Make predictions
y_preds = model.predict(X_test)

In [None]:
y_preds[:10]

In [None]:
np.array(y_test[:10])

In [None]:
len(y_test),len(y_preds)

In [None]:
#Evaluating according to mean_absolute_error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_preds)

In [None]:
housing_df["target"] 

## 4. Evaluating our models
Three ways to evaluate Sklearn models:
1. Estimators built in `score()` function
2. The `scoring` parameter
3. Problem - specific metric functions (sklearn.metrics)

More info at : https://scikit-learn.org/stable/modules/model_evaluation.html

### 4.1 The score method

In [None]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)
#Create data
X = heart_disease.drop("target",axis=1)
y = heart_disease["target"]
#Split train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
#Create classifier instance
clf = RandomForestClassifier(n_estimators = 100)
#Fit model
clf.fit(X_train,y_train)

In [None]:
#Make preditions with score
clf.score(X_train,y_train)# On training data

In [None]:
#Make preditions with score
clf.score(X_test,y_test)# On train data

Lets use the `score()` on our regression problem

In [None]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
#Create data
X = housing_df.drop("target",axis = 1)
y = housing_df["target"]
#Split train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
#Create model instance
model = RandomForestRegressor(n_estimators=2)
#Fitting model
model.fit(X_train,y_train)

In [None]:
model.score(X_test,y_test)#n_estimators = 2

In [None]:
model = RandomForestRegressor(n_estimators=10)
#Fitting model
model.fit(X_train,y_train)
model.score(X_test,y_test)#n_estimators = 10

In [None]:
model = RandomForestRegressor(n_estimators=20)
#Fitting model
model.fit(X_train,y_train)
model.score(X_test,y_test)#n_estimators = 20

In [None]:
model = RandomForestRegressor(n_estimators=50)
#Fitting model
model.fit(X_train,y_train)
model.score(X_test,y_test)#n_estimators = 50

## 4.2 `Scoring` parameter

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)
#Create data
X = heart_disease.drop("target",axis=1)
y = heart_disease["target"]
#Split train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
#Create classifier instance
clf = RandomForestClassifier(n_estimators = 100)
#Fit model
clf.fit(X_train,y_train)

In [None]:
clf.score(X_test,y_test)

In [None]:
cross_val_score(clf,X,y,cv = 5)

<img src = "Cross-validation-chart.png"/>

In [None]:
np.random.seed(42)
cross_validation_array = cross_val_score(clf,X,y,cv = 5) #Five fold
np.mean(cross_validation_array)

In [None]:
np.random.seed(42)
cross_validation_array = cross_val_score(clf,X,y,cv = 10) #Ten fold
np.mean(cross_validation_array)

In [None]:
np.random.seed(42)
cross_validation_array = cross_val_score(clf,X,y,cv = 20) #20 fold
np.mean(cross_validation_array)

In [None]:
#Scoring parameter set to none by default
cross_validation_array = cross_val_score(clf,X,y,cv = 5,scoring= None)
#None == default scoring param of estimator if one exists, this case is mean accuracy

### We can import our own scoring parameter!!!

### 4.2.1 Classification model evaluation metrics
1. Accuracy
2. Area under ROC curve
3. Confusion matrix
4. Classification report

### 4.2.1.1 Accuracy

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)

X = heart_disease.drop("target",axis = 1)
y = heart_disease["target"]

clf = RandomForestClassifier(n_estimators=100)
cross_validation_array = cross_val_score(clf,X,y,cv = 5,scoring= None)
np.mean(cross_validation_array)

In [None]:
print(f"Heart disease classifier Cross validation score: {np.mean(cross_validation_array) *100:.2f} %")

### 4.2.1.2 Area under Reciever Operating characteristic curve (AUC/ROC)

ROC curves are a comparison of a model's true positive rate (tpr) versus a models false positive rate (fpr)

In [None]:
from sklearn.metrics import roc_curve
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)

X = heart_disease.drop("target",axis = 1)
y = heart_disease["target"]
#Split train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
#Classifier instance
clf = RandomForestClassifier(n_estimators=100)

#Fit model
clf.fit(X_train,y_train)

#Predict probabilities
y_probs = clf.predict_proba(X_test)
y_probs[:10]

In [None]:
#Split probs to positive and negative
y_probs_positive = y_probs[:,1]
y_probs_negative = y_probs[:,0]
y_probs_positive[:10],y_probs_negative[:10]

In [None]:
fpr, tpr,thresholds = roc_curve(y_test,y_probs_positive)

## Create a function for plotting ROC curves

In [None]:
import matplotlib.pyplot as plt

In [None]:
def plot_roc_curve(fpr,tpr):
    """
    Plots a ROC curve given the false positive rate (fpr)
    and true positive rate (tpr) of a model
    """
    #Plot roc curve
    plt.plot(fpr,tpr,color = "orange", label = "ROC")
    #PLot with no predictive power (baseline)
    plt.plot([0,1],[0,1],color = "darkblue",linestyle = "--",label = "Guessing")
    #Customize plot
    plt.xlabel("False positive rate (fpr)")
    plt.ylabel("true positive rate (tpr)")
    plt.title("Reciever Operating Characteristics (ROC) Curve")
    plt.show()

In [None]:
fpr, tpr,thresholds = roc_curve(y_test,y_probs_positive)
plot_roc_curve(fpr,tpr)

In [None]:
#Area under curve scoring
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_probs_positive)

In [None]:
#Plotting perfect ROC curve and AUC 
fpr, tpr,thresholds = roc_curve(y_test,y_test)
plot_roc_curve(fpr,tpr)

In [None]:
roc_auc_score(y_test,y_test)

### 4.2.1.3 Confusion matrix

A confusion matrix is a quick way to compare the labels a model predicts and the actual
labels it was supposed to predict.
In essence giving us an idea of where the model got confused

In [None]:
from sklearn.metrics import confusion_matrix
y_preds = clf.predict(X_test)
confusion_matrix(y_true= y_test,y_pred = y_preds)

In [None]:
#Visualize confusion matric with pd.crosstab()
pd.crosstab(y_test,y_preds,
           rownames=["Actual labels"],
           colnames=["Predicted labels"])

In [None]:
#Make a seaborn heatmap() with our confusion matrix
import seaborn as sns
#Set the font scale
sns.set(font_scale= 1.5)
#create a confusion matrix
conf_mat = confusion_matrix(y_true =y_test,y_pred=y_preds)
#Plot confusion
sns.heatmap(conf_mat)

## Nothing to see up here really, need more numbers and stuff

### Basic customization

1. Make predictions amd produce confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
y_preds = clf.predict(X_test)
confusion_matrix(y_true= y_test,y_pred = y_preds)

2. Viewing pandas crosstab

In [None]:
pd.crosstab(y_test,y_preds,
           rownames=["Actual Label"],
           colnames=["Predicted label"])

3. Let us plot this matrix, we want a colorful bright diagonal
   Using  `Sklearn`

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(estimator=clf,X=X,y=y)# makes predictions for me On the ENTIRE data set

In [None]:
ConfusionMatrixDisplay.from_predictions(y_true=y_test,
                                       y_pred=y_preds)

### 4.2.1.4 Classification report (Multiple metrics)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_preds))

In [None]:
#Where precision and recall become valuable
disease_true = np.zeros(10000)
disease_true[0] = 1 #Only one positive case
disease_preds = np.zeros(10000)#Model predicts everything to be zero

pd.DataFrame(classification_report(y_true=disease_true,y_pred=disease_preds,output_dict=True))

## 4.2.2 Evaluating a Regression model
* Search "Regression metrics" in sklearn

- The ones we are going to cover:
  1. R^2
  2. Mean absolute error (MAE)
  3. Mean squared error (MSE)

In [None]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
#Create data
X = housing_df.drop("target",axis=1)
y = housing_df["target"]
#split data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
#instancing model
model = RandomForestRegressor(n_estimators=100)
#fitting model
model.fit(X_train,y_train)

In [None]:
model.score(X_test,y_test)#coefficient of determination (can be negative!!!)

In [None]:
# SAVE a california housing model
import pickle
pickle.dump(model,open("models/cali_housing_random_forest_model.pkl","wb"))

In [None]:
model = pickle.load(open("models/cali_housing_random_forest_model.pkl","rb"))

## 4.2.2.1 R^2

In [None]:
from sklearn.metrics import r2_score
#Fill an array with y_test mean
y_test_mean = np.full(len(y_test),y_test.mean())
y_test_mean[:10]

In [None]:
r2_score(y_true= y_test,
        y_pred= y_test_mean) #Assuming the model predicted just the mean

In [None]:
#Actual trained model score
y_preds = model.predict(X_test)
r2_score(y_true= y_test,
        y_pred= y_preds)

## 4.2.2.2 Mean absolute error

In [None]:
from sklearn.metrics import mean_absolute_error as mae
y_preds = model.predict(X_test)
mae(y_test,y_preds)

^---- how far away is ourmodel from the "truth"

In [None]:
df = pd.DataFrame(data={"Actual values": y_test,
                       "Predicted values": y_preds})
df["Differences"] =df["Predicted values"]- df["Actual values"]
df.head()

In [None]:
df["Differences (Absolute)"] = np.abs(df["Differences"])
df.head()

In [None]:
df["Differences"].mean() ,df["Differences (Absolute)"].mean()