<h1 style="text-align:center">Introduction to Metrics</h1>

<div style="text-align:center;"><img src="https://images.unsplash.com/photo-1501516069922-a9982bd6f3bd?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1650&q=80" /></div>

### About the Notebook:

**In this notebook, I want to address different evaluation metrics.**

In [None]:
# Data Processing
import numpy as np 
import pandas as pd 


# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Preparing the data

I would like to skip right to the point where we have prepared data without desciption.
I am using the same approach as in this [Kernal](https://www.kaggle.com/dietzschdaniel/my-simplistic-titanic-approach-0-79665) for Titanic.

In [None]:
# Loading the data
df_train = pd.read_csv("/kaggle/input/titanic/train.csv")
df_test = pd.read_csv("/kaggle/input/titanic/test.csv")


# Handling NaN values
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].mean())
df_test['Age'] = df_test['Age'].fillna(df_test['Age'].mean())

df_train['Cabin'] = df_train['Cabin'].fillna("Missing")
df_test['Cabin'] = df_test['Cabin'].fillna("Missing")

df_train = df_train.dropna()

df_test['Fare'] = df_test['Fare'].fillna(df_test['Fare'].mean())

# Cleaning the data
df_train = df_train.drop(columns=['Name'], axis=1)
df_test = df_test.drop(columns=['Name'], axis=1)

sex_mapping = {
    'male': 0,
    'female': 1
}
df_train.loc[:, "Sex"] = df_train['Sex'].map(sex_mapping)
df_test.loc[:, "Sex"] = df_test['Sex'].map(sex_mapping)

df_train = df_train.drop(columns=['Ticket'], axis=1)
df_test = df_test.drop(columns=['Ticket'], axis=1)

df_train = df_train.drop(columns=['Cabin'], axis=1)
df_test = df_test.drop(columns=['Cabin'], axis=1)

df_train = pd.get_dummies(df_train, prefix_sep="__",
                              columns=['Embarked'])
df_test = pd.get_dummies(df_test, prefix_sep="__",
                              columns=['Embarked'])

Our data now looks as follows:

In [None]:
df_train.head()

In [None]:
df_train.shape

In [None]:
df_test.head()

In [None]:
df_test.shape

# Metrics

### Preparing the data for Modeling (Classification)

In [None]:
# Everything except the target variable
X = df_train.drop("Survived", axis=1)

# Target variable
y = df_train['Survived'].values

In [None]:
# Random seed for reproducibility
np.random.seed(42)

# Splitting the data into train & test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

For our model, I will be using `RandomForestClassifier()`.

In [None]:
# Setting up RandomForestClassifier()
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

# Predicting values
y_pred = rfc.predict(X_test)

## Classification Metrics

For classification problems, the most common metrics are:
* Accuracy
* Precision
* Recall
* F1 score
* Area under ROC curve (AUC)

At this point, I would like to introduce some vocabulary first.
* True Positive (tp), is a result that indicates a given condition exists when it does.
* False Positive (fp), is a result that indicates a given condition exists when it does not.
* True Negative (tn), is a test result which rightfully indicates that a condition does not hold.
* False Negative (fn), is a test result which wrongly indicates that a condition does not hold.

## Confusion Matrix

These values ar usually presented in a **confusion matrix**.

<div style="text-align:center;"><img src="https://miro.medium.com/max/356/1*Z54JgbS4DUwWSknhDCvNTQ.png" /></div>

In [None]:
# Importing confusion_matrix
from sklearn.metrics import confusion_matrix

# Computing the confusion_matrix
confusion_matrix(y_test, y_pred)

In [None]:
# Importing plot_confusion_matrix
from sklearn.metrics import plot_confusion_matrix

# Plot plot_confusion_matrix
plot_confusion_matrix(rfc, X_test, y_test);

In this case we have 50 True Positives, 90 True Negatives, 19 False Positives and 19 False Negatives.

## Accuracy
Accuracy is the fraction of predictions our model got right.
It is defined by the number of correct predictions divided by the number of total predictions.

<div style="text-align:center;"><img src="https://miro.medium.com/max/373/1*yRa2inzTnyASJOre93ep3g.gif" /></div>

Accuracy can give us a *good first impression* of our model's performance. 

In [None]:
# Importing accuracy_score
from sklearn.metrics import accuracy_score

# Computing the accuracy_score
accuracy_score(y_test, y_pred)

In this case our model predicted about 78% of the samples correctly. However, the accuracy does not always show, how good our model is.  

**For example:**  
If we have 100 Samples with 90 people that did survive and 10 people that did not survive and our model would always predict that a passenger survived, it would have an accuracy of 0.9. However, such a model would not provide much value. Therefore, we should also look at other metrics.

## Precision
From [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score):
> The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

In [None]:
# Importing precision_score
from sklearn.metrics import precision_score

# Computing the precision_score
precision_score(y_test, y_pred)

## Recall
From [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score):
> The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

In [None]:
# Importing recall_score
from sklearn.metrics import recall_score

# Computing the recall_score
recall_score(y_test, y_pred)

## F1 score
From [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score):
> The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. 

The  F1 score is defined as follows:
<div style="text-align:center;"><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/9c94f59b68f5ae0dc92185906c7ec4214fd04e1e" /></div>

It should mainly be used, if you seek a balance between precision and recall.

In [None]:
# Importing f1_score
from sklearn.metrics import f1_score

# Computing the f1_score
f1_score(y_test, y_pred)

## Area under ROC curve (AUC)
From [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):
> A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. 

> The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

The True Positive Rate (TPR) is defined as: `TPR = TP / (TP + FN)`
The False Positive Rate (FPR) is defined as `FPR = FP / (TN + FP)`

If we calculate the area under this curve, we get Area under ROC Curve or simply ROC.

In [None]:
# Importing roc_curve
from sklearn.metrics import roc_curve

# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred)
plt.plot(fpr, tpr)

# Importing roc_auc_score
from sklearn.metrics import roc_auc_score

# Computing the roc_auc_score
roc_auc_score(y_test, y_pred)

## Regression Metrics

In [None]:
train = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")

train = train.fillna(train.median())

train['Alley'] = train['Alley'].fillna('None')
train = train.drop(['Utilities'], axis=1)
train['MasVnrType'] = train['MasVnrType'].fillna('Missing')
train['BsmtQual'] = train['BsmtQual'].fillna('None')
train['BsmtCond'] = train['BsmtCond'].fillna('None')
train['BsmtExposure'] = train['BsmtExposure'].fillna('None')
train['BsmtFinType1'] = train['BsmtFinType1'].fillna('None')
train['BsmtFinType2'] = train['BsmtFinType2'].fillna('None')
train['Electrical'] = train['Electrical'].fillna('None')
train['FireplaceQu'] = train['FireplaceQu'].fillna('None')
train['GarageType'] = train['GarageType'].fillna('None')
train['GarageFinish'] = train['GarageFinish'].fillna('None')
train['GarageQual'] = train['GarageQual'].fillna('None')
train['GarageCond'] = train['GarageCond'].fillna('None')
train = train.drop(['PoolQC'], axis=1)
train['Fence'] = train['Fence'].fillna('None')
train['MiscFeature'] = train['MiscFeature'].fillna('None')
train['SaleType'] = train['SaleType'].fillna('None')

from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(train[c].values)) 
    train[c] = lbl.transform(list(train[c].values))
    
train = pd.get_dummies(train)



# Everything except the target variable
X = train.drop("SalePrice", axis=1)

# Target variable
y = train['SalePrice'].values


# Random seed for reproducibility
np.random.seed(42)

# Splitting the data into train & test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)


# Setting up GradientBoostingRegressor()
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()
model.fit(X_train, y_train)

# Predicting values
y_pred = model.predict(X_test)

For this part we use the date from the [house price prediction competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).
My Approach is explained in this [Kernal](https://www.kaggle.com/dietzschdaniel/my-house-price-prediction-approach).

For regression problems, the most common metrics are:
* Mean absolute error (MAE)
* Mean squared error (MSE)
* Root mean squared error (RMSE)
* Root mean squared logarithmic error (RMSLE)
* Mean percentage error (MPE)
* Mean absolute percentage error (MAPE)
* R^2

## Mean absolute error (MAE)
From [Wikipedia](https://en.wikipedia.org/wiki/Mean_absolute_error):
> In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement.

<div style="text-align:center;"><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/3ef87b78a9af65e308cf4aa9acf6f203efbdeded" /></div>

> yi is the prediction and xi is the true value.

MAE should be used, when you do not want to penalize large errors to much. This [Article](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d) describes this in detail.

In [None]:
# Importing mean_absolute_error
from sklearn.metrics import mean_absolute_error

# Computing the mean_absolute_error
mean_absolute_error(y_test, y_pred)

## Mean squared error (MSE) 
From [Wikipedia](https://en.wikipedia.org/wiki/Mean_squared_error):
> In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.

In [None]:
# Importing mean_squared_error
from sklearn.metrics import mean_squared_error

# Computing the mean_squared_error
mean_squared_error(y_test, y_pred)

## Root mean squared error (RMSE)
From [Wikipedia](https://en.wikipedia.org/wiki/Root-mean-square_deviation):
> The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.

The Root mean squared error (RMSE) is the root of the square of the sum of the predicted values (yi) minus the true values (yhead i) divided by the number of predicted values (n).

<img src="https://miro.medium.com/max/327/1*9hQVcasuwx5ddq_s3MFCyw.gif" />

Root Mean Squared Error should be used, when you want to penalize large errors. This [article](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d) gives a good overview.

In [None]:
# Importing mean_squared_error and numpy
from sklearn.metrics import mean_squared_error
import numpy as np

# Computing the root of mean_squared_error
mean_squared_error(y_test, y_pred)
np.sqrt(mean_squared_error(y_test, y_pred))

## Root mean squared logarithmic error (RMSLE)
From [this Kaggle Discussion](https://www.kaggle.com/c/ashrae-energy-prediction/discussion/113064):
> In case of RMSLE, you take the log of the predictions and actual values. So basically, what changes is the variance that you are measuring. I believe RMSLE is usually used when you don't want to penalize huge differences in the predicted and the actual values when both predicted and true values are huge numbers.

In [None]:
# Importing mean_squared_log_error and numpy
from sklearn.metrics import mean_squared_log_error
import numpy as np

# Computing the root of mean_squared_error
np.sqrt(mean_squared_log_error(y_test, y_pred))

## R^2
R^2 or R-squared is a popular metric for regression problems. It specifies how close the data is to the fitted regression line.
From [Wikipedia](https://en.wikipedia.org/wiki/Coefficient_of_determination):
> In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

<div style="text-align:center;"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/R2values.svg/330px-R2values.svg.png" /></div>



The definition of the scores as explained by the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html):
> Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.


In [None]:
# Importing r2_score
from sklearn.metrics import r2_score

# Computing the root of r2_score
r2_score(y_test, y_pred)

# Work in Progress. To be continued..

**If you liked this notebook or found it helpful in any way, feel free to leave an upvote - That will keep me motivated :)**

**If you have any suggestions for improvement, leave a comment :)**