In [3]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 4.2 Evaluating your models using the scoring parameter

scoring parameter with [cross_val_score()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) or [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [4]:
heart_df = pd.read_csv("data/heart-disease.csv")
heart_df.head() # classification dataset - supervised learning

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [7]:
from sklearn.model_selection import train_test_split

# Import cross_val_score from the model_selection module
from sklearn.model_selection import cross_val_score

# Import the RandomForestClassifier model class from the ensemble module
from sklearn.ensemble import RandomForestClassifier

# Setup random seed 
# reproducibility - the seed sets a particular pattern to the randomness.
# This would be useful as we are diving a dataset into train and test sets
np.random.seed(42)

# Split the data into X (features/data) and y (target/labels)
X = heart_df.drop("target",axis=1)
y = heart_df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# Instantiate the model (on the training set)
clf = RandomForestClassifier()

# Call the fit method on the model and pass it training data
clf.fit(X_train,y_train)

RandomForestClassifier()

In [8]:
# Using score()
clf.score(X_test,y_test)

0.8524590163934426

In [12]:
# Using cross_val_score()
cross_val_score(clf,X,y,cv=5,scoring=None)

# we will take mean of the values in the array

array([0.78688525, 0.86885246, 0.80327869, 0.78333333, 0.76666667])

cross_val_score() returns an array where as score() only returns a single number.

cross_val_score() returns an array because of a parameter called cv, which stands for cross-validation.

When cv isn't set, cross_val_score() will return an array of 5 numbers by default. cv --> k different splits 

In [14]:
# Compare between the 2 methods - which one is better?
np.random.seed(42)

# Single training and test split with score()
clf_single_score = clf.score(X_test,y_test)

# Take mean of 5-fold corss-validation score
clf_cross_valid_score = np.mean(cross_val_score(clf,X,y,cv=5,scoring=None))

# Compare the 2
clf_single_score,clf_cross_valid_score

# therefore, cross_val_score() is preferred

(0.8524590163934426, 0.8248087431693989)

Since we set cv=5 (5-fold cross-validation), we get back 5 different scores instead of 1.

Taking the mean of this array gives us a more in-depth idea of how our model is performing by converting the 5 scores into one.

---------------------------------------------------------

Scoring parameter is set to None by default

If None, the estimator's default scorer is used

Default scoring parameter of classificaition - mean accuracy

-----------------------------------------------------------------

different problems call for different evaluation scores.

The [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) outlines a vast range of evaluation metrics for different problems but let's have a look at a few.