# Model selection

This notebook is destined to comparing models. The comparisons made will be the following:
* Between two different regressors
* Between two different classifiers

In both cases, multiple data partition methods will be tested.

* [1 - Data partitioning](#partition)
* [2 - Comparing models](#compare)

First, let's get the libraries and data to be used throughout this entire notebook

<b>1. Import the needed libraries for data handling (pandas as pd and numpy as np)</b>

In [1]:

import pandas as pd
import numpy as np

<b>2. Read the dataset containing the pm2.5 emission values, located in `Datasets/prsa.csv`, with the parameter `index_col=0`, and save it in a variable named `prsa`. This dataset will be used to test regression models.</b>

In [2]:
prsa = pd.read_csv('Datasets/prsa.csv',index_col=0)

In [3]:
prsa.head()

Unnamed: 0_level_0,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir,NE,NW,SE,cv
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2010-01-02 00:00:00,129.0,-16,-4.0,1020.0,1.79,0,0,0,0,1,0
2010-01-02 01:00:00,148.0,-15,-4.0,1020.0,2.68,0,0,0,0,1,0
2010-01-02 02:00:00,159.0,-11,-5.0,1021.0,3.57,0,0,0,0,1,0
2010-01-02 03:00:00,181.0,-7,-5.0,1022.0,5.36,1,0,0,0,1,0
2010-01-02 04:00:00,138.0,-7,-5.0,1022.0,6.25,2,0,0,0,1,0


<b>3. Read the dataset located in `Datasets/diabetes.csv`, which will be used to test classification models, and save it in a variable named `diab`</b></b>

In [4]:
diab = pd.read_csv('Datasets/diabetes.csv')#CODE HERE

In [5]:
diab.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


<b>4. Create an object named `prsa_data` that will contain your independent features and another object named `pm25` that will contain your independent feature/target from the `prsa` dataset (the _1st column_ in the dataset).</b>

In [6]:
prsa_data = prsa.iloc[:,1:]
pm25 = prsa.iloc[:,0]

In [7]:
prsa_data.head()

Unnamed: 0_level_0,DEWP,TEMP,PRES,Iws,Is,Ir,NE,NW,SE,cv
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2010-01-02 00:00:00,-16,-4.0,1020.0,1.79,0,0,0,0,1,0
2010-01-02 01:00:00,-15,-4.0,1020.0,2.68,0,0,0,0,1,0
2010-01-02 02:00:00,-11,-5.0,1021.0,3.57,0,0,0,0,1,0
2010-01-02 03:00:00,-7,-5.0,1022.0,5.36,1,0,0,0,1,0
2010-01-02 04:00:00,-7,-5.0,1022.0,6.25,2,0,0,0,1,0


<b>5. Create an object named `diab_data` that will contain your independent features and another object named `outcome` that will contain your independent feature/target from the `diab` dataset (the _last column_ in the dataset).</b>

In [8]:
diab_data = diab.drop('Outcome',axis=1)
outcome = diab['Outcome']

In [9]:
diab_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


<a class="anchor" id="partition">

## 1. Data Partition
</a>

There will be three partition methods tested:
* Train-test split
* K-fold cross validation
* Repeated K-fold cross validation

<b>1. From `sklearn.model_selection` import `train_test_split`, `KFold` and `RepeatedKFold`</b>

In [10]:

from sklearn.model_selection import train_test_split, KFold, RepeatedKFold

### 1.1. The train-test split

<b>2. Divide the `prsa_data` into `prsa_train` and `prsa_val`, the `pm25` into `pm25_train` and `y_test`, and define the following arguments: `test_size = 0.3`, `random_state = 15` and `shuffle = True`</b>

In [11]:
prsa_train, prsa_val, pm25_train, pm25_val = train_test_split(prsa_data, 
                                                    pm25, 
                                                    test_size = 0.3,
                                                    random_state = 15,
                                                    shuffle = True
                                                   )

<b>3. Divide the `diab_data` into `diab_train` and `diab_val`, the `outcome` into `outcome_train` and `outcome_val`, and define the following arguments: `test_size = 0.3`, `random_state = 15`, `shuffle = True` and `stratify = outcome` </b>

In [12]:
diab_train, diab_val, outcome_train, outcome_val = train_test_split(diab_data, 
                                                    outcome, 
                                                    test_size = 0.3,
                                                    random_state = 15,
                                                    shuffle = True,
                                                    stratify = outcome
                                                    )



### 1.2. K-Fold Cross-Validation

<b>4. Create a KFold Instance where the number of splits is 10 (``n_splits = 10``) and name it as `kf`</b>

In [13]:
kf = KFold(n_splits = 10)

### 1.3. Repeated K-Fold

<b>5. Create a RepeatedKFold Instance where the number of splits is 6 (`n_splits=6`) and the number of times cross-validator needs to be repeated is 2 (`n_repeats=2`)  and name it as `rkf`</b>

In [14]:
rkf = RepeatedKFold(n_splits = 6, n_repeats = 2)

With the splits prepared, time to run the models

<a class="anchor" id="compare">

## 2. Comparing models
</a>

With the partitions ready to be used, it's finally time to train and compare models:

* [2.1 - Regression](#reg)
* [2.2 - Classification](#cls)

But first, some preparation:


<b>1. Import `LinearRegression` and `LogisticRegression` from `sklearn.linear_model`, import `DecisionTreeRegressor` and `DecisionTreeClassifier` from `sklearn.tree`, and import `r2_score` from `sklearn.metrics`</b>

In [15]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import r2_score

<b>2. Create a function called `run_model`, that receives  data separated in independent features (`X`) and a dependent feature (`y`), and a model instance, and then returns the model fitted to the data.</b>

In [16]:
def run_model(X,y,model):
    return model.fit(X,y)

<b>3. Create a function called `eval_model_clf`, that receives data separated in independent features (`X`) and a dependent feature (`y`), and a model instance, and then returns the model's median accuracy score. This function will be used to evaluate classification models.</b>

In [17]:
def eval_model_clf(X,y, model):
    return model.score(X,y)

<b>4. Run the next cell to create a function named `avg_score_reg` that will return the average score value for the train and the test set. This will have as parameters the partition technique you are going to use, your dependent variable, your independent variables, and your model. This function will be used to evaluate regression models.</b>

In [18]:
def avg_score_reg(method,X,y,model):
    score_train = []
    score_test = []

    for train_index, test_index in method.split(X):
        
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        train_model = run_model(X_train, y_train, model)
        pred_train = train_model.predict(X_train)
        pred_test =  train_model.predict(X_test)
        value_train = r2_score(y_train, pred_train)
        value_test = r2_score(y_test, pred_test)
        score_train.append(value_train)
        score_test.append(value_test)

    print('Train:', np.mean(score_train))
    print('Test:', np.mean(score_test))
    
    return score_train, score_test

<b>5. Run the next cell to create a function named `avg_score_clf` that will return the average score value for the train and the test set. This will have as parameters the partition technique you are going to use, your dependent variable, your independent variables, and your model. This function will be used to evaluate classification models.</b>

In [19]:
def avg_score_clf(method,X,y,model):
    score_train = []
    score_test = []
    for train_index, test_index in method.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        train_model = run_model(X_train, y_train, model)
        value_train = eval_model_clf(X_train, y_train, train_model)
        value_test = eval_model_clf(X_test,y_test, train_model)
        score_train.append(value_train)
        score_test.append(value_test)

    print('Train:', np.mean(score_train))
    print('Test:', np.mean(score_test))

<a class="anchor" id="reg">

### 2.1. Regression models
</a>

In this step, you are going to fit your data into a Linear Regression model, and into a Decision Tree Regressor model, and compare their performances. <br>
First is the **Linear Regression:**

<b>1. Create an instance of `LinearRegression` and save it in the variable `lr`</b>

In [20]:
lr = LinearRegression()

<b>2. Fit your regression to the training data created with `train_test_split` (`prsa_train`,`pm25_train`), saving the result in `lr_fit`.</b>

In [21]:
lr_fit = lr.fit(prsa_train,pm25_train)

<b>3. Use the fitted model to get predictions for your validation data. Save your predictions in a variable.</b>

In [22]:
lr_pred = lr_fit.predict(prsa_val)

<b>4. Use the `r2_score()` function to evaluate your regression's performance on validation data.</b>

In [23]:
r2_score(pm25_train, lr_fit.predict(prsa_train))

0.3340959757776575

In [24]:

r2_score(pm25_val, lr_pred)

0.3051191780119855

<b>5. Check the performance of the LinearRegression you created by calling the function `avg_score_reg`, with `kf` as the partition technique.</b>

In [25]:
avg_score_reg(kf,prsa_data,pm25,lr)

Train: 0.32597118689356297
Test: 0.1712351941129954


([0.31803795373281074,
  0.3271824759210321,
  0.33017110847591746,
  0.34948048585780056,
  0.32969525163956126,
  0.33686111062274204,
  0.3234861984789016,
  0.30741042529638096,
  0.30858159959283005,
  0.3288052593176527],
 [0.3673885814290485,
  0.2717482525123691,
  0.20707708858834228,
  -0.5789615867877489,
  0.16352700090474082,
  0.112711308446384,
  0.32190817850089526,
  0.3169126615470309,
  0.23817874167511444,
  0.2918617143137777])

<b>6. Check the performance of the LinearRegression you created by calling the function `avg_score_reg`, with `rkf` as the partition technique.</b>

In [26]:
avg_score_reg(rkf,prsa_data,pm25,lr)

Train: 0.32507782216792314
Test: 0.3239338552977027


([0.3234570477085563,
  0.32321051015039737,
  0.3250125910802808,
  0.3296833976550102,
  0.32088487085199946,
  0.32819555108565757,
  0.32329971689992365,
  0.3265445544769018,
  0.3246894004341351,
  0.32263058555605206,
  0.3256787383530364,
  0.3276469017631267],
 [0.3324836769362417,
  0.33211619004903103,
  0.32378118059828986,
  0.3019035271554187,
  0.34405696959046317,
  0.30921313338770207,
  0.332741445462311,
  0.31688264450345627,
  0.32488089973954193,
  0.33717499391885264,
  0.32074104312636786,
  0.3112305591047564])

Now, time for the **Decision Tree Regressor:**

<b>7. Create an instance of `DecisionTreeRegressor` and save it in the variable `dtr`</b>

In [27]:
dtr = DecisionTreeRegressor()

<b>8. Fit your decision tree regressor to the training data created with `train_test_split` (`prsa_train`,`pm25_train`), saving the result in `dtr_fit`.</b>

In [28]:
dtr_fit = dtr.fit(prsa_train,pm25_train)

<b>9. Use the fitted model to get predictions for your validation data.</b>

In [29]:
dtr_pred = dtr_fit.predict(prsa_val)

In [30]:
dtr_pred_tr = dtr_fit.predict(prsa_train)

<b>10. Use the `r2_score()` function to evaluate your regressor's performance on validation data.</b>

In [31]:
r2_score(pm25_train,dtr_pred_tr)

0.9943456608279894

In [32]:
r2_score(pm25_val,dtr_pred)

0.18322296827905882

<b>11. Check the performance of the decision tree regressor you created by calling the function `avg_score_reg`, with `kf` as the partition technique.</b>

In [33]:
avg_score_reg(kf,prsa_data,pm25,dtr)

Train: 0.991786136647194
Test: -0.8989103245676631


([0.9907068013151779,
  0.9926418251234539,
  0.9907525125815728,
  0.9912383703574097,
  0.9917476760752207,
  0.9939490738377288,
  0.9926087348084601,
  0.9916808726871117,
  0.9907594802401702,
  0.9917760194456336],
 [-0.04508187510243933,
  -1.8536037893698571,
  -1.9206085646320044,
  -3.288807774878862,
  -0.2819711576050741,
  -1.041930204235113,
  -0.6881216772154439,
  0.19983642996419526,
  -0.23322760753904426,
  0.16441297493701135])

<b>12. Check the performance of the decision tree regressor you created by calling the function `avg_score_reg`, with `rkf` as the partition technique.</b>

In [34]:
avg_score_reg(rkf,prsa_data,pm25,dtr)

Train: 0.99259585354229
Test: 0.20022414231435867


([0.991446974327295,
  0.9919404778352594,
  0.9907370435157689,
  0.9947565430663549,
  0.9908469626435799,
  0.996586703364913,
  0.9908619325745888,
  0.9915281156899135,
  0.993035393624832,
  0.9921117168016278,
  0.9944538143880488,
  0.9928445646752986],
 [0.09846962242554647,
  0.23766164977131332,
  0.10036706404551876,
  0.23864685422204357,
  0.24774066312340937,
  0.21123996950084867,
  0.3095579027654549,
  0.33664844464083477,
  0.15016588006702247,
  0.3102992739535819,
  -0.0014229564589558485,
  0.1633153397156858])

Which model is the best? With which partition technique? <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>




<a class="anchor" id="cla">

### 2.2. Classification models
</a>

In this step, you are going to fit your data into a Logistic Regression model, and into a Decision Tree Classifier model, and compare their performances

<b>1. Create an instance of `LogisticRegression` and save it in the variable `logr`</b>

In [35]:
logr = LogisticRegression()

<b>2. Fit your logistic regression to the training data created with `train_test_split` (`diab_train`,`outcome_train`), saving the result in `logr_fit`.</b>

In [36]:
logr_fit = logr.fit(diab_train,outcome_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<b>3. Use the `.score()` method to evaluate your regression's performance, both on training data and validation data.</b>

In [37]:
print(logr_fit.score(diab_train,outcome_train))

0.7783985102420856


In [38]:
print(logr_fit.score(diab_val,outcome_val))

0.7662337662337663


<b>4. Check the performance of the logistic regressor you created by calling the function `avg_score_clf`, with `kf` as the partition technique.</b>

In [39]:
avg_score_clf(kf,diab_data,outcome,logr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Train: 0.7838532996494985
Test: 0.7721291866028708


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

<b>5. Check the performance of the logistic regression you created by calling the function `avg_score_clf`, with `rkf` as the partition technique.</b>

In [40]:
avg_score_clf(rkf,diab_data,outcome,logr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Train: 0.7785156249999999
Test: 0.7747395833333334


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Now, time for the **Decision Tree Classifier:**

<b>6. Create an instance of `DecisionTreeClassifier` and save it in the variable `dtc`</b>

In [41]:
dtc = DecisionTreeClassifier()

<b>7. Fit your decision tree regressor to the training data created with `train_test_split` (`diab_train`,`outcome_train`), saving the result in `dtc_fit`.</b>

In [42]:
dtc_fit = dtc.fit(diab_train,outcome_train)

<b>8. Use the `.score()` method to evaluate your regressor's performance, both on training data and validation data.</b>

In [43]:
dtc_fit.score(diab_train,outcome_train)

1.0

In [44]:
dtc_fit.score(diab_val,outcome_val)

0.6883116883116883

<b>9. Check the performance of the decision tree classifier you created by calling the function `avg_score_clf`, with `kf` as the partition technique.</b>

In [45]:
avg_score_clf(kf,diab_data,outcome,dtc)

Train: 1.0
Test: 0.7004613807245386


<b>10. Check the performance of the decision tree classifier you created by calling the function `avg_score_clf`, with `rkf` as the partition technique.</b>

In [46]:
avg_score_clf(rkf,diab_data,outcome,dtc)

Train: 1.0
Test: 0.703125


How do the scores compare? Which model is better?

### Important Note:

Please remember that just because the scores are better or worse when using **k-fold cross-validation** compared to the **hold-out method** does **not** mean that the model is inherently *"better with k-fold."* These are two different techniques to evaluate the **same model**.

- **K-fold cross-validation** is more robust because it evaluates the model across multiple subsets of the data, but it doesn't change the underlying model.
- **Hold-out validation** evaluates the model on a single split of the data, which can lead to slightly different performance due to random variations in the split.

In both cases, the model being evaluated is exactly the same. What differs is how we assess its performance. The model’s performance on unseen data will be **consistent**, regardless of the validation method used. These validation techniques help you **choose the best model** and evaluate its stability, but they do not change the model’s performance on new, unknown data.


### That's all!