<center>
    <h1 id='cross-validation' style='color:#7159c1'>🤖 Cross-Validation 🤖</h1>
    <i>Folding Datasets</i>
</center>

---

`Cross-Validation` is used to test different folders/groups of train and validation datasets in order to get the best resultt with the best combination of train and validation datasets. If a dataset is folded in 5 groups, the model will process five times, using different one part as the validation dataset for each time.

Besides, Cross-Validation can take a considerable amount of time if the dataset is big enough. So, before using this technique, consider this:

> **For small datasets** - `where extra computational burden isn't a big deal, you should run cross-validation`;

> **For larger datasets** - `a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to re-use some of it for holdout`.

Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment yields the same result, a single validation set is probably sufficient.

In [9]:
# ---- Imports ----
import pandas as pd # pip install pandas
from sklearn.ensemble import RandomForestRegressor # pip install sklearn
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold, LeaveOneOut, LeavePOut, ShuffleSplit

# ---- Reading Dataset ----
autos_df = pd.read_csv('./datasets/autos.csv')

# ---- Dropping Categorical Features ----
categorical_features = [
    feature for feature in autos_df.columns
    if autos_df[feature].dtype in ['object', 'o']
]

autos_df = autos_df.loc[:, ~autos_df.columns.isin(categorical_features)].copy()

autos_df.head()

Unnamed: 0,symboling,num_of_doors,wheel_base,length,width,height,curb_weight,num_of_cylinders,engine_size,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,2,88.6,168.8,64.1,48.8,2548,4,130,3.47,2.68,9,111,5000,21,27,13495
1,3,2,88.6,168.8,64.1,48.8,2548,4,130,3.47,2.68,9,111,5000,21,27,16500
2,1,2,94.5,171.2,65.5,52.4,2823,6,152,2.68,3.47,9,154,5000,19,26,16500
3,2,4,99.8,176.6,66.2,54.3,2337,4,109,3.19,3.4,10,102,5500,24,30,13950
4,2,4,99.4,176.6,66.4,54.3,2824,5,136,3.19,3.4,8,115,5500,18,22,17450


In [13]:
# ---- Creating the Pipeline ----
pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer())
    , ('model', RandomForestRegressor(n_estimators=50, random_state=20242201))
])

# ---- Folds for Cross-Validation ----
k_folds = KFold(n_splits=5) # common k-fold
sk_folds = StratifiedKFold(n_splits=5) # to apply when there's a chance to have imbalanced classes
loo = LeaveOneOut() # leaves a single row in the validdation dataset (repeats the process in order to make all rows be into
# the validation dataset once)
lpo = LeavePOut(p=50) # it's like LeaveOneOut, but you can set the number of rows to be into the validation dataset
shuffle_split = ShuffleSplit(
    train_size=0.60
    , test_size=0.30 
    , n_splits=5
) # set the percentage of train and validation datasets. The remaining will be discarded over the Cross-Validation Step

# ---- Calculating Cross-Validation ----
scores = -1 * cross_val_score(
    pipeline
    , autos_df.loc[:, 'symboling':'highway_mpg']
    , autos_df.price
    , cv=k_folds # sk_folds, loo, lpo, shuffle_split
    , scoring='neg_mean_absolute_error'
)

print(f'- Average MAE Score Across Experiments: {scores.mean()}')

- Average MAE Score Across Experiments: 2596.1671999550153


---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).