# Understanding `cross_val_score` in Machine Learning

The goal of `cross_val_score` is to evaluate the performance of a machine learning model using **cross-validation**. Specifically, it helps assess how well the model will perform on unseen data by splitting the dataset into multiple folds and training/testing the model on different subsets.

---

## Breakdown of What `cross_val_score` Does

### 1. Evaluating the Model's Performance
- `cross_val_score` helps evaluate the **generalization ability** of the model.
- By training and testing the model on different splits of the data, it provides a more accurate estimate of how well the model might perform on unseen data.
- This is crucial because using a single train-test split can lead to misleading conclusions if the dataset is not representative of the overall population or contains biases.

### 2. Cross-Validation Process
- Cross-validation splits the dataset into multiple parts (**folds**).
- Each fold is used as a test set once, while the remaining data is used for training.
- For example, with `cv=5`, the data is split into **5 parts**, and the model is trained & tested **5 times**, each time on a different subset.
- This gives a **more reliable estimate** of the model's performance across different subsets of the data.

### 3. Score Calculation
- `cross_val_score` computes a **performance score** for each fold.
- In this case, we use `scoring='neg_mean_squared_error'`, which returns the **negative mean squared error (MSE)** for each fold.
- The **negative sign** is used because `cross_val_score` is designed to **maximize the score**, but for regression, lower MSE is better.
- The final result is an array of scores, giving insight into how well the model performs across different data splits.

### 4. Hyperparameter Tuning
- Cross-validation helps in **hyperparameter tuning**, such as choosing the best **regularization strength (`alpha`)** for Ridge regression.
- By comparing the cross-validation scores for different `alpha` values, we can select the one that results in the lowest error.

---

## Summary
- `cross_val_score` performs cross-validation to evaluate the model's performance across different subsets of data.
- This ensures the model **generalizes well** to new data and helps prevent **overfitting** to a single train-test split.

---

## Understanding `neg_MAE` (Negative Mean Absolute Error)

- **`neg_MAE`** is the negative of the Mean Absolute Error (MAE).
  - A **higher `neg_MAE`** (closer to zero) means the model's performance is **worse** (because the actual MAE is increasing).
  - A **lower `neg_MAE`** (more negative) means the model's performance is **better** (because the actual MAE is decreasing).

### Key Takeaway
- **Lower `neg_MAE`** (more negative) = Better performance.
- **Higher `neg_MAE`** (closer to zero) = Worse performance.

### Example
If the `neg_MAE` increases (e.g., from -0.5 to -0.3), it means the MAE has increased (e.g., from 0.5 to 0.3), which indicates that the model's predictions are less accurate. This is **not better**.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('J:/Data science/ML/Pyhon for ML/3.1 UNZIP_ME_FOR_NOTEBOOKS_V4/DATA/Advertising.csv')

In [3]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [4]:
## CREATE X and y
x = df.drop('sales',axis=1)
y = df['sales']

# TRAIN TEST SPLIT
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)

# SCALE DATA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [5]:
from sklearn.linear_model import Ridge

In [6]:
model = Ridge(alpha=100)

In [9]:
from sklearn.model_selection import cross_val_score

In [10]:
scores = cross_val_score(model,x_train,y_train,
                         cv=5 , scoring = 'neg_mean_squared_error')

# Cross-Validation Process (Step-by-Step)

For each of the 5 folds:

## 1. Split the Data

The dataset (`x_train`, `y_train`) is divided into:

- **Training set**: 4 folds (80% of the data).
- **Validation set**: 1 fold (20% of the data).

## 2. Train the Model

The Ridge model is trained on the 4 training folds.

- The model learns the coefficients for the features while applying the L2 regularization penalty (`alpha=100`).

## 3. Evaluate the Model

- The trained model is used to predict the target variable (`y`) for the validation fold.
- The predictions are compared to the actual values in the validation fold using the **negative mean squared error** (`neg_mean_squared_error`).
  
  > **Note**: The negative mean squared error is used because `cross_val_score` maximizes the score, and MSE is a loss metric (lower is better). By negating it, the scoring aligns with the maximization objective.

## 4. Store the Score

- The computed negative MSE for the fold is stored in the `scores` array.

In [11]:
scores

array([ -9.32552967,  -4.9449624 , -11.39665242,  -7.0242106 ,
        -8.38562723])

In [12]:
abs(scores.mean())

8.215396464543607

In [13]:
sec_model = Ridge(alpha=1)

In [14]:
scores_sec = cross_val_score(sec_model , x_train , y_train ,
                             cv=5 , scoring='neg_mean_squared_error')

In [15]:
scores_sec

array([-3.15513238, -1.58086982, -5.40455562, -2.21654481, -4.36709384])

In [16]:
abs(scores_sec.mean())

3.344839296530695

In [17]:
sec_model.fit(x_train,y_train)

In [18]:
y_pred = sec_model.predict(x_test)

In [19]:
from sklearn.metrics import mean_squared_error

In [20]:
rmse = np.sqrt(mean_squared_error(y_test , y_pred))

In [21]:
rmse

1.5228334050147283

----
----
----

# Cross Validation with cross_validate

The cross_validate function differs from cross_val_score in two ways:

It allows specifying multiple metrics for evaluation.

It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.

For single metric evaluation, where the scoring parameter is a string, callable or None, the keys will be:
        
        - ['test_score', 'fit_time', 'score_time']

And for multiple metric evaluation, the return value is a dict with the following keys:

    ['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']

return_train_score is set to False by default to save computation time. To evaluate the scores on the training set as well you need to be set to True.

In [22]:
from sklearn.model_selection import cross_validate

In [26]:
scores_one = cross_validate(model , x_train , y_train ,
                            cv=5 , scoring = ['neg_mean_absolute_error','neg_mean_squared_error','max_error'])

In [27]:
scores_one

{'fit_time': array([0.00099754, 0.00099754, 0.00099778, 0.00099707, 0.0009973 ]),
 'score_time': array([0.00199485, 0.00099659, 0.00199533, 0.00099754, 0.00099754]),
 'test_neg_mean_absolute_error': array([-2.31243044, -1.74653361, -2.56211701, -2.01873159, -2.27951906]),
 'test_neg_mean_squared_error': array([ -9.32552967,  -4.9449624 , -11.39665242,  -7.0242106 ,
         -8.38562723]),
 'test_max_error': array([ -6.44988486,  -5.58926073, -10.33914027,  -6.61950405,
         -7.75578515])}

In [28]:
pd.DataFrame(scores_one)

Unnamed: 0,fit_time,score_time,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_max_error
0,0.000998,0.001995,-2.31243,-9.32553,-6.449885
1,0.000998,0.000997,-1.746534,-4.944962,-5.589261
2,0.000998,0.001995,-2.562117,-11.396652,-10.33914
3,0.000997,0.000998,-2.018732,-7.024211,-6.619504
4,0.000997,0.000998,-2.279519,-8.385627,-7.755785


In [29]:
pd.DataFrame(scores_one).mean()

fit_time                        0.000997
score_time                      0.001396
test_neg_mean_absolute_error   -2.183866
test_neg_mean_squared_error    -8.215396
test_max_error                 -7.350715
dtype: float64

In [32]:
scores_two = cross_validate(sec_model , x_train , y_train ,
                            cv=5 , scoring = ['neg_mean_absolute_error','neg_mean_squared_error','max_error'])

In [33]:
scores_two

{'fit_time': array([0.00199533, 0.0009973 , 0.00099778, 0.00099659, 0.00099683]),
 'score_time': array([0.00199437, 0.00199366, 0.00199533, 0.00099754, 0.00099778]),
 'test_neg_mean_absolute_error': array([-1.54711694, -1.02604449, -1.40079299, -1.15425141, -1.47022164]),
 'test_neg_mean_squared_error': array([-3.15513238, -1.58086982, -5.40455562, -2.21654481, -4.36709384]),
 'test_max_error': array([-3.08829958, -2.81744088, -9.35320917, -4.05585583, -6.49092188])}

In [34]:
pd.DataFrame(scores_two)

Unnamed: 0,fit_time,score_time,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_max_error
0,0.001995,0.001994,-1.547117,-3.155132,-3.0883
1,0.000997,0.001994,-1.026044,-1.58087,-2.817441
2,0.000998,0.001995,-1.400793,-5.404556,-9.353209
3,0.000997,0.000998,-1.154251,-2.216545,-4.055856
4,0.000997,0.000998,-1.470222,-4.367094,-6.490922


In [35]:
pd.DataFrame(scores_two).mean()

fit_time                        0.001197
score_time                      0.001596
test_neg_mean_absolute_error   -1.319685
test_neg_mean_squared_error    -3.344839
test_max_error                 -5.161145
dtype: float64

In [36]:
sec_model.fit(x_train , y_train)

In [37]:
final_pred = sec_model.predict(x_test)

In [38]:
final_pred

array([15.73544249, 19.56177685, 11.47282584, 16.99614361,  9.19583919,
        7.06034338, 20.24078477, 17.27047482,  9.7997058 , 19.18969381,
       12.40827613, 13.88321006, 13.72330625, 21.24960621, 18.41451801,
       10.00739858, 15.54023734,  7.72694272,  7.59886443, 20.3595504 ,
        7.831815  , 18.21607253, 24.61611392, 22.77116018,  8.0117733 ,
       12.667102  , 21.40567156,  8.10250725, 12.43158049, 12.53481984,
       10.81678067, 19.21537816, 10.09192883,  6.76998079, 17.29636618,
        7.81497124,  9.28808588,  8.31202002, 10.6122371 , 10.6533735 ,
       13.05491413,  9.80364168, 10.24764859,  8.09836046, 11.58209801,
       10.10783927,  9.025001  , 16.24936342, 13.26025422, 20.77690029,
       12.51477346, 13.96784546, 17.53696507, 11.15686875, 12.57233878,
        5.56009018, 23.21824128, 12.62301353, 18.72931877, 15.18197827])

In [39]:
rmse = np.sqrt(mean_squared_error(y_test , final_pred))

In [40]:
rmse

1.5228334050147283