<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Testing model performance
© ExploreAI Academy

In this exercise, we will use the train-test split technique to separate a dataset into training and testing sets and use them to train and test the performance of a model.

## Learning objectives

By the end of this train, you should be able to:
* Create separate train and test sets for both the predictor and response variables.
* Train a model on the training data.
* Assess and compare the model's performance on both the training and testing sets. 

## Exercises

In the exercises below, we will be using the `crop_yield_dataset` which consists of observations on `Temperature` (independent variable) and the corresponding `Crop_Yield` (dependent variable). Temperature is measured in degrees and crop yield is measured in units specific to the crop being studied. 

### Import libraries and dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [2]:
# Load dataset
df= pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Python/crop_yield_dataset.csv")
df.head(5)

Unnamed: 0,Temperature,Crop_Yield
0,27.483571,58.922301
1,24.308678,44.07042
2,28.238443,63.490857
3,32.615149,58.221043
4,23.829233,50.592752


### Exercise 1

Separate the dataset we have loaded into features `X` and the response variable `y`.

In [None]:
y = df['']
X = df.drop('ZAR/USD', axis=1)

### Exercise 2

Implement a train-test split where 80% of the observations will be used for training while the remaining 20% will be used for testing to create the following sets: `X_train`, `X_test`, `y_train`, and `y_test`.
Also, use a random state of `42`.

In [None]:
# Your solution here...

### Exercise 3

Train a linear regression model on the training set we have created in Exercise 2.

In [None]:
# Your solution here...

### Exercise 4

Assess the model's performance on the training by calculating the **Mean Squared Error (MSE)** and **R-squared** metrics.

In [None]:
# Your solution here...

### Exercise 5

Assess the model's performance on the testing set by calculating the **Mean Squared Error (MSE)** and **R-squared** metrics.

In [None]:
# Your solution here...

> Compare the model's performance on the training and testing sets. Are there any observed differences?

## Solutions

### Exercise 1

In [None]:
# Split dataset into features and response variable
X = df[['Temperature']]
y = df['Crop_Yield']

X is a DataFrame containing only the `Temperature` column, which serves as the predictor variable.

y is a Series containing the `Crop_Yield` column, which serves as the response or target variable.

### Exercise 2

In [None]:
# Perform train-test split with 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We use the `train_test_split` function to divide the dataset into training and testing sets for both the features (X) and the response variable (y).

The `0.2` test_size indicates that 20% of the data will be used for testing, while the remaining 80% will be used for training.

We use a random_state of `42` to ensure reproducibility of the split.

### Exercise 3

In [None]:
# Create an instance of the LinearRegression class
lm = LinearRegression()

# Train the linear regression model
lm.fit(X_train, y_train)

We create a LinearRegression object and fit it to the training data only which includes the independent variables: `X_train` and the dependent variables: `y_train`.

### Exercise 4

In [None]:
# Generate predictions on the training set
y_train_pred = lm.predict(X_train)

# Calculate the Mean Squared Error (MSE)
train_mse = metrics.mean_squared_error(y_train, y_train_pred)

# Calculate the R-squared
train_r2 = metrics.r2_score(y_train, y_train_pred)

# Print the training MSE and R-squared score
print("Training MSE:", train_mse)
print("Training R-squared:", train_r2)

We first ask the trained linear regression model to generate predictions on the training set of predictors `X_train`.

We then pass the actual target values `y_train` and the predicted target values `y_train_pred` to the `metrics.mean_squared_error()` and the `metrics.r2_score()` functions to calculate the respective metrics.

### Exercise 5

In [None]:
# Generate predictions on the testing set
y_test_pred = lm.predict(X_test)

# Calculate the Mean Squared Error (MSE)
test_mse = metrics.mean_squared_error(y_test, y_test_pred)

# Calculate the R-squared
test_r2 = metrics.r2_score(y_test, y_test_pred)

# Print the testing MSE and R-squared score
print("Testing MSE:", test_mse)
print("Testing R-squared:", test_r2)

We first ask the trained linear regression model to generate predictions on the testing set of predictors `X_test`.

We then pass the actual target values `y_test` and the predicted target values `y_test_pred` to the `metrics.mean_squared_error()` and the `metrics.r2_score()` functions to calculate the respective metrics.

It is worth experimenting with different random states once we have completed the exercise (Try random_state = 50). How do the R-squared and MSE metrics change between the test and training sets? Does the gap get smaller or larger? What does this suggest? Could the relatively small number of observations be affecting these metrics?

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>