<a href="https://colab.research.google.com/github/EnockCity/KCB_Data_Science_and_AI/blob/master/TestingModel_Performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Objectives:**
*Create separate train and test sets for both the predictor and response variables.
*Train a model on the training data.
*Assess and compare the model's performance on both the training and testing sets

**Preview:**
We will be using the crop_yield_dataset which consists of observations on Temperature (independent variable) and the corresponding Crop_Yield (dependent variable). Temperature is measured in degrees and crop yield is measured in units specific to the crop being studied.

**test_size**: This is a value between 0 and 1: the proportion of our dataset that we want to be used as test data. Typically 0.2 (20%).

**random_state**: This is an arbitrary value which, when set, ensures that the random nature in which rows are picked to be in the test set is the same each time the split is carried out. In other words, the rows are picked at random, but we can ensure these random picks are repeatable by using the same value here. This makes it easier to assess model performance across iterations.

**Mean squared error** is higher on the test set than the train set, indicating poor predictive accuracy.

**R-squared** is lower on the test set, indicating a worse fit on the test set.

These results indicate a concept in machine learning model fitting known as **overfitting**. This is a phenomenon where there is:

A discrepancy between the performance of the model on train and on test sets.
An inability of the model to generalise to data it has not seen before.
The term comes from the fact that the model **fits too well, or overfits**, the training data, and **does not fit well, or underfits**, the testing data.


**Import libraries and dataset**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [None]:
# Load dataset
df= pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Python/crop_yield_dataset.csv")
df.head(5)

Unnamed: 0,Temperature,Crop_Yield
0,27.483571,58.922301
1,24.308678,44.07042
2,28.238443,63.490857
3,32.615149,58.221043
4,23.829233,50.592752


Separate the dataset we have loaded into features X and the response variable y

X is a DataFrame containing only the Temperature column, which serves as the predictor variable.

y is a Series containing the Crop_Yield column, which serves as the response or target variable.

In [None]:
# Split dataset into features and response variable
X = df[['Temperature']]
y = df['Crop_Yield']

Perform train-test split with 80-20 ratio

We use the train_test_split function to divide the dataset into training and testing sets for both the features (X) and the response variable (y).

The 0.2 test_size indicates that 20% of the data will be used for testing, while the remaining 80% will be used for training.

We use a random_state of 42 to ensure reproducibility of the split.

In [None]:
# Perform train-test split with 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 Create an instance of the LinearRegression class

 Train the linear regression model

 We create a LinearRegression object and fit it to the training data only which includes the independent variables: X_train and the dependent variables: y_train

In [None]:
# Create an instance of the LinearRegression class
lm = LinearRegression()

# Train the linear regression model
lm.fit(X_train, y_train)

Assess the model's performance on the training by calculating the Mean Squared Error (MSE) and R-squared metrics.

We first ask the trained linear regression model to generate predictions on the training set of predictors X_train.

We then pass the actual target values y_train and the predicted target values y_train_pred to the metrics.mean_squared_error() and the metrics.r2_score() functions to calculate the respective metrics.

In [None]:
# Generate predictions on the training set
y_train_pred = lm.predict(X_train)

# Calculate the Mean Squared Error (MSE)
train_mse = metrics.mean_squared_error(y_train, y_train_pred)

# Calculate the R-squared
train_r2 = metrics.r2_score(y_train, y_train_pred)

# Print the training MSE and R-squared score
print("Training MSE:", train_mse)
print("Training R-squared:", train_r2)

Training MSE: 22.152323850480098
Training R-squared: 0.8025918031520605


Assess the model's performance on the testing set by calculating the Mean Squared Error (MSE) and R-squared metrics.
We first ask the trained linear regression model to generate predictions on the testing set of predictors X_test.

We then pass the actual target values y_test and the predicted target values y_test_pred to the metrics.mean_squared_error() and the metrics.r2_score() functions to calculate the respective metrics.

It is worth experimenting with different random states once we have completed the exercise (Try random_state = 50). How do the R-squared and MSE metrics change between the test and training sets? Does the gap get smaller or larger? What does this suggest? Could the relatively small number of observations be affecting these metrics?

In [None]:
# Generate predictions on the testing set
y_test_pred = lm.predict(X_test)

# Calculate the Mean Squared Error (MSE)
test_mse = metrics.mean_squared_error(y_test, y_test_pred)

# Calculate the R-squared
test_r2 = metrics.r2_score(y_test, y_test_pred)

# Print the testing MSE and R-squared score
print("Testing MSE:", test_mse)
print("Testing R-squared:", test_r2)

Testing MSE: 37.75854546183867
Testing R-squared: 0.7167858892114612
