# Model Evaluation and Validation

# Why Split Data into Training and Test Datasets?

## Introduction
When developing machine learning models, one of the key steps is to split the available data into training and test datasets. This practice is crucial for several reasons, which will be outlined below.

## Reasons for Splitting Data

### 1. **Model Evaluation**

### 2. **Overfitting Prevention**

### 3. **Model Tuning**

### 4. **Cross-validation**

### 5. **Fair Comparison**

In [None]:
import pandas as pd
import requests
from io import StringIO

df = pd.read_csv("https://raw.githubusercontent.com/datasets/cpi-us/main/data/cpiai.csv")

print(df)
# Extract and set the 'year' column as the index with DateTime format
df['Date'] = pd.to_datetime(df['Date'])
df.dropna(inplace=True)
df.set_index('Date', inplace=True)
df.sort_index(inplace = True)

df = df["1990-01-01":"2014-01-01"]


# Select the 'Index' column for monthly CPI levels
cpi_monthly = df['Inflation'].resample('M').mean()

# Display the first few rows
print(cpi_monthly.head())

cpi_monthly = cpi_monthly.diff().dropna()

# Split the data into train and test sets based on the split point
train = cpi_monthly.iloc[:round(len(cpi_monthly)/2)]
test = cpi_monthly.iloc[round(len(cpi_monthly)/2):]

# Display the shapes of the train and test sets
print("Train set shape:", train.shape)
print("Test set shape:", test.shape)


In [None]:
train.sort_index(inplace = True)
train.head()

In [None]:
train.plot()

In [None]:
len(test)

In [None]:
from statsmodels.tsa.ar_model import AutoReg
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller

# Ensure the data is appropriate for modeling
train.dropna(inplace= True) # Drop any missing values

# Fit an Autoregressive Model (AR model)
ar_model = sm.tsa.ARIMA(train, order=(2, 0, 1))
ar_result = ar_model.fit()

print(ar_result.summary())

In [None]:
train.tail()

In [None]:
ar_result.forecast(3)

In [None]:
test.head()

In [None]:
from statsmodels.tsa.arima.model import ARIMA
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Ensure the data is appropriate for modeling
train.dropna(inplace=True)  # Drop missing values from the training data

# Fit an Autoregressive Integrated Moving Average (ARIMA) Model
arima_model = ARIMA(train, order=(2, 0, 1))
arima_result = arima_model.fit()

# Predictions
# Manually set the frequency of the time series data if it's known
freq = 'M'  # Replace 'D' with the appropriate frequency of your data

# Calculate the start and end points for predictions
start = train.index[-1] + pd.tseries.frequencies.to_offset(freq)
end = start + (len(test) - 1) * pd.tseries.frequencies.to_offset(freq)

predictions = arima_result.predict(start=start, end=end)

# Ensure your 'test' data is prepared similarly to 'train' data
test_prepared = test.dropna()

# Compute MSE, RMSE, and MAPE
mse = mean_squared_error(test_prepared, predictions)
rmse = np.sqrt(mse)
#mape = np.mean(np.abs((test_prepared - predictions) / test_prepared)) * 100

print("MSE:", mse)
print("RMSE:", rmse)
#print("MAPE:", mape)

In [None]:
print(pd.concat([test_prepared, arima_result.forecast(len(test_prepared))], axis = 1))

# Understanding Error Metrics: MSE, RMSE, and MAPE

## Mean Squared Error (MSE)
- **Formula:** 
  $ \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 $
  
## Root Mean Squared Error (RMSE)
- **Formula:** 
  $ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2} $

## Mean Absolute Percent Error (MAPE)
- **Formula:** 
  $ \text{MAPE} = \frac{100\%}{n}\sum_{i=1}^{n} \left| \frac{Y_i - \hat{Y}_i}{Y_i} \right| $
- **Interpretation:** MAPE is easy to interpret as a percentage. Lower values of MAPE indicate better fit. MAPE is scale-independent, making it particularly useful for comparing accuracy across different datasets.