# EXERCISE 5

The file  `ESE3_es5_dataset.csv` contains the values (in Celsius degrees) of the global temperature index measured since 1900 to 1997.

## Point 1
Identify the model for these data.

> ### Solution
> First, let's show the time series plot

In [None]:
#Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import statsmodels.api as sm

#Import the dataset
data = pd.read_csv('ESE3_es5_dataset.csv')

# Inspect the dataset
print(data.head())

In [None]:
# Now add a 'year' column to the dataset with values from 1900 to 1997
data['year'] = np.arange(1900, 1998)
print(data.head())

In [None]:
#Time series plot
plt.plot(data['year'], data['Ex5'], 'o-')
plt.title('Time series plot')
plt.xlabel('year')
plt.ylabel('temp')
plt.grid()
plt.show()


> There seems to be a trend in the process. Besides, autocorrelation may be present. Let's perform the runs test to check if data are random or not

In [None]:
# Import the necessary libraries for the runs test
from statsmodels.sandbox.stats.runs import runstest_1samp

_, pval_runs = runstest_1samp(data['Ex5'], correction=False)
print('Runs test p-value = {:.3f}'.format(pval_runs))

if pval_runs<0.05:
    print('The null hypothesis is rejected: the process is not random')
else:
    print('The null hypothesis is accepted: the process is random')


> Let's show ACF and PACF

In [None]:
#ACF and PACF
# Plot the acf and pacf using the statsmodels library
import statsmodels.graphics.tsaplots as sgt

fig, ax = plt.subplots(2, 1)
sgt.plot_acf(data['Ex5'], lags = int(len(data)/3), zero=False, ax=ax[0])
fig.subplots_adjust(hspace=0.5)
sgt.plot_pacf(data['Ex5'], lags = int(len(data)/3), zero=False, ax=ax[1], method = 'ywm')
plt.show()

> The process look highly autocorrelated too. Trend and AR(1) are possible regressors. Let's perform stepwise regression.

<center>

![Diapositiva38.png](attachment:Diapositiva38.png)

![Diapositiva39.png](attachment:Diapositiva39.png)

![Diapositiva40.png](attachment:Diapositiva40.png)

![Diapositiva41.png](attachment:Diapositiva41.png)

![Diapositiva42.png](attachment:Diapositiva42.png)

![Diapositiva43.png](attachment:Diapositiva43.png)

![Diapositiva44.png](attachment:Diapositiva44.png)

![Diapositiva45.png](attachment:Diapositiva45.png)

</center>

> Prepare the variables for the stepwise regression.

In [None]:
# Add a column to the dataset with the lagged values
data['Ex5_lag1'] = data['Ex5'].shift(1)

# and split the dataset into regressors and target
X = data.iloc[1:, 1:3]
y = data.iloc[1:, 0]


> Use the `StepwiseRegression` class built-in the `qda` library to perform the stepwise regression.

In [None]:
# Create a StepwiseRegression object using the qda library
import qda
stepwise = qda.StepwiseRegression(add_constant = True, direction = 'both', alpha_to_enter = 0.15, alpha_to_remove = 0.15)

# Fit the model
model = stepwise.fit(y, X)

> Print out the summary of the stepwise regression.

In [None]:
results = model.model_fit
qda.summary(results)

> Finally, let's check assumptions on the residuals.

In [None]:
#Check on residuals
residuals = results.resid
fits = results.fittedvalues

# Perform the Shapiro-Wilk test
_, pval_SW = stats.shapiro(residuals)
print('Shapiro-Wilk test p-value = %.3f' % pval_SW)

# Plot the residuals
fig, axs = plt.subplots(2, 2)
fig.suptitle('Residual Plots')
stats.probplot(residuals, dist="norm", plot=axs[0,0])
axs[0,0].set_title('Normal probability plot')
axs[0,1].scatter(fits, residuals)
axs[0,1].set_title('Versus Fits')
fig.subplots_adjust(hspace=0.5)
axs[1,0].hist(residuals)
axs[1,0].set_title('Histogram')
axs[1,1].plot(np.arange(1, len(residuals)+1), residuals, 'o-')
plt.show()

In [None]:
#RANDOMNESS OF FESIDUALS
_, pval_runs_res = runstest_1samp(residuals, correction=False)
print('Runs test p-value on the residuals = {:.3f}'.format(pval_runs_res))
fig, ax = plt.subplots(2, 1)
sgt.plot_acf(residuals, lags = int(len(data)/3), zero=False, ax=ax[0])
fig.subplots_adjust(hspace=0.5)
sgt.plot_pacf(residuals, lags = int(len(data)/3), zero=False, ax=ax[1], 
            method = 'ywm')
plt.show()

> The data are normal and random. But it looks like there's something strange at lag 4. 
> 
> Perform the Bartlett test at lag 4. Remember, 1 row was not used in the model, so $n = 98-1$.:
> $$| r_4 | > \frac{z_{\alpha/2}}{\sqrt{n}} = \frac{1.96}{\sqrt{97}} = 0.199$$

In [None]:
# get the value of the autocorrelation function at lag 4
acf4 = sgt.acf(residuals, nlags=4, fft=False)[4]
print('The value of the autocorrelation function at lag 4 is {:.3f}'.format(acf4))

> Try adding a lag 4 term to the model and perform model selection again. 

In [None]:
# Add a lag4 term to the dataframe
data['Ex5_lag4'] = data['Ex5'].shift(4)

In [None]:
# Select the features and target
X = data.iloc[4:, 1:]
y = data.iloc[4:, 0]

# Fit the model
stepwise_2 = qda.StepwiseRegression(add_constant = True, direction = 'both', alpha_to_enter = 0.15, alpha_to_remove = 0.15)
model_2 = stepwise_2.fit(y,X)

> Print out the summary of the stepwise regression.

In [None]:
results_2 = model_2.model_fit
qda.summary(results_2)

> Let's check again the residuals

In [None]:
#Check on residuals
residuals = results_2.resid
fits = results_2.fittedvalues

# Perform the Shapiro-Wilk test
_, pval_SW = stats.shapiro(residuals)
print('Shapiro-Wilk test p-value = %.3f' % pval_SW)

# Plot the residuals
fig, axs = plt.subplots(2, 2)
fig.suptitle('Residual Plots')
stats.probplot(residuals, dist="norm", plot=axs[0,0])
axs[0,0].set_title('Normal probability plot')
axs[0,1].scatter(fits, residuals)
axs[0,1].set_title('Versus Fits')
fig.subplots_adjust(hspace=0.5)
axs[1,0].hist(residuals)
axs[1,0].set_title('Histogram')
axs[1,1].plot(np.arange(1, len(residuals)+1), residuals, 'o-')
plt.show()

In [None]:
#RANDOMNESS OF FESIDUALS
_, pval_runs_res = runstest_1samp(residuals, correction=False)
print('Runs test p-value on the residuals = {:.3f}'.format(pval_runs_res))
fig, ax = plt.subplots(2, 1)
sgt.plot_acf(residuals, lags = int(len(data)/3), zero=False, ax=ax[0])
fig.subplots_adjust(hspace=0.5)
sgt.plot_pacf(residuals, lags = int(len(data)/3), zero=False, ax=ax[1], 
            method = 'ywm')
plt.show()

> The data are normal and random. If we compute $|r_4|$ again for the Bartlett test at lag 4:

In [None]:
# get the value of the autocorrelation function at lag 4
acf4 = sgt.acf(residuals, nlags=4, fft=False)[4]
print('The value of the autocorrelation function at lag 4 is {:.3f}'.format(acf4))

> Remember, 4 rows were not used in the model, so $n = 98-4$.:
> $$| r_4 | > \frac{z_{\alpha/2}}{\sqrt{n}} = \frac{1.96}{\sqrt{94}} = 0.202$$
> $$| r_4 | = 0.033 < 0.202$$

> Now the autocorrelation at lag 4 is not significant. The assumptions ont he residuals are met, so we can accept the model.

## Point 2

Given the model, which is the global temperature of year 1998 (with probability 95%)?

> ### Solution
> We use the last model to predict the global temperature of year 1998. Let's create a dataframe with the values of the regressors. 

> Check the order of the predictors in the model

In [None]:
print(results_2.params)

> Create a new dataframe with the values of the regressors you want to evaluate the model on.
>
> **Remember**: the order of the predictors may not correspond to the one in the original dataframe.
>
> *Hint*: use the `iat[]` function to access a single scalar in a Pandas dataframe.

In [None]:
# Create a dataframe with the new predictors
data_predict = pd.DataFrame({'const': [1], 'Ex5_lag1': [data['Ex5'].iat[-1]], 'year': [1998], 'Ex5_lag4': [data['Ex5'].iat[-4]]})

In [None]:
#predict the next value
prediction = results_2.predict(data_predict)
print('The predicted value of Ex5 is %.3f.' % (prediction[0]))

In [None]:
# Compute the fit, confidence intervals and prediction intervals
prediction_summary = results_2.get_prediction(data_predict).summary_frame(alpha=0.05)
print(prediction_summary)