# **Machine Learning and Forecasting: Seminar 1**

Question 1: Read the dataset Lecture_1_Regression_Toy_Data.csv into Python. Answer the following questions.

1. Show the names of all the variables in the dataset and the first 5 observations.
2. Write Python code to create a Q-Q plot and check whether $y$ follows the normal distribution. Interpret the Q-Q plot.
3. Write Python code to perform a statistical test for goodness-of-fit, e.g., the Jarque Bera test or the omnibus test, and check whether $y$ follows the normal distribution.
4. Build a linear regression model on the dataset, with $y$ being the dependent variable and the others being the independent variables.
5. Carefully check the results of the F-test and the t-tests, respectively? If not, re-build a model until the regression model satisfies the requirements regarding the F-test and the t-test, respectively.
6. What is your final conclusion from the above exercise?

In [None]:
import statsmodels.api as sm
import pandas as pd
from statsmodels.stats.stattools import jarque_bera, omni_normtest
import matplotlib.pyplot as plt
import scipy.stats as stats

### 1.1 Read data in, look at variables and look at first 5 rows

In [None]:
Regression_Toy_Data_df = pd.read_csv("Datasets/Lecture_1_Regression_Toy_Data.csv")
Regression_Toy_Data_df.info()
display(Regression_Toy_Data_df.head(5))

### 1.2 Create a Q-Q plot

In [None]:
stats.probplot(Regression_Toy_Data_df["y"], dist="norm", plot=plt)
plt.show()

y approximately follows a normal distribution (except for a few variables at the bottom left).

### 1.3 Perform a Jarque Bera test to test whether y follows a normal distribution, output is in (JB_statistic, p_value, skewness, kurtosis)

In [None]:
jbTest = jarque_bera(Regression_Toy_Data_df["y"])
jbTest

Normal as p_value is greater than 0.05.

In [None]:
# Perform an omni test to test whether y follows a normal distribution
omniTest = omni_normtest(Regression_Toy_Data_df["y"])
omniTest

Normal as p_value is greater than 0.05.

### 1.4 Build a linear regression model with y being the dependent and the others being the independent

In [None]:
# Iteration 1
# define X and y
y = Regression_Toy_Data_df["y"]
X = Regression_Toy_Data_df.drop(columns=["y"]) 
X = sm.add_constant(X)

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

### 1.5 F-statistic approximately 1 so fail to reject the null hypothesis, need to reiterate.

In [None]:
# Iteration 2
# define X and y, remove variable with highest p value from t-test
y = Regression_Toy_Data_df["y"]
X = Regression_Toy_Data_df.drop(columns=["y", "X5"]) 
X = sm.add_constant(X)

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

In [None]:
# Iteration 3
# define X and y, remove variable with highest p value from t-test
y = Regression_Toy_Data_df["y"]
X = Regression_Toy_Data_df.drop(columns=["y", "X5", "X1"]) 
X = sm.add_constant(X)

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

In [None]:
# Iteration 4
# define X and y, remove variable with highest p value from t-test
y = Regression_Toy_Data_df["y"]
X = Regression_Toy_Data_df.drop(columns=["y", "X5", "X2", "X1"]) 
X = sm.add_constant(X)

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

In [None]:
# Iteration 5
# define X and y, remove variable with highest p value from t-test
y = Regression_Toy_Data_df["y"]
X = Regression_Toy_Data_df.drop(columns=["y", "X5", "X2", "X1", "X6"]) 
X = sm.add_constant(X)

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

In [None]:
# Iteration 6
# define X and y, remove variable with highest p value from t-test
y = Regression_Toy_Data_df["y"]
X = Regression_Toy_Data_df.drop(columns=["y", "X5", "X2", "X1", "X6", "X4"]) 
X = sm.add_constant(X)

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

### 1.6 Conclusion: No model with satisfy both the t-test and the f-test, need different variables or new instances. Model is insignificant.

Question 2: It is known that Miles per gallon (MPG) and CO$_2$ emissions are inversely related: higher MPG means lower fuel consumption, which directly results in fewer CO$_2$ emissions, making the car more economical and environmentally friendly. Read the dataset Miles_Per_Gallon.csv---which was downloaded from https://archive.ics.uci.edu/dataset/9/auto+mpg ---into Python. Answer the following questions.

1. Show the names of all the variables in the dataset and the first 5 observations.
2. Build a linear regression model on the dataset, with $y$ being the dependent variable and the others being the independent variables. Check the model carefully. If needed, rebuild a model until the results of the F-test and the t-tests are significant. Record the values of the $R^2$, the AIC and the BIC, respectively.
3. Find the residuals of the final model. Write Python code to create a Q-Q plot and check whether the residuals follow the normal distribution. Interpret the Q-Q plot.
4. Write Python code to perform a goodness-of-fit test using the Jarque_Bera test and check whether the residuals follows the normal distribution.
5. Interpret each coefficient of the last model you have built.
6. Compare the values of the $R^2$, the AIC and the BIC. Find the optimal model among the models, which are called candidate models.
7. Show the confidence intervals of the coefficients in the final model and then the confidence intervals of all observations in the dataset.
8. Show the prediction intervals of the observations in the dataset.







### 2.1 Read in the data and observe the variables

In [None]:
Auto_MPG_Dataset_df = pd.read_csv("Datasets/Auto_MPG_Dataset.csv")

In [None]:
Auto_MPG_Dataset_df.info()
display(Auto_MPG_Dataset_df.head(5))

### 2.2 Build a linear regression model with Miles_Per_Gallon being the dependent and the others being the independent

In [None]:
# Iteration 1
# define X and y
y = Auto_MPG_Dataset_df["Miles_Per_Gallon"]
X = Auto_MPG_Dataset_df.drop(columns=["Miles_Per_Gallon"])
X = sm.add_constant(X)

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

In [None]:
# Iteration 2 remove acceleration as p-value > 0.05
# define X and y
y = Auto_MPG_Dataset_df["Miles_Per_Gallon"]
X = Auto_MPG_Dataset_df.drop(columns=["Miles_Per_Gallon", "Acceleration"])
X = sm.add_constant(X) 

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

In [None]:
# Iteration 3 remove cylinders as p-value > 0.05
# define X and y
y = Auto_MPG_Dataset_df["Miles_Per_Gallon"]
X = Auto_MPG_Dataset_df.drop(columns=["Miles_Per_Gallon", "Acceleration", "Cylinders"])
X = sm.add_constant(X) 

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

In [None]:
# Iteration 4 remove displacement as p-value > 0.05
# define X and y
y = Auto_MPG_Dataset_df["Miles_Per_Gallon"]
X = Auto_MPG_Dataset_df.drop(columns=["Miles_Per_Gallon", "Acceleration", "Cylinders", "Displacement"]) 
X = sm.add_constant(X)

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

In [None]:
# Iteration 5 remove horsepower as p-value > 0.05
# define X and y
y = Auto_MPG_Dataset_df["Miles_Per_Gallon"]
X = Auto_MPG_Dataset_df.drop(columns=["Miles_Per_Gallon", "Acceleration", "Cylinders", "Displacement", "Horsepower"]) 
X = sm.add_constant(X)

#to fit regression model
model = sm.OLS(y, X).fit() #where y is the dependent variable and X_train_data is the dependent variables

#to view model summary
print(model.summary())

### 2.3 Find residuals of the final model

In [None]:
#to retrieve the residuals
residuals = model.resid

#to create a QQ plot to check with the residuals is normally distributed.
fig = sm.qqplot(residuals, line='45')

# to output confidence intervals for the model coefficients
conf_intervals = model.conf_int()
print(conf_intervals)

The residuals do not follow a normal distribution.

### 2.4 Perform a Jarque Bera test to test whether residuals follows a normal distribution, output is in (JB_statistic, p_value, skewness, kurtosis)

In [None]:
jbTest = jarque_bera(residuals)
jbTest

Not normal as p_value is less than 0.05.

### 2.5 Model interpretation:
- A car of 0 weight, 0 model year and 0 origin as a mpg of -20 (this is obviously impossible but the year will never be earlier than before caras are made and the weight will have a positive lower limit too).
- Increasing model weight by 1kg decreases the mpg by -0.0061.
- Increasing model year by 1 increases the mpg by 0.7885.
- Increasing origin by 1 increases the mpg by 1.1560.


## Wrap-up
- Builld a linear regression model with a significant F-Test and significant t-tests;
- Understand the difference between the $R^2$ and the adjusted $R^2$;
- Compare the performance measures such as AIC and adjusted $R^2$ among your candidate models and the select the optimal model;
- Interpret the coefficients of your final model;

## Preparation for the Next Lecture
- To review the assumptions of multiple linear regression models.