<img src="images/Project_logos.png" width="500" height="300" align="center">

## Aims

This course will teach you some methods for creating statistical models from your data. 
By knowing the relationships between variables, there is the ability to make predictions.


Prior knowledge of Python, NumPy, Pandas, Iris, and Matplolib are assumed for this course.

## Table of Contents

* [Exploratory Data Analysis](#exploratory_analysis)
* [Simple Linear Regression](#simple_linear_regression)
* [Multiple Linear Regression](#multiple_linear_regression)
* [Poisson Distribution](#poisson_distribution)
* [Exercise 1](#exercise1)

## Exploratory Data Analysis<a class="anchor" id="exploratory_analysis"></a>
The first step in building a statistical model is to understand your data and to generate hypotheses.

**What is the structure of the data?**
It is important to understand the structure of the data because different variables must be modelled using different techniques.
- Continuous data can take on any value between a defined range, for example temperature
- Categorical data can only take on certain values, for example counts such as the number of wet days or named states such as age and gender. Categorical data that is ordered, such as high, medium, low is termed `ordinal` and categorical data with no order such as male, female is termed `nominal`.

**How does the data vary within itself?**
There are two aspects to variability related to how the values in a dataset may change from measurement to measurement. 
- calibration uncertainty: how much will the same measurement vary if taken again under the same conditions
- natural uncertainty: how big is the range of possible values within the population or sample

**How does the data vary between variables?**
In order to build a statistical model, one variable in a dataset must be related to another variable in the dataset. Understanding that covariation is key, and can be performed using quantitative methods (refer to [the correlation course](2.Correlation.ipynb)) but can also be inferred just by plotting the data from two metrics to identify positive, negative and non-linear relationships as well as parts of the data that have no relationship to each other.

**Hypothesis generation and testing** 

Once you have examined your data you can use it to make an educated guess of how the variables can be fitted into a statistical model that can explain the data. The **null hypothesis** proposes that such a model will not be statistically significant, this is the default position. The **alternative hypothesis** proposes the opposite, that there is a statistically significant relationship in the model. The alternative hypothesis is what is tested with a statistical model when attempting to disprove the null default hypothesis. **Hypothesis testing** uses a portion of the data (known as the **training data**) to generate the statistical model that is then used with the remaining portion of the data (known as **test data** or **validation data**) for **Hypothesis confirmation**. Hypothesis testing and hypothesis confirmation should consist of different samples from the data. 

**Terminology**

Due to the variability within the data itself, as described above, no statistical model will ever be 100% accurate. A good model will be able to simulate the **signal** in the data (the relationships between variables) but there will always be **noise** in the data, that does not fit the model exactly, also known as the **residuals**.

The measured variable that a statistical model is trying to explain is termed the **response variable**, and is also known as the **dependent variable** or the **target variable**. The measured variables that will be used to try to explain the response variable are known as the **predictor variables**, and are also known as **independent variables** or the **feature variables**.

## Simple Linear Regression<a class="anchor" id="simple_linear_regression"></a>
Linear regression is one of the most widely used statistical models. It takes the form:

$$\mathrm{y} = mx + c$$

where y is the response variable, m is the gradient of the linear relationship between the predictor variable (x) and the response variable and c is the intercept (or the constant) - the value of the response variable at the position where the predictor variable value is zero.

If we take an example of the relationship between the minimum temperature and the maximum temperature at Heathrow, we can use a Ordinary Least Squares (OLS) approach to fit a linear regression model to the data that would enable us to estimate the minimum temperature from maximum temperature observations.

Start by splitting the data into a training set and a testing set:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load in the data from Heathrow
df_heathrow = pd.read_csv('data/heathrow_weather_station_data.csv')
df_heathrow = df_heathrow.dropna()

# Use 75% of the data for training and 25% of the data for testing
test_size = int(len(df_heathrow) * 0.25)
x_train, x_test, y_train, y_test = train_test_split(df_heathrow['Max_temperature'],
                                                    df_heathrow['Min_temperature'],
                                                    test_size=test_size)

There are a number of different Python packages that can be used for regression modelling including scipy, statsmodels and sklearn. Here we will use examples that incorporate aspects of them all.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import linregress

model = linregress(x_train, y_train)

fig, ax = plt.subplots()
plt.scatter(x_train, y_train, c='k')
plt.xlabel('Maximum temperature')
plt.ylabel('Minimum temperature')

x_vals = np.array([df_heathrow['Max_temperature'].min(), df_heathrow['Max_temperature'].max()])
predicted_vals = model.intercept + model.slope*x_vals

plt.plot(x_vals, predicted_vals,'r')
plt.show()

We can estimate how well our model is performing by calculating the R-squared value, a measure of the amount of the variance in the response variable that is explained by the model. 

In [None]:
print(f"R-squared: {model.rvalue**2:.6f}")

R-squared values range from 0 to 1, where higher values indicate a better fitting model. For example, an R-squared value of 0.8 would indicate that 80% of the variance of the response variable is explained by the model. 

The R-squared value from our model therefore suggests there is a strong statistical relationship between the maximum an minimum temperatures at Heathrow and we can use it to estimate the minimum temperature from a known maximum temperature. 

For example, if the maximum temperature was 10'C, the minimum temperature estimate from our model is: 

In [None]:
observed_maximum_temperature = 10
min_temperature_estimate = model.slope*observed_maximum_temperature + model.intercept
print(f'The minimum temperature estimate from this model is {min_temperature_estimate:.2f}\xb0C')

We can now estimate the minimum temperatures from our test data subset:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

y_predicted = model.intercept + model.slope*x_test
residuals = y_test - y_predicted
fig, ax = plt.subplots()
sns.regplot(x=y_predicted,y=residuals,ax=ax)
ax.set(ylabel='Residuals',xlabel='Fitted values')

We can see from the residuals that some of the datapoints are overestimating the observed minimum temperature and some are underestimating the observed minimum temperature.

One of the assumptions of a linear regression model is that the residuals are normally distributed. This doesn't mean that the response variable needs to come from a normal distribution, only that the residuals are normally distributed with some below and some above. We can use a **Q-Q plot**, or quantile-quantile plot, which compared the residuals to those expected from a normal distribution to examine if this assumption holds true.

In [None]:
import statsmodels.api as sm
sm.qqplot(residuals, line='45')
plt.show()

If the two distributions matched perfectly, all the quantile points would lie along the red line. The way that the residuals in the plot move away from the red line for the lowest and highest quantiles means the lower tail of the distribution is more widely extended than would be expected from a true normal distribution, and the upper tail of the distribution is more compressed than would be expected from a true normal distribution.

## Multiple Linear Regression<a class="anchor" id="multiple_linear_regression"></a>
Multiple linear regression is an extension of simple linear regression where instead of a single predictor variable, there are multiple predictor variables. It takes the form:

$$\mathrm{y} = m_{0}x _{0} + m_{1}x _{1} + m_{2}x _{2} + m_{n}x _{n} + c$$

where each mx combination represents the predictor variable and the gradient of the linear relationship between the that predictor variable and the response variable in turn, and c is a constant.

We will use an example of the relationship between the minimum temperature and the maximum temperature and precipitation at heathrow, to see if we can improve on the simple linear regression model.

Start by splitting the data into a training set and a testing set:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load in the data from Heathrow
df_heathrow = pd.read_csv('data/heathrow_weather_station_data.csv')
df_heathrow = df_heathrow.dropna()

# Create a dataset containing the predictors variables
predictors = df_heathrow[['Max_temperature', 'Precipitation']]

# Create a dataset containing the response variable
response = df_heathrow['Min_temperature']

# Use 75% of the data for training and 25% of the data for testing
test_size = int(len(df_heathrow) * 0.25)
x_train, x_test, y_train, y_test = train_test_split(predictors,
                                                    response,
                                                    test_size=test_size)

And now fit the model

In [None]:
import statsmodels.api as ols

# first we must add the constant
x_train = ols.add_constant(x_train)

model = ols.OLS(y_train, x_train).fit()
print(model.summary())

There is a lot of information in the model summary, here we will only discuss some of the outputs. 

We can see that the R-squared value is again high at 0.957, slightly higher than the R-squared value we obtained for the simple linear regression model. We can also see that the relationships between the minimum temperature and both the maximum temperature and precipitation are significant at the 95% confident level (p-values are less than 0.05).

Now let's again estimate the minimum temperatures from our test data subset:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Add a constant to the test data
x_test = ols.add_constant(x_test)

# make the predictions
predicted_min_temperatures = model.predict(x_test)

# calculate the residuals
residuals = y_test - predicted_min_temperatures

# plot the residuals
fig, ax = plt.subplots()
sns.regplot(x=predicted_min_temperatures,y=residuals,ax=ax)
ax.set(ylabel='Residuals',xlabel='Fitted values')

From the reisduals plot, we can't really see that the model is an improvement, but let's look at the Q-Q plot

In [None]:
import statsmodels.api as sm
sm.qqplot(residuals, line='45')
plt.show()

Now we can see that the model is closer to that we would expect from a normal distribution.

Let's now try incorporating the month of the observations into the relationship as a categorical variable. We will start by creating new training and test data.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load in the data from Heathrow
df_heathrow = pd.read_csv('data/heathrow_weather_station_data.csv')
df_heathrow = df_heathrow.dropna()

# Make the month field a categorical field
df_heathrow['Month'] = df_heathrow['Month'].astype('category')

# Create a dataset containing the predictors variables
predictors = df_heathrow[['Max_temperature', 'Precipitation', 'Month']]

# Create a dataset containing the response variable
response = df_heathrow['Min_temperature']

# Use 75% of the data for training and 25% of the data for testing
test_size = int(len(df_heathrow) * 0.25)
x_train, x_test, y_train, y_test = train_test_split(predictors,
                                                    response,
                                                    test_size=test_size)

And now fit the model.

In [None]:
import statsmodels.api as ols

# first we must add the constant
x_train = ols.add_constant(x_train)

model = ols.OLS(y_train, x_train).fit()
print(model.summary())

We can see that the R-squared value has again increased, and that the relationships between the minimum temperature and our categorical month field is also significant at the 95% confident level (p-value is less than 0.05).

Notice also the Adjusted R-squared value. The R-squared value will actually increase as more predictor variables are added to the model and so it can be hard to know when you have **overfit** a model. The Adjusted R-squared, however, adjusts for the number of terms in the model and its value will only increase if the addition of a new predictor variable improves the model fit by more than might be expected to occur by chance. The Adjusted R-squared value will decrease if a new predictor variable doesn’t improve the model by a sufficient amount. In the above we can see that the Adjusted R-squared value is the same as the R-squared value and therefore the model has not been overfit.

## Exercise 1<a class="anchor" id="exercise_1"></a>

Using any Python methods, create a multiple linear regression model that predicts rainfall Heathrow from the other variables available. 

In [None]:
import pandas as pd

# Load in the data from Heathrow
df_heathrow = pd.read_csv('data/heathrow_weather_station_data.csv')
df_heathrow = df_heathrow.dropna()

