# Regression template for Seaborn example datasets

## Task

We are going to build a linear regression model and evaluate its performance. For a regression we need:

- A dependent variable. This is the one we will predict. For a regression model this needs to be a continuous variable.
- One or more independent variables. These are the ones we are using to predict the dependent variable. For a regression all of these need to be numerical and at least some need to be continuous.

For a linear regression model, our hypothesis is that our dependent variable:

- Is influenced by each of the independent variables: the independent variable affects the dependent
- Has a relationship with each independent variable that is linear
- Has a similar distribution to each of the independent variables
- Is influenced more strongly by each independent variable than the influence they have on each other

<a id='Contents'></a>
## Contents
In this notebook, we will:<b>

- [Import](#import) packages and load in some data 
- [Prepare](#prepare) the data so we can explore it
- [Explore](#explore) the data and make our testable hypothesis
- [Split](#split) the data for test and train
- [Build](#build) the model 
- [Interpret](#interpret) the model results

<div class="alert alert-block alert-warning">
<b>Reminder:</b> <br>
You don't need to understand all the code here for now, just look for:
<ul>
<li> What we are trying to do with each code cell. How does it fit in our objective?
<li> What the outputs of each code cell tell us? Are we reading too much into the results of each code cell?
<li> Some of the code cells will have parts that you will need to change to fit your data. You will be told what to change in the comment before the code cell.
</ul>
</div>

<a id="import"></a>
## Import packages and read data
[Back to Contents](#Contents)

Let's start by importing the Python packages we will need:
- [**pandas**](https://pandas.pydata.org/): a tabular data manipulation package
- [**seaborn**](https://seaborn.pydata.org/): a data visualisation package
- [**scikit-learn**](https://scikit-learn.org/): a model building package
- [**statsmodels.api**](https://www.statsmodels.org/): a model building package

In [None]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

We will be using one of these Seaborn [example datasets](https://github.com/mwaskom/seaborn-data):

- `Taxis`: New York taxi journeys 
- `titanic`: Records of details of passengers on the Titanic
- `tips`: Restaurant bills and tips
- `penguins`: Physical details of various penguins
- `iris`: Measurements of different Iris flowers

These can be loaded using the Seaborn function [`load_dataset`](https://seaborn.pydata.org/generated/seaborn.load_dataset.html). We can have a look at the first few rows using the method [`head`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)

In [None]:
data = sns.load_dataset('Taxis')
data.head()

<a id="prepare"></a>
### Prepare the data
[Back to Contents](#Contents)

For this notebook we are assuming the data has been prepared before being loaded in. However, it is always important to check that you have what you expect.

We can look at what kind of data our table contains using the [`info`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) method. 

We should be asking ourselves:

- Does the data contain the columns I expect?
- Do the columns have the data type I would expect?
- Do any of the columns have missing values? Are these in any columns we intend to use? There cannot be any null values in the rows we plan to use in our model.

In [None]:
data.info()

If we have any columns with null values that we want to use in our model, we will need to drop those pieces of data. Pandas gives us the [`dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method to do this. If you only need to drop null values from one column you can use the `subset` parameter to pass a list of the columns you wish to be checked for null values.

This piece of code has been commented out, so it will not run. To use this, remove the `#` from the start of the line and replace `'variable name'` with the column that you wish to be checked for null values. You can examine more than one column at a time by listing all the columns in the square brackets.

In [None]:
# data = data.dropna(subset=['variable name', ])

If one of the columns is only useful as a unique row identifier, we can use it as an index. We can set it using the [`set_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) method. Replace `'index column name'` with the column that you wish to set as the index. If you need this, remove the `#` to uncomment the code.

In [None]:
#data = data.set_index('ID column name')

---
<a id="explore"></a>
### Exploring the data
[Back to Contents](#Contents)

Now that we have checked our data is clean, we can explore it. A good starting point is to look at the descriptive statistics and check that they seem reasonable. We can do so using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method.

We should be asking ourselves:

- Does the data have unexpected outliers (looking at the max and min values) that might suggest a data quality issue?
- Is the spread of the data what you would expect?

In [None]:
data.describe()

It is helpful to do these checks visually. Histograms will help us see the distribution of a variable and scatter plots will show us the relationship between variables. Seaborn allows us to plot these using the functions [`histplot`](https://seaborn.pydata.org/generated/seaborn.histplot.html) for histograms and [`scatterplot`](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) for scatter plots.

In [None]:
sns.histplot(data=data, x='dependent variable name');

In [None]:
sns.scatterplot(data=data, x='variable name 1', y='dependent variable name');

It can be helpful to plot all the histograms and scatter plots in one go. Seaborn allows us to do this using a [`pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html).

In [None]:
sns.pairplot(data);

The scatter plots help us to understand what kind of relationships there are between our variables and how strong they are. We can also quantify this using the Pandas method [`corr`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)

In [None]:
data.corr(numeric_only=False)

### Confirm your hypothesis

Now that you have explored your data and seen the relationships in it, you can pick the independent variables you want to use to predict your dependent variable. Remember, for your independent variables to be useful in predicting the dependent variable, they need to:

- influence the dependent variable
- Have a linear relationship with the dependent variable
- Have a similar distribution to the dependent variable
- Influences more the dependent variable than any of the other chosen independent variables

---
<a id="split"></a>
## Train and test split of data
[Back to Contents](#Contents)

We can use the scikit-learn function [`train_test_split`]() to divide our data in two: 80% in `train_data` and 20% in `test_data`

In [None]:
train_data, test_data = train_test_split(data, 
                                         test_size= 0.2, 
                                         random_state= 42)

---
<a id="build"></a>
## Build the model
[Back to Contents](#Contents)

Now we have our data in a form we can build our linear regression model.

Our predictor variables, the independent variables, are:

In [None]:
independent_variable_names = ['variable name 1', 'variable name 2']

Our predicted variable, the dependent variable, is:

In [None]:
dependent_variable_name = 'dependent variable name'

We will use statsmodels Ordinary Least Squares [`ols`](https://www.statsmodels.org/dev/regression.html) function.

In [None]:
independent_vars = train_data[independent_variable_names]
independent_vars = sm.add_constant(independent_vars)

data_model = sm.OLS(train_data[dependent_variable_name], independent_vars).fit()
data_model.summary()

---
<a id="interpret"></a>
## Interpret the model results
[Back to Contents](#Contents)

Now we have built the model we can look at the model quality and what it tells us.

- The **R squared** is a measure of how much of the variation in dependent variable is explained by the independent variables.
- The **Adj. R-squared** gives us an understanding of how much of the variation in dependent variable is explained by the independent variables. It is useful when there is more than one independent variable

The coefficients will give us the equation for our linear regression model line:

$$\hat{Y} = \text{const} + \text{coeff} \times X$$

How accurate are the estimated coefficients?
Do all of our independent variables contribute to the model?


---
<a id="evaluate"></a>
## Evaluate the model predictions using the test data
[Back to Contents](#Contents)

We can now calculate the model predictions using the `test_data`

In [None]:
predictions = data_model.predict(sm.add_constant(test_data[independent_variable_names]))

A visual comparison between the predicted and actual values of the dependent variable will help us to guage the quality of the predictions. If they match, then they should all be along a straight line with a gradient of 1: as the actual values increase by 1, so too do the predictions.

In [None]:
# Generate the scatter plot
ax = sns.scatterplot(x=test_data[dependent_variable_name], y=predictions)
ax.set_ylabel(f'Model predictions for {dependent_variable_name}')

# Find the start and end of the best prediction line
dependent_min = int(test_data[dependent_variable_name].min())
dependent_max = int(test_data[dependent_variable_name].max())

# Plot the best prediction line
sns.lineplot(x=range(dependent_min,dependent_max), y=range(dependent_min, dependent_max), c='k', ax=ax);

We can also use these predictions to calculate the mean squared error, using `mean_squared_error`, and the r-squared, using `r2_score`. How do the R-squared values compare for the training data and the test data? Is the model likely to be useful for predicting the population?

In [None]:
print("mean squared error \t", mean_squared_error(test_data[dependent_variable_name], predictions))
print("R-squared model \t", data_model.rsquared)
print("R-squared predictions \t", r2_score(test_data[dependent_variable_name], predictions))