<div class="row">
    <div class="column">
        <img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="Data Science Campus Logo"
             align="right" 
             width = "340"
             style="margin: 0px 60px"
             />
    </div>
    <div class="column">
        <img src="https://cdn.ons.gov.uk/assets/images/ons-logo/v2/ons-logo.svg"
             alt="ONS Logo"
             align="left" 
             width = "420"
             style="margin: 0px 30px"/>
    </div>


---

<center><h1><font size=6> Statistics for Data Science with Python </font></h1></center>
<center><h1><font size=7> Introduction </font></h1></center>

*By Dr. Laurie Baker and Dr. Daniel J. Lewis*


# Introduction

Learners will at this point be familiar with how to bring their data into Python and how to clean it.

    
<img src="../../images/tidydata_5.jpg"  width="800" height="800" alt="Analysis process: Wrangle, Visualise, Model.">
Image Credit: @AlisonHorst



<img src="../../images/serra_rio.jpg"  width="800" height="800" alt="The modelling process is like a road map. There are a series of stages.">
Image Credit: Serra Rio do Rastro. Source: Rosanetur


This course covers the next steps of exploring and analysing the results and is intended to give a road map of the modelling process: 


2. **Explore the data.** Explore the variation and covariation in the data. Identify patterns through plotting which can be explored further using formal statistical testing. 

3. **Choose the model:** choose the mathematical description of the pattern you are trying to describe. 

4. **Fit parameters:** once you have defined your model, you can estimate the parameters (slope, intercept, ...). The parameters are effectively the answers to your questions from **1**. 

5. **Estimate confidence intervals/test hypotheses/select models:** measurements of uncertainty are necessary to contextualise your best-fit parameters. By quantifying the uncertainty in the fit of a model, you can estimate confidence limits for the parameters. You can then test your hypothesis statistially and practically, can we tell the difference statistically between the effect of bed nets to control malaria? Are these differences large enough to make bed nets an effective intervention strategy?

Adapted from [*Ecological Models and Data with R*](https://ms.mcmaster.ca/~bolker/emdbook/book.pdf) by Ben Bolker. 




## Course Description: 

This course introduces the basics of carrying out a statistical analysis in Python. It covers exploratory data analysis and constructing and interpreting linear and generalized linear models. 

## **Aims, Objectives and Intended Learning Outcomes:** 

<br>

### Chapter 1: Exploratory Data Analysis
By the end of Chapter 1, learners should know:

*   What is tidy data?
    *   What is a variable, value, and observation?
    *   Several python commands to explore the structure of the data
    *   What is the difference between a continuous and categorical variable?
    *   What is variation and covariation?
*  Where Exploratory Data Analysis fits within data analysis?
    *   How to use plots to explore variation in 
        *	A continuous variable
        *	A categorical variable
    *   How to use plots to explore covariation between
        *	Two categorical variables
        *	Two continuous variables
        *	A categorical and continuous variable. 

### Chapter 2: Model Basics

By the end of Chapter 2, learners should know:

*   Model Basics
    *	 What is a model family and fitted model?
    *	 What is the difference between a response and an explanatory variable?
    
*   Model Construction
    *  How to construct a linear model in python?
    *  What are the slope and intercept in a linear model?
    *  Picking out key information from the model table
    *  How to extract specific parameters from the model object.

*  Assessing Model Fit
    *	 How to inspect model residuals to assess model fit?
    *	 How to pick out key information from the table from a fitted model. 
    *  How to use Adjusted R-squared and AIC to compare models. 

### Chapter 3: Generalized Linear Models

By the end of Chapter 3, learners should know

* What is probability? 

* What is a random variable?

* What a probability distribution is and how it differs for continuous vs. discrete random variables?
* Be familiar with several common probability distributions used to model variation in the response variable
  * Binomial
  * Normal
  * Poisson
  * Negative Binomial

* How to implement a generalized linear model in python.


**Acknowledgements:** Many thanks to Dr. Paraskevi Pericleous for key initial work on this module. Many thanks to Dr. Daniel J. Lewis for preparing the GLM and Bayesian Practical. 


### Packages for this adventure

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # data plotting
import seaborn as sns # data visualisation and plotting
from sklearn import datasets # fetching iris dataset

# Seaborn plot default configurations
sns.set_style("white")

# set the custom size for my graphs
sns.set(rc={'figure.figsize':(8.7,6.27)})

---
# Chapter 1: Exploratory Data Analysis


## Introduction

Exploratory data analysis is a fluid process and there is no single approach. It can be thought of a process of hypothesis generation, data exploration, and formal statistical testing. It comes after the stage of importing and tidying your data.


In this section we will walk through organising your data, getting to know your data structure, and understanding variation and covariation within and between variables.  


For these exercises we will use the **iris** dataset, which consists of morphological measurements of three related species of iris flowers.

<img src="../../images/iris_classification.png"  width="800" height="800" alt="Iris Varieties.">
Image Credit:  Suruchi Fialoke, October 13, 2016, Classification of Iris Varieties.

The data is provided as a built-in dataset in the `sklearn` package which we will change to be formatted as a pandas dataframe. 


### Tidying your data

To conduct regression analyses in python it is important to have your data in a format that is easy to work with. A good approach is to organise your data as tabular data. 

Tabular data is a set of values, where each `value` is placed in its own “cell”, each `variable` in its own column, and each `observation` in its own row. The book [*R for Data Science*](https://r4ds.had.co.nz/) by Garrett Grolemund and Hadley Wickham is a great resource when thinking about tidying data. Here, we use some of the definitions they set out in the book to describe tidy data. 

**Some definitions for tidy data**:

 * A `variable` is a quantity, quality, or property that you can measure.

 * A `value` is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

 * An `observation` is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable.

  * In python, the *pandas* package makes it easy to format your data in a tabular format. 
  

**Reading in the data**

In [None]:
# Define sklearn_to_df function to convert from sklearn to a pandas dataframes

def sklearn_to_df(sklearn_dataset):
    
    df = pd.DataFrame(sklearn_dataset.data, 
                      columns=sklearn_dataset.feature_names)
    
    df['target'] = pd.Categorical.from_codes(sklearn_dataset.target, 
                                             sklearn_dataset.target_names)
    return df


In [None]:
# import and convert format of iris data from sklearn to a pandas dataframe
df_iris = sklearn_to_df(datasets.load_iris())

In [None]:
# looking at the head of the dataframe spot the tidy data
df_iris.head()

**Exercise**

<img src="../../images/tidy_iris_example.png"  width="800" height="800" alt="Iris Varieties.">
Image Credit:  Suruchi Fialoke, October 13, 2016, Classification of Iris Varieties.

1. Looking at the definitions for `value`, `observation`, and `variable`. Which of the coloured boxes in the image above correspond to each definition?



## Getting to know your data

To get started, let's explore the following questions for our dataset. 

 1. What is the structure of the data?

 2. What type of variation occurs within my variables?

 3. What type of covariation occurs between my variables? 
 

### Data Structure and Data Summaries

One of the things we will wish to know about our variables are whether they are continuous or categorical. 

<img src="../../images/continuous_discrete.png"  width="700" height="700" alt="Continuous values include weight and height. Categorical variables are discrete things, like the number of octopus legs.">
Image Credit:  @AllisonHorst.


   * `Continuous variable`: a variable that can take on an unlimited number of values between the lowest and highest points of measurements.
        * e.g. speed, distance, height
        
        
  * `Categorical variable` can take **one** of a limited subset of values. For example, if you have a dataset about a household then you will typically find variables like gender, marriage status, and county.
      * In python, categorical variables are usually stored as character strings or integers (e.g. 'M' and 'F' for male and female). 
      * Categorical variables are **nominal** if they have no order (e.g. 'Ghana' and 'Uruguay')
      * Categorical variables are **ordinal** if there is an order associated with them (e.g. 'low', 'medium', and 'high' referring to economic status).         


**Exercises** 

Run the following lines of code to answer the questions below

1. What are the dimensions of the dataframe?


In [None]:
print(df_iris.shape)

2. What are the first and last values of sepal.length?


In [None]:
df_iris.head()

In [None]:
df_iris.tail()

3. Which variables are categorical or continuous variables? What are these data types called in Python?

In [None]:
df_iris.info()

 4. Using the data summary, what is the minimum and maximum sepal length?

In [None]:
df_iris.describe()

5. What are the names of the columns?

In [None]:
df_iris.columns

Let's simplify the column names and make them more meaningful

In [None]:
df_iris = df_iris.rename(columns={'sepal length (cm)': 'sepal_length', 
                                  'sepal width (cm)': 'sepal_width', 
                                  'petal length (cm)': 'petal_length', 
                                  'petal width (cm)': 'petal_width',
                                  'target': 'species'})
df_iris.columns

### Variation

Now we know more about the structure of our data we can explore the variation and covariation in the variables. Knowing the variation and covariation between variables can help us to understand the spread of the data and potential relationships in the data that may give insight into modelling. 


<div class="alert alert-block alert-success">
<b><font size="4"> Terminology</font> </b> 
<p> 

  * `variation`: is the tendency of values of a variable to change from measurement to measurement. Variation can come in several forms: 
    - `measurement error` you may measure the same thing twice and get slightly different values.
    - `natural variation` is the term I use to refer to variation that is inherent in a population or sample (e.g. as humans we all have different heights, the way these values vary reflect the variation in the sample or population).

  * `covariation`: tendency of values of a variable to change with the values of another variable. 

</p>
</div>


Visualisation is a great initial tool to explore these relationships further.

### Visualising Distributions

How you visualise your variables depends on if the variable is `categorical` or `continuous`.


**A categorical variable** 

   * `Categorical or discrete variable`: a variable that can take on one of a limited, usually fixed number of possible values, assigning each value to a particular group or nominal category. 
        * e.g. sex, race, density: (high, medium, low)
        
To examine the distribution of a categorical variable, we can use a bar plot:

  * Bar plots are a useful tool for getting to know how many observations are within each group of a category. 

In [None]:
species_counts = sns.countplot(x="species", 
                               data=df_iris)
species_counts;

In this case, the bar chart shows that there are the same number of measurements for each species in the data set.
**A continuous variable**
  * A `continuous variable` can take any of an infinite set of ordered values (e.g. numbers and date times). We can inspect the spread of the data using a density plot or box plot. 

In [None]:
petal_length_all_distplot = sns.distplot(df_iris['petal_length'], 
                                         hist=False, 
                                         kde=True, 
                                         kde_kws={'shade': True, 
                                                  'linewidth': 3})

petal_length_all_distplot.set(xlabel='Petal_length', ylabel='Density')
petal_length_all_distplot;

If we look at the distribution of petal length, something interesting seems to be happening. It appears that the distribution is *bimodal* meaning that there are two modes, in this case two maxima, in the data. 

Let's explore the data before by plotting the data species by species.

In [None]:
# Setosa
df_setosa = df_iris[df_iris.species == 'setosa']

petal_length_species = sns.distplot(df_setosa[['petal_length']],
                                    label='setosa',
                                    hist=False,
                                    kde=True,
                                    kde_kws={'shade': True,
                                             'linewidth': 3})

# Virginica
df_virginica = df_iris[df_iris.species == 'virginica']

petal_length_species = sns.distplot(df_virginica[['petal_length']],
                                    label='virginica',
                                    hist=False,
                                    kde=True,
                                    kde_kws={'shade': True,
                                             'linewidth': 3})

# Versicolor
df_versicolor = df_iris[df_iris.species == 'versicolor']
petal_length_species = sns.distplot(df_versicolor[['petal_length']], 
                                    label='versicolor',
                                    hist=False, 
                                    kde=True, 
                                    kde_kws={'shade': True, 
                                             'linewidth': 3})

petal_length_species.set(xlabel='Petal Length', ylabel='Density x 10')
petal_length_species;

### Covariation


**A continuous and categorical variable**

 * **Box plot of petal width by species**


In [None]:
petal_width_boxplot = sns.boxplot(data=df_iris, y='petal_width', x='species')
petal_width_boxplot;

* A box plot gives us a visual representation of the distribution of numeric data using quartiles. It can be a good way to see how the data is spread and to identify potential outliers. 
    * The box plot shows the median (second quartile) in the middle of the plot.
    * The first and third quartile represent the interquartile range (25\% to 75\%). 
    * The minimum and maximum are defined as the (Q1 - 1.5 x IQ) and (Q3 + 1.5 x IQ).

* **Violin plots of sepal length for each species**


In [None]:
sepal_length_violin = sns.violinplot(data=df_iris, 
                                     y="sepal_length", 
                                     x='species')
sepal_length_violin;

Violin plots are similar to box plots, but they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. 

**Two continuous variables**

Plotting two continuous variables, we can see how they change in relation to eachother. In these plots we are looking to see whether there is a 

* **positive relationship** as one variable increases the other variable increases

* **negative relationship** as one variable increases the other decreases

* **no relationship** no discernable pattern of change in one variable with the other.

* **non-linear relationship** we may also be able to pick out other patterns, e.g. *polynomials*. 


In [None]:
iris_scatter_petal_length_sepal_width = sns.scatterplot(data=df_iris, 
                                                        x='petal_length', 
                                                        y='sepal_width', 
                                                        hue='species')
iris_scatter_petal_length_sepal_width;


<div class="alert alert-block alert-info">
<b><font size="4">Exercise:</font></b> 

<p> 

1. Make a box plot or a violin plot of sepal width by species. How does this box plot/violin plot compare to the earlier box plot/violin plot we made of petal width and sepal length?

</p> </div>


In [None]:
# Boxplot


<div class="alert alert-block alert-info">

<p> 
2. Make a scatterplot to visualise the relationship between petal length and sepal length coloured by species. What patterns can you pick out from the data?

</p> </div>


In [None]:
# Scatterplot

<div class="alert alert-block alert-info">

<p> 
3. Pairplots can be a quick and useful way to summarise your dataset quickly and to inspect the relationships simultaneously.Trying running the following code to make a pairplot. What does this code do?

</p> </div>

In [None]:
iris_all_pairplot = sns.pairplot(data=df_iris, 
                                 hue="species", 
                                 diag_kind="kde")
iris_all_pairplot;

<div class="alert alert-block alert-success">
<b><font size="4"> Next Chapter: Model Basics</font> </b> 
<p> 
Exploratory Data Analysis is a useful tool to identify and pick out patterns to explore, but we need to confirm any results with statistical analyses.
</p>
</div>