# Data Science and Programming
# Week 5b


# Table of Contents
* [Introduction](#Introduction)
 * [Problem](#Problem)
 * [Importing the libraries and data](#Importing-the-libraries-and-data)
* [Exploring the data](#Exploring-the-data)
 * [Checkpoint 1](#Checkpoint-1)
* [Analysing the data](#Analysing-the-data)
 * [Plotting scatter diagrams](#Plotting-scatter-diagrams)
 * [Checkpoint 2](#Checkpoint-2)
 * [Adding a third categorical feature](#Adding-a-third-categorical-variable-to-a-scatter-diagram)
 * [Checkpoint 3](#Checkpoint-3)
 * [Adding a third numerical feature](#Adding-a-third-numerical-variable-to-a-scatter-diagram)
 * [Checkpoint 4](#Checkpoint-4)
 * [Hexagonal bin plots](#Hexagonal-bin-plots)
 * [Exploring the association between other variables and life expectancy](Exploring-the-association-between-other-variables-and-life-expectancy)
* [Communicating the result](#Communicating-the-result)
 * [Checkpoint 5](#Checkpoint-5)
 
# Introduction 
This activity uses the Seaborn library in Python to create 2-dimensional charts for data, such as scatter plots: these are charts that feature two numerical variables. You should have met scatter plots before but you will also see how to add an additional third variable to them. More information about Seaborn can be seen at: https://seaborn.pydata.org/

The activity uses the data from the MEI large data set (number 4) which gives *demographic* information about countries.

## Problem

***Which features of countries are associated with longer life expectancy?***

To answer this you could create scatter plots to explore the links between other variables and life expectancy.

## Importing the libraries and data

> Run the code box below to import the libraries.

In [None]:
import pandas as pd
import seaborn as sns

#import the data and check by viewing the first few rows
countries_data = pd.read_csv('mei-lds-4.csv')
countries_data.head()

# Exploring the data
You can explore the data by finding the shape of the data set with `shape` and displaying the data types with `info()`.

> Run the next two code cells to:
> * find the number of rows and columns in the data;
> * show the data types of the columns.

In [None]:
# the number of rows and columns
countries_data.shape

In [None]:
# the data types
countries_data.info()

You can use `describe()` to explore the values of `life expectancy at birth 2010`.

> Run the code below to display some statistics about life expectancy.

In [None]:
# calculate statistics for Life expectancy at birth 2010
countries_data['Life expectancy at birth 2010'].describe()

> Add and run code below to calculate statistics for
> * `Life expectancy at birth 1960`
> * `birth rate per 1000`

In [None]:
# calculate statistics for Life expectancy at birth 1960

In [None]:
# calculate statistics for birth rate per 1000

It is useful to have visual representation of life expectancy as a single variable before looking for an association with other variables.

> Run the code below to view boxplots of life expectancy for 2010 by region.

In [None]:
# create box plots of life expectancy in 2010, grouped by region
sns.catplot(data=countries_data, kind='box', x='Life expectancy at birth 2010', y='Region', aspect=2);

> Add and run code below to create box plots grouped by region for`birth rate per 1000`.


In [None]:
# create box plots of birth rate, grouped by region

## Checkpoint 1

> * Which region has the highest median life expectancy?
> * Which region has the largest variation in life expectancy?
> * Which regions have the lowest birth rates? Which regions have the highest birth rates?

# Analysing the data

## Plotting scatter diagrams

You are probably used to drawing scatter diagrams to look for association between two variables. You can use Seaborn's `relplot()` (relational plot) command to create a scatter diagram. The format of the `relplot()` command should look very similar to `catplot()` and `displot()`.

> Run the code below to draw a scatter diagram of `Life expectancy at birth 2010` against `GDP per capita (US$)`.

In [None]:
# create a scatter plot of life expectancy against GDP
sns.relplot(data=countries_data, x='GDP per capita (US$)', y='Life expectancy at birth 2010', aspect=2);

> Add and run code below to:
> * Create a scatter plot of life expectancy against unemployment;
> * Create a scatter plot of life expectancy against birth rate.

In [None]:
# create a scatter plot of life expectancy against unemployment

In [None]:
# create a scatter plot of life expectancy against birth rate

## Checkpoint 2

> * Do countries with higher GDP have longer life expectancy on average?
> * Describe how life expectancy changes as GDP increases.
> * Is there a strong association between unemployment rate and life expectancy?
> * Describe how life expectancy changes as birth rate increases.

## Adding a third categorical variable to a scatter diagram

You can also format the colour of points in a scatter diagram based on a third categorical value.

> Run the code below to create a scatter plot of life expectancy against GDP, colour-coded by region

In [None]:
# create a scatter plot of life expectancy against GDP, colour-coded by region
sns.relplot(data=countries_data, x='GDP per capita (US$)', y='Life expectancy at birth 2010', hue='Region', aspect=2);

> Add and run code below to create a scatter plot of life expectancy against birth rate, colour-coded by region

In [None]:
# create a scatter plot of life expectancy against birth rate, colour-coded by region

## Checkpoint 3

> Describe any differences in the association between GDP and life expectancy and between birth rate and life expectancy in different regions.

## Adding a third numerical variable to a scatter diagram

A useful feature of scatter diagrams in Seaborn is the ability to format the points based on a third variable. You can assign the a numerical variable to the `size` of the points.

> Run the code below to plot a scatter diagram where the size of the points represents `physician density (physicians/1000 population)`.

In [None]:
# create a scatter plot of life expectancy against GDP, where the size is determined by physicians per 1000
# sizes=(30,150) sets the range of sizes to be used
sns.relplot(data=countries_data, x='GDP per capita (US$)', y='Life expectancy at birth 2010', size='physician density (physicians/1000 population)', sizes=(30, 150), aspect=2);

> Add and run code below to create a scatter diagram of life expectancy against birth rate, with size determined by GDP

In [None]:
# create a scatter plot of life expectancy against birth rate, with size determined by GDP

## Checkpoint 4

> * How are GDP and physician density collectively linked to life expectancy?
> * Which countries have a long life expectancy but a relatively low GDP? Try filtering the data to work this out.

## Hexagonal bin plots

When you have a very large data set, with thousands or even millions of data points, scatter plots can be misleading. This is because the points start landing on top of each other, making it hard to tell if there are many points in an area or just a few. To solve this, some charts 'bin' the data first, in much the same way a histogram does. Unlike a histogram, a bin plot shows density using shading with colours representing higher densities.

Seaborn can create a bin plot using its `jointplot` function, which also provides histograms of the two variables.

> Run the code below to create a hexagonal bin plot of life expectancy against GDP

In [None]:
# create a hexagonal bin plot
sns.jointplot(data=countries_data, kind='hex', x='GDP per capita (US$)', y='Life expectancy at birth 2010');

> Add code below to create a hexagonal bin plot of life expectancy against physician density.

In [None]:
# create a hexagonal bin plot

## Exploring the association between other variables and life expectancy
> * Add code below to explore the association betweeen other variables and life expectancy, using `hue` or `size` to add a third variable as appropriate.
> * You can create additional code boxes using the **+ Code** button.

In [None]:
# create a scatter plot

# Communicating the result
## Checkpoint 5

> Use your analysis to answer the problem: ***Which features of countries are associated with longer life expectancy?***