Please note:

1. This document is a template for your project code notebook, and gives some hints, but is not an exemplar
2. Retain the 4 top level headings ("Imports", "Data", "Exploratory Data Analysis", "Data Communication and Analysis")
3. Edit, remove or insert other cells as you wish 
4. Before submission, **restart and run the whole Notebook from start to finish**, so that the numbers by the code cells start at `[1]` and go up `[2]`, `[3]`, ...
5. Then save as PDF using **File→Save and Export Notebook as→PDF**


# Imports

You can import any packages you would like, but make sure you know what they are doing.

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
mpl.rcParams['font.size'] = 8    # Set the font size to 8pt
mpl.rcParams['figure.dpi'] = 200 # Set the dots per inch to a reasonable compromise between resolution and size

<span style="display: none">\newpage</span>

# Data

Exploratory Data Analysis of your chosen dataset, which should demonstrate that you have understood the structure of the data, identified anomalous or missing data, and used visualisation and descriptive statistics to explore possible relationships in the data. 

In [None]:
# Load data
simd = pd.read_csv('datasets/simd.csv')

<span style="display: none">\newpage</span>

# Exploratory Data Analysis

The visualisation(s), tables and any statistical analyses or application of ML to the data, as  presented in your report. 

In [None]:
simd.head()

We can see that each row corresponds to a Data_Zone, and the columns contain various statistics for each Zone.

In [None]:
simd.info()

It looks like the data is in a wide format. There are some columns (e.g. `Income_rate`), which are objects, even though we might think they should be numbers. Looking at the output above, it will be because they are expressed as percentage strings. Depending on what analysis we do, we might want to convert these columns to numeric ones, in the Data section, in the iteration of this notebook.

In [None]:
simd.describe()

We can see some summary stats for the numeric items above. We should check the max and min values for the columns we're using make sense. 

In [None]:
sns.pairplot(data=simd)

A pairplot is often helpful in understanding numeric data - as this is in the EDA section, we don't mind if it's not easily legible - but you make want to reduced the number of columns shown so you can see what the labels are.

<span style="display: none">\newpage</span>

# Data communication and analysis

In this section, present the actual plots and figures used in your analysis.

In [None]:
# If you make the plot wider than 6in, it will have to be shrunk to fit in the document, and the font size will become too small
plt.figure(figsize=(6,4))
sns.scatterplot(data=simd, x='drive_GP', y='DRUG', s=10)
plt.xscale('log')
plt.xlabel('Drive to GP (minutes)')
plt.ylabel('Drug use')
plt.savefig('example1-large.png') # Save the figure as a PNG

In [None]:
plt.figure(figsize=(3,2))
sns.scatterplot(data=simd, x='drive_GP', y='DRUG', s=5)
plt.xscale('log')
plt.xlabel('Drive to GP (minutes)')
plt.ylabel('Drug use')
plt.tight_layout()
plt.savefig('example1-small.png') # Save the figure as a PNG

You can export tables to LaTeX, with the Pandas styler object.

In [None]:
# This is just to make a smaller table to output
simd_output = simd.copy().head()
# Table headings are better without underscores - and LaTeX doesn't like underscores
simd_output.columns =  simd_output.columns.str.replace('_', ' ')
# Save to a LaTeX file
(
    simd_output[['Data Zone', 'Council area', 'Total population', 'Income rate']]
        .style # Styler object
        .format(escape='latex') # Escape LaTeX characters
        .hide(axis=0) # Hide the index
        .to_latex('simd-table.tex', hrules=True)
)

In [None]:
simd.columns