## Protecting Life Data Preparation Script

- Version 1.2.1 - NHS Pycom Version Built - 22/05/2022
- Version 1.1.3 - Current Version Demo delivered to Divisional Management - 14/04/2022
- Version 1.1.2 - Abstract Submitted to BSG Conference 2022 - 25/02/2022
- Version 1.1.1 - Basic MVP Built - 23/02/2022

#### Authors:

1. Matt Stammers - Consultant Gastroenterolgist and Data Scientist @ AXIS, UHS
2. Michael George - Data Engineering Lead @ AXIS, UHS

What this Script Does:
- Loads in the data prepared by the data preparation script
- Performs mass statistical descripitve analysis on the data after dividing the cohorts into two groups
- Performs further statistical analysis on the data

#### The first stage in an analytics package is to load in the analytics packages - ideally keeping this a slim as possible

We have tried to use only a fairly simple selection of packages in this analytics pipeline - this then makes it much easier to build upon later on. Where possible we have coded out statistical functions ourselves to make the code even easier to understand

In [1]:
# Import Key Packages

# Import base packages
import math
from datetime import datetime
from datetime import timedelta

# Import data manipulation packages
import numpy as np
import pandas as pd

# Import Statistical Packages

from scipy import stats
import statsmodels.api as sm

# Import Plotting Packages

import matplotlib.pyplot as plt
import seaborn as sns

#### Pandas settings adjustment

Typically when performing analytics I like to adjust the base pandas settings for maximal customisability. This is up to you but if you want to change the settings you can below in any way you wish

In [2]:
# Adjust settings to see entire frame:

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('mode.chained_assignment', 'warn')

#### Below is the funciton block. Here we define our own user-generated functions to help with learning

We have written these out bespoke. You can just use the ones contained in other packages but for the sake of learning and clarity we have written out the functions bespoke unless they are contained as defaults within numpy and pandas in which case they are highly reliable. If you spot a mistake please let us know!

In [3]:
# Statistical Functions

def IQR(x):
    """Computes Interquartile Range"""
    lb = stats.scoreatpercentile(x, 25) # Calculates 25th percentile
    ub = stats.scoreatpercentile(x, 75) # Calculates 75th percentile
    return ub - lb # Subtracts one from the other

def IQR_lower(x):
    """Computes Interquartile Range Lower Quartile"""
    lb = stats.scoreatpercentile(x, 25) # Calculates 25th percentile
    return lb

def IQR_upper(x):
    """Computes Interquartile Range Upper Quartile"""
    ub = stats.scoreatpercentile(x, 75) # Calculates 75th percentile
    return ub

def CI_lower(x):
    """Computes 95% Confidence Interval Lower Bound of the Mean"""
    alpha = 0.05                       # significance level = 5%
    degfree = len(x) - 1                  # degress of freedom
    t = stats.t.ppf(1 - alpha/2, degfree)   # 95% confidence t-score 
    s = np.std(x)          # sample standard deviation 
    n = len(x)
    m = np.mean(x)
    return  round(m - (t * s / np.sqrt(n)),4)

def CI_upper(x):
    """Computes 95% Confidence Interval Upper Bound of the Mean"""
    alpha = 0.05                       # significance level = 5%
    degfree = len(x) - 1                  # degress of freedom
    t = stats.t.ppf(1 - alpha/2, degfree)   # 95% confidence t-score 
    s = np.std(x)          # sample standard deviation 
    n = len(x)
    m = np.mean(x)
    return  round(m + (t * s / np.sqrt(n)),4)

def CI_lower_median(x):
    """Computes 95% Confidence Interval Lower Bound of the Median"""
    alpha = 0.05                       # significance level = 5%
    degfree = len(x) - 1                  # degress of freedom
    t = stats.t.ppf(1 - alpha/2, degfree)   # 95% confidence t-score 
    s = np.std(x)          # sample standard deviation 
    n = len(x)
    m = np.median(x)
    return  round(m - (t * s / np.sqrt(n)),4)

def CI_upper_median(x):
    """Computes 95% Confidence Interval Upper Bound of the Median"""
    alpha = 0.05                       # significance level = 5%
    degfree = len(x) - 1                  # degress of freedom
    t = stats.t.ppf(1 - alpha/2, degfree)   # 95% confidence t-score 
    s = np.std(x)          # sample standard deviation 
    n = len(x)
    m = np.median(x)
    return  round(m + (t * s / np.sqrt(n)),4)
    
# To calculate estimated cumulative distribution functions if required
    
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)
    # x-data for the ECDF: x
    x = np.sort(data)
    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n
    return x, y

#### Load in the Datasets

Now we are ready to load in the datasets from the processing steps.