### Patient Blood Pressure Analysis

### Working with Structured Data
### Introduction to Data Science

### Last Updated: 12/14/2022

In [27]:
import numpy as np
import pandas as pd

In [49]:
# set path to datafile
PATH_TO_DATA = '../datasets/blood_pressure.csv'

### Source of Data

This is synthetic blood pressure patient data.

In [50]:
df = pd.read_csv(PATH_TO_DATA)
df

Unnamed: 0,patientid,date,bp_systolic,bp_diastolic
0,1,02/01/22,120,76.0
1,1,02/02/22,127,75.0
2,1,02/03/22,127,70.0
3,1,02/04/22,127,76.0
4,1,02/05/22,-20,74.0
...,...,...,...,...
94,4,02/04/22,125,76.0
95,4,02/05/22,125,72.0
96,4,02/06/22,127,76.0
97,4,02/07/22,125,76.0


Quickly check the datatypes and data summary.

In [51]:
df.dtypes

patientid         int64
date             object
bp_systolic       int64
bp_diastolic    float64
dtype: object

In [52]:
df.describe()

Unnamed: 0,patientid,bp_systolic,bp_diastolic
count,99.0,99.0,98.0
mean,2.59596,124.171717,75.214286
std,0.978537,16.744198,2.450542
min,1.0,-20.0,70.0
25%,2.0,122.0,73.25
50%,3.0,125.0,76.0
75%,3.0,128.0,77.0
max,4.0,200.0,79.0


How many unique patients are there?

In [53]:
len(df.patientid.unique())

4

### Data Issue 1: Missingness

We notice the counts for bp_systolic and bp_diastolic are different. This suggests missing data.  
Let's check for it:

In [54]:
df[df.bp_diastolic.isnull() == True]

Unnamed: 0,patientid,date,bp_systolic,bp_diastolic
6,1,02/07/22,121,


There are different possible methods. Here, we will use the last value carried forward.  
The assumption is that the diastolic pressure hasn't changed since the last reading.

In [55]:
df.ffill(inplace=True)

Checking for missing again, we see the record was imputed.

In [56]:
df[df.bp_diastolic.isnull() == True]

Unnamed: 0,patientid,date,bp_systolic,bp_diastolic


### Data Issue 2: Extreme and Non-Sensical Values

Looking at `bp_systolic` there is a negative value and a very high value.  
Negative values cannot happen. There are different options for handling this case.  
Here, we set a rule where we replace negative values with the previous measurement.

In [59]:
df.loc[df.bp_systolic < 0, 'bp_systolic'] = np.nan
df.bp_systolic.ffill(inplace=True)

---

For patient 2, the systolic measurement on 02/01/22 is 200, which is out of line with the other values.  
This value is possible and troubling. It may not be incorrect, so we choose to leave it in place.

In [48]:
df[df.patientid == 2]

Unnamed: 0,patientid,date,bp_systolic,bp_diastolic
13,2,01/15/22,127.0,79.0
14,2,01/16/22,128.0,78.0
15,2,01/17/22,123.0,76.0
16,2,01/18/22,128.0,71.0
17,2,01/19/22,123.0,75.0
18,2,01/20/22,129.0,73.0
19,2,01/21/22,128.0,77.0
20,2,01/22/22,127.0,74.0
21,2,01/23/22,126.0,73.0
22,2,01/24/22,121.0,74.0


### Conclusions

In general, we need to be thoughtful about missing data and extreme observations.  
For each variable, we need to think about the implications and best strategy.

---