# Import Libraries

In [1]:
import pandas as pd

# Load Datasets

In [3]:
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')

# Data Quality issues with the treatments tables

In [5]:
treatments.head(4)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35


It can be seen that there are some missing values in the `hba1c_change` column as NaN. There are also some missing values in both the `auralin` and `novodra`column but listed as '-', which will not be detected by `isna()`.

In [6]:
treatments.tail(4)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
276,john,teichelmann,-,49u - 49u,7.9,7.58,
277,mathea,lillebø,23u - 36u,-,9.04,8.67,0.37
278,vallie,prince,31u - 38u,-,7.64,7.28,0.36
279,samúel,guðbrandsson,53u - 56u,-,8.0,7.64,0.36


## Viewing random entries of the dataframe 

Using `.sample()`, random entries in teh dataframe are chosen

In [11]:
treatments.sample()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
18,addolorata,lombardi,-,49u - 46u,7.75,7.33,


Viewing multiple random entries

In [21]:
treatments.sample(5)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
41,mahmud,kadyrov,-,44u - 43u,7.53,7.11,0.92
147,krisztián,bakos,-,60u - 60u,7.75,7.29,0.96
143,nora,nyborg,55u - 59u,-,7.83,7.48,0.35
82,maya,isaksson,33u - 41u,-,7.66,7.17,
249,kang,mai,-,39u - 36u,7.78,7.45,0.33


To view the same random samples every time, the `random_state` parameter in the `sample()` function should be set to a number

In [17]:
treatments.sample(5,random_state=2)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
157,asuna,morita,-,35u - 39u,7.58,7.25,0.33
7,eddie,archer,31u - 38u,-,7.89,7.55,0.34
99,abel,yonatan,-,38u - 39u,7.88,7.5,
13,gregor,bole,-,47u - 45u,7.61,7.16,0.95
112,olof,holm,39u - 52u,-,7.85,7.43,


## 1. Completeness

Completeness can be tested through the `.info()` method (to view missing data), though another step must be taken as well

In [22]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


It can be seen that there are 171 values in the `hba1c_change` column when there is supposed to be 280, indicating there are over 100 values missing.

In [23]:
treatments.head(1)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,


There are also some missing values in the dataset labelled as '-'

## 2. Validity

The datatypes can be seen using the `.dtypes` attribute.

In [24]:
treatments.dtypes

given_name       object
surname          object
auralin          object
novodra          object
hba1c_start     float64
hba1c_end       float64
hba1c_change    float64
dtype: object

In [25]:
treatments['auralin']

0      41u - 48u
1              -
2              -
3      33u - 36u
4              -
         ...    
275    45u - 51u
276            -
277    23u - 36u
278    31u - 38u
279    53u - 56u
Name: auralin, Length: 280, dtype: object

The datatype of the `auralin` and `novodra` columns are an object (string). Since they are used to represent numbers, they should optimially be integers or floats to make it easier to access the information

## 3. Accuracy

The `.describe()` method can be used to check for outliers (to see if anything is out of range) as well as see if the values are calculated correctly

In [26]:
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


The most that the `hba1c_change` should be is around 0.5, which indicates that there is an issue with accuracy. The issue arrises from the following entry:

In [41]:
treatments.iloc[treatments['hba1c_change'].idxmax()]

given_name            laura
surname         ehrlichmann
auralin                   -
novodra           43u - 40u
hba1c_start            7.95
hba1c_end              7.46
hba1c_change           0.99
Name: 32, dtype: object

As it can be seen, the change was calculated incorrectly as 7.95-7.46 != 0.99. Instead it's supposed to be 0.49, which means that's another problem with the accuracy.

## 4. Validity

As mentioned previously, there are some missing data that have a value of '-' instead of NaN. These values have an issue with their validity as mising values should be represented as NaN.

In [42]:
sum(treatments.auralin.isnull())

0

In [43]:
sum(treatments.novodra.isnull())

0

In [45]:
len(treatments[treatments['auralin']=='-']),len(treatments[treatments['novodra']=='-'])

(143, 137)

# Data Quality issues with the patients table

## 5. Consistency

In [46]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


As it can be seen in the previous table, there is a minimum value at the weight being 48.8 pounds. Logically, it isn't plausible for someone to be 48 pounds, especially when the average weight is 173 pounds. To confirm this, the `sort_values()` method can be used to see all of the values of the weights.

In [48]:
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

The issue might be that the weight was recorded in kilograms. To confirm this, the weight can be calculated through the use of the BMI and the height.

In [49]:
patients[patients.weight.min()==patients.weight]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
210,211,female,Camilla,Zaitseva,4689 Briarhill Lane,Wooster,OH,44691.0,United States,330-202-2145CamillaZaitseva@superrito.com,11/26/1938,48.8,63,19.1


In [54]:
weight = (patients[patients.weight.min()==patients.weight].height*patients[patients.weight.min()==patients.weight].height*patients[patients.weight.min()==patients.weight].bmi)/703

weight

210    107.834851
dtype: float64

Another way to cnofirm this is to take the weight and convert it to pounds, then calculate the BMI, if it's the exact same as the one in the dataset, then it'll be confirmed that the weight should be in pounds

In [55]:
weight_lbs = patients[patients.weight.min()==patients.weight].weight*2.20462

height_in = patients[patients.weight.min()==patients.weight].height

bmi_check = 703 * weight_lbs/ (height_in**2)
bmi_check

210    19.055827
dtype: float64

In [56]:
patients[patients.weight.min()==patients.weight].bmi

210    19.1
Name: bmi, dtype: float64

Rounded up, the values are equal

## 6. Uniqueness + Validity

The uniqueness of a column and be found by using the `value_counts()` method. Any value that has more than one count can be considered to have duplicates

In [57]:
patients.surname.value_counts()

surname
Doe            6
Jakobsen       3
Taylor         3
Ogochukwu      2
Tucker         2
              ..
Casárez        1
Mata           1
Pospíšil       1
Rukavina       1
Onyekaozulu    1
Name: count, Length: 466, dtype: int64

In [58]:
patients.address.value_counts()

address
123 Main Street             6
2778 North Avenue           2
2476 Fulton Street          2
648 Old Dear Lane           2
3094 Oral Lake Road         1
                           ..
1066 Goosetown Drive        1
4291 Patton Lane            1
4643 Reeves Street          1
174 Lost Creek Road         1
3652 Boone Crockett Lane    1
Name: count, Length: 483, dtype: int64

Alternatively, the `.duplicated()` method can be used to see any duplicated rows. The `subset` parameter can help specifly which columns to look into exactly to find the duplicates

In [59]:
patients[patients.duplicated(subset=['surname','address'])]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
277,278,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
282,283,female,Sandy,Taylor,2476 Fulton Street,Rainelle,WV,25962.0,United States,304-438-2648SandraCTaylor@dayrep.com,10/23/1960,206.1,64,35.4
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3


It can be seen that there are multiple John Doe's that live at 123 Main Street New York, New York at the ZIP code 1234 with the email johndoe@email.com