# Oral Insulin Phase II Clinical Trial Data

## Introduction

### DISCLAIMER: This data isn't 'REAL'

### Diabetes

The increasing prevalence of diabetes in the 21st century is a problem. Patients have symptoms like:
* unusual thirst
* frequent urination
* extreme fatigue

### Discovery of Insulin
In the 1920s, insulin was discovered by Frederick Banting. Most of the food we eat is turned to glucose, or sugar, for our bodies to use for energy. The pancreas, an organ near the stomach, makes a hormone called insulin, to help glucose get into the cells of our bodies. When you have diabetes, your body either doesn't make enough insulin or can't use its own insulin as well as it should. And this causes sugars to build-up in the blood.

With Banting's discovery of insulin, pharmaceutical companies began large-scale production of insulin. Although it doesn't cure diabetes, it's one of the biggest discoveries in medicine. When it came, it was like a miracle.

### Challenges with Insulin
The default method of administration is by a needle, multiple times a day. Insulin pumps are a more recent invention. These are insulin delivering devices that are semi-permanently connected to a diabetics body.

### The Future: Oral Insulin?
Wouldn't it be great if diabetics could take insulin orally? This is an active area of research, but historically the roadblock is getting insulin through the stomach's thick lining.

### Our Dataset: Auralin and Novodra Trials
We will be looking at the phase two clinical trial data of 350 patients for a new **innovative oral insulin** called **Auralin** - a proprietary capsule that can solve this stomach lining problem.

Phase two trials are intended to:
>* Test the efficacy and the dose response of a drug
>* Identify adverse reactions

In this trial, half of the patients are being treated with Auralin, and the other 175 being treated with a popular injectable insulin called Novodra. By comparing key metrics between these two drugs, we can determine if Auralin is effective.

## Data Wrangling
### Gather

In [1]:
import pandas as pd
import numpy as np
from urllib.request import urlretrieve

I couldn't get the data off Udacity's learning platform so I'm borrowing it off someone's GitHub profile

In [2]:
# couldn't get the data off Udacity's learning platform so I'm borrowing it off someone's
# GitHub profile
'''
advsreact_Url = 'https://raw.githubusercontent.com/Jatin1998/Data-Wrangle-and-Analyze-with-Phase2-Clinic-Trial-Data/main/adverse_reactions.csv'
patients_Url = 'https://raw.githubusercontent.com/Jatin1998/Data-Wrangle-and-Analyze-with-Phase2-Clinic-Trial-Data/main/patients.csv'
treatments_Url = 'https://raw.githubusercontent.com/Jatin1998/Data-Wrangle-and-Analyze-with-Phase2-Clinic-Trial-Data/main/treatments.csv'

urlretrieve(advsreact_Url, 'adverse_reactions.csv')
urlretrieve(patients_Url, 'patients.csv')
urlretrieve(treatments_Url, 'treatments.csv')
'''

"\nadvsreact_Url = 'https://raw.githubusercontent.com/Jatin1998/Data-Wrangle-and-Analyze-with-Phase2-Clinic-Trial-Data/main/adverse_reactions.csv'\npatients_Url = 'https://raw.githubusercontent.com/Jatin1998/Data-Wrangle-and-Analyze-with-Phase2-Clinic-Trial-Data/main/patients.csv'\ntreatments_Url = 'https://raw.githubusercontent.com/Jatin1998/Data-Wrangle-and-Analyze-with-Phase2-Clinic-Trial-Data/main/treatments.csv'\n\nurlretrieve(advsreact_Url, 'adverse_reactions.csv')\nurlretrieve(patients_Url, 'patients.csv')\nurlretrieve(treatments_Url, 'treatments.csv')\n"

In [3]:
# load data
patients = pd.read_csv('patients.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')
treatments = pd.read_csv('treatments.csv')

### Assess
This Auralin Phase II clinical trial dataset comes in three tables: `patients`, `treatments`, and `adverse_reactions`.

In the cells below, each column of each table in this clinical trial dataset is described. To see the table that goes hand in hand with these descriptions, display each table in its entirety by displaying the pandas DataFrame that it was gathered into. This task is the mechanical part of visual assessment in pandas.

In [4]:
patients

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7


`patients` columns:
- **patient_id**: the unique identifier for each patient in the [Master Patient Index](https://en.wikipedia.org/wiki/Enterprise_master_patient_index) (i.e. patient database) of the pharmaceutical company that is producing Auralin
- **assigned_sex**: the assigned sex of each patient at birth (male or female)
- **given_name**: the given name (i.e. first name) of each patient
- **surname**: the surname (i.e. last name) of each patient
- **address**: the main address for each patient
- **city**: the corresponding city for the main address of each patient
- **state**: the corresponding state for the main address of each patient
- **zip_code**: the corresponding zip code for the main address of each patient
- **country**: the corresponding country for the main address of each patient (all United states for this clinical trial)
- **contact**: phone number and email information for each patient
- **birthdate**: the date of birth of each patient (month/day/year). The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is  age >= 18 *(there is no maximum age because diabetes is a [growing problem](http://www.diabetes.co.uk/diabetes-and-the-elderly.html) among the elderly population)*
- **weight**: the weight of each patient in pounds (lbs)
- **height**: the height of each patient in inches (in)
- **bmi**: the Body Mass Index (BMI) of each patient. BMI is a simple calculation using a person's height and weight. The formula is BMI = kg/m<sup>2</sup> where kg is a person's weight in kilograms and m<sup>2</sup> is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. *The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is 16 >= BMI >= 38.*

In [5]:
patients

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7


In [6]:
treatments

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.20,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
...,...,...,...,...,...,...,...
275,albina,zetticci,45u - 51u,-,7.93,7.73,0.20
276,john,teichelmann,-,49u - 49u,7.90,7.58,
277,mathea,lillebø,23u - 36u,-,9.04,8.67,0.37
278,vallie,prince,31u - 38u,-,7.64,7.28,0.36


350 patients participated in this clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before.  All were experiencing elevated HbA1c levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:
- 175 patients switched to Auralin for 24 weeks
- 175 patients continued using Novodra for 24 weeks

`treatments` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial
- **auralin**: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) *and* the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash). Both are measured in units (shortform 'u'), which is the [international unit](https://en.wikipedia.org/wiki/International_unit) of measurement and the standard measurement for insulin.
- **novodra**: same as above, except for patients that continued treatment with Novodra
- **hba1c_start**: the patient's HbA1c level at the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The [HbA1c test](https://depts.washington.edu/uwcoe/healthtopics/diabetes.html) measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.
- **hba1c_end**: the patient's HbA1c level at the end of the last week of treatment
- **hba1c_change**: the change in the patient's HbA1c level from the start of treatment to the end, i.e., `hba1c_start` - `hba1c_end`. For Auralin to be deemed effective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).

In [7]:
adverse_reactions

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation
5,jasmine,sykes,hypoglycemia
6,louise,johnson,hypoglycemia
7,albinca,komavec,hypoglycemia
8,noe,aranda,hypoglycemia
9,sofia,hermansen,injection site discomfort


`adverse_reactions` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **adverse_reaction**: the adverse reaction reported by the patient

Additional useful information:
- [Insulin resistance varies person to person](http://www.tudiabetes.org/forum/t/how-much-insulin-is-too-much-on-a-daily-basis/9804/5), which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
- It is important to test drugs and medical products in the people they are meant to help. People of different age, race, sex, and ethnic group must be included in clinical trials. This [diversity](https://www.clinicalleader.com/doc/an-fda-perspective-on-patient-diversity-in-clinical-trials-0001) is reflected in the `patients` table.
- Ensuring column names are descriptive enough is an important step in acquainting yourself with the data. 'Descriptive enough' is subjective. Ideally you want short column names (so they are easier to type and read in code form) but also fully descriptive. Length vs. descriptiveness is a tradeoff and common debate (a [similar debate](https://softwareengineering.stackexchange.com/questions/176582/is-there-an-excuse-for-short-variable-names) exists for variable names). The *auralin* and *novodra* column names are probably not descriptive enough, but you'll address that later so don't worry about that for now.

Next, we'll use **programmatic assessment** to identify issues.

In [8]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [9]:
patients.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


In [10]:
patients.tail()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3


In [11]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [12]:
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [13]:
treatments.sample(5)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
184,chân,bùi,31u - 42u,-,7.53,7.18,0.35
111,mevla,grabljevec,-,51u - 55u,7.72,7.44,0.28
99,abel,yonatan,-,38u - 39u,7.88,7.5,
262,alex,crawford,51u - 62u,-,7.69,7.3,0.39
150,manuela,cindrić,55u - 66u,-,8.07,7.76,0.31


In [14]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [15]:
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


In [16]:
adverse_reactions.adverse_reaction.value_counts()

hypoglycemia                 19
injection site discomfort     6
headache                      3
cough                         2
throat irritation             2
nausea                        2
Name: adverse_reaction, dtype: int64

In [17]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 944.0+ bytes


In [18]:
patients.state.unique()

array(['California', 'Illinois', 'Nebraska', 'NJ', 'AL', 'Florida', 'NV',
       'CA', 'MO', 'New York', 'MI', 'TN', 'VA', 'OK', 'GA', 'MT', 'MA',
       'NY', 'NM', 'IL', 'LA', 'PA', 'CO', 'ME', 'WI', 'SD', 'MN', 'FL',
       'WY', 'OH', 'IA', 'NC', 'IN', 'CT', 'KY', 'DE', 'MD', 'AZ', 'TX',
       'NE', 'AK', 'ND', 'KS', 'MS', 'WA', 'SC', 'WV', 'RI', 'NH', 'OR',
       nan, 'VT', 'ID', 'DC', 'AR'], dtype=object)

In [19]:
patients.query('city in ["New York"]').__len__()

18

In [20]:
patients.surname.value_counts()

Doe            6
Jakobsen       3
Taylor         3
Ogochukwu      2
Tucker         2
              ..
Casárez        1
Mata           1
Pospíšil       1
Rukavina       1
Onyekaozulu    1
Name: surname, Length: 466, dtype: int64

In [21]:
patients.address.value_counts()

123 Main Street             6
2778 North Avenue           2
2476 Fulton Street          2
648 Old Dear Lane           2
3094 Oral Lake Road         1
                           ..
1066 Goosetown Drive        1
4291 Patton Lane            1
4643 Reeves Street          1
174 Lost Creek Road         1
3652 Boone Crockett Lane    1
Name: address, Length: 483, dtype: int64

In [22]:
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [23]:
treatments.auralin.isnull().sum()

0

In [24]:
treatments.novodra.isnull().sum()

0

In [25]:
treatments[['auralin', 'novodra']]

Unnamed: 0,auralin,novodra
0,41u - 48u,-
1,-,40u - 45u
2,-,39u - 36u
3,33u - 36u,-
4,-,33u - 29u
...,...,...
275,45u - 51u,-
276,-,49u - 49u
277,23u - 36u,-
278,31u - 38u,-


#### Issues
**Quality**

`patients`
* *zip_code* is a float not a string ✅
* *zip_code* needs to be 5 digits ✅
* Data entry error at *height* col for 'Tim Neudorf'. 27in instead of 72in ✅
* Similar states are expressed in different ways (e.g. 'NY', 'New York') ✅
* *given_name* for the patient with the 'patient_id' 9 ✅
* Missing demographic information (address - contact columns) ✅
* Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns) ✅
* Multiple phone number formats ✅
* Presence of John Doe data (corrupted data) ✅
* Multiple records for Jakobsen, Gersten, Taylor ✅
* kgs instead of lbs for Zaitseva weight ✅

`treatments`
* 'u' next to the start dose and end dose in the *auralin* and *novodra* columns ✅
* Lowercase names ✅
* NaN values in *HbA1c_change* column ✅
* Inaccurate HbA1c changes (4s mistaken as 9s) ✅
* 280 records instead 350 records. ✅
* Erroneous datatypes (auralin and novodra columns) ✅
* Nulls represent as dashes (-) in *auralin* and *novodra* columns. ✅

`adverse_reactions`
* Lowercase names ✅

**Tidiness**

* *contact* column in `patients` should be split into *phone_number* and *email* ✅
* three variables in two columns in `treatments` (*treatment*, *start_dose*, and *end_dose*) ✅
* Just two tables are needed. The *adverse_reaction* column of the `adverse_reactions` table should be included in the `treatments` table while the `patients` table remains the same. ✅
* Duplicate columns (i.e. *given_name* and *surname*) in `treatments` should be removed. ✅

### Clean

Addressing the identified completeness issues first:
> `patients`
> * Missing demographic information (address - contact columns)

> `treatements`
> * NaN values in *HbA1c_change* column and Inaccurate *HbA1c_change* (leading 4s mistaken as 9s)
> * 280 records instead 350 records.

Unfortunately, nothing can be done about the missing demographic information because there's no way of accessing that information until those patients come back.

In [26]:
patients_clean = patients.copy()
treatments_clean = treatments.copy()
adverse_reactions_clean = adverse_reactions.copy()

#### Missing Data

**`treatments`: Missing records (280 records instead of 350)**

***Define***
* Download the missing data csv.
* Ensuring the column names are the same, append the missing columns.

***Code***

In [27]:
#url = 'https://raw.githubusercontent.com/BaekKyunShin/Data-Analyst-Nanodegree/master/Project4-Treatment_test/treatments_cut.csv'

#urlretrieve(url, 'treatments_cut.csv')

In [28]:
# downloading missing data
treatments_cut = pd.read_csv('treatments_cut.csv')
treatments_cut.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,jožka,resanovič,22u - 30u,-,7.56,7.22,0.34
1,inunnguaq,heilmann,57u - 67u,-,7.85,7.45,
2,alwin,svensson,36u - 39u,-,7.78,7.34,
3,thể,lương,-,61u - 64u,7.64,7.22,0.92
4,amanda,ribeiro,36u - 44u,-,7.85,7.47,0.38


In [29]:
# ensuring column names are similar
treatments_clean.columns.difference(treatments_cut.columns)

Index([], dtype='object')

In [30]:
# appending missing data to our treatments df
treatments_clean = pd.concat([treatments_clean, treatments_cut], ignore_index=True)

***Test***

It is expected that we'll have 350 records.

In [31]:
treatments_clean.__len__()

350

**`treatments`: NaN values in *HbA1c_change* column and Inaccurate *HbA1c_change* (leading 4s mistaken as 9s)**

***Define***

Before we handle NaN values, let's look more as the 4s and 9s error
* Since `hba1c_change` = `hba1c_start` - `hba1c_end`, do the element-wise subtraction and store it in a test column
* Isolate df without nas
* Find out the rows where `hba1c_change` != `testColumn`<-- This is actually an assessment step
* Confirm the error (✅)                                
* Delete current *HbA1c_change*
* rename `testColumn` as *HbA1c_change*. This handles NaN values too

***Code***

In [32]:
treatments_clean['testColumn'] = treatments_clean['hba1c_start'] - treatments_clean['hba1c_end']
treatments_clean.testColumn = treatments_clean.testColumn.round(2)

In [33]:
treatment_mistakenrows = treatments_clean.dropna().query('hba1c_change != testColumn')

In [34]:
treatment_mistakenrows.__len__()

68

In [35]:
treatment_mistakenrows.sample(60).sort_index()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change,testColumn
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97,0.47
13,gregor,bole,-,47u - 45u,7.61,7.16,0.95,0.45
17,gina,cain,-,36u - 36u,7.88,7.4,0.98,0.48
25,benoît,bonami,-,44u - 43u,9.82,9.4,0.92,0.42
26,suhaim,rahal,-,49u - 47u,7.94,7.5,0.94,0.44
27,mizuki,iwata,-,45u - 46u,7.7,7.23,0.97,0.47
32,laura,ehrlichmann,-,43u - 40u,7.95,7.46,0.99,0.49
35,csaba,sági,-,35u - 29u,7.88,7.48,0.9,0.4
40,ásta,grímsdóttir,-,29u - 30u,7.62,7.16,0.96,0.46
41,mahmud,kadyrov,-,44u - 43u,7.53,7.11,0.92,0.42


In [36]:
# dropping HbA1c_change column
treatments_clean.drop('hba1c_change', axis=1, inplace=True)

In [37]:
# rename testColumn as hba1c_change
treatments_clean.rename(columns={'testColumn': 'hba1c_change'}, inplace=True)

***Test***

In [38]:
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [39]:
treatments_clean.isnull().sum()

given_name      0
surname         0
auralin         0
novodra         0
hba1c_start     0
hba1c_end       0
hba1c_change    0
dtype: int64

In [40]:
# iterate through rows to confirm
for i in range(len(treatments_clean)):
      a = treatments_clean['hba1c_start'][i] - treatments_clean['hba1c_end'][i]
      a = round(a,2)
      b = treatments_clean['hba1c_change'][i]
      if a != b:
            print("Error at index: ", i)

#### Tidiness

***contact* column in `patients` should be split into *phone_number* and *email***

**Define**

* Get the right regex to capture emails and phone number
* Test to see if it works
* Using patients.contact.str.extract and the regex, extract the emails and phone no.
* Save extract to *email* and *phone_number* column
* Test to see if it worked.
* Drop *contact* column

**Code**

In [41]:
# regular expressions for capturing emails and phone numbers
rgx_email = r'([a-zA-Z][\w.+-]+@[\w-]+\.[\w.-]+[a-zA-Z])'
rgx_phoneNo =  r'(((\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?)?\d{3}[\s.-]?\d{4})'

In [42]:
# preliminary test
a = patients.contact.sample(15)
b = a.str.extract(rgx_email)
c = a.str.extract(rgx_phoneNo)

pd.DataFrame([a, b[0], c[0]]).T

Unnamed: 0,contact,0,0.1
383,TorbenMMortensen@einrot.com1 208 657 2473,TorbenMMortensen@einrot.com,1 208 657 2473
346,HelenLuwam@dayrep.com607-206-1483,HelenLuwam@dayrep.com,607-206-1483
394,903-939-1025AnjaMueller@gustr.com,AnjaMueller@gustr.com,903-939-1025
224,802-614-0812VeronikaJindrova@jourrapide.com,VeronikaJindrova@jourrapide.com,802-614-0812
299,208-826-1678LindaJLundy@gustr.com,LindaJLundy@gustr.com,208-826-1678
379,580-622-5674RovzanKishiev@armyspy.com,RovzanKishiev@armyspy.com,580-622-5674
441,559-693-6779LuisRibeiroSilva@jourrapide.com,LuisRibeiroSilva@jourrapide.com,559-693-6779
96,207-768-0477NasimSumaiyaSalib@einrot.com,NasimSumaiyaSalib@einrot.com,207-768-0477
416,631-664-4813DaniMuhsinAntoun@armyspy.com,DaniMuhsinAntoun@armyspy.com,631-664-4813
456,ZikoranaudodimmaChinedum@cuvox.de1 757 269 6500,ZikoranaudodimmaChinedum@cuvox.de,1 757 269 6500


In [43]:
# storing extracted emails and phone_no in new columns in patients_clean
patients_clean['email'] = patients_clean.contact.str.extract(rgx_email)[0]
patients_clean['phone_no'] = patients_clean.contact.str.extract(rgx_phoneNo)[0]

In [44]:
# rearrange columns
cols = ['patient_id', 'assigned_sex', 'given_name', 'surname', 'address',
        'city', 'state', 'zip_code', 'country', 'email', 'phone_no', 'contact', 
        'birthdate', 'weight', 'height', 'bmi']
patients_clean = patients_clean.loc[:,cols]

**Test**

In [45]:
patients_clean[['contact', 'email', 'phone_no']].sample(25)

Unnamed: 0,contact,email,phone_no
95,325-282-4087SargentRuais@jourrapide.com,SargentRuais@jourrapide.com,325-282-4087
127,360-482-2553LenaBaer@rhyta.com,LenaBaer@rhyta.com,360-482-2553
471,503-417-1995NadwahHawadahNaifeh@einrot.com,NadwahHawadahNaifeh@einrot.com,503-417-1995
263,JuliaAzevedoCarvalho@superrito.com+1 (212) 782...,JuliaAzevedoCarvalho@superrito.com,+1 (212) 782-4151
473,KateWilkinson@armyspy.com1 508 905 2371,KateWilkinson@armyspy.com,1 508 905 2371
337,208-830-2415LeonReynosoRendon@einrot.com,LeonReynosoRendon@einrot.com,208-830-2415
482,361-693-4960DiogoBarrosSouza@jourrapide.com,DiogoBarrosSouza@jourrapide.com,361-693-4960
239,228-378-1355KhalidJohnsrud@teleworm.us,KhalidJohnsrud@teleworm.us,228-378-1355
387,561-826-5683VallieSPrince@cuvox.de,VallieSPrince@cuvox.de,561-826-5683
231,StefanieHerman@fleckens.hu1 252 583 5410,StefanieHerman@fleckens.hu,1 252 583 5410


In [46]:
# dropping contact col
patients_clean.drop(labels='contact', axis=1, inplace=True)

In [47]:
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,email,phone_no,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,ZoeWellish@superrito.com,951-719-9170,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de,+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,JaeMDebord@gustr.com,402-363-6804,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com,+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,TimNeudorf@cuvox.de,334-515-7487,2/18/1928,192.3,27,26.1


***Three variables in two columns in `treatments` (*treatment*, *start_dose*, and *end_dose*)***

**Define**
* Create a list of columns you want to leave, id_vars
* Using pd.melt with identifer as id_vars, variable name as treatment, value name as dosage, melt the table.
* Split dosage column into start_dose and end_dose using str.split()
* Capture only the digits in each column.
* drop dosage column

**Code**

In [48]:
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [49]:
# identifiers
id_vars = ['given_name', 'surname','hba1c_start','hba1c_end', 'hba1c_change']

# melting the treatments_clean table
treatments_clean = pd.melt(treatments_clean, id_vars=id_vars,
                           var_name='treatment', value_name='dosage')

In [50]:
# splitting the columns by ' - '
start_dose = treatments_clean.dosage.str.split(' - ', expand=True)[0]
end_dose = treatments_clean.dosage.str.split(' - ', expand=True)[1]

# extracting the digits from each column
treatments_clean['start_dose'] = start_dose.str.extract(r'(\d\d)', expand=False)
treatments_clean['end_dose'] = end_dose.str.extract(r'(\d\d)', expand=False)

In [51]:
# converting dosage to float
treatments_clean.start_dose = treatments_clean.start_dose.astype(float)
treatments_clean.end_dose = treatments_clean.end_dose.astype(float)

**Test**

In [52]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    700 non-null    object 
 1   surname       700 non-null    object 
 2   hba1c_start   700 non-null    float64
 3   hba1c_end     700 non-null    float64
 4   hba1c_change  700 non-null    float64
 5   treatment     700 non-null    object 
 6   dosage        700 non-null    object 
 7   start_dose    350 non-null    float64
 8   end_dose      350 non-null    float64
dtypes: float64(5), object(4)
memory usage: 49.3+ KB


In [53]:
treatments_clean[['dosage','start_dose','end_dose']].dropna().sample(15)

Unnamed: 0,dosage,start_dose,end_dose
120,41u - 51u,41.0,51.0
430,37u - 35u,37.0,35.0
457,46u - 43u,46.0,43.0
122,50u - 60u,50.0,60.0
173,22u - 30u,22.0,30.0
542,31u - 36u,31.0,36.0
422,44u - 49u,44.0,49.0
355,42u - 44u,42.0,44.0
143,55u - 59u,55.0,59.0
262,51u - 62u,51.0,62.0


In [54]:
# drop dosage column
treatments_clean.drop('dosage', axis=1, inplace=True)

***The *adverse_reaction* column of the `adverse_reactions` table should be included in the `treatments` table***

**Define**
* Create *full_name* column in `adverse_reaction` and `treatments`.
* Merge `adverse_reaction` to `treatments` using outer join on full_name.

**Code**

In [55]:
# creating full_name column
treatments_clean['full_name'] = treatments_clean.given_name + treatments_clean.surname
adverse_reactions_clean['full_name'] = adverse_reactions_clean.given_name + adverse_reactions_clean.surname

In [56]:
# adding adverse_reaction col from adverse_reaction to treatments
treatments_clean = pd.merge(left=treatments_clean, 
                            right=adverse_reactions_clean.drop(['given_name', 'surname'], axis=1),
                            how='outer', on='full_name')

**Test**

In [57]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 700 entries, 0 to 699
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   given_name        700 non-null    object 
 1   surname           700 non-null    object 
 2   hba1c_start       700 non-null    float64
 3   hba1c_end         700 non-null    float64
 4   hba1c_change      700 non-null    float64
 5   treatment         700 non-null    object 
 6   start_dose        350 non-null    float64
 7   end_dose          350 non-null    float64
 8   full_name         700 non-null    object 
 9   adverse_reaction  70 non-null     object 
dtypes: float64(5), object(5)
memory usage: 60.2+ KB


**Duplicate columns (i.e. *given_name* and *surname*) in `treatments` should be removed.**

Generate *patient_id* col and drop *given_name* and *surname*

**Define**

* redo *full_name* in `treatments_clean` by adding lower(*given_name*) and lower(*surname*)
* Change 'Dsvid' to 'David' at patient_id == 9 in `patients_clean`
* generate *full_name* for `patients_clean` by adding lower(*given_name*) and lower(*surname*)
* Using a left merge, join `treatments_clean` and `patients_clean` on `full_name` 
* Drop *given_name* and *surname* in `treatments`
* Drop NA rows in any of the dose columns.

**Code**

In [58]:
patients_clean.query('patient_id == 9')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,email,phone_no,birthdate,weight,height,bmi
8,9,male,Dsvid,Gustafsson,1790 Nutter Street,Kansas City,MO,64105.0,United States,DavidGustafsson@armyspy.com,816-265-9578,3/6/1937,163.9,66,26.5


In [59]:
# changing 'Dsvid' to 'David' to prevent errors during merge
patients_clean.iloc[8,2] = 'David'
patients_clean.query('patient_id == 9')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,email,phone_no,birthdate,weight,height,bmi
8,9,male,David,Gustafsson,1790 Nutter Street,Kansas City,MO,64105.0,United States,DavidGustafsson@armyspy.com,816-265-9578,3/6/1937,163.9,66,26.5


Creating full name column in `treatments_clean` and `patients_clean`

In [60]:
treatments_clean.full_name = treatments_clean.given_name.str.lower() + treatments_clean.surname.str.lower()

In [61]:
patients_clean['full_name'] = patients_clean.given_name.str.lower() + patients_clean.surname.str.lower()

In [62]:
# getting df to do a left merge with
patientID_merge = patients_clean[['patient_id','full_name']]

In [63]:
# left merge 
treatments_clean =  pd.merge(left=treatments_clean, right=patientID_merge,
                             how='left', on='full_name')

In [64]:
# making sure that there's all our rows have a `patient_id`
treatments_clean[treatments_clean.patient_id.isna()]

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,start_dose,end_dose,full_name,adverse_reaction,patient_id


In [65]:
# dropping unnecessary columns
treatments_clean.drop(['given_name', 'surname','full_name'], axis=1, inplace=True)

# reorder columns
treatments_clean = treatments_clean[['patient_id','treatment', 'hba1c_start', 'hba1c_end',
                                    'hba1c_change', 'start_dose', 'end_dose',
                                    'adverse_reaction']]

Dropping NA rows in *start_dose* column

In [66]:
treatments_clean = treatments_clean[treatments_clean.start_dose.notna()]
treatments_clean.reset_index(drop=True, inplace=True)

**Test**

In [67]:
treatments_clean.head(15)

Unnamed: 0,patient_id,treatment,hba1c_start,hba1c_end,hba1c_change,start_dose,end_dose,adverse_reaction
0,225,auralin,7.63,7.2,0.43,41.0,48.0,
1,94,novodra,7.56,7.09,0.47,40.0,45.0,hypoglycemia
2,64,novodra,7.68,7.25,0.43,39.0,36.0,
3,242,auralin,7.97,7.62,0.35,33.0,36.0,
4,57,novodra,7.78,7.46,0.32,33.0,29.0,
5,490,novodra,7.56,7.18,0.38,42.0,44.0,hypoglycemia
6,345,auralin,7.65,7.27,0.38,37.0,42.0,
7,276,auralin,7.89,7.55,0.34,31.0,38.0,
8,349,novodra,8.08,7.7,0.38,54.0,54.0,
9,15,auralin,7.76,7.37,0.39,30.0,36.0,


In [68]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   patient_id        350 non-null    int64  
 1   treatment         350 non-null    object 
 2   hba1c_start       350 non-null    float64
 3   hba1c_end         350 non-null    float64
 4   hba1c_change      350 non-null    float64
 5   start_dose        350 non-null    float64
 6   end_dose          350 non-null    float64
 7   adverse_reaction  35 non-null     object 
dtypes: float64(5), int64(1), object(2)
memory usage: 22.0+ KB


In [69]:
treatments_clean.patient_id.value_counts()

70     2
225    1
190    1
300    1
329    1
      ..
8      1
219    1
288    1
371    1
422    1
Name: patient_id, Length: 349, dtype: int64

Dropping duplicated rows

In [70]:
treatments_clean.drop_duplicates(inplace=True)
treatments_clean.reset_index(drop=True, inplace=True)

In [71]:
treatments_clean.patient_id.duplicated().sum()

0

In [72]:
treatments_clean.head(15)

Unnamed: 0,patient_id,treatment,hba1c_start,hba1c_end,hba1c_change,start_dose,end_dose,adverse_reaction
0,225,auralin,7.63,7.2,0.43,41.0,48.0,
1,94,novodra,7.56,7.09,0.47,40.0,45.0,hypoglycemia
2,64,novodra,7.68,7.25,0.43,39.0,36.0,
3,242,auralin,7.97,7.62,0.35,33.0,36.0,
4,57,novodra,7.78,7.46,0.32,33.0,29.0,
5,490,novodra,7.56,7.18,0.38,42.0,44.0,hypoglycemia
6,345,auralin,7.65,7.27,0.38,37.0,42.0,
7,276,auralin,7.89,7.55,0.34,31.0,38.0,
8,349,novodra,8.08,7.7,0.38,54.0,54.0,
9,15,auralin,7.76,7.37,0.39,30.0,36.0,


#### Quality

`patients`
* *zip_code* is a float not a string ✅
* *zip_code* needs to be 5 digits ✅
* Data entry error at *height* col for 'Tim Neudorf'. 27in instead of 72in ✅
* Similar states are expressed in different ways (e.g. 'NY', 'New York')
* *given_name* for the patient with the 'patient_id' 9 ✅
* Missing demographic information (address - contact columns) ✅
* Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns) ✅
* Multiple phone number formats ✅
* kgs instead of lbs for Zaitseva weight ✅

***zip_code* is a float not a string and *zip_code* needs to be 5 digits**

***Define***

* Convert *zip_code* to string
* string addition of '000' and *zip_code*
* slice the last 5 indexes as *zip_code* 

***Code***

In [73]:
# converting to string and slicing out the decimal points and decimals
patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2]

In [74]:
# making sure the width of *zip_code* is 5 and filling missing with '0'
patients_clean.zip_code = patients_clean.zip_code.str.pad(width=5, fillchar='0')

In [75]:
patients_clean.zip_code[patients_clean.zip_code.str.contains('n')].head()

209    0000n
219    0000n
230    0000n
234    0000n
242    0000n
Name: zip_code, dtype: object

In [76]:
# replace '0000n' with np.nan
patients_clean.zip_code.replace('0000n',np.nan, inplace=True)

***Test***

In [77]:
patients_clean.zip_code.head()

0    92390
1    61812
2    68467
3    07095
4    36303
Name: zip_code, dtype: object

**Data entry error at *height* col for 'Tim Neudorf'. 27in instead of 72in**

***Describe***
* Find out the row for 'Tim Neudor'
* Replace height with 72in

***Code***

In [78]:
patients_clean.query("given_name == 'Tim'")

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,email,phone_no,birthdate,weight,height,bmi,full_name
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,TimNeudorf@cuvox.de,334-515-7487,2/18/1928,192.3,27,26.1,timneudorf


In [79]:
patients_clean.iloc[4, 13] = 72

**Test**

In [80]:
patients_clean.iloc[4,:]

patient_id                         5
assigned_sex                    male
given_name                       Tim
surname                      Neudorf
address         1428 Turkey Pen Lane
city                          Dothan
state                             AL
zip_code                       36303
country                United States
email            TimNeudorf@cuvox.de
phone_no                334-515-7487
birthdate                  2/18/1928
weight                         192.3
height                            72
bmi                             26.1
full_name                 timneudorf
Name: 4, dtype: object

**Similar states are expressed in different ways (e.g. 'NY', 'New York')**

**Describe**
* Find the states that are represented in full
* Apply a function that converts full state name to abbreviations


**Code**

In [81]:
patients_clean.state.unique()

array(['California', 'Illinois', 'Nebraska', 'NJ', 'AL', 'Florida', 'NV',
       'CA', 'MO', 'New York', 'MI', 'TN', 'VA', 'OK', 'GA', 'MT', 'MA',
       'NY', 'NM', 'IL', 'LA', 'PA', 'CO', 'ME', 'WI', 'SD', 'MN', 'FL',
       'WY', 'OH', 'IA', 'NC', 'IN', 'CT', 'KY', 'DE', 'MD', 'AZ', 'TX',
       'NE', 'AK', 'ND', 'KS', 'MS', 'WA', 'SC', 'WV', 'RI', 'NH', 'OR',
       nan, 'VT', 'ID', 'DC', 'AR'], dtype=object)

In [82]:
state_abbrev = {'California': 'CA', 'Illinois': 'IL', 
                'Nebraska': 'NE', 'Florida': 'FL',
                'New York': 'NY'}

In [83]:
# define a function that converts full state name to abbrev
def stateAbbrev(patients):
      
      if patients['state'] in state_abbrev.keys():
            abbrev = state_abbrev[patients['state']]
            return abbrev
      else:
            return patients['state']

In [84]:
# apply function to state column
patients_clean['state'] = patients_clean.apply(stateAbbrev, axis=1)

***Test***

In [85]:
patients_clean.state.unique()

array(['CA', 'IL', 'NE', 'NJ', 'AL', 'FL', 'NV', 'MO', 'NY', 'MI', 'TN',
       'VA', 'OK', 'GA', 'MT', 'MA', 'NM', 'LA', 'PA', 'CO', 'ME', 'WI',
       'SD', 'MN', 'WY', 'OH', 'IA', 'NC', 'IN', 'CT', 'KY', 'DE', 'MD',
       'AZ', 'TX', 'AK', 'ND', 'KS', 'MS', 'WA', 'SC', 'WV', 'RI', 'NH',
       'OR', nan, 'VT', 'ID', 'DC', 'AR'], dtype=object)

**Erroneous datatypes (assigned sex, state, and birthdate columns)**

***Define***
* Convert *assigned_sex* and *state* to categorical dtype
* Convert birthdate to datetime dtype

***Code***

In [86]:
patients_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    object 
 8   country       491 non-null    object 
 9   email         491 non-null    object 
 10  phone_no      491 non-null    object 
 11  birthdate     503 non-null    object 
 12  weight        503 non-null    float64
 13  height        503 non-null    int64  
 14  bmi           503 non-null    float64
 15  full_name     503 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 63.0+ KB


In [87]:
patients_clean['assigned_sex'] = patients_clean.assigned_sex.astype('category')
patients_clean['state'] = patients_clean.state.astype('category')
patients_clean['birthdate'] = pd.to_datetime(patients_clean.birthdate)

***Test***

In [88]:
patients_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   patient_id    503 non-null    int64         
 1   assigned_sex  503 non-null    category      
 2   given_name    503 non-null    object        
 3   surname       503 non-null    object        
 4   address       491 non-null    object        
 5   city          491 non-null    object        
 6   state         491 non-null    category      
 7   zip_code      491 non-null    object        
 8   country       491 non-null    object        
 9   email         491 non-null    object        
 10  phone_no      491 non-null    object        
 11  birthdate     503 non-null    datetime64[ns]
 12  weight        503 non-null    float64       
 13  height        503 non-null    int64         
 14  bmi           503 non-null    float64       
 15  full_name     503 non-null    object    

**Multiple phone number formats**
Since the are US carrier phone numbers

***Define***
* Strip all special characters leaving only digits
* Ensure a length of 11 digits with the first being '1' for the country code.

***Code***

In [89]:
patients_clean['phone_no'] = patients_clean.phone_no.str.replace(r'\D+', '').str.pad(11, fillchar='1')

  patients_clean['phone_no'] = patients_clean.phone_no.str.replace(r'\D+', '').str.pad(11, fillchar='1')


***Test***

In [90]:
patients_clean.phone_no.head()

0    19517199170
1    12175693204
2    14023636804
3    17326368246
4    13345157487
Name: phone_no, dtype: object

**Remove unrecoverable 'John Doe' data**

***Define***
* Find out which rows have the *given_name* and *surname* == 'John Doe'
* Remove those rows

***Code***

In [91]:
patients_clean.query("full_name == 'johndoe'").index

Int64Index([215, 229, 237, 244, 251, 277], dtype='int64')

In [92]:
patients_clean.drop([215, 229, 237, 244, 251, 277], axis=0, inplace=True)

In [93]:
patients_clean.reset_index(drop=True, inplace=True)

***Test***

In [94]:
patients_clean.query("full_name == 'johndoe'")

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,email,phone_no,birthdate,weight,height,bmi,full_name


In [95]:
patients_clean.drop('full_name', axis=1, inplace=True)

**Multiple records for Jakobsen, Gersten, Taylor**

Define
* Get patient_id and confirm it isn't in `treatments_clean`
* Drop duplicate columns

In [96]:
patients_clean[patients_clean.address.duplicated()].dropna()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,email,phone_no,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,NY,12771,United States,JakobCJakobsen@einrot.com,18458587707,1985-08-01,155.8,67,24.4
276,283,female,Sandy,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,SandraCTaylor@dayrep.com,13044382648,1960-10-23,206.1,64,35.4
496,503,male,Pat,Gersten,2778 North Avenue,Burr,NE,68324,United States,PatrickGersten@rhyta.com,14028484923,1954-05-03,138.2,71,19.3


In [97]:
duped_address = ['648 Old Dear Lane', '2476 Fulton Street', '2778 North Avenue']
patients_clean.query('address in @duped_address')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,email,phone_no,birthdate,weight,height,bmi
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,NY,12771,United States,JakobCJakobsen@einrot.com,18458587707,1985-08-01,155.8,67,24.4
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,NY,12771,United States,JakobCJakobsen@einrot.com,18458587707,1985-08-01,155.8,67,24.4
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,68324,United States,PatrickGersten@rhyta.com,14028484923,1954-05-03,138.2,71,19.3
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,SandraCTaylor@dayrep.com,13044382648,1960-10-23,206.1,64,35.4
276,283,female,Sandy,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,SandraCTaylor@dayrep.com,13044382648,1960-10-23,206.1,64,35.4
496,503,male,Pat,Gersten,2778 North Avenue,Burr,NE,68324,United States,PatrickGersten@rhyta.com,14028484923,1954-05-03,138.2,71,19.3


In [98]:
duped_patientID = patients.query('address in @duped_address')['patient_id'].values
treatments_clean.query('patient_id in @duped_patientID')

Unnamed: 0,patient_id,treatment,hba1c_start,hba1c_end,hba1c_change,start_dose,end_dose,adverse_reaction
228,132,auralin,7.84,7.49,0.35,51.0,58.0,
345,25,novodra,7.96,7.51,0.45,28.0,26.0,hypoglycemia


Okay, we're good to drop the dupes.

In [99]:
patients_clean = patients_clean[~((patients_clean.address.duplicated()) & patients_clean.address.notnull())]

***Test***

In [100]:
patients_clean.query('address in @duped_address')

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,email,phone_no,birthdate,weight,height,bmi
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,NY,12771,United States,JakobCJakobsen@einrot.com,18458587707,1985-08-01,155.8,67,24.4
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,68324,United States,PatrickGersten@rhyta.com,14028484923,1954-05-03,138.2,71,19.3
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,25962,United States,SandraCTaylor@dayrep.com,13044382648,1960-10-23,206.1,64,35.4


**kgs instead of lbs for Zaitseva's weight**

***Define***

* Locate the row that holds the min value for weight
* Multiply the number by 2.20462 to get its equivalent in lbs

***Code***

In [101]:
patients_clean.weight.sort_values().head()

210     48.8
453    102.1
329    102.7
74     103.2
311    106.0
Name: weight, dtype: float64

In [102]:
pounds = 48.8 * 2.20462
patients_clean.loc[210, 'weight'] = round(pounds,1)

***Test***

In [103]:
patients_clean.weight.sort_values().head()

453    102.1
329    102.7
74     103.2
311    106.0
171    106.5
Name: weight, dtype: float64

## FlashForward: Is Auralin Effective?

Check it out here https://youtu.be/rfMu3f9O9hQ