# Data Wrangling on Auralin Dataset

Auralin dataset is a fabricated clinical trial dataset for the purpose of demonstrating how to undertake a full data wrangling process systematically.

More information about the data...

- This dataset was constructed with the consultation of real doctors to ensure plausibility.
- This clinical trial data for an alternative insulin was inspired and closely mimics this real [clinical trial for a new inhaled insulin called Afrezza](http://care.diabetesjournals.org/content/38/12/2266.long).
- The data quality issues in this dataset mimic real, [common data quality issues in healthcare data](http://media.hypersites.com/clients/1446/filemanager/Articles/DocCenter_Problem_with_data.pdf). These issues impact quality of care, patient registration, and revenue.
- The patients in this dataset were created using this [fake name generator](http://www.fakenamegenerator.com/order.php) and do not include real names, addresses, phone numbers, emails, etc.

---

## Assess

- Visualize by scrolling
- Visualize by programming

In [1]:
pwd

'/Users/alejandrosanz/Downloads'

In [2]:
import os 
os.chdir('projects_on_GitHub/data_wrangling/diabetes_data_wrangling')

In [4]:
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')

pd.set_option('display.max_rows', None)

In [100]:
patients.head(10)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
5,6,male,Rafael,Costa,1140 Willis Avenue,Daytona Beach,Florida,32114.0,United States,386-334-5237RafaelCardosoCosta@gustr.com,8/31/1931,183.9,70,26.4
6,7,female,Mary,Adams,3145 Sheila Lane,Burbank,NV,84728.0,United States,775-533-5933MaryBAdams@einrot.com,11/19/1969,146.3,65,24.3
7,8,female,Xiuxiu,Chang,2687 Black Oak Hollow Road,Morgan Hill,CA,95037.0,United States,XiuxiuChang@einrot.com1 408 778 3236,8/13/1958,158.0,60,30.9
8,9,male,Dsvid,Gustafsson,1790 Nutter Street,Kansas City,MO,64105.0,United States,816-265-9578DavidGustafsson@armyspy.com,3/6/1937,163.9,66,26.5
9,10,female,Sophie,Cabrera,3303 Anmoore Road,New York,New York,10011.0,United States,SophieCabreraIbarra@teleworm.us1 718 795 9124,12/3/1930,194.7,64,33.4


In [14]:
treatments

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.20,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
...,...,...,...,...,...,...,...
275,albina,zetticci,45u - 51u,-,7.93,7.73,0.20
276,john,teichelmann,-,49u - 49u,7.90,7.58,
277,mathea,lillebø,23u - 36u,-,9.04,8.67,0.37
278,vallie,prince,31u - 38u,-,7.64,7.28,0.36


In [7]:
adverse_reactions

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation
5,jasmine,sykes,hypoglycemia
6,louise,johnson,hypoglycemia
7,albinca,komavec,hypoglycemia
8,noe,aranda,hypoglycemia
9,sofia,hermansen,injection site discomfort


In [8]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [9]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [26]:

patients.surname.value_counts()

Doe                  6
Jakobsen             3
Taylor               3
Aranda               2
Dratchev             2
Liễu                 2
Hueber               2
Schiavone            2
Berg                 2
Kadyrov              2
Souza                2
Woźniak              2
Tạ                   2
Lương                2
Lâm                  2
Correia              2
Cabrera              2
Tucker               2
Gersten              2
Nilsen               2
Bùi                  2
Collins              2
Ogochukwu            2
Lund                 2
Batukayev            2
Silva                2
Johnson              2
Parker               2
Cindrić              2
Grímsdóttir          2
Kowalczyk            2
Ščančar              1
Maríasson            1
Heimisson            1
Petersen             1
Tománková            1
Bogolyubova          1
Nowakowski           1
Pecinová             1
Priest               1
Eldarkhanov          1
Nyborg               1
Knutsen              1
Jephcott   

In [27]:
patients.address.value_counts()

123 Main Street                  6
2778 North Avenue                2
2476 Fulton Street               2
648 Old Dear Lane                2
2121 Liberty Avenue              1
4155 Raccoon Run                 1
4105 Ferguson Street             1
4988 Lynn Street                 1
3488 Clair Street                1
570 Alpha Avenue                 1
1934 August Lane                 1
34 Hamill Avenue                 1
2886 Straford Park               1
1846 Joseph Street               1
2566 Ingram Street               1
3479 Elm Drive                   1
3781 Hamill Avenue               1
707 Goodwin Avenue               1
2687 Black Oak Hollow Road       1
1810 Hardesty Street             1
2531 Eastland Avenue             1
373 Fantages Way                 1
1962 Cabell Avenue               1
1027 Tenmile Road                1
4500 Myra Street                 1
2020 Gore Street                 1
4141 Davis Place                 1
3650 Graystone Lakes             1
4168 Coventry Court 

In [56]:
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
171    106.5
51     107.1
270    108.1
198    108.5
48     109.1
478    109.6
141    110.2
38     111.8
438    112.0
14     112.0
235    112.2
307    112.4
191    112.6
408    113.1
49     113.3
326    114.0
338    114.1
253    117.0
321    118.4
168    118.8
1      118.8
350    119.0
207    119.2
265    120.0
341    120.3
208    121.2
0      121.7
467    122.1
218    122.2
225    122.3
404    123.0
397    123.4
424    123.6
483    123.9
306    124.1
278    124.3
120    124.5
363    124.7
185    125.1
423    125.2
419    126.1
494    126.3
232    126.9
214    126.9
342    127.2
280    128.5
184    128.9
256    128.9
17     129.1
422    129.4
44     129.8
206    129.8
377    130.0
181    130.2
354    130.2
309    130.7
22     130.7
213    131.1
143    131.1
62     131.1
268    132.0
359    132.2
362    132.2
454    132.7
187    133.1
31     133.3
56     133.8
57     134.0
163    134.2
345    134.2
458    134.2
430    135.7

In [49]:
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [51]:
patients.address.duplicated().sum()

19

In [57]:
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [59]:
patients[patients[['given_name', 'surname']].duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
277,278,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [54]:
patients[patients.address.duplicated()][['given_name', 'surname']]

Unnamed: 0,given_name,surname
29,Jake,Jakobsen
219,Mỹ,Quynh
229,John,Doe
230,Elisabeth,Knudsen
234,Martina,Tománková
237,John,Doe
242,John,O'Brian
244,John,Doe
249,Benjamin,Mehler
251,John,Doe


In [31]:
treatments.auralin.isnull().sum()

0

In [32]:
treatments.novodra.isnull().sum()

0

In [99]:
# check how many common columns in three tables
all_columns = pd.Series(list(patients) + \
                        list(treatments) + \
                        list(adverse_reactions))

all_columns[all_columns.duplicated()]

14    given_name
15       surname
21    given_name
22       surname
dtype: object

In [96]:
# # This also works!
# a = set(patients)
# b = set(treatments)
# c = set(adverse_reactions)
# inters = a.intersection(b).intersection(c)
# inters

### Quality Issues
####  `patients` table
- zip code is a float not a string
- zip code has four digits sometimes
- Tim Neudorf height is 27 instead of 72 in
- full state names sometimes, abbreviations other times
- Dsvid Gustafsson
- Missing demographic information (address - contact columns)
- Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns)
- multiple phone number formats
- Default John Doe data
- Multiple records for Jakebsen, Gersten, Taylor
- kgs instead of lbs for Zaitseva weight


#### `treatments` table
- missing HbA1c changes
- The letter 'u' in starting and ending doses for Auralin and Novodra
- lowercase given names and surnames
- missing records (280 instead of 350)
- Erroneous datatypes (auralin and novodra columns)
- Inaccurate HbA1c changes (4s mistaken as 9s)
- Nulls represented as dashed (-) in auralin and novodra

#### `adverse_reactions` table
- lowercase given names and surnames



### Tidiness Issues

#### `patients` table
- contact column should be split into phone and email columns separately
 
#### `treatments` table
- three variables in two columns (should be like treatment, start_dose, end_dose)
- `given_name` and `surname` being redundant (Add `id` column to connect with `patients` table)
 
 
#### `adverse_reactions` table
- `given_name` and `surname` being redundant
- some kind of verbose and should be joined with `treatments` table


## Clean

**General Process**
- Define
- Code
- Test