<a href="https://colab.research.google.com/github/Gingercapo/Diabetes_prevalence/blob/main/diabetes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# import necessary libraries for data computation and data analysis
import pandas as pd 
import numpy as np

# import necessary libraries for data visulaization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# this libary is to account for error warnings due to python versioning
import warnings
warnings.filterwarnings("ignore")

## Data Gathering / Data Processing

In [None]:
# Reading the dataset as csv
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')

In [None]:
# read the first two rows of the dataset
patients.head(2)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2


In [None]:
# read the first two rows of the dataset
treatments.head(2)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97


In [None]:
# read the first two rows of the dataset
adverse_reactions.head(2)

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia


## Data Assessment
- checking for tidyness and quality issues

In [None]:
# patients programmatic assessment
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [None]:
# treatment programmatic assessment
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [None]:
#adverse_reactions programmatic assessment
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 944.0+ bytes


In [None]:
# Checking for duplicate column name in the three tables 
all_columns = pd.Series(list(patients) + list(treatments) + list(adverse_reactions))
all_columns[all_columns.duplicated()]

14    given_name
15       surname
21    given_name
22       surname
dtype: object

In [None]:
#how many columns are duplicated
all_columns.duplicated().sum()

4

In [None]:
# column list in patients dataset
list(patients)

['patient_id',
 'assigned_sex',
 'given_name',
 'surname',
 'address',
 'city',
 'state',
 'zip_code',
 'country',
 'contact',
 'birthdate',
 'weight',
 'height',
 'bmi']

In [None]:
# viewing Null data
patients[patients['address'].isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
257,258,male,Jin,Kung,,,,,,,5/17/1995,231.7,69,34.2
264,265,female,Wafiyyah,Asfour,,,,,,,11/3/1989,158.6,63,28.1
269,270,female,Flavia,Fiorentino,,,,,,,10/9/1937,175.2,61,33.1
278,279,female,Generosa,Cabán,,,,,,,12/16/1962,124.3,69,18.4


In [None]:
# Summary Statistic of the patients data
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [None]:
# Summary Statistic of the treatments data
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


In [None]:
# Five(5) random sample of patients data
patients.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
324,325,male,Kong,Lei,4508 Goldcliff Circle,Washington,DC,20009.0,United States,202-238-2247KongLei@fleckens.hu,8/10/1937,165.2,72,22.4
9,10,female,Sophie,Cabrera,3303 Anmoore Road,New York,New York,10011.0,United States,SophieCabreraIbarra@teleworm.us1 718 795 9124,12/3/1930,194.7,64,33.4
190,191,male,Regolo,Nucci,3595 Stuart Street,Gibsonia,PA,15044.0,United States,RegoloNucci@einrot.com+1 (724) 449-6928,9/15/1935,213.0,67,33.4
11,12,male,Abdul-Nur,Isa,1092 Farm Meadow Drive,Brentwood,TN,37027.0,United States,Abdul-NurMummarIsa@rhyta.com1 931 207 0839,2/3/1954,238.7,73,31.5
100,101,male,Isac,Berg,1497 Hidden Meadow Drive,Binford,ND,58416.0,United States,701-676-6301IsacBerg@cuvox.de,8/1/1995,137.9,66,22.3


In [None]:
# Viewing duplicate surname
patients.surname.value_counts()

Doe            6
Jakobsen       3
Taylor         3
Ogochukwu      2
Tucker         2
              ..
Casárez        1
Mata           1
Pospíšil       1
Rukavina       1
Onyekaozulu    1
Name: surname, Length: 466, dtype: int64

In [None]:
# Confirming the duplicate using address
patients.address.value_counts()

123 Main Street             6
2778 North Avenue           2
2476 Fulton Street          2
648 Old Dear Lane           2
3094 Oral Lake Road         1
                           ..
1066 Goosetown Drive        1
4291 Patton Lane            1
4643 Reeves Street          1
174 Lost Creek Road         1
3652 Boone Crockett Lane    1
Name: address, Length: 483, dtype: int64

In [None]:
#Collecting all the duplicate address 
patients[patients.address.duplicated()].head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2


In [None]:
# Sorting the weight column in ASC order
patients.weight.sort_values() 

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

In [None]:
# Convertig the weight and height to find the body mass index (bmi)
# Body Mass Index (BMI) is a measurement of a person's weight with respect to his or her height
weight_lbs = patients[patients.surname == 'Zaitseva'].weight * 2.20462
height_in = patients[patients.surname == 'Zaitseva'].height
bmi_check = 703 * weight_lbs / (height_in * height_in)
bmi_check

210    19.055827
dtype: float64

In [None]:
# Confirming that the patients with surname Zaitseva has a body mass of 19.1
patients[patients.surname == 'Zaitseva'].bmi

210    19.1
Name: bmi, dtype: float64

In [None]:
# checking for the sum of null values on treatment table then auralin column
sum(treatments.auralin.isnull())

0

In [None]:
# checking for the sum of null values on treatment table then novodra column
sum(treatments.novodra.isnull())

0

# data wrangling
- this is process of removing errors and combining complex data sets to make them more accessible and easier to analyze

#### Quality
##### `patients` table
- Zip code is a float not a string
- Zip code has four digits sometimes
- Tim Neudorf height is 27 in instead of 72 in
- Full state names sometimes, abbreviations other times
- Dsvid Gustafsson
- Missing demographic information (address - contact columns) ***(can't clean)***
- Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns)
- Multiple phone number formats
- Default John Doe data
- Multiple records for Jakobsen, Gersten, Taylor
- kgs instead of lbs for Zaitseva weight

##### `treatments` table
- Missing HbA1c changes
- The letter 'u' in starting and ending doses for Auralin and Novodra
- Lowercase given names and surnames
- Missing records (280 instead of 350)
- Erroneous datatypes (auralin and novodra columns)
- Inaccurate HbA1c changes (leading 4s mistaken as 9s)
- Nulls represented as dashes (-) in auralin and novodra columns

##### `adverse_reactions` table
- Lowercase given names and surnames

#### Tidiness
- Contact column in `patients` table should be split into phone number and email
- Three variables in two columns in `treatments` table (treatment, start dose and end dose)
- Adverse reaction should be part of the `treatments` table
- Given name and surname columns in `patients` table duplicated in `treatments` and `adverse_reactions` tables

In [None]:
# Making a copy of the dataset before cleaning the dataset
patients_clean = patients.copy()
treatments_clean = treatments.copy()
adverse_reactions_clean = adverse_reactions.copy()

### Missing Data
<font color='red'>Complete the following two "Missing Data" **Define, Code, and Test** sequences after watching the *"Address Missing Data First"* video.</font>
#### `treatments`: Missing records (280 instead of 350)

##### Define
*Your definition here. Note: the missing `treatments` records are stored in a file named `treatments_cut.csv`, which you can see in this Jupyter Notebook's dashboard (click the **jupyter** logo in the top lefthand corner of this Notebook). Hint: [documentation page](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) for the function used in the solution.*

- Import the cut treatments into a DataFrame and concatenate it with the original treatments DataFrame.
-  Missing HbA1c changes and Inaccurate HbA1c changes (leading 4s mistaken as 9s)

In [None]:
treatments_cut = pd.read_csv('treatments_cut.csv')
# concatenating the treatments cut dataset with the original treatments 
# which made it to be 420 row instead of 280 row in line 10 of this code
treatments_clean = pd.concat([treatments_clean, treatments_cut],
                             ignore_index=True)

In [None]:
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [None]:
# Your cleaning code here 
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    350 non-null    object 
 1   surname       350 non-null    object 
 2   auralin       350 non-null    object 
 3   novodra       350 non-null    object 
 4   hba1c_start   350 non-null    float64
 5   hba1c_end     350 non-null    float64
 6   hba1c_change  213 non-null    float64
dtypes: float64(3), object(4)
memory usage: 19.3+ KB


In [None]:
# Recalculating the hba1c_change column
treatments_clean.hba1c_change  = (treatments_clean.hba1c_start - treatments_clean.hba1c_end)
treatments_clean.hba1c_change.head()

0    0.43
1    0.47
2    0.43
3    0.35
4    0.32
Name: hba1c_change, dtype: float64

### test the code

In [None]:
# Your testing code here to see that it is 420 roes acctually
treatments_clean.tail()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
344,7.51,7.06,0.45,novodra,55,51,nausea,153
345,7.67,7.3,0.37,novodra,26,23,,420
346,9.21,8.8,0.41,novodra,22,23,injection site discomfort,336
347,7.96,7.51,0.45,novodra,28,26,hypoglycemia,25
348,7.68,7.21,0.47,novodra,42,44,injection site discomfort,477


#### `treatments`: Missing HbA1c changes and Inaccurate HbA1c changes (leading 4s mistaken as 9s)
*Note: the "Inaccurate HbA1c changes (leading 4s mistaken as 9s)" observation, which is an accuracy issue and not a completeness issue, is included in this header because it is also fixed by the cleaning operation that fixes the missing "Missing HbA1c changes" observation. Multiple observations in one **Define, Code, and Test** header occurs multiple times in this notebook.*

##### Define
*Your definition here.*
- The letter 'u' in starting and ending doses for Auralin and Novodra

### Tidiness

#### Contact column in `patients` table contains two variables: phone number and email

##### Define *Your definition here. Hint 1: use regular expressions with pandas' [`str.extract` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html). Here is an amazing [regex tutorial](https://regexone.com/). Hint 2: [various phone number regex patterns](https://stackoverflow.com/questions/16699007/regular-expression-to-match-standard-10-digit-phone-number). Hint 3: [email address regex pattern](http://emailregex.com/), which you might need to modify to distinguish the email from the phone number.*

In [None]:
# Your cleaning code here

In [None]:

#This code is to seperate the phone No and email in that is in yhe same column
patients['phone'] = patients['contact'].str.extract(r'([+]?[0-9]+[\s+]?[\(]?[\-]?[0-9]+[\)]?[\s+]?[0-9]+[\s+]?[\-]?[0-9]+)')
patients['e_mail'] = patients['contact'].str.extract(r'([a-zA-Z][a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+)')

In [None]:
# Removing the contact with both email and phone number
patients_clean = patients.drop(['contact'], axis = 1)

In [None]:
patients_clean.tail()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,4/10/1959,181.1,72,24.6,207-477-0579,MustafaLindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,3/26/1948,239.6,70,34.4,928-284-4492,RumanBisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,1/13/1971,171.2,67,26.8,816-223-6007,JinkedeKeizer@teleworm.us
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,2/13/1952,176.9,67,27.7,1 360 443 2060,ChidaluOnyekaozulu@jourrapide.com
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324.0,United States,5/3/1954,138.2,71,19.3,402-848-4923,PatrickGersten@rhyta.com


In [None]:
patients_clean.phone.sample(25)

289         504-441-7744
354         201-586-2848
395         336-677-8769
77          859-297-3368
193         203-933-3979
330    +1 (843) 368-5129
304         502-902-6188
4           334-515-7487
218         423-563-2014
184    +1 (386) 989-0019
284         774-219-3140
81          580-934-1141
69     +1 (773) 615-9328
142         516-740-5280
313         251-359-2088
131         304-438-2648
72          504-289-1386
109         601-699-4153
240    +1 (256) 615-5522
420         631-479-8171
13     +1 (205) 417-8095
489         925-283-5425
216         646-472-4758
384    +1 (605) 440-5492
84          516-720-5094
Name: phone, dtype: object

In [None]:

patients_clean.e_mail.sample(25)

204    LeixandreAlanisMadrigal@fleckens.hu
100                      IsacBerg@cuvox.de
267            BernardaCindric@teleworm.us
482        DiogoBarrosSouza@jourrapide.com
263     JuliaAzevedoCarvalho@superrito.com
119                   ChajaBouw@dayrep.com
298               ConnorHarold@fleckens.hu
294              AnnieJAllen@superrito.com
88                MariusHansen@teleworm.us
328                 AnjaHueber@teleworm.us
297             CsonkaBodor@jourrapide.com
173                MarijaGrubisic@cuvox.de
402             ManouckWubbels@armyspy.com
264                                    NaN
237                      johndoe@email.com
335                  LixueHsueh@dayrep.com
151           SatsitaBatukayev@teleworm.us
443               KajsaEidem@superrito.com
244                      johndoe@email.com
108              MarinaGlockner@dayrep.com
4                      TimNeudorf@cuvox.de
163        HawraSultanahTuma@superrito.com
13               AnenechiChidi@armyspy.com
432        

In [None]:
# Confirm that no emails start with an integer (regex didn't match for this)
patients_clean.e_mail.sort_values().head()

404               AaliyahRice@dayrep.com
11          Abdul-NurMummarIsa@rhyta.com
332                AbelEfrem@fleckens.hu
258              AbelYonatan@teleworm.us
305    AddolorataLombardi@jourrapide.com
Name: e_mail, dtype: object

In [None]:
#changing the birthdate to a datatime datatype
patients_clean['birthdate'] = pd.to_datetime(patients_clean['birthdate'])
patients_clean['phone'] = patients_clean['phone'].astype('str')
patients_clean['zip_code'] = patients_clean['zip_code'].astype('str')

In [None]:
# this function is used to keep the zip code consistent to 5 digit
def fix_zip(series):
    s =  series.astype(str).str.extract('(\d+)', expand=False)
    return s.str.zfill(5).mask(s.str.len().eq(3), '0' + s)

In [None]:
#completing the zip_code column to be five digit all through
patients_clean['zip_code'] = fix_zip(patients_clean['zip_code'])

In [None]:
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390,United States,1976-07-10,121.7,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812,United States,1967-04-03,118.8,66,19.2,+1 (217) 569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467,United States,1980-02-19,177.8,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095,United States,1951-07-26,220.9,70,31.7,+1 (732) 636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,1928-02-18,192.3,27,26.1,334-515-7487,TimNeudorf@cuvox.de


In [None]:
patients_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   patient_id    503 non-null    int64         
 1   assigned_sex  503 non-null    object        
 2   given_name    503 non-null    object        
 3   surname       503 non-null    object        
 4   address       491 non-null    object        
 5   city          491 non-null    object        
 6   state         491 non-null    object        
 7   zip_code      491 non-null    object        
 8   country       491 non-null    object        
 9   birthdate     503 non-null    datetime64[ns]
 10  weight        503 non-null    float64       
 11  height        503 non-null    int64         
 12  bmi           503 non-null    float64       
 13  phone         503 non-null    object        
 14  e_mail        491 non-null    object        
dtypes: datetime64[ns](1), float64(2), int64(

In [None]:
#patients_clean.query('e_mail == "NaN"', inplace=True)
patients_clean[patients_clean['e_mail'].isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail
209,210,female,Lalita,Eldarkhanov,,,,,,1950-08-14,143.4,62,26.2,,
219,220,male,Mỹ,Quynh,,,,,,1978-04-09,237.8,69,35.1,,
230,231,female,Elisabeth,Knudsen,,,,,,1976-09-23,165.9,63,29.4,,
234,235,female,Martina,Tománková,,,,,,1936-04-07,199.5,65,33.2,,
242,243,male,John,O'Brian,,,,,,1957-02-25,205.3,74,26.4,,
249,250,male,Benjamin,Mehler,,,,,,1951-10-30,146.5,69,21.6,,
257,258,male,Jin,Kung,,,,,,1995-05-17,231.7,69,34.2,,
264,265,female,Wafiyyah,Asfour,,,,,,1989-11-03,158.6,63,28.1,,
269,270,female,Flavia,Fiorentino,,,,,,1937-10-09,175.2,61,33.1,,
278,279,female,Generosa,Cabán,,,,,,1962-12-16,124.3,69,18.4,,


In [None]:
# filling the null values with Unknown
patients_clean = patients_clean.fillna('Unknown')
patients_clean

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390,United States,1976-07-10,121.7,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812,United States,1967-04-03,118.8,66,19.2,+1 (217) 569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467,United States,1980-02-19,177.8,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,07095,United States,1951-07-26,220.9,70,31.7,+1 (732) 636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,1928-02-18,192.3,27,26.1,334-515-7487,TimNeudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,03852,United States,1959-04-10,181.1,72,24.6,207-477-0579,MustafaLindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341,United States,1948-03-26,239.6,70,34.4,928-284-4492,RumanBisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110,United States,1971-01-13,171.2,67,26.8,816-223-6007,JinkedeKeizer@teleworm.us
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109,United States,1952-02-13,176.9,67,27.7,1 360 443 2060,ChidaluOnyekaozulu@jourrapide.com


In [None]:
# Now our patient dataset is clean
patients_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   patient_id    503 non-null    int64         
 1   assigned_sex  503 non-null    object        
 2   given_name    503 non-null    object        
 3   surname       503 non-null    object        
 4   address       503 non-null    object        
 5   city          503 non-null    object        
 6   state         503 non-null    object        
 7   zip_code      503 non-null    object        
 8   country       503 non-null    object        
 9   birthdate     503 non-null    datetime64[ns]
 10  weight        503 non-null    float64       
 11  height        503 non-null    int64         
 12  bmi           503 non-null    float64       
 13  phone         503 non-null    object        
 14  e_mail        503 non-null    object        
dtypes: datetime64[ns](1), float64(2), int64(

In [None]:
# Your testing code to check whether null still exist in the patient table
patients_clean[patients_clean['e_mail'].isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail


#### Three variables in two columns in `treatments` table (treatment, start dose and end dose)

##### Define
*Your definition here. Hint: use pandas' [melt function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) and [`str.split()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html). Here is an excellent [`melt` tutorial](https://deparkes.co.uk/2016/10/28/reshape-pandas-data-with-melt/).*

In [None]:
#Recalculating hba1c_change
treatments_clean.hba1c_change = (treatments_clean.hba1c_start - 
                                 treatments_clean.hba1c_end)
treatments_clean.hba1c_change

0      0.43
1      0.47
2      0.43
3      0.35
4      0.32
       ... 
345    0.34
346    0.45
347    0.30
348    0.47
349    0.46
Name: hba1c_change, Length: 350, dtype: float64

In [None]:
# Your cleaning code here
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [None]:
#collecting the columns necessary for the tidyness transformation the treament will 
#be the column and then value_name of auralin and novodra will take the values will be dose
treatments_clean = pd.melt(treatments_clean, id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'],
                           var_name='treatment', value_name='dose')

#Selecting columns that is not equal "-"
treatments_clean = treatments_clean[treatments_clean.dose != "-"]

# where you see "-" Spliting the dose column into start dose and end dose
treatments_clean['dose_start'], treatments_clean['dose_end'] = treatments_clean['dose'].str.split(' - ', 1).str

#then drop the column after the spliting is done
treatments_clean = treatments_clean.drop('dose', axis=1)

In [None]:
# Your cleaning code here
treatments_clean.head()

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end
0,veronika,jindrová,7.63,7.2,0.43,auralin,41u,48u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u,36u
6,sophia,haugen,7.65,7.27,0.38,auralin,37u,42u
7,eddie,archer,7.89,7.55,0.34,auralin,31u,38u
9,asia,woźniak,7.76,7.37,0.39,auralin,30u,36u


#### Adverse reaction should be part of the treatments table
## Define
Merge the adverse_reaction column to the treatments table, joining on given name and surname.

In [None]:
treatments_clean = pd.merge(treatments_clean, adverse_reactions_clean,
                            on=['given_name', 'surname'], how='left')

In [None]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350 entries, 0 to 349
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   given_name        350 non-null    object 
 1   surname           350 non-null    object 
 2   hba1c_start       350 non-null    float64
 3   hba1c_end         350 non-null    float64
 4   hba1c_change      350 non-null    float64
 5   treatment         350 non-null    object 
 6   dose_start        350 non-null    object 
 7   dose_end          350 non-null    object 
 8   adverse_reaction  35 non-null     object 
dtypes: float64(3), object(6)
memory usage: 27.3+ KB


In [None]:
treatments_clean

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction
0,veronika,jindrová,7.63,7.20,0.43,auralin,41u,48u,
1,skye,gormanston,7.97,7.62,0.35,auralin,33u,36u,
2,sophia,haugen,7.65,7.27,0.38,auralin,37u,42u,
3,eddie,archer,7.89,7.55,0.34,auralin,31u,38u,
4,asia,woźniak,7.76,7.37,0.39,auralin,30u,36u,
...,...,...,...,...,...,...,...,...,...
345,christopher,woodward,7.51,7.06,0.45,novodra,55u,51u,nausea
346,maret,sultygov,7.67,7.30,0.37,novodra,26u,23u,
347,lixue,hsueh,9.21,8.80,0.41,novodra,22u,23u,injection site discomfort
348,jakob,jakobsen,7.96,7.51,0.45,novodra,28u,26u,hypoglycemia


- Given name and surname columns in patients table duplicated in treatments and adverse_reactions tables and Lowercase given names and surnames
### Define
Adverse reactions table is no longer needed so ignore that part. 
- Isolate the patient ID and names in the patients table, then convert these names to lower case to join with treatments. 
- Then drop the given name and surname columns in the treatments table (so these being lowercase isn't an issue anymore).

### Code

In [None]:
# saving the patient table with the patient_id, given name and surname as id_names
id_names = patients_clean[['patient_id', 'given_name', 'surname']]

# making the given_name lowercase
id_names.given_name = id_names.given_name.str.lower()

#making the surname lowercase
id_names.surname = id_names.surname.str.lower()

# how we want to give the treatment table an ID the matching them with given_name and surname
treatments_clean = pd.merge(treatments_clean, id_names, on=['given_name', 'surname'])
treatments_clean

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
0,veronika,jindrová,7.63,7.20,0.43,auralin,41u,48u,,225
1,skye,gormanston,7.97,7.62,0.35,auralin,33u,36u,,242
2,sophia,haugen,7.65,7.27,0.38,auralin,37u,42u,,345
3,eddie,archer,7.89,7.55,0.34,auralin,31u,38u,,276
4,asia,woźniak,7.76,7.37,0.39,auralin,30u,36u,,15
...,...,...,...,...,...,...,...,...,...,...
344,christopher,woodward,7.51,7.06,0.45,novodra,55u,51u,nausea,153
345,maret,sultygov,7.67,7.30,0.37,novodra,26u,23u,,420
346,lixue,hsueh,9.21,8.80,0.41,novodra,22u,23u,injection site discomfort,336
347,jakob,jakobsen,7.96,7.51,0.45,novodra,28u,26u,hypoglycemia,25


In [None]:
#Normalizing the table by removing the given_name and surname from the treatment_clean dataframe
treatments_clean = treatments_clean.drop(['given_name', 'surname'], axis=1)
treatments_clean

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change,treatment,dose_start,dose_end,adverse_reaction,patient_id
0,7.63,7.20,0.43,auralin,41u,48u,,225
1,7.97,7.62,0.35,auralin,33u,36u,,242
2,7.65,7.27,0.38,auralin,37u,42u,,345
3,7.89,7.55,0.34,auralin,31u,38u,,276
4,7.76,7.37,0.39,auralin,30u,36u,,15
...,...,...,...,...,...,...,...,...
344,7.51,7.06,0.45,novodra,55u,51u,nausea,153
345,7.67,7.30,0.37,novodra,26u,23u,,420
346,9.21,8.80,0.41,novodra,22u,23u,injection site discomfort,336
347,7.96,7.51,0.45,novodra,28u,26u,hypoglycemia,25


In [None]:
# Patient ID should be the only duplicate column
all_columns = pd.Series(list(patients_clean) + list(treatments_clean))
all_columns[all_columns.duplicated()]

22    patient_id
dtype: object

### Quality
Zip code is a float not a string and Zip code has four digits sometimes.

### Define
Convert the zip code column's data type from a float to a string using astype, remove the '.0' using string slicing, and pad four digit zip codes with a leading 0.

### Code

In [None]:
patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0')
# Reconvert NaNs entries that were converted to '0000n' by code above
patients_clean.zip_code = patients_clean.zip_code.replace('0000n', np.nan)

In [None]:
patients_clean.zip_code.head()

0    00923
1    00618
2    00684
3    00070
4    00363
Name: zip_code, dtype: object

Tim Neudorf height is 27 in instead of 72 in
### Define
Replace height for rows in the patients table that have a height of 27 in (there is only one) with 72 in.

### Code

In [None]:
patients_clean.height = patients_clean.height.replace(27, 72)

In [None]:
# Should be empty
patients_clean[patients_clean.height == 27]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail


In [None]:
# Confirm the replacement worked
patients_clean[patients_clean.surname == 'Neudorf']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,363,United States,1928-02-18,192.3,72,26.1,334-515-7487,TimNeudorf@cuvox.de


- Full state names sometimes, abbreviations other times
### Define
Apply a function that converts full state name to state abbreviation for California, New York, Illinois, Florida, and Nebraska.

### Code

In [None]:
# Mapping from full state name to abbreviation
state_abbrev = {'California': 'CA',
                'New York': 'NY',
                'Illinois': 'IL',
                'Florida': 'FL',
                'Nebraska': 'NE'}

# Function to apply
def abbreviate_state(patient):
    if patient['state'] in state_abbrev.keys():
        abbrev = state_abbrev[patient['state']]
        return abbrev
    else:
        return patient['state']
    
patients_clean['state'] = patients_clean.apply(abbreviate_state, axis=1)

In [None]:
patients_clean.state.value_counts()

CA         60
NY         47
TX         32
IL         24
FL         22
MA         22
PA         18
GA         15
OH         14
MI         13
OK         13
LA         13
NJ         12
Unknown    12
VA         11
WI         10
MS         10
IN          9
MN          9
TN          9
AL          9
WA          8
KY          8
NC          8
MO          7
NV          6
KS          6
ID          6
NE          6
SC          5
CT          5
IA          5
CO          4
ND          4
RI          4
ME          4
AR          4
AZ          4
SD          3
MD          3
WV          3
OR          3
DE          3
VT          2
MT          2
DC          2
WY          1
AK          1
NH          1
NM          1
Name: state, dtype: int64

Dsvid Gustafsson
### Define
Replace given name for rows in the patients table that have a given name of 'Dsvid' with 'David'.

### Code

In [None]:
#replacing the incorrect name with the correct one
patients_clean.given_name = patients_clean.given_name.replace('Dsvid', 'David')

In [None]:
#Confirming our changes
patients_clean[patients_clean.surname == 'Gustafsson']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail
8,9,male,David,Gustafsson,1790 Nutter Street,Kansas City,MO,641,United States,1937-03-06,163.9,66,26.5,816-265-9578,DavidGustafsson@armyspy.com


- Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns) and Erroneous datatypes (auralin and novodra columns) and The letter 'u' in - - - - -starting and ending doses for Auralin and Novodra
### Define
Convert assigned sex and state to categorical data types. Zip code data type was already addressed above. Convert birthdate to datetime data type. Strip the letter 'u' in start dose and end dose and convert those columns to data type integer.

### Code

In [None]:
# Changing the datatype To category 
patients_clean.assigned_sex = patients_clean.assigned_sex.astype('category')
patients_clean.state = patients_clean.state.astype('category')

# Changing the datatype To datetime
patients_clean.birthdate = pd.to_datetime(patients_clean.birthdate)

# Changing the datatype Strip u and to integer
treatments_clean.dose_start = treatments_clean.dose_start.str.strip('u').astype(int)
treatments_clean.dose_end = treatments_clean.dose_end.str.strip('u').astype(int)

In [None]:
patients_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   patient_id    503 non-null    int64         
 1   assigned_sex  503 non-null    category      
 2   given_name    503 non-null    object        
 3   surname       503 non-null    object        
 4   address       503 non-null    object        
 5   city          503 non-null    object        
 6   state         503 non-null    category      
 7   zip_code      503 non-null    object        
 8   country       503 non-null    object        
 9   birthdate     503 non-null    datetime64[ns]
 10  weight        503 non-null    float64       
 11  height        503 non-null    int64         
 12  bmi           503 non-null    float64       
 13  phone         503 non-null    object        
 14  e_mail        503 non-null    object        
dtypes: category(2), datetime64[ns](1), float

In [None]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 349 entries, 0 to 348
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   hba1c_start       349 non-null    float64
 1   hba1c_end         349 non-null    float64
 2   hba1c_change      349 non-null    float64
 3   treatment         349 non-null    object 
 4   dose_start        349 non-null    int64  
 5   dose_end          349 non-null    int64  
 6   adverse_reaction  35 non-null     object 
 7   patient_id        349 non-null    int64  
dtypes: float64(3), int64(3), object(2)
memory usage: 24.5+ KB


Multiple phone number formats
### Define
- Strip all " ", "-", "(", ")", and "+" and store each number without any formatting. Pad the phone number with a 1 if the length of the number is 10 digits (we want country code).

### Code

In [None]:
patients_clean.phone_number = patients_clean.phone.str.replace(r'\D+', '').str.pad(11, fillchar='1')

In [None]:
patients_clean.phone_number.head()

0    19517199170
1    12175693204
2    14023636804
3    17326368246
4    13345157487
Name: phone, dtype: object

Default John Doe data
### Define
Remove the non-recoverable John Doe records from the patients table.

#### Code

In [None]:
patients_clean = patients_clean[patients_clean.surname != 'Doe']

In [None]:
# Should be no Doe records
patients_clean.surname.value_counts()

Jakobsen       3
Taylor         3
Aranda         2
Tucker         2
Souza          2
              ..
Casárez        1
Mata           1
Pospíšil       1
Rukavina       1
Onyekaozulu    1
Name: surname, Length: 465, dtype: int64

In [None]:
# Should be no 123 Main Street records
patients_clean.address.value_counts()

Unknown                     12
2778 North Avenue            2
2476 Fulton Street           2
648 Old Dear Lane            2
576 Brown Bear Drive         1
                            ..
1066 Goosetown Drive         1
4291 Patton Lane             1
4643 Reeves Street           1
174 Lost Creek Road          1
3652 Boone Crockett Lane     1
Name: address, Length: 483, dtype: int64

Multiple records for Jakobsen, Gersten, Taylor
### Define
- Remove the Jake Jakobsen, Pat Gersten, and Sandy Taylor rows from the patients table. These are the nicknames, which happen to also not be in the treatments table (removing the wrong name would create a consistency issue between the patients and treatments table). These are all the second occurrence of the duplicate. These are also the only occurences of non-null duplicate addresses.

### Code

In [None]:
# tilde means not: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
patients_clean = patients_clean[~((patients_clean.address.duplicated()) & patients_clean.address.notnull())]

In [None]:
patients_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 483 entries, 0 to 501
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   patient_id    483 non-null    int64         
 1   assigned_sex  483 non-null    category      
 2   given_name    483 non-null    object        
 3   surname       483 non-null    object        
 4   address       483 non-null    object        
 5   city          483 non-null    object        
 6   state         483 non-null    category      
 7   zip_code      483 non-null    object        
 8   country       483 non-null    object        
 9   birthdate     483 non-null    datetime64[ns]
 10  weight        483 non-null    float64       
 11  height        483 non-null    int64         
 12  bmi           483 non-null    float64       
 13  phone         483 non-null    object        
 14  e_mail        483 non-null    object        
dtypes: category(2), datetime64[ns](1), float

### Test

In [None]:
patients_clean[patients_clean.surname == 'Jakobsen']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,NY,127,United States,1985-08-01,155.8,67,24.4,+1 (845) 858-7707,JakobCJakobsen@einrot.com
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,770,United States,1962-11-25,185.2,67,29.0,1 979 203 0438,KarenJakobsen@jourrapide.com


In [None]:
patients_clean[patients_clean.surname == 'Gersten']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,683,United States,1954-05-03,138.2,71,19.3,402-848-4923,PatrickGersten@rhyta.com


In [None]:
patients_clean[patients_clean.surname == 'Taylor']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,e_mail
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,259,United States,1960-10-23,206.1,64,35.4,304-438-2648,SandraCTaylor@dayrep.com
426,427,male,Rogelio,Taylor,4064 Marigold Lane,Miami,FL,331,United States,1992-09-02,186.6,69,27.6,305-434-6299,RogelioJTaylor@teleworm.us


kgs instead of lbs for Zaitseva weight
### Define
- Use advanced indexing to isolate the row where the surname is Zaitseva and convert the entry in its weight field from kg to lbs.

### Code

In [None]:
weight_kg = patients_clean.weight.min()
mask = patients_clean.surname == 'Zaitseva'
column_name = 'weight'
patients_clean.loc[mask, column_name] = weight_kg * 2.20462

In [None]:
# 48.8 shouldn't be the lowest anymore
patients_clean.weight.sort_values()

459    102.1
335    102.7
74     103.2
317    106.0
171    106.5
       ...  
61     244.9
144    244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 483, dtype: float64

Comparing Key Metrics
After assessing and cleaning the clinical trial data set we are ready to determine if the proposed new oral insulin, Auralin, compared to the injectable insulin Novodra.

Adverse Reactions
For Auralin to pass this Phase II clinical trial it must be deemed safe, and the adverse reactions to it is encouraging.

Adverse Reactions
treatment	adverse_reaction	
Auralin	cough	1
headache	1
hypoglycemia	10
nausea	1
throat irritation	2
Novodra	cough	1
headache	2
hypoglycemia	10
injection site discomfort	6
nausea	1
These adverse_reactions were actually previously standalone, but we joined this to the treatments table to allow for this analysis. Between the two drugs, Auralin and Novodra, the counts of each adverse reaction are pretty similar. One exception is throat irritation for Auralin, the oral insulin which is expected because this pill is taken orally and passes by the throat before it gets to the stomach. Another is injection site discomfort for Novodra which is the injectable insulin because that's a common known adverse reaction for injectable insulin because of needles. This one of the reasons why we want oral insulin in the first place.

These counts are more clear in these horizontal bar charts.

Adverse Reaction Count Bar Charts for Auralin and Novodra
Pre-trial Post-trial Mean Insulin Dose Change.
Dose change is important because if the new oral insulin requires a higher dosage to be effective, the manufacturer might not bring this to market because it wouldn't be financially feasible.

The dosage information was hidden in two columns in the treatments table, auralin and novodra, with start dose and end dose in each column, and the treatment value in each column header. We converted this to a tidy format and separated out the start_dose and end_dose by melting the treatment variable down to its own column. This allowed us to run a mean dose change analysis:

Mean Dosage Change
***```treatment```***	
- Auralin	-8.325714
- Novodra	0.377143

- Again, the results here are good for Auralin. Patients that were treated with Auralin required on average, 8 more units of insulin to establish a safe, steady blood sugar level compared to Novodra patients who on average required 0.4 units less of insulin. 

- Auralin requiring 8 more units, is expected because we knew that oral insulin has a tougher time getting into the bloodstream through the stomach lining, and eight units more isn't a big a deal.
Mean Insulin Unit Change for Auralin and Novodra


### HbA1c Change
HbA1 change is our key indicator for diabetes control. Most patients in this trial start around 7.9 percent so if we can establish that Auralin causes a reduction in HbA1c that's similar to the current injectable insulin standard, that's a success. We can measure that through a confidence interval But first we need to establish the difference in means.

### Before Cleaning
Before cleaning, Novodra had a massive advantage in HbA1C reduction, 0.71 compared to 0.35 for Auralin.

***```treatment```***	
- Auralin	0.344872
- Novodra	0.714731
Pre-trial/Post-trial Mean HbA1c Change (Unclean Data)
After Cleaning
After cleaning, the difference is much smaller



***```treatment```***	
- Auralin	0.387657
- Novodra	0.40491
Pre-trial/Post-trial Mean HbA1c Change (Clean Data)
These results are encouraging but clinical trial results require more rigorous statistical analysis.

### Confidence Interval
The confidence interval refers to the range of values that a parameter is likely to fall in with a specific probability. We want the upper limit of the confidence interval of the differences in means to be less than 0.4, meaning that if the difference in means is less than 0.4, we can be highly confident that our results are meaningful.

- Before cleaning, the upper limit of the confidence interval is 0.43 which means that Auralin would not have passed the Phase II clinical trial. But after cleaning the HbA1C reduction is pretty similar the upper limit of the confidence interval is 0.03.

- before_CI_upper_limit	after_CI_upper_limit
0.43	0.03
0.03 is significantly lower than 0.4, which means that Auralin oral insulin is similarly effective to Novodra injectable insulin.

### Good News!

- Our oral insulin, Auralin passed Phase II clinical trials! This is a big deal because the probability of success for Phase II trials is 31%. A successful Phase II trial means we have a good chance of making it past Phase III and the regulatory review process to make it to market. 

- If it does, this oral insulin would be an enormous breakthrough in treating Type I and Type II diabetes patients, as freedom from daily injections would liberate patients, reduce missed doses and therefore reduce irritating and sometimes serious complications from diabetes.

Great job assessing and cleaning this data!

### More Reading
Confidence intervals are an important measure o the validity of our results. Learn more about Confidence Intervals:

Statistics Teaching Tools: Confidence Intervals
Wikipedia: Confidence Interval
New Term
### Term	Definition
Confidence Interval	A statistical term that refers to the range of values that a parameter will fall in with a specific probability