# L3 Q8. Visual Assessment
This Auralin Phase II clinical trial dataset comes in three tables: `patients`, `treatments`, and `adverse_reactions`. Acquaint yourself with them through visual assessment below.

## Gather

In [655]:
import pandas as pd
import numpy as np

In [656]:
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')


## Assess

In [23]:
# Visual assessment
patients


Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7


In [7]:
treatments


Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.20,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
...,...,...,...,...,...,...,...
275,albina,zetticci,45u - 51u,-,7.93,7.73,0.20
276,john,teichelmann,-,49u - 49u,7.90,7.58,
277,mathea,lillebø,23u - 36u,-,9.04,8.67,0.37
278,vallie,prince,31u - 38u,-,7.64,7.28,0.36


In [8]:
adverse_reactions


Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation
5,jasmine,sykes,hypoglycemia
6,louise,johnson,hypoglycemia
7,albinca,komavec,hypoglycemia
8,noe,aranda,hypoglycemia
9,sofia,hermansen,injection site discomfort


## Conclusions

### Quality
- Patients table: zip_code data type: float not string (validity)
- Patients table: zip_code has comma (validity)
- Patients table: Tim Neudorf height is 27 instead of 72 in (validity)
- Patients table: Inconsistent data in state, full name / Short for New York (consistency)
- Patients table: NaN values in hba1c_change (completeness)
- Patients table: given_name for patient_id 9 (accuracy)
- Treatment table: Missing rows (completeness)
- Treatment table: Format u next to dose (validity)
- Treatment and Adverse_Reactions table: Lowercase name (concistency)
- All tables: The column name given_name could be more explanatory, as first_name (validity)

### Tidyness
- Treatment messy - in two different tables (but can be good for analysis) 

## Data Quality Dimensions
1. **Completeness**: Do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
2. **Validity**: We have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
3. **Accuracy**: Inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
4. **Consistency**: Inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

# L3 Q14 Programatic Assessment
 These are the programmatic assessment methods in pandas that you will probably use most often:

* .head (DataFrame and Series)
* .tail (DataFrame and Series)
* .sample (DataFrame and Series)
* .info (DataFrame only)
* .describe (DataFrame and Series)
* .value_counts (Series only)
* Various methods of indexing and selecting data (.loc and bracket notation with/without boolean indexing, also .iloc)




In [31]:
patients.head()


Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


In [32]:
patients.tail()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3


In [37]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [34]:
treatments.sample(5)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
167,lamara,dratchev,49u - 58u,-,7.75,7.4,0.35
234,haylom,nebay,-,42u - 44u,7.62,7.22,0.9
261,caroline,shuler,-,50u - 54u,7.63,7.27,
128,david,gustafsson,-,33u - 34u,7.72,7.28,0.94
76,noriyuki,sakai,-,32u - 31u,7.58,7.16,0.92


In [35]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [39]:
# Nr of patients in New York City
(patients.city == 'New York').sum()

18

In [89]:
patients.surname.value_counts()
patients.patient_id[patients.surname == 'Doe']

215    216
229    230
237    238
244    245
251    252
277    278
Name: patient_id, dtype: int64

In [80]:
patients.address.value_counts()

123 Main Street          6
2476 Fulton Street       2
2778 North Avenue        2
648 Old Dear Lane        2
2645 Moore Avenue        1
                        ..
1251 Clarence Court      1
4040 Linda Street        1
4277 Mutton Town Road    1
206 Eagle Lane           1
3781 Hamill Avenue       1
Name: address, Length: 483, dtype: int64

In [87]:
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [90]:
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

In [91]:
sum(treatments.auralin.isnull())

0

In [92]:
sum(treatments.novodra.isnull())

0

In [None]:
### Quality
- Patients table: zip_code data type: float not string (validity)
- Patients table: zip_code has comma (validity)
- Patients table: Tim Neudorf height is 27 instead of 72 in (validity)
- Patients table: Inconsistent data in state, full name / Short for New York (consistency)
- Patients table: NaN values in hba1c_change (completeness)
- Patients table: given_name for patient_id 9 (accuracy)
- Treatment table: Missing rows (completeness)
- Treatment table: Format u next to dose (validity)
- Treatment and Adverse_Reactions table: Lowercase name (concistency)
- All tables: The column name given_name could be more explanatory, as first_name (validity)
- Patient and Treatment table: Erroneous datatype (assigned_sex, state, zip_code, birthdate, auralin and novodra columns)
- Treatment table: Inaccurate HbA1c changes 4 mistaken as 9. (accuracy)
- Patient table: Multiple phonenr formats
- Patients table: Default John Doe data
- Patient table: Duplicate records for Jakobsen, Gersten, Taylor
- Patient table: Inconcistency in weights for weight 48.8
- Patients table: auralin and novodra, nulls represented as dashes
    
### Tidyness
- Patients table: Email and phonenr records in the same column
- Treatment table: auralin and novodra records ztwo columns in one
- Treatment table: Column headers auralin and novodra are actually variables, create a treatment column


# L4 Q1. Programmatic Data Cleaning
- Define
- Code 
- Test

In [None]:
# Ex (you can also perform these steps for every issue to adjust)
# gather
import pandas as pd 
df = pd.read_csv('animals.csv')

# assess
df.head() 

# clean
df_clean = df.copy()

# Remove 'bb' before every animal name using string slicing
df_clean['Animal'] = df_clean['Animal'].str[2:]
# Replace ! with . in body weight and brain weight columns
df_clean['Body weight (kg)'] = df_clean['Body weight (kg)'].str.replace('!', '.')
df_clean['Brain weight (g)'] = df_clean['Brain weight (g)'].str.replace('!', '.')

# test
df_clean.head()



## Cleaning Order
1. Address Missing Data
2. Address Tidyness Issues
3. Address Quality Issues

The very first thing to do before any cleaning occurs is to make a copy of each piece of data. All of the cleaning operations will be conducted on this copy so you can still view the original dirty and/or messy dataset later.

_df_clean = df.copy()_ Copies the df and locate it to another memory location - Perform the cleaning on the copy peacefully.

In [569]:
patients_clean = patients.copy()
treatments_clean = treatments.copy()
adverse_reactions_clean = adverse_reactions.copy()

### 1. Missing Data
Do first always. 

<font color='red'>Complete the following two "Missing Data" **Define, Code, and Test** sequences after watching the *"Address Missing Data First"* video.</font>

#### `treatments`: Missing records (280 instead of 350)

#### Define
Find the missing records and append to the treatments table.

#### Code

In [657]:
# find the extra treatments
df_extra = pd.read_csv('treatments_cut.csv')

In [106]:
df_extra.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,jožka,resanovič,22u - 30u,-,7.56,7.22,0.34
1,inunnguaq,heilmann,57u - 67u,-,7.85,7.45,
2,alwin,svensson,36u - 39u,-,7.78,7.34,
3,thể,lương,-,61u - 64u,7.64,7.22,0.92
4,amanda,ribeiro,36u - 44u,-,7.85,7.47,0.38


In [658]:
# combine the two lists
treatments_clean = treatments_clean.append(df_extra)
# alt: treatments_clean = pd.concat([treatments_clean, treatments_cut],
                            # ignore_index=True)

#### Test

In [111]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [110]:
df_extra.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    70 non-null     object 
 1   surname       70 non-null     object 
 2   auralin       70 non-null     object 
 3   novodra       70 non-null     object 
 4   hba1c_start   70 non-null     float64
 5   hba1c_end     70 non-null     float64
 6   hba1c_change  42 non-null     float64
dtypes: float64(3), object(4)
memory usage: 4.0+ KB


In [113]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    350 non-null    object 
 1   surname       350 non-null    object 
 2   auralin       350 non-null    object 
 3   novodra       350 non-null    object 
 4   hba1c_start   350 non-null    float64
 5   hba1c_end     350 non-null    float64
 6   hba1c_change  213 non-null    float64
dtypes: float64(3), object(4)
memory usage: 21.9+ KB


#### `treatments`: Missing HbA1c changes and Inaccurate HbA1c changes (leading 4s mistaken as 9s)

#### Define
Calculate the correct HbA1c chance for all observation programmatically. 

#### Code

In [659]:
treatments_clean.hba1c_change = treatments_clean.hba1c_start - treatments_clean.hba1c_end


#### Test

In [117]:
treatments_clean

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.20,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32
...,...,...,...,...,...,...,...
65,rovzan,kishiev,32u - 37u,-,7.75,7.41,0.34
66,jakob,jakobsen,-,28u - 26u,7.96,7.51,0.45
67,bernd,schneider,48u - 56u,-,7.74,7.44,0.30
68,berta,napolitani,-,42u - 44u,7.68,7.21,0.47


In [118]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    350 non-null    object 
 1   surname       350 non-null    object 
 2   auralin       350 non-null    object 
 3   novodra       350 non-null    object 
 4   hba1c_start   350 non-null    float64
 5   hba1c_end     350 non-null    float64
 6   hba1c_change  350 non-null    float64
dtypes: float64(3), object(4)
memory usage: 21.9+ KB


### Tidyness



#### Contact column in `patients` table contains two variables: phone number and email

In [119]:
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


#### Code

In [660]:
patients_clean['phone_number'] = patients_clean.contact.str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)

# [a-zA-Z] to signify emails in this dataset all start and end with letters
patients_clean['email'] = patients_clean.contact.str.extract('([a-zA-Z][a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+[a-zA-Z])', expand=True)

# Note: axis=1 denotes that we are referring to a column, not a row
patients_clean = patients_clean.drop('contact', axis=1)

AttributeError: 'DataFrame' object has no attribute 'contact'

#### Test

In [319]:
# Confirm contact column is gone
list(patients_clean)

['patient_id',
 'assigned_sex',
 'given_name',
 'surname',
 'address',
 'city',
 'state',
 'zip_code',
 'country',
 'birthdate',
 'weight',
 'height',
 'bmi',
 'phone_number',
 'email']

In [320]:
patients_clean.phone_number.sample(25)

59          252 291 3898
353         478-676-7058
219                  NaN
377         217-485-5673
231         252 583 5410
278                  NaN
88          706-755-5723
432         979 203 0438
178         773-934-7423
383         208 657 2473
343         203-251-3573
223         208 897 3897
25          505-828-4955
288         831-427-4114
421         870-270-5502
459         619-710-6286
41          320-826-3340
408    +1 (989) 390-0285
309         256-872-9211
295         916 817 9960
124    +1 (508) 526-3432
384    +1 (605) 440-5492
200         856-655-5415
359         254-518-6365
289         504-441-7744
Name: phone_number, dtype: object

In [321]:
# Confirm that no emails start with an integer (regex didn't match for this)
patients_clean.email.sort_values().head()

404               AaliyahRice@dayrep.com
11          Abdul-NurMummarIsa@rhyta.com
332                AbelEfrem@fleckens.hu
258              AbelYonatan@teleworm.us
305    AddolorataLombardi@jourrapide.com
Name: email, dtype: object

#### Three variables in two columns in `treatments` table (treatment, start dose and end dose)

#### Define
1. Create a new column from the column names by using pd.melt function.

#### Code

In [121]:
treatments_clean.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,0.43
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [661]:
# create treatment column
test = pd.melt(treatments_clean, id_vars = ['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'], value_vars = ['auralin', 'novodra'], var_name = 'treatment', value_name = 'dose', ignore_index = False)
test.dtypes


given_name       object
surname          object
hba1c_start     float64
hba1c_end       float64
hba1c_change    float64
treatment        object
dose             object
dtype: object

In [662]:
# Remove double observations
test = test[test.dose != '-']

In [335]:
# test.head()
# test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350 entries, 0 to 68
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    350 non-null    object 
 1   surname       350 non-null    object 
 2   hba1c_start   350 non-null    float64
 3   hba1c_end     350 non-null    float64
 4   hba1c_change  350 non-null    float64
 5   treatment     350 non-null    object 
 6   dose          350 non-null    object 
dtypes: float64(3), object(4)
memory usage: 21.9+ KB


In [663]:
# Move columns
tret_col = test.pop('treatment') # Remove column and save in a variable

In [664]:
d_col = test.pop('dose')

In [494]:
test.head()

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,7.63,7.2,0.43
3,skye,gormanston,7.97,7.62,0.35
6,sophia,haugen,7.65,7.27,0.38
7,eddie,archer,7.89,7.55,0.34
9,asia,woźniak,7.76,7.37,0.39


In [665]:
# Insert columns on the right place
test.insert(2, 'treatment', tret_col)

In [496]:
test.head()

Unnamed: 0,given_name,surname,treatment,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,auralin,7.63,7.2,0.43
3,skye,gormanston,auralin,7.97,7.62,0.35
6,sophia,haugen,auralin,7.65,7.27,0.38
7,eddie,archer,auralin,7.89,7.55,0.34
9,asia,woźniak,auralin,7.76,7.37,0.39


In [666]:
test.insert(3, 'dose', d_col)

In [498]:
test.head()

Unnamed: 0,given_name,surname,treatment,dose,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,auralin,41u - 48u,7.63,7.2,0.43
3,skye,gormanston,auralin,33u - 36u,7.97,7.62,0.35
6,sophia,haugen,auralin,37u - 42u,7.65,7.27,0.38
7,eddie,archer,auralin,31u - 38u,7.89,7.55,0.34
9,asia,woźniak,auralin,30u - 36u,7.76,7.37,0.39


In [667]:
# Split start and end dose and insert it into df
test.insert(3, 'start_dose', test.dose.str.split('-', expand = True)[0])
test.insert(4, 'end_dose', test.dose.str.split('-', expand = True)[1])


In [346]:
test.head()

Unnamed: 0,given_name,surname,treatment,start_dose,end_dose,dose,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,auralin,41u,48u,41u - 48u,7.63,7.2,0.43
3,skye,gormanston,auralin,33u,36u,33u - 36u,7.97,7.62,0.35
6,sophia,haugen,auralin,37u,42u,37u - 42u,7.65,7.27,0.38
7,eddie,archer,auralin,31u,38u,31u - 38u,7.89,7.55,0.34
9,asia,woźniak,auralin,30u,36u,30u - 36u,7.76,7.37,0.39


In [668]:
# Remove dose
test.drop('dose', axis=1, inplace=True)

In [348]:
test.head()

Unnamed: 0,given_name,surname,treatment,start_dose,end_dose,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,auralin,41u,48u,7.63,7.2,0.43
3,skye,gormanston,auralin,33u,36u,7.97,7.62,0.35
6,sophia,haugen,auralin,37u,42u,7.65,7.27,0.38
7,eddie,archer,auralin,31u,38u,7.89,7.55,0.34
9,asia,woźniak,auralin,30u,36u,7.76,7.37,0.39


In [669]:
treatments_clean = test
treatments_clean

Unnamed: 0,given_name,surname,treatment,start_dose,end_dose,hba1c_start,hba1c_end,hba1c_change
0,,,auralin,,,7.63,7.20,0.43
1,,,auralin,,,7.97,7.62,0.35
2,,,auralin,,,7.65,7.27,0.38
3,,,auralin,,,7.89,7.55,0.34
4,,,auralin,,,7.76,7.37,0.39
...,...,...,...,...,...,...,...,...
58,christopher,woodward,novodra,55u,51u,7.51,7.06,0.45
60,maret,sultygov,novodra,26u,23u,7.67,7.30,0.37
64,lixue,hsueh,novodra,22u,23u,9.21,8.80,0.41
66,jakob,jakobsen,novodra,28u,26u,7.96,7.51,0.45


In [None]:
# They stop here... 
# I would like to convert start_dose and end_dose to integers...

#### Test

In [353]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350 entries, 0 to 68
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    350 non-null    object 
 1   surname       350 non-null    object 
 2   treatment     350 non-null    object 
 3   start_dose    350 non-null    object 
 4   end_dose      350 non-null    object 
 5   hba1c_start   350 non-null    float64
 6   hba1c_end     350 non-null    float64
 7   hba1c_change  350 non-null    float64
dtypes: float64(3), object(5)
memory usage: 24.6+ KB


In [354]:
treatments_clean.head()

Unnamed: 0,given_name,surname,treatment,start_dose,end_dose,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,auralin,41u,48u,7.63,7.2,0.43
3,skye,gormanston,auralin,33u,36u,7.97,7.62,0.35
6,sophia,haugen,auralin,37u,42u,7.65,7.27,0.38
7,eddie,archer,auralin,31u,38u,7.89,7.55,0.34
9,asia,woźniak,auralin,30u,36u,7.76,7.37,0.39


#### Adverse reaction should be part of the `treatments` table

##### Define

In [356]:
adverse_reactions_clean.head()

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation


- We want to concardinate the treatment and adverse reactions tables. We want to add the adverse_reaction column to treatments on given_name and surname.

##### Code

In [670]:
treatments_clean = treatments_clean.merge(adverse_reactions_clean, how = 'left', on = ['given_name', 'surname'])


##### Test

In [361]:
treatments_clean

Unnamed: 0,given_name,surname,treatment,start_dose,end_dose,hba1c_start,hba1c_end,hba1c_change,adverse_reaction
0,veronika,jindrová,auralin,41u,48u,7.63,7.20,0.43,
1,skye,gormanston,auralin,33u,36u,7.97,7.62,0.35,
2,sophia,haugen,auralin,37u,42u,7.65,7.27,0.38,
3,eddie,archer,auralin,31u,38u,7.89,7.55,0.34,
4,asia,woźniak,auralin,30u,36u,7.76,7.37,0.39,
...,...,...,...,...,...,...,...,...,...
345,christopher,woodward,novodra,55u,51u,7.51,7.06,0.45,nausea
346,maret,sultygov,novodra,26u,23u,7.67,7.30,0.37,
347,lixue,hsueh,novodra,22u,23u,9.21,8.80,0.41,injection site discomfort
348,jakob,jakobsen,novodra,28u,26u,7.96,7.51,0.45,hypoglycemia


In [362]:
treatments_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350 entries, 0 to 349
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   given_name        350 non-null    object 
 1   surname           350 non-null    object 
 2   treatment         350 non-null    object 
 3   start_dose        350 non-null    object 
 4   end_dose          350 non-null    object 
 5   hba1c_start       350 non-null    float64
 6   hba1c_end         350 non-null    float64
 7   hba1c_change      350 non-null    float64
 8   adverse_reaction  35 non-null     object 
dtypes: float64(3), object(6)
memory usage: 27.3+ KB


#### Given name and surname columns in `patients` table duplicated in `treatments` and `adverse_reactions` tables  and Lowercase given names and surnames

##### Define
- Identify duplicates with .duplicated() and remove one of the duplicates.
- Create the name with first letter as uppercase.
- Add patient_id

##### Code

In [671]:
# Remove duplicates
treatments_clean = treatments_clean.drop_duplicates()

In [672]:
# Format the first letters in given_name and surname with capitalized letter
treatments_clean.given_name = treatments_clean.given_name.str.capitalize()
treatments_clean.surname = treatments_clean.surname.str.capitalize()


In [406]:
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,7/10/1976,121.7,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,4/3/1967,118.8,66,19.2,+1 (217) 569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,2/19/1980,177.8,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,7/26/1951,220.9,70,31.7,+1 (732) 636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,2/18/1928,192.3,27,26.1,334-515-7487,TimNeudorf@cuvox.de


In [673]:
# Add patient_id table
id_names = patients_clean[['patient_id', 'given_name', 'surname']]
treatments_clean = pd.merge(treatments_clean, id_names, on = ['given_name', 'surname'])
treatments_clean = treatments_clean.drop(['given_name', 'surname'], axis = 1)



In [674]:
# Move patient_id to beginning of table
pat_id = treatments_clean.pop('patient_id')


In [675]:
treatments_clean.insert(0, 'patient_id', pat_id)

##### Test

In [389]:
treatments_clean.duplicated().sum()

0

In [428]:
treatments_clean.head()

Unnamed: 0,patient_id,treatment,start_dose,end_dose,hba1c_start,hba1c_end,hba1c_change,adverse_reaction
0,225,auralin,41u,48u,7.63,7.2,0.43,
1,242,auralin,33u,36u,7.97,7.62,0.35,
2,345,auralin,37u,42u,7.65,7.27,0.38,
3,276,auralin,31u,38u,7.89,7.55,0.34,
4,15,auralin,30u,36u,7.76,7.37,0.39,


In [510]:
# Patient ID should be the only duplicate column
all_columns = pd.Series(list(patients_clean) + list(treatments_clean))
all_columns[all_columns.duplicated()]

15    patient_id
dtype: object

### Quality

#### Zip code is a float not a string and Zip code has four digits sometimes

##### Define
- Convert zip code to 5 digit categorical data.

##### Code

In [511]:
# 12 na values
patients_clean.zip_code.isna().sum()

12

In [520]:
patients_clean.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,923,United States,7/10/1976,121.7,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,618,United States,4/3/1967,118.8,66,19.2,+1 (217) 569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,684,United States,2/19/1980,177.8,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,70,United States,7/26/1951,220.9,70,31.7,+1 (732) 636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,363,United States,2/18/1928,192.3,27,26.1,334-515-7487,TimNeudorf@cuvox.de


In [None]:
# Convert zip code float to str

In [676]:
patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0')
patients_clean


Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,CA,00923,United States,1976-07-10,121.7,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,IL,00618,United States,1967-04-03,118.8,66,19.2,+1 (217) 569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,NE,00684,United States,1980-02-19,177.8,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,00070,United States,1951-07-26,220.9,70,31.7,+1 (732) 636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,00363,United States,1928-02-18,192.3,72,26.1,334-515-7487,TimNeudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,00038,United States,1959-04-10,181.1,72,24.6,207-477-0579,MustafaLindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,00863,United States,1948-03-26,239.6,70,34.4,928-284-4492,RumanBisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,00641,United States,1971-01-13,171.2,67,26.8,816-223-6007,JinkedeKeizer@teleworm.us
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,00981,United States,1952-02-13,176.9,67,27.7,360 443 2060,ChidaluOnyekaozulu@jourrapide.com


In [677]:
# Reconvert NaNs entries that were converted to '0000n' by code above
patients_clean.zip_code = patients_clean.zip_code.replace('0000n', np.nan)

In [591]:
patients_clean.zip_code.head() # wrong with 2 digits..


0    92390
1    61812
2    68467
3    07095
4    36303
Name: zip_code, dtype: object

#### Tim Neudorf height is 27 in instead of 72 in

##### Define
- Replace Tim Neudorf height 27 to 72 

##### Code

In [678]:
# Replace the height
patients_clean.height = patients_clean.height.replace(27, 72)

##### Test

In [546]:
(patients_clean.height == 27).sum()

0

#### Full state names sometimes, abbreviations other times

##### Define

- Unify the state names and terms to abbreviations

##### Code

In [679]:
# Change state names
patients_clean.state = patients_clean.state.replace('California', 'CA')
patients_clean.state = patients_clean.state.replace('New York', 'NY')
patients_clean.state = patients_clean.state.replace('Illinois', 'IL')
patients_clean.state = patients_clean.state.replace('Florida', 'FL')
patients_clean.state = patients_clean.state.replace('Nebraska', 'NE')


##### Test

In [596]:
patients_clean.state.value_counts()

CA    60
NY    47
TX    32
IL    24
FL    22
MA    22
PA    18
GA    15
OH    14
LA    13
MI    13
OK    13
NJ    12
VA    11
MS    10
WI    10
AL     9
IN     9
MN     9
TN     9
NC     8
WA     8
KY     8
MO     7
NV     6
NE     6
ID     6
KS     6
CT     5
IA     5
SC     5
RI     4
ND     4
AR     4
AZ     4
CO     4
ME     4
OR     3
MD     3
SD     3
DE     3
WV     3
VT     2
DC     2
MT     2
NH     1
WY     1
NM     1
AK     1
Name: state, dtype: int64

#### Dsvid Gustafsson

##### Define

- Correct Dsvid to David

##### Code

In [680]:
id_names.given_name = id_names.given_name.replace('Dsvid', 'David')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [681]:
patients_clean.given_name = patients_clean.given_name.replace('Dsvid', 'David')

##### Test

In [610]:
(patients_clean.given_name  == 'David').sum()

3

In [611]:
(patients_clean.given_name  == 'Dsvid').sum()

0

In [612]:
(id_names.given_name  == 'David').sum()

3

In [613]:
(id_names.given_name  == 'Dsvid').sum()

0

#### Erroneous datatypes (assigned sex, state, zip_code, and birthdate columns) and Erroneous datatypes (auralin and novodra columns) and The letter 'u' in starting and ending doses for Auralin and Novodra

##### Define

- Assigned sex --> Categorical
- State --> Categorical
- Zip_code --> Categorical
- Birthdate --> Datetime
- Remove u from start and end dose & change dtype to int

In [698]:
patients_clean.dtypes

patient_id               int64
assigned_sex          category
given_name              object
surname                 object
address                 object
city                    object
state                 category
zip_code              category
country                 object
birthdate       datetime64[ns]
weight                 float64
height                   int64
bmi                    float64
phone_number            object
email                   object
dtype: object

In [682]:
# Assigned sex
patients_clean.assigned_sex = patients_clean.assigned_sex.astype('category')


In [683]:
# State
patients_clean.state = patients_clean.state.astype('category')


In [684]:
# zip_code
patients_clean.zip_code = patients_clean.zip_code.astype('category')


In [685]:
# Birthdate
patients_clean.birthdate = patients_clean.birthdate.astype('datetime64')



In [689]:
# Remove 'u'
# treatments_clean.start_dose = treatments_clean.start_dose.str[:-2]
# treatments_clean.end_dose = treatments_clean.end_dose.str[:-1]


In [696]:
treatments_clean.head()

Unnamed: 0,patient_id,treatment,start_dose,end_dose,hba1c_start,hba1c_end,hba1c_change,adverse_reaction
0,468,auralin,22,30,7.56,7.22,0.34,
1,107,auralin,57,67,7.85,7.45,0.4,
2,438,auralin,36,39,7.78,7.34,0.44,
3,144,auralin,36,44,7.85,7.47,0.38,
4,260,auralin,30,35,7.53,7.12,0.41,


In [701]:
# start_dose end_dose
treatments_clean.start_dose = treatments_clean.start_dose.astype(int)
treatments_clean.end_dose = treatments_clean.end_dose.astype(int)


In [702]:
treatments_clean.dtypes

patient_id            int64
treatment            object
start_dose            int64
end_dose              int64
hba1c_start         float64
hba1c_end           float64
hba1c_change        float64
adverse_reaction     object
dtype: object

#### Multiple phone number formats


##### Define

Replace att non nr with ''. Pad 1 as country code.

In [705]:
patients_clean.phone_number.sample(25)

26          309-671-8852
186    +1 (401) 485-6384
440    +1 (573) 493-4748
107         240-322-1398
476         267-972-3749
487         785 229 1188
168         650-849-6900
267    +1 (785) 823-6728
392         630-252-5095
29     +1 (845) 858-7707
422         641-475-9654
377         217-485-5673
128         317 292 2394
310         913 322 9114
402         601-885-6550
44          207 861 4587
348    +1 (608) 527-1021
175         217 491 5261
96          207-768-0477
127         360-482-2553
328    +1 (612) 342-6065
293         408-834-4087
367         775-358-9076
253         321-287-0484
183         909-355-9418
Name: phone_number, dtype: object

##### Code

In [714]:
patients_clean.phone_number = patients_clean.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')


##### Test

In [715]:
patients_clean.phone_number

0      19517199170
1      12175693204
2      14023636804
3      17326368246
4      13345157487
          ...     
498    12074770579
499    19282844492
500    18162236007
501    13604432060
502    14028484923
Name: phone_number, Length: 503, dtype: object

#### Default John Doe data

##### Define

Check if John Doe data is duplicate or if not replace with proper name.

In [716]:
patients_clean[patients_clean.surname == 'Doe']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
215,216,male,John,Doe,123 Main Street,New York,NY,123,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
229,230,male,John,Doe,123 Main Street,New York,NY,123,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
237,238,male,John,Doe,123 Main Street,New York,NY,123,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
244,245,male,John,Doe,123 Main Street,New York,NY,123,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
251,252,male,John,Doe,123 Main Street,New York,NY,123,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com
277,278,male,John,Doe,123 Main Street,New York,NY,123,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com


In [None]:
patients_clean[patients_clean.surname == 'Doe']

In [None]:
id_names_clean[id_names_clean.surname == 'Doe']

In [None]:
id_names_clean = id_names.copy()

##### Code

In [723]:
patients_clean = patients_clean.drop(patients_clean[patients_clean.surname == 'Doe'].index[1:])

In [735]:
id_names_clean = id_names_clean.drop(id_names_clean[id_names_clean.surname == 'Doe'].index[1:])

##### Test

In [724]:
patients_clean[patients_clean.surname == 'Doe']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
215,216,male,John,Doe,123 Main Street,New York,NY,123,United States,1975-01-01,180.0,72,24.4,11234567890,johndoe@email.com


In [736]:
id_names_clean[id_names_clean.surname == 'Doe']

Unnamed: 0,patient_id,given_name,surname
215,216,John,Doe


In [731]:
(treatments_clean.patient_id == 237).sum()

0

#### Multiple records for Jakobsen, Gersten, Taylor


##### Define

Remove multiple records for nicknames of Jakobsen (id 30), Gersten, Taylor

In [740]:
id_names_clean[id_names_clean.surname == 'Jakobsen'] # no duplicates

Unnamed: 0,patient_id,given_name,surname
24,25,Jakob,Jakobsen
29,30,Jake,Jakobsen
432,433,Karen,Jakobsen


In [748]:
id_names_clean[id_names_clean.surname == 'Gersten'] # no duplicates


Unnamed: 0,patient_id,given_name,surname
97,98,Patrick,Gersten
502,503,Pat,Gersten


In [750]:
id_names_clean[id_names_clean.surname == 'Taylor'] # no duplicates


Unnamed: 0,patient_id,given_name,surname
131,132,Sandra,Taylor
282,283,Sandy,Taylor
426,427,Rogelio,Taylor


In [784]:
patients_clean[patients_clean.surname == 'Jakobsen']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,NY,127,United States,1985-08-01,155.8,67,24.4,18458587707,JakobCJakobsen@einrot.com
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,NY,127,United States,1985-08-01,155.8,67,24.4,18458587707,JakobCJakobsen@einrot.com
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,770,United States,1962-11-25,185.2,67,29.0,19792030438,KarenJakobsen@jourrapide.com


In [749]:
patients_clean[patients_clean.surname == 'Gersten']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,683,United States,1954-05-03,138.2,71,19.3,14028484923,PatrickGersten@rhyta.com
502,503,male,Pat,Gersten,2778 North Avenue,Burr,NE,683,United States,1954-05-03,138.2,71,19.3,14028484923,PatrickGersten@rhyta.com


In [751]:
patients_clean[patients_clean.surname == 'Taylor']


Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,259,United States,1960-10-23,206.1,64,35.4,13044382648,SandraCTaylor@dayrep.com
282,283,female,Sandy,Taylor,2476 Fulton Street,Rainelle,WV,259,United States,1960-10-23,206.1,64,35.4,13044382648,SandraCTaylor@dayrep.com
426,427,male,Rogelio,Taylor,4064 Marigold Lane,Miami,FL,331,United States,1992-09-02,186.6,69,27.6,13054346299,RogelioJTaylor@teleworm.us


In [761]:
treatments_clean[treatments_clean.patient_id == 25] # keep

Unnamed: 0,patient_id,treatment,start_dose,end_dose,hba1c_start,hba1c_end,hba1c_change,adverse_reaction
67,25,novodra,28,26,7.96,7.51,0.45,hypoglycemia


In [765]:
treatments_clean[treatments_clean.patient_id == 503] # ?

Unnamed: 0,patient_id,treatment,start_dose,end_dose,hba1c_start,hba1c_end,hba1c_change,adverse_reaction


In [770]:
treatments_clean[treatments_clean.patient_id == 426] # ?

Unnamed: 0,patient_id,treatment,start_dose,end_dose,hba1c_start,hba1c_end,hba1c_change,adverse_reaction


##### Code

In [789]:
# tilde means not: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
patients_clean = patients_clean[~((patients_clean.address.duplicated()) & patients_clean.address.notnull())]

##### Test

In [791]:
patients_clean[patients_clean.surname == 'Jakobsen']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,NY,127,United States,1985-08-01,155.8,67,24.4,18458587707,JakobCJakobsen@einrot.com
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,770,United States,1962-11-25,185.2,67,29.0,19792030438,KarenJakobsen@jourrapide.com


In [792]:
patients_clean[patients_clean.surname == 'Gersten']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
97,98,male,Patrick,Gersten,2778 North Avenue,Burr,NE,683,United States,1954-05-03,138.2,71,19.3,14028484923,PatrickGersten@rhyta.com


In [793]:
patients_clean[patients_clean.surname == 'Taylor']

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone_number,email
131,132,female,Sandra,Taylor,2476 Fulton Street,Rainelle,WV,259,United States,1960-10-23,206.1,64,35.4,13044382648,SandraCTaylor@dayrep.com
426,427,male,Rogelio,Taylor,4064 Marigold Lane,Miami,FL,331,United States,1992-09-02,186.6,69,27.6,13054346299,RogelioJTaylor@teleworm.us


#### kgs instead of lbs for Zaitseva weight

##### Define
Convert Zaitseva weight from kgs to ibs.

##### Code

In [None]:
patients_clean.weight[patients_clean.surname == 'Zaitseva'] = patients_clean.weight[patients_clean.surname == 'Zaitseva']*2.2


In [None]:
'''
Given solution
weight_kg = patients_clean.weight.min()
mask = patients_clean.surname == 'Zaitseva'
column_name = 'weight'
patients_clean.loc[mask, column_name] = weight_kg * 2.20462
'''

##### Test

In [803]:
patients_clean.weight[patients_clean.surname == 'Zaitseva']

210    107.36
Name: weight, dtype: float64

In [804]:
patients_clean.weight.sort_values()

459    102.1
335    102.7
74     103.2
317    106.0
171    106.5
       ...  
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 495, dtype: float64