<h2 ><i>Step 2: Data Wrangling/Munging</i></h2>

<p style = "margin-left: 25px; font-size: 15px">Data wrangling, sometimes referred to as data munging, is a process of transforming and mapping data from one "raw" data into another format with the intent of variety of downstream purposes such as analytics.</p>

<ol style = "list-style-type: lower-alpha; font-size: 15px; font-weight: bold; font-style: oblique; line-height: 2">
    <li>Gathering Data</li>
    <li>Assessing Data</li>
    <li>Cleaning Data</li>
</ol>

<hr style = "border: 1px solid;">

***2b: Assessing Data***
<p style = "margin-left: 25px; font-size: 15px">In this step, the data is to be understood more deeply. Before implementing methods to clean it, you will definitely need to have a better idea.</p>

<h2 ><i>Types of Unclean Data</i></h2>
<ul style = "font-size: 15px; line-height: 25px">
    <li> <span style = "font-weight: bold; font-style: oblique;">Dirty Data</span> (Data with quality issues): Dirty data, also known as low quality data. Low quality data has content issues.</li>
    <ul>
        <li>Duplicated Data</li>
        <li>Missing Data</li>
        <li>Corrupt Data</li>
        <li>Inaccurate Data</li>
        <li style = "list-style: none; padding-top: 10px">
            <p style = "font-weight: bold; font-style: oblique;">Data Quality Dimensions</p>
            <ul>
                <li>Completeness: is data missing or not?</li>
                <li>Validity: is data valid? (example duplicate patient_id/ negative height)</li>
                <li>Accuracy: data is valid but not accurate (example adult persons's weight = 1kg)</li>
                <li>Consistency: both valid and accurate but written differently (example New Jersey and NJ)</li>
            </ul>
        </li>
    </ul>
    <br>
    <li> <span style = "font-weight: bold; font-style: oblique;">Messy Data</span> (Data with tidiness issues): Messy data, also known as untidy data. Untidy data has structural issues.<br> <span style = "font-weight: bold; font-style : oblique">Tidy data has the following properties:</span></li>
    <ul>
        <li>Each variable forms a column</li>
        <li>Each observation forms a row</li>
        <li>Each observation unit forms a table</li>
    </ul>
</ul>

<hr style = "border: 1px solid;">

<h2 ><i>Order of severity</i></h2>
<div style = "font-size: 20px; font-weight: bold; font-style: oblique;">
    Completeness > Validity > Accuracy > Consistancy
</div>

<h2 style = "text-decoration: underline">Process of Data Wrangling/Munging</h1>

<h3 <i>Step 1: Write a summary of your data</i></h3>
<hr style = "border: 1px solid;">
<h3 <i>Step 2: Write column descriptions</i></h3>
<hr style = "border: 1px solid;">
<h3 <i>Step 3: Add any additional information</i></h3>
<hr style = "border: 1px solid;">

<h3 <i>Step 4: Data Assessment</i></h3>
<div style = "margin-left: 64px">
<h3><i>Types of Assessment</i></h3><br>
<div style = "font-size: 15px;">
<p>There are two types of assessment style:</p>
<ul style = "font-size: 15px; line-height: 25px">
    <li><span style = "font-weight: bold; font-style: oblique">Manual:</span> Looking through the data manually in excel or other data reading tools</li>
    <li><span style = "font-weight: bold; font-style: oblique">Programmatic:</span> By using Pandas functions such as</li>
        <ul>
            <li>head() or tail()</li>
            <li>sample()</li>
            <li>info()</li>
            <li>isnull()</li>
            <li>duplicated()</li>
            <li>describe()</li>
        </ul>
</ul>
</div>

<div>
    <h3 font-weight: bold; font-style: oblique;">Steps in Assessment</h3>
    <p style = "font-size: 15px;">There are two steps involved in assessment:</p>
    <ul style = "font-size: 15px; line-height: 25px; font-size: 15px">
        <li>Discover</li>
        <li>Document</li>
    </ul>
</div>
</div>

<hr style = "border: 1px solid;">

<h3 <i>Step 5: Data Cleaning</i></h3>
<div style = "margin-left: 64px">
<h3><i>Data Cleaning Order</i></h3>
<ol style = "font-size: 15px; line-height: 25px">
    <li>Quality => Completeness</li>
    <li>Tidiness</li>
    <li>Quality => Validity</li>
    <li>Quality => Accuracy</li>
    <li>Quality => Consistency</li>
</ol>

<div>
    <h3font-weight: bold; font-style: oblique;">Steps involved in Data Cleaning</h3>
    <ul style = "font-size: 15px; line-height: 25px; font-size: 15px">
        <li>Define</li>
        <li>Code</li>
        <li>Test</li>
    </ul>
</div>

<p style = "font-size: 20px">Note: Always make sure to create a copy of your dataset/s before you start the cleaning process</p>
</div>

In [1]:
import pandas as pd
import numpy as np

In [2]:
patients = pd.read_csv("C:/Users/manth/OneDrive/Desktop/Coding/Python/Juypter Python/Teknowell/Datasets/patients.csv")
treatments = pd.read_csv("C:/Users/manth/OneDrive/Desktop/Coding/Python/Juypter Python/Teknowell/Datasets/treatments.csv")
treatments_cut = pd.read_csv("C:/Users/manth/OneDrive/Desktop/Coding/Python/Juypter Python/Teknowell/Datasets/treatments_cut.csv")
adverse_reactions = pd.read_csv("C:/Users/manth/OneDrive/Desktop/Coding/Python/Juypter Python/Teknowell/Datasets/adverse_reactions.csv")

### Step 1: Write a summary of your data

<p style = "font-size: 15px; line-height: 25px">This is a dataset about 500 patients of which 350 patients participatedin a clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before. All were experincing elevated HbA1c levels.</p>

<p style = "font-size: 15px; line-height: 25px">All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After 4 weeks, which isn't enough time to capture all the changes in HbA1c that can be attributed by the switch to Auralin or Novodra:</p>
<ul style = "font-size: 15px; line-height: 25px">
    <li>175 patients switched to Auralin for 24 weeks</li>
    <li>175 patients continued using Novodrafor 24 weeks</li>
</ul>
<p style = "font-size: 15px; line-height: 25px">Data about patients feeling adverse effects is also recorded.</p>

### Step 2: Write column descriptions

<h3 style = "color: orange; font-style: oblique">Table: patients</h3>

<ul style = "font-size: 15px; line-height: 25px">
    <li><span style = "font-weight: bold">patient_id:</span> the unique identifier foe each patient in the Master Patient index (i.e., Patients Database) of the pharmaceutical company that is producing Auralin</li>
     <li><span style = "font-weight: bold">assigned_sex:</span> the assigned sex of patient at birth (Male/ Female)</li>
    <li><span style = "font-weight: bold">given_name:</span> the given name (i.e., first name) of each patient</li>
    <li><span style = "font-weight: bold">surname:</span> the surname (i.e., last name) of each patient</li>
    <li><span style = "font-weight: bold">address:</span> the main address for each patient</li>
    <li><span style = "font-weight: bold">city:</span> the corresponding city for main address of each patient</li>
    <li><span style = "font-weight: bold">state:</span> the corresponding state for main address of each patient</li>
    <li><span style = "font-weight: bold">zip_code:</span> the corresponding zip code for main address of each patient</li>
    <li><span style = "font-weight: bold">country:</span> the corresponding country for main address of each patient (all Unites States for this clinical trial)</li>
    <li><span style = "font-weight: bold">contact:</span> phone number and email information for each patient</li>
    <li><span style = "font-weight: bold">birthdate:</span> the date of birth of each patient (monthn/day/year). The inclusion criteria for this clinical trial is age >= 18 (there is no maximun age bacause diabetes is a growing problem among the eldery population)</li>
    <li><span style = "font-weight: bold">weight:</span> the weight of each patient in pounds (lbs)</li>
    <li><span style = "font-weight: bold">height:</span> the height of each patient in inches (in)</li>
    <li><span style = "font-weight: bold">BMI:</span> the Body Mass Index (BMI) of each patient. BMI is a simple calculation using person's height and weight. The formula is BMI = kg/m2 where kg is person's weight in kilograms and m2 is their height in meter squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. The inclusion criteria for this clinical trial is 16>= BMI >= 38</li>
</ul>

<h3 style = "color: orange; font-style: oblique">Table: treatments and treatments_cut</h3>

<ul style = "font-size: 15px; line-height: 25px">
    <li><span style = "font-weight: bold">given_name:</span> the given name of each patient in the Master Patient index that took part in clinical trial</li>
    <li><span style = "font-weight: bold">surname:</span> the surname of each patient in the Master Patient index that took part in clinical trial</li>
    <li><span style = "font-weight: bold">auralin:</span> the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) and the ending median daily dose of insulin at the end of the 24 week of treatment measured over the 24th week of treatment (the number ofter the dash). Both are measured in units (shortform 'u'), which id the international unit of measurment and the standard measurment for insulin</li>
    <li><span style = "font-weight: bold">novodra:</span> same as above, except for patients that continued treatment with Novodra</li>
    <li><span style = "font-weight: bold">hba1c_start:</span> the patients's HbA1c level st the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The HbA1c test measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %</li>
    <li><span style = "font-weight: bold">hba1c_end:</span> the patient's HbA1c level at the end of the week of treatment</li>
    <li><span style = "font-weight: bold">hba1c_change:</span> the vhange in te patient's HbA1c level from start of the treatment ot end i.e., hba1c_start - hba1c_end. For Auralin to be deemed offective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistcally defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c change for Novodra and Auralin (i.e, Novadra minus Auralin)</li>
</ul>

<h3 style = "color: orange; font-style: oblique">Table: adverse_reactions</h3>

<ul style = "font-size: 15px; line-height: 25px">
    <li><span style = "font-weight: bold">given_name:</span> the given name of each patient in the Master Patient index that took part in clinical trial and had an adverse reaction (includes both the patients treated Auralin and Novadra)</li>
    <li><span style = "font-weight: bold">surname:</span> the surname of each patient in the Master Patient index that took part in clinical trial and had an adverse reaction (includes both the patients treated Auralin and Novadra)</li>
    <li><span style = "font-weight: bold">adverse_reaction:</span> the adverse reaction reported by the patient</li>
</ul>

### Step 3: Add any additional information

<p style = "font-size: 15px; line-height: 25px">Additional useful information:</p>
<ul style = "font-size: 15px; line-height: 25px">
    <li>Insulin resistance varies person to person, which is why both starting median daily dose and ending median daily dose are required i.e., to calculate change in dose.</li>
    <li>It is important to test drugs and medical products in the people they are ment to help. People of different age, race, sex and ethnic group must be included in clincal trails. This is reflected in the patients table</li>
</ul>

### Step 4: Data Assessment

In [3]:
with pd.ExcelWriter("clinical_trial.xlsx") as w:
    patients.to_excel(w, sheet_name = "patients", index = False)
    treatments.to_excel(w, sheet_name = "treatments", index = False)
    treatments_cut.to_excel(w, sheet_name = "treatments_cuts", index = False)
    adverse_reactions.to_excel(w, sheet_name = "adverse_reactions", index = False)

### Issues with Datasets

1. ***Dirty Data***

`Table: patients`
 - patient_id = 9 has name misspelled "Disvid" instead of "David `accuracy`
- state column sometimes conatin full name sometimes abbrivation `consistency`
- zip_code has entries with 4 digit `validity`
- data missing for 12 patients in columns address, city	state, zip_code, country, contact `completeness`
- Incorrect datatype assigned to assigned_sex, zip_code and birthdate `validity`
- duplicate entries by name "John Doe" `accuracy`
- patient_id = 211 has weight 48.8 pounds `accuracy`
- patient_id = 5 has height = 27 inches `accuracy`

`Table: treatments & treatments_cut`
- given_name and suraname is in all lower case `accuracy`
- remove 'u' form Auralin and Novodra columns `validity`
- replace '-' form Auralin and Novodra with NaN `validity`
- missing value in hba1c_change `completeness`
- one duplicate data in treatment table by name "joseph day" `accuracy`
- hba1c_change has missing values and some hba1c1_change values are incorrectly calculated `accuracy`

`Table: adverse_reactions`
- given_name and suranme are in all lower case `consistency`

<br>

2. ***Messy Data***

`Table: patients`
- contacts column contains both phone and email

`Table: treatments and treatment_cut`
- Auralin and Novodra columns should br split into two columns start and end dose
- merge both tables

`Table: adverse_reactions`
- This table should not exist independently

In [4]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       491 non-null    object 
 5   city          491 non-null    object 
 6   state         491 non-null    object 
 7   zip_code      491 non-null    float64
 8   country       491 non-null    object 
 9   contact       491 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [5]:
patients[patients["address"].isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2
219,220,male,Mỹ,Quynh,,,,,,,04-09-1978,237.8,69,35.1
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,04-07-1936,199.5,65,33.2
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
257,258,male,Jin,Kung,,,,,,,5/17/1995,231.7,69,34.2
264,265,female,Wafiyyah,Asfour,,,,,,,11-03-1989,158.6,63,28.1
269,270,female,Flavia,Fiorentino,,,,,,,10-09-1937,175.2,61,33.1
278,279,female,Generosa,Cabán,,,,,,,12/16/1962,124.3,69,18.4


In [6]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [7]:
treatments_cut.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    70 non-null     object 
 1   surname       70 non-null     object 
 2   auralin       70 non-null     object 
 3   novodra       70 non-null     object 
 4   hba1c_start   70 non-null     float64
 5   hba1c_end     70 non-null     float64
 6   hba1c_change  42 non-null     float64
dtypes: float64(3), object(4)
memory usage: 4.0+ KB


In [8]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 948.0+ bytes


In [9]:
patients.duplicated().sum()

0

In [10]:
patients.duplicated(subset = ["given_name", "surname"]).sum()

5

In [11]:
patients[patients.duplicated(subset = ["given_name", "surname"])]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,01-01-1975,180.0,72,24.4
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,01-01-1975,180.0,72,24.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,01-01-1975,180.0,72,24.4
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,01-01-1975,180.0,72,24.4
277,278,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,01-01-1975,180.0,72,24.4


In [12]:
treatments.duplicated().sum()

1

In [13]:
treatments[treatments.duplicated()]

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
136,joseph,day,29u - 36u,-,7.7,7.19,


In [14]:
treatments[treatments.duplicated(subset = ["given_name", "surname"])]

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
136,joseph,day,29u - 36u,-,7.7,7.19,


In [15]:
treatments_cut.duplicated().sum()

0

In [16]:
treatments_cut[treatments_cut.duplicated(subset = ["given_name", "surname"])]

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change


In [17]:
adverse_reactions.duplicated().sum()

0

In [18]:
adverse_reactions[adverse_reactions.duplicated(subset = ["given_name", "surname"])]

Unnamed: 0,given_name,surname,adverse_reaction


In [19]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [20]:
patients[patients["weight"] == 48.8]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
210,211,female,Camilla,Zaitseva,4689 Briarhill Lane,Wooster,OH,44691.0,United States,330-202-2145CamillaZaitseva@superrito.com,11/26/1938,48.8,63,19.1


In [21]:
patients[patients["height"] == 27]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


In [22]:
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


In [23]:
treatments.sort_values("hba1c_start", ascending = False)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
166,annie,allen,36u - 42u,-,9.95,9.58,0.37
75,mackenzie,mckay,-,44u - 43u,9.87,9.48,0.39
81,robert,wagner,43u - 49u,-,9.84,9.52,0.32
171,justyna,kowalczyk,24u - 34u,-,9.84,9.44,
25,benoît,bonami,-,44u - 43u,9.82,9.40,0.92
...,...,...,...,...,...,...,...
105,finlay,sheppard,-,31u - 30u,7.51,7.17,0.34
53,nasser,mansour,-,33u - 31u,7.51,7.06,0.95
126,jowita,wiśniewska,-,22u - 23u,7.50,7.08,0.92
270,mika,martinsson,34u - 43u,-,7.50,7.17,0.33


In [24]:
treatments.sort_values("hba1c_end", ascending = False)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
166,annie,allen,36u - 42u,-,9.95,9.58,0.37
81,robert,wagner,43u - 49u,-,9.84,9.52,0.32
75,mackenzie,mckay,-,44u - 43u,9.87,9.48,0.39
171,justyna,kowalczyk,24u - 34u,-,9.84,9.44,
192,valur,bjarkason,-,31u - 36u,9.71,9.41,0.30
...,...,...,...,...,...,...,...
86,ananías,enríquez,-,44u - 45u,7.58,7.07,0.51
53,nasser,mansour,-,33u - 31u,7.51,7.06,0.95
187,león,reynoso,-,38u - 40u,7.59,7.06,0.53
80,hideki,haraguchi,-,37u - 35u,7.59,7.05,0.54


In [25]:
treatments[(treatments["hba1c_change"] > treatments["hba1c_change"].quantile(0.5)) &
           (treatments["hba1c_change"] < treatments["hba1c_change"].quantile(0.75))].sort_values(
    "hba1c_change", ascending = False)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
88,yumena,nakayama,-,34u - 32u,7.76,7.35,0.91
102,bội,tạ,-,41u - 40u,7.73,7.32,0.91
191,monika,lončar,-,49u - 46u,9.13,8.72,0.91
153,nicole,zimmerman,-,59u - 56u,8.98,8.57,0.91
117,javier,moquin,-,30u - 32u,8.0,7.59,0.91
35,csaba,sági,-,35u - 29u,7.88,7.48,0.9
257,mathilde,nørgaard,-,27u - 28u,8.5,8.1,0.9
127,farizah,sleiman,-,50u - 50u,7.8,7.4,0.9
221,torben,mortensen,-,44u - 40u,7.8,7.4,0.9
234,haylom,nebay,-,42u - 44u,7.62,7.22,0.9


In [26]:
treatments_cut.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,70.0,70.0,42.0
mean,7.838,7.443143,0.51881
std,0.423007,0.418706,0.270719
min,7.51,7.02,0.28
25%,7.64,7.2325,0.34
50%,7.73,7.345,0.37
75%,7.86,7.4675,0.9075
max,9.91,9.46,0.97


### Step 5: Data Cleaning

In [27]:
patients_df = patients.copy()
treatments_df = treatments.copy()
treatments_cut_df = treatments_cut.copy()
adverse_reactions_df = adverse_reactions.copy()

### Define

- Replace all missing values of patients_df with "No Data"
- Substract hba1c_start from hba1c_end to get hba1c_change in treatment_df and treatment_cut_df
- In patients_df use regex to separate email and phone
- Merge treatment_df and treatment_cut_df
- Split Auralin and Novodra into two columns
- Merge adverse_reactions_df with treatment_df
- Convert zip_code column to string and add leading zero
- Convert birthdate to datetime object
- Replace "Dasvid" with "David"
- Drop duplicated values form patients_df
- Replace patient_id = 211 weight with 148.8 lbs instead of 48.8 pounds
- Replace patient_id = 5 height with 67 instead of 27 inches
- Convert given_name and suraname to capitalize in treatment_df
- Drop duplicate data in treatment_df table by name "joseph day"
- Calculate hba1c_change by subtracting start and end columns in treatment_df table
- Convert state names to abbreviations in patients_df table

In [28]:
patients_df[patients_df["address"].isnull()] # we cannot do anything in this

patients_df = patients_df.fillna("No Data")

In [29]:
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       503 non-null    object 
 5   city          503 non-null    object 
 6   state         503 non-null    object 
 7   zip_code      503 non-null    object 
 8   country       503 non-null    object 
 9   contact       503 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(2), int64(2), object(10)
memory usage: 55.1+ KB


In [30]:
treatments_df["hba1c_change"] = treatments_df["hba1c_start"] - treatments_df["hba1c_end"]
treatments_cut_df["hba1c_change"] = treatments_cut_df["hba1c_start"] - treatments_cut_df["hba1c_end"]

In [31]:
treatments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  280 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [32]:
treatments_cut_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    70 non-null     object 
 1   surname       70 non-null     object 
 2   auralin       70 non-null     object 
 3   novodra       70 non-null     object 
 4   hba1c_start   70 non-null     float64
 5   hba1c_end     70 non-null     float64
 6   hba1c_change  70 non-null     float64
dtypes: float64(3), object(4)
memory usage: 4.0+ KB


In [33]:
patients_df["contact"][0]

'951-719-9170ZoeWellish@superrito.com'

In [34]:
import re 

def get_phone_email(txt):
    phone = re.sub("[a-zA-Z!@#$%^&*.+,]", "", txt).replace("-", "").replace("(", "").replace(")", "").replace(" ", "")
    email = re.sub("[0-9]", "", txt).replace("-", "").replace("()", "").replace("+", "").strip().lower()
    return phone, email

In [35]:
get_phone_email(patients_df["contact"][0])

('9517199170', 'zoewellish@superrito.com')

In [36]:
patients_df["phone"] = patients_df["contact"].apply(lambda e: get_phone_email(e)[0])
patients_df["email"] = patients_df["contact"].apply(lambda e: get_phone_email(e)[1])

In [37]:
patients_df = patients_df.drop(columns = "contact")

In [38]:
def correct_phone(n):
    if len(n) == 11:
        return n[1:]
    elif len(n) == 0:
        return "No Number"
    else:
        return n

In [39]:
patients_df["phone"] = patients_df["phone"].apply(lambda e: correct_phone(e))

In [40]:
treatments_df = pd.concat([treatments_df, treatments_cut_df])

In [41]:
treatments_df = treatments_df.melt(id_vars = ["given_name", "surname", "hba1c_start", "hba1c_end", "hba1c_change"], var_name = "drug", value_name = "dose_range")

In [42]:
treatments_df = treatments_df[treatments_df["dose_range"] != "-"].reset_index(drop = True)

In [43]:
treatments_df["dose_start"] = treatments_df["dose_range"].str.split("u - ").str.get(0)
treatments_df["dose_end"] = treatments_df["dose_range"].str.split("u - ").str.get(1).str.replace("u", "")

In [44]:
treatments_df = treatments_df.drop(columns = "dose_range")

In [45]:
treatments_df[["dose_start", "dose_end"]] = treatments_df[["dose_start", "dose_end"]].astype(int)

In [46]:
treatments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    350 non-null    object 
 1   surname       350 non-null    object 
 2   hba1c_start   350 non-null    float64
 3   hba1c_end     350 non-null    float64
 4   hba1c_change  350 non-null    float64
 5   drug          350 non-null    object 
 6   dose_start    350 non-null    int32  
 7   dose_end      350 non-null    int32  
dtypes: float64(3), int32(2), object(3)
memory usage: 19.3+ KB


In [47]:
treatments_df

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,drug,dose_start,dose_end
0,veronika,jindrová,7.63,7.20,0.43,auralin,41,48
1,skye,gormanston,7.97,7.62,0.35,auralin,33,36
2,sophia,haugen,7.65,7.27,0.38,auralin,37,42
3,eddie,archer,7.89,7.55,0.34,auralin,31,38
4,asia,woźniak,7.76,7.37,0.39,auralin,30,36
...,...,...,...,...,...,...,...,...
345,christopher,woodward,7.51,7.06,0.45,novodra,55,51
346,maret,sultygov,7.67,7.30,0.37,novodra,26,23
347,lixue,hsueh,9.21,8.80,0.41,novodra,22,23
348,jakob,jakobsen,7.96,7.51,0.45,novodra,28,26


In [48]:
treatments_df = treatments_df.merge(adverse_reactions_df, how = "left", on = ["given_name", "surname"])

In [49]:
treatments_df["adverse_reaction"] = treatments_df["adverse_reaction"].fillna("no reaction")

In [50]:
treatments_df

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,drug,dose_start,dose_end,adverse_reaction
0,veronika,jindrová,7.63,7.20,0.43,auralin,41,48,no reaction
1,skye,gormanston,7.97,7.62,0.35,auralin,33,36,no reaction
2,sophia,haugen,7.65,7.27,0.38,auralin,37,42,no reaction
3,eddie,archer,7.89,7.55,0.34,auralin,31,38,no reaction
4,asia,woźniak,7.76,7.37,0.39,auralin,30,36,no reaction
...,...,...,...,...,...,...,...,...,...
345,christopher,woodward,7.51,7.06,0.45,novodra,55,51,nausea
346,maret,sultygov,7.67,7.30,0.37,novodra,26,23,no reaction
347,lixue,hsueh,9.21,8.80,0.41,novodra,22,23,injection site discomfort
348,jakob,jakobsen,7.96,7.51,0.45,novodra,28,26,hypoglycemia


In [51]:
patients_df["zip_code"] = patients_df["zip_code"].apply(lambda e: str(int(e)) if e != "No Data" else e)

In [52]:
zip_index = patients_df[patients_df["zip_code"].str.len() == 4].index

In [53]:
city_state = list(zip(patients_df.iloc[zip_index, 5], patients_df.iloc[zip_index, 6]))

In [54]:
patients_df.iloc[zip_index]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,email
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095,United States,7/26/1951,220.9,70,31.7,7326368246,phanbaliem@jourrapide.com
20,21,female,Sofia,Karlsen,2931 Romano Street,Whitman,MA,2382,United States,9/24/1934,153.1,66,24.7,7814471763,sofiatkarlsen@teleworm.us
34,35,female,Mariana,Souza,577 Chipmunk Lane,Orrington,ME,4474,United States,03-06-1948,152.9,63,27.1,2078258634,marianagomessouza@superrito.com
38,39,female,Genet,Fesahaye,4649 Joanne Lane,Westborough,MA,1581,United States,01-11-1954,111.8,67,17.5,9784609060,genetfesahaye@armyspy.com
39,40,female,Ganimete,Ščančar,4105 Ferguson Street,Walpole,MA,2081,United States,10/25/1947,191.6,67,30.0,5084542027,ganimetescancar@cuvox.de
44,45,female,Blanka,Jurković,3165 Upton Avenue,Waterville,ME,4901,United States,1/26/1923,129.8,66,20.9,2078614587,blankajurkovic@superrito.com
53,54,male,Kwemtochukwu,Ogochukwu,2172 Lynn Street,Franklin,MA,2038,United States,6/30/1976,150.5,72,20.4,6173175055,kwemtochukwuogochukwu@einrot.com
54,55,female,Louise,Johnson,4984 Hampton Meadows,Burlington,MA,1803,United States,03-01-1931,141.0,62,25.8,9784071874,louisejohnson@rhyta.com
62,63,female,Firenze,Fodor,1786 Gerald L. Bates Drive,Belmont,MA,2178,United States,04-01-1943,131.1,60,25.6,6178835967,fodorfirenze@dayrep.com
67,68,male,Nebechi,Ekechukwu,2418 Smith Street,Marlboro,MA,1752,United States,01-11-1945,154.9,64,26.6,5088044850,nebechiekechukwu@teleworm.us


In [55]:
import zipcodes

zipcodes.filter_by(city = "Woodbridge", state = "NJ")[0]["zip_code"]

'07095'

In [56]:
for i in city_state:
    try:
        print(i, "-"*10, zipcodes.filter_by(city = i[0], state = i[1])[0]["zip_code"])
    except:
        print(i, "-"*10, np.nan)

('Woodbridge', 'NJ') ---------- 07095
('Whitman', 'MA') ---------- 02382
('Orrington', 'ME') ---------- 04474
('Westborough', 'MA') ---------- 01580
('Walpole', 'MA') ---------- 02081
('Waterville', 'ME') ---------- 04901
('Franklin', 'MA') ---------- 02038
('Burlington', 'MA') ---------- 01803
('Belmont', 'MA') ---------- 02478
('Marlboro', 'MA') ---------- nan
('Hartford', 'CT') ---------- 06101
('Foxboro', 'MA') ---------- 02035
('Mansfield', 'MA') ---------- 02048
('Mount Holly', 'NJ') ---------- 08060
('Presque Isle', 'ME') ---------- 04769
('Quincy', 'MA') ---------- 02169
('Boston', 'MA') ---------- 02108
('Bedford', 'MA') ---------- 01730
('Lowell', 'MA') ---------- 01850
('Jersey City', 'NJ') ---------- 07097
('Providence', 'RI') ---------- 02901
('West Haven', 'CT') ---------- 06516
('Hopewell  Mercer', 'NJ') ---------- nan
('Pennsauken', 'NJ') ---------- 08110
('Exeter', 'NH') ---------- 03833
('Hartford', 'VT') ---------- 05047
('Brattleboro', 'VT') ---------- 05301
('Bedfo

In [57]:
patients_df["zip_code"] = patients_df["zip_code"].apply(lambda e: "0" + e if len(e) == 4 else e)

In [58]:
patients_df

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390,United States,07-10-1976,121.7,66,19.6,9517199170,zoewellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812,United States,04-03-1967,118.8,66,19.2,2175693204,pamelashill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467,United States,2/19/1980,177.8,71,24.8,4023636804,jaemdebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,07095,United States,7/26/1951,220.9,70,31.7,7326368246,phanbaliem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,2/18/1928,192.3,27,26.1,3345157487,timneudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,03852,United States,04-10-1959,181.1,72,24.6,2074770579,mustafalindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341,United States,3/26/1948,239.6,70,34.4,9282844492,rumanbisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110,United States,1/13/1971,171.2,67,26.8,8162236007,jinkedekeizer@teleworm.us
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109,United States,2/13/1952,176.9,67,27.7,3604432060,chidaluonyekaozulu@jourrapide.com


In [59]:
patients_df["birthdate"] = pd.to_datetime(patients_df["birthdate"])

In [60]:
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   patient_id    503 non-null    int64         
 1   assigned_sex  503 non-null    object        
 2   given_name    503 non-null    object        
 3   surname       503 non-null    object        
 4   address       503 non-null    object        
 5   city          503 non-null    object        
 6   state         503 non-null    object        
 7   zip_code      503 non-null    object        
 8   country       503 non-null    object        
 9   birthdate     503 non-null    datetime64[ns]
 10  weight        503 non-null    float64       
 11  height        503 non-null    int64         
 12  bmi           503 non-null    float64       
 13  phone         503 non-null    object        
 14  email         503 non-null    object        
dtypes: datetime64[ns](1), float64(2), int64(

In [61]:
patients_df["given_name"][patients_df["given_name"] == "Dsvid"] = "David"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  patients_df["given_name"][patients_df["given_name"] == "Dsvid"] = "David"


In [62]:
patients_df.head(20)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390,United States,1976-07-10,121.7,66,19.6,9517199170,zoewellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812,United States,1967-04-03,118.8,66,19.2,2175693204,pamelashill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467,United States,1980-02-19,177.8,71,24.8,4023636804,jaemdebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095,United States,1951-07-26,220.9,70,31.7,7326368246,phanbaliem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,1928-02-18,192.3,27,26.1,3345157487,timneudorf@cuvox.de
5,6,male,Rafael,Costa,1140 Willis Avenue,Daytona Beach,Florida,32114,United States,1931-08-31,183.9,70,26.4,3863345237,rafaelcardosocosta@gustr.com
6,7,female,Mary,Adams,3145 Sheila Lane,Burbank,NV,84728,United States,1969-11-19,146.3,65,24.3,7755335933,marybadams@einrot.com
7,8,female,Xiuxiu,Chang,2687 Black Oak Hollow Road,Morgan Hill,CA,95037,United States,1958-08-13,158.0,60,30.9,4087783236,xiuxiuchang@einrot.com
8,9,male,David,Gustafsson,1790 Nutter Street,Kansas City,MO,64105,United States,1937-03-06,163.9,66,26.5,8162659578,davidgustafsson@armyspy.com
9,10,female,Sophie,Cabrera,3303 Anmoore Road,New York,New York,10011,United States,1930-12-03,194.7,64,33.4,7187959124,sophiecabreraibarra@teleworm.us


In [63]:
patients_df = patients_df.drop_duplicates(subset = ["given_name", "surname"], ignore_index = True)

In [64]:
patients_df

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390,United States,1976-07-10,121.7,66,19.6,9517199170,zoewellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812,United States,1967-04-03,118.8,66,19.2,2175693204,pamelashill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467,United States,1980-02-19,177.8,71,24.8,4023636804,jaemdebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,07095,United States,1951-07-26,220.9,70,31.7,7326368246,phanbaliem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,1928-02-18,192.3,27,26.1,3345157487,timneudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,03852,United States,1959-04-10,181.1,72,24.6,2074770579,mustafalindstrom@jourrapide.com
494,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341,United States,1948-03-26,239.6,70,34.4,9282844492,rumanbisliev@gustr.com
495,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110,United States,1971-01-13,171.2,67,26.8,8162236007,jinkedekeizer@teleworm.us
496,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109,United States,1952-02-13,176.9,67,27.7,3604432060,chidaluonyekaozulu@jourrapide.com


In [65]:
patients_df["weight"][patients_df["weight"] == 48.8] = 148.8

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  patients_df["weight"][patients_df["weight"] == 48.8] = 148.8


In [66]:
patients_df.describe()

Unnamed: 0,patient_id,weight,height,bmi
count,498.0,498.0,498.0,498.0
mean,252.034137,173.56988,66.580321,27.514859
std,146.067474,33.636769,4.400313,5.293793
min,1.0,102.1,27.0,17.1
25%,125.25,148.9,63.0,23.225
50%,253.5,174.45,67.0,27.25
75%,378.75,199.725,69.0,31.8
max,503.0,255.9,79.0,37.7


In [67]:
patients_df["height"][patients_df["height"] == 27] = 67

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  patients_df["height"][patients_df["height"] == 27] = 67


In [68]:
patients_df.describe()

Unnamed: 0,patient_id,weight,height,bmi
count,498.0,498.0,498.0,498.0
mean,252.034137,173.56988,66.660643,27.514859
std,146.067474,33.636769,4.025484,5.293793
min,1.0,102.1,59.0,17.1
25%,125.25,148.9,63.0,23.225
50%,253.5,174.45,67.0,27.25
75%,378.75,199.725,69.0,31.8
max,503.0,255.9,79.0,37.7


In [69]:
treatments_df["given_name"] = treatments_df["given_name"].str.capitalize()
treatments_df["surname"] = treatments_df["surname"].str.capitalize()

In [70]:
treatments_df

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,drug,dose_start,dose_end,adverse_reaction
0,Veronika,Jindrová,7.63,7.20,0.43,auralin,41,48,no reaction
1,Skye,Gormanston,7.97,7.62,0.35,auralin,33,36,no reaction
2,Sophia,Haugen,7.65,7.27,0.38,auralin,37,42,no reaction
3,Eddie,Archer,7.89,7.55,0.34,auralin,31,38,no reaction
4,Asia,Woźniak,7.76,7.37,0.39,auralin,30,36,no reaction
...,...,...,...,...,...,...,...,...,...
345,Christopher,Woodward,7.51,7.06,0.45,novodra,55,51,nausea
346,Maret,Sultygov,7.67,7.30,0.37,novodra,26,23,no reaction
347,Lixue,Hsueh,9.21,8.80,0.41,novodra,22,23,injection site discomfort
348,Jakob,Jakobsen,7.96,7.51,0.45,novodra,28,26,hypoglycemia


In [71]:
treatments_df = treatments_df.drop_duplicates(ignore_index = True)

In [77]:
patients_df.state.unique()

array(['California', 'Illinois', 'Nebraska', 'NJ', 'AL', 'Florida', 'NV',
       'CA', 'MO', 'New York', 'MI', 'TN', 'VA', 'OK', 'GA', 'MT', 'MA',
       'NY', 'NM', 'IL', 'LA', 'PA', 'CO', 'ME', 'WI', 'SD', 'MN', 'FL',
       'WY', 'OH', 'IA', 'NC', 'IN', 'CT', 'KY', 'DE', 'MD', 'AZ', 'TX',
       'NE', 'AK', 'ND', 'KS', 'MS', 'WA', 'SC', 'WV', 'RI', 'NH', 'OR',
       'No Data', 'VT', 'ID', 'DC', 'AR'], dtype=object)

In [80]:
patients_df["state"] = patients_df["state"].replace({"California": "CA", "Illinois": "IL", "Nebraska": "NE", "Florida": "FL", "New York": "NY"})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  patients_df["state"] = patients_df["state"].replace({"California": "CA", "Illinois": "IL", "Nebraska": "NE", "Florida": "FL", "New York": "NY"})


In [81]:
patients_df

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,CA,92390,United States,1976-07-10,121.7,66,19.6,9517199170,zoewellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,IL,61812,United States,1967-04-03,118.8,66,19.2,2175693204,pamelashill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,NE,68467,United States,1980-02-19,177.8,71,24.8,4023636804,jaemdebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,07095,United States,1951-07-26,220.9,70,31.7,7326368246,phanbaliem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303,United States,1928-02-18,192.3,67,26.1,3345157487,timneudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,03852,United States,1959-04-10,181.1,72,24.6,2074770579,mustafalindstrom@jourrapide.com
494,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341,United States,1948-03-26,239.6,70,34.4,9282844492,rumanbisliev@gustr.com
495,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110,United States,1971-01-13,171.2,67,26.8,8162236007,jinkedekeizer@teleworm.us
496,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109,United States,1952-02-13,176.9,67,27.7,3604432060,chidaluonyekaozulu@jourrapide.com
