## Assessing the Data

1. Quality
    * Dirty data, also known as low quality data like missing, duplicate or incorrect data. Low quality data has content issues.
2. Tidiness
    * Messy data, also known as untidy data. Untidy data has structural issues.
    * Each variable forms a column, each observation forms a row and each type of observational unit forms a table
    
Types of Assessment:
1. Visual
2. Programmatic: using functions to summarize the data

**Assessing** your data is the second step in data wrangling. When assessing, you're like a detective at work, inspecting your dataset for two things: data quality issues (i.e. content issues) and lack of tidiness (i.e. structural issues).

Assessing is the precursor to cleaning. You can't clean something that you don't know exists! In this lesson, you'll learn to identify and categorize common data quality and tidiness issues. 

This lesson will be structured as follows:

* You'll get motivated to assess (and later clean) the dataset for lessons 3 and 4: Phase II clinical trial data that compares the efficacy and safety of a new oral insulin to treat diabetes
* You'll learn to distinguish between dirty data and messy data
* You'll assess the data visually and programmatically to identify:
    * Data quality issues
    * Tidiness issues
* You'll learn about data quality dimensions and categorize each of the data quality issues identified above into its appropriate dimension

## Sources of Dirty Data

Dirty data = low quality data = content issues

There are lots of sources of dirty data. Basically, anytime humans are involved, there's going to be dirty data. There are lots of ways in which we touch data we work with.

We're going to have user entry errors.
In some situations, we won't have any data coding standards, or where we do have standards they'll be poorly applied, causing problems in the resulting data
We might have to integrate data where different schemas have been used for the same type of item.
We'll have legacy data systems, where data wasn't coded when disc and memory constraints were much more restrictive than they are now. Over time systems evolve. Needs change, and data changes.
Some of our data won't have the unique identifiers it should.
Other data will be lost in transformation from one format to another.
And then, of course, there's always programmer error.
And finally, data might have been corrupted in transmission or storage by cosmic rays or other physical phenomenon. So hey, one that's not our fault.

## Sources of Messy Data
Messy data = untidy data = structural issues

Messy data is usually the result of poor data planning. Or a lack of awareness of the benefits of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). Fortunately, messy data is usually much more easily addressable than most of the sources of dirty data mentioned above.

### Dataset: Oral Insulin Phase II Clinical Trial Data

**Discovery of Insulin:** help glucose get into cells of our bodies. ALthough it doesn't cure diabetes, it's one of the biggest discoveries in medicine. 
**The future: oral insuline**
Phase II clinical trial data for a new innovative oral insulin called auralin. 
* They believe will solve this stomach lining problem
* Test the efficacy and the dose response of a drug 
* Identify common short term side effects, adeverse reactions
* 350 pacients split into two groups:
    * 175 treated with auralin
    * 175 treated with injectable insulin, novodra
* Comparing these two drugs, we can determine if auralin is effective
* `hba1c_change` level
* If we get a say 0.5% change we have a major breakthrough in the quality of life improvement for diabetics all over the world
* We could have duplicate data, missing data and inaccurate data 

#### DISCLAIMER: This Data Isn't "Real"

The Auralin and Novodra are *not* real insulin products. This clinical trial data was fabricated for the sake of this course. When assessing this data, the issues that you'll detect (and later clean) are meant to simulate real-world data quality and tidiness issues.

That said:

* This dataset was constructed with the consultation of real doctors to ensure plausibility.
* This clinical trial data for an alternative insulin was inspired and closely mimics this real [clinical trial for a new inhaled insulin called Afrezza](http://care.diabetesjournals.org/content/38/12/2266.long).
* The data quality issues in this dataset mimic real, [common data quality issues in healthcare data](http://media.hypersites.com/clients/1446/filemanager/Articles/DocCenter_Problem_with_data.pdf). These issues impact quality of care, patient registration, and revenue.
* The patients in this dataset were created using this [fake name generator](https://www.fakenamegenerator.com/order.php) and do not include real names, addresses, phone numbers, emails, etc.

### Gather

In [1]:
import pandas as pd

In [2]:
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')

### Assess

Visually asses the data better in Excell or Google sheets but some files are so large that spreadsheet programs crash when trying to open them.

`patients` columns:
- **patient_id**: the unique identifier for each patient in the [Master Patient Index](https://en.wikipedia.org/wiki/Enterprise_master_patient_index) (i.e. patient database) of the pharmaceutical company that is producing Auralin
- **assigned_sex**: the assigned sex of each patient at birth (male or female)
- **given_name**: the given name (i.e. first name) of each patient
- **surname**: the surname (i.e. last name) of each patient
- **address**: the main address for each patient
- **city**: the corresponding city for the main address of each patient
- **state**: the corresponding state for the main address of each patient
- **zip_code**: the corresponding zip code for the main address of each patient
- **country**: the corresponding country for the main address of each patient (all United states for this clinical trial)
- **contact**: phone number and email information for each patient
- **birthdate**: the date of birth of each patient (month/day/year). The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is  age >= 18 *(there is no maximum age because diabetes is a [growing problem](http://www.diabetes.co.uk/diabetes-and-the-elderly.html) among the elderly population)*
- **weight**: the weight of each patient in pounds (lbs)
- **height**: the height of each patient in inches (in)
- **bmi**: the Body Mass Index (BMI) of each patient. BMI is a simple calculation using a person's height and weight. The formula is BMI = kg/m<sup>2</sup> where kg is a person's weight in kilograms and m<sup>2</sup> is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. *The [inclusion criteria](https://en.wikipedia.org/wiki/Inclusion_and_exclusion_criteria) for this clinical trial is 16 >= BMI >= 38.*

Our key metric is **hba1c_change**

350 patients participated in this clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before.  All were experiencing elevated HbA1c levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:
- 175 patients switched to Auralin for 24 weeks
- 175 patients continued using Novodra for 24 weeks

`treatments` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial
- **auralin**: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) *and* the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash). Both are measured in units (shortform 'u'), which is the [international unit](https://en.wikipedia.org/wiki/International_unit) of measurement and the standard measurement for insulin.
- **novodra**: same as above, except for patients that continued treatment with Novodra
- **hba1c_start**: the patient's HbA1c level at the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The [HbA1c test](https://depts.washington.edu/uwcoe/healthtopics/diabetes.html) measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.
- **hba1c_end**: the patient's HbA1c level at the end of the last week of treatment
- **hba1c_change**: the change in the patient's HbA1c level from the start of treatment to the end, i.e., `hba1c_start` - `hba1c_end`. For Auralin to be deemed effective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).

`adverse_reactions` columns:
- **given_name**: the given name of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **surname**: the surname of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- **adverse_reaction**: the adverse reaction reported by the patient

Additional useful information:
- [Insulin resistance varies person to person](http://www.tudiabetes.org/forum/t/how-much-insulin-is-too-much-on-a-daily-basis/9804/5), which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
- It is important to test drugs and medical products in the people they are meant to help. People of different age, race, sex, and ethnic group must be included in clinical trials. This [diversity](https://www.clinicalleader.com/doc/an-fda-perspective-on-patient-diversity-in-clinical-trials-0001) is reflected in the `patients` table.
- Ensuring column names are descriptive enough is an important step in acquainting yourself with the data. 'Descriptive enough' is subjective. Ideally you want short column names (so they are easier to type and read in code form) but also fully descriptive. Length vs. descriptiveness is a tradeoff and common debate (a [similar debate](https://softwareengineering.stackexchange.com/questions/176582/is-there-an-excuse-for-short-variable-names) exists for variable names). The *auralin* and *novodra* column names are probably not descriptive enough, but you'll address that later so don't worry about that for now.

**More Information**

* [Stack Overflow: Is it a good idea to use an integer column for storing US ZIP codes in a database?](https://stackoverflow.com/questions/893454/is-it-a-good-idea-to-use-an-integer-column-for-storing-us-zip-codes-in-a-databas)
* [ABBYY: Optical Character Recognition](https://www.abbyy.com/en-ca/finereader/what-is-ocr/)
* [Hadley Wickham: Tidying messy datasets](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)

In [3]:
patients.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


In [4]:
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [5]:
adverse_reactions.head()

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation


## Assessing versus Exploring

1. Assesing is part of the Data Wrangling Process
2. Exploring is part of the EDA: remove outliers and create new and more descriptive features from existing data: feature engineering

In the context of this dataset, **assessing** is everything you just identified, like spotting:

* Missing HbA1c changes
* Poorly formatted zip codes (e.g., four digits and float data type instead of five digits and string or object data type)
* Multiple state formats (e.g., NY and New York)
* Incorrect patient height values (e.g., 27 inches instead of 72 inches)

**Assessing** is also identifying structural (tidiness) issues that make analysis difficult.

The discovery of these data quality and ensure that the analysis can be executed, which for this clinical trial data includes calculated average patient metrics (e.g. age, weight, height, and BMI) and calculating the confidence interval for the difference in HbA1c change means between Novodra and Auralin patients.

**Exploring**, in the context of this dataset, might be:

* Using summary statistics like `count` on the state column or `mean` on the weight column to see if patients from certain states or of certain weights are more likely to have diabetes, which we can use to exclude certain patients from the analysis and make it less biased

Exploring, in the context of a clinical trial, is less likely to happen given that clinical trials are expensive and consist of extreme pre-planning. So exploring on this dataset would likely exclusively happen before the *treatments* and *adverse_reactions* tables were created, i.e., before the clinical trial was conducted.

## Data Quality Dimensions

Data quality dimensions help guide your thought process while assessing and also cleaning. The four main data quality dimensions are:

1. **Completeness**: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
2. **Validity**: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
3. **Accuracy**: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
4. **Consistency**: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.
Regarding the other data quality research mentioned in the video, the additional dimensions are super specific cases of these four dimensions listed above. Example: currency, defined as follows: the degree to which data is current with the world that it models. Currency can measure how up-to-date data is. Currency is a specific case of accuracy data in the sense that out-of-date data is (usually) valid but wrong. In other words, our definition of accuracy can include currency.

**More Information:**
The inconsistent data quality dimension research mentioned in the video: [source 1 (PDF)](https://www.damauk.org/RWFilePub.php?&cat=403&dx=2&ob=3&rpn=catviewleafpublic403&id=106193), [source 2](http://www.informit.com/articles/article.aspx?p=399325&seqNum=3), [source 3](https://searchdatamanagement.techtarget.com/definition/data-quality), [and source 4](https://www.youtube.com/watch?v=dPsx8_Fcr-U)

### Quiz
Categorize the most recent four data quality issues you visually detected into their appropriate data quality dimensions. Reminder, they are:

* 'Dsvid' given name typo in the patients table: accuracy;
* 'u' next to start dose and end dose in the treatments table: validity;
* Lowercase given names and surnames in the treatments and adverse_reactions table: consistency;
* 280 records in the treatments table instead of 350: completeness.

## Programmatic Assessment

Using functions and methods to reveal something about your data's quality and tidiness to you:

These are the programmatic assessment methods in pandas that you will probably use most often:

* `.head` (DataFrame and Series)
* `.tail` (DataFrame and Series)
* `.sample` (DataFrame and Series)
* `.info` (DataFrame only)
* `.describe` (DataFrame and Series)
* `.value_counts` (Series only)
* Various methods of indexing and selecting data (.loc and bracket notation with/without boolean indexing, also .iloc)

Try them out below and keep their results in mind. Some will come in handy later in the lesson.

Check out the [pandas API reference](https://pandas.pydata.org/pandas-docs/stable/api.html) for detailed usage information.

In [6]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
patient_id      503 non-null int64
assigned_sex    503 non-null object
given_name      503 non-null object
surname         503 non-null object
address         491 non-null object
city            491 non-null object
state           491 non-null object
zip_code        491 non-null float64
country         491 non-null object
contact         491 non-null object
birthdate       503 non-null object
weight          503 non-null float64
height          503 non-null int64
bmi             503 non-null float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [7]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
given_name          34 non-null object
surname             34 non-null object
adverse_reaction    34 non-null object
dtypes: object(3)
memory usage: 896.0+ bytes


Try `.head` and `.tail` on the `patients` table.

In [8]:
patients.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


In [19]:
patients.tail()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7
502,503,male,Pat,Gersten,2778 North Avenue,Burr,Nebraska,68324.0,United States,PatrickGersten@rhyta.com402-848-4923,5/3/1954,138.2,71,19.3


Try `.sample` on the `treatments` table.

In [20]:
treatments.sample()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
36,chỉ,lâm,45u - 48u,-,7.68,7.24,


Try `.info` on the `treatments` table.

In [14]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
given_name      280 non-null object
surname         280 non-null object
auralin         280 non-null object
novodra         280 non-null object
hba1c_start     280 non-null float64
hba1c_end       280 non-null float64
hba1c_change    171 non-null float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


Try `.describe` on the `patients` table.

In [21]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


Try `.value_counts` on the *adverse_reaction* column of the `adverse_reactions` table.

In [22]:
adverse_reactions.adverse_reaction.value_counts()

hypoglycemia                 19
injection site discomfort     6
headache                      3
throat irritation             2
cough                         2
nausea                        2
Name: adverse_reaction, dtype: int64

Try selecting the records in the `patients` table for patients that are from the *city* New York.

In [46]:
patients[patients.city == "New York"]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
9,10,female,Sophie,Cabrera,3303 Anmoore Road,New York,New York,10011.0,United States,SophieCabreraIbarra@teleworm.us1 718 795 9124,12/3/1930,194.7,64,33.4
35,36,female,Kamila,Pecinová,3558 Longview Avenue,New York,New York,10004.0,United States,718-501-0503KamilaPecinova@dayrep.com,12/23/1985,198.9,62,36.4
84,85,female,Nương,Vũ,465 Southern Street,New York,NY,10001.0,United States,VuCamNuong@fleckens.hu516-720-5094,2/1/1981,138.2,63,24.5
129,130,female,Rebecca,Jephcott,989 Wayback Lane,New York,NY,10004.0,United States,631-370-7406RebeccaJephcott@armyspy.com,8/1/1966,203.3,65,33.8
142,143,male,Finley,Chandler,2754 Westwood Avenue,New York,New York,10001.0,United States,516-740-5280FinleyChandler@dayrep.com,10/25/1936,150.9,70,21.6
152,153,male,Christopher,Woodward,3450 Southern Street,New York,NY,10004.0,United States,ChristopherWoodward@jourrapide.com+1 (516) 630...,9/4/1984,212.2,66,34.2
188,189,male,Søren,Sørensen,2397 Bell Street,New York,NY,10011.0,United States,SrenSrensen@superrito.com1 212 201 3108,12/31/1942,157.1,67,24.6
213,214,female,Onyemaechi,Onwughara,685 Duncan Avenue,New York,NY,10013.0,United States,917-622-9142OnyemaechiOnwughara@einrot.com,3/8/1989,131.1,69,19.4
215,216,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [47]:
len(patients[patients.city == "New York"])

18

There are 18 patients from the city of New York.

In [49]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
given_name      280 non-null object
surname         280 non-null object
auralin         280 non-null object
novodra         280 non-null object
hba1c_start     280 non-null float64
hba1c_end       280 non-null float64
hba1c_change    171 non-null float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [48]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
patient_id      503 non-null int64
assigned_sex    503 non-null object
given_name      503 non-null object
surname         503 non-null object
address         491 non-null object
city            491 non-null object
state           491 non-null object
zip_code        491 non-null float64
country         491 non-null object
contact         491 non-null object
birthdate       503 non-null object
weight          503 non-null float64
height          503 non-null int64
bmi             503 non-null float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB


In [19]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
given_name          34 non-null object
surname             34 non-null object
adverse_reaction    34 non-null object
dtypes: object(3)
memory usage: 896.0+ bytes


In [53]:
# check for missing data with null address
patients[patients.address.isnull()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
209,210,female,Lalita,Eldarkhanov,,,,,,,8/14/1950,143.4,62,26.2
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
257,258,male,Jin,Kung,,,,,,,5/17/1995,231.7,69,34.2
264,265,female,Wafiyyah,Asfour,,,,,,,11/3/1989,158.6,63,28.1
269,270,female,Flavia,Fiorentino,,,,,,,10/9/1937,175.2,61,33.1
278,279,female,Generosa,Cabán,,,,,,,12/16/1962,124.3,69,18.4


In [50]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
given_name          34 non-null object
surname             34 non-null object
adverse_reaction    34 non-null object
dtypes: object(3)
memory usage: 896.0+ bytes


In [54]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [55]:
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


In [57]:
# returns a random sample of five records from the table
patients.sample(5)

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
310,311,male,Hugo,Collins,3214 Better Street,Lenexa,KS,66219.0,United States,HugoCollins@cuvox.de1 913 322 9114,2/3/1932,193.6,69,28.6
473,474,female,Kate,Wilkinson,664 Lyon Avenue,South Boston,MA,2127.0,United States,KateWilkinson@armyspy.com1 508 905 2371,7/18/1998,175.3,65,29.2
133,134,female,Elisabeth,Dimmen,3180 Still Pastures Drive,Columbia,SC,29210.0,United States,ElisabethDimmen@cuvox.de+1 (803) 465-3312,4/21/1985,199.5,65,33.2
77,78,female,Rut,Halldórsdóttir,1054 Zappia Drive,Lexington,KY,40507.0,United States,859-297-3368RutHalldorsdottir@einrot.com,5/9/1959,162.6,66,26.2
53,54,male,Kwemtochukwu,Ogochukwu,2172 Lynn Street,Franklin,MA,2038.0,United States,617-317-5055KwemtochukwuOgochukwu@einrot.com,6/30/1976,150.5,72,20.4


In [58]:
patients.surname.value_counts()

Doe              6
Taylor           3
Jakobsen         3
Woźniak          2
Nilsen           2
Hueber           2
Gersten          2
Cabrera          2
Grímsdóttir      2
Correia          2
Berg             2
Lâm              2
Kowalczyk        2
Tucker           2
Aranda           2
Batukayev        2
Lương            2
Souza            2
Silva            2
Dratchev         2
Liễu             2
Kadyrov          2
Lund             2
Collins          2
Parker           2
Bùi              2
Ogochukwu        2
Johnson          2
Cindrić          2
Tạ               2
                ..
Adonay           1
Mathiesen        1
Sørensen         1
Aličajić         1
Borgen           1
Radislav         1
Tuma             1
Freud            1
Tsukada          1
Gaber            1
Bakos            1
Laatikainen      1
Synek            1
House            1
Német            1
van der Lubbe    1
Touma            1
Thạch            1
Bidwill          1
Arsanukayev      1
MacDonald        1
Macleod     

In [64]:
patients[patients.surname == "Doe"]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
215,216,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
277,278,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [72]:
# check for duplicates in the address
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


In [9]:
patients.address.value_counts()

123 Main Street                  6
2778 North Avenue                2
648 Old Dear Lane                2
2476 Fulton Street               2
2967 Prudence Street             1
2121 Liberty Avenue              1
4278 Hart Country Lane           1
3538 Paul Wayne Haggerty Road    1
3084 Blue Spruce Lane            1
4178 Despard Street              1
2152 Heritage Road               1
3390 Hidden Meadow Drive         1
3964 Walnut Avenue               1
4851 Andy Street                 1
4476 Center Street               1
212 Tibbs Avenue                 1
2970 Forest Avenue               1
3141 Brentwood Drive             1
962 George Street                1
4192 Holly Street                1
547 Weekley Street               1
1270 Haul Road                   1
3634 Lyon Avenue                 1
3411 Pyramid Valley Road         1
4988 Lynn Street                 1
2687 Hinkle Deegan Lake Road     1
1540 Overlook Drive              1
2595 Feathers Hooves Drive       1
3300 Woodridge Lane 

In [68]:
patients[patients.surname == "Knudsen"]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4


In [67]:
patients[patients.surname == "Jakobsen"]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
24,25,male,Jakob,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
432,433,female,Karen,Jakobsen,1690 Fannie Street,Houston,TX,77020.0,United States,KarenJakobsen@jourrapide.com1 979 203 0438,11/25/1962,185.2,67,29.0


In [61]:
# sort patient weight
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
171    106.5
51     107.1
270    108.1
198    108.5
48     109.1
478    109.6
141    110.2
38     111.8
438    112.0
14     112.0
235    112.2
307    112.4
191    112.6
408    113.1
49     113.3
326    114.0
338    114.1
253    117.0
321    118.4
168    118.8
1      118.8
350    119.0
207    119.2
265    120.0
341    120.3
       ...  
332    224.0
252    224.2
12     224.2
222    224.8
166    225.3
111    225.9
101    226.2
150    226.6
352    227.7
428    227.7
88     227.7
13     228.4
339    229.0
182    230.3
121    230.8
257    231.7
395    231.9
246    232.1
219    237.8
11     238.7
50     238.9
441    239.1
499    239.6
439    242.0
487    242.4
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

In [16]:
weight_lbs = patients[patients.surname == 'Zaitseva'].weight * 2.20462
height_in = patients[patients.surname == 'Zaitseva'].height
bmi_check = 703 * weight_lbs / (height_in * height_in)

weight_lbs, bmi_check, patients[patients.surname == 'Zaitseva'].bmi

(210    107.585456
 Name: weight, dtype: float64, 210    19.055827
 dtype: float64, 210    19.1
 Name: bmi, dtype: float64)

In [63]:
# check for null values
sum(treatments.auralin.isnull()), sum(treatments.novodra.isnull())

(0, 0)

In [18]:
treatments.head(1)

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,


In [25]:
# check for duplicate column names
all_columns = pd.Series(list(patients) + list(treatments) + list(adverse_reactions))
all_columns[all_columns.duplicated()]

14    given_name
15       surname
21    given_name
22       surname
dtype: object

#### Quality

1. `treatments` table: 
    - missing `hba1c_change`
    - 'u' next to the start dose and end dose in the auralin and novodra columns
    - Lowercase given names and surnames
    - Missing records (280 records instead of 350)
    - Erroneous datatype (`auralin` and `novodra` should be integers)
    - Innacurate `hba1c_change` maximum value is wrong calculated
    - Nulls represented as dashes (-) in auralin and novodra columns
    
2. `pacients` table: 
    - Erroneous datatype (`zip code` is a float not a string; `assigned-sex`, `state` more appropriate as categorical data type; `zip-code` string, `birthdate` date time)
    - `zip code` has four digits sometimes 
    - Tim Neudorf height is 27 in instead of 72 in
    - Full state names sometimes, abbreviations other times
    - The given_name for the patient with the patient_id 9
    - Missing demographic information (address - contact columns)
    - Multiple phone formats
    - Default John Doe data
    - Multiple records for Jake Jakobsen, Gersten, Taylor
    - kgs instead of lbs for Zaitseva weight
    
3. `adverse_reactions` table:
    - Lowercase given names and surnames

#### Tidiness

1. `treatments` table: 
    - auralin and novodra columns should be split into three variables: `treatment` (auralin or novodra), `start_dose` and `end_dose`

2. `pacients` table:
    - `contact` column should be split into phone and e-mail address 

3. `adverse_reactions` table:
    - `adverse_reaction` variable inside of `treatments table`
    
We should have only two tables 
1. `pacients`:
    - `patient_id`
    - `assigned_sex`
    - `given_name`
    - `surname`
    - `address`
    - `city`
    - `state`
    - `zip_code`
    - `country`
    - `contact`
    - `birthdate`
    - `weight
    - `height`
    - `bmi`

2. `treatments`: 
    - `patient_id`
    - `treatment`
    - `start_dose`
    - `end_dose`
    - `hba1c_start`
    - `hba1c_end`
    - `adverse_reaction`