### Data Analysis Process
1. Asking Questions
2. Data Wrangling
   
   a. Gathering Data
   
- i. CSV files 
- ii. APIs 
- iii. Web Scraping 
- iv. Databases  
  
  b. Assessing Data
  
  c. Cleaning Data
4. Exploratory Data Analysis
5. Drawing Conclusion
6. Comunicating Results

### Summary of today's session
- You have to become Sherlock
- Will try to create a framework
- Frameworks may vary
- The goal is to give you idea

### 1. Data Wrangling
- Data Gathering
- Data Accessing
- Data Cleaning

### 1b. Data Accessing
In this step, the data is to be understood more deeply. Before implementing methods to clean it, you will definitely need to have a better idea about what the data is about.

### Types of Unclean Data

There are 2 kinds of unclean data

<image>

### Dirty Data (Data with Quality issues): Dirty data, also known as low quality data. Low quality data has content issues.

- Duplicated data
- Missing Data
- Corrupt Data
- Inaccurate Data

**Messy Data (Data with tidiness issues): Messy data, also known as untidy data. Untidy data has structural issues.Tidy data has the following properties:**

- Each variable forms a column
- Each observation forms a row
- Each observational unit forms a table


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
patients = pd.read_csv(r"D:\100-DAYS-OF-PYTHON-PROGRAMMING\100-DAYS-OF-PYTHON-PROGRAMMING\Data-Assessing-and-Cleaning\data-wrangling-master\data-wrangling-master\patients.csv")
treatments = pd.read_csv(r"D:\100-DAYS-OF-PYTHON-PROGRAMMING\100-DAYS-OF-PYTHON-PROGRAMMING\Data-Assessing-and-Cleaning\data-wrangling-master\data-wrangling-master\treatments.csv")
adverse_reactions = pd.read_csv(r"D:\100-DAYS-OF-PYTHON-PROGRAMMING\100-DAYS-OF-PYTHON-PROGRAMMING\Data-Assessing-and-Cleaning\data-wrangling-master\data-wrangling-master\adverse_reactions.csv")
treatments_cut = pd.read_csv(r"D:\100-DAYS-OF-PYTHON-PROGRAMMING\100-DAYS-OF-PYTHON-PROGRAMMING\Data-Assessing-and-Cleaning\data-wrangling-master\data-wrangling-master\treatments_cut.csv")

In [5]:
# view datasets
patients.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


In [6]:
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [7]:
treatments_cut.shape

(70, 7)

In [8]:
adverse_reactions

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation
5,jasmine,sykes,hypoglycemia
6,louise,johnson,hypoglycemia
7,albinca,komavec,hypoglycemia
8,noe,aranda,hypoglycemia
9,sofia,hermansen,injection site discomfort


### 1. Write a summary for your data
This is a dataset about 500 patients of which 350 patients participated in a clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before. All were experiencing elevated HbA1c levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After 4 weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:

- 175 patients switched to Auralin for 24 weeks
- 175 patients continued using Novodra for 24 weeks
Data about patients feeling some adverse effects is also recorded.

### 2. Write Column descriptions
**Table -> patients:**
- patient_id: the unique identifier for each patient in the Master Patient Index (i.e. patient database) of the pharmaceutical company that is producing Auralin
- assigned_sex: the assigned sex of each patient at birth (male or female)
- given_name: the given name (i.e. first name) of each patient
- surname: the surname (i.e. last name) of each patient
- address: the main address for each patient
- city: the corresponding city for the main address of each patient
- state: the corresponding state for the main address of each patient
- zip_code: the corresponding zip code for the main address of each patient
- country: the corresponding country for the main address of each patient (all United states for this clinical trial)
- contact: phone number and email information for each patient
- birthdate: the date of birth of each patient (month/day/year). The inclusion criteria for this clinical trial is age >= 18 (there  is no maximum age because diabetes is a growing problem among the elderly population)
- weight: the weight of each patient in pounds (lbs)
- height: the height of each patient in inches (in)
- bmi: the Body Mass Index (BMI) of each patient. BMI is a simple calculation using a person's height and weight. The formula is BMI =
kg/m2 where kg is a person's weight in kilograms and m2 is their height in metres squared. A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. The inclusion criteria for this clinical trial is 16 >= BMI >= 38.

**Table -> treatments and treatment_cut:**

- given_name: the given name of each patient in the Master Patient Index that took part in the clinical trial
- surname: the surname of each patient in the Master Patient Index that took part in the clinical trial
- auralin: the baseline median daily dose of insulin from the week prior to switching to Auralin (the number before the dash) and the ending median daily dose of insulin at the end of the 24 weeks of treatment measured over the 24th week of treatment (the number after the dash). Both are measured in units (shortform 'u'), which is the international unit of measurement and the standard measurement for insulin.
- novodra: same as above, except for patients that continued treatment with Novodra
- hba1c_start: the patient's HbA1c level at the beginning of the first week of treatment. HbA1c stands for Hemoglobin A1c. The HbA1c test measures what the average blood sugar has been over the past three months. It is thus a powerful way to get an overall sense of how well diabetes has been controlled. Everyone with diabetes should have this test 2 to 4 times per year. Measured in %.
- hba1c_end: the patient's HbA1c level at the end of the last week of treatment
- hba1c_change: the change in the patient's HbA1c level from the start of treatment to the end, i.e., hba1c_start - hba1c_end. For Auralin to be deemed effective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).

**Table -> adverse_reactions**
- given_name: the given name of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- surname: the surname of each patient in the Master Patient Index that took part in the clinical trial and had an adverse reaction (includes both patients treated Auralin and Novodra)
- adverse_reaction: the adverse reaction reported by the patient

#### 3. Add any additional information
**Additional useful information:**

- Insulin resistance varies person to person, which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
- It is important to test drugs and medical products in the people they are meant to help. People of different age, race, sex, and ethnic group must be included in clinical trials. This diversity is reflected in the patients table.

#### Types of Assessment
**There are 2 types of assessment styles**

- Manual - Looking through the data manually in google sheets
- Programmatic - By using pandas functions such as info(), describe() or sample()

**Steps in Assessment**
#### There are 2 steps involved in Assessment

- Discover
- Document

In [11]:
# export data for manual assessment

with pd.ExcelWriter('clinical_trials.xlsx') as writer:
  patients.to_excel(writer,sheet_name='patients')
  treatments.to_excel(writer,sheet_name='treatments')
  treatments_cut.to_excel(writer,sheet_name='treatment_cut')
  adverse_reactions.to_excel(writer,sheet_name='adverse_reactions')