# Import Libaries

In [1]:
import pandas as pd

# Load Dataset

In [2]:
patients = pd.read_csv("patients.csv")
treatments = pd.read_csv("treatments.csv")
adverse_reactions = pd.read_csv("adverse_reactions.csv")

# Helpful techniques for visual and programmatic assessment

## Using `.head()`

To visually inspect the adverse reaction table, the `.head()` function can be used

In [4]:
adverse_reactions.head()

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation


The same method can be applied for both the `treatments` and `patients` dataframes

In [5]:
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [7]:
patients.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


## Using `.info()`

The `.info()` function can be used to view columns and some information about them in the `treatments` dataframe.

In [8]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


The same thing can be done on the `adverse_reaction` dataframe

In [9]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 948.0+ bytes


## Using `.shape`

The `.shape` attribute is useful for getting the dimensionality of a dataframe

In [10]:
adverse_reactions.shape

(34, 3)

In [11]:
treatments.shape

(280, 7)

In [13]:
patients.shape

(503, 14)

Which basically produces (rows,columns)

## Using `.columns`

The `.columns` attribute is great for viewing all of the labels of the columns

In [14]:
adverse_reactions.columns

Index(['given_name', 'surname', 'adverse_reaction'], dtype='object')

In [15]:
treatments.columns

Index(['given_name', 'surname', 'auralin', 'novodra', 'hba1c_start',
       'hba1c_end', 'hba1c_change'],
      dtype='object')

## Using `.index`

The `.index` attribute is great for viewing teh indexes of the data

In [16]:
adverse_reactions.index

RangeIndex(start=0, stop=34, step=1)

# Assessing a single observational unit being stored in multiple tables

Viewing the columns on the `treatments` dataframe and the `adverse_reactions` dataframe shows that they contain the same columns, `given_name` and `sur_name`.

In [17]:
treatments.columns

Index(['given_name', 'surname', 'auralin', 'novodra', 'hba1c_start',
       'hba1c_end', 'hba1c_change'],
      dtype='object')

In [19]:
adverse_reactions.columns

Index(['given_name', 'surname', 'adverse_reaction'], dtype='object')

During cleaning, these two dataframes can be merged based on the common columns.

# Variables stored in both rows and columns

In the `auralin` and `novodra` columns, it can be found that no patient has taken both of these medicines at the same time. The following code block should return nothing, indicating that the previous assessmentis correct.

In [21]:
treatments[((treatments['auralin'] != '-') &  (treatments['novodra'] !='-'))]

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change


Having '-' representing the patient not taking a medicine and the start and dosages put in the same columns is inefficient. When cleaning the data, three columns will be used: `start_dosage`, 'end_dosage' and `medicine_type` (auralin/novodra)

# Multiple variables stored in a single column

In the `contact` column in the `patients` dataframe, it contains both the phone number and the email of every patient in a single column. Sometimes the phone number is first and other times it is last. This will be tricky to clean solely with string manipulation.

In [22]:
patients['contact'].value_counts()

contact
johndoe@email.com1234567890                        6
PatrickGersten@rhyta.com402-848-4923               2
304-438-2648SandraCTaylor@dayrep.com               2
JakobCJakobsen@einrot.com+1 (845) 858-7707         2
PavelFilipek@rhyta.com1 952 431 5166               1
                                                  ..
CoralieAllaire@armyspy.com+1 (828) 586-5050        1
ChibuzoOkoli@einrot.com+1 (918) 971-5864           1
EllenRLuman@einrot.com920-849-0384                 1
LeVietThong@gustr.com+1 (612) 208-2965             1
ChidaluOnyekaozulu@jourrapide.com1 360 443 2060    1
Name: count, Length: 483, dtype: int64