# D207 - Exploratory Data Analysis
***Desiree McElroy***


The dataset I chose is the medical dataset.

## **A1. Research Question**

The research question focuses on whether there's a variation in readmission rates between genders. This is important because it helps pinpoint which factors readmission rates are dependent on. The null hypothesis would be that gender and readmissions are independent of each other. This initial step is crucial in guiding further investigation into potential correlations between patient demographics and hospital readmissions.

## **A2. Why is this beneficial?**

Regarding the benefits for stakeholders, it helps in identifying potential gender-based disparities in healthcare outcomes. If there is an association between gender and readmission, it could signal underlying issues such as differences in healthcare access and quality of care among genders. Addressing these disparities is essential for ensuring equitable healthcare delivery. Secondly, this analysis can fuel targeted interventions and personalized care approaches. By understanding how gender influences readmission risk, hospitals can tailor interventions and support services to specific patient groups more effectively, ultimately improving patient outcomes and reducing healthcare costs. Additionally, these insights enable hospitals to allocate resources efficiently, directing attention and resources towards areas with the greatest need based on gender-specific readmission patterns. Overall, examining the independence of gender from hospital readmission is critical for promoting fairness, improving healthcare quality, and optimizing resource allocation within healthcare systems.

## **A3. Identify relevant data to research question.**


The two features needed for this analysis are the gender and readmission column. The ReAdmis (later renamed to readmission) column is binary which indicates Yes or No for readmission. The gender column has three categories; female, male and nonbinary.

In [78]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

In [56]:
# read in data

# columns = ['case_order', 'customer_id', 'interaction', 'unique_id', 'city', 
#            'state', 'county', 'zip', 'latitude', 'longitude', 'population', 'area', 
#            'timezone', 'job', 'children', 'age', 
#            'income', 'marital', 'gender', 'readmission', 'vitd_levels', 'doc_visits', 
#            'full_meals_eaten', 'vitd_supplement', 'soft_drink', 'initial_admin', 
#            'high_blood', 'stroke', 'complication_risk', 'overweight', 'arthritis', 
#            'diabetes', 'hyperlipidemia', 'backpain', 'anxiety', 'allergic_rhinitis',
#            'reflux_esophagitis', 'asthma', 'services_received', 'initial_days', 'total_charge', 
#            'additional_charges', 'item1', 'item2', 'item3', 'item4', 'item5', 'item6', 
#            'item7', 'item8']

df = pd.read_csv('medical_clean.csv')
df.head(3)

Unnamed: 0,CaseOrder,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,...,TotalCharge,Additional_charges,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
0,1,C412403,8cd49b13-f45a-4b47-a2bd-173ffa932c2f,3a83ddb66e2ae73798bdf1d705dc0932,Eva,AL,Morgan,35621,34.3496,-86.72508,...,3726.70286,17939.40342,3,3,2,2,4,3,3,4
1,2,Z919181,d2450b70-0337-4406-bdbb-bc1037f1734c,176354c5eef714957d486009feabf195,Marianna,FL,Jackson,32446,30.84513,-85.22907,...,4193.190458,17612.99812,3,4,3,4,4,4,3,3
2,3,F995323,a2057123-abf5-4a2c-abad-8ffe33512562,e19a0fa00aeda885b8a436757e889bc9,Sioux Falls,SD,Minnehaha,57110,43.54321,-96.63772,...,2434.234222,17505.19246,2,4,4,4,3,4,3,3


**The following code is used from my previous performance assessment D206**

In [58]:
# run through clean
# lowercase columns
df.columns = map(str.lower, df.columns)
    
# change timezone column entries before changing data type
tz_dict = {
    "America/Puerto_Rico" : "US - Puerto Rico",
    "America/New_York": "US - Eastern",
    "America/Detroit" : "US - Eastern",
    "America/Indiana/Indianapolis" : "US - Eastern",
    "America/Indiana/Vevay" : "US - Eastern",
    "America/Indiana/Vincennes" : "US - Eastern",
    "America/Kentucky/Louisville" : "US - Eastern",
    "America/Toronto" : "US - Eastern",
    "America/Indiana/Marengo" : "US - Eastern",
    "America/Indiana/Winamac" : "US - Eastern",
    "America/Chicago" : "US - Central", 
    "America/Menominee" : "US - Central",
    "America/Indiana/Knox" : "US - Central",
    "America/Indiana/Tell_City" : "US - Central",
    "America/North_Dakota/Beulah" : "US - Central",
    "America/North_Dakota/New_Salem" : "US - Central",
    "America/Denver" : "US - Mountain",
    "America/Boise" : "US - Mountain",
    "America/Phoenix" : "US - Arizona",
    "America/Los_Angeles" : "US - Pacific",
    "America/Nome" : "US - Alaskan",
    "America/Anchorage" : "US - Alaskan",
    "America/Sitka" : "US - Alaskan",
    "America/Yakutat" : "US - Alaskan",
    "America/Adak" : "US - Aleutian",
    "Pacific/Honolulu" : 'US - Hawaiian'
    }
df.timezone.replace(tz_dict, inplace=True)

# convert zip column to str, then fill 0s in entries
df.zip = df.zip.astype('str').str.zfill(5)

# changing datatypes
# change columns to boolean data type
to_bool = ['readmis', 'soft_drink', 'highblood', 'stroke',
           'complication_risk', 'overweight', 'arthritis', 'diabetes',
          'hyperlipidemia', 'backpain', 'anxiety', 'allergic_rhinitis',
          'reflux_esophagitis', 'asthma']
# make a copy to remove entries with 1,0 and change str entries to int
to_bool_copy = to_bool.copy()
to_bool_copy.remove('anxiety')
to_bool_copy.remove('overweight')
for col in to_bool_copy:
    df[col] = df[col].replace({'Yes':1, 'No':0}).astype(bool)

for col in to_bool:
    df[col] = df[col].astype('bool')

# round entries in columns to only have two decimal places
round_num = ['vitd_levels', 'totalcharge', 'additional_charges']
for col in round_num:
    df[col] = round(df[col], 2)

# change columns to integer data type
to_int = ['population', 'children', 'age','income',
         'initial_days']
for col in to_int:
    df[col] = df[col].astype('int32')

# change columns to categorical data type
to_cat = ['marital', 'gender', 'initial_admin', 'services',
          'item1', 'item2', 'item3', 'item4', 'item5', 
          'item6', 'item7', 'item8', 'timezone', 'state',
          'complication_risk']
for col in to_cat:
    df[col] = df[col].astype('category')
    
    
    
# make columns more readable  
columns = ['case_order', 'customer_id', 'interaction', 'unique_id', 'city', 
           'state', 'county', 'zip', 'latitude', 'longitude', 'population', 'area', 
           'timezone', 'job', 'children', 'age', 
           'income', 'marital', 'gender', 'readmission', 'vitd_levels', 'doc_visits', 
           'full_meals_eaten', 'vitd_supplement', 'soft_drink', 'initial_admin', 
           'high_blood', 'stroke', 'complication_risk', 'overweight', 'arthritis', 
           'diabetes', 'hyperlipidemia', 'backpain', 'anxiety', 'allergic_rhinitis',
           'reflux_esophagitis', 'asthma', 'services_received', 'initial_days', 'total_charge', 
           'additional_charges', 'item1', 'item2', 'item3', 'item4', 'item5', 'item6', 
           'item7', 'item8']

df.columns = columns
df.set_index('case_order', inplace=True)

In [55]:
df.T

case_order,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
customer_id,C412403,Z919181,F995323,A879973,C544523,S543885,E543302,K477307,Q870521,Z229385,...,M07341,L715446,T523588,Q117805,M583491,B863060,P712040,R778890,E344109,I569847
interaction,8cd49b13-f45a-4b47-a2bd-173ffa932c2f,d2450b70-0337-4406-bdbb-bc1037f1734c,a2057123-abf5-4a2c-abad-8ffe33512562,1dec528d-eb34-4079-adce-0d7a40e82205,5885f56b-d6da-43a3-8760-83583af94266,e3b0a319-9e2e-4a23-8752-2fdc736c30f4,2fccb53e-bd9a-4eaa-a53c-9dfc0cb83f94,ab634508-dd8c-42e5-a4e4-d101a46f2431,67b386eb-1d04-4020-9474-542a09d304e3,5acd5dd3-f0ae-41c7-9540-cf3e4ecb2e27,...,9b73f4cb-3945-41c1-9a38-129fcecde3a0,a5492e46-bf07-4c9e-bd00-e96adba46557,8dfe0df1-bf7b-48d0-83e9-0bc22d86d168,ccc85472-5bd1-4389-8442-122a876b9000,15c2b4bb-2c36-41b2-b1e2-206144fae1dc,a25b594d-0328-486f-a9b9-0567eb0f9723,70711574-f7b1-4a17-b15f-48c54564b70f,1d79569d-8e0f-4180-a207-d67ee4527d26,f5a68e69-2a60-409b-a92f-ac0847b27db0,bc482c02-f8c9-4423-99de-3db5e62a18d5
unique_id,3a83ddb66e2ae73798bdf1d705dc0932,176354c5eef714957d486009feabf195,e19a0fa00aeda885b8a436757e889bc9,cd17d7b6d152cb6f23957346d11c3f07,d2f0425877b10ed6bb381f3e2579424a,03e447146d4a32e1aaf75727c3d1230c,e4884a42ba809df6a89ded6c97f460d4,5f78b8699d1aa9b950b562073f629ca2,e8e016144bfbe14974752d834f530e26,687e7ba1b80022c310fa2d4b00db199a,...,4f83c32e349fa29482f338ed25896f01,441b4934c2fdb97ce81fe317b4150a32,1a35db97b8b90ab318b90b708904d312,9612abd4b9a81c2fd596ec9adb232efa,b9dd180aa8894ecea6af33a46b22e015,39184dc28cc038871912ccc4500049e5,3cd124ccd43147404292e883bf9ec55c,41b770aeee97a5b9e7f69c906a8119d7,2bb491ef5b1beb1fed758cc6885c167a,95663a202338000abdf7e09311c2a8a1
city,Eva,Marianna,Sioux Falls,New Richland,West Point,Braggs,Thompson,Strasburg,Panama City,Paynesville,...,Crosby,Blunt,Columbus,Northvale,Fellsmere,Norlina,Milmay,Southside,Quinn,Coraopolis
state,AL,FL,SD,MN,VA,OK,OH,VA,FL,MN,...,MS,SD,OH,NJ,FL,NC,NJ,TN,SD,PA
county,Morgan,Jackson,Minnehaha,Waseca,King William,Muskogee,Geauga,Shenandoah,Bay,Stearns,...,Wilkinson,Hughes,Franklin,Bergen,Indian River,Warren,Atlantic,Montgomery,Pennington,Allegheny
zip,35621,32446,57110,56072,23181,74423,44086,22641,32404,56362,...,39633,57522,43203,07647,32948,27563,08340,37171,57775,15108
latitude,34.3496,30.84513,43.54321,43.89744,37.59894,35.67302,41.67511,39.08062,30.20097,45.40325,...,31.29102,44.47735,39.9731,41.00669,27.88942,36.42886,39.43609,36.36655,44.10354,40.49998
longitude,-86.72508,-85.22907,-96.63772,-93.51479,-76.88958,-95.1918,-81.05788,-78.3915,-85.5061,-94.71424,...,-91.18493,-99.99679,-82.96898,-73.94259,-80.73347,-78.23716,-74.87302,-87.29988,-102.0159,-80.19959
population,2951,11303,17125,2162,5287,981,2558,479,40029,5840,...,1236,552,8368,5412,7908,4762,1251,532,271,41524


## B1. Describe data analysis; $Chi^2$ Test

${H_0}$: There is no association between Gender and Readmission


${H_a}$: There is an association between Gender and Readmission

In [75]:
observed = pd.crosstab(df.gender, df.readmission)
observed

readmission,False,True
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,3205,1813
Male,2995,1773
Nonbinary,131,83


## B2. Provide output and calculations

In [92]:
alpha = 0.01
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [93]:
print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected.astype(int))
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed

[[3205 1813]
 [2995 1773]
 [ 131   83]]
---
Expected

[[3176 1841]
 [3018 1749]
 [ 135   78]]
---

chi^2 = 1.5858
p     = 0.4525


In [94]:
if p < alpha:
    print('We reject the null')
else:
    print("We fail to reject the null hypothesis")

we fail to reject the null


## B3. Justify why I chose this analysis technique.

B.  Describe the data analysis by doing the following:

- 1.  Using one of the following techniques, write code (in either Python or R) to run the analysis of the data set:

    - •   chi-square

•   t-test

•   ANOVA

2.  Provide the output and the results of any calculations from the analysis you performed.

3.  Justify why you chose this analysis technique.

 

C.  Identify the distribution of two continuous variables and two categorical variables using univariate statistics from your cleaned and prepared data. 

Represent your findings in Part C, visually as part of your submission.
 

Note: To draw a graph or visualization, you may use one or a combination of the following:

- A spreadsheet program, such as Excel (*.xls)

- A graphics program, such as Paint (*.jpeg, *.gif)

- A word-processing program, such as Word (*.rtf) 

- A scanned hand-drawn graph (*.jpeg, *.gif)

 

D.  Identify the distribution of two continuous variables and two categorical variables using bivariate statistics from your cleaned and prepared data.

Represent your findings in Part D, visually as part of your submission.
 

Note: To draw a graph or visualization, you may use one or a combination of the following:

- A spreadsheet program, such as Excel (*.xls)

- A graphics program, such as Paint (*.jpeg, *.gif)

- A word-processing program, such as Word (*.rtf) 

- A scanned hand-drawn graph (*.jpeg, *.gif)

  

E.  Summarize the implications of your data analysis by doing the following:

1.  Discuss the results of the hypothesis test.

2.  Discuss the limitations of your data analysis.

3.  Recommend a course of action based on your results.

### Notes
- Correlation is relationship between two variables
- R squared is relationship between independent and dependent variable
- Comparing means for:
    - CONTINUOUS > T-test measures two samples, ANOVA measures multiple samples.
    - DISCRETE > Chi$^2$ Test


Unnamed: 0,CaseOrder,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,...,TotalCharge,Additional_charges,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
0,1,C412403,8cd49b13-f45a-4b47-a2bd-173ffa932c2f,3a83ddb66e2ae73798bdf1d705dc0932,Eva,AL,Morgan,35621,34.3496,-86.72508,...,3726.70286,17939.40342,3,3,2,2,4,3,3,4
1,2,Z919181,d2450b70-0337-4406-bdbb-bc1037f1734c,176354c5eef714957d486009feabf195,Marianna,FL,Jackson,32446,30.84513,-85.22907,...,4193.190458,17612.99812,3,4,3,4,4,4,3,3
2,3,F995323,a2057123-abf5-4a2c-abad-8ffe33512562,e19a0fa00aeda885b8a436757e889bc9,Sioux Falls,SD,Minnehaha,57110,43.54321,-96.63772,...,2434.234222,17505.19246,2,4,4,4,3,4,3,3
3,4,A879973,1dec528d-eb34-4079-adce-0d7a40e82205,cd17d7b6d152cb6f23957346d11c3f07,New Richland,MN,Waseca,56072,43.89744,-93.51479,...,2127.830423,12993.43735,3,5,5,3,4,5,5,5
4,5,C544523,5885f56b-d6da-43a3-8760-83583af94266,d2f0425877b10ed6bb381f3e2579424a,West Point,VA,King William,23181,37.59894,-76.88958,...,2113.073274,3716.525786,2,1,3,3,5,3,4,3


In [7]:
df.Zip.tail()

9995    27563
9996     8340
9997    37171
9998    57775
9999    15108
Name: Zip, dtype: int64

In [9]:
df.Gender.value_counts()

Female       5018
Male         4768
Nonbinary     214
Name: Gender, dtype: int64

In [5]:
df.dtypes

CaseOrder               int64
Customer_id            object
Interaction            object
UID                    object
City                   object
State                  object
County                 object
Zip                     int64
Lat                   float64
Lng                   float64
Population              int64
Area                   object
TimeZone               object
Job                    object
Children                int64
Age                     int64
Income                float64
Marital                object
Gender                 object
ReAdmis                object
VitD_levels           float64
Doc_visits              int64
Full_meals_eaten        int64
vitD_supp               int64
Soft_drink             object
Initial_admin          object
HighBlood              object
Stroke                 object
Complication_risk      object
Overweight             object
Arthritis              object
Diabetes               object
Hyperlipidemia         object
BackPain  

In [10]:
df.isnull().any().sum()

0

In [11]:
df.duplicated().any()

False