In advance of the first 24-hour virtual Women in Data Science (WiDS) Worldwide Conference March 8, 2021, we invite you to build a team, hone your data science skills, and join us for the 4th Annual WiDS Datathon focused on social impact. Register now!

The WiDS Datathon 2021 focuses on patient health, with an emphasis on the chronic condition of diabetes, through data from MIT’s GOSSIS (Global Open Source Severity of Illness Score) initiative. Brought to you by the WiDS Worldwide team at Stanford, the West Big Data Innovation Hub, and the WiDS Datathon Committee, this year’s datathon is open until March 1, 2021. Winners will be announced at the WiDS Conference via livestream, reaching a community of 100,000+ data enthusiasts across more than 85 countries.

# Background #

Getting a rapid understanding of the context of a patient’s overall health has been particularly important during the COVID-19 pandemic as healthcare workers around the world struggle with hospitals overloaded by patients in critical condition. Intensive Care Units (ICUs) often lack verified medical histories for incoming patients. A patient in distress or a patient who is brought in confused or unresponsive may not be able to provide information about chronic conditions such as heart disease, injuries, or diabetes. Medical records may take days to transfer, especially for a patient from another medical provider or system.

Knowledge about chronic conditions such as diabetes can inform clinical decisions about patient care and ultimately improve patient outcomes. Learn more about the global scale of diabetes on our Datathon News page, and make sure you are subscribed to the WiDS Datathon Mailing List to receive the latest updates, tutorials, and articles.

# Overview
This years challenge will focus on models to determine whether a patient admitted to an ICU has been diagnosed with a particular type of diabetes, Diabetes Mellitus. Using data from the first 24 hours of intensive care, individuals and teams will explore labeled training data for model development. Participants will then upload predictions for unlabeled data to Kaggle and these predictions will be used to determine the public leaderboard rankings, as well as the final leaderboard revealed at the close of the competition.

Data analysis can be completed using your preferred tools. Tutorials, sample code, and other resources will be posted throughout the competition at [widsconference.org/datathon-resources](widsconference.org/datathon-resources) and on the Kaggle Discussion Forum.

The WiDS Datathon 2021 dataset is similar to the WiDS Datathon 2020 dataset, but prior experience is not needed. This year's WiDS Datathon will also feature a Phase 2 hosted by the National Science Foundation Big Data Innovation Hubs, to encourage research papers and further collaboration. Special thanks to Kaggle for supporting the suite of WiDS Datathon cash awards this year, totaling $10,000 USD.

Who can participate
We invite anyone from those new to data science to veterans of the field to participate. For those who have never tried machine learning or worked with health data before, check out the series of tutorials and webinars from the WiDS Datathon Committee, workshop organizers, and community of participants to help you get started.

The WiDS Datathon aims to inspire women worldwide to learn more about data science, and to create a supportive environment for women to connect with others in their community who share their interests. Toward these ends, we open the datathon to individuals or teams of up to 4; at least half of each team must be women (individuals identifying as female participants). Participants can include students, faculty, and individuals with various roles in non-profit, academic, government, and industry organizations.

Acknowledgements
​The WiDS Datathon 2021 is a collaboration led by the WiDS Worldwide team at Stanford University, the West Big Data Innovation Hub, and the WiDS Datathon Committee. WiDS Datathon 2021 cash prizes are provided by Kaggle and the Excellence in Research Award is supported by the National Science Foundation under Grants 1916573, 1916481, and 1915774, as part of a network of Big Data Innovation Hubs. Special thanks to the MIT GOSSIS Initiative and the Harvard Data Privacy Lab.

# Summary of the goal

In [1]:
# Found out whether a patient admitted to an ICU has Diabetes Mellitus

import pandas as pd
import numpy as np

In [2]:
# Reading in the dataset

data = pd.read_csv('TrainingDataWiDS2021.csv')

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,encounter_id,hospital_id,age,bmi,elective_surgery,ethnicity,gender,height,hospital_admit_source,...,h1_pao2fio2ratio_max,h1_pao2fio2ratio_min,aids,cirrhosis,hepatic_failure,immunosuppression,leukemia,lymphoma,solid_tumor_with_metastasis,diabetes_mellitus
0,1,214826,118,68.0,22.732803,0,Caucasian,M,180.3,Floor,...,,,0,0,0,0,0,0,0,1
1,2,246060,81,77.0,27.421875,0,Caucasian,F,160.0,Floor,...,51.0,51.0,0,0,0,0,0,0,0,1
2,3,276985,118,25.0,31.952749,0,Caucasian,F,172.7,Emergency Department,...,,,0,0,0,0,0,0,0,0
3,4,262220,118,81.0,22.635548,1,Caucasian,F,165.1,Operating Room,...,337.0,337.0,0,0,0,0,0,0,0,0
4,5,201746,33,19.0,,0,Caucasian,M,188.0,,...,,,0,0,0,0,0,0,0,0


In [4]:
data.shape

(130157, 181)

In [5]:
data.describe()

Unnamed: 0.1,Unnamed: 0,encounter_id,hospital_id,age,bmi,elective_surgery,height,icu_id,pre_icu_los_days,readmission_status,...,h1_pao2fio2ratio_max,h1_pao2fio2ratio_min,aids,cirrhosis,hepatic_failure,immunosuppression,leukemia,lymphoma,solid_tumor_with_metastasis,diabetes_mellitus
count,130157.0,130157.0,130157.0,125169.0,125667.0,130157.0,128080.0,130157.0,130157.0,130157.0,...,16760.0,16760.0,130157.0,130157.0,130157.0,130157.0,130157.0,130157.0,130157.0,130157.0
mean,65079.0,213000.856519,106.102131,61.995103,29.11026,0.18984,169.607219,662.428344,0.839933,0.0,...,247.525419,239.617358,0.00103,0.016081,0.013599,0.025669,0.007307,0.004187,0.020852,0.216285
std,37573.233831,38109.828146,63.482277,16.82288,8.262776,0.392176,10.833085,304.259843,2.485337,0.0,...,131.440167,128.562211,0.03207,0.125786,0.115819,0.158146,0.085166,0.064574,0.142888,0.411712
min,1.0,147000.0,1.0,0.0,14.844926,0.0,137.2,82.0,-0.25,0.0,...,42.0,38.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,32540.0,180001.0,49.0,52.0,23.598006,0.0,162.5,427.0,0.045833,0.0,...,144.0,138.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,65079.0,213014.0,112.0,64.0,27.564749,0.0,170.1,653.0,0.155556,0.0,...,228.125,218.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,97618.0,246002.0,165.0,75.0,32.803127,0.0,177.8,969.0,0.423611,0.0,...,333.0,324.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,130157.0,279000.0,204.0,89.0,67.81499,1.0,195.59,1111.0,175.627778,0.0,...,720.0,654.813793,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130157 entries, 0 to 130156
Columns: 181 entries, Unnamed: 0 to diabetes_mellitus
dtypes: float64(157), int64(18), object(6)
memory usage: 179.7+ MB


In [7]:
data.columns

Index(['Unnamed: 0', 'encounter_id', 'hospital_id', 'age', 'bmi',
       'elective_surgery', 'ethnicity', 'gender', 'height',
       'hospital_admit_source',
       ...
       'h1_pao2fio2ratio_max', 'h1_pao2fio2ratio_min', 'aids', 'cirrhosis',
       'hepatic_failure', 'immunosuppression', 'leukemia', 'lymphoma',
       'solid_tumor_with_metastasis', 'diabetes_mellitus'],
      dtype='object', length=181)

In [8]:
data.columns.value_counts()

sodium_apache            1
apache_2_diagnosis       1
ventilated_apache        1
diabetes_mellitus        1
d1_sysbp_invasive_min    1
                        ..
resprate_apache          1
h1_pao2fio2ratio_min     1
h1_bun_max               1
h1_potassium_min         1
wbc_apache               1
Length: 181, dtype: int64

In [9]:
df = data.drop(['Unnamed: 0'], axis=1)

In [10]:
df.head()

Unnamed: 0,encounter_id,hospital_id,age,bmi,elective_surgery,ethnicity,gender,height,hospital_admit_source,icu_admit_source,...,h1_pao2fio2ratio_max,h1_pao2fio2ratio_min,aids,cirrhosis,hepatic_failure,immunosuppression,leukemia,lymphoma,solid_tumor_with_metastasis,diabetes_mellitus
0,214826,118,68.0,22.732803,0,Caucasian,M,180.3,Floor,Floor,...,,,0,0,0,0,0,0,0,1
1,246060,81,77.0,27.421875,0,Caucasian,F,160.0,Floor,Floor,...,51.0,51.0,0,0,0,0,0,0,0,1
2,276985,118,25.0,31.952749,0,Caucasian,F,172.7,Emergency Department,Accident & Emergency,...,,,0,0,0,0,0,0,0,0
3,262220,118,81.0,22.635548,1,Caucasian,F,165.1,Operating Room,Operating Room / Recovery,...,337.0,337.0,0,0,0,0,0,0,0,0
4,201746,33,19.0,,0,Caucasian,M,188.0,,Accident & Emergency,...,,,0,0,0,0,0,0,0,0


In [11]:
df.dtypes

encounter_id                     int64
hospital_id                      int64
age                            float64
bmi                            float64
elective_surgery                 int64
                                ...   
immunosuppression                int64
leukemia                         int64
lymphoma                         int64
solid_tumor_with_metastasis      int64
diabetes_mellitus                int64
Length: 180, dtype: object

In [12]:
df.describe()

Unnamed: 0,encounter_id,hospital_id,age,bmi,elective_surgery,height,icu_id,pre_icu_los_days,readmission_status,weight,...,h1_pao2fio2ratio_max,h1_pao2fio2ratio_min,aids,cirrhosis,hepatic_failure,immunosuppression,leukemia,lymphoma,solid_tumor_with_metastasis,diabetes_mellitus
count,130157.0,130157.0,125169.0,125667.0,130157.0,128080.0,130157.0,130157.0,130157.0,126694.0,...,16760.0,16760.0,130157.0,130157.0,130157.0,130157.0,130157.0,130157.0,130157.0,130157.0
mean,213000.856519,106.102131,61.995103,29.11026,0.18984,169.607219,662.428344,0.839933,0.0,83.791104,...,247.525419,239.617358,0.00103,0.016081,0.013599,0.025669,0.007307,0.004187,0.020852,0.216285
std,38109.828146,63.482277,16.82288,8.262776,0.392176,10.833085,304.259843,2.485337,0.0,24.963063,...,131.440167,128.562211,0.03207,0.125786,0.115819,0.158146,0.085166,0.064574,0.142888,0.411712
min,147000.0,1.0,0.0,14.844926,0.0,137.2,82.0,-0.25,0.0,38.6,...,42.0,38.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,180001.0,49.0,52.0,23.598006,0.0,162.5,427.0,0.045833,0.0,66.5,...,144.0,138.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,213014.0,112.0,64.0,27.564749,0.0,170.1,653.0,0.155556,0.0,80.0,...,228.125,218.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,246002.0,165.0,75.0,32.803127,0.0,177.8,969.0,0.423611,0.0,96.8,...,333.0,324.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,279000.0,204.0,89.0,67.81499,1.0,195.59,1111.0,175.627778,0.0,186.0,...,720.0,654.813793,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [20]:
print(df.columns.tolist())

['encounter_id', 'hospital_id', 'age', 'bmi', 'elective_surgery', 'ethnicity', 'gender', 'height', 'hospital_admit_source', 'icu_admit_source', 'icu_id', 'icu_stay_type', 'icu_type', 'pre_icu_los_days', 'readmission_status', 'weight', 'albumin_apache', 'apache_2_diagnosis', 'apache_3j_diagnosis', 'apache_post_operative', 'arf_apache', 'bilirubin_apache', 'bun_apache', 'creatinine_apache', 'fio2_apache', 'gcs_eyes_apache', 'gcs_motor_apache', 'gcs_unable_apache', 'gcs_verbal_apache', 'glucose_apache', 'heart_rate_apache', 'hematocrit_apache', 'intubated_apache', 'map_apache', 'paco2_apache', 'paco2_for_ph_apache', 'pao2_apache', 'ph_apache', 'resprate_apache', 'sodium_apache', 'temp_apache', 'urineoutput_apache', 'ventilated_apache', 'wbc_apache', 'd1_diasbp_invasive_max', 'd1_diasbp_invasive_min', 'd1_diasbp_max', 'd1_diasbp_min', 'd1_diasbp_noninvasive_max', 'd1_diasbp_noninvasive_min', 'd1_heartrate_max', 'd1_heartrate_min', 'd1_mbp_invasive_max', 'd1_mbp_invasive_min', 'd1_mbp_max',

In [None]:
""" Possible columns needed

age, bmi, height, gender, weight, d1_heartrate_max, d1_heartrate_min, d1_glucose_max, d1_glucose_min, haemoglobin, 
d1_hemaglobin_max, d1_hemaglobin_min, h1_glucose_max, h1_glucose_min, h1_hemaglobin_max, aids, cirrhosis, hepatic_failure, 
immunosuppression, leukemia, lymphoma, solid_tumor_with_metastasis, diabetes_mellitus

"""