# Approach

Notes:
WHO Diabetes Fact Sheet
* Diabetes is a chronic disease that occurs either when the pancreas **does not produce enough insulin or when the body cannot effectively use the insulin it produces**. Insulin is a hormone that regulates blood sugar. **Hyperglycaemia, or raised blood sugar**, is a common effect of uncontrolled diabetes and over time leads to serious damage to many of the body's systems, especially the nerves and blood vessels. 
* There are two types of diabetes - Type 1 diabetes and Type 2 diabetes. Type 2 diabetes is more common than Type 1 diabetes and often results from excess body weight and physical inactivity while Type 1 diabetes is independent on body size. Additionally, there is Gestational diabetes in which a woman without diabetes develops high blood sugar levels during pregnancy. The latter usually resolves after birth while the other two types of diabetes have to be treated in the longterm.
* Around 8.5 % of the adult population is diagnosed with Diabetes independent of the gender.
* Diabetes mellitus is characterized by high blood sugar levels over a prolonged period of time and is diagnosed by demonstrating any one of the following:
        * Fasting plasma glucose level ≥ 7.0 mmol/L (126 mg/dL)
        * Plasma glucose ≥ 11.1 mmol/L (200 mg/dL) two hours after a 75 gram oral glucose load as in a glucose tolerance test
        * Symptoms of high blood sugar and casual plasma glucose ≥ 11.1 mmol/L (200 mg/dL)
        * Glycated hemoglobin (HbA1C) ≥ 48 mmol/mol (≥ 6.5 DCCT %)
        
Commonly know risk factors/comorbidities to develop diabetes


# Data cleaning and preparation (possible steps, to be verified what is really needed)

### Exploratory Data Analysis (EDA)
- Target correlates to BMI
- Glucose related features are one of the most important features for diabetes detection
- Target has minor correlation to blood pressure category (could indicate level of fitness)
- Check if target is unbalanced

### Data Cleaning
- Drop identifiers (encounter_id, hospital_id, icu_id)
- ~Drop measurements taken in the first hour (h1...min/max) because high ratio of missing values and redundant with d1_...min/max columns~
- ~Drop duplicates in blood pressure columns (invasive/noninvasive)~
- Drop readmission status (readmission_status) because only one unique value
- Drop 30 rows in training data with implausible age (0)
- ~Drop paco2_for_ph_apache because it is identical to paco2_apache identical.~

### Handling Missing Values
- Merge ...apache to d1_...max and then drop ...apache column for albumin, bilirubin , bun, creatinine, glucose, hematocrit, sodium, wbc, and heart_rate
- Weight and height filled with mean for gender
- ~Fill the missing values of the Oxygenation index d1_pao2fio2ratio_max with pao2_apache/fio2_apache.~
- Fill the missing values for gender with 'Unknown'
- Fill the missing values for ethnicity with the already available value 'Other/Unknown'.
- Use MICE to replace missing values in numerical columns
- Use SingleImputer to replace missing values in categorical variables, strategy "most frequent"


### Check Correlations to avoid multicollinearity
- Remove columns where missing values are more than 80% of the entire column, columns that have low correlation with target variable.
- Remove columns that are visible correlated with other ones to avoid multicollinearity.
- Drop BMI column because it can be derived from weight and height, I want to avoid multicollinearity

### Feature Engineering
- BMI Categories (underweight, healthy weight, overweight, obese)
- ~Range and Mean from d1 ... min/max~
- ~Blood pressure categories from systolic and diastolic blood pressure~
- One Hot Encode Categorical columns
- Groups of tests that were taken
- Note that h1 measurements were present, but not value
- BMI AND ethnicity
- hospital_admit_source, icu_admit_source: emergency yes or no 
- difference in glucose max d1, h1
- scores from mdscore
- has high risk diabetes comorbidity
- glucose & hemaglobin
- label wann werte fehlen
- bmi klassifizierung
- labs + vitals: range, mean
- blutdruck
- glascoma score
- falsche labwerte
- Einheiten alle gleich? 

### Outlier removal

### Apply all preprocessing steps for unlabeled data
- Align dataset df with unlabel test data to make sure they have the same columns.
- distributions train test


# Modeling
- train test split
- Performing Random Undersampling and then Oversampling to rebalance the dataset
- Performing Random Undersampling and then Oversampling to rebalance the dataset
- try different models, use cross validation
- make pipeline to compare parameters and use Gridsearch
- Try: Random Forest, LGBM, xgboost, deep learning 
- Feature selection 
- Build an ensemble


Step 1) build pipeline for testing

Step 2) feature engineering & co


# Import libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Import data

In [3]:
# non-processed data
train = pd.read_csv('/kaggle/input/widsdatathon2021/TrainingWiDS2021.csv')
test = pd.read_csv('/kaggle/input/widsdatathon2021/UnlabeledWiDS2021.csv')
dictionary = pd.read_csv('/kaggle/input/widsdatathon2021/DataDictionaryWiDS2021.csv')
data_types = dictionary[['Category', 'Variable Name','Data Type']].set_index('Variable Name')
data_types.head()