# Hand Selection
This notebook consists in detailing the most appropriate features from this dataset according to different scientific sources.

In [None]:
import pandas as pd

kidney_disease = pd.read_csv(r'Data/original_dataset.csv')
kidney_disease.head()

## Glomerular Filtration Rate (GFR) estimation
A patient is considered to have a chronic kidney disease if his GFR (expressed in mL/min/1.73 m<sup>2</sup>) is below 60. GFR is defined as the sum of the filtration rates of all of the patient's functioning nephrons (filtering units making up the kidneys). If we do not have this information in the dataset, there exists a formula to compute an estimated value of the GFR (eGFR) (source: [A new equation to estimate glomerular filtration rate. Ann Intern Med 2009.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2763564/)). The formula is the following:  

**eGFR = 141 × min(Scr/κ, 1)<sup>α</sup> × max(Scr/κ, 1)<sup>-1.209</sup> × 0.993<sup>Age</sup> × 1.018 [if female] × 1.159 [if black]**  
- α: -0.329 for females and -0.411 for males
- κ: 0.7 for females and 0.9 for males
- Scr: serum creatinine  

This estimation depends on 4 factors: serum creatinine, age, gender and ethnicity. If our dataset contains serum creatinine and age, gender and ethnicity are 2 missing factors. Nonetheless, we can use this estimation as a prediction model.  

To do so, for each patient, we compute all the possible eGFR combinations. If all of these estimations agree about the patient being or not being sick, we have a prediction, otherwise we have an unsure record.

In [None]:
# Sex : [F, M] --> [True, False]
# Ethnicity : True if person is black, False otherwise
def gfr(sex:bool, ethnicity:bool, serum_creatinine:float, age:int):
    k = [0.9, 0.7][sex]
    alpha = [-0.411, -0.329][sex]
    sex_rate = [1, 1.018][sex]
    ethnicity_rate = [1, 1.159][ethnicity]
    return 141*pow(min(serum_creatinine/k, 1),alpha)*pow(max(serum_creatinine/k, 1),-1.209)*pow(0.993, age)*sex_rate*ethnicity_rate

if 'serum_creatinine' in kidney_disease.columns:
    kidney_disease = kidney_disease.rename(columns={'serum_creatinine': 'sc'})


calc_dict = kidney_disease[['age', 'sc', 'classification']].dropna().to_dict(orient='records')
correct_pred = 0
wrong_pred = 0
unsure = 0
for pat in calc_dict:
    sick_ma = gfr(True, False, pat['sc'], pat['age']) < 60
    sick_fa = gfr(False, False, pat['sc'], pat['age']) < 60
    sick_mb = gfr(True, True, pat['sc'], pat['age']) < 60
    sick_fb = gfr(False, True, pat['sc'], pat['age']) < 60
    sick_all = [sick_ma, sick_fa, sick_mb, sick_fb]
    if True not in sick_all :
        if pat['classification'] == 'notckd': correct_pred+=1
        else : wrong_pred += 1
    elif False not in sick_all:
        if pat['classification'] in ['ckd', 'ckd\t']: correct_pred += 1
        else: wrong_pred += 1
    else:
        unsure+=1

total_records = len(calc_dict)
print('Total not null records :', total_records)
print('Correct predicictions :', correct_pred)
print('Wrong predictions :', wrong_pred)
print('Unsure records :', unsure)
print('{:.3f} < Accuracy < {:.3f}'.format(correct_pred/total_records, (correct_pred+unsure)/total_records)) 