<div align="center">
<font size="4"> Context  </font>  
</div> 

Reduction of child mortality is reflected in several of the United Nations' Sustainable Development Goals and is a key indicator of human progress. The UN expects that by 2030, countries end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce under‑5 mortality to at least as low as 25 per 1,000 live births.

Parallel to notion of child mortality is of course maternal mortality, which accounts for 295 000 deaths during and following pregnancy and childbirth (as of 2017). The vast majority of these deaths (94%) occurred in low-resource settings, and most could have been prevented.

In light of what was mentioned above, Cardiotocograms (CTGs) are a simple and cost accessible option to assess fetal health, allowing healthcare professionals to take action in order to prevent child and maternal mortality. The equipment itself works by sending ultrasound pulses and reading its response, thus shedding light on fetal heart rate (FHR), fetal movements, uterine contractions and more.

<div align="center">
<font size="4"> Data  </font>  
</div> 

This dataset contains **2126 records** of features extracted from Cardiotocogram exams, which were then classified by three expert obstetritians into **3 classes**:

- Normal
- Suspect
- Pathological

Link to dataset is [here](https://www.kaggle.com/andrewmvd/fetal-health-classification).

<h2 style=color:Teal align="left"> Table of Contents </h2>

### 1 Import packages
#### 1.1 Kaggle and other imports
#### 1.2 Visuzalization imports
#### 1.3 Import Scikit-Learn
#### 1.4 Import LightGBM
### 2 Configs
### 3 Dataset
#### 3.1 Class imbalance
#### 3.2 Data scaling
#### 3.3 Data split
#### 3.4 Class weights
#### 3.5 Validation set
### 4 Model
#### 4.1 Build model
#### 4.2 Fit model
#### 4.3 Feature importance
### 5 Grid search
#### 5.1 Grid parameters
#### 5.2 Fit model
#### 5.3 Save optimum parameters
#### 5.4 Fit tuned model
#### 5.5 Feature importance for tuned model


<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 1 Import packages </h1>

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 1.1 Kaggle and other imports </h1>

In [None]:
import os
import gc
import time
import tqdm
import pickle
import random

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 1.2 Visualization imports </h1>

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerBase
from matplotlib.text import Text

import warnings
warnings.filterwarnings('ignore')

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 1.3 Import Scikit-Learn </h1>

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.utils import class_weight
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 1.4 Import LightGBM </h1>

In [None]:
import lightgbm as lgb

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 2 Configs </h1>

In [None]:
SEED = 123      
random.seed(SEED)

TEST_SIZE = 0.20
VAL_SIZE = 0.15

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 3 Dataset </h1>

In [None]:
data = pd.read_csv("/kaggle/input/fetal-health-classification/fetal_health.csv")

In [None]:
data

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 3.1 Class imbalance </h1>

In [None]:
plt.figure(figsize=(6,5))
ax = sns.countplot(x = data['fetal_health'])

#1-Normal
#2-Suspect
#3-Pathological

In [None]:
label = LabelEncoder()
label.fit(data['fetal_health'])
data['fetal_health'] = label.transform(data['fetal_health'])

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 3.2 Data scaling </h1>

In [None]:
X = data.drop(columns='fetal_health')
y = data['fetal_health']

In [None]:
X_scaled = MinMaxScaler().fit_transform(X.values)
X=pd.DataFrame(X_scaled, index=X.index, columns=X.columns)
X.head()

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 3.3 Data split </h1>

In [None]:
print('\nData split:\nTest size: {}\nVal  size: {}\n'.format(TEST_SIZE,VAL_SIZE))

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, 
                                                    test_size = TEST_SIZE, 
                                                    random_state=SEED)

print('TRAIN: {} & {}'.format(X_train.shape, y_train.shape))
print('TEST:  {} & {}'.format(X_test.shape, y_test.shape))

val_len = int(X_train.shape[0]*VAL_SIZE)

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 3.4 Class weights </h1>

In [None]:
# Calculate class weights from sklearn
class_weight_array = class_weight.compute_class_weight('balanced', 
                                                       np.unique(y_train), 
                                                       y_train)
print('\nClass weights: {}'.format(class_weight_array)) 

# Class weights as dictionary for Keras
keys = [0,1,2] 
class_weight_dict = dict(zip(keys, class_weight_array.T))
print('\nClass weights dict: {}'.format(class_weight_dict))

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 3.5 Validation set </h1>

In [None]:
X_val = X_train[:val_len]
y_val = y_train[:val_len]

X_train_cut = X_train[val_len:]
y_train_cut = y_train[val_len:]

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4 Model </h1>

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.1 Build model </h1>

In [None]:
fit_params={"early_stopping_rounds":10, 
            "eval_metric" : 'auc_mu', 
            "eval_set" : [(X_val,y_val)],
            'eval_names': ['valid'],
            'verbose': 100,
            'feature_name': 'auto'
           }

In [None]:
clf = lgb.LGBMClassifier(boosting_type = 'gbdt',
                         objective='multiclass',
                         metric='auc_mu', 
                         class_weight=class_weight_dict,
                         n_estimators=100,
                         num_leaves= 31,
                         colsample_bytree=1.0,
                         subsample=1.0,
                         learning_rate=0.1,
                         max_depth=-1, 
                         random_state=SEED, 
                         #silent=True, 
                         #n_jobs=4, 
)

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.2 Fit model </h1>

In [None]:
clf.fit(X_train_cut, y_train_cut, **fit_params)

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.3 Feature importance </h1>

In [None]:
plt.figure(figsize=(15,9))

plt.title('LightGBM feature importances', fontsize=18)
plt.xlabel('Importance', fontsize=16)
plt.ylabel('Feature', fontsize=16)

feat_imp = pd.Series(clf.feature_importances_, index=X.columns)
feat_imp.nlargest(34).plot(kind='barh', color='tab:orange')

plt.tight_layout()

plt.savefig('lgbm_kfold_feature_importances_default.png')

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 5 Grid search </h1>

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 5.1 Grid parameters </h1>

In [None]:
# These parameters will be used in gridsearch and will be found optimum parameters
param_grid = {
    'class_weight': [class_weight_dict],
    'boosting_type': ['gbdt', 'dart'],
    'objective': ['multiclass'],
    'metric': ['auc_mu'],
    'n_estimators': [100, 500],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [50, 100, 200],
    'colsample_bytree': [0.6, 0.8],
    'subsample': [0.7, 0.8],
    'max_depth': [5, 10, 50],
    'random_state': [SEED]
}
# 2160 fits (432x5), keep it not too big

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 5.2 Fit model</h1>

In [None]:
lgbm = lgb.LGBMClassifier() 

#lgbm.fit(X_train_cut, y_train_cut)

In [None]:
%%time

# Grid search with 5-fold cross-validation
lgbm_cv = GridSearchCV(lgbm, 
                       param_grid, 
                       cv=5, 
                       n_jobs=None, 
                       verbose=0) # set to 2 for long output

lgbm_cv.fit(X_train_cut, y_train_cut)

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 5.3 Save optimum parameters </h1>

In [None]:
# Print optimum parameters
print('\nGrid search optimum parameters:\n{}'.format(lgbm_cv.best_params_))
# Save the model parameters
filename = 'model_lgbm_kfold_gridsearch_best_params.pickle'
pickle.dump(lgbm_cv.best_params_, open(filename, 'wb'))
# Load the model paarmeters
with open(filename, 'rb') as file:
    best_params = pickle.load(file)
print('\nGrid search optimum parameters [loaded]:\n{}'.format(best_params))

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 5.4 Fit tuned model</h1>

In [None]:
# Tuned model with optimum parameters
clf_tuned = lgb.LGBMClassifier(**best_params) # == lgbm_cv.best_params_

# Fit model
clf_tuned.fit(X_train_cut, y_train_cut, **fit_params)

# Predict
y_test_pred_tuned = clf_tuned.predict(X_test) 
# Find the accuracy of y_test and predicitons, and round the result
acc_tuned = round(accuracy_score(y_test, y_test_pred_tuned), 4) 
print('Accuracy: {}%'.format(acc_tuned*100))

<h1 style="background-color:LightSeaGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 5.5 Feature importance for tuned model</h1>

In [None]:
plt.figure(figsize=(15,9))
plt.title('LightGBM feature importances [tuned] | acc = {}%'.format(acc_tuned*100), fontsize=18)
plt.xlabel('Importance', fontsize=16)
plt.ylabel('Feature', fontsize=16)
feat_imp = pd.Series(clf_tuned.feature_importances_, index=X.columns)
feat_imp.nlargest(50).plot(kind='barh', color='tab:orange')
plt.tight_layout()
plt.savefig('lgbm_5fold_feature_importances_tuned.png')

In [None]:
lgbm_cv.cv_results_

In [None]:
best_feat_10 = list(feat_imp.nlargest(10).index)
best_feat_10