## TL;DR
*Train a model to predict INACBG based on ICD9 and ICD10 codes using Random Forest Classification.*

**Data Input:** 27.5k data

**Algorithm:** Random Forest XGBoost

**Test Data:** 10% dari 27.5k

**Akurasi:** 0.3989

**Nama Model:** ./predict-inacbg/model/03_RF_XG_27k.joblib

In [19]:
import pandas as pd
df = pd.read_csv('ai_inacbg.csv', sep=';')

nama_model = './predict-inacbg/model/03_RF_XG_27k.joblib'
nama_encoders_icd9 = './predict-inacbg/encoders/03_RF_XG_27k_icd9_encoder.joblib'
nama_encoders_icd10 = './predict-inacbg/encoders/03_RF_XG_27k_icd10_encoder.joblib'
nama_encoders_inacbg = './predict-inacbg/encoders/03_RF_XG_27k_inacbg_encoder.joblib'

## Explore Data

In [2]:
print("Dataset Shape:", df.shape)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nUnique Values per Column:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

# Display sample of data
print("\nSample Data:")
df.head()

Dataset Shape: (278785, 10)

Missing Values:
ID                  0
Tanggal             0
RegID           71085
SEPID               1
INACBG          77440
ICD10          159585
ICD9           103427
INACBG_Desc     77440
ICD10_Desc     159586
ICD9_Desc      103427
dtype: int64

Unique Values per Column:
ID: 278785 unique values
Tanggal: 154321 unique values
RegID: 89244 unique values
SEPID: 104825 unique values
INACBG: 588 unique values
ICD10: 4420 unique values
ICD9: 916 unique values
INACBG_Desc: 588 unique values
ICD10_Desc: 4369 unique values
ICD9_Desc: 919 unique values

Sample Data:


Unnamed: 0,ID,Tanggal,RegID,SEPID,INACBG,ICD10,ICD9,INACBG_Desc,ICD10_Desc,ICD9_Desc
0,1,2023-09-01 02:07:14,P230900003,0901R0030923V000001,Q-5-42-0,,8907.0,PENYAKIT AKUT KECIL LAIN-LAIN,,"Consultation, described as comprehensive"
1,2,2023-09-01 02:07:14,P230900003,0901R0030923V000001,Q-5-42-0,,9922.0,PENYAKIT AKUT KECIL LAIN-LAIN,,Injection of other anti-infective
2,3,2023-09-01 09:52:27,P230900494,0901R0030923V000428,Q-5-44-0,N30,8907.0,PENYAKIT KRONIS KECIL LAIN-LAIN,Cystitis,"Consultation, described as comprehensive"
3,4,2023-09-01 09:52:27,P230900494,0901R0030923V000428,Q-5-44-0,N30.9,,PENYAKIT KRONIS KECIL LAIN-LAIN,"Cystitis, unspecified",
4,5,2023-09-01 13:34:31,P230900609,0901R0030923V001192,Q-5-44-0,C11,8907.0,PENYAKIT KRONIS KECIL LAIN-LAIN,Malignant neoplasm of nasopharynx,"Consultation, described as comprehensive"


### Take only necessary fields

In [3]:
df = df[['ICD9', 'ICD10', 'INACBG']]
df.head()

Unnamed: 0,ICD9,ICD10,INACBG
0,8907.0,,Q-5-42-0
1,9922.0,,Q-5-42-0
2,8907.0,N30,Q-5-44-0
3,,N30.9,Q-5-44-0
4,8907.0,C11,Q-5-44-0


## Clean Data

In [4]:
df_clean = df.copy()
# Print initial row count
print(f"Initial row count: {len(df_clean)}")

# 1. Remove rows where INACBG is null
rows_before = len(df_clean)
df_clean = df_clean.dropna(subset=['INACBG'])
rows_removed = rows_before - len(df_clean)
print(f"Rows removed due to null INACBG: {rows_removed}")

# 2. Remove rows where both ICD9 and ICD10 are null
rows_before = len(df_clean)
df_clean = df_clean.dropna(subset=['ICD9', 'ICD10'], how='all')
rows_removed = rows_before - len(df_clean)
print(f"Rows removed due to both ICD9 and ICD10 being null: {rows_removed}")

# Optional: Clean string values (remove whitespace, standardize case)
string_columns = df_clean.select_dtypes(include=['object']).columns
for col in string_columns:
    df_clean[col] = df_clean[col].str.strip() if df_clean[col].dtype == 'object' else df_clean[col]
    df_clean[col] = df_clean[col].str.upper() if df_clean[col].dtype == 'object' else df_clean[col]


print(f"Final row count: {len(df_clean)}")

Initial row count: 278785
Rows removed due to null INACBG: 77440
Rows removed due to both ICD9 and ICD10 being null: 81
Final row count: 201264


In [5]:
print("\nDataset Shape After Cleaning:", df_clean.shape)
print("\nMissing Values After Cleaning:")
print(df_clean.isnull().sum())

# Analyze remaining data
print("\nSummary of cleaned data:")
print("Number of unique INACBG codes:", df_clean['INACBG'].nunique())
print("Number of unique ICD9 codes:", df_clean['ICD9'].nunique())
print("Number of unique ICD10 codes:", df_clean['ICD10'].nunique())

# Check for rows with only ICD9 or only ICD10
only_icd9 = df_clean['ICD9'].notna() & df_clean['ICD10'].isna()
only_icd10 = df_clean['ICD10'].notna() & df_clean['ICD9'].isna()

print(f"\nRows with only ICD9: {sum(only_icd9)}")
print(f"Rows with only ICD10: {sum(only_icd10)}")
print(f"Rows with both ICD9 and ICD10: {sum(df_clean['ICD9'].notna() & df_clean['ICD10'].notna())}")



Dataset Shape After Cleaning: (201264, 3)

Missing Values After Cleaning:
ICD9      32009
ICD10     84679
INACBG        0
dtype: int64

Summary of cleaned data:
Number of unique INACBG codes: 587
Number of unique ICD9 codes: 902
Number of unique ICD10 codes: 4409

Rows with only ICD9: 84679
Rows with only ICD10: 32009
Rows with both ICD9 and ICD10: 84576


### Take data where ICD9, ICD10, and INACBG unique

In [6]:
df_clean = df_clean.drop_duplicates(subset=['ICD9', 'ICD10', 'INACBG'], keep='first')

### Data Check After Clean

In [7]:
df_clean.reset_index(drop=True, inplace=False)
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27500 entries, 0 to 278771
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ICD9    24461 non-null  float64
 1   ICD10   22711 non-null  object 
 2   INACBG  27500 non-null  object 
dtypes: float64(1), object(2)
memory usage: 859.4+ KB


## Data Prep

In [8]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb

def prepare_xgboost(df):
    # Encode kolom kategorik
    le_icd9 = LabelEncoder()
    le_icd10 = LabelEncoder()
    le_inacbg = LabelEncoder()
    
    # Encode ICD9, ICD10, dan INACBG
    df['ICD9_encoded'] = le_icd9.fit_transform(df['ICD9'].astype(str))
    df['ICD10_encoded'] = le_icd10.fit_transform(df['ICD10'].astype(str))
    df['INACBG_encoded'] = le_inacbg.fit_transform(df['INACBG'].astype(str))
    
    # Pisahkan features dan target
    X = df[['ICD9_encoded', 'ICD10_encoded']]
    y = df['INACBG_encoded']
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Konversi ke DMatrix
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)
    
    # Parameter XGBoost
    params = {
        'objective': 'multi:softmax',  # Classification objective function
        'num_class': len(np.unique(y)),  # Number of classes in the target variable
        'max_depth': 6,  # Maximum depth of a tree
        'learning_rate': 0.1,  # Learning rate (step size shrinkage)
        'eval_metric': 'merror'  # Evaluation metric for cross-validation
    }
    
    # Training model
    model = xgb.train(params, dtrain, num_boost_round=100)
    
    return model, X_test, y_test, le_icd9, le_icd10, le_inacbg

def predict_inacbg(model, X_test, le_icd9, le_icd10):
    dtest = xgb.DMatrix(X_test)
    predictions = model.predict(dtest)
    return predictions

# Evaluasi model
def evaluate_model(y_test, predictions):
    from sklearn.metrics import accuracy_score, classification_report
    
    accuracy = accuracy_score(y_test, predictions)
    print("Akurasi Model:", accuracy)

In [9]:
# Melatih model dan mempersiapkan data
model, X_test, y_test, le_icd9, le_icd10, le_inacbg = prepare_xgboost(df_clean)

In [10]:
# Membuat prediksi dengan model yang sudah dilatih
predictions = predict_inacbg(model, X_test, le_icd9, le_icd10)

# Evaluasi model
evaluate_model(y_test, predictions)

Akurasi Model: 0.39890909090909094


## Dump Model

In [17]:
import joblib
import os
import numpy as np
from sklearn.preprocessing import LabelEncoder

def save_model(model, le_icd9, le_icd10, le_inacbg):
    """
    Save scikit-learn model and label encoders
    
    Parameters:
    - model: Trained scikit-learn classifier
    - le_icd9: LabelEncoder for ICD9
    - le_icd10: LabelEncoder for ICD10
    - le_inacbg: LabelEncoder for INACBG
    - filepath: Directory to save model files
    """    
    # Save scikit-learn model
    joblib.dump(model, f'{nama_model}')
    
    # Save label encoders
    joblib.dump(le_icd9, f'{nama_encoders_icd9}')
    joblib.dump(le_icd10, f'{nama_encoders_icd10}')
    joblib.dump(le_inacbg, f'{nama_encoders_inacbg}')
    
    print("Model and label encoders saved successfully.")

In [20]:
save_model(model, le_icd9, le_icd10, le_inacbg)

Model and label encoders saved successfully.


## Predict Test

In [24]:
def load_model(filepath='./'):
    """
    Load saved scikit-learn model and label encoders
    
    Parameters:
    - filepath: Directory where model files are saved
    
    Returns:
    - model: Loaded scikit-learn model
    - le_icd9: Loaded LabelEncoder for ICD9
    - le_icd10: Loaded LabelEncoder for ICD10
    - le_inacbg: Loaded LabelEncoder for INACBG
    """
    # Load scikit-learn model
    model = joblib.load(f'{nama_model}')
    
    # Load label encoders
    le_icd9 = joblib.load(f'{nama_encoders_icd9}')
    le_icd10 = joblib.load(f'{nama_encoders_icd10}')
    le_inacbg = joblib.load(f'{nama_encoders_inacbg}')
    
    return model, le_icd9, le_icd10, le_inacbg

import xgboost as xgb

def predict_inacbg_single(icd9, icd10, model, le_icd9, le_icd10, le_inacbg):
    # Encode input ICD9 and ICD10
    icd9_encoded = le_icd9.transform([str(icd9)])[0]
    icd10_encoded = le_icd10.transform([str(icd10)])[0]
    
    # Prepare input for XGBoost prediction with feature names
    input_data = xgb.DMatrix(
        [[icd9_encoded, icd10_encoded]], 
        feature_names=['ICD9_encoded', 'ICD10_encoded']
    )
    
    # Predict using XGBoost Booster method
    prediction = model.predict(input_data)[0]
    
    # Get original INACBG code
    predicted_inacbg = le_inacbg.inverse_transform([int(prediction)])[0]
    
    return predicted_inacbg, int(prediction)

In [22]:
df.head(10)

Unnamed: 0,ID,Tanggal,RegID,SEPID,INACBG,ICD10,ICD9,INACBG_Desc,ICD10_Desc,ICD9_Desc
0,1,2023-09-01 02:07:14,P230900003,0901R0030923V000001,Q-5-42-0,,8907.0,PENYAKIT AKUT KECIL LAIN-LAIN,,"Consultation, described as comprehensive"
1,2,2023-09-01 02:07:14,P230900003,0901R0030923V000001,Q-5-42-0,,9922.0,PENYAKIT AKUT KECIL LAIN-LAIN,,Injection of other anti-infective
2,3,2023-09-01 09:52:27,P230900494,0901R0030923V000428,Q-5-44-0,N30,8907.0,PENYAKIT KRONIS KECIL LAIN-LAIN,Cystitis,"Consultation, described as comprehensive"
3,4,2023-09-01 09:52:27,P230900494,0901R0030923V000428,Q-5-44-0,N30.9,,PENYAKIT KRONIS KECIL LAIN-LAIN,"Cystitis, unspecified",
4,5,2023-09-01 13:34:31,P230900609,0901R0030923V001192,Q-5-44-0,C11,8907.0,PENYAKIT KRONIS KECIL LAIN-LAIN,Malignant neoplasm of nasopharynx,"Consultation, described as comprehensive"
5,6,2023-09-01 13:34:31,P230900609,0901R0030923V001192,Q-5-44-0,C11.9,,PENYAKIT KRONIS KECIL LAIN-LAIN,"Nasopharynx, unspecified",
6,7,2023-09-01 13:35:10,P230900617,0901R0030923V001195,Q-5-44-0,C31,8907.0,PENYAKIT KRONIS KECIL LAIN-LAIN,Malignant neoplasm of accessory sinuses,"Consultation, described as comprehensive"
7,8,2023-09-01 13:35:10,P230900617,0901R0030923V001195,Q-5-44-0,C31.9,9059.0,PENYAKIT KRONIS KECIL LAIN-LAIN,"Accessory sinus, unspecified","Microscopic examination of blood, Other micros..."
8,9,2023-09-01 13:36:27,P230901119,0901R0030923V001199,U-3-13-0,,2219.0,"PROSEDUR DIAGNOSTIK LAIN-LAIN PADA TELINGA, HI...",,Other diagnostic procedures on nasal sinuses
9,10,2023-09-01 13:36:27,P230901119,0901R0030923V001199,U-3-13-0,J02.9,8907.0,"PROSEDUR DIAGNOSTIK LAIN-LAIN PADA TELINGA, HI...","Acute pharyngitis, unspecified","Consultation, described as comprehensive"


In [25]:
loaded_model, loaded_le_icd9, loaded_le_icd10, loaded_le_inacbg = load_model()

In [26]:
result, encoded_result = predict_inacbg_single('8907.0', 'J02.9', 
    loaded_model, 
    loaded_le_icd9, 
    loaded_le_icd10, 
    loaded_le_inacbg)
print(f"Predicted INACBG: {result}")

Predicted INACBG: Q-5-44-0
