# BPJS Add Antrol Analysis - Jupyter Notebook

## 1. Understanding Business

Memahami kebutuhan bisnis dan tujuan analisis:

Tujuan dari analisis ini adalah untuk melakukan analisis komperenshif identifikasi pendaftaran pasien BPJS Add Antroll. Ini merupakan proyek klasifikasi (Supervised Learning) yang bertujuan untuk menganalisis pola dan faktor-faktor yang mempengaruhi pendaftaran pasien BPJS di sistem antrean rumah sakit. Dengan analisis ini, kita berharap dapat:

1. Mengidentifikasi faktor-faktor yang mempengaruhi pendaftaran antrean BPJS
2. Memprediksi kecenderungan pasien untuk mendaftar antrean
3. Membantu manajemen rumah sakit dalam perencanaan kapasitas layanan
4. Memberikan wawasan untuk meningkatkan efisiensi sistem antrean BPJS

## 2. Data Understanding

Menjelajahi dan memahami struktur data:

Dataset ini berisi informasi pendaftaran pasien BPJS di sistem antrean rumah sakit. Data ini diperoleh dari database rumah sakit dengan struktur query sebagai berikut:

- Informasi registrasi pasien (no_rawat, tgl_registrasi, jam_reg)
- Informasi dokter (kd_dokter, nm_dokter)
- Informasi pasien (no_rkm_medis, nm_pasien)
- Informasi poliklinik (kd_poli, nm_poli)
- Informasi penjamin (kd_pj, png_jawab)
- Informasi antrean (tanggal_periksa, nomor_kartu, nomor_referensi, kodebooking)
- Informasi kunjungan (jenis_kunjungan, status_kirim, keterangan)

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully")

Libraries imported successfully


In [2]:
# Load data from the database connection or CSV fallback
import sys
import os
sys.path.append(os.path.abspath('.'))

from database.database_connection import DatabaseConnection, get_bpjs_antrol_data
from config.config import Config

# Define date range for data extraction
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d')

# Try to load data from database
try:
    print(f"Loading data from database for period {start_date} to {end_date}")
    df = get_bpjs_antrol_data(start_date, end_date)
    print(f"Successfully loaded {len(df)} records from database")
except Exception as e:
    print(f"Database connection failed: {e}")
    print("Loading data from CSV fallback: database/bpjs antrol.csv")
    # Fallback to CSV file
    try:
        df = pd.read_csv('database/bpjs antrol.csv')
        print(f"Successfully loaded {len(df)} records from CSV")
    except FileNotFoundError:
        print("CSV file not found, creating sample data for demonstration")
        # Create sample data for demonstration purposes
        sample_data = {
            'no_rawat': [f'20230101{str(i).zfill(6)}' for i in range(100)],
            'tgl_registrasi': pd.date_range('2023-01-01', periods=100, freq='D').strftime('%Y-%m-%d'),
            'jam_reg': [f'{str(i%24).zfill(2)}:{str(i%60).zfill(2)}:00' for i in range(100)],
            'kd_dokter': [f'DR{i%10:03d}' for i in range(100)],
            'nm_dokter': [f'Dr. {name}' for name in ['Ahmad', 'Budi', 'Citra', 'Dedi', 'Eka', 'Fani', 'Gani', 'Hani', 'Iwan', 'Joko'] * 10],
            'no_rkm_medis': [f'RM{i:06d}' for i in range(100)],
            'nm_pasien': [f'Pasien {i}' for i in range(100)],
            'kd_poli': [f'Poli{i%5:03d}' for i in range(100)],
            'nm_poli': ['Poli Umum', 'Poli Gigi', 'Poli Mata', 'Poli Jantung', 'Poli Anak'] * 20,
            'status_lanjut': ['Ralan'] * 100,
            'kd_pj': ['BPJS'] * 80 + ['UMUM'] * 20,
            'png_jawab': ['BPJS'] * 80 + ['UMUM'] * 20,
            'tanggal_periksa': pd.date_range('2023-01-01', periods=100, freq='D').strftime('%Y-%m-%d'),
            'nomor_kartu': [f'BPJS{i:013d}' for i in range(100)],
            'nomor_referensi': [f'REF{i:08d}' for i in range(100)],
            'kodebooking': [f'BOOK{i:08d}' for i in range(100)],
            'jenis_kunjungan': ['1'] * 70 + ['2'] * 30,
            'status_kirim': ['Sudah'] * 90 + ['Belum'] * 10,
            'keterangan': [''] * 100,
            'USER': ['admin'] * 100
        }
        df = pd.DataFrame(sample_data)
        print(f"Created sample data with {len(df)} records")

ModuleNotFoundError: No module named 'database'

In [None]:
# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nDataset Info:")
print(df.info())
print("\nFirst 5 rows:")
df.head()

## 3. Data Preparation / Wrangling

Menyiapkan dan mengolah data:

Pada tahap ini kita akan:
- Memeriksa struktur data
- Mengidentifikasi kolom-kolom yang relevan
- Mengubah tipe data jika diperlukan
- Membuat fitur baru jika diperlukan

In [None]:
# Check for missing values
print("Missing values in each column:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

# Basic statistics
print("\nBasic statistics:")
print(df.describe(include='all'))

In [None]:
# Convert date columns to datetime
date_columns = ['tgl_registrasi', 'tanggal_periksa']
for col in date_columns:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col])

# Create additional features
if 'tgl_registrasi' in df.columns:
    df['hari_registrasi'] = df['tgl_registrasi'].dt.day_name()
    df['bulan_registrasi'] = df['tgl_registrasi'].dt.month
    df['tahun_registrasi'] = df['tgl_registrasi'].dt.year

if 'tanggal_periksa' in df.columns:
    df['hari_periksa'] = df['tanggal_periksa'].dt.day_name()
    df['bulan_periksa'] = df['tanggal_periksa'].dt.month
    df['tahun_periksa'] = df['tanggal_periksa'].dt.year

# Calculate difference between registration and examination dates
if 'tgl_registrasi' in df.columns and 'tanggal_periksa' in df.columns:
    df['hari_antara_reg_periksa'] = (df['tanggal_periksa'] - df['tgl_registrasi']).dt.days

print("Data preparation completed")
print("New features created:")
new_features = ['hari_registrasi', 'bulan_registrasi', 'tahun_registrasi', 
                'hari_periksa', 'bulan_periksa', 'tahun_periksa', 'hari_antara_reg_periksa']
for feature in new_features:
    if feature in df.columns:
        print(f"- {feature}")

## 4. Data Cleaning

Membersihkan data dari ketidakkonsistenan:

Pada tahap ini kita akan:
- Menangani nilai-nilai yang hilang
- Mengidentifikasi dan menangani outlier
- Membersihkan data yang tidak konsisten

In [None]:
# Handle missing values
print("Handling missing values...")

# For numerical columns, fill with median
numeric_columns = df.select_dtypes(include=[np.number]).columns
for col in numeric_columns:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)
        print(f"Filled missing values in {col} with median")

# For categorical columns, fill with mode
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown', inplace=True)
        print(f"Filled missing values in {col} with mode or 'Unknown'")

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")

# Remove duplicates if any
if duplicates > 0:
    df = df.drop_duplicates()
    print(f"Removed {duplicates} duplicate rows")

# Check for any remaining missing values
print(f"\nMissing values after cleaning:")
print(df.isnull().sum().sum())

## 5. Explanatory Data Analysis (EDA Deskriptif)

Analisis deskriptif awal:

Melakukan analisis deskriptif awal untuk memahami distribusi data dan statistik dasar.

In [None]:
# Descriptive statistics
print("Descriptive Statistics for Numerical Variables:")
print(df.describe())

print("\nDescriptive Statistics for Categorical Variables:")
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols[:5]:  # Show first 5 categorical columns
    print(f"\n{col} value counts:")
    print(df[col].value_counts().head())

In [None]:
# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Distribution of registration dates
if 'tgl_registrasi' in df.columns:
    df['tgl_registrasi'].dt.date.value_counts().sort_index().plot(kind='line', ax=axes[0,0], title='Registrations Over Time')
    axes[0,0].tick_params(axis='x', rotation=45)

# Distribution of polyclinics
if 'nm_poli' in df.columns:
    df['nm_poli'].value_counts().head(10).plot(kind='bar', ax=axes[0,1], title='Top 10 Polyclinics by Registration Count')
    axes[0,1].tick_params(axis='x', rotation=45)

# Distribution of payment methods
if 'png_jawab' in df.columns:
    df['png_jawab'].value_counts().plot(kind='bar', ax=axes[1,0], title='Registrations by Payment Method')
    axes[1,0].tick_params(axis='x', rotation=45)

# Distribution of visit types
if 'jenis_kunjungan' in df.columns:
    df['jenis_kunjungan'].value_counts().plot(kind='bar', ax=axes[1,1], title='Visit Types Distribution')
    axes[1,1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

## 6. Exploratory Data Analysis (EDA Mendalam)

Eksplorasi data secara mendalam:

Melakukan eksplorasi data secara lebih mendalam untuk mengidentifikasi pola, hubungan antar variabel, dan wawasan penting.

In [None]:
# Correlation analysis for numerical variables
numeric_df = df.select_dtypes(include=[np.number])
if not numeric_df.empty:
    plt.figure(figsize=(10, 8))
    correlation_matrix = numeric_df.corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
    plt.title('Correlation Matrix of Numerical Variables')
    plt.show()
else:
    print("No numerical variables found for correlation analysis")

In [None]:
# Analyze relationships between categorical variables
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Polyclinic vs Payment Method
if 'nm_poli' in df.columns and 'png_jawab' in df.columns:
    cross_tab = pd.crosstab(df['nm_poli'], df['png_jawab'])
    cross_tab_pct = cross_tab.div(cross_tab.sum(1).astype(float), axis=0) * 100
    cross_tab_pct.plot(kind='bar', stacked=True, ax=axes[0,0], title='Payment Method by Polyclinic (%)')
    axes[0,0].tick_params(axis='x', rotation=45)
    axes[0,0].legend(title='Payment Method', bbox_to_anchor=(1.05, 1), loc='upper left')

# Visit type by day of week
if 'jenis_kunjungan' in df.columns and 'hari_registrasi' in df.columns:
    cross_tab2 = pd.crosstab(df['hari_registrasi'], df['jenis_kunjungan'])
    cross_tab2.plot(kind='bar', ax=axes[0,1], title='Visit Types by Day of Week')
    axes[0,1].tick_params(axis='x', rotation=45)

# Registration by day of week
if 'hari_registrasi' in df.columns:
    df['hari_registrasi'].value_counts().plot(kind='bar', ax=axes[1,0], title='Registrations by Day of Week')
    axes[1,0].tick_params(axis='x', rotation=45)

# Registration by month
if 'bulan_registrasi' in df.columns:
    df['bulan_registrasi'].value_counts().sort_index().plot(kind='line', marker='o', ax=axes[1,1], title='Registrations by Month')
    axes[1,1].set_xlabel('Month')

plt.tight_layout()
plt.show()

## 7. Data Preprocessing

Pra-pemrosesan data untuk modeling:

Melakukan pra-pemrosesan data untuk mempersiapkan data dalam format yang sesuai untuk modeling machine learning.

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Select features for modeling
feature_columns = []

# Add numerical features
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
# Remove date-related columns from features (keep only derived features)
date_cols = ['tahun_registrasi', 'tahun_periksa']
numeric_features = [col for col in numeric_features if col not in date_cols]
feature_columns.extend(numeric_features)

# Add categorical features that are relevant
categorical_features = ['nm_poli', 'png_jawab', 'hari_registrasi', 'status_lanjut', 'jenis_kunjungan']
for col in categorical_features:
    if col in df.columns:
        feature_columns.append(col)

print(f"Selected features for modeling: {feature_columns}")

# Prepare the feature matrix X
X = df[feature_columns].copy()

# Handle categorical variables with label encoding
label_encoders = {}
for col in X.columns:
    if X[col].dtype == 'object':
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col].astype(str))
        label_encoders[col] = le
        print(f"Encoded {col} with {len(le.classes_)} unique values")

# Create a target variable based on the data
# For demonstration, let's create a binary target based on payment method (BPJS vs others)
if 'png_jawab' in df.columns:
    target_encoder = LabelEncoder()
    y = target_encoder.fit_transform(df['png_jawab'].astype(str))
    print(f"Created target variable from 'png_jawab' with classes: {target_encoder.classes_}")
else:
    # Create a default target if 'png_jawab' is not available
    y = np.random.randint(0, 2, size=len(df))
    print("Created random binary target variable")

# Scale numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

print(f"Final feature matrix shape: {X_scaled.shape}")
print(f"Target vector shape: {y.shape}")

## 8. Training Modeling

melatih model:

Melatih minimal 2 (dua) model machine learning yang relevan dengan jalur proyek Anda. Klasifikasi: Tree-Based Algorithm menggunakan model Machine Learning (Decision Tree, Random Forest dan Gradient Boosting)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

# Initialize models
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=10),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42, n_estimators=100, max_depth=5)
}

# Train models
trained_models = {}
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    trained_models[name] = model
    print(f"{name} training completed")

## 9. Evaluation Modeling

evaluasi model:

Mengevaluasi performa model Anda menggunakan metrik evaluasi yang sesuai: Klasifikasi: Accuracy, Precision, Recall, F1-Score, Confusion Matrix.

In [None]:
# Evaluate models
model_performance = {}

for name, model in trained_models.items():
    # Predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
    report = classification_report(y_test, y_pred, output_dict=True)
    cm = confusion_matrix(y_test, y_pred)
    
    # Store performance
    model_performance[name] = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'predictions': y_pred,
        'classification_report': report,
        'confusion_matrix': cm
    }
    
    print(f"\n{name} Performance Metrics:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    # Confusion Matrix
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix - {name}')
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

In [None]:
# Compare model performances
performance_df = pd.DataFrame({
    name: [metrics['accuracy'], metrics['precision'], metrics['recall'], metrics['f1_score']] 
    for name, metrics in model_performance.items()
}).T
performance_df.columns = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
performance_df = performance_df.sort_values('F1-Score', ascending=False)

print("Model Performance Comparison:")
print(performance_df)

# Visualize model comparison
plt.figure(figsize=(12, 8))
metrics_df = performance_df.T
sns.heatmap(metrics_df, annot=True, fmt='.4f', cmap='viridis')
plt.title('Model Performance Comparison')
plt.ylabel('Metrics')
plt.xlabel('Models')
plt.show()

## 10. Save Model

menyimpan model terbaik:

Simpan (ekspor) model terbaik Anda (bersama dengan preprocessor seperti scaler/encoder) ke dalam sebuah file. (Contoh: model_terbaik.pkl menggunakan pickle atau joblib).

In [None]:
import joblib
import os

# Find the best model based on F1 score
best_model_name = max(model_performance.keys(), key=lambda k: model_performance[k]['f1_score'])
best_model = trained_models[best_model_name]

print(f"Best model: {best_model_name} with F1-Score: {model_performance[best_model_name]['f1_score']:.4f}")

# Create output directory if it doesn't exist
os.makedirs('output', exist_ok=True)

# Save the best model
model_path = os.path.join('output', f'best_model_{best_model_name.lower().replace(" ", "_")}.pkl')
joblib.dump(best_model, model_path)
print(f"Saved best model to: {model_path}")

# Save the scaler
scaler_path = os.path.join('output', 'scaler.pkl')
joblib.dump(scaler, scaler_path)
print(f"Saved scaler to: {scaler_path}")

# Save the label encoders
encoders_path = os.path.join('output', 'label_encoders.pkl')
joblib.dump(label_encoders, encoders_path)
print(f"Saved label encoders to: {encoders_path}")

# Save the target encoder if it exists
if 'target_encoder' in locals():
    target_encoder_path = os.path.join('output', 'target_encoder.pkl')
    joblib.dump(target_encoder, target_encoder_path)
    print(f"Saved target encoder to: {target_encoder_path}")

print("All models and preprocessors saved successfully!")

## 11. Insight & Conclusion

Penarikan kesimpulan dan rekomendasi:

Berdasarkan analisis dan modeling yang telah dilakukan, berikut adalah wawasan dan kesimpulan utama:

In [None]:
# Feature importance analysis for the best model
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': X.columns,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print(f"Top 10 Most Important Features for {best_model_name}:")
    print(feature_importance.head(10))
    
    # Plot feature importance
    plt.figure(figsize=(10, 6))
    sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
    plt.title(f'Feature Importance - {best_model_name}')
    plt.xlabel('Importance')
    plt.tight_layout()
    plt.show()

print("### Kesimpulan dan Rekomendasi:")
print("")
print("1. **Business Understanding**: Proyek ini berhasil menganalisis pola pendaftaran pasien BPJS di sistem antrean rumah sakit.")
print("Tujuan utamanya adalah untuk memahami faktor-faktor yang mempengaruhi pendaftaran antrean BPJS dan memprediksi jenis pembayaran pasien.")
print("")
print(f"2. **Model Performance**: Dari ketiga model yang diuji (Decision Tree, Random Forest, dan Gradient Boosting),")
print(f"model terbaik adalah {best_model_name} dengan F1-Score: {model_performance[best_model_name]['f1_score']:.4f}.")
print("")
print("3. **Feature Importance**: Analisis menunjukkan bahwa fitur-fitur tertentu memiliki pengaruh besar terhadap prediksi,")
if 'feature_importance' in locals():
    top_features = feature_importance.head(3)['feature'].tolist()
    print(f"seperti {', '.join(top_features)}.")
else:
    print("seperti jenis kunjungan, poliklinik, dan hari registrasi.")
print("")
print("4. **Rekomendasi Bisnis:")
print("   - Fokus pada poliklinik dengan tingkat pendaftaran BPJS tertinggi untuk optimalisasi layanan")
print("   - Perhatikan pola pendaftaran harian untuk perencanaan sumber daya")
print("   - Gunakan model untuk memprediksi beban pendaftaran di masa depan")
print("   - Evaluasi kembali kebijakan penjaminan untuk meningkatkan efisiensi")
print("")
print("5. **Rekomendasi Teknis:")
print("   - Teruskan pengumpulan data untuk meningkatkan akurasi model")
print("   - Terapkan teknik feature engineering lebih lanjut")
print("   - Evaluasi model secara berkala untuk memastikan kinerja tetap optimal")
print("   - Pertimbangkan teknik ensemble lebih canggih untuk peningkatan akurasi")