# Comprehensive CKD Risk Factor Prediction Analysis## Dataset: UCI Chronic Kidney Disease Risk Factor Prediction (ID: 857)### Objectives:1. **Problem 1**: Multi-class Classification for CKD stage (s1/s2/s3/s4/s5)2. **Problem 2**: Binary Classification for CKD diagnosis (ckd vs notckd)### Features Used:- **age**: Patient age- **al**: Albumin levels in urine- **urinestate**: Derived feature (1 if any of rbc, pc, pcc, ba equals 1, else 0)### Analysis Pipeline:- Data import and preprocessing- Exploratory Data Analysis (EDA)- Feature engineering- Model training and hyperparameter tuning- Model evaluation and comparison- Visualization of results

In [None]:
# Import necessary librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_scorefrom sklearn.preprocessing import LabelEncoder, StandardScalerfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.svm import SVCfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,                              confusion_matrix, classification_report, roc_auc_score, roc_curve,                             precision_recall_curve, auc)from sklearn.inspection import permutation_importanceimport warningswarnings.filterwarnings('ignore')# Set random seed for reproducibilitynp.random.seed(42)# Set plotting styleplt.style.use('seaborn-v0_8-darkgrid')sns.set_palette("husl")print("All libraries imported successfully!")

## 1. Data Loading and Initial ExplorationWe'll load the UCI CKD Risk Factor Prediction dataset (ID: 857) from the UCI ML Repository.

In [None]:
# Load dataset from UCI repositorytry:    from ucimlrepo import fetch_ucirepo         # Fetch dataset     ckd_risk = fetch_ucirepo(id=857)        # Data (as pandas dataframes)     X_data = ckd_risk.data.features     y_data = ckd_risk.data.targets         # Combine features and target into a single dataframe    df = pd.concat([X_data, y_data], axis=1)        print("Dataset loaded successfully from UCI repository!")    print(f"Dataset Shape: {df.shape}")    except Exception as e:    print(f"Could not load from UCI repository: {e}")    print("\nGenerating synthetic data based on dataset structure...")        # Generate synthetic data that matches the expected structure    np.random.seed(42)    n_samples = 200        # Generate features    data = {        'age': np.random.randint(20, 80, n_samples),        'bp': np.random.randint(60, 140, n_samples),        'sg': np.random.choice([1.005, 1.010, 1.015, 1.020, 1.025], n_samples),        'al': np.random.choice([0, 1, 2, 3, 4, 5], n_samples),        'su': np.random.choice([0, 1, 2, 3, 4, 5], n_samples),        'rbc': np.random.choice([0, 1], n_samples),        'pc': np.random.choice([0, 1], n_samples),        'pcc': np.random.choice([0, 1], n_samples),        'ba': np.random.choice([0, 1], n_samples),        'bgr': np.random.randint(70, 200, n_samples),        'bu': np.random.randint(10, 150, n_samples),        'sc': np.random.uniform(0.5, 8.0, n_samples),        'sod': np.random.randint(120, 160, n_samples),        'pot': np.random.uniform(2.5, 6.0, n_samples),        'hemo': np.random.uniform(8.0, 18.0, n_samples),        'pcv': np.random.randint(25, 55, n_samples),        'wbcc': np.random.randint(4000, 15000, n_samples),        'rbcc': np.random.uniform(2.5, 6.5, n_samples),        'htn': np.random.choice([0, 1], n_samples),        'dm': np.random.choice([0, 1], n_samples),        'cad': np.random.choice([0, 1], n_samples),        'appet': np.random.choice([0, 1], n_samples),        'pe': np.random.choice([0, 1], n_samples),        'ane': np.random.choice([0, 1], n_samples),    }        # Generate stage (s1-s5) based on some logic    stage_probs = []    for i in range(n_samples):        # Simple heuristic: worse indicators -> higher stage        score = (data['al'][i] / 5 + data['sc'][i] / 8 +                  (1 if data['htn'][i] == 1 else 0) +                  (1 if data['dm'][i] == 1 else 0)) / 4                if score < 0.2:            stage_probs.append(np.random.choice(['s1', 's2'], p=[0.7, 0.3]))        elif score < 0.4:            stage_probs.append(np.random.choice(['s2', 's3'], p=[0.5, 0.5]))        elif score < 0.6:            stage_probs.append(np.random.choice(['s3', 's4'], p=[0.5, 0.5]))        elif score < 0.8:            stage_probs.append(np.random.choice(['s4', 's5'], p=[0.5, 0.5]))        else:            stage_probs.append('s5')        data['stage'] = stage_probs        # Generate class (ckd/notckd) based on stage    data['class'] = ['notckd' if s in ['s1', 's2'] and np.random.random() > 0.3                      else 'ckd' for s in data['stage']]        df = pd.DataFrame(data)    print("Synthetic dataset generated successfully!")    print(f"Dataset Shape: {df.shape}")# Display basic informationprint("\nFirst few rows:")print(df.head(10))print("\nDataset Info:")print(df.info())print("\nBasic Statistics:")print(df.describe())

## 2. Data Preprocessing### Steps:1. Check for missing values2. Handle missing values if present3. Create 'urinestate' feature from rbc, pc, pcc, ba columns4. Verify data quality

In [None]:
# Check for missing valuesprint("Missing values per column:")print(df.isnull().sum())print(f"\nTotal missing values: {df.isnull().sum().sum()}")# Handle missing values if anyif df.isnull().sum().sum() > 0:    print("\nHandling missing values...")        # For numerical columns: fill with median    numeric_cols = df.select_dtypes(include=[np.number]).columns    for col in numeric_cols:        if df[col].isnull().sum() > 0:            df[col].fillna(df[col].median(), inplace=True)        # For categorical columns: fill with mode    categorical_cols = df.select_dtypes(exclude=[np.number]).columns    for col in categorical_cols:        if df[col].isnull().sum() > 0:            df[col].fillna(df[col].mode()[0], inplace=True)        print("Missing values handled successfully!")    print(f"Remaining missing values: {df.isnull().sum().sum()}")else:    print("No missing values found!")# Create 'urinestate' feature# urinestate = 1 if ANY of (rbc, pc, pcc, ba) equals 1, else 0df['urinestate'] = ((df['rbc'] == 1) | (df['pc'] == 1) |                      (df['pcc'] == 1) | (df['ba'] == 1)).astype(int)print("\n'urinestate' feature created successfully!")print(f"\nUrinestate distribution:")print(df['urinestate'].value_counts())print(f"\nPercentage with urinestate=1: {df['urinestate'].mean()*100:.2f}%")

## 3. Exploratory Data Analysis (EDA)Let's explore the data through various visualizations and statistical summaries.

In [None]:
# Check target variable distributionsprint("=" * 60)print("TARGET VARIABLE DISTRIBUTIONS")print("=" * 60)print("\n1. CKD Stage Distribution:")print(df['stage'].value_counts().sort_index())print(f"\nStage percentages:")print((df['stage'].value_counts(normalize=True) * 100).sort_index())print("\n2. CKD Class Distribution:")print(df['class'].value_counts())print(f"\nClass percentages:")print((df['class'].value_counts(normalize=True) * 100))# Feature distributions for selected featuresprint("\n" + "=" * 60)print("FEATURE DISTRIBUTIONS (age, al, urinestate)")print("=" * 60)print("\nAge statistics:")print(df['age'].describe())print("\nAlbumin (al) statistics:")print(df['al'].describe())print(f"\nAlbumin value counts:")print(df['al'].value_counts().sort_index())print("\nUrinestate statistics:")print(df['urinestate'].value_counts())

In [None]:
# Visualizationsfig, axes = plt.subplots(3, 3, figsize=(18, 14))# Row 1: Feature distributionsaxes[0, 0].hist(df['age'], bins=20, edgecolor='black', alpha=0.7)axes[0, 0].set_title('Age Distribution', fontsize=12, fontweight='bold')axes[0, 0].set_xlabel('Age')axes[0, 0].set_ylabel('Frequency')axes[0, 0].grid(True, alpha=0.3)al_counts = df['al'].value_counts().sort_index()axes[0, 1].bar(al_counts.index, al_counts.values, edgecolor='black', alpha=0.7)axes[0, 1].set_title('Albumin (al) Distribution', fontsize=12, fontweight='bold')axes[0, 1].set_xlabel('Albumin Level')axes[0, 1].set_ylabel('Frequency')axes[0, 1].grid(True, alpha=0.3)urinestate_counts = df['urinestate'].value_counts().sort_index()axes[0, 2].bar(urinestate_counts.index, urinestate_counts.values, edgecolor='black', alpha=0.7)axes[0, 2].set_title('Urinestate Distribution', fontsize=12, fontweight='bold')axes[0, 2].set_xlabel('Urinestate (0/1)')axes[0, 2].set_ylabel('Frequency')axes[0, 2].grid(True, alpha=0.3)# Row 2: Target variable distributionsstage_counts = df['stage'].value_counts().sort_index()axes[1, 0].bar(range(len(stage_counts)), stage_counts.values,                tick_label=stage_counts.index, edgecolor='black', alpha=0.7)axes[1, 0].set_title('CKD Stage Distribution', fontsize=12, fontweight='bold')axes[1, 0].set_xlabel('Stage')axes[1, 0].set_ylabel('Frequency')axes[1, 0].grid(True, alpha=0.3)class_counts = df['class'].value_counts()axes[1, 1].bar(class_counts.index, class_counts.values, edgecolor='black', alpha=0.7)axes[1, 1].set_title('CKD Class Distribution', fontsize=12, fontweight='bold')axes[1, 1].set_xlabel('Class')axes[1, 1].set_ylabel('Frequency')axes[1, 1].grid(True, alpha=0.3)# Box plot for age by classdf.boxplot(column='age', by='class', ax=axes[1, 2])axes[1, 2].set_title('Age Distribution by CKD Class', fontsize=12, fontweight='bold')axes[1, 2].set_xlabel('Class')axes[1, 2].set_ylabel('Age')plt.sca(axes[1, 2])plt.xticks(rotation=0)# Row 3: Relationshipsscatter = axes[2, 0].scatter(df['age'], df['al'], alpha=0.5, c=df['urinestate'], cmap='coolwarm')axes[2, 0].set_title('Age vs Albumin (colored by urinestate)', fontsize=12, fontweight='bold')axes[2, 0].set_xlabel('Age')axes[2, 0].set_ylabel('Albumin Level')axes[2, 0].grid(True, alpha=0.3)plt.colorbar(scatter, ax=axes[2, 0])# Correlation heatmap for selected featuresselected_features = ['age', 'al', 'urinestate', 'rbc', 'pc', 'pcc', 'ba']corr_matrix = df[selected_features].corr()sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm',             center=0, ax=axes[2, 1], cbar_kws={'shrink': 0.8})axes[2, 1].set_title('Correlation Heatmap', fontsize=12, fontweight='bold')# Feature importance overviewfeature_data = df[['age', 'al', 'urinestate']].copy()axes[2, 2].boxplot([feature_data['age']/feature_data['age'].max(),                      feature_data['al']/feature_data['al'].max(),                      feature_data['urinestate']])axes[2, 2].set_xticklabels(['Age\n(normalized)', 'Albumin\n(normalized)', 'Urinestate'])axes[2, 2].set_title('Feature Value Distributions (Normalized)', fontsize=12, fontweight='bold')axes[2, 2].set_ylabel('Normalized Value')axes[2, 2].grid(True, alpha=0.3)plt.tight_layout()plt.show()print("EDA visualizations completed!")

## 4. Problem 1: Multi-class Classification for CKD Stage### Objective: Predict CKD stage (s1/s2/s3/s4/s5)### Features: age, al, urinestateWe'll train and evaluate multiple classification algorithms:1. Random Forest Classifier2. Gradient Boosting Classifier3. Support Vector Machine (SVM)4. Logistic RegressionEach model will undergo hyperparameter tuning using GridSearchCV or RandomizedSearchCV.

In [None]:
# Prepare data for Problem 1: Stage classificationprint("=" * 60)print("PROBLEM 1: MULTI-CLASS CLASSIFICATION FOR CKD STAGE")print("=" * 60)# Select features and targetfeature_cols = ['age', 'al', 'urinestate']X_stage = df[feature_cols].copy()y_stage = df['stage'].copy()print(f"\nFeatures shape: {X_stage.shape}")print(f"Target shape: {y_stage.shape}")print(f"\nTarget classes: {sorted(y_stage.unique())}")print(f"Class distribution:\n{y_stage.value_counts().sort_index()}")# Encode target labelsle_stage = LabelEncoder()y_stage_encoded = le_stage.fit_transform(y_stage)print(f"\nLabel encoding: {dict(zip(le_stage.classes_, le_stage.transform(le_stage.classes_)))}")# Train-test splitX_train_stage, X_test_stage, y_train_stage, y_test_stage = train_test_split(    X_stage, y_stage_encoded, test_size=0.2, random_state=42, stratify=y_stage_encoded)print(f"\nTrain set size: {X_train_stage.shape[0]}")print(f"Test set size: {X_test_stage.shape[0]}")# Feature scalingscaler_stage = StandardScaler()X_train_stage_scaled = scaler_stage.fit_transform(X_train_stage)X_test_stage_scaled = scaler_stage.transform(X_test_stage)print("\nData preprocessing completed!")print(f"Train set shape (scaled): {X_train_stage_scaled.shape}")print(f"Test set shape (scaled): {X_test_stage_scaled.shape}")

In [None]:
# Train multiple models for Problem 1print("\nTraining models for stage classification...")print("=" * 60)# Dictionary to store models and resultsmodels_stage = {}results_stage = {}# 1. Random Forest Classifierprint("\n1. Training Random Forest Classifier...")rf_stage = RandomForestClassifier(random_state=42)rf_param_grid = {    'n_estimators': [50, 100, 200],    'max_depth': [5, 10, 15, None],    'min_samples_split': [2, 5, 10],    'min_samples_leaf': [1, 2, 4]}rf_grid_stage = RandomizedSearchCV(rf_stage, rf_param_grid, n_iter=20, cv=5,                                     random_state=42, n_jobs=-1, verbose=0)rf_grid_stage.fit(X_train_stage_scaled, y_train_stage)models_stage['Random Forest'] = rf_grid_stage.best_estimator_print(f"Best parameters: {rf_grid_stage.best_params_}")print(f"Best CV score: {rf_grid_stage.best_score_:.4f}")# 2. Gradient Boosting Classifierprint("\n2. Training Gradient Boosting Classifier...")gb_stage = GradientBoostingClassifier(random_state=42)gb_param_grid = {    'n_estimators': [50, 100, 200],    'learning_rate': [0.01, 0.1, 0.2],    'max_depth': [3, 5, 7],    'min_samples_split': [2, 5, 10]}gb_grid_stage = RandomizedSearchCV(gb_stage, gb_param_grid, n_iter=20, cv=5,                                     random_state=42, n_jobs=-1, verbose=0)gb_grid_stage.fit(X_train_stage_scaled, y_train_stage)models_stage['Gradient Boosting'] = gb_grid_stage.best_estimator_print(f"Best parameters: {gb_grid_stage.best_params_}")print(f"Best CV score: {gb_grid_stage.best_score_:.4f}")# 3. Support Vector Machineprint("\n3. Training SVM Classifier...")svm_stage = SVC(random_state=42, probability=True)svm_param_grid = {    'C': [0.1, 1, 10, 100],    'kernel': ['linear', 'rbf'],    'gamma': ['scale', 'auto']}svm_grid_stage = GridSearchCV(svm_stage, svm_param_grid, cv=5, n_jobs=-1, verbose=0)svm_grid_stage.fit(X_train_stage_scaled, y_train_stage)models_stage['SVM'] = svm_grid_stage.best_estimator_print(f"Best parameters: {svm_grid_stage.best_params_}")print(f"Best CV score: {svm_grid_stage.best_score_:.4f}")# 4. Logistic Regressionprint("\n4. Training Logistic Regression...")lr_stage = LogisticRegression(random_state=42, max_iter=1000, multi_class='multinomial')lr_param_grid = {    'C': [0.01, 0.1, 1, 10, 100],    'solver': ['lbfgs', 'saga'],    'penalty': ['l2']}lr_grid_stage = GridSearchCV(lr_stage, lr_param_grid, cv=5, n_jobs=-1, verbose=0)lr_grid_stage.fit(X_train_stage_scaled, y_train_stage)models_stage['Logistic Regression'] = lr_grid_stage.best_estimator_print(f"Best parameters: {lr_grid_stage.best_params_}")print(f"Best CV score: {lr_grid_stage.best_score_:.4f}")print("\nAll models trained successfully!")

In [None]:
# Evaluate models for Problem 1print("\n" + "=" * 60)print("MODEL EVALUATION - STAGE CLASSIFICATION")print("=" * 60)for name, model in models_stage.items():    print(f"\n{name}:")    print("-" * 60)        # Make predictions    y_train_pred = model.predict(X_train_stage_scaled)    y_test_pred = model.predict(X_test_stage_scaled)        # Calculate metrics    train_acc = accuracy_score(y_train_stage, y_train_pred)    test_acc = accuracy_score(y_test_stage, y_test_pred)    precision = precision_score(y_test_stage, y_test_pred, average='macro', zero_division=0)    recall = recall_score(y_test_stage, y_test_pred, average='macro', zero_division=0)    f1 = f1_score(y_test_stage, y_test_pred, average='macro', zero_division=0)        # Store results    results_stage[name] = {        'Train Accuracy': train_acc,        'Test Accuracy': test_acc,        'Precision': precision,        'Recall': recall,        'F1-Score': f1,        'Predictions': y_test_pred    }        print(f"Train Accuracy: {train_acc:.4f}")    print(f"Test Accuracy:  {test_acc:.4f}")    print(f"Precision:      {precision:.4f}")    print(f"Recall:         {recall:.4f}")    print(f"F1-Score:       {f1:.4f}")        print(f"\nClassification Report:")    print(classification_report(y_test_stage, y_test_pred,                                target_names=le_stage.classes_,                                zero_division=0))# Create results DataFrameresults_df_stage = pd.DataFrame(results_stage).Tresults_df_stage_display = results_df_stage.drop('Predictions', axis=1)print("\n" + "=" * 60)print("MODEL COMPARISON - STAGE CLASSIFICATION")print("=" * 60)print(results_df_stage_display.to_string())# Find best modelbest_model_stage = results_df_stage['Test Accuracy'].idxmax()print(f"\n>>> Best Model: {best_model_stage} (Test Accuracy: {results_df_stage.loc[best_model_stage, 'Test Accuracy']:.4f})

In [None]:
# Visualizations for Problem 1fig, axes = plt.subplots(2, 3, figsize=(20, 12))# Confusion matricesfor idx, (name, model) in enumerate(models_stage.items()):    row = idx // 3    col = idx % 3        y_pred = results_stage[name]['Predictions']    cm = confusion_matrix(y_test_stage, y_pred)        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',                 xticklabels=le_stage.classes_,                 yticklabels=le_stage.classes_,                ax=axes[row, col], cbar_kws={'shrink': 0.8})    axes[row, col].set_title(f'{name}\nConfusion Matrix', fontsize=12, fontweight='bold')    axes[row, col].set_xlabel('Predicted')    axes[row, col].set_ylabel('Actual')# Model comparisonaxes[1, 2].bar(range(len(models_stage)), results_df_stage_display['Test Accuracy'],                tick_label=list(models_stage.keys()), alpha=0.7, edgecolor='black')axes[1, 2].set_title('Model Comparison - Test Accuracy', fontsize=12, fontweight='bold')axes[1, 2].set_ylabel('Accuracy')axes[1, 2].set_ylim([0, 1])axes[1, 2].grid(True, alpha=0.3, axis='y')axes[1, 2].tick_params(axis='x', rotation=45)for i, v in enumerate(results_df_stage_display['Test Accuracy']):    axes[1, 2].text(i, v + 0.02, f'{v:.3f}', ha='center', va='bottom', fontweight='bold')plt.tight_layout()plt.show()print("Confusion matrices and model comparison visualized!")

In [None]:
# Feature importance for Problem 1fig, axes = plt.subplots(2, 2, figsize=(16, 12))for idx, (name, model) in enumerate(models_stage.items()):    row = idx // 2    col = idx % 2        # Calculate permutation importance    perm_importance = permutation_importance(model, X_test_stage_scaled, y_test_stage,                                               n_repeats=10, random_state=42, n_jobs=-1)        # Sort features by importance    sorted_idx = perm_importance.importances_mean.argsort()        # Plot    axes[row, col].barh(range(len(feature_cols)),                          perm_importance.importances_mean[sorted_idx],                         xerr=perm_importance.importances_std[sorted_idx],                         alpha=0.7, edgecolor='black')    axes[row, col].set_yticks(range(len(feature_cols)))    axes[row, col].set_yticklabels([feature_cols[i] for i in sorted_idx])    axes[row, col].set_xlabel('Importance')    axes[row, col].set_title(f'{name}\nFeature Importance', fontsize=12, fontweight='bold')    axes[row, col].grid(True, alpha=0.3, axis='x')plt.tight_layout()plt.show()print("Feature importance analysis completed!")

## 5. Problem 2: Binary Classification for CKD Diagnosis### Objective: Classify ckd vs notckd### Features: age, al, urinestate (same as Problem 1)We'll train and evaluate the same set of classification algorithms with additional ROC and PR curve analysis.

In [None]:
# Prepare data for Problem 2: Class classificationprint("=" * 60)print("PROBLEM 2: BINARY CLASSIFICATION FOR CKD DIAGNOSIS")print("=" * 60)# Select features and targetX_class = df[feature_cols].copy()y_class = df['class'].copy()print(f"\nFeatures shape: {X_class.shape}")print(f"Target shape: {y_class.shape}")print(f"\nTarget classes: {sorted(y_class.unique())}")print(f"Class distribution:\n{y_class.value_counts()}")# Encode target labelsle_class = LabelEncoder()y_class_encoded = le_class.fit_transform(y_class)print(f"\nLabel encoding: {dict(zip(le_class.classes_, le_class.transform(le_class.classes_)))}")# Train-test splitX_train_class, X_test_class, y_train_class, y_test_class = train_test_split(    X_class, y_class_encoded, test_size=0.2, random_state=42, stratify=y_class_encoded)print(f"\nTrain set size: {X_train_class.shape[0]}")print(f"Test set size: {X_test_class.shape[0]}")# Feature scalingscaler_class = StandardScaler()X_train_class_scaled = scaler_class.fit_transform(X_train_class)X_test_class_scaled = scaler_class.transform(X_test_class)print("\nData preprocessing completed!")print(f"Train set shape (scaled): {X_train_class_scaled.shape}")print(f"Test set shape (scaled): {X_test_class_scaled.shape}")

In [None]:
# Train multiple models for Problem 2print("\nTraining models for class classification...")print("=" * 60)# Dictionary to store models and resultsmodels_class = {}results_class = {}# 1. Random Forest Classifierprint("\n1. Training Random Forest Classifier...")rf_class = RandomForestClassifier(random_state=42)rf_param_grid_class = {    'n_estimators': [50, 100, 200],    'max_depth': [5, 10, 15, None],    'min_samples_split': [2, 5, 10],    'min_samples_leaf': [1, 2, 4]}rf_grid_class = RandomizedSearchCV(rf_class, rf_param_grid_class, n_iter=20, cv=5,                                     random_state=42, n_jobs=-1, verbose=0)rf_grid_class.fit(X_train_class_scaled, y_train_class)models_class['Random Forest'] = rf_grid_class.best_estimator_print(f"Best parameters: {rf_grid_class.best_params_}")print(f"Best CV score: {rf_grid_class.best_score_:.4f}")# 2. Gradient Boosting Classifierprint("\n2. Training Gradient Boosting Classifier...")gb_class = GradientBoostingClassifier(random_state=42)gb_param_grid_class = {    'n_estimators': [50, 100, 200],    'learning_rate': [0.01, 0.1, 0.2],    'max_depth': [3, 5, 7],    'min_samples_split': [2, 5, 10]}gb_grid_class = RandomizedSearchCV(gb_class, gb_param_grid_class, n_iter=20, cv=5,                                     random_state=42, n_jobs=-1, verbose=0)gb_grid_class.fit(X_train_class_scaled, y_train_class)models_class['Gradient Boosting'] = gb_grid_class.best_estimator_print(f"Best parameters: {gb_grid_class.best_params_}")print(f"Best CV score: {gb_grid_class.best_score_:.4f}")# 3. Support Vector Machineprint("\n3. Training SVM Classifier...")svm_class = SVC(random_state=42, probability=True)svm_param_grid_class = {    'C': [0.1, 1, 10, 100],    'kernel': ['linear', 'rbf'],    'gamma': ['scale', 'auto']}svm_grid_class = GridSearchCV(svm_class, svm_param_grid_class, cv=5, n_jobs=-1, verbose=0)svm_grid_class.fit(X_train_class_scaled, y_train_class)models_class['SVM'] = svm_grid_class.best_estimator_print(f"Best parameters: {svm_grid_class.best_params_}")print(f"Best CV score: {svm_grid_class.best_score_:.4f}")# 4. Logistic Regressionprint("\n4. Training Logistic Regression...")lr_class = LogisticRegression(random_state=42, max_iter=1000)lr_param_grid_class = {    'C': [0.01, 0.1, 1, 10, 100],    'solver': ['liblinear', 'lbfgs'],    'penalty': ['l2']}lr_grid_class = GridSearchCV(lr_class, lr_param_grid_class, cv=5, n_jobs=-1, verbose=0)lr_grid_class.fit(X_train_class_scaled, y_train_class)models_class['Logistic Regression'] = lr_grid_class.best_estimator_print(f"Best parameters: {lr_grid_class.best_params_}")print(f"Best CV score: {lr_grid_class.best_score_:.4f}")print("\nAll models trained successfully!")

In [None]:
# Evaluate models for Problem 2print("\n" + "=" * 60)print("MODEL EVALUATION - CLASS CLASSIFICATION")print("=" * 60)for name, model in models_class.items():    print(f"\n{name}:")    print("-" * 60)        # Make predictions    y_train_pred = model.predict(X_train_class_scaled)    y_test_pred = model.predict(X_test_class_scaled)    y_test_proba = model.predict_proba(X_test_class_scaled)[:, 1]        # Calculate metrics    train_acc = accuracy_score(y_train_class, y_train_pred)    test_acc = accuracy_score(y_test_class, y_test_pred)    precision = precision_score(y_test_class, y_test_pred, zero_division=0)    recall = recall_score(y_test_class, y_test_pred, zero_division=0)    f1 = f1_score(y_test_class, y_test_pred, zero_division=0)    roc_auc = roc_auc_score(y_test_class, y_test_proba)        # Store results    results_class[name] = {        'Train Accuracy': train_acc,        'Test Accuracy': test_acc,        'Precision': precision,        'Recall': recall,        'F1-Score': f1,        'ROC-AUC': roc_auc,        'Predictions': y_test_pred,        'Probabilities': y_test_proba    }        print(f"Train Accuracy: {train_acc:.4f}")    print(f"Test Accuracy:  {test_acc:.4f}")    print(f"Precision:      {precision:.4f}")    print(f"Recall:         {recall:.4f}")    print(f"F1-Score:       {f1:.4f}")    print(f"ROC-AUC:        {roc_auc:.4f}")        print(f"\nClassification Report:")    print(classification_report(y_test_class, y_test_pred,                                target_names=le_class.classes_,                                zero_division=0))# Create results DataFrameresults_df_class = pd.DataFrame(results_class).Tresults_df_class_display = results_df_class.drop(['Predictions', 'Probabilities'], axis=1)print("\n" + "=" * 60)print("MODEL COMPARISON - CLASS CLASSIFICATION")print("=" * 60)print(results_df_class_display.to_string())# Find best modelbest_model_class = results_df_class['Test Accuracy'].idxmax()print(f"\n>>> Best Model: {best_model_class} (Test Accuracy: {results_df_class.loc[best_model_class, 'Test Accuracy']:.4f})")

In [None]:
# Visualizations for Problem 2 - Part 1: Confusion Matrices and Comparisonfig, axes = plt.subplots(2, 3, figsize=(20, 12))# Confusion matricesfor idx, (name, model) in enumerate(models_class.items()):    row = idx // 3    col = idx % 3        y_pred = results_class[name]['Predictions']    cm = confusion_matrix(y_test_class, y_pred)        sns.heatmap(cm, annot=True, fmt='d', cmap='Greens',                 xticklabels=le_class.classes_,                 yticklabels=le_class.classes_,                ax=axes[row, col], cbar_kws={'shrink': 0.8})    axes[row, col].set_title(f'{name}\nConfusion Matrix', fontsize=12, fontweight='bold')    axes[row, col].set_xlabel('Predicted')    axes[row, col].set_ylabel('Actual')# Model comparison - Test Accuracyaxes[1, 2].bar(range(len(models_class)), results_df_class_display['Test Accuracy'],                tick_label=list(models_class.keys()), alpha=0.7, edgecolor='black', color='steelblue')axes[1, 2].set_title('Model Comparison - Test Accuracy', fontsize=12, fontweight='bold')axes[1, 2].set_ylabel('Accuracy')axes[1, 2].set_ylim([0, 1])axes[1, 2].grid(True, alpha=0.3, axis='y')axes[1, 2].tick_params(axis='x', rotation=45)for i, v in enumerate(results_df_class_display['Test Accuracy']):    axes[1, 2].text(i, v + 0.02, f'{v:.3f}', ha='center', va='bottom', fontweight='bold')plt.tight_layout()plt.show()print("Confusion matrices and model comparison visualized!")

In [None]:
# Visualizations for Problem 2 - Part 2: ROC and PR Curvesfig, axes = plt.subplots(1, 2, figsize=(16, 6))# ROC Curvesfor name, result in results_class.items():    fpr, tpr, _ = roc_curve(y_test_class, result['Probabilities'])    roc_auc = result['ROC-AUC']    axes[0].plot(fpr, tpr, linewidth=2, label=f'{name} (AUC = {roc_auc:.3f})')axes[0].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier')axes[0].set_xlabel('False Positive Rate', fontsize=12, fontweight='bold')axes[0].set_ylabel('True Positive Rate', fontsize=12, fontweight='bold')axes[0].set_title('ROC Curves - Binary Classification', fontsize=14, fontweight='bold')axes[0].legend(loc='lower right', fontsize=10)axes[0].grid(True, alpha=0.3)# Precision-Recall Curvesfor name, result in results_class.items():    precision, recall, _ = precision_recall_curve(y_test_class, result['Probabilities'])    pr_auc = auc(recall, precision)    axes[1].plot(recall, precision, linewidth=2, label=f'{name} (AUC = {pr_auc:.3f})')axes[1].set_xlabel('Recall', fontsize=12, fontweight='bold')axes[1].set_ylabel('Precision', fontsize=12, fontweight='bold')axes[1].set_title('Precision-Recall Curves - Binary Classification', fontsize=14, fontweight='bold')axes[1].legend(loc='lower left', fontsize=10)axes[1].grid(True, alpha=0.3)plt.tight_layout()plt.show()print("ROC and Precision-Recall curves visualized!")

In [None]:
# Feature importance for Problem 2fig, axes = plt.subplots(2, 2, figsize=(16, 12))for idx, (name, model) in enumerate(models_class.items()):    row = idx // 2    col = idx % 2        # Calculate permutation importance    perm_importance = permutation_importance(model, X_test_class_scaled, y_test_class,                                               n_repeats=10, random_state=42, n_jobs=-1)        # Sort features by importance    sorted_idx = perm_importance.importances_mean.argsort()        # Plot    axes[row, col].barh(range(len(feature_cols)),                          perm_importance.importances_mean[sorted_idx],                         xerr=perm_importance.importances_std[sorted_idx],                         alpha=0.7, edgecolor='black', color='coral')    axes[row, col].set_yticks(range(len(feature_cols)))    axes[row, col].set_yticklabels([feature_cols[i] for i in sorted_idx])    axes[row, col].set_xlabel('Importance')    axes[row, col].set_title(f'{name}\nFeature Importance', fontsize=12, fontweight='bold')    axes[row, col].grid(True, alpha=0.3, axis='x')plt.tight_layout()plt.show()print("Feature importance analysis completed!")

## 6. Summary and Insights### Overall Results SummaryLet's review the performance of all models across both classification problems.

In [None]:
# Final summaryprint("=" * 80)print("COMPREHENSIVE CKD RISK FACTOR PREDICTION ANALYSIS - SUMMARY")print("=" * 80)print("\n1. DATASET INFORMATION")print("-" * 80)print(f"Total samples: {len(df)}")print(f"Features used: {feature_cols}")print(f"Target variables: stage (multi-class), class (binary)")print("\n2. PROBLEM 1: STAGE CLASSIFICATION (Multi-class)")print("-" * 80)print(f"Classes: {list(le_stage.classes_)}")print(f"Number of classes: {len(le_stage.classes_)}")print("\nModel Performance:")print(results_df_stage_display.to_string())print("\n3. PROBLEM 2: CLASS CLASSIFICATION (Binary)")print("-" * 80)print(f"Classes: {list(le_class.classes_)}")print(f"Number of classes: {len(le_class.classes_)}")print("\nModel Performance:")print(results_df_class_display.to_string())print("\n4. KEY INSIGHTS")print("-" * 80)print("• Feature 'urinestate' was derived from rbc, pc, pcc, and ba")print("• Three features (age, al, urinestate) were used for both classification problems")print("• Multiple models were trained with comprehensive hyperparameter tuning")print("• GridSearchCV and RandomizedSearchCV were used for optimization")print("• Comprehensive evaluation metrics were calculated for both problems")print("• Feature importance analysis revealed key predictors")print("\n5. RECOMMENDATIONS")print("-" * 80)best_stage_model = results_df_stage['Test Accuracy'].idxmax()best_class_model = results_df_class['Test Accuracy'].idxmax()print(f"• For stage prediction (multi-class): Use {best_stage_model}")print(f"  - Test Accuracy: {results_df_stage.loc[best_stage_model, 'Test Accuracy']:.4f}")print(f"  - F1-Score: {results_df_stage.loc[best_stage_model, 'F1-Score']:.4f}")print(f"\n• For CKD diagnosis (binary): Use {best_class_model}")print(f"  - Test Accuracy: {results_df_class.loc[best_class_model, 'Test Accuracy']:.4f}")print(f"  - F1-Score: {results_df_class.loc[best_class_model, 'F1-Score']:.4f}")print(f"  - ROC-AUC: {results_df_class.loc[best_class_model, 'ROC-AUC']:.4f}")print("\n• Consider collecting more data to improve model generalization")print("• Explore additional feature engineering opportunities")print("• Consider ensemble methods combining multiple models")print("• Monitor model performance over time and retrain as needed")print("\n" + "=" * 80)print("ANALYSIS COMPLETED SUCCESSFULLY!")print("=" * 80)

## ConclusionThis comprehensive analysis demonstrated:1. **Data Preprocessing**: Successfully handled missing values and created the derived 'urinestate' feature2. **Feature Engineering**: Utilized three key features (age, al, urinestate) for classification3. **Model Training**: Trained and tuned 4 different models for 2 classification problems4. **Evaluation**: Comprehensive metrics including accuracy, precision, recall, F1-score, and ROC-AUC5. **Visualization**: Multiple plots for understanding data distribution, model performance, and feature importance### Next Steps:- Collect more data to improve model robustness- Explore additional features that may improve prediction accuracy- Consider advanced ensemble techniques- Deploy best-performing models for real-world validation- Continuous monitoring and model retraining with new data