## Construction of Basic Models and Classification

The main task was binary classification (predicting the occurrence of the disease). The training process proceeded as follows:
* **Preprocessing:** Applied *One-Hot Encoding* for categorical variables and standardization (*StandardScaler*) for numerical variables.
* **Data Split:** Training set (80%) and Test set (20%) with stratification.
* **Model:** Initially, Logistic Regression and Random Forest were used.

### Model Performance Comparison

An evaluation of two models was conducted: Random Forest and Logistic Regression. The table below presents a comparison of key quality metrics (for the positive class 1 - Diabetic).

| Metric | Random Forest (RF) | Logistic Regression (LR) |
| :--- | :---: | :---: |
| **Accuracy** | **0.901** | 0.894 |
| **ROC AUC** | **0.948** | 0.942 |
| **Precision (Class 1)** | **0.867** | 0.866 |
| **Recall (Class 1)** | **0.797** | 0.770 |
| **F1-Score (Class 1)** | **0.830** | 0.815 |

The **Random Forest** model achieved slightly better results across all key indicators.

### Detailed Classification Report

| Model | Class | Precision | Recall | F1-Score |
| :--- | :---: | :---: | :---: | :---: |
| **Random Forest** | 0 (Healthy) | 0.91 | 0.95 | 0.93 |
| | 1 (Diabetic) | 0.87 | 0.80 | 0.83 |
| **Logistic Regression** | 0 (Healthy) | 0.90 | 0.95 | 0.93 |
| | 1 (Diabetic) | 0.87 | 0.77 | 0.82 |

---

## Experiments and Training Process Optimization

To achieve the highest possible predictive model quality, a series of experiments were conducted to examine the impact of preprocessing, feature selection, and validation strategies on the final results.

### Threshold Tuning

In medical diagnostics, the standard threshold of 0.5 is often not optimal because the cost of missing a disease (False Negative) is higher than the cost of a false alarm. An analysis was conducted for the standard threshold of **0.5** and a lowered threshold of **0.3**.

| Threshold | Accuracy | Precision | Recall | F1-Score |
| :---: | :---: | :---: | :---: | :---: |
| 0.5 | **0.905** | **0.874** | 0.801 | **0.836** |
| **0.3** | 0.890 | 0.782 | **0.881** | 0.829 |

**Conclusion:** Lowering the threshold to **0.3** allowed for an increase in sensitivity (Recall) to **88.1%**, which is crucial in a medical decision support system.

### Strategy for Handling Errors in Categorical Data

Two strategies were tested for fixing invalid values in the `smoking` and `drinking` columns:
1.  **DROP:** Removing rows with invalid values.
2.  **CLIP:** Clipping values to the allowed range.

| Strategy | Accuracy | Precision | Recall | ROC AUC |
| :--- | :---: | :---: | :---: | :---: |
| DROP | **0.925** | 0.667 | 0.771 | 0.948 |
| **CLIP** | 0.890 | **0.782** | **0.881** | **0.955** |

**Conclusion:** The **CLIP** strategy was chosen because the DROP strategy drastically reduced the model's sensitivity.

### Impact of Removing Outliers

The impact of removing anomalies detected using the IQR method was examined.

| Dataset | Accuracy | Recall | F1-Score | ROC AUC |
| :--- | :---: | :---: | :---: | :---: |
| Full (With Outliers) | 0.890 | **0.881** | **0.816** | **0.944** |
| Without Outliers | **0.894** | 0.588 | 0.603 | 0.891 |

**Conclusion:** Removing outliers led to a drastic drop in Recall (from 87.7% to 58.8%). A decision was made **not to remove** anomalies, as extreme values likely represent strong predictors of the disease.

### Feature Selection

| Feature Set | Accuracy | Recall | F1-Score | ROC AUC |
| :--- | :---: | :---: | :---: | :---: |
| All Features | 0.890 | 0.881 | 0.828 | 0.954 |
| Top 15 Features | 0.887 | 0.885 | 0.826 | 0.954 |
| Top 9 Features | 0.882 | 0.877 | 0.817 | 0.946 |
| Top 6 Features | 0.880 | 0.877 | 0.816 | 0.944 |

**Conclusion:** It was decided to keep **all features** to maintain the maximum ROC AUC value.

---

## Model Tuning and Validation Strategy

### Hyperparameter Optimization

* `n_estimators`: 300
* `max_depth`: 20
* `class_weight`: 'balanced' (impact was negligible)

### Validation: Hold-out vs. Cross-Validation

| Validation Method | Recall | Std Dev | Precision | ROC AUC |
| :--- | :---: | :---: | :---: | :---: |
| Hold-out (Thr=0.3) | **0.881** | - | 0.774 | 0.955 |
| Hold-out (Thr=0.5) | 0.813 | - | 0.909 | 0.955 |
| 5-Fold CV (Thr=0.5) | 0.824 | +/- 0.022 | **0.886** | **0.963** |

The results confirm model stability and lack of overfitting (Hold-out results are close to the CV average).

### Final Results - Model Card

| Parameter / Metric | Value / Description |
| :--- | :--- |
| **Model** | **Random Forest** |
| Number of Trees | 300 |
| Max Depth | 20 |
| Threshold | **0.3** |
| **Data Strategy** | |
| Categorical Fix | CLIP |
| Outliers | Kept (Not removed) |
| Features | All (18) |
| **Results (Test Set)** | |
| **Recall (Sensitivity)** | **0.881** |
| Precision | 0.782 |
| F1-Score | 0.829 |
| Accuracy | 0.890 |
| ROC AUC | 0.955 |

## Final Summary

The final **Random Forest** model is characterized by high diagnostic effectiveness:
* Achieved **ROC AUC at the level of ~0.95**.
* **Sensitivity (Recall) was maximized to approx. 88%**, which means high effectiveness in detecting sick patients.
* It was identified that aggressive data cleaning (removing anomalies) is detrimental to model quality in this specific medical problem.