# Fork 1 – Baseline Logistic Regression

This notebook presents a baseline Logistic Regression model for classifying **Hypertrophic Cardiomyopathy (HCM)** using heart and lung features. 
It is intended as a clean, professional, reproducible report for KN AI MED.

## Contents
1. Problem Overview
2. Data Description
3. Preprocessing
4. Model Overview
5. Evaluation Metrics
6. Visualization
7. Conclusions

## 1. Problem Overview
- **Task:** Binary classification of **Cardiomegaly** (HCM) based on heart and lung features.
- **Target:** `Cardiomegaly` (0 = absent, 1 = present)
- **Features:**
  - Heart width, Lung width
  - CTR – Cardiothoracic Ratio
  - Inertia tensors: xx, yy, xy, normalized_diff
  - Inscribed circle radius
  - Polygon Area Ratio
  - Heart perimeter, Heart area
  - Lung area
- **Baseline Model:** Logistic Regression

## 2. Data Description
- Data source: `task_data.csv`
- Number of samples: [insert number]
- Columns include both numeric features and the target column.
- Some numeric columns use commas as decimal separators – these need conversion.
- Missing values should be handled (impute or drop) before modeling.

## 3. Preprocessing
- Convert numeric columns with commas to floats.
- Separate features (`X`) and target (`y`).
- Split data into training (80%) and testing (20%) sets.
- Apply standard scaling to numeric features to normalize their range.
- Optional: Check for class imbalance and apply techniques if necessary (e.g., stratified split).

## 4. Model Overview
- **Baseline Logistic Regression**:
  - Simple linear model for binary classification.
  - Hyperparameters: `max_iter=1000` to ensure convergence.
  - Can be extended with L1/L2 regularization if needed.
- Model pipeline includes scaling and classifier for reproducibility.

## 5. Evaluation Metrics
- **Accuracy:** Overall fraction of correct predictions.
- **Confusion Matrix:** Counts of True Positives, True Negatives, False Positives, False Negatives.
- **Precision / Recall / F1-Score:**
  - Precision: Proportion of positive predictions that are correct.
  - Recall: Proportion of actual positives correctly predicted.
  - F1-Score: Harmonic mean of precision and recall.
- **ROC Curve / AUC:** Performance across thresholds, area under the curve indicates model quality.

## 6. Visualization
- **Confusion Matrix Heatmap:** Easily interpret misclassifications.
- **ROC Curve:** Visualize trade-off between true positive rate and false positive rate.
- Optional: Feature importance (coefficients) visualization for Logistic Regression.

## 7. Conclusions
- Baseline Logistic Regression provides a reference performance.
- Potential class imbalance observed: Class 1 predicted better than Class 0.
- Cross-validation recommended to obtain more stable metrics.
- Next steps / Fork 2:
  - Hyperparameter tuning (GridSearchCV, RandomizedSearchCV)
  - Ensemble models (Random Forest, VotingClassifier)
  - Feature selection and importance analysis
  - Visualizations for KN-ready presentation