In Machine Learning, imbalanced data occurs when one class (majority) heavily outnumbers the other (minority).
This often leads to biased models that predict the majority class more accurately, while failing to identify the minority class — which is usually of greater interest (e.g., fraud detection, churn, claims).
This project demonstrates handling imbalanced datasets and building a robust classification model using Python (pandas, scikit-learn, seaborn, matplotlib).
-
Data Exploration
- Load and inspect the insurance claims dataset (
58,592 rows × 41 features). - Check for missing values, feature types, and target distribution.
- Load and inspect the insurance claims dataset (
-
Exploratory Data Analysis (EDA)
- Visualized class imbalance (
claim_statusvariable). - Analyzed distributions of numerical features like
subscription_length,vehicle_age,customer_age. - Explored categorical variables such as
region_code,segment,fuel_type.
- Visualized class imbalance (
-
Handling Imbalance
- Applied oversampling (resample) on the minority class.
- Balanced dataset:
54,844samples in each class.
-
Feature Selection
- Used Random Forest Feature Importance to identify key predictors:
subscription_length,customer_age,vehicle_age,region_density, etc.
- Used Random Forest Feature Importance to identify key predictors:
-
Model Training
- Trained Random Forest Classifier on oversampled data.
- Evaluation metrics used: Precision, Recall, F1-score, Accuracy.
-
Model Performance
- Achieved 99% accuracy with balanced precision & recall.
- High recall for minority class (claims), ensuring better detection.
- Balanced model capable of detecting claims effectively.
- Random Forest proved robust for handling categorical + numerical features.
- Model evaluation showed F1-score = 0.99 for both classes.
- Python: Pandas, NumPy, Matplotlib, Seaborn, scikit-learn
- Model: Random Forest Classifier
- Resampling: sklearn.utils.resample
- Imbalanced datasets require special treatment (oversampling, resampling, or algorithmic approaches).
- Metrics like Precision, Recall, F1, AUROC are more reliable than Accuracy.
- Random Forest + Oversampling achieved strong results for this dataset.