Skip to content

Exploratory Data Analysis & predicting medical insurance cost with machine learning.

Notifications You must be signed in to change notification settings

Luckywijaya/Medical-Insurance-Cost-Prediction

Repository files navigation

Medical-Insurance-Cost-Prediction

Data source : https://www.kaggle.com/mirichoi0218/insurance

Project Overview

• Seek insight from the dataset with Exploratory Data Analysis
• Performed data processing, data engineering to prepare data before modeling
• Built a model to predict Insurance Cost based on the features

Exploratory Data Analysis

• Feature sex, region has an almost balanced amount, meanwhile most people are non smoker & obese
image

• A person who smoke and have BMI above 30 tends to have a higher medical cost
image

• Older people who smoke have more expensive charges
image

• People who smoke and obese have the highest average charges compared to others
image

Data Processing

• Check missing value - there are none
• Check duplicate value - there are 1 duplicate, will be remove
• Feature engineering - make a new column weight_status based on BMI score
• Feature transformation
Encoding sex, region, & weight_status
Ordinal encoding smoker
• Modeling
Separating target & features
Splitting train & test data
Modeling using Linear Regression, Random Forest, Decision Tree, Ridge, & Lasso algorithm
Find the best algorithm
Tuning Hyperparameter

Model Evaluation

Score LinearRegression DecisionTree RandomForest Ridge
MAE 4305.20 2798.83 2608.55 4311.10
RMSE 6209.88 6067.50 4841.88 6238.13
R2 0.77 0.78 0.78 0.86
Train Accuracy 0.74 1.0 0.97 0.74
Test Accuracy 0.77 0.78 0.86 0.77

Conclusion

Based on the predictive modeling, Linear Regression algorithm has the best score compared to the others, with MAE Score 4305.20, RMSE Score 6209.88, & R2 Score 0.77. Linear Regression algorithm is fit based on the train & test accuracy.