Predicting the cost of treatment and insurance using regression by leveraging personal health data.
This is my first notebook. I am trying to perform Exploratory Data Analysis (EDA) and linear regression on personal health data. Any feedback and constructive criticism is appreciated. The personal heath data is hosted on Kaggle. Link: https://www.kaggle.com/mirichoi0218/insurance
- Kaggle Dataset
- Jupyter notebook
- Python: numpy, pandas, matplotlib packages
Once we import the Data using read_csv
, we then use head()
to sample the data . We try to identify numerical and categorical data.
We proceed to collect basic descriptive stats using describe()
. We try to understand what the data looks like and what it is trying to tell us.
data.describe()
We then split the data into numerical and categorical data.
We proceed to convert categorical data into numerical data. We use One hot encoding technique for this.
One hot encoding is a technique where we replace the categorical data with binary digits. The categorical column is split into same number of columns as the values. The respective column is then given a '1' or a '0' corresponding to the values.
we use one hot encoding by using get_dummies()
We try to then find the correlation between features.
Using a heat map to explore the trends.
From this we can see the following observations:
- Strong correlation between charges and smoker_yes.
- Weak correlation between charges and age.
- Weak correlation between charges and bmi.
- Weak correlation between bmi and region_southeast. Since the values for the weak correlations are less than 0.5, we can term them as insignificant and drop them.
Exploring the trend between charges and smoker_yes. Finding the range of the treatment charges of patients using graphs.
From the graph, We can see the minimum charges are around 1122 for a high number of patients and maximum of 63770.
We then begin to predict the values of the patient charges using the other features. We build a linear regression model after importing the package sklearn.linear_model
. We split the data set into training and test set. We use 30% of the dataset for testing using test_size=0.3
We take the predictor variable without the charges column and the target variable as charges.
We proceed to fit the linear regression model for the test and training set using fit()
. This part is called Model fitting. We check the prediction score of both and training and test set using score()
. It comes out to be 79%, which is pretty decent I would say.
To evaluate our linear regression, we use R2 and mean squared error.
On evaluating our model, it showed accuracy of 80% on the test data.
From the figure, Our evaluation metrics of R2 and mean squared error of both training and test data are closely matching. This is enough to conclude our model is appropriate to predict patient charges based on their personal health data.
This project is under the MIT License - see License for more details