Forecasting Healthcare Costs

Predicting the cost of treatment and insurance using regression by leveraging personal health data.

Motivation

This is my first notebook. I am trying to perform Exploratory Data Analysis (EDA) and linear regression on personal health data. Any feedback and constructive criticism is appreciated. The personal heath data is hosted on Kaggle. Link: https://www.kaggle.com/mirichoi0218/insurance

Components

Kaggle Dataset
Jupyter notebook
Python: numpy, pandas, matplotlib packages

Model Implementation

1. Import Data

Once we import the Data using read_csv, we then use head() to sample the data . We try to identify numerical and categorical data.

2. Data Preprocessing

We proceed to collect basic descriptive stats using describe(). We try to understand what the data looks like and what it is trying to tell us.

data.describe()

We then split the data into numerical and categorical data.

We proceed to convert categorical data into numerical data. We use One hot encoding technique for this.

One hot encoding is a technique where we replace the categorical data with binary digits. The categorical column is split into same number of columns as the values. The respective column is then given a '1' or a '0' corresponding to the values.

we use one hot encoding by using get_dummies()

3. Exploratory Data Analysis (EDA)

We try to then find the correlation between features.

Using a heat map to explore the trends.

From this we can see the following observations:

Strong correlation between charges and smoker_yes.
Weak correlation between charges and age.
Weak correlation between charges and bmi.
Weak correlation between bmi and region_southeast. Since the values for the weak correlations are less than 0.5, we can term them as insignificant and drop them.

Exploring the trend between charges and smoker_yes. Finding the range of the treatment charges of patients using graphs.

From the graph, We can see the minimum charges are around 1122 for a high number of patients and maximum of 63770.

4. Model Building

We then begin to predict the values of the patient charges using the other features. We build a linear regression model after importing the package sklearn.linear_model. We split the data set into training and test set. We use 30% of the dataset for testing using test_size=0.3 We take the predictor variable without the charges column and the target variable as charges. We proceed to fit the linear regression model for the test and training set using fit(). This part is called Model fitting. We check the prediction score of both and training and test set using score(). It comes out to be 79%, which is pretty decent I would say.

5. Model Evaluation

To evaluate our linear regression, we use R² and mean squared error.

On evaluating our model, it showed accuracy of 80% on the test data.

From the figure, Our evaluation metrics of R² and mean squared error of both training and test data are closely matching. This is enough to conclude our model is appropriate to predict patient charges based on their personal health data.

License

This project is under the MIT License - see License for more details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Forecasting Healthcare Costs

Motivation

Table of contents

Components

Model Implementation

1. Import Data

2. Data Preprocessing

3. Exploratory Data Analysis (EDA)

4. Model Building

5. Model Evaluation

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Forecasting Healthcare Costs

Motivation

Table of contents

Components

Model Implementation

1. Import Data

2. Data Preprocessing

3. Exploratory Data Analysis (EDA)

4. Model Building

5. Model Evaluation

License