# Predicting Health Insurance Costs: A Linear Regression Project in Python and R









## Objective
This project applies linear regression techniques to predict This project explores how demographic and behavioral factors influence individual medical insurance costs using linear regression models in both Python and R.

The project is completed in two parts:
- **Part I**: Interactive analysis and modeling using Jupyter Notebooks.
- **Part II**: Standalone scripts that take command-line arguments and generate outputs automatically.




### Introduction

 Understanding the factors that drive medical insurance costs is essential for improving healthcare planning, affordability, and risk assessment. Insurance charges are often influenced by a combination of lifestyle, demographic, and clinical variables—such as age, BMI, smoking status, and region of residence. In the United States, rising healthcare costs have made predictive models increasingly relevant for insurers and policymakers alike.

In this project, we build linear regression models using Python and R to predict an individual’s insurance charges based on real-world health and demographic data. By analyzing features such as age, BMI, number of children, gender, smoking status, and region, the goal is to explore how these variables contribute to healthcare costs—and to compare modeling approaches across two programming environments.




# Dataset

## About the Dataset

The dataset used in this project is adapted from *Machine Learning with R* by Brett Lantz and has been cleaned and formatted for educational use. It contains health insurance information for 1,338 individuals and includes seven key features related to demographics, health behaviors, and insurance charges.

### Dataset Features

- **age**: Age of the primary beneficiary  
- **sex**: Gender of the policyholder (male/female)  
- **bmi**: Body Mass Index (kg/m²), a measure of body fat based on height and weight  
- **children**: Number of dependents covered by the insurance plan  
- **smoker**: Smoking status (yes/no)  
- **region**: Residential region in the U.S. (northeast, southeast, southwest, northwest)  
- **charges**: Individual medical insurance costs billed (in USD)  

This dataset provides a realistic basis for exploring how lifestyle and demographic factors influence health insurance costs, making it ideal for linear regression analysis and predictive modeling.



## Python Exploratory Data Analysis

# Exploratory Data Analysis (Python)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('regression_data.csv')

# --- Data Overview ---
print(df.info())
print(df.describe())

# --- Distribution Plots ---

# Age distribution
plt.figure(figsize=(8,4))
sns.histplot(df['age'], bins=30, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

# BMI distribution
plt.figure(figsize=(8,4))
sns.histplot(df['bmi'], bins=30, kde=True)
plt.title('BMI Distribution')
plt.xlabel('BMI')
plt.ylabel('Count')
plt.show()

# --- Categorical Variable Analysis ---

# Count of smokers vs non-smokers
plt.figure(figsize=(6,4))
sns.countplot(x='smoker', data=df)
plt.title('Count of Smokers vs Non-Smokers')
plt.xlabel('Smoker')
plt.ylabel('Count')
plt.show()

# Charges by smoking status (boxplot)
plt.figure(figsize=(6,6))
sns.boxplot(x='smoker', y='charges', data=df)
plt.title('Insurance Charges by Smoking Status')
plt.xlabel('Smoker')
plt.ylabel('Charges (USD)')
plt.show()

# --- Relationships Between Variables ---

# Scatter plot: Age vs Charges, colored by smoker
plt.figure(figsize=(8,5))
sns.scatterplot(x='age', y='charges', hue='smoker', data=df)
plt.title('Age vs Charges by Smoking Status')
plt.xlabel('Age')
plt.ylabel('Charges (USD)')
plt.show()

# Scatter plot: BMI vs Charges, colored by smoker
plt.figure(figsize=(8,5))
sns.scatterplot(x='bmi', y='charges', hue='smoker', data=df)
plt.title('BMI vs Charges by Smoking Status')
plt.xlabel('BMI')
plt.ylabel('Charges (USD)')
plt.show()

# --- Correlation Matrix ---

plt.figure(figsize=(6,5))
corr = df[['age', 'bmi', 'children', 'charges']].corr()
sns.heatmap(corr, annot



### Distribution of Average Glucose Level  
![Distribution of Average Glucose Level](images/glucose_distribution.png)

### Distribution of Age  
![Distribution of Age](images/age_distribution.png)


### Correlation Heatmap  
![Correlation Heatmap](images/correlation_heatmap.png)

### Scatterplot: Age vs Average Glucose Level  
![Age vs Average Glucose Level](images/age_vs_glucose.png)

## R Exploratory Data Analysis

 ![insurance_charges_vs_bmi.png](attachment:insurance_charges_vs_bmi.png) ![insurance_charges_by_smoker.png](attachment:insurance_charges_by_smoker.png) ![insurance_charges_vs_age.png](attachment:insurance_charges_vs_age.png)


### Encoding Categorical Variables

Linear regression models require numerical inputs. Hence, all categorical features were transformed into binary (dummy) variables using one-hot encoding, with the first category dropped to avoid multicollinearity.

### Transformed columns:
In the dataset, some variables represent categories rather than numerical values, these are called categorical variables. Examples in this dataset include:

sex:gender of the insurance policyholder (male or female)

smoker:smoking status (yes or no)

region:residential region (northeast, southeast, southwest, northwest)

To use these variables effectively in statistical modeling, especially linear regression, they need to be converted into factors in R. This process is called encoding categorical variables.

##### Why this matters: Encoding categorical data allows the linear regression algorithm to interpret the relationships between categories and the target variable numerically.

## Feature Selection
Features were selected based on domain relevance and exploratory data analysis. Only variables with potential predictive value for stroke risk were retained.

Included features:

Age

Hypertension

Heart disease

Average glucose level

BMI

Selected one-hot encoded categories

##### Why this matters: Reducing the number of irrelevant or redundant features helps prevent overfitting and ensures the model captures meaningful patterns.

 ## Handling Missing Values
Missing values were handled carefully to preserve data integrity.

Rows with missing bmi values were dropped.

Categorical columns like smoking_status were encoded after addressing missing entries (if any).

 ##### Why this matters: Missing data can skew results or reduce model accuracy, so addressing it upfront helps maintain a reliable dataset.

## Models and Results

### Linear Regression
After exploring the dataset, linear regression models were built in both Python and R to predict health insurance charges based on demographic and behavioral factors such as age, BMI, and smoking status. The models help quantify the relationships and identify key predictors impacting insurance costs.
##### Python Implementation
The Python model uses the statsmodels library.

1. Categorical variables (like smoker) were converted into numerical binary format.

2. Multiple predictors (age, bmi, smoker) were included.

3. The model summary includes coefficients, R-squared, and statistical significance for each variable.

### Results
![image.png](attachment:image.png)



 | Feature           | Coefficient  |
| ----------------- | ------------ |
| smoker\_yes       | 23651.128856 |
| region\_southwest | -809.799354  |
| region\_southeast | -657.864297  |
| children          | 425.278784   |
| region\_northwest | -370.677326  |
| bmi               | 337.092552   |
| age               | 256.975706   |
| sex\_male         | -18.591692   |

### R Model Results
| Metric                   | Value     |
| ------------------------ | --------- |
| Residual standard error  | 4846      |
| Degrees of freedom       | 1328      |
| Multiple R-squared       | 0.8409    |
| Adjusted R-squared       | 0.8398    |
| F-statistic (9, 1328 DF) | 780       |
| F-statistic p-value      | < 2.2e-16 |


### Model Summary

### R Model Summary (Simple Linear Regression: Charges predicted by Age)
The simple linear regression model in R examined the relationship between insurance charges and age. The analysis showed that age is a statistically significant predictor: for each additional year of age, insurance charges increase by approximately $258. The model explains about 9% of the variation in charges, indicating that while age is an important factor, many other variables contribute to insurance costs. The residual standard error was around $11,560, reflecting typical prediction errors.
### Python Model Summary (Multiple Linear Regression: Charges predicted by smoking status, region, children, BMI, age, and sex)
The multiple linear regression model in Python incorporated several predictors simultaneously. Smoking status had the largest effect, increasing charges by over $23,600 for smokers compared to non-smokers. Regional effects varied, with residents in the Southwest, Southeast, and Northwest regions associated with lower charges relative to the baseline region. Each additional child increased charges by about $425. Higher BMI and increasing age also contributed positively, increasing charges by roughly $337 per BMI unit and $257 per year of age, respectively. Being male was associated with a small decrease in charges, about $19 less, holding other factors constant.

# Plot 1
plot_charges <- ggplot(insurance, aes(x = charges)) +
  geom_histogram(fill = "lightblue", bins = 30) +
  theme_minimal() +
  labs(title = "Distribution of Insurance Charges")

ggsave("plot_charges.png", plot = plot_charges, width = 6, height = 4, dpi = 300)

# Plot 2
plot_smoker <- ggplot(insurance, aes(x = smoker, y = charges, fill = smoker)) +
  geom_boxplot() +
  labs(title = "Charges by Smoking Status") +
  theme_minimal()

ggsave("plot_smoker.png", plot = plot_smoker, width = 6, height = 4, dpi = 300)


# Predicting Health Insurance Costs: A Linear Regression Project in Python and R

## Repository Details

This repository contains analysis of insurance charges using linear regression models in both R and Python. The project explores how demographic and behavioral factors such as age, BMI, smoking status, and region influence individual medical insurance costs.

| File/Folder                           | Description                                             |
| ------------------------------------- | ------------------------------------------------------- |
| `insurance.csv`                       | Cleaned dataset used for analysis                       |
| `linear_regression_r.R`               | R script for simple linear regression (`charges ~ age`) |
| `insurance_analysis.ipynb`            | Python notebook for multivariate linear regression      |
| `model_summary.md`                    | Summary of results from both R and Python models        |
| `data visualizations/`                | Folder containing saved visualizations (.png, .emf)     |
| ├── `insurance_charges_vs_age.png`    | Scatter plot of charges vs. age                         |
| ├── `insurance_charges_vs_bmi.png`    | Scatter plot of charges vs. BMI                         |
| └── `insurance_charges_by_smoker.png` | Charges vs. BMI colored by smoking status               |
| `README.md`                           | Project overview and usage instructions                 |



### Dataset

This dataset is inspired by Machine Learning with R by Brett Lantz. While the book provides valuable insights into machine learning in R, the associated datasets are not readily available online unless purchased or accessed through an account with Packt Publishing. This version of the dataset has been cleaned and formatted to match the book’s examples and is derived from public domain sources.


### How to Use This Repo

1. Clone or download the repository to your local machine
2. Use the provided cleaned dataset (insurance.csv) to reproduce the regression analyses
3. Run the R script or Python notebook for data exploration, model training, and evaluation
4. Review the saved visualizations in the visualizations/ folder for additional insights

### Technologies and Libraries
For Python
- Python 3.x  
- pandas, numpy  
- scikit-learn  
- matplotlib, seaborn  
For R:
-Base R
-readr, ggplot2 (for data import and plotting)
-stats package (built-in, used for linear modeling)



## Author

**Kehinde Soetan**  
Medical Humanities & Computational Social Sciences Enthusiast

- 📧 Email: kehindesoetan3@gmail.com  
- 🌐 LinkedIn: [linkedin.com/in/kehindesoetan](https://linkedin.com/in/kehindesoetan)  

Feel free to reach out for collaborations or discussions related to health data analysis, medical humanities, and computational social sciences.


License
Database: Open Database
Contents: Open Database Content License (ODC-ODbL)
