# U.S. Medical Insurance Costs

## Table of contents
### 1.Introduction
- Project Overview
- Objectives
- Data Source
### 2.Data Exploration and Preprocessing
- Loading the Dataset
- Initial Data Inspection
- Data Cleaning and Preprocessing
- Data Visualization
  - Age Distribution
  - Charges by Smoker Status
  - Summary Statistics
  - Categorical Variable Analysis
  - Gender Distribution
  - Smoker vs. Non-Smoker Analysis
### 3.Data Analysis
- Correlation Analysis
- Correlation Matrix
- Visualizing Relationships
- Hypothesis Testing
- Impact of Smoking on Charges
- Regression Analysis
- Predicting Charges with BMI
- Clustering
- K-Means Clustering
- Geospatial Analysis (if applicable)
- Regional Differences in Charges
### 4.Insights and Conclusion
- Key Findings
- Implications
- Recommendations
- Limitations of the Analysis
- Future Work
### 5.Appendices
- Code Snippets and Notebooks (if provided separately)
- Data Dictionary
- References

# 1. Introduction

### Project Overview
This project explores the complex world of health insurance premiums in our interconnected society. Health insurance is a vital safety net for accessing medical care, but its cost is not arbitrary. This analysis dives deep into the factors that influence insurance premiums, including age, gender, BMI, family size, and smoking habits. Smoking, for instance, leads to higher premiums due to its health risks.

Family size can also affect costs, while BMI plays a role, with higher values often leading to increased premiums. Geographic location matters too, as healthcare costs and access vary by region.

Using the 'insurance.csv' dataset, this project uncovers the relationships between these factors and insurance charges, aiming to provide insights into how premiums are determined. It invites readers on a data-driven journey to unveil the intricacies of health insurance pricing and its impact on individuals' lives.

### Objectives
This project aims to explore the factors influencing health insurance premiums. We'll delve into the variables that insurers consider when determining premiums, such as age, gender, BMI, family size, smoking habits, and geographic location. By analyzing the 'insurance.csv' dataset, we seek to uncover hidden patterns and provide insights into how insurance premiums are calculated.

### Data Source

The primary data source for this project is the 'insurance.csv' dataset, which contains information about individuals' health insurance details. This dataset serves as the foundation for our analysis, allowing us to draw insights into the determinants of health insurance premiums.

# 2.Data Exploration and Preprocessing

### Loading the Dataset

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('insurance.csv')

### Initial Data Inspection

In [None]:
# Check the size of data
df.shape

It appears that there are 1338 rows and 7 characters. 

In [None]:
# Observe the data
df.head(5)


In [None]:
df.columns

In [None]:
# Cehck the type of characters
df.info()

As we can see, the dataset is complete. This is beneficial since we need to analyze the correlation relationship later.

In [None]:
#Select numeric columns
numeric = list(df.select_dtypes(include=[np.number]).columns.values)
print('These columns are stored as numerical input: '+ str(numeric))

#Select non-numeric columns
columns_object = list(df.select_dtypes(include=[np.number]).columns.values)
print('These columns are stored as series of object: '+ str(columns_object))

In [None]:
# Inspect for missing values
missing_values = df.isnull().sum()
print(missing_values)

### Data Cleaning and Preprocessing

The dataset is already complete. We will now convert categorical columns to numerical values for analysis.

### Data Visualization

  - Age Distribution

  - Charges by Smoker Status

  - Summary Statistics

  - Categorical Variable Analysis

  - Gender Distribution

  - Smoker vs. Non-Smoker Analysis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
# Visualize the distribution of 'age' using a histogram
sns.histplot(data=df, x='age', kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()


In [None]:
# Visualize the distribution of 'bmi' using a histogram
sns.histplot(data=df, x='bmi', kde=True)
plt.title('BMI Distribution')
plt.xlabel('BMI')
plt.ylabel('Count')
plt.show()


In [None]:
# Visualize the distribution of 'children' using a histogram
sns.histplot(data=df, x='children', discrete=True)
plt.title('Children Distribution')
plt.xlabel('Number of Children')
plt.ylabel('Count')
plt.show()


In [None]:
# Visualize the distribution of 'charges' using a histogram
sns.histplot(data=df, x='charges', kde=True)
plt.title('Insurance Charges Distribution')
plt.xlabel('Charges')
plt.ylabel('Count')
plt.show()


In [None]:
# Create bar plots for categorical variables: 'sex', 'smoker', 'region'
categorical_vars = ['sex', 'smoker', 'region']

for var in categorical_vars:
    sns.countplot(data=df, x=var)
    plt.title(f'{var} Distribution')
    plt.xlabel(var)
    plt.ylabel('Count')
    plt.show()


In [None]:
# Create boxplots for numeric variables: 'age', 'bmi', 'children', 'charges'
numeric_vars = ['age', 'bmi', 'children', 'charges']

for var in numeric_vars:
    sns.boxplot(data=df, y=var)
    plt.title(f'{var} Boxplot')
    plt.ylabel(var)
    plt.show()


In [None]:
# Create a scatterplot matrix for numeric variables
sns.pairplot(df[numeric_vars])
plt.show()


In [None]:
# Create a heatmap to visualize correlations between numeric variables
correlation_matrix = df[numeric_vars].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


In [None]:
# Summary statistics.
df.describe()

# 3.Data Analysis
- Correlation Analysis
- Correlation Matrix
- Visualizing Relationships
- Hypothesis Testing
- Impact of Smoking on Charges
- Regression Analysis
- Predicting Charges with BMI
- Clustering
- K-Means Clustering
- Geospatial Analysis (if applicable)
- Regional Differences in Charges

# 4.Insights and Conclusion
- Key Findings
- Implications
- Recommendations
- Limitations of the Analysis
- Future Work

# 5.Appendices
- Code Snippets and Notebooks (if provided separately)
- Data Dictionary
- References