# U.S. Medical Insurance

By Jia Bloom | Date: 6/17/23



In this project, a CSV file containing information about medical insurance costs will be investigated using **Python**. This dataset was pulled from [Kaggle](https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset) to be used as a practice dataset for data analysis.

The goal of the project will be to conduct **descriptive** and **exploratory** analysis on the dataset to gain insights into the patients and their insurance costs, specifically how a patient's demographic information impacts their insurance costs.

`insurance.csv` contains 7 columns where each row is a patient:


*   `Age`: patient's age
*   `Sex`: patient's sex
*   `BMI`: patient's body mass index (BMI)
*   `Number of Children`: number of patient's children
*   `Smoking Status`: whether the patient smokes
*   `U.S. Geographical Region`: patient's region
*   `Yearly Medical Insurance Cost (AKA Charges)`: amount patient is charged per year for medical insurance



## Setting Up Environment and Loading Data

The first step is to import the relevant Python libraries: `csv`, `statistics`, and `Counter`.

In [None]:
import csv
import statistics
from collections import Counter
import numpy as np
from scipy import stats

Next, the `csv` library will be used to load the CSV file.

In [None]:
# empty lists for the data to fill
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

# opens the file
with open("/content/insurance.csv", newline="") as insurance_data:
    insurance_reader = csv.DictReader(insurance_data, delimiter = ",")
    for row in insurance_reader:
        age.append(row["age"])
        sex.append(row["sex"])
        bmi.append(row["bmi"])
        children.append(row["children"])
        smoker.append(row["smoker"])
        region.append(row["region"])
        charges.append(row["charges"])

## Data Cleaning

In [None]:
# standardizes capitalization for strings
sex = [x.lower() for x in sex]
smoker = [x.lower() for x in smoker]
region = [x.lower() for x in region]

# clears any extra spaces / whitespace from the data
age = [x.strip() for x in age]
sex = [x.strip() for x in sex]
bmi = [x.strip() for x in bmi]
children = [x.strip() for x in children]
smoker = [x.strip() for x in smoker]
region = [x.strip() for x in region]
charges = [x.strip() for x in charges]

# converts list items into correct data type
age = [int(x) for x in age]
bmi = [float(x) for x in bmi]
children = [int(x) for x in children]
smoker = [True if x == "yes" else False for x in smoker]
charges = [float(x) for x in charges]


Note: The data type of sex & region don't need to be changed because they're meant to be strings.

## Descriptive Analysis

The next step is to produce summary statistics for the variables in the dataset.

Numerical variables will be summarized with these statistics:

*   Minimum
*   Maximum
*   Mean
*   Standard Deviation
*   Median



Categorical variables will be summarized with these statistics:

*   Frequency / Count
*   Relative Percentages
*   Mode



#### Numerical Variables

The following code creates a dictionary where the keys are the variables in the dataset, and the values are the data types. This will be used later.

In [None]:
data_types = {
    "age": "integer",
    "sex": "string",
    "bmi": "float",
    "children": "integer",
    "smoker": "boolean",
    "region": "string",
    "charges": "integer"
}


The following code creates a dictionary of summary statistics for each numerical variable of the dataset.

**Inputs**: `list` (data to summarize), `list_name` (name of the list), `dictionary` (empty dictionary to insert the results into)

**Output**: dictionary where the keys are the variable / column names, and the values are dictionaries that include information on the associated summary statistics

In [None]:
# function to create a dictionary of summary statistics
def create_summary_dict(list, list_name, dictionary):
    num_of_rows = len(list)
    minimum = round(min(list), 2)
    maximum = round(max(list), 2)
    mean = round((sum(list)/len(list)), 2)
    stdev = round((statistics.stdev(list)),2)
    median = round(statistics.median(list), 2)
    dictionary.update({list_name: {"Count": num_of_rows,
                                   "Minimum": minimum,
                                   "Maximum": maximum,
                                   "Mean": mean,
                                   "Standard Deviation": stdev,
                                   "Median": median}})

In [None]:
# calls function to summarize numerical variables
num_sum = {}
create_summary_dict(age, "age", num_sum)
create_summary_dict(bmi, "bmi", num_sum)
create_summary_dict(children, "children", num_sum)
create_summary_dict(charges, "charges", num_sum)

print(num_sum)

{'age': {'Count': 1338, 'Minimum': 18, 'Maximum': 64, 'Mean': 39.21, 'Standard Deviation': 14.05, 'Median': 39.0}, 'bmi': {'Count': 1338, 'Minimum': 15.96, 'Maximum': 53.13, 'Mean': 30.66, 'Standard Deviation': 6.1, 'Median': 30.4}, 'children': {'Count': 1338, 'Minimum': 0, 'Maximum': 5, 'Mean': 1.09, 'Standard Deviation': 1.21, 'Median': 1.0}, 'charges': {'Count': 1338, 'Minimum': 1121.87, 'Maximum': 63770.43, 'Mean': 13270.42, 'Standard Deviation': 12110.01, 'Median': 9382.03}}


The following code takes the num_sum dictionary and prints it out with relevant context.

**Input**: `var` (dictionary created from the previously defined `create_summary_dict` function)

**Output**: printed message outlining summary statistics for each variable within the dictionary inputted in the function

In [None]:
# define function
def print_sum(var):
  data_type = data_types.get(var)
  count = str(num_sum.get(var).get("Count"))
  minimum = str(num_sum[var]["Minimum"])
  maximum = str(num_sum[var]["Maximum"])
  mean = str(num_sum[var]["Mean"])
  stdev = str(num_sum[var]["Standard Deviation"])
  median = str(num_sum[var]["Median"])
  msg = """{var} is an {data_type} with {num_of_rows} observations.
              The data ranges from {minimum} to {maximum}.
              The mean is {mean} with a standard deviation of {stdev}.
              The median is {median}""".format(
                      var=var.title(), data_type=data_type,
                      num_of_rows=count,
                      minimum=minimum, maximum=maximum,
                      mean=mean, stdev=stdev, median=median)
  alt_msg = """{var} is an {data_type} with {num_of_rows} observations.
              The data ranges from {minimum} to {maximum}.
              The mean is {mean} with a standard deviation of {stdev}.
              The median is {median}""".format(
                      var=var.title(), data_type=data_type,
                      num_of_rows=count,
                      minimum=minimum, maximum=maximum,
                      mean=mean, stdev=stdev, median=median)
  if data_types.get(var) == "integer":
    print(msg)
  else:
    print(alt_msg)

In [None]:
# calls function to print summary statistics
print_sum("age")
print_sum("bmi")
print_sum("children")
print_sum("charges")

Age is an integer with 1338 observations.
              The data ranges from 18 to 64.
              The mean is 39.21 with a standard deviation of 14.05.
              The median is 39.0
Bmi is an float with 1338 observations.
              The data ranges from 15.96 to 53.13.
              The mean is 30.66 with a standard deviation of 6.1.
              The median is 30.4
Children is an integer with 1338 observations.
              The data ranges from 0 to 5.
              The mean is 1.09 with a standard deviation of 1.21.
              The median is 1.0
Charges is an integer with 1338 observations.
              The data ranges from 1121.87 to 63770.43.
              The mean is 13270.42 with a standard deviation of 12110.01.
              The median is 9382.03


#### Categorical Variables

The following code defines a function that creates a dictionary where the keys are unique categories and the values contain a dictionary that includes the raw count and the percentage of total for that category.

**Inputs**: `list` (containing categorical data that needs to be summarized), `dictionary` (empty dictionary to input count & percentage data into)

**Output**: dictionary with every unique value in the list as the keys and the values being information about how frequently that item occurs in the list and the associated relative percentage

In [None]:
# define function
def count_var(list, dictionary):
    count_dict = Counter(list)
    total_counts = sum(count_dict.values())
    for key, count in count_dict.items():
        percentage = count / total_counts
        dictionary.update({key: {"Count": count, "Percent": percentage}})

In [None]:
# calls function
sex_count = {}
count_var(sex, sex_count)
print("categorical summary of sex: ", sex_count)
smoker_count = {}
count_var(smoker, smoker_count)
print("categorical summary of smoker status: ", smoker_count)
region_count = {}
count_var(region, region_count)
print("categorical summary of region: ", region_count)

categorical summary of sex:  {'female': {'Count': 662, 'Percent': 0.4947683109118087}, 'male': {'Count': 676, 'Percent': 0.5052316890881914}}
categorical summary of smoker status:  {True: {'Count': 274, 'Percent': 0.20478325859491778}, False: {'Count': 1064, 'Percent': 0.7952167414050823}}
categorical summary of region:  {'southwest': {'Count': 325, 'Percent': 0.2428998505231689}, 'southeast': {'Count': 364, 'Percent': 0.27204783258594917}, 'northwest': {'Count': 325, 'Percent': 0.2428998505231689}, 'northeast': {'Count': 324, 'Percent': 0.242152466367713}}


The following code defines a function that finds the mode in a list of data.

**Input**: `list` (data to find the mode of)

**Output**: the mode of the list

In [None]:
# define function
def find_mode(list):
    count_dict = Counter(list)
    mode_value = max(count_dict.values())
    for key, value in count_dict.items():
      if mode_value == value:
        return key

In [None]:
# calls function to return modes
print("The mode of sex is " + find_mode(sex))
print("The mode of smoker is " + str(find_mode(smoker)))
print("The mode of region is " + find_mode(region))

The mode of sex is male
The mode of smoker is False
The mode of region is southeast


## Exploratory Analysis

The next step is to explore relationships between variables in the dataset. This will be divided into two sections based on variable type, and the analysis will focus on the relationship between a patient's demographics and how much they're charged for medical insurance.


*   Numerical variables
*   Categorical variables


#### Numerical Variables

As a reminder, the numerical variables in this dataset are the following:


*   Age
*   BMI
*   Number of Children
*   Yearly Medical Insurance Cost

The analysis will mainly focus on the relationship of the first three variables to Yearly Medical Insurance Cost. The initial hypothesis is that there will be a linear positive relationship between the first three variables to Yearly Medical Insurance Cost. To determine the relationship, the **`NumPy`** and **`stats`** libraries will be used to calculate linear regression, correlation coefficient, R-squared, and p-value.

The following code defines a function that returns a  dictionary of linear correlation stats, including slope, intercept, correlation coefficient, r-squared, p-value, and standard error.

**Inputs**: `list_x` (list of values to be correlated with `list_y`), `list_y` (list of values to be correlated with `list_x`)

**Output**: dictionary that contains the slope, intercept, correlation coefficient, r-squared, p-value, and standard error associated with running a correlation between the two lists

In [None]:
# defines a function that returns linear correlation stats
def regression(list_x, list_y):
  x = np.array(list_x)
  y = np.array(list_y)
  slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
  regression_dict = {"Slope": slope,
                     "Intercept": intercept,
                     "Correlation Coefficient": r_value,
                     "r-squared": (r_value**2),
                     "p-value": p_value,
                     "Standard Error": std_err}
  rounded_regression = {}
  for key, value in regression_dict.items():
    rounded_value = round(value, 4)
    rounded_regression.update({key: rounded_value})
  return rounded_regression

In [None]:
# calls function to print the correlations
age_to_charges = regression(age, charges)
bmi_to_charges = regression(bmi, charges)
children_to_charges = regression(children, charges)
print("correlation between age & charges: ", age_to_charges)
print("correlation between BMI & charges: ", bmi_to_charges)
print("correlation between number of children & charges: ", children_to_charges)

correlation between age & charges:  {'Slope': 257.7226, 'Intercept': 3165.885, 'Correlation Coefficient': 0.299, 'r-squared': 0.0894, 'p-value': 0.0, 'Standard Error': 22.5024}
correlation between BMI & charges:  {'Slope': 393.873, 'Intercept': 1192.9372, 'Correlation Coefficient': 0.1983, 'r-squared': 0.0393, 'p-value': 0.0, 'Standard Error': 53.2507}
correlation between number of children & charges:  {'Slope': 683.0894, 'Intercept': 12522.4955, 'Correlation Coefficient': 0.068, 'r-squared': 0.0046, 'p-value': 0.0129, 'Standard Error': 274.2018}


#### Categorical Variables

The following code modifies a previous function to take only one argument.

**Input**: `list` (list of data to summarize)

**Output**: dictionary containing summary statistics of the list (count, minimum, maximum, mean, standard deviation, and median)

In [None]:
# modified create_summary_dict to only take one argument
def summarize(list):
    dictionary = {}
    num_of_rows = len(list)
    minimum = round(min(list), 2)
    maximum = round(max(list), 2)
    mean = round((sum(list)/len(list)), 2)
    stdev = round((statistics.stdev(list)),2)
    median = round(statistics.median(list), 2)
    dictionary.update({"Count": num_of_rows,
                        "Minimum": minimum,
                        "Maximum": maximum,
                        "Mean": mean,
                        "Standard Deviation": stdev,
                        "Median": median})
    return dictionary


The following code groups data by categories, and then finds summary statistics on the charges associated with each category. For example, this function can find the average charge for men vs. the average charge for women.

**Inputs**: `cat_var` (list containing a list of categorical data to group by), `charges` (list containing the numerical data to summarize)

**Output**: dictionary that contains a summary of the charges (count, minimum, maximum, mean, standard deviation, and median) grouped by a categorical variable

In [None]:
# define function to summarize charges by category
def sum_by_cat(cat_var, charges):
  unique = list(set(cat_var))
  num_unique = len(unique)
  for i in range(num_unique):
    list_charges = []
    for j in range(len(cat_var)):
      if unique[i] == cat_var[j]:
        list_charges.append(charges[j])
    print("charges for", unique[i], ": ", summarize(list_charges))

In [None]:
print("Charges by sex:")
sum_by_cat(sex, charges)

Charges by sex:
charges for female :  {'Count': 662, 'Minimum': 1607.51, 'Maximum': 63770.43, 'Mean': 12569.58, 'Standard Deviation': 11128.7, 'Median': 9412.96}
charges for male :  {'Count': 676, 'Minimum': 1121.87, 'Maximum': 62592.87, 'Mean': 13956.75, 'Standard Deviation': 12971.03, 'Median': 9369.62}


In [None]:
print("Charges by smoker status:")
sum_by_cat(smoker, charges)

Charges by smoker status:
charges for False :  {'Count': 1064, 'Minimum': 1121.87, 'Maximum': 36910.61, 'Mean': 8434.27, 'Standard Deviation': 5993.78, 'Median': 7345.41}
charges for True :  {'Count': 274, 'Minimum': 12829.46, 'Maximum': 63770.43, 'Mean': 32050.23, 'Standard Deviation': 11541.55, 'Median': 34456.35}


In [None]:
print("Charges by region:")
sum_by_cat(region, charges)

Charges by region:
charges for southeast :  {'Count': 364, 'Minimum': 1121.87, 'Maximum': 63770.43, 'Mean': 14735.41, 'Standard Deviation': 13971.1, 'Median': 9294.13}
charges for northwest :  {'Count': 325, 'Minimum': 1621.34, 'Maximum': 60021.4, 'Mean': 12417.58, 'Standard Deviation': 11072.28, 'Median': 8965.8}
charges for southwest :  {'Count': 325, 'Minimum': 1241.57, 'Maximum': 52590.83, 'Mean': 12346.94, 'Standard Deviation': 11557.18, 'Median': 8798.59}
charges for northeast :  {'Count': 324, 'Minimum': 1694.8, 'Maximum': 58571.07, 'Mean': 13406.38, 'Standard Deviation': 11255.8, 'Median': 10057.65}


## Conclusions

#### Descriptive Analysis Conclusions


*   Each variable has an **equal number of observations**: 1,338
*   All numerical variables (except charges) have **similar means and medians**
   *   Charges have a higher mean (USD 13,270) than median (USD 9,382), indicating
that there is likely a **right skew in the charges data**
*   There are **slightly more men** than women in the dataset
*   The large majority of people **do not smoke** (80%)
*   There are 4 regions in the dataset with the **Southeast** being the most represented



#### Exploratory Analysis Conclusions


*   Age, BMI, and number of children are all **positively correlated** with charges
*   Based on **correlation coefficients**, **age** has the strongest relationship to charges, and **number of children** has the weakest relationship to charges
*   **Women** have a slightly lower average charge but a slightly higher median charge than men
*   On average, **smokers** pay approximately **$23,616 more** than their non-smoking counterparts for medical insurance
*   The **Southeast** has the **highest** average charges, and the **Southwest** has the **lowest** average charges



## Future Work and Improvements



This data analysis was done without certain useful libraries such as **Pandas**, which would've streamlined the descriptive analysis by using **DataFrames**. However, the purpose of this project was to showcase Python fundamentals and data analysis through the use of simple lists, dictionaries, and some arrays.


In the future and with similar projects, **Pandas** could be incorporated for more efficient data analysis, and **Matplotlib** would be useful to create visualizations of the dataset.