In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import folium
import json

In [2]:
# Define path
path = r'/Users/Shikongo/Healthcare Cost Analysis and Prediction/'

In [3]:
# Opening JSON file
f = open(path+'geoJSON_us_regions copy.json')

# returns JSON object as a dictionary
region_geo = json.load(f)

In [4]:
region_geo = r'/Users/Shikongo/Healthcare Cost Analysis and Prediction/geoJSON_us_regions.json'

In [5]:
region_geo

'/Users/Shikongo/Healthcare Cost Analysis and Prediction/geoJSON_us_regions.json'

In [6]:
# Define path
path = r'/Users/Shikongo/Healthcare Cost Analysis and Prediction/'

In [7]:
df_insurance = pd.read_csv(os.path.join(path,'Healthcare Clean Data',' Clean Data','insurance2.csv'), encoding = '')

In [8]:
df_insurance['Charges'] = df_insurance['Charges'].replace('[\$,]', '', regex=True).astype(float)

In [9]:
df_insurance.head()

Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Age,Sex,BMI,Children,Smoker,Region,Charges
0,0,0,0,0,0,0,19,female,27.9,0,yes,southwest,16884.924
1,1,1,1,1,1,1,18,male,33.77,1,no,southeast,1725.5523
2,2,2,2,2,2,2,28,male,33.0,3,no,southeast,4449.462
3,3,3,3,3,3,3,33,male,22.705,0,no,northwest,21984.47061
4,4,4,4,4,4,4,32,male,28.88,0,no,northwest,3866.8552


In [10]:
# Create new variable with values 1 = Female & 0 = Male
df_insurance['Sex_Dummy'] = pd.get_dummies(df_insurance['Sex'])['female']

In [11]:
# Create new variable with values 1 = Female & 0 = Male
df_insurance['Smoker_Dummy'] = pd.get_dummies(df_insurance['Smoker'])['yes']

In [12]:
# Match JSON data with insurance data for map purpose: 
df_insurance = df_insurance.replace(['northeast', 'northwest', 'southeast', 'southwest'], ['Northeast', 'Midwest', 'South', 'West'])

In [13]:
df_insurance['Region'] = df_insurance['Region'].replace(['northeast', 'northwest', 'southeast', 'southwest'], ['Northeast', 'Midwest', 'South', 'West'])


In [15]:
# Aggregate the number of children per region, before using the folium library.
Children_per_region = df_insurance.groupby('Region')['Children'].sum().reset_index()

In [16]:
print(Children_per_region)

      Region  Children
0    Midwest       373
1  Northeast       339
2      South       382
3       West       371


In [17]:
df_insurance = df_insurance.drop(['Unnamed: 0.1', 'Unnamed: 0'], axis=1)

In [18]:
df_insurance.head()

Unnamed: 0,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Age,Sex,BMI,Children,Smoker,Region,Charges,Sex_Dummy,Smoker_Dummy
0,0,0,0,19,female,27.9,0,yes,West,16884.924,1,1
1,1,1,1,18,male,33.77,1,no,South,1725.5523,0,0
2,2,2,2,28,male,33.0,3,no,South,4449.462,0,0
3,3,3,3,33,male,22.705,0,no,Midwest,21984.47061,0,0
4,4,4,4,32,male,28.88,0,no,Midwest,3866.8552,0,0


Discuss the results and what they mean?
Does the analysis answer any of your existing research question?

The choropleth map created successfully displays regional variations in healthcare charges, with the Midwest having the highest average charges of health insurance (24,556) and the other regions having the lowest average charge of 1,630. This addresses my "Funneling question" about regional variations in healthcare charges.

As for whether this analysis answers my "Clarifying Question" about what factors significantly impact healthcare charges, the map alone may not directly provide that answer. The map visually presents regional disparities in healthcare charges, but it doesn't explicitly reveal the underlying factors causing these variations.

To understand the factors impacting healthcare charges, I would typically need to perform additional data analysis and potentially use statistical methods such as regression analysis to identify the variables (age, BMI, smoking status and number of children) that are most strongly correlated with healthcare charges. Some of the variables I mentioned earlier, could be factors to investigate.


Does the analysis lead you to any new research questions? 
So, the choropleth map provides valuable information about regional variations, but to determine the factors influencing healthcare charges, further data analysis is needed, which could lead to new research questions and findings. 

New research questions: 
What demographic and lifestyle factors are driving the higher healthcare charges in the Midwest region compared to other regions?
How do factors like smoking prevalence, average age, or obesity rates differ across regions and impact healthcare costs?
Analyzing the relationships between these variables and healthcare charges can lead to a more comprehensive understanding of the factors at play.








In [19]:
map = folium.Map(location = [40, -95], zoom_start = 4)

folium.Choropleth(
    geo_data = region_geo, 
    data = Children_per_region,
    columns = ['Region', 'Children'],
    key_on = 'feature.properties.name',
    fill_color = 'YlOrBr', fill_opacity=0.6, line_opacity=0.1,
    legend_name = "Children").add_to(map)

folium.LayerControl().add_to(map)

map

Discuss the results and what they mean? Does the analysis answer any of your existing research question?

The choropleth map answer my following questions: 
Question1 : What factors significantly impact healthcare charges?
Funneling questions: 
Are there regional variations in healthcare charges? 
In terms of regional disparities, The choropleth map displays variations in healthcare charges based on the number of children covered by insurance in different regions. The Midwest and West regions stand out with higher number of children, ranging from 368-376 and 361-368, respectively. On the other hand, the Northeast and West show relatively lower number of children, with ranges of 339-346 and 346-353. These regional variations suggest that geographic location plays a significant role in determining healthcare costs.


Clarifying Question 2:
How is healthcare cost distribution related to the number of children covered by insurance?
Funneling questions:
Is there a correlation between the number of children and healthcare charges?
Do individuals with more children tend to have higher or lower healthcare costs?
The relationship between healthcare cost distribution and the number of children covered, the data suggests a correlation. Regions with a larger number of covered children, such as the Midwest and West, tend to exhibit slightly elevated healthcare charges. This raises interesting questions about the specific factors contributing to these regional differences. For example, are there differences in healthcare infrastructure, socioeconomic factors, or policy implementations that could explain the observed patterns?

The analysis prompts new research questions such as:

What specific factors in the Midwest and West regions contribute to the higher healthcare charges for individuals with children?
Are there regional policies or healthcare practices that explain the variations in charges?
How do socioeconomic factors in different regions impact the relationship between the number of children covered and healthcare charges?
In conclusion, the analysis not only provides insights into the existing research questions but also stimulates further inquiries into the underlying factors driving regional variations in healthcare charges and their correlation with the number of covered children.


In [20]:
df_insurance.to_csv(os.path.join(path,'Healthcare Clean Data',' Clean Data','insurance_Supervised.csv'))