<font size="+4" color='#274472'><b> Exploratory Data Analysis of Medicare and Medicaid with Bokeh Libraryüìö </b></font>

<div style="border-radius:10px; border:#5885AF solid; padding: 15px; background-color: #C3E0E5; font-size:100%; text-align:left">
   <h3 ><b>Intro</b></h3>
    <font color='#274472'><b>What is Bokeh?</b></font><br>
Bokeh is a Python library for creating interactive visualizations in web browsers. It allows you to create beautiful, interactive plots, dashboards, and applications using Python <br><br>
    
The notebook explores a dataset covering Medicare and Medicaid claims from 2003 onwards, compiled by CMS. It includes diverse categories like Inpatient and Outpatient claims, with indicators computed by CDC's Division for Heart Disease and Stroke Prevention. This dataset feeds into the National Cardiovascular Disease Surveillance System, aiming to provide a comprehensive view of CVDs and associated risk factors. Organized by location and indicators, it allows trend analysis and stratification by sex and race/ethnicity, aiding insights into public health disparities. Through visualizations like line plots and bar plots, the notebook delves into CVD prevalence and its variations across demographics and regions.Through the following questions<br>
    <ol>
        <li> What is the trend of heart disease prevalence over the years in the United States?</li>
        <li>Which year had the highest prevalence of heart disease?</li>
        <li>Which state has the highest average prevalence of heart disease?</li>
        <li>How does the prevalence of heart disease differ between males and females?</li>
        <li>What is the distribution of heart disease prevalence across different categories?</li>
        <li>What is the trend of heart disease prevalence among different racial/ethnic groups?</li>
        <li>What is the confidence interval for the prevalence of heart disease in each state?</li>
        <li>
How does the prevalence of heart disease vary between different data sources?</li>
    </ol>
</div>

<div style="border-radius:10px; border:#EF7C8E solid; padding: 15px; font-size:100%; text-align:left">
‚úçüèª Since it's already pre-installed in Kaggle let's go ahead and start by importing the necessary modules
</div>

In [2]:
from bokeh.plotting import figure, output_file, show, output_notebook
import pandas as pd
import numpy as np

ModuleNotFoundError: No module named 'bokeh'

In [2]:
medi = pd.read_csv("/kaggle/input/medicare-and-medicaid/CMS.csv")
medi.head(5)

Unnamed: 0,RowId,YearStart,LocationAbbr,LocationDesc,DataSource,PriorityArea1,PriorityArea2,PriorityArea3,PriorityArea4,Class,...,Break_Out_Category,Break_Out,ClassId,TopicId,QuestionId,Data_Value_TypeID,BreakOutCategoryId,BreakOutId,LocationId,GeoLocation
0,,2016,US,United States,Medicare,,,,,Cardiovascular Diseases,...,Race,Unknown,C1,T1,MD101,Crude,BOC04,RAC08,59,
1,,2017,US,United States,Medicare,,,,,Cardiovascular Diseases,...,Race,Unknown,C1,T1,MD101,Crude,BOC04,RAC08,59,
2,,2018,US,United States,Medicare,,,,,Cardiovascular Diseases,...,Race,Unknown,C1,T1,MD101,Crude,BOC04,RAC08,59,
3,,2019,US,United States,Medicare,,,,,Cardiovascular Diseases,...,Gender,Male,C1,T1,MD101,Crude,BOC02,GEN01,59,
4,,2020,US,United States,Medicare,,,,,Cardiovascular Diseases,...,Gender,Male,C1,T1,MD101,Crude,BOC02,GEN01,59,


<div style=" display: flex; gap: 10px;">
      <img src="https://i.pinimg.com/736x/6e/7f/fd/6e7ffdcaa00c1fa9694388d01f870ec4.jpg" alt="Small Image" style="width: 100px; height: auto;">
   </div> 
<b>Let's inspect the data further</b>

In [3]:
medi.shape

(33454, 30)

In [4]:
medi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33454 entries, 0 to 33453
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   RowId                       0 non-null      float64
 1   YearStart                   33454 non-null  int64  
 2   LocationAbbr                33454 non-null  object 
 3   LocationDesc                33454 non-null  object 
 4   DataSource                  33454 non-null  object 
 5   PriorityArea1               0 non-null      float64
 6   PriorityArea2               0 non-null      float64
 7   PriorityArea3               0 non-null      float64
 8   PriorityArea4               0 non-null      float64
 9   Class                       33454 non-null  object 
 10  Topic                       33454 non-null  object 
 11  Question                    33454 non-null  object 
 12  Data_Value_Type             33454 non-null  object 
 13  Data_Value_Unit             334

<div style="border-radius:10px; border:#EF7C8E solid; padding: 15px; font-size:100%; text-align:left">
‚úçüèª From this we can conclude that <b>RowId, PriorityArea1, PriorityArea2, PriorityArea3, PriorityArea4, Data_Value_Footnote_Symbol, Data_Value_Footnote</b> have less significance in the dataset because the whole column is null
</div>

In [5]:
features = ['RowId', 'PriorityArea1', 'PriorityArea2', 'PriorityArea3', 'PriorityArea4', 'Data_Value_Footnote_Symbol', 'Data_Value_Footnote']
medi.drop(features, axis=1, inplace=True)
medi.columns

Index(['YearStart', 'LocationAbbr', 'LocationDesc', 'DataSource', 'Class',
       'Topic', 'Question', 'Data_Value_Type', 'Data_Value_Unit', 'Data_Value',
       'Data_Value_Alt', 'Low_Confidence_Limit', 'High_Confidence_Limit',
       'Break_Out_Category', 'Break_Out', 'ClassId', 'TopicId', 'QuestionId',
       'Data_Value_TypeID', 'BreakOutCategoryId', 'BreakOutId', 'LocationId',
       'GeoLocation'],
      dtype='object')

In [6]:
medi.isnull().sum()

YearStart                  0
LocationAbbr               0
LocationDesc               0
DataSource                 0
Class                      0
Topic                      0
Question                   0
Data_Value_Type            0
Data_Value_Unit            0
Data_Value                 0
Data_Value_Alt             0
Low_Confidence_Limit       0
High_Confidence_Limit      0
Break_Out_Category         0
Break_Out                  0
ClassId                    0
TopicId                    0
QuestionId                 0
Data_Value_TypeID          0
BreakOutCategoryId         0
BreakOutId                 0
LocationId                 0
GeoLocation              720
dtype: int64

<div style="border-radius: 10px; border: #5885AF solid; padding: 15px; font-size: 100%; text-align: left;
            background-image: url('https://images.unsplash.com/photo-1579154392128-bf8c7ebee541?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'); background-size: cover;
            position: relative;">
        <div style="display: flex; flex-direction: row;">
            <div style="width: 8%;">
                <font size="+2" color="#FFFFFF"><b>1 ‚Ä¢</b></font>
            </div>
            <div style="width: 92%;">
                <h4 style="color:#C3E0E5"><b>What is the trend of heart disease prevalence over the years in the United States?</b></h4>
            </div>
        </div>
    </div>

In [7]:
medi['Topic'].unique()

array(['Major Cardiovascular Disease',
       'Diseases of the Heart (Heart Disease)', 'Stroke', 'Heart Failure',
       'Acute Myocardial Infarction (Heart Attack)',
       'Coronary Heart Disease'], dtype=object)

<div style="border-radius:10px; border:#EF7C8E solid; padding: 15px; font-size:100%; text-align:left">
üéØ My focus here will be Major Cardiovascular Disease
</div>

In [8]:
#filter based on topic
heart_disease_data = medi[medi['Topic'] == 'Major Cardiovascular Disease']

#group based of year
mean_prevalence_by_year = heart_disease_data.groupby('YearStart')['Data_Value'].mean()

#output in the notebook
output_notebook()
#output in the file
output_file("heart_disease_trend.html")

p = figure(title="Trend of Heart Disease Prevalence Over Years", x_axis_label='Year', y_axis_label='Prevalence')
p.line(mean_prevalence_by_year.index, mean_prevalence_by_year.values, line_width=2)
show(p)

<div style="border-radius: 10px; border: #5885AF solid; padding: 15px; font-size: 100%; text-align: left;
            background-image: url('https://images.unsplash.com/photo-1579154392128-bf8c7ebee541?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'); background-size: cover;
            position: relative;">
        <div style="display: flex; flex-direction: row;">
            <div style="width: 8%;">
                <font size="+2" color="#FFFFFF"><b>2 ‚Ä¢</b></font>
            </div>
            <div style="width: 92%;">
                <h4 style="color:#C3E0E5"><b>Which year had the highest prevalence of heart disease?</b></h4>
            </div>
        </div>
    </div>

In [9]:
top_3_year = mean_prevalence_by_year.nlargest(3)
top_3_year

YearStart
2017    2236.404909
2016    2235.418017
2019    2212.455864
Name: Data_Value, dtype: float64

In [10]:
#year with highest mean prevalence
highest_prevalence_year = mean_prevalence_by_year.idxmax()
highest_prevalence_year_value = mean_prevalence_by_year.max()

print(f"The year with the highest average prevalence of heart disease is {highest_prevalence_year} with a prevalence of {highest_prevalence_year_value:.2f}")

The year with the highest average prevalence of heart disease is 2017 with a prevalence of 2236.40


<div style="border-radius: 10px; border: #5885AF solid; padding: 15px; font-size: 100%; text-align: left;
            background-image: url('https://images.unsplash.com/photo-1579154392128-bf8c7ebee541?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'); background-size: cover;
            position: relative;">
        <div style="display: flex; flex-direction: row;">
            <div style="width: 8%;">
                <font size="+2" color="#FFFFFF"><b>3 ‚Ä¢</b></font>
            </div>
            <div style="width: 92%;">
                <h4 style="color:#C3E0E5"><b>Which state has the highest average prevalence of heart disease?</b></h4>
            </div>
        </div>
    </div>

In [11]:
mean_prevalence_by_state = heart_disease_data.groupby('LocationDesc')['Data_Value'].mean().reset_index()

output_notebook()
p = figure(x_range=mean_prevalence_by_state['LocationDesc'], width=800,
           title="Mean Prevalence of Heart Disease by State", toolbar_location=None, tools="")

p.vbar(x='LocationDesc', top='Data_Value', width=0.9, source=mean_prevalence_by_state,
       line_color='white', fill_color="#81B622")

#rotate the x-axis labels
p.xaxis.major_label_orientation = "vertical"

p.xaxis.axis_label = "State"
p.yaxis.axis_label = "Mean Prevalence"
p.title.align = "center"
p.title.text_font_size = "16px"

show(p)

In [12]:
#state with top 3 highest mean prevalence
top_3_state = heart_disease_data.groupby('LocationDesc')['Data_Value'].mean().nlargest(3)
top_3_state

LocationDesc
Michigan         2640.294444
West Virginia    2519.355556
Ohio             2454.620370
Name: Data_Value, dtype: float64

In [13]:
#state with highest mean prevalence
highest_prevalence_state = top_3_state.idxmax()
highest_prevalence_value = top_3_state.max()

print(f"The state with the highest average prevalence of heart disease is {highest_prevalence_state} with a prevalence of {highest_prevalence_value:.2f}")

The state with the highest average prevalence of heart disease is Michigan with a prevalence of 2640.29


<div style="border-radius: 10px; border: #5885AF solid; padding: 15px; font-size: 100%; text-align: left;
            background-image: url('https://images.unsplash.com/photo-1579154392128-bf8c7ebee541?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'); background-size: cover;
            position: relative;">
        <div style="display: flex; flex-direction: row;">
            <div style="width: 8%;">
                <font size="+2" color="#FFFFFF"><b>4 ‚Ä¢</b></font>
            </div>
            <div style="width: 92%;">
                <h4 style="color:#C3E0E5"><b>How does the prevalence of heart disease differ between males and females?</b></h4>
            </div>
        </div>
    </div>

In [14]:
medi['Break_Out'].unique()

array(['Unknown', 'Male', 'Non-Hispanic Asian', 'Non-Hispanic White',
       'Hispanic', '75+', 'Other', 'Non-Hispanic Black', 'Female',
       'Overall'], dtype=object)

In [15]:
#filter for male and female populations
male_data = heart_disease_data[heart_disease_data['Break_Out'] == 'Male']
female_data = heart_disease_data[heart_disease_data['Break_Out'] == 'Female']

#mean prevalence for males and females
mean_prevalence_male = male_data['Data_Value'].mean()
mean_prevalence_female = female_data['Data_Value'].mean()

print(f"The average prevalence of heart disease is {mean_prevalence_male:.2f} for males and {mean_prevalence_female:.2f} for females.")

The average prevalence of heart disease is 2483.32 for males and 2004.94 for females.


In [16]:
output_notebook()

p = figure(title="Mean Prevalence of Heart Disease for Males and Females")

#percentage of prevalence for males and females
total_prevalence = mean_prevalence_male + mean_prevalence_female
male_percentage = (mean_prevalence_male / total_prevalence) * 100
female_percentage = (mean_prevalence_female / total_prevalence) * 100

#pie slices for males and females
p.wedge(x=0, y=1, radius=0.4, start_angle=0, end_angle=male_percentage/100*2*np.pi, color="#3D550C", legend_label="Male")
p.wedge(x=0, y=1, radius=0.4, start_angle=male_percentage/100*2*np.pi, end_angle=2*np.pi, color="#81B622", legend_label="Female")

#hide the axes
p.axis.visible = False

show(p)

<div style="border-radius: 10px; border: #5885AF solid; padding: 15px; font-size: 100%; text-align: left;
            background-image: url('https://images.unsplash.com/photo-1579154392128-bf8c7ebee541?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'); background-size: cover;
            position: relative;">
        <div style="display: flex; flex-direction: row;">
            <div style="width: 8%;">
                <font size="+2" color="#FFFFFF"><b>5 ‚Ä¢</b></font>
            </div>
            <div style="width: 92%;">
                <h4 style="color:#C3E0E5"><b>What is the distribution of heart disease prevalence across different categories?</b></h4>
            </div>
        </div>
    </div>

In [17]:
bo_categories = heart_disease_data['Break_Out_Category'].unique()
bo_categories

array(['Race', 'Gender', 'Age', 'Overall'], dtype=object)

In [18]:
#mean prevalence for each age group
mean_prevalence_by_bo_category = {}
for bo_category in bo_categories:
    bo_category_data = heart_disease_data[heart_disease_data['Break_Out_Category'] == bo_category]
    mean_prevalence_by_bo_category[bo_category] = bo_category_data['Data_Value'].mean()

print("Mean Prevalence of Heart Disease Across Different Age Groups:")
for bo_category, mean_prevalence in mean_prevalence_by_bo_category.items():
    print(f"{bo_category}: {mean_prevalence:.2f}")

Mean Prevalence of Heart Disease Across Different Age Groups:
Race: 1907.98
Gender: 2244.13
Age: 3125.23
Overall: 2296.86


In [19]:
output_notebook()

#total prevalence
total_prevalence = sum(mean_prevalence_by_bo_category.values())

#percentage of prevalence for each age group
percentages = [(value / total_prevalence) * 100 for value in mean_prevalence_by_bo_category.values()]
p = figure(title="Mean Prevalence of Heart Disease Across Different Breka Out Categories", toolbar_location=None,
           tools="hover", tooltips="@bo_category: @percent{0.2f}%")

#color palette
colors = ['#E7F2F8','#74BDCB','#FFA384','#EFE7BC']

#pie slices for each age group
start_angle = np.pi / 2
end_angle = start_angle
for bo_category, percentage, color in zip(mean_prevalence_by_bo_category.keys(), percentages, colors):
    end_angle += percentage / 100 * 2 * np.pi
    p.wedge(x=0, y=0, radius=1, start_angle=start_angle, end_angle=end_angle, color=color,
            legend_label=bo_category, line_color='white')
    start_angle = end_angle

p.axis.visible = False

show(p)

<div style="border-radius: 10px; border: #5885AF solid; padding: 15px; font-size: 100%; text-align: left;
            background-image: url('https://images.unsplash.com/photo-1579154392128-bf8c7ebee541?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'); background-size: cover;
            position: relative;">
        <div style="display: flex; flex-direction: row;">
            <div style="width: 8%;">
                <font size="+2" color="#FFFFFF"><b>6 ‚Ä¢</b></font>
            </div>
            <div style="width: 92%;">
                <h4 style="color:#C3E0E5"><b>What is the trend of heart disease prevalence among different racial/ethnic groups?</b></h4>
            </div>
        </div>
    </div>

In [20]:
racial_groups = heart_disease_data['Break_Out'].unique()

In [21]:
mean_prevalence_by_year_race = {}
for racial_group in racial_groups:
    racial_group_data = heart_disease_data[heart_disease_data['Break_Out'] == racial_group]
    mean_prevalence_by_year_race[racial_group] = racial_group_data.groupby('YearStart')['Data_Value'].mean()


output_notebook()

color_palette = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']

p = figure(title="Trend of Heart Disease Prevalence Among Racial/Ethnic Groups Over Years", x_axis_label='Year', y_axis_label='Prevalence', width=800)
for i, (racial_group, prevalence_by_year) in enumerate(mean_prevalence_by_year_race.items()):
    color = color_palette[i % len(color_palette)]  # Ensure cycling through colors if there are more racial groups than colors
    p.line(prevalence_by_year.index, prevalence_by_year.values, legend_label=racial_group, line_width=2, color=color)

#legend location
p.legend.location = "top_left"
show(p)

<div style="border-radius: 10px; border: #5885AF solid; padding: 15px; font-size: 100%; text-align: left;
            background-image: url('https://images.unsplash.com/photo-1579154392128-bf8c7ebee541?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'); background-size: cover;
            position: relative;">
        <div style="display: flex; flex-direction: row;">
            <div style="width: 8%;">
                <font size="+2" color="#FFFFFF"><b>7 ‚Ä¢</b></font>
            </div>
            <div style="width: 92%;">
                <h4 style="color:#C3E0E5"><b>What is the confidence interval for the prevalence of heart disease in each state?</b></h4>
            </div>
        </div>
    </div>

In [22]:
#confidence interval for each state
confidence_intervals_by_state = heart_disease_data.groupby('LocationDesc').apply(lambda x: (x['Low_Confidence_Limit'].mean(), x['High_Confidence_Limit'].mean()))

print("Confidence Intervals for Heart Disease Prevalence in Each State:")
for state, (low, high) in confidence_intervals_by_state.items():
    print(f"{state}: [{low:.2f}, {high:.2f}]")


Confidence Intervals for Heart Disease Prevalence in Each State:
Alabama: [2091.47, 2562.27]
Alaska: [1479.35, 2081.80]
Arizona: [1697.81, 1921.08]
Arkansas: [2221.68, 2703.36]
California: [1902.31, 1981.77]
Colorado: [1484.12, 1761.18]
Connecticut: [2113.80, 2461.68]
Delaware: [1833.28, 2392.83]
Florida: [2364.87, 2506.81]
Georgia: [1944.80, 2168.39]
Hawaii: [1317.53, 1900.67]
Idaho: [1363.87, 2048.68]
Illinois: [2235.81, 2393.66]
Indiana: [2183.09, 2526.24]
Iowa: [1825.70, 2314.98]
Kansas: [2043.47, 2438.22]
Kentucky: [2116.97, 2647.80]
Louisiana: [2205.07, 2620.82]
Maine: [1667.25, 2880.02]
Maryland: [1783.35, 1959.38]
Massachusetts: [2159.96, 2390.82]
Michigan: [2512.80, 2782.65]
Minnesota: [2041.42, 2493.06]
Mississippi: [2119.51, 2731.19]
Missouri: [2204.26, 2607.26]
Montana: [1471.45, 2516.52]
Nebraska: [1819.40, 2357.37]
Nevada: [1972.96, 2272.63]
New Hampshire: [1819.03, 2645.97]
New Jersey: [2242.19, 2408.67]
New Mexico: [1558.04, 1967.78]
New York: [2231.62, 2360.18]
North C

  confidence_intervals_by_state = heart_disease_data.groupby('LocationDesc').apply(lambda x: (x['Low_Confidence_Limit'].mean(), x['High_Confidence_Limit'].mean()))


In [23]:
output_notebook()

#state names and confidence intervals
states = list(confidence_intervals_by_state.keys())
low_means = [confidence_intervals_by_state[state][0] for state in states]
high_means = [confidence_intervals_by_state[state][1] for state in states]


p = figure(title="Mean Low and High Confidence Intervals by State", x_range=states, y_axis_label='Mean', width=800)
#low confidence intervals
p.line(states, low_means, legend_label="Low Confidence Interval", line_width=2, color='blue')
#high confidence intervals
p.line(states, high_means, legend_label="High Confidence Interval", line_width=2, color='red')
#rotate the x-axis labels for better readability
p.xaxis.major_label_orientation = "vertical"

show(p)

<div style="border-radius: 10px; border: #5885AF solid; padding: 15px; font-size: 100%; text-align: left;
            background-image: url('https://images.unsplash.com/photo-1579154392128-bf8c7ebee541?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'); background-size: cover;
            position: relative;">
        <div style="display: flex; flex-direction: row;">
            <div style="width: 8%;">
                <font size="+2" color="#FFFFFF"><b>8 ‚Ä¢</b></font>
            </div>
            <div style="width: 92%;">
                <h4 style="color:#C3E0E5"><b>How does the prevalence of heart disease vary between different data sources?</b></h4>
            </div>
        </div>
    </div>

In [24]:
medi['DataSource'].unique()

array(['Medicare'], dtype=object)