# Chronic Disease Indicators Exploratory Data Analysis Project

## **Exploratory Data Analysis (EDA)**

###Installing necessary Packages and loading the Data

In [1]:
!pip install pyspark
!pip install dash plotly

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=f33cf447fdb674cf7f86e4bb714d5ee7690b6e44fbe0a2539017d531b518e068
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0
Collecting dash
  Downloading dash-2.14.2-py3-none-any.whl (10.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting dash-html-components==2.0.0 (from dash)
  Downloading dash_html_components

In [2]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
type(spark)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import dash
from dash import dcc, html
import plotly.express as px
import plotly.graph_objs as go
from scipy import stats
import tkinter as tk
import seaborn as sns

#### Understanding the data:

In [3]:
! [ ! -e "$(basename us_chronic_disease_indicators.csv)" ] && wget  https://storage.googleapis.com/mbcc/datasets/us_chronic_disease_indicators.csv
ps_df = spark.read.csv('us_chronic_disease_indicators.csv',
                      header = True,
                      inferSchema = True)

--2023-12-13 00:59:00--  https://storage.googleapis.com/mbcc/datasets/us_chronic_disease_indicators.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.213.207, 173.194.215.207, 173.194.216.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.213.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 208030973 (198M) [text/csv]
Saving to: ‘us_chronic_disease_indicators.csv’


2023-12-13 00:59:02 (148 MB/s) - ‘us_chronic_disease_indicators.csv’ saved [208030973/208030973]



In [4]:
print(ps_df.columns)

['yearstart', 'yearend', 'locationabbr', 'locationdesc', 'datasource', 'topic', 'question', 'datavalueunit', 'datavaluetype', 'datavalue', 'lowconfidencelimit', 'highconfidencelimit', 'stratificationcategory1', 'stratification1', 'geolocation', 'topicid', 'questionid', 'datavaluetypeid', 'stratificationcategoryid1', 'stratificationid1']


In [5]:
ps_df.show(5)

+---------+-------+------------+-------------+----------+--------------------+--------------------+-------------+----------------+---------+------------------+-------------------+-----------------------+--------------------+--------------------+-------+----------+---------------+-------------------------+-----------------+
|yearstart|yearend|locationabbr| locationdesc|datasource|               topic|            question|datavalueunit|   datavaluetype|datavalue|lowconfidencelimit|highconfidencelimit|stratificationcategory1|     stratification1|         geolocation|topicid|questionid|datavaluetypeid|stratificationcategoryid1|stratificationid1|
+---------+-------+------------+-------------+----------+--------------------+--------------------+-------------+----------------+---------+------------------+-------------------+-----------------------+--------------------+--------------------+-------+----------+---------------+-------------------------+-----------------+
|     2010|   2010|      

In [6]:
ps_df.printSchema()

root
 |-- yearstart: integer (nullable = true)
 |-- yearend: integer (nullable = true)
 |-- locationabbr: string (nullable = true)
 |-- locationdesc: string (nullable = true)
 |-- datasource: string (nullable = true)
 |-- topic: string (nullable = true)
 |-- question: string (nullable = true)
 |-- datavalueunit: string (nullable = true)
 |-- datavaluetype: string (nullable = true)
 |-- datavalue: double (nullable = true)
 |-- lowconfidencelimit: double (nullable = true)
 |-- highconfidencelimit: double (nullable = true)
 |-- stratificationcategory1: string (nullable = true)
 |-- stratification1: string (nullable = true)
 |-- geolocation: string (nullable = true)
 |-- topicid: string (nullable = true)
 |-- questionid: string (nullable = true)
 |-- datavaluetypeid: string (nullable = true)
 |-- stratificationcategoryid1: string (nullable = true)
 |-- stratificationid1: string (nullable = true)



In [7]:
print(f'There are: {ps_df.count()} rows and {len(ps_df.columns)} columns in the CDI dataframe')

There are: 804578 rows and 20 columns in the CDI dataframe


In [8]:
#Summary statistics of the data value column
ps_df.describe("datavalue").show(5)

+-------+------------------+
|summary|         datavalue|
+-------+------------------+
|  count|            804578|
|   mean|1005.3254076298603|
| stddev| 18804.32508158481|
|    min|               0.0|
|    max|         2925456.0|
+-------+------------------+



### Data Cleaning and exploration

In [9]:
#Checking for duplicate rows in the dataframe
ps_df_before = ps_df.count()
ps_df_no_dupes = ps_df.drop_duplicates()
ps_df_after = ps_df_no_dupes.count()
if ps_df_before == ps_df_after:
    print("No duplicates found.")
else:
    print(f"Duplicates found and removed. Number of duplicates: {ps_df_before - ps_df_after}")

No duplicates found.


In [10]:
#Checking for missing values and reporting results
from pyspark.sql.functions import col, sum
ps_df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in ps_df.columns)).show()

+---------+-------+------------+------------+----------+-----+--------+-------------+-------------+---------+------------------+-------------------+-----------------------+---------------+-----------+-------+----------+---------------+-------------------------+-----------------+
|yearstart|yearend|locationabbr|locationdesc|datasource|topic|question|datavalueunit|datavaluetype|datavalue|lowconfidencelimit|highconfidencelimit|stratificationcategory1|stratification1|geolocation|topicid|questionid|datavaluetypeid|stratificationcategoryid1|stratificationid1|
+---------+-------+------------+------------+----------+-----+--------+-------------+-------------+---------+------------------+-------------------+-----------------------+---------------+-----------+-------+----------+---------------+-------------------------+-----------------+
|        0|      0|           0|           0|         0|    0|       0|       105238|            0|        0|            122198|             122198|            

In [11]:
#converting geolocation columns into two new latlong columns
from pyspark.sql.functions import col, count, split
from pyspark.sql.functions import regexp_extract

ps_df = ps_df.withColumn('latitude', regexp_extract(ps_df['GeoLocation'], r'\(([^ ]+)', 1).cast('double'))
ps_df = ps_df.withColumn('longitude', regexp_extract(ps_df['GeoLocation'], r'([^ ]+)\)', 1).cast('double'))
ps_df.show(5)

+---------+-------+------------+-------------+----------+--------------------+--------------------+-------------+----------------+---------+------------------+-------------------+-----------------------+--------------------+--------------------+-------+----------+---------------+-------------------------+-----------------+-------------------+------------------+
|yearstart|yearend|locationabbr| locationdesc|datasource|               topic|            question|datavalueunit|   datavaluetype|datavalue|lowconfidencelimit|highconfidencelimit|stratificationcategory1|     stratification1|         geolocation|topicid|questionid|datavaluetypeid|stratificationcategoryid1|stratificationid1|           latitude|         longitude|
+---------+-------+------------+-------------+----------+--------------------+--------------------+-------------+----------------+---------+------------------+-------------------+-----------------------+--------------------+--------------------+-------+----------+--------

In [12]:
# Converting quantitative columns to Numeric
ps_df.printSchema()
columns_to_convert = ['datavalue', 'lowconfidencelimit', 'highconfidencelimit']
for column in columns_to_convert:
    ps_df = ps_df.withColumn(column, col(column).cast('float'))
ps_df.printSchema()

root
 |-- yearstart: integer (nullable = true)
 |-- yearend: integer (nullable = true)
 |-- locationabbr: string (nullable = true)
 |-- locationdesc: string (nullable = true)
 |-- datasource: string (nullable = true)
 |-- topic: string (nullable = true)
 |-- question: string (nullable = true)
 |-- datavalueunit: string (nullable = true)
 |-- datavaluetype: string (nullable = true)
 |-- datavalue: double (nullable = true)
 |-- lowconfidencelimit: double (nullable = true)
 |-- highconfidencelimit: double (nullable = true)
 |-- stratificationcategory1: string (nullable = true)
 |-- stratification1: string (nullable = true)
 |-- geolocation: string (nullable = true)
 |-- topicid: string (nullable = true)
 |-- questionid: string (nullable = true)
 |-- datavaluetypeid: string (nullable = true)
 |-- stratificationcategoryid1: string (nullable = true)
 |-- stratificationid1: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)

root
 |-- yea

In [13]:
#unique values of datavaluetype
ps_df.select("datavaluetype").distinct().show()

+--------------------+
|       datavaluetype|
+--------------------+
|              Number|
|   Age-adjusted Rate|
|             Percent|
|                Mean|
|Average Annual Nu...|
|              Median|
|          Crude Rate|
|Average Annual Cr...|
|          US Dollars|
|Age-adjusted Prev...|
|          Prevalence|
|Average Annual Ag...|
|   Age-adjusted Mean|
|Adjusted by age, ...|
|    Crude Prevalence|
|Per capita alcoho...|
+--------------------+



In [14]:
# Count of diseases
ps_df.groupBy("topic").count().orderBy("count").show()

+--------------------+------+
|               topic| count|
+--------------------+------+
|          Disability|  3239|
| Reproductive Health|  5510|
|        Immunization|  8949|
|       Mental Health| 10716|
|         Oral Health| 16945|
|Chronic Kidney Di...| 18555|
|        Older Adults| 19251|
|             Tobacco| 36670|
|              Asthma| 39846|
|             Alcohol| 42969|
|           Arthritis| 54809|
|Overarching Condi...| 60950|
|Nutrition, Physic...| 63165|
|            Diabetes| 84631|
|Chronic Obstructi...| 94584|
|Cardiovascular Di...|113167|
|              Cancer|130622|
+--------------------+------+



In [15]:
# Grouping by datavaluetype then getting summary statistics for datavalue column
from pyspark.sql.functions import mean, min, max, stddev, count

ps_df.groupBy("datavaluetype").agg(
    count("datavalue").alias("count_datavalue"),
    mean("datavalue").alias("mean_datavalue"),
    min("datavalue").alias("min_datavalue"),
    max("datavalue").alias("max_datavalue"),
    stddev("datavalue").alias("stddev_datavalue")
).show()

+--------------------+---------------+------------------+-------------+-------------+-------------------+
|       datavaluetype|count_datavalue|    mean_datavalue|min_datavalue|max_datavalue|   stddev_datavalue|
+--------------------+---------------+------------------+-------------+-------------+-------------------+
|              Number|          71811| 8431.119324339037|         0.41|    2925456.0|  55574.83814154899|
|   Age-adjusted Rate|          68895|117.74391724228177|         0.06|       2942.5| 167.74804105776954|
|             Percent|           2543| 38.80876132452108|          0.0|        100.0|  28.30905649403973|
|                Mean|          18429|  5.21849801837735|          0.6|         96.0|  7.808346067146053|
|Average Annual Nu...|          40519|4060.7426639354376|          3.0|    1736608.0|  37750.77214714233|
|              Median|           5941| 1.195371152259669|          0.4|          2.3|0.30447357867434355|
|          Crude Rate|          68895|118.9616

In [16]:
#grouping by topic and question
ps_df.groupBy("topic", "question").count().show()

+--------------------+--------------------+-----+
|               topic|            question|count|
+--------------------+--------------------+-----+
|              Cancer|Cancer of the lun...| 8757|
|Cardiovascular Di...|Hospitalization f...|11235|
|Overarching Condi...|Current health ca...| 2611|
|Chronic Obstructi...|Hospitalization f...| 3912|
|Nutrition, Physic...|Overweight or obe...| 2236|
|Cardiovascular Di...|Influenza vaccina...| 4939|
|Chronic Obstructi...|Prevalence of chr...| 6839|
|Chronic Kidney Di...|Mortality with en...|10989|
|         Oral Health|Visits to dentist...| 3959|
|             Alcohol|Amount of alcohol...|  204|
|              Asthma|Hospitalizations ...| 3507|
|Cardiovascular Di...|Awareness of high...| 1032|
|             Alcohol|Binge drinking pr...| 8084|
|Overarching Condi...|High school compl...| 1590|
|            Diabetes|Mortality with di...| 5829|
|            Diabetes|Prevalence of ges...|   52|
|              Cancer|Papanicolaou smea...| 2318|


In [17]:
# Checking for the most prominent disease
disease_counts = (
    ps_df.groupBy("topic")
    .count()
    .orderBy(col("count").desc())
)
most_prominent_disease = disease_counts.first()
print("Most Prominent Disease:", most_prominent_disease["topic"])
print("Count:", most_prominent_disease["count"])

Most Prominent Disease: Cancer
Count: 130622


In [18]:
#Checking for the highest frequency question in the data set
highest_frequency_question = (
    ps_df.groupBy("question")
    .count()
    .orderBy(col("count").desc())
    .select("question")
    .first()
    .question
)
print("Highest Frequency Question:", highest_frequency_question)

Highest Frequency Question: Hospitalization for chronic obstructive pulmonary disease as any diagnosis among Medicare-eligible persons aged >= 65 years


### EDA Visualizations

#### Loading the modules for visualization and switching from pyspark dataframe to pandas dataframe

In [None]:

from dash.dependencies import Input, Output
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

In [21]:
#Switch from pyspark df to pandas df before doing the vizualizations
pd_df = ps_df.toPandas()
df = pd_df

What is the overall distribution of gender in the dataset?

In [None]:
#Dash version of the Chart showing the prevalence of diseases by gender- Tosin

# Filter data for Gender
gender_df = pd_df[pd_df['stratificationcategory1'] == 'Gender']

# Calculate count of topics by gender
gender_prevalence = gender_df.groupby('stratification1')['topic'].count()

# Create Dash application
app = dash.Dash('Disease prevalence by Gender')

# Layout of the Dash app
app.layout = html.Div([
    html.H1('Prevalence of Diseases by Gender'),
    dcc.Graph(
        id='gender-prevalence-bar',
        figure={
            'data': [
                {'x': gender_prevalence.index, 'y': gender_prevalence.values, 'type': 'bar', 'name': 'Gender Distribution',
                 'marker': {'color': ['red', 'darkblue']},
                    'name': 'Gender Distribution'}
            ],
            'layout2': {
                'title': 'Gender Distribution throughout the Dataset',
                'xaxis': {'title': 'Gender'},
                'yaxis': {'title': 'Occurrence Count'}
            }
        }
    )
])

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)


The chart shows the distribution of the number of occurrences of each gender in the dataset. The balanced representation indicates that the dataset is not skewed towards one gender, providing a fair distribution for gender-related analyses.

How is the data distributed across different racial and ethnic groups?

In [None]:
#Bar chart of the distribution of diseases by Race

# Filter data for required columns and exclude specific values in 'stratification1'
df_plot = pd_df[['yearend', 'locationdesc', 'topic', 'question', 'datavalue', 'stratification1']]
exclude_values = ['Female', 'Male', 'Overall']
df_plot = df_plot[~df_plot['stratification1'].isin(exclude_values)]

# Group by 'stratification1' and count occurrences
race_prevalence = df_plot.groupby('stratification1')['datavalue'].count()

# Create Dash application
app = dash.Dash(__name__)

# Layout of the Dash app with the bar chart
app.layout = html.Div([
    html.H1('Distribution by Race'),
    dcc.Graph(
        id='race-distribution-bar',
        figure={
            'data': [go.Bar(x=race_prevalence.index, y=race_prevalence.values, marker_color='violet')],
            'layout': {
                'title': 'Distribution by Race',
                'xaxis': {'title': 'Race/Ethnicity'},
                'yaxis': {'title': 'Distribution'}
            }
        }
    )
])

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)
"""

Each bar represents a specific racial or ethnic category, and the height of the bar indicates the count or distribution of diseases within that group. The chart provides a visual comparison, allowing for the identification of patterns or disparities in disease occurrence among various racial and ethnic categories.

What are the patterns and trends regarding cancer over the years?

In [None]:
#Scatterplot showing the relationship between years and cancer

# Filter for the 'Cancer' topic
cancer_df = pd_df[pd_df['topic'] == 'Cancer']

# Create Dash application
app = dash.Dash(__name__)

# Layout of the Dash app with the scatter plot for Cancer topic
app.layout = html.Div([
    html.H1('Relationship between Years and Cancer'),

    dcc.Graph(
        id='scatter-plot-cancer',
        figure=px.scatter(cancer_df, x='yearstart', title='Relationship between Years and Cancer',
                          labels={'yearstart': 'Year', 'locationdesc': 'Location', 'topic': 'Topic'},
                          trendline='ols', trendline_color_override='red')
    )
])

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

Each point on the scatterplot represents a specific instance of cancer recorded in the dataset, with the x-axis indicating the year of occurrence and the y-axis representing other relevant details such as location and topic. The red trendline on the plot provides an overview of the potential trend or correlation in cancer occurrences over the years.

What is the geographic distribution of obesity in the US?

In [None]:

#Heat Map for Obesity among adults aged >= 18 years Rate by Location and Year
import pandas as pd
import plotly.graph_objects as go
import dash
from dash import dcc, html

# Assuming you have loaded your data into pd_df

# Filter the dataset based on specified criteria
filtered_df = pd_df[
    (pd_df['topic'] == 'Nutrition, Physical Activity, and Weight Status') &
    (pd_df['question'] == 'Obesity among adults aged >= 18 years')
]

# Create a Heatmap
heatmap_fig = go.Figure(data=go.Heatmap(
    x=filtered_df['locationdesc'],
    y=filtered_df['yearstart'],
    z=filtered_df['datavalue'],
    colorscale='Viridis',
    colorbar=dict(title='Obesity Rate'),
))

# Layout Settings
heatmap_fig.update_layout(
    title=dict(
        text='<b>Obesity among adults aged >= 18 years Rate by Location and Year</b>',
        x=0.5,  # Center the title
        y=0.9,  # Adjust the title position
        xanchor='center',
        yanchor='top',
        font=dict(color='red', size=14)  # Set the title color and size
    ),
    xaxis_title=dict(
        text='<b>Location</b>',
        font=dict(color='blue', size=12)  # Set x-axis label color and size
    ),
    yaxis_title=dict(
        text='<b>Year</b>',
        font=dict(color='green', size=12)  # Set y-axis label color and size
    ),
    annotations=[
        dict(
            x=0.5,
            y=-0.15,  # Adjust the y-coordinate to move the annotation below the x-axis
            xref='paper',
            yref='paper',
            showarrow=False,
            text='',
            font=dict(size=10)
        )
    ]
)

# Initialize the Dash app
app = dash.Dash(__name__)

# Define the app layout
app.layout = html.Div([
    html.H1('Obesity Heatmap'),
    dcc.Graph(figure=heatmap_fig)
])

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

Each cell in the map represents the obesity for a specific combination of location and year, with color intensity indicating the rate.

In [None]:
#Viz - Bubble plot (Diabetes)

# Filtering the dataset based on specified criteria
diabetes_df = pd_df[
    (pd_df['topic'] == 'Diabetes') &
    (pd_df['stratification1'] == 'Female') &
    (pd_df['datavaluetype'] == 'Crude Prevalence')  # corrected typo in 'datavalueType'
]

#Dash application
app = dash.Dash(__name__)

# Layout of the Dash app
app.layout = html.Div([
    html.H1('Bubble Chart for Diabetes Rate Over the Years in Females by Location'),

    dcc.Graph(
        id='diabetes-bubble-chart',
        figure=px.scatter(
            diabetes_df,
            x='locationdesc',  # Swap x and y axes
            y='yearstart',  # Swap x and y axes
            size='datavalue',
            color='datavalue',
            hover_name='locationdesc',
            animation_frame='yearstart',
            animation_group='locationdesc',
            title='Diabetes Rate Over the Years in Females by Location',
            labels={'datavalue': 'Diabetes Rate'},
            size_max=50,
            color_continuous_scale='Viridis',
            category_orders={'locationdesc': sorted(diabetes_df['locationdesc'].unique())},
            range_y=[diabetes_df['yearstart'].min(), diabetes_df['yearstart'].max()],  # Set the y-axis range
        ).update_layout(
            margin=dict(l=150, r=150),  # Increase the left and right margins
            xaxis=dict(categoryorder='total ascending')  # Adjust the category order
        )
    )
])

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

**Animated Box PLOT FOR SELECTED QUESTIONS**

In [None]:

! pip install dash
import plotly.express as px
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import pandas as pd


# Load the dataset
url = 'https://storage.googleapis.com/mbcc/datasets/us_chronic_disease_indicators.csv'
df = pd.read_csv(url)

questions = ['Soda consumption among high school students',
             'Obesity among high school students',
             'Overweight or obesity among high school students']

# Filter the dataset for the specified questions and exclude certain datavalueunits
filtered_df = df[(df['question'].isin(questions)) & ~df['datavalueunit'].isin(['cases per 100,000', 'Number'])]

# Create an animated box plot using Plotly Express
fig = px.box(filtered_df, x='question', y='datavalue', color='locationabbr',
             animation_frame='yearstart', animation_group='locationabbr',
             title='Animated Box Plot for Selected Questions',
             labels={'datavalue': 'Data Value', 'locationabbr': 'Location'},
             category_orders={'locationabbr': sorted(filtered_df['locationabbr'].unique())},  # Specify the order of locations
             )

# Customize hover information
fig.update_traces(hovertemplate='Location: %{y}<br>Data Value: %{x}')

# Initialize the Dash app
app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([
    dcc.Graph(id='animated-box-plot', figure=fig),
])

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

"\n! pip install dash\nimport plotly.express as px\nimport dash\nimport dash_core_components as dcc\nimport dash_html_components as html\nfrom dash.dependencies import Input, Output\nimport pandas as pd\n\n\n# Load the dataset\nurl = 'https://storage.googleapis.com/mbcc/datasets/us_chronic_disease_indicators.csv'\ndf = pd.read_csv(url)\n\n# Specify the questions\nquestions = ['Soda consumption among high school students',\n             'Obesity among high school students',\n             'Overweight or obesity among high school students']\n\n# Filter the dataset for the specified questions and exclude certain datavalueunits\nfiltered_df = df[(df['question'].isin(questions)) & ~df['datavalueunit'].isin(['cases per 100,000', 'Number'])]\n\n# Create an animated box plot using Plotly Express\nfig = px.box(filtered_df, x='question', y='datavalue', color='locationabbr',\n             animation_frame='yearstart', animation_group='locationabbr',\n             title='Animated Box Plot for Select

In [None]:
#Distribution of Diseases per Year
# Create Dash application
app = dash.Dash(__name__)

# Create an overlapping area chart showing distribution of diseases for each year
fig = px.area(pd_df, x='yearstart', color='topic',
              labels={"yearstart": "Year", "topic": "Diseases", "count": "Count"},
              title="Distribution of Diseases Over Years", template="plotly")

# Layout of the Dash app
app.layout = html.Div([
    html.H1('Distribution of Diseases Over Years'),
    dcc.Graph(figure=fig)
])

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

'#Distribution of Diseases per Year\n# Create Dash application\napp = dash.Dash(__name__)\n\n# Create an overlapping area chart showing distribution of diseases for each year\nfig = px.area(pd_df, x=\'yearstart\', color=\'topic\',\n              labels={"yearstart": "Year", "topic": "Diseases", "count": "Count"},\n              title="Distribution of Diseases Over Years", template="plotly")\n\n# Layout of the Dash app\napp.layout = html.Div([\n    html.H1(\'Distribution of Diseases Over Years\'),\n    dcc.Graph(figure=fig)\n])\n\n# Run the app\nif __name__ == \'__main__\':\n    app.run_server(debug=True)\n'

**Scatter plot for: Interactive Mental Health**

In [None]:

import plotly.express as px
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error


# Load your dataset
url = 'https://storage.googleapis.com/mbcc/datasets/us_chronic_disease_indicators.csv'
df = pd.read_csv(url)

# Filter the dataset for mental health and recent mentally unhealthy days among adults aged >= 18 years
filtered_df = df[(df['topic'] == 'Mental Health') &
                 (df['question'] == 'Recent mentally unhealthy days among adults aged >= 18 years') &
                 (df['datavaluetype'] == 'Mean')].copy()  # Make a copy to avoid SettingWithCopyWarning

# Create a new column 'index' as a placeholder for the x-axis
filtered_df['index'] = range(len(filtered_df))

# Choose a gender (Overall, Female, Male)
selected_gender = 'Overall'

# Filter the DataFrame based on the selected gender
filtered_data = filtered_df[filtered_df['stratification1'] == selected_gender].copy()

# Perform polynomial regression
X = filtered_data[['yearstart']]
y = filtered_data['datavalue']

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.1, random_state=42)

# Check if there are enough data points for linear regression
if len(X_train) > 1:
    # Model fitting and prediction
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Cross-validation predictions
    y_pred_cv = cross_val_predict(model, X_poly, y, cv=5)

    # Calculate mean squared error
    mse_cv = mean_squared_error(y, y_pred_cv)

    # Dash app initialization
    app = dash.Dash(__name__)

    # Layout of the app
    app.layout = html.Div(children=[
        html.H1("Interactive Mental Health Map"),

        dcc.Graph(id='scatter-plot'),

        dcc.Dropdown(
            id='gender-dropdown',
            options=[
                {'label': 'Overall', 'value': 'Overall'},
                {'label': 'Female', 'value': 'Female'},
                {'label': 'Male', 'value': 'Male'},
            ],
            value='Overall',  # Set the default value to 'Overall'
            style={'width': '50%'}
        )
    ])

    # Callback to update scatter plot based on the selected gender
    @app.callback(
        Output('scatter-plot', 'figure'),
        [Input('gender-dropdown', 'value')]
    )
    def update_scatter_plot(selected_gender):
        # Filter the DataFrame based on the selected gender
        filtered_data = filtered_df[filtered_df['stratification1'] == selected_gender].copy()

        # Create scatter plot with regression line
        fig = px.scatter(
            filtered_data,
            x='yearstart',
            y='datavalue',
            color='locationdesc',  # Color by location
            hover_name='locationdesc',
            labels={'datavalue': 'Recent Mentally Unhealthy Days', 'yearstart': 'Year'},
            title=f'Scatter Plot: Mental Health Prediction ({selected_gender})\nMean Squared Error (CV): {mse_cv:.2f}'
        )

        # Add polynomial regression line to the plot
        x_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
        x_range_poly = poly.transform(x_range)
        y_range_pred = model.predict(x_range_poly)
        fig.add_scatter(x=x_range.flatten(), y=y_range_pred, mode='lines', name='Polynomial Regression Line')

        return fig

    # Run the app
    if __name__ == '__main__':
        app.run_server(debug=True)
else:
    print("Insufficient data points for linear regression.")

**Bar Graph: Incidence of treated end-stage renal disease vs Incidence attributed to diabetes**

In [None]:

#Bar Graph: Incidence of treated end-stage renal disease vs Incidence attributed to diabetes
import dash_html_components as html
from dash.dependencies import Input, Output

# Load your dataset
url = 'https://storage.googleapis.com/mbcc/datasets/us_chronic_disease_indicators.csv'
df = pd.read_csv(url)

# Convert 'yearstart' to numeric during data loading
df['yearstart'] = pd.to_numeric(df['yearstart'], errors='coerce')

# Filter the data for Chronic Kidney Disease
filtered_df = df[df['topic'] == 'Chronic Kidney Disease']

# Filter data for the specified questions
questions = ['Incidence of treated end-stage renal disease',
             'Incidence of treated end-stage renal disease attributed to diabetes']
filtered_questions_df = filtered_df[filtered_df['question'].isin(questions)]

# Set 'datavalue' to cases per 1,000,000
filtered_questions_df['datavalue'] *= 1000000

# Initialize the Dash app
app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([
    html.H1(
        'Incidence of treated end-stage renal disease vs. Incidence attributed to diabetes ',
        style={'textAlign': 'center', 'color': 'purple'}
    ),
    html.Label('Select Location:'),
    dcc.Dropdown(
        id='location-dropdown',
        options=[{'label': location, 'value': location} for location in filtered_questions_df['locationdesc'].unique()],
        value='United States',  # Set the initial value to the United States
    ),
    dcc.Graph(id='bar-chart'),
    html.Div(id='correlation-output'),
    html.Div(id='percentage-output')
])

# Callback to update the bar chart, correlation, and percentage based on user input
@app.callback(
    [Output('bar-chart', 'figure'),
     Output('correlation-output', 'children'),
     Output('percentage-output', 'children')],
    [Input('location-dropdown', 'value')]
)
def update_bar_chart(selected_location):
    # Filter data based on the selected location
    location_data = filtered_questions_df[filtered_questions_df['locationdesc'] == selected_location]

    # Check if the filtered data is empty
    if location_data.empty:
        fig = px.bar(title=f'No Data for {selected_location}')
        correlation_output = ''
        percentage_output = ''
    else:
        # Convert 'yearstart' to numeric to avoid callback errors
        location_data['yearstart'] = pd.to_numeric(location_data['yearstart'], errors='coerce')

        # Create a bar chart
        fig = px.bar(
            location_data,
            x='yearstart',
            y='datavalue',
            color='question',
            labels={'datavalue': 'Cases per 1,000,000'},
            height=600,
        )

        # Update the layout for better visualization
        fig.update_layout(
            xaxis_title='Year',
            yaxis_title='Cases per 1,000,000',
            legend_title='Question',
            height=600,
            margin=dict(l=0, r=0, b=0, t=40),
            template='plotly_dark'
        )

        # Set tick positions and labels on the x-axis
        fig.update_xaxes(tickvals=location_data['yearstart'].unique(), ticktext=location_data['yearstart'].unique())

        # Calculate correlation between the two questions
        correlation = location_data.groupby('locationdesc')['datavalue'].corr(location_data['datavalue']).iloc[0::2].values
        correlation_output = f'Correlation between the two questions: {correlation[0]:.2f}'

        # Calculate the percentage of 'Incidence of treated end-stage renal disease attributed to diabetes'
        # with respect to the total 'Incidence of treated end-stage renal disease'
        total_incidence = location_data[location_data['question'] == 'Incidence of treated end-stage renal disease']['datavalue'].sum()
        total_diabetes_incidence = location_data[location_data['question'] == 'Incidence of treated end-stage renal disease attributed to diabetes']['datavalue'].sum()

        # Check for potential division by zero
        if total_incidence != 0:
            percentage = (total_diabetes_incidence / total_incidence) * 100
            percentage_output = f'Percentage of Incidence attributed to diabetes: {percentage:.2f}%'
        else:
            percentage_output = 'Total Incidence is zero, cannot calculate percentage.'

        print(percentage_output)

    return fig, correlation_output, percentage_output


# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

**Bar Chart for Topic entries over years by location**

In [None]:
#Bar Chart for Topic entries over years by location

# Create a Dash app

app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([

    html.H1("Topic Entries Over Years by Location"),
    html.Label('Select Location:'),

    dcc.Dropdown(
        id='location-dropdown',
        options=[{'label': location, 'value': location} for location in df['locationdesc'].unique()],
        value='Michigan'  # Default selected location
    ),

    dcc.Graph(id='bar-chart'),
])

# Define callback to update bar chart based on selected location
@app.callback(
    Output('bar-chart', 'figure'),
    [Input('location-dropdown', 'value')]
)

def update_bar_chart(selected_location):
    filtered_df = df[df['locationdesc'] == selected_location]

    # Group by year and count entries for each topic
    topic_entries = filtered_df.groupby(['yearstart', 'topic']).size().reset_index(name='entry_count')

    # Create a bar chart using Plotly Express with a different color scale (Viridis)
    fig = px.bar(
        topic_entries,
        x='yearstart',
        y='entry_count',
        color='topic',
        labels={'entry_count': 'Entry Count', 'yearstart': 'Year'},
        title=f'Topic Entries Over Years for {selected_location}',
        color_continuous_scale='Viridis',  # Use a different color scale

    )

    return fig

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

**What is the prevalence of each disease over the years?**

In [None]:

#What is the prevalence of each disease over the years?

# Create a new column 'year_range' by combining 'yearstart' and 'yearend'
df['year_range'] = df['yearstart'].astype(str) + '-' + df['yearend'].astype(str)

# Split 'year_range' into 'year_start' and 'year_end'
df[['year_start', 'year_end']] = df['year_range'].str.split('-', expand=True)

# Group by 'year_start' and count the occurrences of each disease
grouped_df = df.groupby(['year_start', 'topic'])['topic'].count().unstack().fillna(0).reset_index()

# Create Dash app
app = dash.Dash(__name__)

# Define app layout
app.layout = html.Div([
    dcc.Graph(
        id='trends-chart',
        figure={}
    ),
])

# Callback to update the chart based on user interaction
@app.callback(
    Output('trends-chart', 'figure'),
    [Input('trends-chart', 'relayoutData')]
)
def update_chart(relayout_data):
    try:
        # Plot the trends using a line chart
        fig = px.line(grouped_df, x='year_start', y=grouped_df.columns[1:], title="Trends of Diseases Over the Years",
                      labels={'value': 'Number of Occurrences', 'year_start': 'Year Start'},
                      template="plotly_dark", width=1200, height=600)

        return fig

    except Exception as e:
        print(f"Error: {str(e)}")
        return px.scatter(title=f"Error: {str(e)}", template="plotly_dark", width=1200, height=600)

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)


Which states have the highest and lowest prevalence of binge drinking among women aged 18-44?

In [None]:
#Geospatial Map: Binge drinking prevalence among women aged 18-44 years across U.S. States
# Specify the question and topic
question = 'Binge drinking prevalence among women aged 18-44 years'
topic = 'Alcohol'

# Filter the dataset for the specified question and topic
filtered_df = df[(df['topic'] == topic) & (df['question'] == question)]

# Find the highest prevalence state and its value
max_row = filtered_df.loc[filtered_df['datavalue'].idxmax()]
highest_state = max_row['locationabbr']
highest_value = max_row['datavalue']

# Create a Dash app
app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([
    dcc.Graph(
        id='choropleth-map',
        figure=px.choropleth(
            filtered_df,
            locations='locationabbr',
            locationmode='USA-states',
            color='datavalue',
            animation_frame='yearstart',
            color_continuous_scale='Viridis',
            title=f'{question} Across U.S. States (18-44 years)',
            labels={'datavalue': 'Prevalence'},
            scope='usa',
        ).update_geos(projection_type='albers usa')
    ),
    html.P(f'Highest prevalence in {highest_state} with a value of {highest_value:.2f}'),
])

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)


Clusters of high or low prevalence can be observed on the map, indicating potential hotspots or coldspots. The map can highlight regional variations, providing insights into potential disparities in public health enabling public health officials, researchers, and policymakers to make informed decisions and implement targeted interventions to reduce binge drinking among women aged 18-44.

**Bar chart showing count of diseases over time**


In [None]:
"""# Initialize the Dash app
app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([
    html.H1("Disease Variation Over Years"),

    dcc.Dropdown(
        id='disease-dropdown',
        options=[{'label': disease, 'value': disease} for disease in pd_df['topic'].unique()],
        value=pd_df['topic'].unique()[0],
        multi=False,
        style={'width': '50%'}
    ),

    dcc.Checklist(
        id='gender-checklist',
        options=[{'label': gender, 'value': gender} for gender in pd_df['stratification1'].unique()],
        value=pd_df['stratification1'].unique(),
        inline=True
    ),

    dcc.Graph(id='disease-variation-plot'),
])

# Define callback to update the plot based on user input
@app.callback(
    Output('disease-variation-plot', 'figure'),
    [Input('disease-dropdown', 'value'),
     Input('gender-checklist', 'value')]
)
def update_plot(selected_disease, selected_genders):
    filtered_df = pd_df[(pd_df['topic'] == selected_disease) & (pd_df['stratification1'].isin(selected_genders))]

    grouped_df = filtered_df.groupby(['stratification1', 'yearstart']).size().reset_index(name='Count')

    fig = px.bar(
        grouped_df,
        x='yearstart',
        y='Count',
        color='stratification1',
        barmode='group',  # Set barmode to 'group' for grouped bar chart
        labels={'yearstart': 'Year', 'Count': 'Count'},
        title=f'{selected_disease} Variation Over Years',
        template='plotly_dark',  # Set the template to 'plotly_dark'
    )

    return fig

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)"""

###MODULES AND DASHBOARD

In [19]:
import dash
from dash.dependencies import Input, Output
import pandas as pd
import plotly.graph_objects as go
from wordcloud import WordCloud
import base64
from io import BytesIO
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from dash import dcc, html
import plotly.express as px
import plotly.graph_objs as go
from scipy import stats
import tkinter as tk
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_predict

In [22]:
#FINAL DASHBOARD
text_to_generate_wordcloud = ' '.join(df['topic'])

# Create a WordCloud object with specified parameters
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_to_generate_wordcloud)

# Convert Word Cloud image to base64 for Dash
wordcloud_image_bytes = BytesIO()
wordcloud.to_image().save(wordcloud_image_bytes, format='PNG')
wordcloud_base64 = base64.b64encode(wordcloud_image_bytes.getvalue()).decode('utf-8')

# Create a new column 'year_range' by combining 'yearstart' and 'yearend'
df['year_range'] = df['yearstart'].astype(str) + '-' + df['yearend'].astype(str)

# Specify the questions
questions = ['Soda consumption among high school students',
             'Overweight or obesity among high school students']

# Filter the dataset for the specified questions and exclude certain datavalueunits
specq_df = df[(df['question'].isin(questions)) & ~df['datavalueunit'].isin(['cases per 100,000', 'Number'])]
specq_df = specq_df.drop(['yearstart', 'yearend'], axis=1)

# Get unique questions for labels
unique_questions = specq_df['question'].unique()

# Get unique locations for dropdown options
locations = specq_df['locationdesc'].unique()

# Define a dictionary to map each location to a specific color scale
color_scales = {
    'Location 1': 'Viridis',
    'Location 2': 'Reds',
}

# Filter data for the 'Cancer' topic, where 'datavaluetype' is 'Average Annual Number', and 'stratification1' is 'Overall'
cancer_df = df[(df['topic'] == 'Cancer') & (df['datavaluetype'] == 'Average Annual Number') & (df['stratification1'] == 'Overall')]

# Filter the dataset based on specified criteria
filtered_df7 = df[
    (df['topic'] == 'Nutrition, Physical Activity, and Weight Status') &
    (df['question'] == 'Obesity among adults aged >= 18 years')
]

#DASH 4 FILTERS
# Filter the data for the specified conditions (considering cancer and US)
filtered_df = df[(df['topic'] == 'Cancer') & (df['locationabbr'] == 'US')]

# Define the allowed values for 'stratification1'
allowed_stratifications = ['Overall', 'Male', 'Female']

# Get unique values for questions
questions = sorted(filtered_df['question'].unique())
locations = sorted(filtered_df['locationabbr'].unique())

#DASH 5 FILTERS
# Filter data for Alcohol and Crude Prevalence
alchohol_df = df[(df['topic'] == 'Alcohol') & (df['datavaluetype'] == 'Crude Prevalence')]

# Create a dropdown for selecting the year
year_options = [{'label': str(year), 'value': year} for year in alchohol_df['yearstart'].unique()]

# Create a dropdown for selecting the location
location_options = [{'label': location, 'value': location} for location in alchohol_df['locationdesc'].unique()]

#DASH 6 FILTERS
# Filter data for the 'Cardiovascular Disease' topic and where 'datavaluetype' is 'Number'
cardiovascular_df = df[(df['topic'] == 'Cardiovascular Disease') & (df['datavaluetype'] == 'Number')]

# Create a sunburst chart
fig_sunburst = px.sunburst(
    cardiovascular_df,
    path=['question', 'yearstart'],
    values='datavalue',
    labels={'datavalue': 'Data Value'},
    height=600,
    width=800,
    color='datavalue',
    color_continuous_scale='Viridis'
)

# Modify the layout
fig_sunburst.update_layout(
    title_font=dict(color='purple'),  # Set title color to purple
    template='plotly_dark',  # Set the template to Plotly Dark
)

#DASH 7 FILTERS
# Filter data specifically for alcohol-related indicators (Binge drinking prevalence) and cardiovascular disease-related indicators (High cholesterol prevalence) among adults aged 18 years and older.
alcohol2_df = df[(df['topic'] == 'Alcohol') & (df['question'] == 'Binge drinking prevalence among adults aged >= 18 years') & (df['datavaluetype'] == 'Crude Prevalence') & (df['stratification1'] == 'Overall')]
cvd_df = df[(df['topic'] == 'Cardiovascular Disease') & (df['question'] == 'High cholesterol prevalence among adults aged >= 18 years') & (df['datavaluetype'] == 'Crude Prevalence') & (df['stratification1'] == 'Overall')]
combined_df = pd.concat([alcohol2_df, cvd_df])


#DASH 8 FILTERS
# Filter the dataset based on specified criteria
diabetes_df = df[
    (df['topic'] == 'Diabetes') &
    (df['stratification1'] == 'Female') &
    (df['datavaluetype'] == 'Crude Prevalence')]

#DASH 9 FILTERS
# Filter data for Adults with diagnosed diabetes aged >= 18 years who have taken a diabetes self-management course
adults_df = df[
    (df['question'] == 'Adults with diagnosed diabetes aged >= 18 years who have taken a diabetes self-management course') &
    (df['datavaluetype'] == 'Crude Prevalence')]

# Create a Heatmap
heatmap_fig = go.Figure(data=go.Heatmap(
    x=filtered_df7['locationdesc'],
    y=filtered_df7['yearstart'],
    z=filtered_df7['datavalue'],
    colorscale='Viridis',
    colorbar=dict(title='Obesity Rate'),
))
#DASH 10 FILTERS
# Filter data for Heavy drinking among adults aged >= 18 years with 'Overall' stratification1
heavy_drinking_adults = df[(df['topic'] == 'Alcohol') &
                           (df['question'] == 'Heavy drinking among adults aged >= 18 years') &
                           (df['datavaluetype'] == 'Crude Prevalence') &
                           (df['stratification1'] == 'Overall')]

# Filter data for Heavy drinking among women aged 18-44 years with 'Overall' stratification1
heavy_drinking_women = df[(df['topic'] == 'Alcohol') &
                          (df['question'] == 'Heavy drinking among women aged 18-44 years') &
                          (df['datavaluetype'] == 'Crude Prevalence') &
                          (df['stratification1'] == 'Overall')]

# Merge the two datasets on the common columns
merged_df1 = pd.merge(heavy_drinking_adults, heavy_drinking_women, on=['locationdesc', 'yearstart', 'locationabbr'], suffixes=('_total', '_women'))

# Calculate the percentage of heavy drinking among women aged 18-44 years out of heavy drinking among adults aged >= 18 years
merged_df1['percentage'] = (merged_df1['datavalue_women'] / merged_df1['datavalue_total']) * 100

# Layout Settings
heatmap_fig.update_layout(
    title=dict(
        text='<b>Obesity among adults aged >= 18 years Rate by Location and Year</b>',
        x=0.5,  # Center the title
        y=0.9,  # Adjust the title position
        xanchor='center',
        yanchor='top',
        font=dict(color='purple', size=14)  # Set the title color to purple and size
    ),
    xaxis_title=dict(
        text='<b>Location</b>',
        font=dict(color='blue', size=12)  # Set x-axis label color and size
    ),
    yaxis_title=dict(
        text='<b>Year</b>',
        font=dict(color='green', size=12)  # Set y-axis label color and size
    ),
)

#DASH 11 FILTERS
# Specify the questions
questions = ['Soda consumption among high school students',
             'Obesity among high school students',
             'Overweight or obesity among high school students']

# Filter the dataset for the specified questions and exclude certain datavalueunits
filtered_df3 = df[(df['question'].isin(questions)) & ~df['datavalueunit'].isin(['cases per 100,000', 'Number'])]

# Create an animated box plot using Plotly Express with purple template
fig = px.box(filtered_df3, x='question', y='datavalue', color='locationabbr',
             animation_frame='yearstart', animation_group='locationabbr',
             labels={'datavalue': 'Data Value', 'locationabbr': 'Location'},
             category_orders={'question': questions},  # Specify the order of questions
             height=600,  # Increase the height of the box plot
             template='plotly_dark'  # Use the dark template
             )

# Customize hover information
fig.update_traces(hovertemplate='Location: %{y}<br>Data Value: %{x}')

#DASH 12 FILTERS
# Specify the questions
questions = ['Soda consumption among high school students',
             'Overweight or obesity among high school students']

# Filter the dataset for the specified questions and exclude certain datavalueunits
specq_df = df[(df['question'].isin(questions)) & ~df['datavalueunit'].isin(['cases per 100,000', 'Number'])]

# Remove 'yearstart' and 'yearend' columns from the DataFrame
specq_df = specq_df.drop(['yearstart', 'yearend'], axis=1)

# Get unique questions for labels
unique_questions = specq_df['question'].unique()

# Get unique locations for dropdown options
locations = specq_df['locationdesc'].unique()

# Define a dictionary to map each location to a specific color scale
color_scales = {
    'Location 1': 'Viridis',
    'Location 2': 'Reds',
    # Add more locations and color scales as needed
}


# Create a stacked bar chart
fig = px.bar(
    cancer_df,
    x='yearstart',
    y='datavalue',
    color='question',
    labels={'datavalue': 'Average Annual Number', 'yearstart': 'Year'},
    height=600,
    width=800,
    category_orders={"question": cancer_df['question'].unique()},
    color_discrete_sequence=px.colors.qualitative.Set1
)

# Modify the layout to place legends below and add space for the title
fig.update_layout(
    barmode='stack',
    legend=dict(orientation='h', yanchor='top', y=1.02, xanchor='right', x=1),
    xaxis=dict(tickmode='array', tickvals=cancer_df['yearstart'].unique(), ticktext=cancer_df['yearstart'].unique()),
    margin=dict(t=100),  # Adjust the top margin for space between title and plot
    template='plotly_dark'  # Set the template to Plotly Dark
)

# Create a list of location options for the dropdown
location_options = [{'label': location, 'value': location} for location in cancer_df['locationdesc'].unique()]

# Split 'year_range' into 'year_start' and 'year_end'
df[['year_start', 'year_end']] = df['year_range'].str.split('-', expand=True)

# Group by 'year_start' and count the occurrences of each disease
grouped_df = df.groupby(['year_start', 'topic'])['topic'].count().unstack().fillna(0).reset_index()

# Get the order of legends based on the sum of occurrences
legend_order = (
    grouped_df.loc[:, grouped_df.columns != 'year_start']
    .apply(pd.to_numeric, errors='coerce')  # Convert non-numeric columns to numeric
    .sum(axis=0)
    .sort_values(ascending=False)
    .index
)

# Convert 'yearstart' to numeric during data loading
df['yearstart'] = pd.to_numeric(df['yearstart'], errors='coerce')

# Filter the data for Chronic Kidney Disease
filtered_df8 = df[df['topic'] == 'Chronic Kidney Disease']

# Filter data for the specified questions and set stratification to 'Overall'
questions = ['Incidence of treated end-stage renal disease',
             'Incidence of treated end-stage renal disease attributed to diabetes']
filtered_questions_df = filtered_df8[(filtered_df8['question'].isin(questions)) &
                                     (filtered_df8['stratification1'] == 'Overall') &
                                     (filtered_df8['stratificationcategory1'] == 'Overall')]

# Set 'datavalue' to cases per 1,000,000
filtered_questions_df['datavalue'] *= 1000000

# Aggregate data to avoid multiple values for the same year
aggregated_data = filtered_questions_df.groupby(['locationdesc', 'yearstart', 'question']).agg({'datavalue': 'sum'}).reset_index()
# Model filters
# Filter the dataset for mental health and recent mentally unhealthy days among adults aged >= 18 years
filtered_dfM = df[(df['topic'] == 'Mental Health') &
                 (df['question'] == 'Recent mentally unhealthy days among adults aged >= 18 years') &
                 (df['datavaluetype'] == 'Mean')].copy()  # Make a copy to avoid SettingWithCopyWarning

# Create a new column 'index' as a placeholder for the x-axis
filtered_dfM['index'] = range(len(filtered_dfM))

# Choose a gender (Overall, Female, Male)
selected_gender = 'Overall'

# Filter the DataFrame based on the selected gender
filtered_dataM = filtered_dfM[filtered_dfM['stratification1'] == selected_gender].copy()

# Perform polynomial regression
X = filtered_dataM[['yearstart']]
y = filtered_dataM['datavalue']

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.1, random_state=42)

# Check if there are enough data points for linear regression
if len(X_train) > 1:
    # Model fitting and prediction
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Cross-validation predictions
    y_pred_cv = cross_val_predict(model, X_poly, y, cv=5)

    # Calculate mean squared error
    mse_cv = mean_squared_error(y, y_pred_cv)

    # Calculate residuals and remove outliers
    residuals = y - y_pred_cv
    threshold = 2  # Adjust the threshold as needed
    outlier_mask = np.abs(residuals) < threshold
    filtered_dataM = filtered_dataM[outlier_mask]




# Initialize the Dash app
app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([
    #first Section
    html.H1("What are the Most Frequently Occurring Topics in US Chronic Disease Indicators?",
            style={'text-align': 'center', 'color': 'purple'}),

    dcc.Graph(
        id='wordcloud-graph',
        figure={
            'data': [],
            'layout': go.Layout(
                images=[go.layout.Image(
                    source=f'data:image/png;base64,{wordcloud_base64}',
                    x=0,
                    y=1,
                    xref="paper",
                    yref="paper",
                    sizex=1,
                    sizey=1,
                    sizing="contain",
                    layer="below"
                )],
                xaxis={'visible': False},
                yaxis={'visible': False},
                template='plotly_dark'
            )
        }
    ),

    html.P(
        "The Word Cloud provides a visual representation of the most frequently occurring topics in the dataset. "
        "Larger and bolder words in the Word Cloud represent topics that appear more frequently in the 'Topic' column of the dataset. "
        "This visualization aids in quickly identifying major themes and can guide further analysis and exploration of specific chronic disease topics.",
        style={'color': 'purple', 'text-align': 'center'}
    ),
    #Second Section
    html.H1("Trends of Diseases Over the Years", style={'text-align': 'center', 'margin-top': '20px', 'margin-bottom': '20px', 'color': 'purple', 'font-weight': 'bold'}),  # Title for the dashboard

    dcc.Graph(
        id='trends-chart',
        figure={},
        config={'displayModeBar': False}  # Hide the mode bar to prevent chart resizing
    ),

    html.Div([
        html.P("What Have Been the Trends of Diseases Over the Years?", style={'color': 'purple', 'font-weight': 'bold'}),
        html.Ul([
            html.Li("The chart visualizes trends of diseases over the years based on available data."),
            html.Li("Each line represents a disease, showing its occurrences across different years."),
            html.Li("Users can explore and analyze the patterns of diseases to identify potential trends or outbreaks."),
            html.Li("The legend order is based on the sum of occurrences, allowing easy identification of prevalent diseases."),
        ], style={'color': 'purple'})
    ], style={'text-align': 'left', 'margin': '20px'}),

    #third section
    html.H1(
        'How has the Average Annual Number of Cancer Cases changed over the years, and what variations exist across different locations and questions?',
        style={'text-align': 'left', 'font-weight': 'bold', 'color': 'purple'}
    ),

    dcc.Dropdown(
        id='location-dropdown-6',
        options=location_options,
        value=cancer_df['locationdesc'].iloc[0],
        style={'width': '50%', 'margin': 'auto'}
    ),
    dcc.Graph(
        id='bar-chart3',
        figure=fig
    ),

    html.P(
        "Yearly Trends: The stacked bar chart allows users to observe the yearly trends in the average annual number of cancer cases. "
        "The color-coded bars represent different questions, enabling quick identification of patterns and changes over time.",
        style={'color': 'purple', 'text-align': 'left'}
    ),
    html.P(
        "Location-based Comparison: Users can select a specific location using the dropdown menu, enabling a focused analysis of cancer cases in a particular area. "
        "This feature is valuable for comparing cancer trends across different regions and identifying geographical variations.",
        style={'color': 'purple', 'text-align': 'left'}
    ),
    html.P(
        "Question-specific Insights: By exploring different questions, users can gain insights into how specific aspects of cancer are changing over time. "
        "The color legend aids in distinguishing between different question categories within each year.",
        style={'color': 'purple', 'text-align': 'left'}
    ),

    html.H1("EXPLORATION OF HEALTH CONCERNS IN THE DATASET"),


    # 5th plot - DASH NO 5
     html.H1("How is the distribution of related concerns in Alcohol?", style={'text-align': 'center', 'color': 'purple'}),

    dcc.Dropdown(
        id='year-dropdown2',
        options=year_options,
        value=alchohol_df['yearstart'].min(),
        style={'width': '50%', 'margin': 'auto'}
    ),

    dcc.Dropdown(
        id='location-dropdown2',
        options=location_options,
        value=alchohol_df['locationdesc'].iloc[0],
        style={'width': '50%', 'margin': 'auto'}
    ),

    dcc.Graph(
        id='bar-chart2',
        figure={}
    ),

    html.P("Interactive Exploration: Users can interactively explore how the distribution of concerns related to Alcohol varies across different questions for a specific year and location.",
           style={'color': 'purple', 'text-align': 'center'}),

    # 6TH PLOT - DASH NO 6
    html.H1("How is hierarchical data pertaining to questions related to Cardiovascular Disease visualized and can be explored?",
            style={'text-align': 'center', 'color': 'purple'}),
    dcc.Graph(
        id='sunburst-chart1',
        figure=fig_sunburst
    ),

    html.P(
        "Users can explore this hierarchical structure to gain insights into the distribution and relationships of data values within the Cardiovascular Disease dataset.",
        style={'color': 'purple'}
    ),

    html.Hr(),

    html.H3("Follow-up Questions:", style={'color': 'purple'}),

    html.P(
        "1. What visual elements or features are used to represent the hierarchy in the data?\n"
        "2. How can users interact with the visualization to navigate through the hierarchical data?\n"
        "3. Why is it important to visualize questions hierarchically in the context of Cardiovascular Disease data?\n"
        "4. Are there specific insights or trends that users can derive from exploring the hierarchical data?",
        style={'color': 'purple'}
    ),

    # 7TH PLOT - DASH NO 7
     html.H1("Relationship between Alcohol and Cardiovascular Disease Indicators Over the Years",
            style={'color': 'purple'}),

    # Scatter plot
    dcc.Graph(
        id='scatter-plot7',
        figure=px.scatter(
            combined_df,
            x='yearstart',
            y='datavalue',
            color='topic',
            labels={'datavalue': 'Crude Prevalence'},
            template='plotly_dark'
        ),
        style={'color': 'purple'}
    ),

    # Correlation coefficient
    html.Div(id='correlation-output7', style={'text-align': 'center', 'color': 'purple'}),

    # Location Dropdown
    dcc.Dropdown(
        id='location-dropdown7',
        options=[{'label': location, 'value': location} for location in combined_df['locationabbr'].unique()],
        value=combined_df['locationabbr'].iloc[0],
        multi=False,
        style={'width': '50%', 'margin': 'auto'}
    ),

    # Additional Information
    html.P("Filtering data specifically for alcohol-related indicators (Binge drinking prevalence) and cardiovascular disease-related indicators (High cholesterol prevalence) among adults aged 18 years and older.",
           style={'color': 'purple'}),
    html.P("Correlation Analysis: Calculating the correlation coefficient between the prevalence values of the selected alcohol and cardiovascular disease indicators over the years.",
           style={'color': 'purple'}),
    html.P("Dynamic Correlation Display: The correlation coefficient is dynamically displayed below the plot. As the user selects different locations from the dropdown, the correlation coefficient updates accordingly, providing insights into how the relationship varies across different locations.",
           style={'color': 'purple'}),

    # 8TH PLOT - DASH NO 8
    html.H1(
        "What is the Diabetes rate over the years in females by location?",
        style={'color': 'purple', 'text-align': 'center'}
    ),
    dcc.Graph(
        id='bubble-chart8',
        figure={}
    ),
    html.P(
        "The chart's X-axis represents the locations, the Y-axis represents the years, "
        "and the size and color of the bubbles indicate the diabetes rate. "
        "When a user clicks on a bubble (representing a specific location and year), "
        "the chart updates to show detailed information for that specific data point. "
        "This allows users to drill down into the data for a more detailed analysis.",
        style={'color': 'purple'}
    ),

    # 9TH PLOT - DASH NO 9
    html.H1("How many people have a proactive approach to managing their health and suggests a certain level of awareness among individuals with diagnosed diabetes?", style={'text-align': 'center', 'color': 'purple'}),

    dcc.Graph(
        id='choropleth-map9',
        figure={}
    ),

    html.P(
        "In summary, participation in a diabetes self-management course is a positive indicator of individuals taking steps to educate themselves about their condition, enhance their self-care skills, and make informed decisions. "
        "It reflects a proactive and empowered approach to managing diabetes, contributing to improved overall awareness and well-being among those with diagnosed diabetes.",
        style={'color': 'purple', 'text-align': 'center'}
    ),

     # 10TH PLOT - DASH NO 10
     html.H1("Percentage of Heavy Drinking Among Women Aged 18-44 Years Out of Heavy Drinking Among Adults Aged >= 18 Years",
            style={'color': 'purple', 'font-weight': 'bold'}),

    dcc.Dropdown(
        id='location-dropdown10',
        options=[{'label': location, 'value': location} for location in locations],
        value='Arizona',
        style={'width': '50%'}
    ),

    dcc.Graph(
        id='bar-chart10'
    ),

    html.P("Gender-Specific Insights: Identifying the percentage of heavy drinking among women specifically helps in understanding the prevalence of this behavior within the female population.",
           style={'color': 'purple'}),
    html.P("It allows for a targeted analysis of women in the reproductive age group (18-44 years), which may have implications for maternal and child health.",
           style={'color': 'purple'}),


    # 11TH PLOT - DASH NO 11
    html.H1("How do health indicators among high school students vary over time and across different locations?",
            style={'color': 'purple'}),

    dcc.Graph(id='animated-box-plot11', figure=fig),

    html.P("Goal is to compare the distribution of soda consumption, obesity rates, and overweight or obesity rates among high school students "
           "across different locations to analyze health indicators high school students.",
           style={'color': 'purple'}),

    # 12TH PLOT - DASH NO 12
     html.H1("Is there any correlation between Soda consumption among high school students and Overweight or obesity among high school students?"),

    html.Div([
        dcc.Dropdown(
            id='location-dropdown12',
            options=[{'label': location, 'value': location} for location in locations],
            value=locations[0],
            style={'width': '50%', 'color': '#7FDBFF'}
        ),
        dcc.Graph(id='correlation-heatmap12', style={'color': '#7FDBFF'}),
        dcc.Markdown(
            '''
            **As you select a location from the dropdown, the correlation heatmap will be updated to show the correlation between Soda consumption among high school students and Overweight or obesity among high school students.**
            '''
        ),
    ]),


    #Thirteen Section
    html.H1("What is the obesity rate among adults aged 18 years and older, and How does it vary by location and year?",
            style={'color': 'purple'}),
    dcc.Graph(figure=heatmap_fig),
    html.Div([
        html.P("The heat map visualizes the obesity rate among adults aged 18 years and older in different locations and years.",
               style={'color': 'purple'}),
        html.P("The map provides a visual representation of how the obesity rate changes over different locations and years, allowing for easy identification of trends and variations in the data.",
               style={'color': 'purple'}),
    ]),

    # Fourteen Section
    html.H1(
        'Incidence of treated end-stage renal disease vs. Incidence attributed to diabetes ',
        style={'textAlign': 'center', 'color': 'purple'}
    ),
    html.Label('Select Location:'),
    dcc.Dropdown(
        id='location-dropdown',
        options=[{'label': location, 'value': location} for location in aggregated_data['locationdesc'].unique()],
        value='United States',
    ),
    dcc.Graph(id='bar-chart'),
    html.Div(id='correlation-output'),
    html.Div(id='percentage-output'),
    html.Div(id='additional-text', style={'color': 'purple', 'margin-top': '20px'}),

    # Fifteen Section
    html.H1("What are the Top 5 Disease Entries Over Years by Location?", style={'color': 'purple'}),
    html.Label('Select Location:'),
    dcc.Dropdown(
        id='location-dropdown-2',
        options=[{'label': location, 'value': location} for location in aggregated_data['locationdesc'].unique()],
        value='United States',
    ),
    dcc.Graph(id='bar-chart-2'),
    html.Div(id='text-output', style={'color': 'purple', 'margin-top': '20px'}),

     # Sixteen Section
    html.H1("What are the Bottom 5 Topic Entries Over Years by Location?", style={'color': 'purple'}),
    html.Label('Select Location:'),
    dcc.Dropdown(
        id='location-dropdown-3',
        options=[{'label': location, 'value': location} for location in df['locationdesc'].unique()],
        value='Michigan'
    ),
    dcc.Graph(id='bar-chart-3'),
    html.Div(id='bottom-text', style={'color': 'purple', 'padding-top': '20px'}),

    #Seventeen Section
     html.H1("Exploring the Relationship Between Disease and Gender/Race", style={'color': 'purple'}),

    dcc.Dropdown(
        id='disease-dropdown',
        options=[{'label': disease, 'value': disease} for disease in df['topic'].unique()],
        value=df['topic'].unique()[0],
        multi=False,
        style={'width': '50%'}),
    dcc.Checklist(
        id='gender-checklist',
        options=[{'label': gender, 'value': gender} for gender in df['stratification1'].unique()],
        value=df['stratification1'].unique().tolist(),
        inline=True
    ),

    dcc.Graph(id='disease-variation-plot'),

    html.Div([
        html.P("The graph demonstrates a clear picture about the dependency or relationship between disease and gender or race. Are there diseases which are gender or race specific?", style={'color': 'purple'}),
]),
    #Eighteen Section
    html.H1("Relationship Between Mammography use among women aged 50-74 years and Cancer of the female breast, mortality",
            style={'text-align': 'center', 'color': 'purple'}),
    html.Label('Select Location:'),
    dcc.Dropdown(
        id='location-dropdown-5',
        options=[{'label': location, 'value': location} for location in df['locationdesc'].unique()],
        value=df['locationdesc'].iloc[0],
        style={'width': '300px', 'margin-bottom': '10px'}
    ),

    dcc.Graph(
        id='sunburst-chart',
        figure={}
    ),

    html.Div([
        html.P("Early Detection of Breast Cancer:",
               style={'color': 'purple', 'font-weight': 'bold'}),
        html.P(
            "Mammography is effective in detecting breast cancer at an early, more treatable stage. When breast cancer is identified in its early stages, treatment options are generally more successful, and the likelihood of survival increases."),
        html.P("Reducing Mortality through Early Treatment:",
               style={'color': 'purple', 'font-weight': 'bold'}),
        html.P(
            "Early detection through mammography allows for prompt initiation of treatment, such as surgery, radiation therapy, or chemotherapy. This timely intervention can reduce the risk of cancer spreading to other parts of the body, ultimately decreasing breast cancer mortality rates.")
    ], style={'margin': '20px', 'padding': '20px', 'background-color': '#f0f0f0', 'border-radius': '10px'}),

    #Nineteen Section
    html.H1("US Chronic Disease Indicators Explorer", style={'text-align': 'center', 'color': 'purple'}),
   html.Label('Select Topic:'),
   dcc.Dropdown(
       id='topic-dropdown',
       options=[{'label': topic, 'value': topic} for topic in df['topic'].unique()],
       value=df['topic'].iloc[0]
   ),
   html.Label('Select Question:'),
   dcc.Dropdown(id='question-dropdown'),
   html.Label('Select Stratification1:'),
   dcc.Dropdown(id='stratification1-dropdown'),
   html.Label('Select Stratification Category1:'),
   dcc.Dropdown(id='stratificationcategory1-dropdown'),
   dcc.Graph(
       id='choropleth-map',
       figure={}
   ),
   html.P(id='highest-prevalence-output'),
   html.Div([
       html.Ul([
           html.Li("Users can dynamically select a health topic, a specific question, and various demographic stratifications to investigate.", style={'color': 'purple'}),
           html.Li("The choropleth map visualizes the prevalence of the selected health indicator across U.S. states over time.", style={'color': 'purple'}),
           html.Li("Additionally, the program highlights the state with the highest prevalence, providing valuable insights into regional variations and trends in chronic diseases.", style={'color': 'purple'}),
           html.Li("This tool is valuable for public health professionals, policymakers, and researchers to analyze and understand the distribution of chronic diseases and make informed decisions.", style={'color': 'purple'})
       ])
   ]),

    #Model Selection
    html.H1("Interactive Mental Health Prediction Model", style={'color': 'purple'}),
        html.Div([
            dcc.Graph(id='scatter-plot20'),
            dcc.Dropdown(
                id='gender-dropdown20',
                options=[
                    {'label': 'Overall', 'value': 'Overall'},
                    {'label': 'Female', 'value': 'Female'},
                    {'label': 'Male', 'value': 'Male'},
                ],
                value='Overall',
                style={'width': '50%'}
            ),
            dcc.RangeSlider(
                id='future-years-dropdown20',
                min=df['yearstart'].min(),
                max=df['yearstart'].max() + 10,
                step=1,
                marks={year: str(year) for year in range(df['yearstart'].min(), df['yearstart'].max() + 1)},
                value=[df['yearstart'].max() - 1, df['yearstart'].max()]
            ),
            html.Div(id='selected-year-output20')
        ]),
        html.Div([
            html.P("In our case, we found that Linear regression would be simple and straightforward. "
                   "The relationship between the input variable and target variable is represented as a linear equation. "
                   "Also, it has the lowest MSE value. MSE value for the linear regression model is: 0.28. "
                   "While MSE values for ridge and lasso regression are 0.29 and 0.37 respectively.",
                   style={'color': 'black'})
        ])

       ])

# Callback 2 to update the chart based on user interaction
@app.callback(
    Output('trends-chart', 'figure'),
    [Input('trends-chart', 'relayoutData')]
)
def update_chart(relayout_data):
    try:
        # Plot the trends using a line chart with custom legend order
        fig = px.line(grouped_df, x='year_start', y=legend_order,
                      labels={'value': 'Number of Occurrences', 'year_start': 'Years'},
                      template="plotly_dark", width=1200, height=600)

        # Adjust the position of the legend to avoid overlapping with the graph
        fig.update_layout(legend=dict(orientation='h', yanchor='bottom', y=1.1, xanchor='right', x=1))

        # Increase the top margin to create space between the title and the top of the graph
        fig.update_layout(margin=dict(t=150))  # Adjust the top margin as needed

        return fig

    except Exception as e:
        print(f"Error: {str(e)}")
        return px.scatter(title=f"Error: {str(e)}", template="plotly_dark", width=1200, height=600)
# Callback 3 to update the chart based on the selected location
@app.callback(
    Output('bar-chart3', 'figure'),
    [Input('location-dropdown-6', 'value')]
)
def update_chart(selected_location):
    # Filter data for the selected location
    filtered_data = cancer_df[cancer_df['locationdesc'] == selected_location]

    # Update the figure with the filtered data
    updated_fig = px.bar(
        filtered_data,
        x='yearstart',
        y='datavalue',
        color='question',
        labels={'datavalue': 'Average Annual Number', 'yearstart': 'Year'},
        height=600,
        width=800,
        category_orders={"question": filtered_data['question'].unique()},
        color_discrete_sequence=px.colors.qualitative.Set1
    )

    # Update the layout
    updated_fig.update_layout(
        barmode='stack',
        legend=dict(orientation='h', yanchor='top', y=1.02, xanchor='right', x=1),
        xaxis=dict(tickmode='array', tickvals=filtered_data['yearstart'].unique(), ticktext=filtered_data['yearstart'].unique()),
        margin=dict(t=100),
        template='plotly_dark'
    )

    return updated_fig

#CALLBACKS FOR DASH 5
# Callback to update the bar chart based on the selected year and location
@app.callback(
    Output('bar-chart2', 'figure'),
    [Input('year-dropdown2', 'value'),
     Input('location-dropdown2', 'value')]
)
def update_bar_chart(selected_year, selected_location):
    try:
        # Filter data for the selected year and location
        filtered_data = alchohol_df[(alchohol_df['yearstart'] == selected_year) & (alchohol_df['locationdesc'] == selected_location)]

        # Aggregate data by summing up 'datavalue' for each 'question'
        aggregated_df = filtered_data.groupby(['question'])['datavalue'].sum().reset_index()

        # Sort the dataframe in descending order based on 'datavalue'
        sorted_df = aggregated_df.sort_values(by='datavalue', ascending=False)

        # Create a multicolor horizontal bar chart with questions on the y-axis
        fig = px.bar(
            sorted_df,
            y='question',
            x='datavalue',
            color='datavalue',
            height=600,
            color_continuous_scale=px.colors.sequential.Viridis,
            labels={'question': 'Questions', 'datavalue': 'Data Values'},
            template='plotly_dark'
        )

        # Adjust legend position
        fig.update_layout(legend=dict(yanchor="top", y=1.2, xanchor="left", x=0.01))

        return fig

    except Exception as e:
        print(f"Error: {str(e)}")
        return px.scatter(title=f"Error: {str(e)}", template="plotly_dark", width=1200, height=600)

# NO CALLBACKS FOR DASH 6

# CALLBACKS FOR DASH 7
# Callback to update correlation coefficient based on selected location
@app.callback(
    Output('correlation-output7', 'children'),
    [Input('location-dropdown7', 'value')]
)
def update_correlation(location):
    filtered_df2 = combined_df[combined_df['locationabbr'] == location]
    correlation_coefficient = np.corrcoef(filtered_df2['datavalue'], filtered_df2['yearstart'])[0, 1]
    return f"Correlation Coefficient: {correlation_coefficient:.2f}"

# CALLBACKS FOR DASH 8
# Callback to update the Bubble Chart based on user input
@app.callback(
    Output('bubble-chart8', 'figure'),
    [Input('bubble-chart8', 'clickData')]
)
def update_bubble_chart(click_data):
    try:
        # If click data is available, filter the dataframe for the selected location and year
        if click_data is not None and 'points' in click_data:
            location = click_data['points'][0]['locationdesc']
            year = click_data['points'][0]['x']
            diabetes_data = diabetes_df[(diabetes_df['locationdesc'] == location) & (diabetes_df['yearstart'] == year)]
        else:
            # If no click data, show the entire dataset
            diabetes_data = diabetes_df

        # Create the Bubble Chart using Plotly Express
        fig = px.scatter(
            diabetes_data,
            x='locationdesc',
            y='yearstart',
            size='datavalue',
            color='datavalue',
            hover_name='locationdesc',
            animation_frame='yearstart',
            animation_group='locationdesc',
            labels={'datavalue': 'Diabetes Rate'},
            size_max=50,
            color_continuous_scale='Viridis',
            category_orders={'locationdesc': sorted(diabetes_df['locationdesc'].unique())},
            range_y=[diabetes_df['yearstart'].min(), diabetes_df['yearstart'].max()],
        )

        # Adjust the distance on the x-axis and set the template to Plotly Dark
        fig.update_layout(
            margin=dict(l=150, r=150),  # Increase the left and right margins
            xaxis=dict(categoryorder='total ascending'),  # Adjust the category order
            template='plotly_dark'  # Set the template to Plotly Dark
        )

        return fig

    except Exception as e:
        print(f"Error: {str(e)}")
        return px.scatter(title=f"Error: {str(e)}", template="plotly_dark", width=1200, height=600)

# CALLBACKS FOR DASH 9
# Callback to update the choropleth map based on the selected year
@app.callback(
    Output('choropleth-map9', 'figure'),
    [Input('choropleth-map9', 'relayoutData')]
)
def update_choropleth_map(relayoutData):
    try:
        # Create a chloropleth map using Plotly Express
        fig = px.choropleth(
            adults_df,
            locations='locationabbr',
            locationmode="USA-states",
            color='datavalue',
            color_continuous_scale="Viridis",
            range_color=(adults_df['datavalue'].min(), adults_df['datavalue'].max()),
            labels={'datavalue': 'Prevalence'},
            title='',  # Remove the title
            template='plotly_dark'
        )

        # Update the layout for better visualization
        fig.update_layout(
            geo=dict(scope='usa'),
            height=600,
            margin=dict(l=0, r=0, b=0, t=40),
            template='plotly_dark'
        )

        return fig

    except Exception as e:
        print(f"Error: {str(e)}")
        return px.scatter(title=f"Error: {str(e)}", template="plotly_dark", width=1200, height=600)

# CALLBACKS FOR DASH 10
# Define callback to update the graph based on location selection
@app.callback(
    Output('bar-chart10', 'figure'),
    [Input('location-dropdown10', 'value')]
)
def update_graph(selected_location):
    try:
        # Filter data for the selected location
        selected_location_data = merged_df1[merged_df1['locationdesc'] == selected_location]

        # Create a bar chart with the Plotly dark template
        fig = px.bar(selected_location_data, x='yearstart', y='percentage', title='Percentage of Heavy Drinking Among Women Aged 18-44 Years',
                     labels={'percentage': 'Percentage'}, height=400, template='plotly_dark')

        return fig
    except Exception as e:
        raise dash.exceptions.PreventUpdate

# NO CALLBACKS FOR DASH 11
# CALLBACKS FOR DASH 12
# Callback to update the heatmap based on user input
@app.callback(
    Output('correlation-heatmap12', 'figure'),
    Input('location-dropdown12', 'value')
)
def update_heatmap(selected_location):
    # Filter DataFrame based on selected location
    location_filtered_df = specq_df[specq_df['locationdesc'] == selected_location]

    # Normalize correlation values separately for each location
    normalized_correlation_matrix = location_filtered_df.groupby('locationdesc').corr().droplevel(0)

    # Get the color scale for the selected location
    selected_color_scale = color_scales.get(selected_location, 'Viridis')  # Default to 'Viridis' if not found

    # Create heatmap trace with text annotations
    heatmap_trace = go.Heatmap(
        z=normalized_correlation_matrix.values,
        x=unique_questions,
        y=unique_questions,
        colorscale=selected_color_scale,  # Set the color scale for the selected location
        text=[[f'{correlation:.2f}' for correlation in row] for row in normalized_correlation_matrix.values],  # Display correlation values as text
        hoverinfo='text',  # Show text on hover
    )

    # Layout settings
    layout = go.Layout(
        title=f'Correlation Heatmap for {selected_location}',
        xaxis=dict(title='Questions', tickangle=0, automargin=True),  # Rotate x-axis labels for better readability
        yaxis=dict(title='Questions', automargin=True),
        paper_bgcolor='#111111',  # Plot background color in Plotly Dark color
        plot_bgcolor='#111111',  # Plot area background color in Plotly Dark color
        font=dict(color='#7FDBFF'),  # Font color for axis labels
        xaxis_showgrid=True,  # Show grid for x-axis
        yaxis_showgrid=True,  # Show grid for y-axis
    )

    # Create figure
    fig = go.Figure(data=[heatmap_trace], layout=layout)

    return fig



# Callback 14 to update the first bar chart, correlation, and percentage based on user input
@app.callback(
    [Output('bar-chart', 'figure'),
     Output('correlation-output', 'children'),
     Output('percentage-output', 'children'),
     Output('additional-text', 'children')],
    [Input('location-dropdown', 'value')]
)
def update_bar_chart(selected_location):
    # Your existing callback logic for the first section here
    # Filter aggregated data based on the selected location
    location_data = aggregated_data[aggregated_data['locationdesc'] == selected_location]

    # Check if the filtered data is empty
    if location_data.empty:
        fig = px.bar(title=f'No Data for {selected_location}')
        correlation_output = ''
        percentage_output = ''
        additional_text = ''
    else:
        # Create a bar chart
        fig = px.bar(
            location_data,
            x='yearstart',
            y='datavalue',
            color='question',
            labels={'datavalue': 'Cases per 1,000,000'},
            height=600,
        )

        # Update the layout for better visualization
        fig.update_layout(
            xaxis_title='Year',
            yaxis_title='Cases per 1,000,000',
            legend_title='Question',
            height=600,
            margin=dict(l=0, r=0, b=0, t=40),
            template='plotly_dark'
        )

        # Set tick positions and labels on the x-axis
        fig.update_xaxes(tickvals=location_data['yearstart'].unique(), ticktext=location_data['yearstart'].unique())

        # Calculate correlation between the two questions
        correlation = location_data.groupby('locationdesc')['datavalue'].corr(location_data['datavalue']).iloc[0::2].values
        correlation_output = f'Correlation between the two questions: {correlation[0]:.2f}'

        # Calculate the percentage of 'Incidence of treated end-stage renal disease attributed to diabetes'
        # with respect to the total 'Incidence of treated end-stage renal disease'
        total_incidence = location_data[location_data['question'] == 'Incidence of treated end-stage renal disease']['datavalue'].sum()
        total_diabetes_incidence = location_data[location_data['question'] == 'Incidence of treated end-stage renal disease attributed to diabetes']['datavalue'].sum()

        # Check for potential division by zero
        if total_incidence != 0:
            percentage = (total_diabetes_incidence / total_incidence) * 100
            percentage_output = f'Percentage of Incidence attributed to diabetes: {percentage:.2f}%'

            # Additional text
            if percentage > 40:
                additional_text = f'For almost every location, the percentage of Incidence attributed to diabetes is above 40%.'
            else:
                additional_text = f'The percentage of Incidence attributed to diabetes is below 40% for most locations.'
        else:
            percentage_output = 'Total Incidence is zero, cannot calculate percentage.'
            additional_text = ''
    return fig, correlation_output, percentage_output, additional_text

# Callback 15 to update the second bar chart and text based on user input
@app.callback(
    [Output('bar-chart-2', 'figure'),
     Output('text-output', 'children')],
    [Input('location-dropdown-2', 'value')]
)

def update_bar_chart_2(selected_location):
    filtered_df9 = df[(df['yearstart'] >= 2008) & (df['yearstart'] <= 2021)]
    filtered_df = filtered_df9[filtered_df9['locationdesc'] == selected_location]
    # Group by year and count entries for each topic
    topic_entries = filtered_df.groupby(['yearstart', 'topic']).size().reset_index(name='entry_count')

    # Get the top 5 topics based on entry count
    top_topics = topic_entries.groupby('topic')['entry_count'].sum().nlargest(5).index
    filtered_topic_entries = topic_entries[topic_entries['topic'].isin(top_topics)]

    # Create a bar chart using Plotly Express with the Plotly Dark template
    fig = px.bar(
        filtered_topic_entries,
        x='yearstart',
        y='entry_count',
        color='topic',
        labels={'entry_count': 'Entry Count', 'yearstart': 'Year'},
        color_continuous_scale='Viridis',  # Use a different color scale
        template='plotly_dark',  # Use the Plotly Dark template
    )

    # Text for the bottom of the graph
    text_output = """
    Identifying Dominant Health Issues: The bar chart allows to quickly identify the top 5 diseases with the
    highest entry counts for the selected location. This information helps in understanding which health issues
    are most prevalent or have the most reported cases over the specified period.
    """

    return fig, text_output
@app.callback(
    [Output('bar-chart-3', 'figure'),
     Output('bottom-text', 'children')],
    [Input('location-dropdown-3', 'value')]
)

def update_bar_chart_3(selected_location):
    filtered_df = df[df['locationdesc'] == selected_location]

    # Group by year and count entries for each topic
    topic_entries = filtered_df.groupby(['yearstart', 'topic']).size().reset_index(name='entry_count')

    # Get the bottom 5 topics based on entry count
    bottom_topics = topic_entries.groupby('topic')['entry_count'].sum().nsmallest(5).index
    filtered_topic_entries = topic_entries[topic_entries['topic'].isin(bottom_topics)]

    # Create a bar chart using Plotly Express with the Plotly Dark template
    fig = px.bar(
        filtered_topic_entries,
        x='yearstart',
        y='entry_count',
        color='topic',
        labels={'entry_count': 'Entry Count', 'yearstart': 'Year'},
        color_continuous_scale='Viridis',  # Use a different color scale
        template='plotly_dark',  # Use the Plotly Dark template
    )

    # Additional text for the bottom of the graph
    bottom_text = '''
        Bottom 5 Topic Entries visualization complements the understanding gained from the "Top 5 Topic Entries" visualization
        by providing a comprehensive view of the entire spectrum of health topics and their prevalence in a selected location over the specified time period.
        Knowing the least prevalent health issues allows for targeted interventions or resource allocation to address specific health concerns
        that may be overlooked compared to more common health issues.
    '''

    return fig, bottom_text
# Callback 16 to update the fourth bar chart based on user input
@app.callback(
    Output('disease-variation-plot', 'figure'),
    [Input('disease-dropdown', 'value'),
     Input('gender-checklist', 'value')]
)
def update_plot(selected_disease, selected_genders):
    filtered_df = df[(df['topic'] == selected_disease) & (df['stratification1'].isin(selected_genders))]

    grouped_df = filtered_df.groupby(['stratification1', 'yearstart']).size().reset_index(name='Count')

    fig = px.bar(
        grouped_df,
        x='yearstart',
        y='Count',
        color='stratification1',
        barmode='group',  # Set barmode to 'group' for grouped bar chart
        labels={'yearstart': 'Year', 'Count': 'Count'},
        title=f'{selected_disease} Variation Over Years',
        template='plotly_dark',  # Set the template to 'plotly_dark'
    )

    return fig
# Callback 17 to update the sunburst chart based on the selected location and questions
@app.callback(
    Output('sunburst-chart', 'figure'),
    [Input('location-dropdown-5', 'value')]
)
def update_sunburst_chart(selected_location):
    # Filter data for selected location, Cancer topic, and selected questions
    filtered_df = df[(df['topic'] == 'Cancer') & (df['locationdesc'] == selected_location)]
    selected_questions = ["Mammography use among women aged 50-74 years", "Cancer of the female breast, mortality"]
    filtered_df = filtered_df[filtered_df['question'].isin(selected_questions)]

    # Create a sunburst chart
    fig = px.sunburst(filtered_df, path=['topic', 'question', 'locationdesc'], values='datavalue', color='question')

    # Customize the layout
    fig.update_layout(
        template='plotly_dark'
    )

    # Remove text for specific segment (Cancer of female breast, mortality)
    fig.update_traces(textinfo='label+percent entry')  # This line removes text for specific segment

    # Add custom labels for the two questions
    fig.update_traces(labels=['Mammography use', 'Cancer of female breast'])

    return fig

# Callback 18 to update dropdown options based on selected topic
@app.callback(
   [Output('question-dropdown', 'options'),
    Output('question-dropdown', 'value'),
    Output('stratification1-dropdown', 'options'),
    Output('stratification1-dropdown', 'value'),
    Output('stratificationcategory1-dropdown', 'options'),
    Output('stratificationcategory1-dropdown', 'value')],
   [Input('topic-dropdown', 'value')]
)
def update_dropdowns(selected_topic):
   # Filter DataFrame based on the selected topic
   filtered_df = df[df['topic'] == selected_topic]

   # Update options for question dropdown
   question_options = [{'label': question, 'value': question} for question in filtered_df['question'].unique()]
   default_question = question_options[0]['value'] if question_options else None

   # Update options for stratification1 dropdown
   stratification1_options = [{'label': strat, 'value': strat} for strat in filtered_df['stratification1'].unique()]
   default_stratification1 = stratification1_options[0]['value'] if stratification1_options else None

   # Update options for stratificationcategory1 dropdown
   stratificationcategory1_options = [{'label': strat_cat, 'value': strat_cat} for strat_cat in
                                     filtered_df['stratificationcategory1'].unique()]
   default_stratificationcategory1 = stratificationcategory1_options[0]['value'] if stratificationcategory1_options else None

   return question_options, default_question, stratification1_options, default_stratification1, \
          stratificationcategory1_options, default_stratificationcategory1

# Callback to 19 update choropleth map and highest prevalence output based on dropdown selections
@app.callback(
   [Output('choropleth-map', 'figure'),
    Output('highest-prevalence-output', 'children')],
   [Input('topic-dropdown', 'value'),
    Input('question-dropdown', 'value'),
    Input('stratification1-dropdown', 'value'),
    Input('stratificationcategory1-dropdown', 'value')]
)
def update_choropleth_map(selected_topic, selected_question, selected_stratification1, selected_stratificationcategory1):
   try:
       # Filter the dataset based on dropdown selections
       filtered_df = df[(df['topic'] == selected_topic) &
                        (df['question'] == selected_question) &
                        (df['stratification1'] == selected_stratification1) &
                        (df['stratificationcategory1'] == selected_stratificationcategory1)]

       # Prevent unnecessary updates if the filtered dataframe is empty
       if filtered_df.empty:
           raise dash.exceptions.PreventUpdate

       # Find the state with the highest prevalence and its corresponding value
       highest_state = filtered_df.loc[filtered_df['datavalue'].idxmax(), 'locationdesc']
       highest_value = filtered_df['datavalue'].max()
       highest_year = filtered_df.loc[filtered_df['datavalue'].idxmax(), 'yearstart']

       # Create choropleth map
       choropleth_map = px.choropleth(
           filtered_df,
           locations='locationabbr',
           locationmode='USA-states',
           color='datavalue',
           animation_frame='yearstart',
           color_continuous_scale='Viridis',
           title=f'{selected_question} Across U.S. States',
           labels={'datavalue': 'Prevalence'},
           scope='usa',
       ).update_geos(projection_type='albers usa')

       return choropleth_map, f'Highest prevalence in {highest_state} ({highest_year}) with a value of {highest_value:.2f}'

   except Exception as e:
       print(f"Error in callback: {str(e)}")
       import traceback
       traceback.print_exc()
       raise dash.exceptions.PreventUpdate

    # Callback to update scatter plot and output based on the selected gender and future years
@app.callback(
   [Output('scatter-plot20', 'figure'),
    Output('selected-year-output20', 'children')],
    [Input('gender-dropdown20', 'value'),
    Input('future-years-dropdown20', 'value')]
    )
def update_scatter_plot(selected_gender, future_years_range):
        # Filter the DataFrame based on the selected gender
        filtered_dataM = filtered_dfM[filtered_dfM['stratification1'] == selected_gender].copy()

        # Extend X range for future years
        future_years = list(range(future_years_range[0], future_years_range[1] + 1))
        X_future = poly.transform(np.array(future_years).reshape(-1, 1))

        # Predict for future years
        y_future_pred = model.predict(X_future)

        # Create scatter plot with regression line
        fig = px.scatter(
            filtered_dataM,
            x='yearstart',
            y='datavalue',
            color='locationdesc',  # Color by location
            hover_name='locationdesc',
            labels={'datavalue': 'Recent Mentally Unhealthy Days', 'yearstart': 'Year'},
            title=f'Scatter Plot: Mental Health Prediction ({selected_gender})\nMean Squared Error (CV): {mse_cv:.2f}'
        )

        # Add polynomial regression line to the plot
        x_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
        x_range_poly = poly.transform(x_range)
        y_range_pred = model.predict(x_range_poly)
        fig.add_scatter(x=x_range.flatten(), y=y_range_pred, mode='lines', name='Polynomial Regression Line')

        # Add predictions for future years to the plot with text annotations
        fig.add_scatter(x=future_years, y=y_future_pred, mode='markers', name='Future Predictions',
                        marker=dict(color='black', size=10))

        # Format prediction values for output
        prediction_output = [html.P(f'Prediction for {year}: {prediction:.2f}') for year, prediction in
                             zip(future_years, y_future_pred)]

        return fig, prediction_output


# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



<IPython.core.display.Javascript object>

## **Predictive Analytics**

The following models attempt to predict the future frequency of recent mentally unhealthy days (Which is an important health concern we have chosen to explore within the CDI data) among adults aged 18 years and older based on historical data trends and gender differentiation. The models are evaluated using cross validation and MSE, and the model with the best MSE(which we discovered to be Linear regression) is selected to be displayed on the dashboard

###**Linear regression model**

In [None]:
# Filter the dataset for mental health and recent mentally unhealthy days among adults aged >= 18 years
filtered_df = df[(df['topic'] == 'Mental Health') &
                 (df['question'] == 'Recent mentally unhealthy days among adults aged >= 18 years') &
                 (df['datavaluetype'] == 'Mean')].copy()  # Make a copy to avoid SettingWithCopyWarning

# Create a new column 'index' as a placeholder for the x-axis
filtered_df['index'] = range(len(filtered_df))

# Choose a gender (Overall, Female, Male)
selected_gender = 'Overall'

# Filter the DataFrame based on the selected gender
filtered_data = filtered_df[filtered_df['stratification1'] == selected_gender].copy()

# Perform polynomial regression
X = filtered_data[['yearstart']]
y = filtered_data['datavalue']

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.1, random_state=42)

# Check if there are enough data points for linear regression
if len(X_train) > 1:
    # Model fitting and prediction
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Cross-validation predictions
    y_pred_cv = cross_val_predict(model, X_poly, y, cv=5)

    # Calculate mean squared error
    mse_cv = mean_squared_error(y, y_pred_cv)

    # Calculate residuals and remove outliers
    residuals = y - y_pred_cv
    threshold = 2  # Adjust the threshold as needed
    outlier_mask = np.abs(residuals) < threshold
    filtered_data = filtered_data[outlier_mask]

    # Dash app initialization
    app = dash.Dash(__name__)

    # Layout of the app
    app.layout = html.Div(children=[
        html.H1("Interactive Mental Health Prediction Model", style={'color': 'purple'}),
        html.Div([
            dcc.Graph(id='scatter-plot'),
            dcc.Dropdown(
                id='gender-dropdown',
                options=[
                    {'label': 'Overall', 'value': 'Overall'},
                    {'label': 'Female', 'value': 'Female'},
                    {'label': 'Male', 'value': 'Male'},
                ],
                value='Overall',  # Set the default value to 'Overall'
                style={'width': '50%'}
            ),
            dcc.RangeSlider(
                id='future-years-dropdown',
                min=df['yearstart'].min(),
                max=df['yearstart'].max() + 10,
                step=1,
                marks={year: str(year) for year in range(df['yearstart'].min(), df['yearstart'].max() + 1)},
                value=[df['yearstart'].max() - 1, df['yearstart'].max()]  # Set the initial range to the last two years
            ),
            html.Div(id='selected-year-output')
        ]),
        html.Div([
            html.P("In our case, we found that Linear regression would be simple and straightforward. "
                   "The relationship between the input variable and target variable is represented as a linear equation. "
                   "Also, it has the lowest MSE value. MSE value for the linear regression model is: 0.28. "
                   "While MSE values for ridge and lasso regression are 0.29 and 0.37 respectively.",
                   style={'color': 'black'})
        ])
    ])

    # Callback to update scatter plot and output based on the selected gender and future years
    @app.callback(
        [Output('scatter-plot', 'figure'),
         Output('selected-year-output', 'children')],
        [Input('gender-dropdown', 'value'),
         Input('future-years-dropdown', 'value')]
    )
    def update_scatter_plot(selected_gender, future_years_range):
        # Filter the DataFrame based on the selected gender
        filtered_data = filtered_df[filtered_df['stratification1'] == selected_gender].copy()

        # Extend X range for future years
        future_years = list(range(future_years_range[0], future_years_range[1] + 1))
        X_future = poly.transform(np.array(future_years).reshape(-1, 1))

        # Predict for future years
        y_future_pred = model.predict(X_future)

        # Create scatter plot with regression line
        fig = px.scatter(
            filtered_data,
            x='yearstart',
            y='datavalue',
            color='locationdesc',  # Color by location
            hover_name='locationdesc',
            labels={'datavalue': 'Recent Mentally Unhealthy Days', 'yearstart': 'Year'},
            title=f'Scatter Plot: Mental Health Prediction ({selected_gender})\nMean Squared Error (CV): {mse_cv:.2f}'
        )

        # Add polynomial regression line to the plot
        x_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
        x_range_poly = poly.transform(x_range)
        y_range_pred = model.predict(x_range_poly)
        fig.add_scatter(x=x_range.flatten(), y=y_range_pred, mode='lines', name='Polynomial Regression Line')

        # Add predictions for future years to the plot with text annotations
        fig.add_scatter(x=future_years, y=y_future_pred, mode='markers', name='Future Predictions',
                        marker=dict(color='black', size=10))

        # Format prediction values for output
        prediction_output = [html.P(f'Prediction for {year}: {prediction:.2f}') for year, prediction in
                             zip(future_years, y_future_pred)]

        return fig, prediction_output

    # Run the app
    if __name__ == '__main__':
        app.run_server(debug=True)
else:
    print("Insufficient data points for linear regression.")


<IPython.core.display.Javascript object>

###**Ridge regression model**

In [None]:
# Filter the dataset for mental health and recent mentally unhealthy days among adults aged >= 18 years
filtered_df = df[(df['topic'] == 'Mental Health') &
                 (df['question'] == 'Recent mentally unhealthy days among adults aged >= 18 years') &
                 (df['datavaluetype'] == 'Mean')].copy()  # Make a copy to avoid SettingWithCopyWarning

# Create a new column 'index' as a placeholder for the x-axis
filtered_df['index'] = range(len(filtered_df))

# Choose a gender (Overall, Female, Male)
selected_gender = 'Overall'

# Filter the DataFrame based on the selected gender
filtered_data = filtered_df[filtered_df['stratification1'] == selected_gender].copy()

# Perform Ridge Regression
X = filtered_data[['yearstart']]
y = filtered_data['datavalue']

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Standardize features
scaler = StandardScaler()
X_poly_scaled = scaler.fit_transform(X_poly)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_poly_scaled, y, test_size=0.1, random_state=42)

# Check if there are enough data points for Ridge Regression
if len(X_train) > 1:
    # Model fitting and prediction
    ridge_model = Ridge(alpha=1.0)  # You can adjust the regularization strength (alpha) if needed
    ridge_model.fit(X_train, y_train)

    # Cross-validation predictions
    y_pred_cv = cross_val_predict(ridge_model, X_poly_scaled, y, cv=5)

    # Calculate mean squared error
    mse_cv = mean_squared_error(y, y_pred_cv)

    # Dash app initialization
    app = dash.Dash(__name__)

    # Layout of the app
    app.layout = html.Div(children=[
        html.H1("Interactive Mental Health prediction(Ridge Model)"),
        dcc.Graph(id='scatter-plot'),
        dcc.Dropdown(
            id='gender-dropdown',
            options=[
                {'label': 'Overall', 'value': 'Overall'},
                {'label': 'Female', 'value': 'Female'},
                {'label': 'Male', 'value': 'Male'},
            ],
            value='Overall',  # Set the default value to 'Overall'
            style={'width': '50%'}
        ),
        dcc.RangeSlider(
            id='future-years-dropdown',
            min=df['yearstart'].min(),
            max=df['yearstart'].max() + 10,  # Extend the range for future years
            step=1,
            marks={year: str(year) for year in range(df['yearstart'].min(), df['yearstart'].max() + 1)},
            value=[df['yearstart'].max() - 1, df['yearstart'].max()]  # Set the initial range to the last two years
        ),
        html.Div(id='selected-year-output')  # Output div for selected year prediction
    ])

    # Callback to update scatter plot and output based on the selected gender and future years
    @app.callback(
        [Output('scatter-plot', 'figure'),
         Output('selected-year-output', 'children')],
        [Input('gender-dropdown', 'value'),
         Input('future-years-dropdown', 'value')]
    )
    def update_scatter_plot(selected_gender, future_years_range):
        # Filter the DataFrame based on the selected gender
        filtered_data = filtered_df[filtered_df['stratification1'] == selected_gender].copy()

        # Extend X range for future years
        future_years = list(range(future_years_range[0], future_years_range[1] + 1))
        X_future = poly.transform(np.array(future_years).reshape(-1, 1))
        X_future_scaled = scaler.transform(X_future)

        # Predict for future years
        y_future_pred = ridge_model.predict(X_future_scaled)

        # Create scatter plot with regression line
        fig = px.scatter(
            filtered_data,
            x='yearstart',
            y='datavalue',
            color='locationdesc',  # Color by location
            hover_name='locationdesc',
            labels={'datavalue': 'Recent Mentally Unhealthy Days', 'yearstart': 'Year'},
            title=f'Scatter Plot: Mental Health Prediction ({selected_gender})\nMean Squared Error (CV): {mse_cv:.2f}'
        )

        # Add polynomial regression line to the plot
        x_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
        x_range_poly = poly.transform(x_range)
        x_range_scaled = scaler.transform(x_range_poly)
        y_range_pred = ridge_model.predict(x_range_scaled)
        fig.add_scatter(x=x_range.flatten(), y=y_range_pred, mode='lines', name='Ridge Regression Line')

        # Add predictions for future years to the plot with text annotations
        fig.add_scatter(x=future_years, y=y_future_pred, mode='markers', name='Future Predictions', marker=dict(color='black', size=10))

        # Format prediction values for output
        prediction_output = [html.P(f'Prediction for {year}: {prediction:.2f}') for year, prediction in zip(future_years, y_future_pred)]

        return fig, prediction_output

    # Run the app
    if __name__ == '__main__':
        app.run_server(debug=True)
else:
    print("Insufficient data points for Ridge Regression.")





<IPython.core.display.Javascript object>

###**Lasso regression model**

In [None]:
# Filter the dataset for mental health and recent mentally unhealthy days among adults aged >= 18 years
filtered_df = df[(df['topic'] == 'Mental Health') &
                 (df['question'] == 'Recent mentally unhealthy days among adults aged >= 18 years') &
                 (df['datavaluetype'] == 'Mean')].copy()  # Make a copy to avoid SettingWithCopyWarning

# Create a new column 'index' as a placeholder for the x-axis
filtered_df['index'] = range(len(filtered_df))

# Choose a gender (Overall, Female, Male)
selected_gender = 'Overall'

# Filter the DataFrame based on the selected gender
filtered_data = filtered_df[filtered_df['stratification1'] == selected_gender].copy()

# Perform Ridge Regression
X = filtered_data[['yearstart']]
y = filtered_data['datavalue']

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Standardize features
scaler = StandardScaler()
X_poly_scaled = scaler.fit_transform(X_poly)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_poly_scaled, y, test_size=0.1, random_state=42)

# Check if there are enough data points for Ridge Regression
if len(X_train) > 1:
    # Model fitting and prediction
    ridge_model = Ridge(alpha=1.0)  # You can adjust the regularization strength (alpha) if needed
    ridge_model.fit(X_train, y_train)

    # Cross-validation predictions
    y_pred_cv = cross_val_predict(ridge_model, X_poly_scaled, y, cv=5)

    # Calculate mean squared error
    mse_cv = mean_squared_error(y, y_pred_cv)

    # Dash app initialization
    app = dash.Dash(__name__)

    # Layout of the app
    app.layout = html.Div(children=[
        html.H1("Interactive Mental Health prediction(Ridge Model"),
        dcc.Graph(id='scatter-plot'),
        dcc.Dropdown(
            id='gender-dropdown',
            options=[
                {'label': 'Overall', 'value': 'Overall'},
                {'label': 'Female', 'value': 'Female'},
                {'label': 'Male', 'value': 'Male'},
            ],
            value='Overall',  # Set the default value to 'Overall'
            style={'width': '50%'}
        ),
        dcc.RangeSlider(
            id='future-years-dropdown',
            min=df['yearstart'].min(),
            max=df['yearstart'].max() + 10,  # Extend the range for future years
            step=1,
            marks={year: str(year) for year in range(df['yearstart'].min(), df['yearstart'].max() + 1)},
            value=[df['yearstart'].max() - 1, df['yearstart'].max()]  # Set the initial range to the last two years
        ),
        html.Div(id='selected-year-output')  # Output div for selected year prediction
    ])

    # Callback to update scatter plot and output based on the selected gender and future years
    @app.callback(
        [Output('scatter-plot', 'figure'),
         Output('selected-year-output', 'children')],
        [Input('gender-dropdown', 'value'),
         Input('future-years-dropdown', 'value')]
    )
    def update_scatter_plot(selected_gender, future_years_range):
        # Filter the DataFrame based on the selected gender
        filtered_data = filtered_df[filtered_df['stratification1'] == selected_gender].copy()

        # Extend X range for future years
        future_years = list(range(future_years_range[0], future_years_range[1] + 1))
        X_future = poly.transform(np.array(future_years).reshape(-1, 1))
        X_future_scaled = scaler.transform(X_future)

        # Predict for future years
        y_future_pred = ridge_model.predict(X_future_scaled)

        # Create scatter plot with regression line
        fig = px.scatter(
            filtered_data,
            x='yearstart',
            y='datavalue',
            color='locationdesc',  # Color by location
            hover_name='locationdesc',
            labels={'datavalue': 'Recent Mentally Unhealthy Days', 'yearstart': 'Year'},
            title=f'Scatter Plot: Mental Health Prediction ({selected_gender})\nMean Squared Error (CV): {mse_cv:.2f}'
        )

        # Add polynomial regression line to the plot
        x_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
        x_range_poly = poly.transform(x_range)
        x_range_scaled = scaler.transform(x_range_poly)
        y_range_pred = ridge_model.predict(x_range_scaled)
        fig.add_scatter(x=x_range.flatten(), y=y_range_pred, mode='lines', name='Ridge Regression Line')

        # Add predictions for future years to the plot with text annotations
        fig.add_scatter(x=future_years, y=y_future_pred, mode='markers', name='Future Predictions', marker=dict(color='black', size=10))

        # Format prediction values for output
        prediction_output = [html.P(f'Prediction for {year}: {prediction:.2f}') for year, prediction in zip(future_years, y_future_pred)]

        return fig, prediction_output

    # Run the app
    if __name__ == '__main__':
        app.run_server(debug=True)
else:
    print("Insufficient data points for Ridge Regression.")





###**MSE COMPARISON PLOT**

In [None]:
import matplotlib.pyplot as plt

mse_values = [0.28, 0.29, 0.37]

model_names = ['Linear Regression', 'Ridge Regression', 'Ridge Regression']

# Plotting the MSE values for each model
plt.figure(figsize=(8, 6))
plt.bar(model_names, mse_values, color='skyblue')
plt.title('Comparison of Mean Squared Error (MSE) for Different Models')
plt.xlabel('Models')
plt.ylabel('Mean Squared Error (MSE)')
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()


##**Overall project conclusion and Insights**

**Dataset Overview**

The dataset covers a broad spectrum of health-related topics, encompassing diseases, demographics, and geospatial information.
The visualizations throughout our dashboard help uncover patterns, trends, and disparities in health outcomes.
The balanced gender distribution and varied racial representation contribute to the dataset's comprehensiveness.
The consistent thread throughout these analyses is the suggestion that heightened awareness, whether through educational initiatives or targeted campaigns, correlates with positive health outcomes. These findings emphasize the pivotal role of public health initiatives and the need for continuous efforts to promote awareness and empower individuals to prioritize their well-being.

        

**Our process**

We began by performing some data cleaning steps exploring the overall nature of the data using the pyspark module because of it's functionality in larger datasets. After the data cleaning and understanding steps were completed, we decided to switch from the pyspark dataframe to a pandas dataframe to create the visualizations. Then with the help of matplotlib, seaborn, plotly express and other modules, we were able to create visualizations to further elucidate trends in the data based on a variety of metrics.

We explored the disease trends in the data using line plots, wordclouds and bar charts, then explored some of the questions asked in the dataframe as they relate to the disease to get an idea of the health concerns in the dataframe. We performed several other explorations on the dataset as a function of Race, ethnicity, gender, age, location and datavalue and found some interesting insights with each visualization. Our 5 major points of focus when exploring this datset were identifying trends in the data over the years, health concerns across different groups, impacts of certain conditions as a function of race/ethnicity and gender, correlation between variables in the dataset, and possible useful insights that could be drawn from the data.

Below, we have added our overall insights from the data exploration and the visualizations present in our Dashboard.

####Early Detection and Awareness
Promoting early detection through regular screenings and increasing awareness about the importance of proactive health monitoring can significantly contribute to improved health outcomes and a reduction in mortality rates associated with chronic diseases, for example, breast cancer and mammography screening or diabetes self-management program.

#### Impact of Alcohol on Health
Analysis also highlights the multifaceted impact of alcohol on health, emphasizing the importance of moderation and responsible drinking to mitigate associated risks and promote overall well-being. Individuals should be aware of the potential health consequences and seek guidance from healthcare professionals if concerns arise.

#### High School Lifestyle and Obesity
There is a recognized link between lifestyle factors established during high school and the risk of obesity later in life.
Several aspects of high school life can contribute to either healthy or unhealthy lifestyle choices, influencing long-term health outcomes, including the risk of obesity. The availability of nutritious food options in school cafeterias, the presence of vending machines offering healthy snacks, and the promotion of balanced meals can impact students' dietary choices.