## Research Question

"How do chronic disease prevalence and health disparities vary by socioeconomic factors (e.g., income inequality, unemployment) across U.S. states?"

## Problem Statement 

Chronic diseases disproportionately affect different socioeconomic groups across the U.S. states. Factors such as income inequality, employment status, and healthcare access contribute to these disparities. This study aims to analyze trends in chronic disease prevalence and identify correlations with socioeconomic indicators. We hypothesize that states with higher income inequality and unemployment will exhibit higher chronic disease rates and lower access to healthcare

## Data Description 

https://catalog.data.gov/dataset/u-s-chronic-disease-indicators   The U.S. Chronic Disease Indicators (CDI) dataset is provided by the CDC’s Division of Population Health. It consists of 309,215 records covering 115 key health indicators related to chronic diseases across all U.S. states and territories. This dataset is designed to provide a standardized approach for collecting, analyzing, and reporting chronic disease data to support public health decision-making.

## Import the libraries

In [12]:
# Import the libraries
import numpy as np                  # Scientific Computing
import pandas as pd                 # Data Analysis
import matplotlib.pyplot as plt     # Plotting
import seaborn as sns               # Statistical Data Visualization

# Let's make sure pandas returns all the rows and columns for the dataframe
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Force pandas to display full numbers instead of scientific notation
# pd.options.display.float_format = '{:.0f}'.format

# Library to suppress warnings
import warnings
warnings.filterwarnings('ignore')

### Read the dataset

In [13]:
# Read the dataset
path = pd.read_csv('U.S._Chronic_Disease_Indicators.csv')

# Create the Dataframe
us_chronic_desease = pd.DataFrame(path)


### Understanding the Dataset

- Checking first elements of the DataFrame with .head( ) method
- After you run the code above, nothing will appear.
- So you have to write df to see your data. But instead of seeing all the data, we are going to use the “.head( )” method to see the first five elements of the data.
- Before you run the read_csv code, you can write df.head( ) below. So it’s going to be like this:

Inside the parentheses, we can write the number of elements that we want to see.
If we leave it blank, it will show the first five elements.
If we write 7 inside of the parentheses, it will show the first 7 elements of the dataframe.

In [15]:
# Display the first ten rows of the Dataframe
us_chronic_desease.head(7)

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,Response,DataValueUnit,DataValueType,DataValue,DataValueAlt,DataValueFootnoteSymbol,DataValueFootnote,LowConfidenceLimit,HighConfidenceLimit,StratificationCategory1,Stratification1,StratificationCategory2,Stratification2,StratificationCategory3,Stratification3,Geolocation,LocationID,TopicID,QuestionID,ResponseID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,2019,2019,AR,Arkansas,BRFSS,Diabetes,Diabetes among adults,,%,Crude Prevalence,13.6,13.6,,,12.1,15.4,Sex,Male,,,,,POINT (-92.27449074299966 34.74865012400045),5,DIA,DIA01,,CRDPREV,SEX,SEXM,,,,
1,2019,2019,ID,Idaho,BRFSS,Diabetes,Diabetes among adults,,%,Crude Prevalence,10.6,10.6,,,9.1,12.2,Sex,Male,,,,,POINT (-114.3637300419997 43.682630005000476),16,DIA,DIA01,,CRDPREV,SEX,SEXM,,,,
2,2019,2019,IN,Indiana,YRBSS,Sleep,Short sleep duration among high school students,,%,Crude Prevalence,,,*,No data available,,,Grade,Grade 12,,,,,POINT (-86.14996019399968 39.766910452000445),18,SLEP,SLP02,,CRDPREV,GRADE,GRD12,,,,
3,2019,2019,IA,Iowa,NVSS,Asthma,"Asthma mortality among all people, underlying ...",,Number,Number,54.0,54.0,,,,,Overall,Overall,,,,,POINT (-93.81649055599968 42.46940091300047),19,AST,AST01,,NMBR,OVERALL,OVR,,,,
4,2019,2019,IA,Iowa,BRFSS,Asthma,Current asthma among adults,,%,Crude Prevalence,10.3,10.3,,,9.1,11.7,Age,Age 18-44,,,,,POINT (-93.81649055599968 42.46940091300047),19,AST,AST02,,CRDPREV,AGE,AGE1844,,,,
5,2019,2019,IA,Iowa,NVSS,Diabetes,"Diabetes mortality among all people, underlyin...",,Number,Number,54.0,54.0,,,,,Age,Age 0-44,,,,,POINT (-93.81649055599968 42.46940091300047),19,DIA,DIA03,,NMBR,AGE,AGE0_44,,,,
6,2019,2019,IA,Iowa,BRFSS,Health Status,Recent activity limitation among adults,,Number,Crude Mean,2.3,2.3,,,2.1,2.5,Sex,Female,,,,,POINT (-93.81649055599968 42.46940091300047),19,HEA,HEA04,,CRDMEAN,SEX,SEXF,,,,


Checking last elements of the DataFrame with .tail() method
There is also a method to see the see last n number of elements.
The method is called .tail().
The same rule is also applied here. If we leave the parentheses blank, it will be set as 5, if we write 7 inside of the parentheses, it will show the last 7 elements of the dataframe.

In [17]:
# Display the last ten rows of the dataframe
# Syntax: DataFrame.tail(qty)
us_chronic_desease.tail(7)

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,Response,DataValueUnit,DataValueType,DataValue,DataValueAlt,DataValueFootnoteSymbol,DataValueFootnote,LowConfidenceLimit,HighConfidenceLimit,StratificationCategory1,Stratification1,StratificationCategory2,Stratification2,StratificationCategory3,Stratification3,Geolocation,LocationID,TopicID,QuestionID,ResponseID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
309208,2022,2022,VT,Vermont,BRFSS,Sleep,Short sleep duration among adults,,%,Crude Prevalence,26.5,26.5,,,16.8,39.3,Race/Ethnicity,"Asian, non-Hispanic",,,,,POINT (-72.51764079099962 43.62538123900049),50,SLEP,SLP03,,CRDPREV,RACE,ASN,,,,
309209,2022,2022,VI,Virgin Islands,BRFSS,Immunization,Influenza vaccination among adults,,%,Crude Prevalence,34.2,34.2,,,24.2,45.8,Age,Age >=65,,,,,POINT (-64.896335 18.335765),78,IMM,IMM01,,CRDPREV,AGE,AGE65P,,,,
309210,2022,2022,VI,Virgin Islands,BRFSS,Tobacco,Quit attempts in the past year among adult cur...,,%,Age-adjusted Prevalence,,,#,No data available for this indicator because t...,,,Race/Ethnicity,"American Indian or Alaska Native, non-Hispanic",,,,,POINT (-64.896335 18.335765),78,TOB,TOB06,,AGEADJPREV,RACE,AIAN,,,,
309211,2022,2022,WV,West Virginia,BRFSS,Chronic Obstructive Pulmonary Disease,Chronic obstructive pulmonary disease among ad...,,%,Crude Prevalence,14.0,14.0,,,12.8,15.2,Overall,Overall,,,,,POINT (-80.71264013499967 38.66551020200046),54,COPD,COPD01,,CRDPREV,OVERALL,OVR,,,,
309212,2022,2022,WI,Wisconsin,BRFSS,Immunization,Pneumococcal vaccination among adults aged 65 ...,,%,Crude Prevalence,64.2,64.2,,,52.2,74.6,Race/Ethnicity,"Black, non-Hispanic",,,,,POINT (-89.81637074199966 44.39319117400049),55,IMM,IMM04,,CRDPREV,RACE,BLK,,,,
309213,2022,2022,VT,Vermont,BRFSS,Social Determinants of Health,Lack of health insurance among adults aged 18-...,,%,Crude Prevalence,,,****,Data suppressed; denominator < 50 or relative ...,,,Race/Ethnicity,"Hawaiian or Pacific Islander, non-Hispanic",,,,,POINT (-72.51764079099962 43.62538123900049),50,SDOH,SDH09,,CRDPREV,RACE,HAPI,,,,
309214,2022,2022,WA,Washington,BRFSS,Alcohol,Binge drinking prevalence among adults,,%,Age-adjusted Prevalence,19.2,19.2,,,18.3,20.2,Sex,Male,,,,,POINT (-120.47001078999972 47.52227862900048),53,ALC,ALC06,,AGEADJPREV,SEX,SEXM,,,,


In [18]:
# display the dimensions of the data
# This is the number of rows and columns in the data
# Syntax: DataFrame.shape
us_chronic_desease.shape

(309215, 34)

In [19]:
309215*34

10513310

* State the shape of the dataframe :
  - The dataframe have 309215 rows
  - The dataframe have 34 Columns
  - There are 10513310 total datapoints observed i my dataset.

### Data Types
* The main data types in Pandas dataframes are `object`, `float`, `int64`, `bool` and `datetime64`. In order to understand each attribute of the data, it is always good to know the data type of each column.

##### `.info()` method
* This method prints information about a DataFrame including the index `dtype` and column dtypes, non-null values and memory usage.

In [20]:
# Let's check the basic information about the dataset
# Syntax: DataFrame.info()
us_chronic_desease.info(show_counts=True, verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309215 entries, 0 to 309214
Data columns (total 34 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   YearStart                  309215 non-null  int64  
 1   YearEnd                    309215 non-null  int64  
 2   LocationAbbr               309215 non-null  object 
 3   LocationDesc               309215 non-null  object 
 4   DataSource                 309215 non-null  object 
 5   Topic                      309215 non-null  object 
 6   Question                   309215 non-null  object 
 7   Response                   0 non-null       float64
 8   DataValueUnit              309215 non-null  object 
 9   DataValueType              309215 non-null  object 
 10  DataValue                  209196 non-null  float64
 11  DataValueAlt               209196 non-null  float64
 12  DataValueFootnoteSymbol    101716 non-null  object 
 13  DataValueFootnote          10

### Observations of the Data Set
Describe the dataset.
- There are 34 rows and 309215 columns.
- The data types are integer , object ,float 
Based on the number of expected data points, and those listed by the `.info()` method, how many missing (null) values are there?
- There are 3953604 missing (null) values 

The number of non-null values does not match the total number expected based on the number of rows in the dataframe. This indicates the presence of missing values that will need further investigation.



# Data Cleaning:


## Data Analysis and Preparation
The steps of analysis and preparation of the data for statistical modeling has several steps:
1. Check dimensions of the dataframe to determine the number rows and columns. This increases understanding of the data structure and size.
2. Check data types, ensuring the data types are correct. Refer data definitions to validate the results. For example dates including year, month, and day should be updated from integers or strings to `datetime` for ease of use in time series analysis.
3. Update data types based on the business definition, changing the data types as per requirement.
4. Using Python and Numpy methods, examine the summary statistics. This assists in determination of the data scale, including the relative minima and maxima for the range. The distribution of data with early determination of skewness or kurtosis are also visible for outliers in the data.
5. Checking for missing values that may cause noise in machine learning models and abberations in visualizations.
6. Study the correlation between data variables for key insights and future feature engineering.
7. Detection of outliers that may contribute to the skewness in data or add to the kurtosis affecting the data ranges.

### Headers Update and Map New Column

>
- To aid in analysis and visualization add a new columns as needed. For instance instead of an entire state name, add a column that has a two letter abbreviation for the state.
- The list of headers requires standardization, these will be updated ensure uniformity.
- Update ALL CAPS or all lowercase to the appropriate case. These can be updated to Title Case.

In [30]:
# Let's create a list of the columns in the dataset
# Use the variable = DataFrame.columns method
chroniccols=us_chronic_desease.columns
chroniccols

Index(['YearStart', 'YearEnd', 'LocationAbbr', 'LocationDesc', 'DataSource',
       'Topic', 'Question', 'Response', 'DataValueUnit', 'DataValueType',
       'DataValue', 'DataValueAlt', 'DataValueFootnoteSymbol',
       'DataValueFootnote', 'LowConfidenceLimit', 'HighConfidenceLimit',
       'StratificationCategory1', 'Stratification1', 'StratificationCategory2',
       'Stratification2', 'StratificationCategory3', 'Stratification3',
       'Geolocation', 'LocationID', 'TopicID', 'QuestionID', 'ResponseID',
       'DataValueTypeID', 'StratificationCategoryID1', 'StratificationID1',
       'StratificationCategoryID2', 'StratificationID2',
       'StratificationCategoryID3', 'StratificationID3'],
      dtype='object')

In [None]:
# Let's Update the Headers for Syntax Consistency
# Syntax: df = df.rename(columns={'currentColumnName':'newColumnName', 'nextCurrentColumnName':'nextNewColumnName'})


# Let's view the new columns and update the variable
# Pass the columns to the variable: Use the variable = DataFrame.columns method

# Call the variable to see the contents