## Objective

Monitor and Analyze Chronic Disease Trends: The primary purpose of this dataset appears to be monitoring and analyzing trends related to chronic diseases in the United States.


## Research questions

* Trends in Chronic Disease Prevalence: What are the trends in the prevalence of chronic diseases over the years covered by the dataset? Are certain chronic diseases becoming more or less common?

* Geographical Variations: Are there significant geographical variations in the prevalence of specific chronic diseases or health outcomes? Do certain regions have higher or lower rates of chronic conditions?

* Temporal Changes: Have there been significant changes in the rates of chronic diseases over the years in different locations? Are there any notable patterns or fluctuations?

* Health Disparities: Do health disparities exist among different demographic groups (stratification categories)? For example, do certain age groups, genders, or racial/ethnic groups have higher rates of chronic diseases?

* Mortality Trends: Are there trends in mortality rates related to specific chronic diseases? Have mortality rates improved or worsened over time?



In [3]:
# import the libraries

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings("ignore")

In [4]:
# Read the file
df = pd.read_csv("U.S._Chronic_Disease_Indicators__CDI_.csv")

In [5]:
# Find out the number of columns and rows
df.shape

(1185676, 34)

In [6]:
df.head(10)

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,Response,DataValueUnit,DataValueType,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,2010,2010,OR,Oregon,NVSS,Cardiovascular Disease,Mortality from heart failure,,,Number,...,41,CVD,CVD1_4,NMBR,RACE,AIAN,,,,
1,2019,2019,AZ,Arizona,YRBSS,Alcohol,Alcohol use among youth,,%,Crude Prevalence,...,4,ALC,ALC1_1,CRDPREV,GENDER,GENF,,,,
2,2019,2019,OH,Ohio,YRBSS,Alcohol,Alcohol use among youth,,%,Crude Prevalence,...,39,ALC,ALC1_1,CRDPREV,GENDER,GENM,,,,
3,2019,2019,US,United States,YRBSS,Alcohol,Alcohol use among youth,,%,Crude Prevalence,...,59,ALC,ALC1_1,CRDPREV,RACE,ASN,,,,
4,2015,2015,VI,Virgin Islands,YRBSS,Alcohol,Alcohol use among youth,,%,Crude Prevalence,...,78,ALC,ALC1_1,CRDPREV,GENDER,GENM,,,,
5,2020,2020,AL,Alabama,PRAMS,Alcohol,Alcohol use before pregnancy,,%,Crude Prevalence,...,1,ALC,ALC1_2,CRDPREV,RACE,WHT,,,,
6,2015,2015,DE,Delaware,PRAMS,Alcohol,Alcohol use before pregnancy,,%,Crude Prevalence,...,10,ALC,ALC1_2,CRDPREV,OVERALL,OVR,,,,
7,2019,2019,FL,Florida,PRAMS,Alcohol,Alcohol use before pregnancy,,%,Crude Prevalence,...,12,ALC,ALC1_2,CRDPREV,OVERALL,OVR,,,,
8,2018,2018,KS,Kansas,PRAMS,Alcohol,Alcohol use before pregnancy,,%,Crude Prevalence,...,20,ALC,ALC1_2,CRDPREV,OVERALL,OVR,,,,
9,2013,2013,MS,Mississippi,PRAMS,Alcohol,Alcohol use before pregnancy,,%,Crude Prevalence,...,28,ALC,ALC1_2,CRDPREV,OVERALL,OVR,,,,


In [7]:
df.info() # find out the data type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1185676 entries, 0 to 1185675
Data columns (total 34 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   YearStart                  1185676 non-null  int64  
 1   YearEnd                    1185676 non-null  int64  
 2   LocationAbbr               1185676 non-null  object 
 3   LocationDesc               1185676 non-null  object 
 4   DataSource                 1185676 non-null  object 
 5   Topic                      1185676 non-null  object 
 6   Question                   1185676 non-null  object 
 7   Response                   0 non-null        float64
 8   DataValueUnit              1033553 non-null  object 
 9   DataValueType              1185676 non-null  object 
 10  DataValue                  806942 non-null   object 
 11  DataValueAlt               804578 non-null   float64
 12  DataValueFootnoteSymbol    393710 non-null   object 
 13  DatavalueFoo

In [8]:
# Describing the data
df.describe()

Unnamed: 0,YearStart,YearEnd,Response,DataValueAlt,LowConfidenceLimit,HighConfidenceLimit,StratificationCategory2,Stratification2,StratificationCategory3,Stratification3,ResponseID,LocationID,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
count,1185676.0,1185676.0,0.0,804578.0,682380.0,682380.0,0.0,0.0,0.0,0.0,0.0,1185676.0,0.0,0.0,0.0,0.0
mean,2015.103,2015.643,,1005.325,50.264623,61.873881,,,,,,30.78907,,,,
std,3.320259,3.001197,,18804.33,89.004848,100.104303,,,,,,17.50972,,,,
min,2001.0,2001.0,,0.0,0.0,0.0,,,,,,1.0,,,,
25%,2013.0,2013.0,,16.1,11.0,16.3,,,,,,17.0,,,,
50%,2015.0,2016.0,,40.0,28.5,41.0,,,,,,30.0,,,,
75%,2018.0,2018.0,,76.0,56.3,71.1,,,,,,45.0,,,,
max,2021.0,2021.0,,2925456.0,2541.6,3530.5,,,,,,78.0,,,,


In this dataset most of the columns are not usesful we need to remove the unwanted columns.

## Important columns

 YearStart: the starting year of the data for a particular record.

 YearEnd: the ending year of the data for a particular record (integer).

 LocationDesc: the full name of the state or territory where the data was collected (string).

 Topic: the general topic area of the data, such as "Cancer" or "Diabetes" (string).

 Question: the specific question or aspect of health being measured by the data (string)(e.g. Alcohol use among youth, Alcohol use before pregnancy)

 DataValue: the actual data value for a particular record (double).

 StratificationCategory1: a general category for the stratification variable (e.g. "Race/Ethnicity“)(string).

 Stratification1: a specific category within the general stratification category (string).
