# US CDC Data API Extraction

The objective of this file is to obtain information smoking information about a particular State by using the U.S. CDC Data API's. The end goal is to enter a State and retrieve associated information.

- API Usage Tutorial:
https://dev.socrata.com/foundry/chronicdata.cdc.gov/wsas-xwh5

- U.S. CDC Data:
https://chronicdata.cdc.gov/Survey-Data/Behavioral-Risk-Factor-Data-Tobacco-Use-2011-to-pr/wsas-xwh5

- Variables Description and Methodology
https://chronicdata.cdc.gov/Survey-Data/Behavior-Risk-Factor-Surveillance-System-BRFSS-Glo/5amh-5sx3


We will use the Behavioral Risk Factor Data: Tobacco Use (2011 to present). The following is a brief description from their website:

*The BRFSS is a continuous, state-based surveillance system that collects information about modifiable risk factors for chronic diseases and other leading causes of death. The data for the STATE System were extracted from the annual BRFSS surveys from participating states. Tobacco topics included are cigarette and e-cigarette use prevalence by demographics, cigarette and e-cigarette use frequency, and quit attempts.*

*For 2011 data and forward, a random-digit dialing system was used to select samples of adults in households with landline or cellular telephones. The sample represented adults from each state who were civilian, aged 18 years or older and not institutionalized.*

Using the "Socrata Query Language" we can specify what we want from the data. For now, I will specify that I want the following:
- Year = 2018, as this is the most recent year in the dataset

- Topic Description = Cigarette Use (Adults), as we want to look at Cigarette use within adults

- age = All Ages

- Gender = Overall

- Race = All Races

- Education - All Grades

- Measure Description = Current Smoking
    - This is defined as: Persons who reported ever smoking at least 100 cigarettes and who currently smoke every day or on some days. Respondents who answered "don't know" or who refused to answer were excluded from the analysis, as were respondents with missing current smoking information.

In [7]:
#Importing necessary packages
import pandas as pd
from sodapy import Socrata
pd.set_option('display.max_columns', 500)

In [2]:
#This is the key information that I requested from the Socrata website
keyID =  'keyID'
keySecret = 'keySecret'
app_token = 'apptoken'

In [3]:
client = Socrata("chronicdata.cdc.gov", username = keyID, password = keySecret, app_token = app_token)

#This is where I input my search criteria
results = client.get("wsas-xwh5", content_type = 'json', YEAR = 2018, TopicDesc = 'Cigarette Use (Adults)', age = 'All Ages', Race = 'All Races', Gender = 'Overall', Education = 'All Grades', MeasureDesc = 'Current Smoking', limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

In [9]:
#Keeping only the necessary columns
df_smoking = results_df[['locationabbr', 'locationdesc', 'age', 'education', 'gender', 'race', 'topicdesc',  'sample_size', 'measuredesc', 'data_value_type' , 'data_value', 'low_confidence_limit', 'high_confidence_limit',  ]]

In [10]:
df_smoking.head()

Unnamed: 0,locationabbr,locationdesc,age,education,gender,race,topicdesc,sample_size,measuredesc,data_value_type,data_value,low_confidence_limit,high_confidence_limit
0,PA,Pennsylvania,All Ages,All Grades,Overall,All Races,Cigarette Use (Adults),5954,Current Smoking,Percentage,17.0,15.7,18.3
1,MS,Mississippi,All Ages,All Grades,Overall,All Races,Cigarette Use (Adults),5674,Current Smoking,Percentage,20.5,19.0,22.0
2,GU,Guam,All Ages,All Grades,Overall,All Races,Cigarette Use (Adults),1561,Current Smoking,Percentage,21.9,18.7,25.1
3,TX,Texas,All Ages,All Grades,Overall,All Races,Cigarette Use (Adults),10697,Current Smoking,Percentage,14.4,12.9,15.9
4,AZ,Arizona,All Ages,All Grades,Overall,All Races,Cigarette Use (Adults),7758,Current Smoking,Percentage,14.0,12.7,15.3
