---
title: "Exploring income and physical activity disparities in the US"
subtitle: "INFO 511 - Fall 2024 - Final Project"
author: 
  - name: "The Outliers"
    affiliations:
      - name: "School of Information, University of Arizona"
description: "Using a dataset from the CDC of nutrition, physical activity, and obesity records, we will explore the possibility of a relationship between income and physical activity and examine its direction and strength."
format:
   html:
    code-tools: true
    code-overflow: wrap
    embed-resources: true
editor: visual
execute:
  warning: false
  echo: false
jupyter: python3
---


## Introduction

Understanding the relationship between socioeconomic status and health behaviors is necessary for addressing disparities in public health outcomes. Our project seeks to understand whether higher-income populations consistently have more time for physical activity than lower income populations using a dataset from the Centers for Disease Control and Prevention (CDC). The dataset is specifically from the Behavioral Risk Factor Surveillance System project and was obtained from phone surveys conducted between 2011 and 2023. The whole dataset offers insights into physical activity, nutrition, and obesity trends among U.S. residents aged 18 and older. For the purpose of this project, we are focusing on the survey questions related to physical activity. The data is stratified by factors such as age, education, gender, income, and race/ethnicity.

## Research Question

**Do higher-income populations have more time for physical activity than lower income populations?**

We hypothesize that this is true, higher income populations have more time for physical activity. Therefore, populations will engage in more physical activity as their income level increases (positive relationship).

## Data

Dataset: [Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System](https://chronicdata.cdc.gov/Nutrition-Physical-Activity-and-Obesity/Nutrition-Physical-Activity-and-Obesity-Behavioral/hn4x-zwk7/about_data)

This dataset is hosted by the United States Center for Disease Control and was obtained from the Behavioral Risk Factor Surveillance System, a CDC project consisting of health-related phone surveys. The original dataset consists of 104,000 rows and 33 columns. Descriptions of all columns are available on the link above. Each row represents a combination of a year, state, survey question, and percent of individuals who are positively identified for that question, along with stratification. Data_Value contains the corresponding value collected for each survey question. The categories for stratification are Age Range, Education, Gender, Income, Race/Ethnicity, and Total. This dataset includes observations for the years 2011-2023. Percentages and data are not included for groups with insufficient sample sizes. 

The main columns of interest for our research question are:

-   YearStart and YearEnd: The year the data was collected. These are the same for every row.
-   LocationAbbr: Contains the abbreviation for the data where the data was collected.
-   Topic: Contains the topic the variable being measured falls into. For our research question, we are interested in the topic "Physical Activity - Behavior"
-   Question: What is being measured. Within "Physical Activity - Behavior" there are 5 questions:
    -   Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week
    -   Percent of adults who engage in no leisure-time physical activity
    -   Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)
    -   Percent of adults who engage in muscle-strengthening activities on 2 or more days a week
    -   Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)
-   Data_Value: The value being measured by the survey, in this case for these specific questions it will be a percentage.
-   StratificationCategory1: What variable the data is being stratified by. Depending on the value in this column, it will contain a value in the columns "Race", "Age (years)", "Income", etc. For our research question we are interested in the levels in Income, such as 'Less than \$15,000', '\$35,000 - \$49,999', etc.
-   Income: Contains the income level.


In [None]:
import pandas as pd
data = pd.read_csv('/data/Nutrition.csv')
data.head()

In [None]:
data.info()

### Data Cleaning and Wrangling, EDA

The columns YearStart and YearEnd always contain the same values so one can be dropped. 


In [None]:
data.loc[data['YearStart'] != data['YearEnd']]

In [None]:
activity = data.drop(columns='YearEnd')

We are only interested in the rows containing questions related to physical activity. 


In [None]:
activity = activity.loc[activity['Topic'] == 'Physical Activity - Behavior']
activity.drop(columns=["Total", "Education", "Age(years)", "Gender", "Race/Ethnicity", "GeoLocation"], inplace=True)
activity = activity[activity['StratificationCategory1'] == "Income"]
activity.info()

In [None]:
# looking for missing values
cols_with_nulls = []
for col in activity.columns:
    if activity[col].isna().sum() > 0:
        cols_with_nulls.append(col)

print(f"Columns with 1 or more missing values:")
for i in cols_with_nulls:
    print(i)

####  Data_Value_Unit

In [None]:
# unique values in Data_Value_Unit
print(['Data_Value_Unit'].unique())

This column's values seem to not correspond to the name, it looks like it might be a data entry error.

#### Removing nulls from Data_Value

In [None]:
# removing null values
activity = activity.dropna(subset=['Data_Value'])

#### Questions

In [None]:
print(questions = activity['Question'].unique())

#### Encoding income as a numeric value
Treating income as a numeric variable can be useful for regression.

In [None]:
# maps income ranges to numeric values
income_dict = {'Less than $15,000':0,
  '$15,000 - $24,999':15,
  '$25,000 - $34,999':25, 
  "$35,000 - $49,999":35, 
  '$50,000 - $74,999':50, 
  '$75,000 or greater':75,
        }

# removes data not reported
activity_clean = activity_clean.loc[activity_clean['Income'] != 'Data not reported']

# creates numeric column of income based on mappings
activity_clean['numeric_income'] = activity_clean['Income'].replace(income_dict)