Skip to content

02 Data Management

Serena Kim edited this page Mar 3, 2024 · 3 revisions

Read data

In this tutorial, we will use

  1. CDC/ATSDR's SVI Data: Raw Data
  2. US Census' "Presence and Types of Internet Subscriptions in Household Data (B28002)": Raw Data

Import the csv files:

Subset datasets

From the SVI dataset, we can select only the variables we will use. To do this, we need the dplyr library:

install.packages("dplyr")
library(dplyr)
nc_svi <- NorthCarolina_county %>% select(FIPS, EP_POV150, EP_UNEMP, EP_AGE65, EP_NOHSDP)
  • %>% (pipe operator): This operator is used for chaining operations together in a sequence. It takes the data from the left-hand side and passes it as the first argument to the function on the right-hand side.
  • select: This function is part of the dplyr package(library), and it's used to select specific columns from a dataframe.

Now, you can delete the original dataframe:

rm(NorthCarolina_county)

We will do the same for the US Census' internet data:

nc_internet <- internet_data %>% select("Geography", "Geographic Area Name", "Estimate!!Total:" , "Estimate!!Total:!!With an Internet subscription", "Estimate!!Total:!!With an Internet subscription!!Broadband of any type")

Now we can remove the original dataframe:

rm(internet_data)

Creating a new variable by extracting substrings from strings

To successfully merge the datasets, it's important to ensure that the dataframes have a common variable name. FIPS, which stands for Federal Information Processing Standards, is a set of standards developed by the United States federal government for encoding various data elements, including geographic codes, to ensure consistency and accuracy in data processing and exchange. While CDC data has the FIPS column, the census data has the Geography column instead. This Geography column's last five characters represent the FIPS code, which corresponds to the CDC dataset. Therefore, we will create a new variable named FIPS in the US Census Dataset:

nc_internet$FIPS <- substr(nc_internet$Geography, 
                             nchar(nc_internet$Geography)-4, 
                             nchar(nc_internet$Geography))

Merging two datasets

As they both are county-level datasets in North Carolina, we don't have to worry about choosing the merging methods (e.g., inner, outer, left, right). Simply merge two dataframe by using:

merged_data <- merge(nc_svi, nc_internet, by = "FIPS")

Creating a new column based on the values of other columns

We want to compare the proportion of households with broadband and internet access across the counties. Therefore, let's create two new columns.

To calculate the % of households with broadband access:

merged_data$ep_broadband = merged_data$`Estimate!!Total:!!With an Internet subscription!!Broadband of any type` / merged_data$`Estimate!!Total:` * 100

To calculate the % of households with any types of internet access:

merged_data$ep_internet = merged_data$`Estimate!!Total:!!With an Internet subscription` / merged_data$`Estimate!!Total:` * 100

Removing the columns that are not needed

merged_data <- merged_data %>% select(-`Geography`, -`Estimate!!Total:!!With an Internet subscription!!Broadband of any type`, -`Estimate!!Total:`, -`Estimate!!Total:!!With an Internet subscription` )

Creating a scatter plot: Poverty against broadband access

plot(merged_data$EP_POV150, merged_data$ep_broadband,
     xlab = "population living below 150% of the poverty line", ylab = "households with broadband")

To fit a linear line:

lm_fit <- lm(ep_broadband ~ EP_POV150, data = merged_data)
abline(lm_fit, col = "red")

Now, it's your turn. Create a scatter plot of your choice.

Rename a column (field name)

If you want to call pr_broadband proportion_broadband now, you can use;

library(dplyr)
merged_data<- merged_data %>% rename(proportion_broadband = pr_broadband)