-
Notifications
You must be signed in to change notification settings - Fork 0
02 Data Management
In this tutorial, we will use
- CDC/ATSDR's SVI Data: Raw Data
- US Census' "Presence and Types of Internet Subscriptions in Household Data (B28002)": Raw Data
Import the csv files:
From the SVI dataset, we can select only the variables we will use. To do this, we need the dplyr
library:
install.packages("dplyr")
library(dplyr)
nc_svi <- NorthCarolina_county %>% select(FIPS, EP_POV150, EP_UNEMP, EP_AGE65, EP_NOHSDP)
-
%>%
(pipe operator): This operator is used for chaining operations together in a sequence. It takes the data from the left-hand side and passes it as the first argument to the function on the right-hand side. -
select
: This function is part of thedplyr
package(library), and it's used to select specific columns from a dataframe.
Now, you can delete the original dataframe:
rm(NorthCarolina_county)
We will do the same for the US Census' internet data:
nc_internet <- internet_data %>% select("Geography", "Geographic Area Name", "Estimate!!Total:" , "Estimate!!Total:!!With an Internet subscription", "Estimate!!Total:!!With an Internet subscription!!Broadband of any type")
Now we can remove the original dataframe:
rm(internet_data)
To successfully merge the datasets, it's important to ensure that the dataframes have a common variable name. FIPS
, which stands for Federal Information Processing Standards, is a set of standards developed by the United States federal government for encoding various data elements, including geographic codes, to ensure consistency and accuracy in data processing and exchange. While CDC data has the FIPS
column, the census data has the Geography column instead. This Geography column's last five characters represent the FIPS
code, which corresponds to the CDC dataset. Therefore, we will create a new variable named FIPS
in the US Census Dataset:
nc_internet$FIPS <- substr(nc_internet$Geography,
nchar(nc_internet$Geography)-4,
nchar(nc_internet$Geography))
As they both are county-level datasets in North Carolina, we don't have to worry about choosing the merging methods (e.g., inner, outer, left, right). Simply merge two dataframe by using:
merged_data <- merge(nc_svi, nc_internet, by = "FIPS")
We want to compare the proportion of households with broadband and internet access across the counties. Therefore, let's create two new columns.
To calculate the % of households with broadband access:
merged_data$ep_broadband = merged_data$`Estimate!!Total:!!With an Internet subscription!!Broadband of any type` / merged_data$`Estimate!!Total:` * 100
To calculate the % of households with any types of internet access:
merged_data$ep_internet = merged_data$`Estimate!!Total:!!With an Internet subscription` / merged_data$`Estimate!!Total:` * 100
merged_data <- merged_data %>% select(-`Geography`, -`Estimate!!Total:!!With an Internet subscription!!Broadband of any type`, -`Estimate!!Total:`, -`Estimate!!Total:!!With an Internet subscription` )
plot(merged_data$EP_POV150, merged_data$ep_broadband,
xlab = "population living below 150% of the poverty line", ylab = "households with broadband")
To fit a linear line:
lm_fit <- lm(ep_broadband ~ EP_POV150, data = merged_data)
abline(lm_fit, col = "red")
Now, it's your turn. Create a scatter plot of your choice.
If you want to call pr_broadband
proportion_broadband
now, you can use;
library(dplyr)
merged_data<- merged_data %>% rename(proportion_broadband = pr_broadband)