Skip to content

01 The Basics

Serena Kim edited this page Feb 29, 2024 · 2 revisions

1.1 What is R?

R is a programming language and environment specifically designed for statistical computing and data analysis. R provides a wide range of statistical and mathematical functions, making it a powerful tool for tasks like data modeling, hypothesis testing, regression analysis, and more.

R is open-source software, which means it's freely available to anyone. This has contributed to its widespread adoption and the development of a large and active community of users and package developers.

R excels at data manipulation and visualization. It allows you to import, clean, and transform data easily. With packages like ggplot2, it offers highly customizable and publication-quality data visualization capabilities.

1.2 Getting Started with R

We will use a cloud version of R on https://posit.cloud/ during the class. Feel free to use the R software if you have installed it on your machine already.

Sign up for an account at the RStudio Cloud sign-up page:

Click the Sign Up button on the bottom-right to start with the free version. Limitation of free version: You only have up to 25 projects on free account.

Input your email, a password, as well as your first and last name.

Once you've signed up, open R Studio Cloud for the first time and create a new project.

Now you have the R cloud Console as below...

1.3 Installing and using packages in R

Packages are the "units" of reproducible R code. People in the R Community have created packages to keep track of the R functions that they write and reuse. Packages offer a helpful combination of code, reusbale R functions, descriptive documentation, tests for checking your code, and sample data sets.

install.packages() will download the relevant source code from R and install it on your machine.

library() will make the commands in the packages you downloaded available to you in the current R session. A new session starts each time R starts and continues until that instance of R is terminated.

install.packages(“tidyverse”)

Once the process is completed, you can load the tidyverse library with the library() function. To load the core tidyverse, type library(tidyverse) and press Enter or Return.

1.4 Importing Dataset to R

We will import (read) a dataset from CDC's Social Vulnerability Index (SVI) to our R system.

You can download the dataset on this page.

Select Year == 2020, Geography == North Carolina, Geography Type == Counties, and File Type == CSV File (table data).

Select "Chose File" and click "Ok".

And select "Import"

And you should see a screen with all variables, observations, and values of the data.

1.5 The proportion of people who don't have health insurance

The variable EP_UNINSUR indicates the estimated proportion of people who don't have health insurance.

Let's find out the average proportion of people who don't have health insurance for counties in North Carolina.

average_ep_uninsur <- mean(NorthCarolina_county$EP_UNINSUR)
  • NorthCarolina_county$EP_UNINSUR selects the "EP_UNINSUR" column from your data frame.
  • mean() calculates the average of the selected column and assigns it to the variable average_ep_uninsur.

Now just print out the newly created value average_ep_uninsur

average_ep_uninsur

Let's find out the highest EP_UNINSUR value:

max_ep_uninsur <- max(NorthCarolina_county$EP_UNINSUR)
max_ep_uninsur
  • max_ep_uninsur stores the maximum value in the "EP_UNINSUR" column using the max() function.

If you are wondering which county has the highest EP_UNINSUR value:

max_county <- NorthCarolina_county$COUNTY[which.max(NorthCarolina_county$EP_UNINSUR)]
  • which.max(NorthCarolina_county$EP_UNINSUR) finds the index of the maximum value in the "EP_UNINSUR" column.
  • NorthCarolina_county$COUNTY extracts the "COUNTY" column from the data frame.
  • [] is used to subset the "COUNTY" column with the index of the maximum value found in step 2, giving you the COUNTY that corresponds to the maximum "EP_UNINSUR" value.

1.6 Histogram

hist(NorthCarolina_county$EP_UNINSUR)

You can use the hist() function to create a histogram.

Other options (arguments) you can use to identify labels, color, etc...

hist(NorthCarolina_county$EP_UNINSUR, 
     main = "Histogram of EP_UNINSUR in North Carolina Counties",
     xlab = "EP_UNINSUR Value",
     ylab = "Frequency",
     col = "lightblue",
     border = "black"
)