-
Notifications
You must be signed in to change notification settings - Fork 0
01 The Basics
R is a programming language and environment specifically designed for statistical computing and data analysis. R provides a wide range of statistical and mathematical functions, making it a powerful tool for tasks like data modeling, hypothesis testing, regression analysis, and more.
R is open-source software, which means it's freely available to anyone. This has contributed to its widespread adoption and the development of a large and active community of users and package developers.
R excels at data manipulation and visualization. It allows you to import, clean, and transform data easily. With packages like ggplot2
, it offers highly customizable and publication-quality data visualization capabilities.
We will use a cloud version of R on https://posit.cloud/ during the class. Feel free to use the R software if you have installed it on your machine already.
Sign up for an account at the RStudio Cloud sign-up page:
Click the Sign Up button on the bottom-right to start with the free version. Limitation of free version: You only have up to 25 projects on free account.
Input your email, a password, as well as your first and last name.
Once you've signed up, open R Studio Cloud for the first time and create a new project.
Now you have the R cloud Console as below...
Packages are the "units" of reproducible R code. People in the R Community have created packages to keep track of the R functions that they write and reuse. Packages offer a helpful combination of code, reusbale R functions, descriptive documentation, tests for checking your code, and sample data sets.
install.packages()
will download the relevant source code from R and install it on your machine.
library()
will make the commands in the packages you downloaded available to you in the current R session. A new session starts each time R starts and continues until that instance of R is terminated.
install.packages(“tidyverse”)
Once the process is completed, you can load the tidyverse library with the library() function. To load the core tidyverse, type library(tidyverse)
and press Enter or Return.
We will import (read) a dataset from CDC's Social Vulnerability Index (SVI) to our R system.
You can download the dataset on this page.
Select Year == 2020, Geography == North Carolina, Geography Type == Counties, and File Type == CSV File (table data).
Select "Chose File" and click "Ok".
And select "Import"
And you should see a screen with all variables, observations, and values of the data.
The variable EP_UNINSUR
indicates the estimated proportion of people who don't have health insurance.
Let's find out the average proportion of people who don't have health insurance for counties in North Carolina.
average_ep_uninsur <- mean(NorthCarolina_county$EP_UNINSUR)
-
NorthCarolina_county$EP_UNINSUR
selects the "EP_UNINSUR" column from your data frame. -
mean()
calculates the average of the selected column and assigns it to the variableaverage_ep_uninsur
.
Now just print out the newly created value average_ep_uninsur
average_ep_uninsur
Let's find out the highest EP_UNINSUR
value:
max_ep_uninsur <- max(NorthCarolina_county$EP_UNINSUR)
max_ep_uninsur
-
max_ep_uninsur
stores the maximum value in the "EP_UNINSUR" column using the max() function.
If you are wondering which county has the highest EP_UNINSUR
value:
max_county <- NorthCarolina_county$COUNTY[which.max(NorthCarolina_county$EP_UNINSUR)]
-
which.max(NorthCarolina_county$EP_UNINSUR)
finds the index of the maximum value in the "EP_UNINSUR" column. - NorthCarolina_county$COUNTY extracts the "COUNTY" column from the data frame.
- [] is used to subset the "COUNTY" column with the index of the maximum value found in step 2, giving you the COUNTY that corresponds to the maximum "EP_UNINSUR" value.
1.6 Histogram
hist(NorthCarolina_county$EP_UNINSUR)
You can use the hist()
function to create a histogram.
Other options (arguments) you can use to identify labels, color, etc...
hist(NorthCarolina_county$EP_UNINSUR,
main = "Histogram of EP_UNINSUR in North Carolina Counties",
xlab = "EP_UNINSUR Value",
ylab = "Frequency",
col = "lightblue",
border = "black"
)