# HeardLibrary/workshops Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
..
Failed to load latest commit information. redcap shiny ggplot+ggvis.md readme.md shiny-for-GIS.md shiny.md

# Open Source Statistical Analysis with R

## Goals

R is an open source programming language for statistical analysis. In this session, we’ll show you how to get started with R. Learn how to load data into R and to plot simple graphs. We’ll also introduce you to RStudio, a free integrated development environment (IDE) for R. You won’t leave as an R expert, but you’ll learn enough to get started on your data analysis journey.

## Prerequisites

```install.packages("ggplot2")
library(ggplot2)

install.packages("RCurl")
library(RCurl)```

## R is a Programming Language

• R is a programming language, not a 'point-and-click' statistical application
• RStudio provides a integrated development environment (IDE) for R, making its appearance more user-friendly
• People use R in lots of different ways
• from evaluating simple statistical functions in a REPL
• to developing interactive web applications with Shiny
• The combination of R & RStudio makes it possible to become production by learning a few functions and then develop expertise over time as necessary

## R Exercises

### Average Heights and Weights for American Women

This practice dataset of the average heights and weights for American women (ages 30-39) comes built in with the R programming language.

```# Load the ggplot2 graphing library
library(ggplot2)

# Assign the dataset to a variable
averages <- women

# explore the dataset
str(averages)
View(averages)

# plot the dataset
ggplot(averages, aes(x=height, y=weight)) + geom_point()

# plot the dataset with a trend line (linear regression)
ggplot(averages, aes(x=height, y=weight)) + geom_point() + stat_smooth(method = "lm")```

### Lower Secondary School Age Population in the USA

This dataset from the United Nations on Quandl contains the population of all genders of middle school ("lower secondary school") kids in the United States.

```# Load the required libraries
library(RCurl)
library(ggplot2)

# Load the dataset directly from Quandl & read CSV into data.frame
csv <- getURL("https://www.quandl.com/api/v1/datasets/UN/UIS_LOWERSECONDARYSCHOOLAGEPOPULATION__ALLGENDERS_USA.csv")

# Explore the dataset
str(kids)
View(kids)

# Plot the dataset
ggplot(kids, aes(x=Year, y=Number)) + geom_point()

# Plot the dataset with cleaner x axis and title
ggplot(kids, aes(x=Year, y=Number)) + geom_point() + theme(axis.text.x = element_text(angle = 90)) + ggtitle("Lower Secondary School Age Population")```

### New Private Housing Units Authorized By Building Permit for Tennessee

This dataset from the Federal Reserve on Quandl contains data on new private housing units authorized by building permit for Tennessee.

```#Load required libraries
library(ggplot2)
library(RCurl)

# Get dataset directly from Quandl
csv <- getURL("https://www.quandl.com/api/v1/datasets/FRED/TNBPPRIVSA.csv")
permits <- read.csv(text = csv)

# Explore dataset
View(permits)
str(permits)

# Make a simple scatter plot
ggplot(permits, aes(x=Date, y=Value)) + geom_point()

# Edit the dates in the dataset using the strptime function
# Thanks to http://stackoverflow.com/questions/20967445/plotting-historical-data-with-missing-values/20969623#20969623
permits\$Year <- strptime(as.character(permits\$Date), "%Y-%m-%d")
permits\$Year <- format(permits\$Year, "%Y")

# Make another simple scatter plot
ggplot(permits, aes(x=Year, y=Value)) + geom_point()

# Switch to a boxplot
ggplot(permits, aes(x=Year, y=Value)) + geom_boxplot() + ggtitle("New Private Housing Units Authorized By Building Permit for Tennessee")```

### ARL Library Investment Index

This dataset from the Association for Research Libraries (ARL) contains key information about academic library budgets and staffing. An Excel (XLS) file is available here, but we will be working with a converted CSV file on your desktop.

```# Load required libraries
library(ggplot2)
library(scales)

# Load dataset from CSV

# Explore dataset
str(arl)
View(arl)

# Remove columns we do not want for our analysis
arl <- arl[,-c(1,2,3,4,5)]
View(arl)

# Remove row of extraneous data
arl <- arl[-116,]
View(arl)

# Changes names of columns for easier access
names(arl)[c(1:5)] <- c("Institution", "Total", "Salaries", "Material", "Staff")
View(arl)

# Create a simple scatter plot
ggplot(arl, aes(x=Staff, y=Salaries)) + geom_point()

# Convert wages from factor (discrete variable) to numeric (continuous variable)
wages <- arl\$Salaries
wages <- unlist(wages)
wages <- gsub(",","",wages)
wages <- as.numeric(wages)
arl\$Wages <- wages

# Create a simple scatter plot (with trend line)
ggplot(arl, aes(x=Staff, y=Wages)) + geom_point()
ggplot(arl, aes(x=Staff, y=Wages)) + geom_point() + stat_smooth(method="lm")

# Make the Y axes less cluttered
ggplot(arl, aes(x=Staff, y=Wages)) + geom_point() + stat_smooth(method="lm") + scale_y_continuous(labels = comma)

ggplot(arl, aes(x=Staff, y=Wages)) + geom_point() + stat_smooth(method="lm") + scale_y_continuous(labels = comma) + main("ARL Salaries")

# Highlight Vanderbilt on the plot
# Thanks to http://stackoverflow.com/questions/14351608/color-one-point-and-add-an-annotation-in-ggplot2/14351684#14351684

# First, create a subset of the data with only vanderbilt
Vandy <- subset(arl, Institution == "VANDERBILT")
View(Vandy)

# Then, create a scatter plot with a highlighted point for Vanderbilt
ggplot(arl, aes(x=Staff, y=Wages)) + geom_point() + stat_smooth(method="lm") + scale_y_continuous(labels = comma) + ggtitle("ARL Salaries") + xlab("All Staff") + ylab("Professional Salaries") + geom_point(data=Vandy, colour="red")
```

### Next Steps with R

You can’t perform that action at this time.