# Data Analysis and Visualization with R

## Workshop Summary and Contact Information

**Summary:** R is a free and powerful programming language that is commonly used by researchers in both qualitative and quantitative disciplines. R provides a near comprehensive, and still expanding set of research and data analysis tools. This workshop explores the power of R for data analysis and visualization. The focus of this workshop will be hands-on exercises. No programming experience is required, but a basic comprehension of programming and statistics is benefiticial.

**Contact:**   
Email: AskData@uc.edu  
Location: 240 Braunstein Hall (GMP Library)  
Research & Data Services Website: https://libraries.uc.edu/research-teaching-support/research-data-services.html
GitHub: https://github.com/RAJohansen/UCL_Workshops
Twitter: https://twitter.com/johansen_phd

### Section I: Brief Introduction R 

##### 1. R for basic calculation

In [1]:
sin(pi*15)/100

##### 2. R Objects & Assignment
R stores values and objects so they can be reused throughout an equation or script\
Hint alt - is a shortcut for the < - 

In [None]:
x <- 1+2
y <- x +1
y

#### 3. Understanding functions & Getting Help in R
General recipe for functions:

In [2]:
#{r eval=FALSE}
function_name(argument #1 = value #1,
              argument #2 = value #2)

ERROR: Error in parse(text = x, srcfile = src): <text>:1:4: unexpected symbol
1: {r eval
       ^


Going back to our series task, we want to create a series of numbers from 1 to 100 by 2. Luckily there are many functions already available to use in base R (many many more available from packages, which we will discuss later).\
\
Given that we are just learning R, I will tell you that the function is called "seq()"\
The first thing I do when using a new functions is to look at the documentation. You can use the ? to find R documentation.\

**HINT: Scroll to the bottom of the help page for workable examples.**\

In [None]:
?seq()

**HINT: if you can't remember exactly what function you are looking for, Use Tab.**

In [None]:
me<tab>

Additionally, if you are not sure what the function is called try a fuzzy search.\

In [None]:
apropos("mea") 

### Section II: Exploring the Tidyverse!

#### Install and Load the tidyverse package

In [None]:
require("tidyverse")
require("gapminder")

#### Explore the Tidyverse 
https://www.tidyverse.org/

R packages only have to be installed once but loaded everytime.\
Using require is a nice way to make sure every script has the packages needed which combines install.packages() & library()

#### 1. Basic Data Exploration
In this section we will use the gapminder data set
https://www.gapminder.org/

##### Lets assign this data to an object called "gapminder"

In [None]:
# Lets assign this data to an object called "gapminder"
gapminder <- gapminder

##### View our table

In [None]:
View(gapminder)

##### Lists the variables 

In [None]:
names(gapminder)

##### Lets Examine the structure of the data
This will become very useful when we visualize or analyze data, because we must make sure our variables are in the appropriate format!!

In [None]:
str(gapminder)

##### Statistical summary of the data

In [None]:
summary(gapminder)

#### 2. Exploring our data further
**HINT: Understanding how data is indexed is crutial for R programming**

##### Lets look at column 2 

In [None]:
gapminder[,2]

##### Lets look at row 5

In [None]:
gapminder[5,]

##### Selecting a single cell (row 5 column 3)

In [None]:
gapminder[5,3]

Based on this idea, we can make more complicated searches. Lets take the first ten observations and look at the variables:Country (1), Continent(2), Year (3), and population (5)

In [None]:
gapminder[1:10,c(1:3, 5)]

##### What if we want to know the highest gpdPercap 

In [None]:
max(gapminder$gdpPercap)

##### Lets find the row number of the country with the highest gpdpercap
Then show me all columns for row that row

In [None]:
which.max(gapminder$gdpPercap)
gapminder[854,]

#### 2. The filter verb
The filter verb is used to look at a subset of a data set.\
Typically you combine filter with a pipe %>%

Use the filter verb to find the the data for the US

In [None]:
gapminder %>% 
  filter(country == "United States")

##### Multiple conditions
Use filter to return the US for only the year 2007

In [None]:
gapminder %>% 
  filter(year == 2007, country == "United States")

#### The arrange verb 
Used for sorting data by ascending or descending condition\

##### Ascending Order
Use the arrange verb to sort the data in ascending order by GDP per capita

In [None]:
gapminder %>% 
  arrange(gdpPercap)

##### Descending order

In [None]:
gapminder %>% 
  arrange(desc(gdpPercap))

##### Combining verbs
Use filter and arrange to return the results for 2007 in ascending order by GDP per capita

In [None]:
gapminder %>% 
  filter(year == 2007) %>% 
  arrange(gdpPercap)

#### The mutate verb
Change or Add variables to a data set

##### Change a variable

In [None]:
gapminder %>% 
  mutate(pop = pop/1000000)

##### Add a new variable called gdp

In [None]:
gapminder %>% 
  mutate(gdp = gdpPercap * pop)

#### Combine all three verbs 

In [None]:
gapminder %>% 
  mutate(gdp = gdpPercap * pop) %>% 
  filter(year == 2007) %>% 
  arrange(desc(gdp))

#### The Summarize Verb 
##### Summarize entire data set

In [None]:
gapminder %>% 
  summarize(meanLifeExp = mean(lifeExp))

##### What if we want to return the mean life exp just for 2007

In [None]:
gapminder %>% 
  filter(year == 2007) %>% 
  summarize(meanLifeExp = mean(lifeExp))

##### Creating multiple Summaries

In [None]:
gapminder %>% 
  filter(year == 2007) %>% 
  summarize(meanLifeExp = mean(lifeExp),
            totalPop = sum(pop))

**HINT: What data type is pop? Use str(gapminder)**
##### Convert pop to a numeric data type instead of an integer

In [None]:
gapminder$pop <- as.numeric(gapminder$pop)

In [None]:
gapminder %>% 
  filter(year == 2007) %>% 
  summarize(meanLifeExp = mean(lifeExp),
            totalPop = sum(pop))

#### The group_by Verb 
The group_by verb is useful for creating aggregated groups, especially when combined with the summarize function

##### Summarize by each unique year

In [None]:
gapminder %>% 
  group_by(year) %>% 
  summarize(meanLifeExp = mean(lifeExp),
            totalPop = sum(pop))

##### Summarize data from 2007 by continent

In [None]:
gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  summarize(meanLifeExp = mean(lifeExp),
            totalPop = sum(pop))

##### What if we want to summarize by continent over all years?
**HINT: Simply add an additional arguement to the group_by verb**

In [None]:
gapminder %>% 
  group_by(year, continent) %>% 
  summarize(meanLifeExp = mean(lifeExp),
            totalPop = sum(pop))

#### Section II Task
Answer the following questions using the mtcars dataset

In [None]:
mtcars <- mtcars
#View()
#Str()
#names()

##### Find the median mpg & wt for each group of cylinders 

In [None]:
mtcars %>% 
  group_by() %>% 
  summarize( = median(),
             = median())

### Section III: Data Visualization

Useful resources for using base plot in R: \
https://www.harding.edu/fmccown/r/ \
https://www.statmethods.net/graphs/index.html

#### 1. Default Plot

In [None]:
plot(mtcars$mpg)

#### 2. Dotchart

In [None]:
dotchart(mtcars$mpg)

##### Adding details and labels to a Simple Dotplot

In [None]:
dotchart(mtcars$mpg,
         labels=row.names(mtcars),
         main="Gas Milage for Car Models", 
         xlab="Miles Per Gallon")

#### 3. Histogram

In [None]:
hist(mtcars$mpg)

##### Add color and explore bin sizes

In [None]:
hist(mtcars$mpg, breaks=5, col="red")
hist(mtcars$mpg, breaks=10, col="red")
hist(mtcars$mpg, breaks=15, col="red")

#### 4. Kernel Density Plot
First you need to save the density of the data you want to an R object\
Then plot that object using plot()

In [None]:
d <- density(mtcars$mpg) # returns the density data 
plot(d) # plots the results

#### 5. Barplot

In [None]:
barplot(mtcars$cyl)

**HINT:** To fist create a variable called "count" to count the number of each group\
Then use the barplot() function on the object counts

In [None]:
counts <- table(mtcars$cyl)
barplot(counts)

##### Add Chart Title and Axes

In [None]:
barplot(counts, 
        main="Car Distribution", 
        xlab="Number of Gears")

##### Converting a Bar chart into a Stacked Bar

In [None]:
counts <- table(mtcars$cyl, mtcars$gear)
barplot(counts,
        main="Car Distribution by Cylinders and Gears",
        xlab="Number of Gears",
        col = c("darkred","darkblue","orange"),
        legend = rownames(counts))

#### 6. Box Plots

In [None]:
boxplot(mtcars$mpg~mtcars$cyl)

In [None]:
boxplot(mpg~cyl,
        data=mtcars,
        main="Car Milage Data", 
        xlab="Number of Cylinders",
        ylab="Miles Per Gallon")

#### 7. Pie Charts

In [None]:
slices <- table(mtcars$cyl)
lbls <- c("Four", "Six", "Eight")
pie(slices,
    labels = lbls,
    main="Pie Chart of mtcars Cylindars")

#### 8. Scatterplot
##### Simple Scatterplot

In [None]:
plot(mtcars$wt,mtcars$mpg)

In [None]:
plot(mtcars$wt, mtcars$mpg,
     main="Scatterplot Example", 
     xlab="Car Weight ",
     ylab="Miles Per Gallon ",
     pch=19)

#####Add linear regression line 
Regression line is (y~x) 

In [None]:
plot(mtcars$wt, mtcars$mpg,
     main="Scatterplot Example", 
     xlab="Car Weight ",
     ylab="Miles Per Gallon ",
     pch=19,
     abline(lm(mtcars$mpg~mtcars$wt), col="red"))

#### 9. Line Graphs

In [None]:
lines <- c(1:2,4,7,5,8,10,7)
plot(lines)

In [None]:
plot(lines, type="o", col="blue")

In [None]:
plot(lines, type="o", col="blue",
     main="My Line Graph")

#### 10. Get Inspired!!! 
##### Use the Iris data set to make a scatterplot matrix

In [None]:
data("iris")
pairs(iris[1:4]) #only quantitative variables

##### Explore pastel theme in RColorBrewer (Might not work in Jupyter)

In [None]:
require("RColorBrewer")
display.brewer.pal(3,"Pastel1") #display colorpalette

In [None]:
##### Use the function below to modify the scatterplot matrix
Put Histograms on the diagonal (from "pairs" Help)

In [None]:
panel.hist  <- function(x,...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0,1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, ...)
}

# Create a fancy scatterplot matrix

pairs(iris[1:4],
      panel = panel.smooth,
      main = "Scatterplot Maxtris for Iris Data Using pairs Function",
      diag.panel = panel.hist,
      pch = 16,
      col = brewer.pal(3, "Pastel1")[unclass(iris$Species)])

# jpeg('C:/temp/My_Awesome_Plot.jpg')
# Run your fancy scatter plot matrix code here!
#dev.off()


### Section IV: Data Analysis
#### 1. Basic Stats with R
Statistics are used to summerize data!\
We use stats because it is difficult to memorize and decipher raw numbers\

#####**Example 1: Average daily car traffic for a week **

In [None]:
total <- sum(5,16,15,16,13,20,25)
days <- 7
total/days

##### Two basic types of Statistics
**Descriptive Stats:** Uses data to describe the characteristics of a group
**Inferential Stats:** Uses the data to make predictions or draw conclusions

#### 2. Calculating descriptive statistics 
One variable vs. the Entire Data set

In [None]:
summary(mtcars$mpg)

In [None]:
summary(mtcars)

Tukey's five-number summary: Min, Lower-hinge, Median, Upper-Hinge, Max (Not Labeled)
**Hint:: These five numbers are the same as a boxplot**

In [None]:
fivenum(cars$mpg)

##### Alternative Descriptive Stats using the psych package
vars, n, mean, sd, median, trimmed, mad, min, max, range, skew, kurtosis, se

In [None]:
#install.packages("psych")
library(psych)
describe(mtcars)  #vars, n, mean, sd, median, trimmed, mad, min, max, range, skew, kurtosis, se

##### Alternative Descriptive Stats using the pastecs pacakge

In [None]:
#install.packages("pastecs")
library(pastecs)
?stat.desc()
stat.desc(mtcars)

#### 3. Analyzing data by groups
For this section we will use the iris dataset

In [None]:
data(iris)
View(iris)
mean(iris$Petal.Width) #mean of all observation's petal.width

##### Split the data file and repeat analysis using "aggregate"
Allowing for the comparison of means by group

In [None]:
aggregate(iris$Petal.Width ~ iris$Species, FUN = mean) # ~ means a function of...
means <- aggregate(iris$Petal.Width ~ iris$Species, FUN = mean)
plot(means)

**Hint:** There is significant difference between species

##### Conducting multiple calculations at once
**Hint:** The results do not keep the column headers so you need to remember the order you wrote them

In [None]:
aggregate(cbind(iris$Petal.Width, iris$Petal.Length)~ iris$Species, FUN = mean)

#### 4. Calculating Correlations

##### Create a correlation matrix

In [None]:
mtcars <- mtcars
cor(mtcars)

##### Simplify the matrix to increase readability 
We can use the round() function to wrap the cor() function

In [None]:
round(cor(mtcars), 2)

##### Correlate One pair of variables at a time
Derives r, hypothesis test, and CI\
Pearson's product-moment correlation\

In [None]:
cor.test(mtcars$mpg, mtcars$wt)

##### Graphical Check of bivariate regression

In [None]:
hist(mtcars$mpg)
hist(mtcars$wt)
plot(mtcars$wt, mtcars$mpg, abline(lm(mtcars$mpg~mtcars$wt)))

#### 5. Creating a Linear regression model
**Correlation:** is the strength of the association
**Regression:** is a function that can be used to predict values of another variable

##### Create a LM for miles per gallon & weight from mtcars

In [None]:
reg1 <- lm(mpg~wt, data = mtcars)
reg1

In [None]:
summary(reg1)

The slope being statsitcally significant means that wt is a good predictor of mpg\
The variable weight can accounts for 0.75 or 75% of the variation in mpg\

#### 6. Calculate Multiple Regression

**Hint:** Saving models as an R object allows for the extraction of additional information from model

##### Use Six Predictors to model mpg

In [None]:
reg1 <- lm(mpg ~cyl + disp + hp + wt + gear + carb, 
           data = mtcars)
reg1

##### Extract model details

In [None]:
summary(reg1)
anova(reg1)
coef(reg1)
confint(reg1) #Confindence intervals for coefficients
resid(reg1)
hist(residuals(reg1)) #histogram of the residuals