<a href="https://colab.research.google.com/github/cs432-websci-fall20/assignments/blob/master/432_Week_03_InfoVis_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 432/532 R Tutorial

These commands can also be run locally using RStudio.

There are a ton of references available for R and since it's popular, you can pretty much search for whatever you want and find something close.

## R Basics

Using R as a calculator:

In [None]:
1 / 200 * 30
(59 + 73 + 2) / 3
sin(pi/2)

**Important:** Variable assignment is done with `<-`, not with `=`

In [None]:
x <- 3*4
x

Vectors are important in R.  Access is similar to Python notation for lists (but must include both ends of the range).

In [None]:
b1<-c(1,2,3,4,5,6)
b1

In [None]:
b2<-1:10
b2
b2[4]
b2[0:4]
b2[4:6]

In [None]:
b2[:4]

## Working with Data

`read.csv()`
 * reads data as table and converts it to data frame
 * can read local file or file on the web
 * specify separator with `sep=`
 * specify if there's column titles with `header`

In [None]:
mydata1 <- read.csv("https://www.cs.odu.edu/~mweigle/courses/cs795/mklein-IntroR/lecture/input1.dat", sep="\t", header=F)
mydata1

You can also read in a datafile that has column names (use `header=T`). Columns are then addressable by `var$colname`.

In [None]:
mydata2<-read.table("https://www.cs.odu.edu/~mweigle/courses/cs795/mklein-IntroR/lecture/input2.dat",sep="\t",header=T)
mydata2

Accessing columns -- using names or using column number

`var[row,col]`

In [None]:
mydata2$id
mydata2$val_A
mydata2$val_B
mydata2[,2]

Accessing an individual cell (row 3 in column val_A) -- two different ways

In [None]:
mydata2$val_A[3]
mydata2[3,2]

Accessing a row (row 3)

In [None]:
mydata2[3,]

What if you have missing data and want to perform some mathematical functions?  

Use `NA` in place of the missing data and use `na.rm=T` options in functions.

In [None]:
d1<-c(1:3,rep(NA,4),8:10)
d1

In [None]:
mean(d1)

In [None]:
mean(d1,na.rm=T)
median(d1,na.rm=T)

## Plotting with ggplot2

First we load the ggplot2 library. 

Then we're going to use some of the datasets included with the ggplot2 library.  You can see the list of these with the `data(package="ggplot2")` command. For each dataset, use `?datasetName` to get more information about the dataset.

In [None]:
library(ggplot2)
theme_set(theme_bw())  # selects a black and white theme
library(scales)        # allows us to format axes labels with commas
options(scipen=999)    # prevent using scientific notation

In [None]:
data(package="ggplot2")

In [None]:
?midwest

In [None]:
data("midwest", package = "ggplot2")

In [None]:
head(midwest, 6)

### Scatterplot

Here's a basic scatterplot, showing the percentage of college educated (mapped to the y-axis) vs. the total population (mapped to the x-axis) in each county in Ohio (state==OH).

Notice the notation used to subset the dataset inside the `ggplot()` function.  `midwest$state` refers to the `state` column in the midwest dataset.

In [None]:
gg <- ggplot(midwest[midwest$state=="OH",], aes(x=poptotal, y=percollege)) + 
  geom_point()
plot(gg)

Because we saved the basic chart in a variable, we can reuse it and add options.  We add `scale_x_continuous(label=comma)` so that the numbers are comma-formatted and specify the chart labels.

In [None]:
gg + 
  scale_x_continuous(label=comma) + 
  labs(y="% College Educated", 
       x="Population", 
       caption = "Source: midwest")

### Bar Chart

For our bar chart, let's look at the total population in each state.  We can sum `poptotal` in each county.

In [None]:
state_pop <- aggregate(midwest$poptotal, by=list(midwest$state), FUN=sum)
state_pop

Then we'll change the column labels to something reasonable.

In [None]:
colnames(state_pop) <- c("state", "poptotal") 
state_pop

We use `geom_bar()` to create a bar chart.  We want the chart to directly show the values in the table, so we use `stat="identity"`.  

We can also specify the width of the bars and also the color for all the bars. Note that this is not mapping an attribute to color, but coloring all bars regardless of value, since it's outside of the `aes()` function.

In [None]:
ggplot(state_pop, aes(x=state, y=poptotal)) + 
  geom_bar(stat="identity", width=.5, fill="tomato3") + 
  scale_y_continuous(label=comma) + 
  labs(y="Total Population", caption="source: mpg")

Next, let's sort this by population in descending order.

In [None]:
ordered <- state_pop[order(-state_pop$poptotal),]
ordered$state <- factor(ordered$state, levels = ordered$state)  # to retain the order in plot
ordered

In [None]:
ggplot(ordered, aes(x=state, y=poptotal)) + 
  geom_bar(stat="identity", width=.5, fill="tomato3") + 
  scale_y_continuous(label=comma) + 
  labs(y="Total Population", caption="source: mpg")

And then turn it sideways for a horizontal bar chart by just switching the x and y axes.

In [None]:
ggplot(ordered, aes(y=state, x=poptotal)) + 
  geom_bar(stat="identity", width=.5, fill="tomato3") + 
  scale_x_continuous(label=comma) + 
  labs(x="Total Population", caption="source: mpg")

### Line Chart

For a line chart, we need an ordered (but not necessarily quantitative) value for the x-axis.  Usually this is something like time.  So we need to load a different dataset.  The economics dataset is a time series dataset with various economic indicators from 1967-2015.

In [None]:
?economics

In [None]:
head(economics)

We use `geom_line()` to create the line chart.

In [None]:
gg <- ggplot(economics, aes(x=date)) + 
  geom_line(aes(y=unemploy)) + 
  scale_y_continuous(label=comma) + 
  labs(y="Number unemployed (thousands)",
    caption="Source: Economics")
plot(gg)

We can add points to the line chart just by adding a `geom_point()`.

In [None]:
gg + geom_point(aes(y=unemploy))

### Scatterplot Matrix

The simplest way to plot a scatterplot matrix is with the standard R function `pairs()` (not a part of ggplot2).  Instead of plotting the data in the diagonals, it lists the attribute name.

The example here goes back to the midwest dataset, selects only columns 4:6, and sets the point mark to a filled dot (19).

In [None]:
pairs(midwest[,4:6], pch=19)

### Histogram

For the histogram, we show the distribution of population per county.  Note that we're limiting this to counties that have less than 1 M people (in particular, Cook County, IL includes Chicago and has > 5 M people), so that skews the histogram.

To create the histogram, we use the `geom_histogram()` function.  `binwidth` sets the size of each histogram bin (this one is set to 10,000).

In [None]:
ggplot(midwest[midwest$poptotal < 1000000,], aes(poptotal)) + 
  geom_histogram(binwidth=10000) +
  scale_x_continuous(label=comma) +
  labs(x = "Population per County", caption="source: midwest")

### Boxplot

We're again looking at the total population by county in the midwest (and again, only for counties with < 1M people).  This time, we'll use boxplots (`geom_boxplot()`) and create a separate boxplot for each state.  

In [None]:
ggplot(midwest[midwest$poptotal < 1000000,], aes(x=state, y=poptotal)) + 
  scale_y_continuous(label=comma) +
  geom_boxplot()

### Empirical CDF (ECDF)

For the empirical CDF (CDF), we just have to take the histogram code and replace `geom_histogram()` with `stat_ecdf()`. 

In [None]:
ggplot(midwest[midwest$poptotal < 1000000,], aes(poptotal)) + 
  stat_ecdf() +
  scale_x_continuous(label=comma) +
  labs(x = "Population per County", y = "Cumulative Distribution", caption="source: midwest")

## Output R Datasets to CSV

So that we can use the same data in our Python examples, let's output the `midwest` and `economics` datasets to CSV files that we can load into our Python notebook.

In [None]:
write.csv(midwest,"midwest.csv", row.names = TRUE)

In [None]:
write.csv(economics,"economics.csv", row.names = TRUE)

Now we can refresh our file list and download these files locally (or save them to Google Drive) so that we load them in our Python notebook later.