## Useful R codeblocks discovered through UQ DATA7001 - Intro to Data Science
### David Ainscough

### For new installs of R/RStudio/etc - get the tidyverse library. It includes
- ggplot2, for data visualisation.
- dplyr, for data manipulation.
- tidyr, for data tidying.
- readr, for data import.
- purrr, for functional programming.
- tibble, for tibbles, a modern re-imagining of data frames.
- stringr, for strings.
- forcats, for factors.

In [1]:
# install.packages("tidyverse")

### Per session/notebook - initialise needed libraries

In [2]:
# Either
library(readr)
library(dplyr)

# Or
# library(tidyverse)
# which includes the two above, but is slightly more wasteful.

"package 'readr' was built under R version 3.4.4"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



### Import some CSV's to work with

In almost every case, it's worth using the GUI file import feature in RStudio to do the import first, make sure columns are mapped directly, manually exclude columns that aren't needed. You can then copy the R code from RStudio's import into the Notebook to make it easily repeatable. If you need to code it manually, samples are:

In [6]:
# Either import the dataset as-is
# data_frame_1 <- read_csv("Directory/filename.csv")

# Or import exluding some columns (if known up-front)
# data_frame_1 <- read_csv("Directory/filename.csv", col_types = cols(col_to_exclude = col_skip(), 2nd_col_to_ex = col_skip())

To join different CSVs together on a common reference (like stop_ID for GTFS data, SA2Code for ABS data, etc):

In [15]:
# For an inner join (only keep rows that are represented in both dataframes)
# joined_df <- join(data_frame_1, data_frame_2, by="stop_id", type="inner")

# For a merge, which I believe keeps all rows (ie. same as full outer join)
# joined_df <- merge(data_frame_1, data_frame_2, by = "SA2Code")

Depending on the dataset/s, you'll probably want to do some aggregation to get the data to a level where it's going to be actually useful.

In [16]:
# result_df <- data_frame_1 %>% group_by(var1, var2, var3) %>% summarise(var_to_count = sum(var_to_count))

In [17]:
# Remove rows with missing values
# data_frame_1 <- data_frame_1[complete.cases(data_frame_1), ]

In [18]:
# add a new column based on existing columns - can do whatever with as many variables as you need
# data_frame_1$new_column_name <- with(data_frame_1, var1 + var2)

In [19]:
# write out data frame as a CSV to do more stuff with (usually import into Tableau for VEDA, plotting, whatever)
# write.csv(data_frame_1,"filename.csv", row.names = FALSE)

In [20]:
# subset the data based on matching values
# new_df <- subset(data_frame_1, column_name == "value_to_match") 

In [21]:
# Remove columns
# data_frame_1$column_name <- NULL

In [24]:
# Do something to all rows that match a variable - in this example, set a different variable to 5
# data_frame_1$variable_to_change[data_frame_1$variable_to_match == "True"] <- 5

In [25]:
# Change the name of a column
# colnames(column_name)[colnames(column_name)=="current_name"] <- "new_name"

In [26]:
# To turn off R's tendency to represent very small and very large numbers in scientific notation
# options(scipen=999)

# To restore to defaults
# options(scipen=0)

In [27]:
# Another way of removing rows with NAs
# data_frame_1 <- na.omit(data_frame_1)

In [28]:
# If we only need unique entries in a dataframe - I think it examines each row for uniqueness, not the first variable?
# data_frame_1 <- unique(data_frame_1)

In [29]:
# To order the dataframe by a specific variable (ascending)
# data_frame_1 <- data_frame_1[order(data_frame_1$variable_to_order_on),]

In [30]:
# K-means clustering code example
# Chose which variables to use in clustering (also include a unique identifier which doesn't need to be used)
# data_cluster <- data_frame_1[,c("var_1", "var_2", "var_3", "var_4", "var_5")]

# Actually perform the clustering
# 2:5 selects which columns to cluster on (in this example, column 1 was the suburb name, so I clustered on cols 2,3,4,5)
# 5 is the number of clusters to return - experiment with different numbers to see which gives best results
# cluster_results <- kmeans(data_cluster[,2:5], 5)

# Assign the relevant cluster number to each row in the dataframe
# data_cluster$cluster <- as.factor(cluster_results$cluster)

# Example on how to examine each cluster
# indiv_cluster <- subset(data_cluster, cluster == "1")
# head(indiv_cluster)
# summary(indiv_cluster)

# Do a scatterplot matrix to visualise the clustering
# First line is just setting which colurs to use - make sure you have as many colours as you have clusters!
# my_cols <- c("#00AFBB", "#E7B800", "#FC4E07","#204dcf", "#19e02c" )  
# pairs(data_cluster[,2:5], col = my_cols[data_cluster$cluster])