## Lecture 2 
(You should refer to [main lecture notes](https://pages.github.ubc.ca/MDS-2022-23/DSCI_523_r-prog_students/README.html) for details)

There are 3 key themes to this lecture:

1. Key data types in R

2. Using base R to subset objects

3. Tidyverse code style

First, let's load the packages we need:

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1
── [1mConflicts[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


*Note: if you have to install an R package that exists on CRAN, the command is: `install.packages("PACKAGE_NAME")`.*

In [None]:
?pivot_longer

Just showing some help on how to pivot_longer with "tab"

In [2]:
table4a  |> pivot_longer(cols = `1999`:`2000`,names_to = "year",values_to = "value_shere")

country,year,value_shere
<chr>,<chr>,<int>
Afghanistan,1999,745
Afghanistan,2000,2666
Brazil,1999,37737
Brazil,2000,80488
China,1999,212258
China,2000,213766


And then let's limit the output of data frames in Jupyter to 10 lines:

In [3]:
options(repr.matrix.max.rows = 10)

In [5]:
library(gapminder)
gapminder <- gapminder |> mutate(tot_gdp = pop * gdpPercap) |> arrange(desc(tot_gdp))

In [6]:
head(gapminder)

country,continent,year,lifeExp,pop,gdpPercap,tot_gdp
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>,<dbl>
United States,Americas,2007,78.242,301139947,42951.653,12934460000000.0
United States,Americas,2002,77.31,287675526,39097.1,11247280000000.0
United States,Americas,1997,76.81,272911760,35767.433,9761353000000.0
United States,Americas,1992,76.09,256894189,32003.932,8221624000000.0
United States,Americas,1987,75.02,242803533,29884.35,7256026000000.0
China,Asia,2007,72.961,1318683096,4959.115,6539501000000.0


## 1. Clicker

## 1. Key data types in R

The simplest object in R is a vector of length 1, this is the closest thing R has to a scalar: ( check key datatypes image)

In [7]:
my_string <- "hello"

In [8]:
my_string

In [9]:
typeof(my_string)

In [11]:
my_number <- 2
my_number
typeof(my_number)

In [14]:
## Here is an integer type. We should specify L  
simple_vector <- c(3L)

You can learn about R objects using `typeof` and `class`:

In [15]:
typeof(simple_vector)

In [16]:
class(simple_vector)

Vectors must be of homogenous type, if you don't do that yourself when creating them, R does that for you: (R counts from 1)

In [17]:
a_vector <- c("word", 1, TRUE)
a_vector

In [18]:
mixed_vec <- c(55, TRUE, 1L, NULL)

In [19]:
typeof(mixed_vec)

In [20]:
class(a_vector)

In [21]:
class(a_vector[1])

In [22]:
class(a_vector[2])

In [23]:
class(a_vector[3])

## 2. Clicker

Lists are also collections of elements, but are a slightly more sophisticated object. They can contain elements of heterogenous types:

In [26]:
a_vec <- c("word", 1, TRUE)
a_vec

In [27]:
a_list <- list("word", 1, TRUE)
a_list

In [28]:
a_list[[3]]

In [29]:
class(a_list)

In [30]:
class(a_list[[1]])

In [31]:
class(a_list[[2]])

In [32]:
class(a_list[[3]])

Data frames are special kinds of lists that are required to have: 
- the elements be vectors (or list columns - more on that later) - we call these columns
- the elements (i.e., columns) must have names
- the elements (i.e., columns) must be of the same length

In [33]:
a_dataframe <- data.frame(words = c("word", "another word"), 
                         numbers = c(1, 2),
                         logicals = c(TRUE, FALSE))

a_dataframe

words,numbers,logicals
<chr>,<dbl>,<lgl>
word,1,True
another word,2,False


In [34]:
typeof(a_dataframe)

In [35]:
class(a_dataframe)

Tibbles are from the `tidyverse` and are a special flavour of data frame that have:
- ability to have grouped rows
- more predictable subsetting behaviour when subsetting a single column
- a nicer print method in RStudio

We will mostly work with tibbles in R in MDS.

In [36]:
a_tibble <- tibble(words = c("word", "another word"), 
                         numbers = c(1, 2),
                         logicals = c(TRUE, FALSE))

a_tibble

words,numbers,logicals
<chr>,<dbl>,<lgl>
word,1,True
another word,2,False


In [37]:
typeof(a_tibble)

In [38]:
class(a_tibble)

## 2. Using base R to subset objects

There are 3 operators that can be used when subsetting data frames (and lists, as data frames are a special kind of list): 

- `[`
- `$` 
- `[[`

`[` usually* returns the same type of object:

In [39]:
a_tibble[ , 2]

numbers
<dbl>
1
2


The exception to this is the special case of a single column of a data frame...

In [40]:
a_dataframe[ , 2]

`$` returns the element with a layer of structure removed.

In [41]:
a_tibble$numbers

Note, the element must have a name for this to work!

`[[` returns the element with a layer of structure removed:

In [42]:
a_tibble[[2]]

## 3. Tidyverse code style

Let's fix some of the style issues together:

In [None]:
library(scatterplot3d)

# Generates toy dataset to play around with clustering 

##Function to create toy 3 dimensional dataset


make_toy_data <- function(means, stdevs, n) {
  ##Returns a dataframe containing a 3D toy dataset 
  ##
  ##Arguements: 
  ##
  ##  means  <- a dataframe of 3 columns giving the means for each dimenstion of each cluster. Each row is a cluster.
  ##  stdevs <- a dataframe of 3 columns giving the standard deviations for each dimenstion of each cluster. Each row is a cluster.
  ##  n <- a value specifying the number of datapoints for each cluster
  ##  dist  <- a vector specifying the type of distribution you would like each cluster data to be
  
  ##get how many clusters are wanted
  cluster_n <- dim(means)[1]
  
  ##make an empty dataframe to hold the toy data
  dim1 <- rep(0, n * cluster_n)
  dim2 <- rep(0, n * cluster_n)
  dim3 <- rep(0, n * cluster_n)
  
  ##for each cluster, generate toy data and save them in dim1
  previous_n  <- 0
  for (i in 1:cluster_n) {
    temp <- rnorm(n, means[i, 1], stdevs[i, 1])
    dim1[(previous_n + 1):((previous_n) + n)]  <- temp
    previous_n  <- n + previous_n
  }
  
  ##for each cluster, generate toy data and save them in dim2
  previous_n  <- 0
  for (i in 1:cluster_n) {
    temp <- rnorm(n, means[i, 2], stdevs[i, 2])
    dim2[(previous_n + 1):((previous_n) + n)]  <- temp
    previous_n  <- n + previous_n
  }
  
  ##for each cluster, generate toy data and save them in dim3
  previous_n  <- 0
  for (i in 1:cluster_n) {
    temp <- rnorm(n, means[i, 3], stdevs[i, 3])
    dim3[(previous_n + 1):((previous_n) + n)]  <- temp
    previous_n  <- n + previous_n
  }
  
  data.frame(dim1, dim2, dim3)

}


##define means for toy dataset clusters
dim1_mean <- c(2, 50, 10)
dim2_mean <- c(8, 6, 13)
dim3_mean <- c(2, 2.5, 40)
cluster_means <- data.frame(dim1_mean, dim2_mean, dim3_mean)

##define standard deviations for toy dataset clusters
dim1_stdev <- c(5, 1, 1)
dim2_stdev <- c(5, 1, 1)
dim3_stdev <- c(5, 1, 1)
cluster_stdevs <- data.frame(dim1_stdev, dim2_stdev, dim3_stdev)

##define number of samples wanted for each cluster
cluster_n <- 200

##make toy dataset using parameters specified above
my_toy_df <- make_toy_data(cluster_means, cluster_stdevs, cluster_n)

##save toy dataset to a .csv using comma as a delimitor
write.table(my_toy_df, 
            "toy_data.csv", 
            sep = ",", 
            row.names = FALSE, 
            col.names = FALSE, 
            quote = FALSE, 
            append = FALSE)

# Change the data frame with training data to a matrix
#not scaled and centred but as a matrix
my_toy_df_matrix <- as.matrix(my_toy_df)
##plot unclustered data
scatterplot3d(my_toy_df_matrix)

par(mfrow = c(1,3))
plot(my_toy_df_matrix[ , 1], my_toy_df_matrix[ , 2])
plot(my_toy_df_matrix[ , 2], my_toy_df_matrix[ , 3])
plot(my_toy_df_matrix[ , 3], my_toy_df_matrix[ , 1])