## Lecture 2 
(You should refer to [main lecture notes] for details)

There are 3 key themes to this lecture:

1. Key data types in R

2. Using base R to subset objects

3. Tidyverse code style

NOTE: This course is meant to be done in a flipped classroom style. This means that you should watch the pre-recorded lecture videos before coming to class. The class time will be spent in going through some highlights of the lecture, and working on group problem solving by the use of worksheets. I aim to provide you around 30 min to work on the worksheets, during this time me and the TAs will be available to help you with any questions you have.

# iClicker 1: 

How do you feel about the R programming language?

A) I LOVE R, it's my favorite language!

B) I think it's pretty good!

C) It's just OK.

D) Not a fan, it perplexes me!

## 1. Key data types in R

The simplest object in R is a vector of length 1, this is the closest thing R has to a scalar: ( check key datatypes image)

Here we created an object named my_string. 

In [4]:
my_string <- "hello"

In [5]:
my_string

In [3]:
typeof(my_string)

# iclicker 2: What is the data type of my_number?

```R
my_number <- 42
```

A) character

B) numeric

C) integer

D) double



In [8]:
my_number <- 2L
my_number
typeof(my_number)

In [10]:
## Here is an integer type. We should specify L  
simple_vector <- c(3L)

You can learn about R objects using `typeof` and `class`:

In [11]:
typeof(simple_vector)

In [12]:
class(simple_vector)

# iclicker: What is the output of the following code?

```R
a_vector <- c("word", 1, TRUE)
a_vector[2] <- 2
```

A) "word" 2 TRUE

B) "word" 2 1

C) "word" 1 2

D) "word" 1 TRUE


> Note: With the coersion rules in R, above all options are converted to character vector.
Like ... `'word''2''TRUE'`

In [1]:
a_vector <- c("word", 1, TRUE)
a_vector[2] <- 2

In [15]:
a_vector

Vectors must be of homogenous type, if you don't do that yourself when creating them, R does that for you: (R counts from 1)

# Coercion

Hierarchy for coercion:

> character → double → integer → logical

In [17]:
a_vector <- c("word", 1, TRUE)
a_vector

In [18]:
mixed_vec <- c(55, TRUE, 1L, NULL)

In [19]:
typeof(mixed_vec)

In [20]:
class(a_vector)

In [21]:
class(a_vector[1])

In [22]:
class(a_vector[2])

In [23]:
class(a_vector[3])

## iClicker 2: 
What type are the following mixed vectors coerced to? 

***`mixed_vec <- c(55, FALSE, 1L, NA,"NA")`***

A) integer

B) double

C) character

D) logical

In [21]:
mixed_vec <- c(55, FALSE, 1L, NA,NA, "hello")
mixed_vec
typeof(mixed_vec)

In [25]:
length("hello")

In [24]:
typeof(c("hello", "world"))

In [20]:
typeof(NaN)

Lists are also collections of elements, but are a slightly more sophisticated object. They can contain elements of heterogenous types:

In [26]:
a_vec <- c("word", 1, TRUE)
a_vec

In [27]:
a_list <- list("word", 1, TRUE)
a_list

In [29]:
a_list[3]

In [30]:
a_list[[3]]

In [29]:
class(a_list)

In [30]:
class(a_list[[1]])

In [31]:
class(a_list[[2]])

In [32]:
class(a_list[[3]])

Data frames are special kinds of lists that are required to have: 
- the elements be vectors (or list columns - more on that later) - we call these columns
- the elements (i.e., columns) must have names
- the elements (i.e., columns) must be of the same length

In [31]:
a_dataframe <- data.frame(words = c("word", "another word"), 
                         numbers = c(1, 2),
                         logicals = c(TRUE, FALSE))

a_dataframe

words,numbers,logicals
<chr>,<dbl>,<lgl>
word,1,True
another word,2,False


In [35]:
class(1L)

In [32]:
typeof(a_dataframe)

In [33]:
class(a_dataframe)

Tibbles are from the `tidyverse` and are a special flavour of data frame that have:
- ability to have grouped rows
- more predictable subsetting behaviour when subsetting a single column
- a nicer print method in RStudio

We will mostly work with tibbles in R in MDS.

In [40]:
library(tidyverse)

-- [1mAttaching core tidyverse packages[22m ------------------------ tidyverse 2.0.0 --
[32mv[39m [34mdplyr    [39m 1.1.4     [32mv[39m [34mreadr    [39m 2.1.5
[32mv[39m [34mforcats  [39m 1.0.0     [32mv[39m [34mstringr  [39m 1.5.1
[32mv[39m [34mggplot2  [39m 3.5.1     [32mv[39m [34mtibble   [39m 3.2.1
[32mv[39m [34mlubridate[39m 1.9.3     [32mv[39m [34mtidyr    [39m 1.3.1
[32mv[39m [34mpurrr    [39m 1.0.2     
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mi[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [41]:
a_tibble <- tibble(words = c("word", "another word"), 
                         numbers = c(1, 2),
                         logicals = c(TRUE, FALSE))

a_tibble

words,numbers,logicals
<chr>,<dbl>,<lgl>
word,1,True
another word,2,False


In [42]:
typeof(a_tibble)

In [43]:
class(a_tibble)

## 2. Using base R to subset objects

There are 3 operators that can be used when subsetting data frames (and lists, as data frames are a special kind of list): 

- `[`
- `$` 
- `[[`

`[` usually* returns the same type of object:

In [45]:
a_tibble[ , 2]

numbers
<dbl>
1
2


The exception to this is the special case of a single column of a data frame...

In [46]:
a_dataframe[ , 2]

`$` returns the element with a layer of structure removed.

In [41]:
a_tibble$numbers

`[[` returns the element with a layer of structure removed:

In [42]:
a_tibble[[2]]

## 3. Tidyverse code style

Let's fix some of the style issues together:

In [None]:
library(scatterplot3d)

# Generates toy dataset to play around with clustering 

##Function to create toy 3 dimensional dataset


make_toy_data <- function(means, stdevs, n) {
  ##Returns a dataframe containing a 3D toy dataset 
  ##
  ##Arguements: 
  ##
  ##  means  <- a dataframe of 3 columns giving the means for each dimenstion of each cluster. Each row is a cluster.
  ##  stdevs <- a dataframe of 3 columns giving the standard deviations for each dimenstion of each cluster. Each row is a cluster.
  ##  n <- a value specifying the number of datapoints for each cluster
  ##  dist  <- a vector specifying the type of distribution you would like each cluster data to be
  
  ##get how many clusters are wanted
  cluster_n <- dim(means)[1]
  
  ##make an empty dataframe to hold the toy data
  dim1 <- rep(0, n * cluster.n)
  dim2 <- rep(0, n*cluster.n)
  dim3 <- rep(0, n*cluster.n)
  
  ##for each cluster, generate toy data and save them in dim1
  previous.n<-0
  for (i in 1:cluster.n) {
    temp <- rnorm(n,means[i, 1],stdevs[i, 1])
    dim1[(previous.n + 1):((previous.n) + n)]  <- temp
    previous.n  <- n + previous.n
  }
  
  ##for each cluster, generate toy data and save them in dim2
  previous.n<-0
  for (i in 1:cluster.n) {
    temp <- rnorm(n, means[i, 2], stdevs[i, 2])
    dim2[(previous.n + 1):((previous.n) + n)]  <- temp
    previous.n  <- n + previous.n
  }
  
  ##for each cluster, generate toy data and save them in dim3
  previous.n  <- 0
  for (i in 1:cluster.n) {
    temp <- rnorm(n, means[i, 3], stdevs[i, 3])
    dim3[(previous.n + 1):((previous.n) + n)]  <- temp
    previous.n  <- n + previous.n
  }
  
  data.frame(dim1,dim2,dim3)

}


##define means for toy dataset clusters
dim1_mean <- c(2, 50, 10)
dim2_mean <- c(8, 6, 13)
dim3_mean <- c(2, 2.5, 40)
cluster_means <- data.frame(dim1_mean, dim2_mean, dim3_mean)

##define standard deviations for toy dataset clusters
dim1_stdev <- c(5, 1, 1)
dim2_stdev <- c(5, 1, 1)
dim3_stdev <- c(5, 1, 1)
cluster_stdevs <- data.frame(dim1_stdev, dim2_stdev, dim3_stdev)

##define number of samples wanted for each cluster
cluster.n <- 200

##make toy dataset using parameters specified above
my_toy_df <- make_toy_data(cluster_means, cluster_stdevs, cluster.n)

##save toy dataset to a .csv using comma as a delimitor
write.table(my_toy_df, 
            "toy_data.csv", 
            sep = ",", 
            row.names = FALSE, 
            col.names = FALSE, 
            quote = FALSE, 
            append = FALSE)

# Change the data frame with training data to a matrix
#not scaled and centred but as a matrix
my_toy_df_matrix <- as.matrix(my_toy_df)
##plot unclustered data
scatterplot3d(my_toy_df_matrix)

par(mfrow = c(1,3))
plot (my_toy_df_matrix[ , 1], my_toy_df_matrix[ , 2])
plot (my_toy_df_matrix[ , 2], my_toy_df_matrix[ , 3])
plot (my_toy_df_matrix[ , 3], my_toy_df_matrix[ , 1])

In [None]:
library(scatterplot3d)

# Generates toy dataset to play around with clustering 

##Function to create toy 3 dimensional dataset


make_toy_data <- function(means, stdevs, n) {
  ##Returns a dataframe containing a 3D toy dataset 
  ##
  ##Arguements: 
  ##
  ##  means  <- a dataframe of 3 columns giving the means for each dimenstion of each cluster. Each row is a cluster.
  ##  stdevs <- a dataframe of 3 columns giving the standard deviations for each dimenstion of each cluster. Each row is a cluster.
  ##  n <- a value specifying the number of datapoints for each cluster
  ##  dist  <- a vector specifying the type of distribution you would like each cluster data to be
  
  ##get how many clusters are wanted
  cluster_n <- dim(means)[1]
  
  ##make an empty dataframe to hold the toy data
  dim1 <- rep(0, n * cluster_n)
  dim2 <- rep(0, n * cluster_n)
  dim3 <- rep(0, n * cluster_n)
  
  ##for each cluster, generate toy data and save them in dim1
  previous_n  <- 0
  for (i in 1:cluster_n) {
    temp <- rnorm(n, means[i, 1], stdevs[i, 1])
    dim1[(previous_n + 1):((previous_n) + n)]  <- temp
    previous_n  <- n + previous_n
  }
  
  ##for each cluster, generate toy data and save them in dim2
  previous_n  <- 0
  for (i in 1:cluster_n) {
    temp <- rnorm(n, means[i, 2], stdevs[i, 2])
    dim2[(previous_n + 1):((previous_n) + n)]  <- temp
    previous_n  <- n + previous_n
  }
  
  ##for each cluster, generate toy data and save them in dim3
  previous_n  <- 0
  for (i in 1:cluster_n) {
    temp <- rnorm(n, means[i, 3], stdevs[i, 3])
    dim3[(previous_n + 1):((previous_n) + n)]  <- temp
    previous_n  <- n + previous_n
  }
  
  data.frame(dim1, dim2, dim3)

}


##define means for toy dataset clusters
dim1_mean <- c(2, 50, 10)
dim2_mean <- c(8, 6, 13)
dim3_mean <- c(2, 2.5, 40)
cluster_means <- data.frame(dim1_mean, dim2_mean, dim3_mean)

##define standard deviations for toy dataset clusters
dim1_stdev <- c(5, 1, 1)
dim2_stdev <- c(5, 1, 1)
dim3_stdev <- c(5, 1, 1)
cluster_stdevs <- data.frame(dim1_stdev, dim2_stdev, dim3_stdev)

##define number of samples wanted for each cluster
cluster_n <- 200

##make toy dataset using parameters specified above
my_toy_df <- make_toy_data(cluster_means, cluster_stdevs, cluster_n)

##save toy dataset to a .csv using comma as a delimitor
write.table(my_toy_df, 
            "toy_data.csv", 
            sep = ",", 
            row.names = FALSE, 
            col.names = FALSE, 
            quote = FALSE, 
            append = FALSE)

# Change the data frame with training data to a matrix
#not scaled and centred but as a matrix
my_toy_df_matrix <- as.matrix(my_toy_df)
##plot unclustered data
scatterplot3d(my_toy_df_matrix)

par(mfrow = c(1,3))
plot(my_toy_df_matrix[ , 1], my_toy_df_matrix[ , 2])
plot(my_toy_df_matrix[ , 2], my_toy_df_matrix[ , 3])
plot(my_toy_df_matrix[ , 3], my_toy_df_matrix[ , 1])