## Lecture 2 
(You should refer to [main lecture notes] for details)

There are 3 key themes to this lecture:

1. Key data types in R

2. Using base R to subset objects

3. Tidyverse code style

NOTE: This course is meant to be done in a flipped classroom style. This means that you should watch the pre-recorded lecture videos before coming to class. The class time will be spent in going through some highlights of the lecture, and working on group problem solving by the use of worksheets. I aim to provide you around 30 min to work on the worksheets, during this time me and the TAs will be available to help you with any questions you have.

## 1. Key data types in R

The simplest object in R is a vector of length 1, this is the closest thing R has to a scalar: ( check key datatypes image)

Here we created an object named my_string. 

You can learn about R objects using `typeof` and `class`:

Vectors must be of homogenous type, if you don't do that yourself when creating them, R does that for you: (R counts from 1)

Lists are also collections of elements, but are a slightly more sophisticated object. They can contain elements of heterogenous types:

Data frames are special kinds of lists that are required to have: 
- the elements be vectors (or list columns - more on that later) - we call these columns
- the elements (i.e., columns) must have names
- the elements (i.e., columns) must be of the same length

Tibbles are from the `tidyverse` and are a special flavour of data frame that have:
- ability to have grouped rows
- more predictable subsetting behaviour when subsetting a single column
- a nicer print method in RStudio

We will mostly work with tibbles in R in MDS.

## 2. Using base R to subset objects

There are 3 operators that can be used when subsetting data frames (and lists, as data frames are a special kind of list): 

- `[`
- `$` 
- `[[`

`[` usually* returns the same type of object:

The exception to this is the special case of a single column of a data frame...

`$` returns the element with a layer of structure removed.

Note, the element must have a name for this to work!

`[[` returns the element with a layer of structure removed:

## 3. Tidyverse code style

Let's fix some of the style issues together:

In [None]:
library(scatterplot3d)

# Generates toy dataset to play around with clustering 

##Function to create toy 3 dimensional dataset


make_toy_data <- function(means, stdevs, n) {
  ##Returns a dataframe containing a 3D toy dataset 
  ##
  ##Arguements: 
  ##
  ##  means  <- a dataframe of 3 columns giving the means for each dimenstion of each cluster. Each row is a cluster.
  ##  stdevs <- a dataframe of 3 columns giving the standard deviations for each dimenstion of each cluster. Each row is a cluster.
  ##  n <- a value specifying the number of datapoints for each cluster
  ##  dist  <- a vector specifying the type of distribution you would like each cluster data to be
  
  ##get how many clusters are wanted
  cluster.n<-dim(means)[1]
  
  ##make an empty dataframe to hold the toy data
  dim1 <- rep(0, n*cluster.n)
  dim2 <- rep(0, n*cluster.n)
  dim3 <- rep(0, n*cluster.n)
  
  ##for each cluster, generate toy data and save them in dim1
  previous.n<-0
  for (i in 1:cluster.n) {
    temp <- rnorm(n,means[i, 1],stdevs[i, 1])
    dim1[(previous.n + 1):((previous.n) + n)]  <- temp
    previous.n  <- n + previous.n
  }
  
  ##for each cluster, generate toy data and save them in dim2
  previous.n<-0
  for (i in 1:cluster.n) {
    temp <- rnorm(n, means[i, 2], stdevs[i, 2])
    dim2[(previous.n + 1):((previous.n) + n)]  <- temp
    previous.n  <- n + previous.n
  }
  
  ##for each cluster, generate toy data and save them in dim3
  previous.n  <- 0
  for (i in 1:cluster.n) {
    temp <- rnorm(n, means[i, 3], stdevs[i, 3])
    dim3[(previous.n + 1):((previous.n) + n)]  <- temp
    previous.n  <- n + previous.n
  }
  
  data.frame(dim1,dim2,dim3)

}


##define means for toy dataset clusters
dim1_mean <- c(2, 50, 10)
dim2_mean <- c(8, 6, 13)
dim3_mean <- c(2, 2.5, 40)
cluster_means <- data.frame(dim1_mean, dim2_mean, dim3_mean)

##define standard deviations for toy dataset clusters
dim1_stdev <- c(5, 1, 1)
dim2_stdev <- c(5, 1, 1)
dim3_stdev <- c(5, 1, 1)
cluster_stdevs <- data.frame(dim1_stdev, dim2_stdev, dim3_stdev)

##define number of samples wanted for each cluster
cluster.n <- 200

##make toy dataset using parameters specified above
my_toy_df <- make_toy_data(cluster_means, cluster_stdevs, cluster.n)

##save toy dataset to a .csv using comma as a delimitor
write.table(my_toy_df, 
            "toy_data.csv", 
            sep = ",", 
            row.names = FALSE, 
            col.names = FALSE, 
            quote = FALSE, 
            append = FALSE)

# Change the data frame with training data to a matrix
#not scaled and centred but as a matrix
my_toy_df_matrix <- as.matrix(my_toy_df)
##plot unclustered data
scatterplot3d(my_toy_df_matrix)

par(mfrow = c(1,3))
plot (my_toy_df_matrix[ , 1], my_toy_df_matrix[ , 2])
plot (my_toy_df_matrix[ , 2], my_toy_df_matrix[ , 3])
plot (my_toy_df_matrix[ , 3], my_toy_df_matrix[ , 1])