# Introduction to R

## Overview

R is one of the most widely used open source tool for statistical computing. It has a huge community support, and a large number of packages have been contributed by the community.

This Prac is aimed to teach the basics of R, which will serve as foundation for the future Pracs. The topics of R covered in this pracs include the following
* data structures
* filtering and data quality operations
* loops and conditional statements
* commonly used statistical functions
* basic plotting

## Instructions
<strong> This Prac is assessed, and the tasks are based on the demos taken during lab. </strong>

Complete the tasks given. 
* <strong>Please note that your code will be re-executed on new dataset files during marking. The new dataset will have the same "Column Name" as that of the current dataset - but the order of the columns can be different. Also the rows, and number of rows will change. Your code should give the correct answer for the new datasets as well. <u>So please don't hard-code the answers / row indexes / column indexes.</u> To reference a column you can use the column names instead of column index.</strong>

Additional Note
* variable names and strings are case-sensitive in R
* Any updates regarding the Pracs will be posted on <strong>Piazza</strong>
* Your code will be tested on Version 4.x.x of R

## Submission
After completing the tasks, download the notebook as an R file to your local system. This can be done by "File > Download as > R". Rename the downloaded R file with your "UQ Login-id" - for example, `s4477608.R`
* Upload the downloaded file to jupyter in the same folder as that of this notebook, that is, inside "Prac1" folder.
* Submit the downloaded file to BB

<strong> The R file uploaded to jupyter will be assessed automatically. So, it is very important to upload the R file to the correct location in jupyter, with the correct name ( as mentioned above ) - failing to which 0 will be awarded for this Prac. </strong>

## 1. Data Quality 
Imported data would contain multiple inconsistencies and quality issues. The following tasks aims to provide an example of some of the inconsistencies and quality issues.


Data can be imported from multiple sources like CSV, JSON, Relational Database, and so on. We will be dealing with simple CSV files. Below is shown a Sample Task, and sample solution to import a csv file.

|<center> Sample TASK - 1</center>|
| ---- |
| Import data from "unclean1.csv" using `read.csv` function. Store the data into a variable called "clean1". Import data from iris1.csv into a variable called "d1". Note: first row of "iris1.csv" is the header|

In [1]:
# Place your answer here (Sample Solution)
clean1 = read.csv("datasets/unclean1.csv", header=FALSE)
d1 = read.csv("datasets/iris1.csv", header=TRUE)

### Data Inconsistency

|<center>TASK</center>|
| ---- |
| The last row of "clean1" dataframe contains the column header. Assign the column header using the last row, and also remove the last row from the data frame. Store the result inside "clean1". |

In [2]:
# Place your answer here
end_row = clean1[nrow(clean1),]
colnames(clean1) = c(t(end_row))
clean1 = clean1[-nrow(clean1),]

### Data Type Mismatch
When the data is imported the data types assigned by default might not completely match the underlying data. 
You can check the structure of the data using `str` function, and assign the correct data types using functions like `as.numeric`, `as.character`, `as.factor`, `as.integer`, ...

|<center>TASK</center>|
| ---- |
| Make a copy of "d1" data frame, and name it "datatype1". Sepal.Length, Sepal.Width, Petal.Length, Petal.Width columns should be of type numeric, and Species column should be of type character. Check if the columns of "datatype1" match their respective data types, fix the same if not correct.  |

In [3]:
# Place your answer here
datatype1 = d1
datatype1[,ncol(datatype1)] = as.character(datatype1[,ncol(datatype1)])
for (i in c(1:(ncol(datatype1)-1))) {datatype1[,i] = as.numeric(datatype1[,i])}
is.numeric(datatype1$Sepal.Length)
is.numeric(datatype1$Sepal.Width)
is.numeric(datatype1$Petal.Length)
is.numeric(datatype1$Petal.Width)
is.character(datatype1$Species)

### Handling NA's for Arithmetic Operation
Arithmetic operations such as mean/sum might not perform as expected when the underlying data has missing values. There are many ways to handle missingness, and ignoring the NA's is one of the method.

|<center>TASK</center>|
| ---- |
| Find the mean of "Petal.Width" column of "clean1". Note: consider using `na.rm` parameter of mean function to handle the NA's of this column. Store the result inside "sum_narm" |

In [4]:
# Place your answer here
type_character = as.character(clean1[,'Petal.Width'])
type_numeric = as.numeric(type_character)
sum_narm = mean(type_numeric,na.rm=T)

### Data Missingness
There are many kind of missingness defined in R. Most common of those are `NA`, `NULL`, and `empty strings`. 

|<center>TASK</center>|
| ---- |
| Make a copy of "d1" data frame, and name it "clean2". Remove rows of "clean2" which contains missing values (NA, NULL, and empty strings) in any of its columns. |

In [5]:
# Place your answer here
clean2 = d1
clean2 = clean2[complete.cases(clean2),]

### Combining Data From JSON and CSV
In most situations, data needs to be combined from various sources like CSV, json, ... You can combine data frames loaded from different sources using `rbind` function, provided they have the same structure.

|<center>TASK</center>|
| ---- |
| Combine contents of "iris3.json" and "iris4.csv" using `rbind` function, and store the resultant dataframe inside "rbind1". Note: you can import `JSON` data using `read_json` function of `jsonlite` package |

In [6]:
# Place your answer here
library(jsonlite)
iris3 = read_json('datasets/iris3.json', simplifyVector=TRUE)
iris4 = read.csv('datasets/iris4.csv', header=TRUE)
rbind1 = rbind(iris3,iris4)

### Filtering Data

|<center>TASK</center>|
| ---- |
| Store the first 5 rows and the last 10 rows of "d1" inside "select1"|

In [7]:
# Place your answer here
first_five=d1[1:5,]
last_ten=d1[(nrow(d1)-9):nrow(d1),]
select1=rbind(first_five,last_ten)

|<center>TASK</center>|
| ---- |
| Make a copy of "d1" data frame, and name it "select2". Remove all rows of "select2" whose Species is versicolor. |

In [8]:
# Place your answer here
select2 = subset(d1, Species != 'versicolor')

|<center>TASK</center>|
| ---- |
| Make a copy of "d1" data frame, and name it "select3". Set "Petal.Length" column values of "select3" data frame to 0 whose Species is setosa. |

In [9]:
# Place your answer here
select3 = d1
select3[select3$Sqecies == 'setosa', 'Petal.Length'] = 0

## 2. Factors
All categorical features should be stored as `factor` in R. Each category should be a `level` of that factor.

|<center>TASK</center>|
| ---- |
| Given that there are 3 distinct species of iris - "setosa", "versicolor", "virginica", find the count of occurrence of each Species in "select2" using `table` function. Store the resultant count inside "count1" vector. Name of each element of the vector should be its corresponding Species. Hint: you have to convert the column into `factor` and mention the `levels` before you perform the count. |

In [10]:
# Place your answer here
count1 = table(select2$Species)

## 3. User Defined Function

User-Defined function can be used to create functions which can perform custom operation.

|<center>TASK</center>|
| ---- |
| Create a function called "fun1" which takes 2 matrix as input parameters (x1,x2). The function should return the sum of determinant of the matrices.|

In [11]:
# Place your answer here
fun1 = function(x1, x2){
    sum_matrix = det(x1) +det(x2)
    return (sum_matrix)
}

## 4. Common/Important Functions and Operations 

### Basic Arithmetic Operation

|<center>TASK</center>|
| ---- |
| Find the sum of "Petal.Length" column of "d1". Store the result inside vector "sum1" vector |

In [12]:
# Place your answer here
sum1 = sum(d1[['Petal.Length']])

### Apply Function
Apply function can be used to perform operations on rows / columns of data frame or matrices

|<center>TASK</center>|
| ---- |
| Find the "sum/median" (SUM divided by MEDIAN) of all the numeric columns of "d1" using `apply` function. Store the result inside "apply1". Note: you have to use `is.numeric` to find if a column is numeric or not. |

In [13]:
# Place your answer here
for(i in 1:length(d1)){
    if(is.numeric(d1[[i]])){c=c(1:i)}
}
apply1 = apply(d1[,c],2,function(x){{return (sum(x)/median(x))}})

### Aggregate Function
Aggregate function proves useful when we want to perform aggregation on each and every categories of data. 


|<center>TASK</center>|
| ---- |
| Find the sum of "Petal.Length","Petal.Width", "Sepal.Length", "Sepal.Width" of "d1" for each Species using `aggregate` function. The result should be a data frame where the column name are the names of numeric columns of "d1" and the row names are the names of the Species. Store the result inside "agg1" |

In [14]:
# Place your answer here
agg1 = aggregate(d1[,c(1:4)], by = list(d1$Species), FUN = sum)

|<center>TASK</center>|
| ---- |
| Load "script1.R" using `source` function.<strong> You will be making use of the objects created inside this script in subsequent tasks. Please note that the contents of the objects inside this script will be changed during marking. So please dont hard code your answers.</strong>  |

In [15]:
# Place your answer here
source("script1.R")

## 5. Lists

List can store any type of objects inside it. It is mainly used to return multiple objects from a function.

|<center>TASK</center>|
| ---- |
| Create a function called "fun1" which takes 2 matrix as input parameters (x1,x2). It should return a list where the 1st element of list is the first matrix, 2nd element of the list is the 2nd matrix, 3rd element of the list is the vector which contain the sum of determinant of the two matrices.  |

In [16]:
# Place your answer here
fun1 = function(x1, x2){
    sum_matrix = det(x1) +det(x2)
    out = list(x1, x2,sum_matrix)
    return (out)
}

### lapply & sapply
lapply and sapply can be used to iterate through elements of the list to perform some operation.

|<center>TASK</center>|
| ---- |
| Find the sum of determinant of matrices present inside the "list1" (loaded from "script1.R") using `lapply`. Store the result inside vector "lapply1". Perform the same operation using `sapply` and store the result inside "sapply1" variable. Note: you have to check if an element of the list is matrix or not.  |

In [17]:
# Place your answer here
for(i in 1:length(list1)){print(is.matrix(list1[[i]]))}
lapply(list1, is.matrix)
lapply1 = lapply(list1[2], det)
aspply1 = sapply(list1, is.matrix)

[1] FALSE
[1] TRUE
[1] FALSE


## 6. Loops

Loops can be used to 
* execute tasks "n" number of times
* iterate through elements of an object
* and so on

|<center>TASK</center>|
| ---- |
| Find the sum of determinant of matrices present inside the "list1" (loaded from "script1.R") using `for` loop. Store the result inside vector "for1". Perform the same operation using `while` loop. Store the result inside "while1". Note: you have to check if an element of the list is matrix or not.  |

In [19]:
# Place your answer here
for1 = 0
for(i in length(list1)){
    if(is.matrix(list1[[i]])){ 
        for1 = for1 + det(list1[[i]])
    }
}

while1 = 0
i = 1
while(i <= length(list1)){
    if(is.matrix(list1[[i]])){
        while1 = while1 + det(list1[[i]])
    }
    i = i + 1
}

## 7. Sampling

Sampling is the process where only a subset of the data is selected for analysis. This may be due to
* data being extremely large to process
* challenges in collecting the entire data


There are different types of sampling like
* Simple Random Sampling
* Weighted Sampling
* Stratified Sampling
* Systematic Sampling
* ...

Below tasks are based on sampling

### Simple Random Sampling

|<center>TASK</center>|
| ---- |
| Sample 80 rows `without replacement` from "d1". Store the result inside "sample1" data frame. Perform the same operation `with replacement` and store the result inside "sample2" data frame. Set the seed (using `set.seed` function) to 55 for both the cases. Note: perform sample operation using `sample` function  |

In [20]:
# Place your answer here
set.seed(55)
sample1 = d1[sample(nrow(d1), 80, FALSE),]
sample2 = d1[sample(nrow(d1), 80, TRUE),]

### Weighted Sampling

|<center>TASK</center>|
| ---- |
| Sample with replacement, 80 data points from "d1", such that versicolor data points have twice the weight as that of setosa, and virginica data points have twice the weight as that of versicolor. Combine the sampled rows inside "sample3" data frame. Set the seed to 55. Note: perform sample operation using `sample` function, and weights can be given using `prob` parameter of sample function |

In [21]:
# Place your answer here
set.seed(55)
sample3 = d1[sample(nrow(d1),80, replace = TRUE, prob = c(1,2,4)[as.integer(d1[['Species']])]),]

### Repeated Weighted Sampling

|<center>TASK</center>|
| ---- |
| Perform the above weighted sampling for 100 iterations, finding the mean of "Sepal.Length" for each iteration. Store the means inside "weighted_iterative" vector. The seed for each iteration is the iteration number itself. |

In [22]:
# Place your answer here
weighted_iterative=list()
for (i in c(1:100)) {
    set.seed(i)
    sample3 = d1[sample(nrow(d1),20, replace = TRUE, prob = c(1,2,4)[as.integer(d1[['Species']])]),]
    weighted_iterative = c(weighted_iterative, list(mean(sample3$Sepal.Length)))
}

### Stratified Sampling

|<center>TASK</center>|
| ---- |
| Sample with replacement - 10 values from each of the three Species of "d1" data. Store the sampled data inside "stratified_setosa", "stratified_versicolor", and "stratified_virginica" data frames respectively. Set the seed to 55 for sampling. Perform sample operation using `sample` function, and use `prob` parameter of sample function to sample from a particular species alone. Hint: all data points with a corresponding weight of 0 will not be sampled. |

In [23]:
# Place your answer here
set.seed(55)
stratified_setosa = d1[sample(nrow(d1),10, replace = TRUE, prob = c(1,0,0)[as.integer(d1[['Species']])]),]
stratified_versicolor = d1[sample(nrow(d1),10, replace = TRUE, prob = c(0,1,0)[as.integer(d1[['Species']])]),]
stratified_virginica = d1[sample(nrow(d1),10, replace = TRUE, prob = c(0,0,1)[as.integer(d1[['Species']])]),]


### Repeated Stratified Sampling

|<center>TASK</center>|
| ---- |
| Perform the above stratified sampling for 100 iterations, finding the mean of "Sepal.Length" for each iteration - for each species. The seed for each iteration is the iteration number itself. Store the sampled means inside "setosa_mean", "versicolor_mean", "virginica_mean" vectors respectively. Note: after 100 iteratations, there will be 100 values in each of the  vectors |

In [24]:
# Place your answer here
setosa_mean=list()
versicolor_mean=list()
virginica_mean=list()
for (i in c(1:100)) {
    set.seed(i)
    stratified_setosa = d1[sample(nrow(d1),10, replace = TRUE, prob = c(1,0,0)[as.integer(d1[['Species']])]),]
    stratified_versicolor = d1[sample(nrow(d1),10, replace = TRUE, prob = c(0,1,0)[as.integer(d1[['Species']])]),]
    stratified_virginica = d1[sample(nrow(d1),10, replace = TRUE, prob = c(0,0,1)[as.integer(d1[['Species']])]),]

    setosa_mean = c(setosa_mean, list(mean(stratified_setosa$Sepal.Length)))
    versicolor_mean = c(versicolor_mean, list(mean(stratified_versicolor$Sepal.Length)))
    virginica_mean = c(virginica_mean, list(mean(stratified_virginica$Sepal.Length)))
}