# Introduction to R

## Overview

R is one of the most widely used open source tool for statistical computing. It has a huge community support, and a large number of packages have been contributed by the community.

This Prac is aimed to teach the basics of R, which will serve as foundation for the future Pracs. The topics of R covered in this pracs include the following
* data structures
* filtering and data quality operations
* loops and conditional statements
* commonly used statistical functions
* basic plotting

## Instructions
<strong> This Prac is assessed, and the tasks are based on the demos taken during lab / lecture / tutorial and their materials. </strong>

Complete the tasks given. 
* <strong>Please note that your code will be re-executed on new dataset files during marking. The new dataset will have the same "Column Name" as that of the current dataset - but the order of the columns can be different. Also the rows, and number of rows will change. Your code should give the correct answer for the new datasets as well. <u>So please don't hard-code the answers / row indexes / column indexes.</u> To reference a column you can use the column names instead of column index.</strong>

* <strong>Make sure the R script submitted has no syntactical error, in which case a zero will be awarded for this Prac. Tutors will direct you on how to identify syntactical errors in your script.</strong>

Additional Note
* Variable names and strings are case-sensitive in R
* Use of any packages to achieve the objective is strictly prohibited - unless explicitly mentioned in the question to do so
* Any updates regarding the Pracs will be posted on <strong>Piazza</strong>
* Your code will be tested on Version 4.x.x of R

## Submission
After completing the tasks, download the notebook as an R file to your local system. This can be done by "File > Download as > R". Rename the downloaded R file with your "UQ Login-id" - for example, `s4477608.r`
* Upload the downloaded file to jupyter in the same folder as that of this notebook, that is, inside "Prac1" folder.
* Submit the downloaded file to BB

<strong> The R file uploaded to jupyter will be assessed automatically. So, it is very important to upload the R file to the correct location in jupyter, with the correct name ( as mentioned above ) - failing to which 0 will be awarded for this Prac. </strong>

## 1. Data Quality 
Imported data would contain multiple inconsistencies and quality issues. The following tasks aims to provide an example of some of the inconsistencies and quality issues.


Data can be imported from multiple sources like CSV, JSON, Relational Database, and so on. We will be dealing with simple CSV files. Below is shown a Sample Task, and sample solution to import a csv file.

|<center> Sample TASK - 1</center>|
| ---- |
| Import data from "unclean1.csv" using `read.csv` function. Store the data into a variable called "unclean1". <br>Import data from "iris1.csv" into a variable called "d1". First row of "iris1.csv" is the column header of "d1" dataframe (Use `header` argument of `read.csv` to assign the same). <br> String columns should NOT be imported as factors for both the cases (use `stringsAsFactors` argument of `read.csv` to achieve the same). |

In [None]:
# Place your answer here (Sample Solution)
unclean1 = read.csv("datasets/unclean1.csv", header=FALSE, stringsAsFactors=FALSE)
d1 = read.csv("datasets/iris1.csv", header=TRUE, stringsAsFactors=FALSE)

### Data Inconsistency

|<center>TASK</center>|
| ---- |
| Make a copy of "unclean1" dataframe, and name it "clean1". <br> The last row of "clean1" dataframe contains the column header. Assign "clean1" column header using its last row, and also remove its last row (from the "clean1" data frame). <br> Reminder: please don't hard-code the row number of the last row. |

In [None]:
# Place your answer here
clean1 = data.frame(unclean1) # Creates copy of "unclean1" dataframe
colnames(clean1) = tail(clean1,1) # Assigns "clean1" headers using last row
clean1 = head(clean1,-1) # Removes last row by keeping all rows but -1 row which is the last row

### Data Type Mismatch
When the data is imported the data types assigned by default might not completely match the underlying data. 
You can check the structure of the data using `str` function, and assign the correct data types using functions like `as.numeric`, `as.character`, `as.factor`, `as.integer`, ...

|<center>TASK</center>|
| ---- |
| Make a copy of "clean1" data frame, and name it "datatype1". Sepal.Length, Sepal.Width, Petal.Length, Petal.Width columns should be of type numeric, and Species column should be of type character. Check if the columns of "datatype1" match their respective data types, fix the same if not correct.  |

In [2]:
# Place your answer here
datatype1 = data.frame(clean1) # Make copy of "clean1" named "datatype1"
str(datatype1) # Check data types of coloumns
datatype1[1:4] = lapply(datatype1[1:4],as.numeric) # Coerce data from columns 1-4 to be data type numeric
str(datatype1) # Check data types are correct

# Alternatively
str(datatype1) # Check data types of coloumns
datatype1$Sepal.Length = as.numeric(datatype1$Sepal.Length)
datatype1$Sepal.Width = as.numeric(datatype1$Sepal.Width)
datatype1$Petal.Length = as.numeric(datatype1$Petal.Length)
datatype1$Petal.Width = as.numeric(datatype1$Petal.Width)

ERROR: Error in str(datatype1): object 'datatype1' not found


### Handling NA's for Arithmetic Operation
Arithmetic operations such as mean/sum might not perform as expected when the underlying data has missing values. There are many ways to handle missingness, and ignoring the NA's is one of the method.

|<center>TASK</center>|
| ---- |
| Find the mean of "Petal.Width" column of "datatype1". Note: consider using `na.rm` parameter of mean function to handle the NA's of this column. Store the result inside "mean_narm" |

In [None]:
# Place your answer here
mean_narm = mean(datatype1$Petal.Width, na.rm=TRUE)

### Data Missingness
There are many kind of missingness defined in R. Most common of those are `NA`, `NULL`, and `empty strings`. 

|<center>TASK</center>|
| ---- |
| Make a copy of "datatype1" data frame, and name it "clean2". Remove rows of "clean2" which contains missing values (NA, empty strings) in any of its columns. |

In [None]:
# Place your answer here
clean2 = data.frame(datatype1) # Make copy of "datatype1" named "clean2" 
clean2[clean2==""] = NA # Converts empty strings to NA
clean2 = na.omit(clean2) # Removes all rows which include NA

### Combining Data From JSON and CSV
In most situations, data needs to be combined from various sources like CSV, json, ... You can combine data frames loaded from different sources using `rbind` function, provided they have the same structure.

|<center>TASK</center>|
| ---- |
| Combine contents of "iris3.json" and "iris4.csv" using `rbind` function, and store the resultant dataframe inside "rbind1". <br> Import `JSON` data using `read_json` function of `jsonlite` package. Use `simplifyVector` argument of `read_json` function to read json correctly as a dataframe |

In [None]:
# Place your answer here
iris3 = jsonlite::read_json("datasets/iris3.json", simplifyVector=TRUE) # Load data from json file as "iris3"
iris4 = read.csv("datasets/iris4.csv", header=TRUE, stringsAsFactors=FALSE) # Load data from csv file as "iris4"
rbind1 = rbind(iris3, iris4) # Combine data frames "iris3" and "iris4"

### Filtering Data

|<center>TASK</center>|
| ---- |
| Store the first 5 rows and the last 10 rows of "d1" inside "select1"|

In [None]:
# Place your answer here
select1 = rbind(head(d1,5),tail(d1,10))

|<center>TASK</center>|
| ---- |
| Make a copy of "d1" data frame, and name it "select2". Remove all rows of "select2" whose Species is versicolor. |

In [None]:
# Place your answer here
select2 = data.frame(d1)
select2 = select2[!grepl("versicolor", select2$Species), ] # Uses the grepl function to keep all rows not containing versicolor in Species coloumn

# Alternatively and less accurately since grep search string for matches not just whole strings we can do
select2 = select2[select2$Species!="versicolor", ]

|<center>TASK</center>|
| ---- |
| Make a copy of "d1" data frame, and name it "select3". Set "Petal.Length" column values of "select3" data frame to 0 whose Species is setosa. |

In [None]:
# Place your answer here
select3 = data.frame(d1)
select3$Petal.Length = replace(select3$Petal.Length, grep("setosa",select3$Species),0) # Use the replace function to change all Petal Length, whose Species is setosa, to 0

## 2. Factors
All categorical features should be stored as `factor` in R. Each category should be a `level` of that factor.

|<center>TASK</center>|
| ---- |
| Given that there are 3 distinct species of iris - "setosa", "versicolor", "virginica", find the count of occurrence of each Species in "select2" using `table` function. Store the resultant count inside "count1" vector. Name of each element of the vector should be its corresponding Species. Hint: you have to convert the column into `factor` and mention the `levels` before you perform the count. |

In [None]:
# Place your answer here
select2$Species = as.factor(select2$Species) # Convert "Species" coloumn to factor with levels setosa and virginica
count1 = table(select2$Species) # Table tallies all the counts of each level to give us the occurrence of each Species 

## 3. User Defined Function

User-Defined function can be used to create functions which can perform custom operation.

|<center>TASK</center>|
| ---- |
| Create a function called "fun1" which takes 2 square matrices as input parameters (x1,x2). The function should return the sum of determinant of the matrices. Note: you DONT have to check if "x1" and "x2" are square matrices|

In [None]:
# Place your answer here
fun1 <- function(x1,x2){
  return(det(x1)+det(x2)) # Simply returns sum of the two determinates, nothing fancy
}

## 4. Common/Important Functions and Operations 

### Basic Arithmetic Operation

|<center>TASK</center>|
| ---- |
| Find the sum of "Petal.Length" column of "d1". Store the result inside vector "sum1" vector |

In [None]:
# Place your answer here
sum1 = sum(d1$Petal.Length) # Uses the sum function to sum over column, we have left na.rm as FALSE by default since there are no NA

### Apply Function
Apply function can be used to perform operations on rows / columns of data frame or matrices

|<center>TASK</center>|
| ---- |
| Find the "sum/median" (SUM divided by MEDIAN) of all the numeric columns of "d1" using `apply` function. Store the result inside "apply1". Note: you have to use `is.numeric` to find if a column is numeric or not. |

In [None]:
# Place your answer here
apply1 = apply(d1[sapply(d1,is.numeric)],2,sum)/apply(d1[sapply(d1,is.numeric)],2,mean) # sapply(d1,is.numeric) tests if columns are numeric, then we apply sum, and apply mean, and divide them both 

# The other way of doing this would be to define a function which combines sum/median

### Aggregate Function
Aggregate function proves useful when we want to perform aggregation on each and every categories of data. 


|<center>TASK</center>|
| ---- |
| Find the sum of "Petal.Length","Petal.Width", "Sepal.Length", "Sepal.Width" of "d1" for each Species using `aggregate` function. The result should be a data frame where the column name are the names of numeric columns of "d1" and the row names are the names of the Species. Store the result inside "agg1" |

In [None]:
# Place your answer here
agg1 = aggregate(d1[sapply(d1,is.numeric)], list(d1$Species),sum) # Important here we convert d1$Species to a list

|<center>TASK</center>|
| ---- |
| Load "script1.R" using `source` function.<strong> You will be making use of the objects created inside this script in subsequent tasks. Please note that the contents of the objects inside this script will be changed during marking. So please dont hard code your answers.</strong>  |

In [None]:
# Place your answer here
source("script.R") # Loads "script.R"

## 5. Lists

List can store any type of objects inside it. It is mainly used to return multiple objects from a function.

|<center>TASK</center>|
| ---- |
| Create a function called "fun2" which takes 2 matrix as input parameters (x1,x2). It should return a list where the 1st element of list is the first matrix, 2nd element of the list is the 2nd matrix, 3rd element of the list is the vector which contain the sum of determinant of the two matrices.  |

In [None]:
# Place your answer here
fun2 <- function(x1,x2) {
  return(list(x1,x2,fun1(x1,x2))) # Makes use of fun1 defined above which sums the determinate of two 2x2 matrices
}

### lapply & sapply
lapply and sapply can be used to iterate through elements of the list to perform some operation.

|<center>TASK</center>|
| ---- |
| Find the sum of determinant of matrices present inside the "list1" (loaded from "script1.R") using `lapply`. Store the result inside "lapply1" vector. <br> Perform the same operation using `sapply` and store the result inside "sapply1" vector. <br> Note: you have to check if an element of "list1" is matrix or not. That is, "list1" might contain 10 elements, but only 4 of them might be a matrix, which needs to be checked. |

In [None]:
# Place your answer here


## 6. Loops

Loops can be used to 
* execute tasks "n" number of times
* iterate through elements of an object
* and so on

|<center>TASK</center>|
| ---- |
| Find the sum of determinant of matrices present inside the "list1" (loaded from "script1.R") using `for` loop. Store the result inside vector "for1". Perform the same operation using `while` loop. Store the result inside "while1" vector. Note: you have to check if an element of the list is matrix or not.  |

In [None]:
# Place your answer here


## 7. Sampling

Sampling is the process where only a subset of the data is selected for analysis. This may be due to
* data being extremely large to process
* challenges in collecting the entire data


There are different types of sampling like
* Simple Random Sampling
* Weighted Sampling
* Stratified Sampling
* Systematic Sampling
* ...

Below tasks are based on sampling

### Simple Random Sampling

|<center>TASK</center>|
| ---- |
| Sample 80 rows `without replacement` from "d1". Store the result inside "sample1" data frame. Perform the same operation `with replacement` and store the result inside "sample2" data frame. Set the seed (using `set.seed` function) to 55 for both the cases. Note: perform sample operation using `sample` function  |

In [None]:
# Place your answer here


### Weighted Sampling

|<center>TASK</center>|
| ---- |
| Sample with replacement, 80 data points from "d1", such that versicolor data points have twice the weight as that of setosa, and virginica data points have twice the weight as that of versicolor. Combine the sampled rows inside "sample3" data frame. Set the seed to 55. Note: perform sample operation using `sample` function, and weights can be given using `prob` parameter of sample function |

In [None]:
# Place your answer here


### Repeated Weighted Sampling

|<center>TASK</center>|
| ---- |
| Perform the above weighted sampling for 100 iterations, finding the mean of "Sepal.Length" for each iteration. Store the means inside "weighted_iterative" vector. The seed for each iteration is the iteration number itself. |

In [None]:
# Place your answer here


### Stratified Sampling

|<center>TASK</center>|
| ---- |
| Sample with replacement - 10 values from each of the three Species of "d1" data. Store the sampled data inside "stratified_setosa", "stratified_versicolor", and "stratified_virginica" data frames respectively. Set the seed to 55 for sampling. Perform sample operation using `sample` function, and use `prob` parameter of sample function to sample from a particular species alone. Hint: all data points with a corresponding weight of 0 will not be sampled. |

In [None]:
# Place your answer here


### Repeated Stratified Sampling

|<center>TASK</center>|
| ---- |
| Perform the above stratified sampling for 100 iterations, finding the mean of "Sepal.Length" for each iteration - for each species. The seed for each iteration is the iteration number itself. Store the sampled means inside "setosa_mean", "versicolor_mean", "virginica_mean" vectors respectively. Note: after 100 iteratations, there will be 100 values in each of the  vectors |

In [None]:
# Place your answer here


In [None]:
print("This Line gets printed if there is no major error, when Kernel -> Restart & Run All")