# R workshop
authors: 
* Christian Wanschers 
* Bastiaan Verheul
* Nino van Alphen
* Sara Eftekhar Azam


# Introduction
In this workshop, we are going to learn you the basics of R and its statistical capabilities!

## Loading packages
First, we have to install some packages.
The packages we are going to use are:

* `Tidyverse`: A collection of a(r)R packages designed for data science. It includes some example datasets too.
* `ggplot2`: Used for creating elegant data visualizations.
* `dplyr`: Provides a set of functions for data manipulation tasks such as filtering, grouping, summarizing, and mutating data frames.
* `magrittr`: Provides the pipe operator %>%, which allows you to chain together multiple operations.

All packages can be directly installed by running them in a .R-file or a codeblock below:

In [None]:
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("magrittr")

After installing the packages, we should include or import them to be able to use them later:

In [None]:
library(tidyverse)
library(ggplot2)
library(dplyr)
library(magrittr)

## Loading data and initial exploring dataset
The packages we are utilizing contain a variety of valuable datasets that can be utilized. To access these datasets, execute the following code snippet:

In [None]:
data(package = "ggplot2")

we will be utilizing the `Bachelor_Degrees_Majors.csv` dataset, which contains information regarding the enrollment of college students in various states across the US. We can import the dataset with code:


In [None]:
data <- read.csv("Bachelor_Degrees_Majors.csv")

Let's show the first few rows using the `head()` function:

In [None]:
head(data)

We need to convert the character variables to factors, we also need to remove the commas by converting the columns to numeric type. We can do this with code:


In [None]:
numeric_cols <- c("Bachelor.s.Degree.Holders", "Science.and.Engineering", "Science.and.Engineering.Related.Fields", "Business", "Education", "Arts..Humanities.and.Others")
data[, numeric_cols] <- lapply(data[, numeric_cols], function(x) as.numeric(gsub(",", "", x)))

Why not just copy/paste the columnname? This is because of naming convention used in (a)R(rr). A string may contain:
* letters
* numbers
* dots
* underscores
* but it does **NOT** allow **spaces** and **apostrophes**, they are automagically replaced with a dot (or in older versions of R they may be replaced by an underscore)

When you read a .CSV-file, R will automagically convert them to syntactically valid names!

Total is redundant so we should remove rows with Sex=Total. 

In [None]:
# in dataset, we have male, female, and total (which is sum of male and female). SO, rows with Sex == "Total" are redudant. Since we don't want to show them in our plots, we remove them.
data <- data %>%
  filter(Sex != "Total")

In [None]:
# View the first few rows of the dataset
head(data)

In [None]:
# Display the structure of the dataset
str(data)

In [None]:
# Summary statistics for numerical variables
summary(data)

# Assignment 1: EDA

## Group, filter, and plot data
In R, grouping is similar to utilizing the `GROUP BY` clause in SQL, while filtering is akin to applying conditions with the `WHERE` clause in SQL. 

Creating plots in R is straightforward and user-friendly. 

For our current task, we will create a visualization comparing `Science and Engineering` with `Business` using the `ggplot` library. This library allows for the addition of multiple elements to the plot through color differentiation.

Lets view the result of code below:

In [None]:
ggplot(data, aes(x = Science.and.Engineering, y = Business, color = Age.Group)) +
  geom_point() +  # Add points for each data point
  scale_color_discrete(name = "Age Group") +  # Custom color legend title
  labs(x = "Science and Engineering", y = "Business") +  # Labels for x and y axes
  ggtitle("Scatter Plot of Science and Engineering vs. Business")  # Title for the plot

What did the code do? 
1. `ggplot` is called, `data` passed in with on the X and Y axis the 2 columns. `Age` is used as the color differentiator.
2. A few parameters are called in `ggplot()`:
   * `geom_point()`: creates the scatterplot including the datapoints
   * `scale_color_discrete()`: Within this function you can set the name of the legend
   * `labs()`: Sets custom labels for the axes
   * `ggtitle()`: Title of the whole graph

# Assignment 2
Using above code as example, create a plot for a diamond carot vs its price and show its cut as color (or any other plot of your choice).

We are going to learn how to do these by doing some example tasks.

## Task 1
Compare the average Sicence and Engineering for different age groups.
Only consider Female students.
Plot bar chart to show the results. 

Let's do this with pipe %>% operator:

In [None]:
task1 <- data %>% 
  filter(Sex == "Female") %>%
  group_by(Age.Group) %>%
  summarise(mean_sci = mean(Science.and.Engineering))

ggplot(task1, aes(x = Age.Group, y = mean_sci)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Age Group", y = "Mean Science and Engineering") +
  ggtitle("Mean Science and Engineering by Age Group")

## Task 2:
Plot Business and Education in a scatter plot for age group 25 to 39 and Sex of Male.


In [None]:
task2 = data %>% filter(Age.Group == "25 to 39" & Sex == "Male")
ggplot(task2, aes(x = Business, y = Education)) +
  geom_point() +  # Add points
  labs(x = "Business", y = "Education") +
  ggtitle("Business vs Education")

# Demo 2: Regression

We are going to analyse the relationship between the number of `Bachelors's degree holders` and the number of `Science and Engineering` using a linear regression model. 

First select the columns to be used in the model:

In [None]:
# Select relevant columns
selected_data <- data[, c("Bachelor.s.Degree.Holders", "Science.and.Engineering")]

Convert the data to a dataframe using `as.data.frame()`:

In [None]:
selected_data <- as.data.frame(selected_data)

Split the data into training, testing and validation:

In [None]:
set.seed(123)  # For reproducibility
sample_index <- sample(seq_len(nrow(data)), 0.7 * nrow(data))
train_data <- data[sample_index, ]
test_data <- data[-sample_index, ]

Now learn the actual model:

In [None]:
model <- lm(Bachelor.s.Degree.Holders ~ Science.and.Engineering, data = train_data)

Just like that? Yep, just like that. the `lm` part determines what model-algorithm is used. `lm` stands for "linear model".

We can show the performance of the model:

In [None]:
summary(model)

The `summary()` shows that the model has scored an adjusted $R^2$ of 0.9209

We can optionally calculate the MSE:

In [None]:
mse <- mean((predictions - test_data$Bachelor.s.Degree.Holders)^2)
print(paste("Mean Squared Error:", mse))

Using `ggplot`, we can also scatterplot the columns and include the regression line in 1 go:

In [None]:
# Scatter plot with regression line
ggplot(selected_data, aes(x = Science.and.Engineering, y = Bachelor.s.Degree.Holders)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Simple Linear Regression Analysis",
       x = "Science and Engineering Degree Holders",
       y = "Bachelor's Degree Holders")

# Assignment 2: Regression

Instead of using a linear regression model, now it's up to you to make use of another regression model. Lets try Random Forest Regression this time. Feel free to add more blocks as required.

In [None]:
# Install the required package
install.packages("randomForest")

In [None]:
# Import the randomForest library

In [None]:
# Pick 2 different columns from the dataset and convert them to numeric value

In [None]:
# Select the columns for regression and store in a variable

In [None]:
# Train the RandomForest model

# Demo 3: Classification

In this section, we will demonstrate how to perform classification using a simple decision tree classifier. We will predict the field of study (e.g., 'Science and Engineering', 'Business', etc.) based on the number of bachelor's degree holders. 

First, we need to install and load the required package `rpart` for creating decision trees.

In [None]:
install.packages("rpart")
library(rpart)

Next, let's prepare our dataset. We will use the `data` object that was created earlier. We will select relevant columns and create a new categorical column for classification. For simplicity, we will create a binary classification problem by categorizing the `Science and Engineering` field into 'High' and 'Low' based on the median value.

In [None]:
# Create a binary classification target variable
median_sci_eng <- median(data$Science.and.Engineering, na.rm = TRUE)
data$SciEng_Category <- ifelse(data$Science.and.Engineering >= median_sci_eng, "High", "Low")

# Convert the new column to a factor
data$SciEng_Category <- as.factor(data$SciEng_Category)

Let's display the first few rows of the modified dataset to ensure our new column has been added correctly.

In [None]:
head(data)

Next, let's split the dataset into training and testing sets. 70% of the data will be used for training and the remaining 30% for testing.

In [None]:
set.seed(123)  # For reproducibility
sample_index <- sample(seq_len(nrow(data)), size = 0.7 * nrow(data))
train_data <- data[sample_index, ]
test_data <- data[-sample_index, ]

We can now train the decision tree classifier using the `rpart` function, with `SciEng_Category` as the target variable and `Bachelor's Degree Holders` as the predictor.

In [None]:
# Train the decision tree classifier
model <- rpart(SciEng_Category ~ Bachelor.s.Degree.Holders, data = train_data, method = "class")

Let's visualize the decision tree.

In [None]:
library(rpart.plot)
rpart.plot(model)

Now, let's make predictions on the test set and evaluate the model's performance.

In [None]:
# Make predictions on the test set
predictions <- predict(model, test_data, type = "class")

# Evaluate the model's performance
confusion_matrix <- table(Predicted = predictions, Actual = test_data$SciEng_Category)
print(confusion_matrix)

The confusion matrix shows how well the model performs.

To calculate the accuracy of the model, we can use the following code:

In [None]:
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))

We have successfully built and evaluated a decision tree classifier.

# Assignment 3: Classification

Using the code from the demo as an example, create a classification model using a different algorithm, such as k-Nearest Neighbors (k-NN) or Random Forest. Choose another categorical target variable and appropriate predictor variables from the dataset. Feel free to add more code blocks.

In [None]:
# Install the required package for k-NN
install.packages("class")

In [None]:
# Import the class library for k-NN

In [None]:
# Prepare the dataset for k-NN classification

# Demo 4: Clustering

R even has the ability to do clustering, we are going to show you how to do `HDBSCAN`!

It requires just 1 extra package. code below installs and imports the required `dbscan` package:

In [None]:
install.packages("dbscan")
library(dbscan)

Since we already prepared the dataset we can skip the preprocessing, selecting the correct columns and storing as a dataframe but just to be sure here is the complete code to do all that in 1 block:

In [None]:
# Convert necessary columns to numeric
data$Science.and.Engineering <- as.numeric(gsub(",", "", data$Science.and.Engineering))
data$Business <- as.numeric(gsub(",", "", data$Business))
data <- na.omit(data)  # Remove any rows with NA values

# Select columns for clustering
selected_data <- data[, c("Science.and.Engineering", "Business")]

Run the actual `HDBSCAN`:

In [None]:
hdbscan_result <- hdbscan(selected_data, minPts = 5)

Print the result:

In [None]:
print(hdbscan_result)

Add cluster information to the data:

In [None]:
data$cluster <- as.factor(hdbscan_result$cluster)

Plot the result using the same function as before:

In [None]:
# Plot the clusters
ggplot(data, aes(x = Science.and.Engineering, y = Business, color = cluster)) +
  geom_point() +
  labs(title = "HDBSCAN Clustering of Science and Engineering vs. Business",
       x = "Science and Engineering",
       y = "Business") +
  scale_color_discrete(name = "Cluster") +
  theme_minimal()

# Assignment 4: Clustering (Optional)