# Classifying Student Knowledge Levels Using Metrics of Academic Rigor: A Proposal

DSCI 100 010 W2023T2, Group 01 (Thomas Hesselbo, Debon Lee, Coe McGrath, Celine Xi)

### Introduction

 Being able to predict a given student’s knowledge level in a given subject is of pedagogical importance, as it would allow an instructor to identify students who are in need of additional assistance. Subsequently, our project attempts to build a classifier for a student’s level of knowledge utilizing their performance and their study habits in a current subject and its prerequisite subjects. To accomplish this, our project utilizes the “User Knowledge Modeling” dataset sourced from the UC Irvine Machine Learning Repository. This data set contains the variables STG, SCG, STR, LPR, PEG, and UNS. User knowledge (UNS) is classified as either “very low”, “low”, “middle”, or “high”. STG and STR are the normalized study times for current material and prerequisite material respectively. PEG is performance on exams for the current subject. SCG is the extent of repetition undergone by that student for the current subject. Finally, LPR is the student’s knowledge in the prerequisite courses. These variables (STG, SCG, STR, LPR, PEG) are normalized and scaled from 0 to 1, and come pre-split into training and test data sets. 

### Preliminary Data Exploration

#### Loading Packages & Setting the Environment

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(readxl)
library(RColorBrewer)
options(repr.matrix.max.rows = 6)

#### Loading the Excel Data

In [None]:
url <- "https://github.com/TheABoss/DSCI-100-2023T2-Project-Group-001/raw/main/data/Data_User_Modeling_Dataset_Hamdi%20Tolga%20KAHRAMAN.xls"
#Define URL object for Excel dataset

temp_xls <- tempfile(fileext = ".xls")
#Prepare temp file

download.file(url, destfile = temp_xls)
#Downloading Excel file

know_train <- read_excel(temp_xls,sheet = 2,range = "A1:F259")
#Downloading Excel sheet as tibble containing training data

know_test <- read_excel(temp_xls,sheet = 3,range = "A1:F146")
#Downloading Excel sheet as tibble containing test data

#### Data Previewing, Cleaning, & Wrangling

In [None]:
know_train
know_test
#Demonstrating file can be read

In [None]:
any(is.na(know_train))
any(is.na(know_test))
#Checking for NAs

The loaded data meets the prerequisites of being tidy (rows = single observations, columns = single variables, one cell = one value). Thus, no cleaning or wrangling was performed as the data is already tidy.

#### Data Summary

##### Table

In [None]:
know_train_avg <- know_train |>
                group_by(UNS) |>
                summarize(avgSTG = mean(STG),
                      avgSCG = mean(SCG),
                      avgSTR = mean(STR),
                      avgLPR = mean(LPR),
                      avgPEG = mean(PEG),
                      Observations = n()) |>
                arrange(desc(avgSTG))
know_train_avg

##### Plot

In [None]:
reorder <- know_train |>
            pivot_longer(STG:PEG, names_to = "type", values_to = "value") |>
            group_by(UNS,type) |>
            summarise(avg = mean(value),
                      sd = sd(value))
#Pivoting summmary table such that is no longer untidy (i.e. readable by ggplot)

reorder$UNS <- factor(reorder$UNS, levels = c("very_low", "Low", "Middle", "High"))
#Reorder data from V. Low to High, while also including human-readable labels

options(repr.plot.width = 10, repr.plot.height = 7)
know_train_avg_plot <- ggplot(reorder, aes(x = UNS, 
                                           y = avg, 
                                           fill = UNS)) +
                        geom_bar(stat = "identity", 
                                 position = "dodge") +
                        labs(x = "Knowledge Level", 
                             y = "Scaled Performance", 
                             title = "Average Scaled Performance with\n Respect to Knowledge Level and Attribute",
                             fill = "Measure") +
                        facet_grid(. ~ type) + 
                        scale_x_discrete(labels = c("Very Low", 
                                                    "Low", 
                                                    "Middle", 
                                                    "High")) +
                        scale_fill_brewer(palette = "Paired", 
                                          labels = c("Very Low", 
                                                    "Low", 
                                                    "Middle", 
                                                    "High")) +
                        theme(text = element_text(size = 16), 
                              axis.text.x = element_text(angle = 50, hjust = 1))
know_train_avg_plot

The above graph is an effective demonstration of the data set, as it shows that the average of the predictors PEG, SCG, STG, and STR increase with the knowledge level classifier (ie. the average PEG of “very low” observations is smaller than “low” observations, “low” observations are smaller than “middle” observations, and so on). 

### Methods

To conduct our analysis, we will create a classifier utilizing the K nearest neighbours (KNN) model. The predictors for this model will be STG, SCG, STR, and PEG. We selected these predictors based on the above graph as they show a somewhat increasing trend as knowledge level changes from “very low” to “low” to “middle” to “high”. Continuing on this logic, LPR does not appear to have the same trend, as the average values for “very low” and “middle” observations are similar, as are the ones for “low” and “high”. In other words, LPR values do not seem to be able to predict knowledge class based on the graph. Thus, we are excluding LPR as one of our predictors.  We will show the results of our analysis through two graphs: one which shows a line graph of the predicted accuracy with different values of k, and a confusion matrix for the final classifier against the test data.

### Expected Outcomes and Significance

We expect that our classifier will have a higher accuracy than the majority classifier and that it will have a high level of recall and precision for each level of knowledge. Should these findings prove significant, it would allow instructors to use this classifier to identify students in need of assistance. Continued work can look to improve the classifier by investigating other predictors for a student’s knowledge levels in a given subject. Additionally, we could also investigate other models (aside from KNN), as they may better predict a student’s level of understanding.