# **Students' Knowledge Status**

**Group: Liam Brennan, Eva He, Li-Kun Lin, Steve He** 

## Introduction:

The objective of any class should be to increase student’s understanding of the topic subject. Students with a high understanding of class material will enter the workforce with the tools they need to succeed in their relevant subjects. According to Papanikolaou, the professor of School of Pedagogical & Technological Education, “learners' knowledge level is used as valuable information to represent learners' current state” (Papanikolaou, et.al, 2002). Thus, we will build a KNN model to classify students' knowledge level based on five quantitative variables taken from Kahraman et al’s “User Knowledge Modeling dataset".

**Question**: Can we predict a student's knowledge level based on their study habits and exam performances using the Knn-classification algorithm?

Data Set information: 
The dataset was collected by Ka​​hraman et al. The weighting system and development of quantitative measurements for the variables was done using Kahramans rule based system which gives quantitative values(ratings) to students' performances in certain academic related parameters (Kharaman et al, 2013). The parameters:


STG: Refers to Study time rating(0-1), the amount of time spent studying about Electrical DC Machines.

SCG: Refers to Repetition rating(0-1),  the amount of problems, material the student worked on. For example, worksheets, tutorials.

PEG: Refers to the exam performance rating of the subject(0-1), in this case, the exam performance on Electrical DC Machine course.

STR: Refers to Study time rating of related subjects(0-1), the amount of time students spent studying related topics.

LPR; Refers to exam performance rating in related subjects(0-1), exam performance on related material, or background information.

*UNS(): Refers to student understanding levels; Based on the weighting system Kahramans uses in his rule based system paper, classified  as “Very-low” “Low”, “Middle”, or “High” Understanding of Electrical DC Machines(Kharaman, 2013 2). 

## Methods & Results:

### 1. loading in all the packages for data analysis
This analysis utilizes the knn classification algorithm to predict the knowledge level (High, Middle, Low or Very Low) of students. First, all the packages that are necessary to perform this algorithm are loaded into R.

In [None]:
install.packages("themis")
install.packages("kknn")
install.packages("cowplot")
library(kknn)
library(purrr)
library(tidyverse)
library(repr)
library(tidymodels)
library(themis)
library(cowplot)
options(repr.matrix.max.rows = 6)

### 2. Reading in the data from the web, and Preparation of Data analysis

#### (A) loading the data
After loading the data, we factorize the variable we want to predict, UNS, by using the function `as.factor`.

In [None]:
set.seed(1234) 
options(repr.plot.height = 5, repr.plot.width = 6)

##Loading data
url <- "https://github.com/JackyLinllk/ubc_dsci100_assignment/raw/main/data/Data_User_Modeling_Dataset_full.csv"
knowledge_data<-read_csv(url)|>
    select(STG:UNS)|>
    mutate(UNS = as.factor(UNS))
knowledge_data

##### Table 1: Dataset Of Students' Knowledge Status
    This table represents the students' knowledge status about the subject of Electrical DC Machines, encompassing key metrics such as Study Time Rating (STG), Repetition Rating (SCG), Study Time Rating of related subjects (STR), Exam Performance Rating in related subjects (LPR), Exam Performance Rating of the subject (LPR), and their overall Knowledge Level (UNS).

#### (B) NA-values
We then check for NA-values, and realized that the last row restores NA-values across all the variables, so we just simply deleted the last row of the dataset

In [None]:
##Check NA rows
na_row= which(!complete.cases(knowledge_data))
print(na_row)

##Delete the NA row of the data
knowledge_data= knowledge_data[-nrow(knowledge_data),]
knowledge_data

##### Table 2: Dataset Of Students' Knowledge Status with removed NA-Values
    Data set after the NA-Values are removed, with the same variables(STG SCG STR LPR PEG UNS) as table 1

#### C) Preliminary Analysis and Final prep 
We check the data’s tidiness from "Table 2" by looking into 3 factors, which are: (1) each row is a single observation, (2) Each column is a single variable, (3) Each value is in single cell. 


After checking the data is tidy, we utilize function called
group_by with summarize to find the summary statistics of the number of observations in
each factor level with the corresponding proportion (Table 3).

We realize that under the quantitative variable named UNS, the factor levels
named “very_low” and “Very Low” should be in the same level. Thus, we apply the function
mutate with function `fct_recode` to merge two factor levels into one.

On top of that, we realized the
classifier is class imbalance, which means the proportion for each stratum is not equally
proportional. Thus, we apply the functions called `uc_recipe` and `step_upsample` to rebalance
the rare class, namely “Very Low”, by oversampling. Then, we again utilize group_by with
summarize to check each stratum is equally proportional to each other.

In [None]:
## Creating the Summary table and checking for Proportion 
class_prop = knowledge_data|>
  group_by(UNS)|>
  summarize(count = n(),
            percentage= count/nrow(knowledge_data))
class_prop


##### Table 3: Summary Statistics of The number of Observations Knowledge Levels
    
    This table displays summary statistics for each knowledge level(UNS), including the count of observations(count) and the corresponding proportion(percentage).

In [None]:
##Typo in the dataset #very_low and Very Low should be same observation
knowledge_data= knowledge_data|>
  mutate(UNS = fct_recode(UNS, "Very Low" =  "very_low"))

##Balancing the data
ups_recipe= recipe(UNS~. ,data=knowledge_data)|>
  step_upsample(UNS, over_ratio=1, skip=F)|>
  prep()
upsampled_knowledge=bake(ups_recipe, knowledge_data)

##Checking the balance
upsampled_knowledge|>
  group_by(UNS)|>
  summarize(n=n())


##### Table 5: Summary Statistics of The number of Observations Knowledge Levels After Balancing the Data
    This table presents summary statistics for each knowledge level (UNS) with the updated count of observations following the upsampling process.

### 3. Splitting the data into 70% properation of training data and testing data

#### A) Splitting the data 
We separated the data using the `initial_split` function to create 2 subsets, namely training set and testing set.
Inside the initial_split function, we set the strata argument to the categorical variable UNS. The
training and testing functions are used to create two different data frames with the
corresponding weight of 70% and 30%.

In [None]:
##Split the data into training set and testing set
knowledge_data_split <- initial_split(upsampled_knowledge, prop = 0.70, strata = UNS)  
knowledge_data_split_train <- training(knowledge_data_split)
knowledge_data_split_test <- testing(knowledge_data_split)

knowledge_data_split_train

##### Table 6: The Training set of Student's Knowledge 
    This table represents the training set we're utilizing for model training, encompassing variables such as STG SCG STR LPR PEG UNS(see intro for more info). 

In [None]:
knowledge_data_split_test

##### Table 7: The Testing set of Student's Knowledge 
    This table represents the testing set we're utilizing to evaluate our models with, encompassing variables such as STG SCG STR LPR PEG UNS(see intro for more info). 

#### B) Preliminary Analysis on the Training set 
We created a Summary table of observations in each knowledge level using `group_by` UNS and `summarize`, to see if each of the Knowledge levels are balanced or not. We also created a plot that compares all the variables against PEG, setting it on the y-axis, and all the others on the X-axis. We did this because we wanted a base for comparison, so we used PEG as that base.

In [None]:
# creating the Summary table
data_training_summary <- knowledge_data_split_train|>
    group_by(UNS)|>
    summarize(count = n())
data_training_summary

##### Table 7: Summary Statistics of The number of Observations Knowledge Levels of Training data
    This table represents the summary statistics of the number of observations(count) in each knowledge level(UNS).

We see that the count of each Knowledge level is the same, and we won't have to worry about it being unbalanced.

Next we created the plots

In [None]:
options(repr.plot.height = 8, repr.plot.width = 15)
# craeting plot for comparing PRG with LPR
data_plot_LPR_PEG <- ggplot(knowledge_data_split_train, aes(x= LPR,y= PEG, color = UNS))+
    geom_point() +
    labs(x="Exam performance Rating \n in Related Subject (0-1)", y= "Exam performance Rating \n in Subject (0-1)", color = "Knowledge level")+
    ggtitle("Exam Performance vs Exam performance in Related Subject")+
    theme(text = element_text(size = 15))

# craeting plot for comparing STG with LPR
data_plot_STG_PEG <- ggplot(knowledge_data_split_train, aes(x= STG,y= PEG, color = UNS))+
    geom_point() +
    labs(x="Study time rating (0-1)", y= "Exam performance Rating \n in Subject (0-1)", color = "Knowledge level")+
    ggtitle("Exam Performance vs Study time rating")+
    theme(text = element_text(size = 15))

# craeting plot for comparing SCG with LPR
data_plot_SCG_PEG <- ggplot(knowledge_data_split_train, aes(x= SCG,y= PEG, color = UNS))+
    geom_point() +
    labs(x="Repetition rating (0-1)", y= "Exam performance Rating \n in Subject (0-1)", color = "Knowledge level")+
    ggtitle("Exam Performance vs Repetition rating ")+
    theme(text = element_text(size = 15))

# craeting plot for comparing STR with LPR
data_plot_STR_PEG <- ggplot(knowledge_data_split_train, aes(x= STR,y= PEG, color = UNS))+
    geom_point() +
    labs(x="Study time rating \n in Related Subject (0-1)", y= "Exam performance Rating \n in Subject (0-1)", color = "Knowledge level")+
    ggtitle("Exam Performance vs Study time rating in Related Subject")+
    theme(text = element_text(size = 15))


# putting all the plots into one
compare_plot <- plot_grid(data_plot_LPR_PEG, data_plot_STG_PEG, data_plot_SCG_PEG, data_plot_STR_PEG, ncol = 2)
compare_plot

##### Plot 1: The Relationship Between the Variables  
    This plot visualizes the relationships of PEG (Y-axis) with all other variables (excluding UNS), Colored by knowledge level.

From plot 1, It is clear from these plots that PEG with any of the variables can distinguish between different knowledge levels. Although Some combinations are better than others, like the "Exam performance vs exam performance in related subjects", but generally speaking, all of them works. Therefore, any combination of PEG with any of the variables can be used as predictors.

### 4. Parameter selection: Finding the best K value
Our next step is to find the best K value and selecting the predictor variables  which maximizes the accuracy for our model.

First, we apply nearest_neighbor, set_engine, and set_mode
functions to create a model specification. Inside the `nearest_neighbor function`, the argument
`weight_func` is set to rectangular, which means each k neighbor is equally important. For
the neighbors argument, `tune()` is telling the framework to find the different parameter values
for K (Timbers et al, 2023).

For selecting the K variable, where K is the number of neighbors. We will be using cross-validation with a validation set of 5 in the training set to find the best possible k value. In other words, we will split our
training data into 5 training sets. 

For cross-validation, we use `vfold_cv` function to set the validation set into 5 folds
by using the training set. Finally, we create a tribble with neighbors and use the seq function to
set the K-values to odd numbers (e.g., 1,3,5... 𝑛). The reason why we don’t want even
numbers is because each neighbor is equally weighted; therefore, the even numbers will
cause confusion (Timbers et al, 2023).

For selecting the predictor variables for the recipe, we based it off of the research article by Kharaman 2013,(Developing Intuitive Knowledge Classifier). The article mentioned that in predicting the user knowledge levels, the most useful variable to consider are the Study time rating(STG), the Repetition rating( SCG), the exam performance rating(PEG) of the subject, the Study time rating(STR) of related subjects, and the exam performance rating in related subjects(LPR). Basically all the variables (Kahraman et al, 2013). And from Plot 1, we can also see that indeed distinctions between the different knowledge levels can be made with all the different variables. 

Since KNN classification uses Euclidean distance between points, it is very sensitive
to the different types of scale. Thus, we planned to standardize the variables for all chosen
variables to ensure the predictive algorithms are accurate and unbiased. We managed to
standardize all the variables by using the recipe function with `step_center(all_predictors())` and
`step_scale(all_predictors())`.

Finally, we put everything into workflow to chain all the steps together to get the
accuracy of different K-values. 
 


In [None]:

##Finding the k value for best accuracy
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

##Choosing all the variables as predictors, and standardize it
data_recipe <- recipe(UNS ~. , data = knowledge_data_split_train)|>
  step_center(all_predictors())|>
  step_scale(all_predictors())

training_vfold <-  vfold_cv(knowledge_data_split_train, v=5, strata = UNS)

k_value = 101
K <- tibble(neighbors = seq(1,k_value,2))

knn_result <- workflow() |>
  add_recipe(data_recipe) |>
  add_model(knn_tune)|>
  tune_grid(resamples = training_vfold, grid = K) |>
  collect_metrics()|>
  filter(.metric == "accuracy")
knn_result

##### Table 8: Accuracy of different K values  
    This table represents the Accuracy of different K values from 1 to 101, Advancing by 2. 

### 5. Visualizing the optimal K-value

We used the ggplot function to create a line graph which
helps to visualize the accuracy trends under corresponding K-values. Surprisingly, when the
K=1, we have the most accurate K-value for the model. Thus, we choose K equals to one as
our optimal K-value.

In [None]:
##Scatter plot on the accuracy and number of neighbors
options(repr.plot.height = 8, repr.plot.width = 8)
cross_val_plot <- ggplot(knn_result, aes(x=neighbors, y= mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") +
  ggtitle(label= "KNN Accuracy verses Number of Neighbors")
cross_val_plot

##### Plot 2: KNN Accuracy versus Number of Neighbors
    This plot represents the Virtualization of "Table 7". With the Accuracy Estimate on the y-axis, and Neighbors on the X-axis. 
    
    Notably, the Highest accuracy came from a K neighbor of 1. So we will be using for K = 1 for the optimal K value

### 6. Creating the model with the optimal K-value

We Chose the K value based on Plot 2, where the highest accuracy is 1

In [None]:
##Finding confusion matrix of model using testing set
knn_best_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = 1) |>
  set_engine("kknn") |>
  set_mode("classification")


knn_fit <- workflow() |>
  add_recipe(data_recipe) |>
  add_model(knn_best_tune)|>
  fit(knowledge_data_split_train)


### 7. Predicting the model on the testing data set, and evaluating the model

#### A)  Predicting the on testing set
We first used the model to predict the knowledge levels of the testing set (table 9).


In [None]:
## Predicting the UNS of testing data set
knowledge_predictions= knn_fit|>
  predict(knowledge_data_split_test)|>
  bind_cols(knowledge_data_split_test)
knowledge_predictions


##### Table 9: Predicted knowledge levels
    This table represents the predicted knowledge level generated by the model on the testing set (Table 7). It includes the same variables as in the testing set (STG, SCG, STR, LPR, PEG, UNS), along with an additional variable, .pred_class, representing the knowledge level assigned by our model.

#### B)  Evaluating the model

Next, we calculated the corresponding accuracy and set up the confusion matrix, setting the truth to the actual Knowledge level(UNS),
and comparing it to what the model predicted(.pred_class).
 $$𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = \frac{𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛} {𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}$$


In [None]:
knowledge_metrics= knowledge_predictions|>
  metrics(truth= UNS, estimate = .pred_class)|>
  filter(.metric== "accuracy")
knowledge_metrics

##### Table 10: Accuracy of the model
    This table represents the accuracy of our model with .estimate showing the estimated accuracy percentage, ie 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 / 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
    
    Looking at the value of the .estimate variable, it shows that our model has an estimated accuracy on the testing set of ~92.9%, which is pretty good.

In [None]:
knowledge_conf_mat= knowledge_predictions|>
  conf_mat(truth= UNS, estimate = .pred_class)
knowledge_conf_mat

##### Table 11: Confusion matrix 
    This table represents the true number of each Knowledge level, and the predicted number of each Knowledge level.
    The leftmost column represents the predicted number of observations, and the topmost row represents the true number of observations.
    
    Looking at the table, The highest number of missed predictions of the Knowledge level came from the "low" class, 
    with a total of 8 missed predictions. All the other Knowledge levels had only one missed prediction.

#### C) Visualization of result 
We decided to visualize the results using a bar graph. To be more specific, we are comparing the number of true observations for a knowledge level versus the amount of predicted observations for a knowledge level. We did this by first using the `group_by `and `summarize` functions to count the number of observations for each knowledge level, for each predicted (count) and true (count2) cases. Then, we combined the two tables into one and added a new column to identify if it's "True" or "Pred" (count3). The purpose of the new column is to differentiate (color) them in the bar plots. Then we created bar plots that compared the true class with the predicted class of that one knowledge level, e.g., pred_high vs High, pred_low vs low, and so on, and used `plot_grid` to put them beside each other."


In [None]:
options(repre.plot.width= 11, reper.plot.height = 9)

#Number of Predicted cases 
count <- knowledge_predictions|>
    group_by(.pred_class) |>
    summarize(count = n())|>
    rename(UNS = .pred_class)|>
    mutate(UNS = fct_recode(UNS, pred_High = "High", pred_Low = "Low", pred_Middle = "Middle", pred_Very_low = "Very Low"))

#Number of True cases 
count2 <- knowledge_predictions|>
    group_by(UNS) |>
    summarize(count = n())

#Combining the Number of Predicted cases with Number of True cases 
count3 <- rbind(count, count2)|>
    mutate(pred_or_true = UNS)|>
     mutate(pred_or_true = fct_recode(UNS, Pred = "pred_High", Pred = "pred_Low", Pred = "pred_Middle", Pred = "pred_Very_low",
                                     True = "High",
                                     True = "Low",
                                     True = "Middle",
                                     True = "Very Low"))

# craeting plot for comparing Predicted vs True for Knowledge level High
Compare_High <- count3|>
    filter(UNS == "pred_High" | UNS== "High") |>
    ggplot(aes(x= UNS, y= count, fill=pred_or_true))+
    geom_bar(stat = "identity") +
    xlab("Knowledge level") +
    ylab("Number of Observations")+
    labs(fill = "Prediction vs True")+
    ggtitle("Predicted High vs True High")

# craeting plot for comparing Predicted vs True for Knowledge level Middle
Compare_Middle <- count3|>
    filter(UNS == "pred_Middle" | UNS== "Middle") |>
    ggplot(aes(x= UNS, y= count, fill=pred_or_true))+
    geom_bar(stat = "identity") +
    xlab("Knowledge level") +
    ylab("Number of Observations")+
    labs(fill = "Prediction vs True")+
    ggtitle("Predicted Middle vs True Middle")

# craeting plot for comparing Predicted vs True for Knowledge level low
Compare_Low <- count3|>
    filter(UNS== "pred_Low" | UNS== "Low")|>
    ggplot(aes(x= UNS, y= count, fill=pred_or_true))+
    geom_bar(stat = "identity") +
    xlab("Knowledge level") +
    ylab("Number of Observations")+
    labs(fill = "Prediction vs True")+
    ggtitle("Predicted Low vs True Low")
# craeting plot for comparing Predicted vs True for Knowledge level very low
Compare_Very_Low <- count3|>
    filter(UNS== "pred_Very_low" | UNS== "Very Low")|>
    ggplot(aes(x= UNS, y= count, fill=pred_or_true))+
    geom_bar(stat = "identity") +
    xlab("Knowledge level") +
    ylab("Number of Observations")+
    labs(fill = "Prediction vs True")+
    ggtitle("Predicted Very Low vs True Very Low")

# putting the plots beside each other 
combine_plot <- plot_grid(Compare_High, Compare_Middle,Compare_Low, Compare_Very_Low, ncol = 2)
combine_plot

##### Plot 3: Predicted vs True
    This plot compares the amounts of true (blue) with the predicted number (red) of each class level. The four plots represent the four different knowledge levels, starting with 'High' at the top left and ending with 'Very Low' at the bottom right.

Again, we see that the highest number of missed predictions of the knowledge level came from the 'Low' class.

### Discussion:
 
After using the cross-validation for the validation sets of 5, we find the optimal hyperparameter k-value to be 1(Plot 2), and the testing set reveals that our KNN model has an accuracy rate of approximately 93%(Table 10). Therefore, our model prediction tends to correspond with the labeled category of student performance(Plot 3). Thus, in this data set, we can conclude that our model is a valid estimator for classifying the student performance.

Our model’s predictions have not shown much discrepancy in the accuracy across the four categories(Table 11). However, it should be noted that our model has highest accuracy in predicting the category labeled as “High”, with only one missed prediction, and is less accurate in predicting the category labeled as “Low”, with 8 missed predictions in a total of 40 observations(Table 11). In other words, our testing data reveals a 20% miss label for the “Low” variable (Kharaman et al. 2013 2).  Although 8 is not large enough to be considered as a remarkable number, the cause for this difference is worth potential future investigation. One possible reason might be the upsampling, we increase the observation in “Low” from 50 to 129 which is approximately 258% of the original observation(table 5).  However, the calculation/method behind the upsampling is beyond this course. 

Since we are creating a multiple explanatory variables KNN model, another technique we would apply to improve the accuracy of this model is the forward selection. The forward selection helps to eliminate the variables that are less statistically significant, which might help to generalize the model and exclude from the multivariable penalty. Thus, we would possibly have a better estimation for the general cases. Using this method, we could also get insight of which one out of the five variables has more influence on an individual’s knowledge level in this dataset, which was a question originally proposed in our hypothesis.

Overall, the result of the model seems valid with relatively high accuracy, however, we are aware of the model limitation on overfitting. Since our hyperparameter k-value is 1, we are suggesting the new observation will match with the nearest one. This might work extremely well in this specific dataset, but we don’t have a valid reason to say this model is generalized. Thus, we cannot guarantee the model will work well when we add new observations with more noise.

This model provides valuable real-life applications as it could potentially be applied to the current education system. Compared to the traditional letter grade grading system, the system used in this dataset has more criteria that brings in a more diverse perspective to assessing student learning. Our algorithm could help bring this system into practical use by using the algorithm to assign the category instead of human assignment(educators, authorities etc.), which avoids subjective bias. This could also allow this grading system to be applied across different educational institutions as the category assignment is universal and objective, in which the influence that the discrepancy between each individual graders has on the result could be minimized. 
 
Looking at the potential real life application of this model, it leads the way to future questions such as: How can we adapt the model to predict the knowledge level for a data set that incorporates more variables? This question is worth investigation as the criterion for this grading system could be modified or extended to fit more educational needs of different institutions.


References: 

Kahraman, H., Colak, I., & Sagiroglu, S. (2013). User Knowledge Modeling. UCI Machine Learning Repository. https://doi.org/10.24432/C5231X

Kahraman, H. T., Sagiroglu, S., & Colak, I. (2013). The development of an intuitive knowledge classifier and the modeling of domain-dependent data. Knowledge-Based Systems, 37, 283–295. https://doi.org/10.1016/j.knosys.2012.08.009

Timbers, T., Campbell, T., & Lee, M. (2023). Data Science: A First Introduction. datasciencebook.ca. https://datasciencebook.ca/index.html

Papanikolaou, K. A., Grigoriadou, M., Magoulas, G. D., & Kornilakis, H. (2002). Towards new forms of knowledge communication: the adaptive dimension of a web-based learning environment. Computers & Education, 39(4), 333–360. https://doi.org/10.1016/s0360-1315(02)00067-2