### **Students' Knowledge Status**

**Group: Liam Brennan, Eva He, Li-Kun Lin, Steve He** 

Introduction:
The objective of any class should be to increase student’s understanding of the topic subject. Students with a high understanding of class material will enter the workforce with the tools they need to succeed in their relevant subjects. We will classify students' understanding of class materials based on five quantitative variables taken from Kahraman et al’s User Knowledge Modeling dataset.
The question our project will try to answer is : Can we predict student's knowledge level based on 5 different academic parameters using the Knn-classification algorithm?

Data Set information: 
The dataset was collected by Ka​​hraman et al. The weighting system and development of quantitative measurements for the variables was done using Kahramans rule based system which gives quantitative values(ratings) to students' performances in certain academic related parameters. The parameters:


STG: Refers to Study time rating(0-1), the amount time spend studying about Electrical DC Machines.

SCG: Refers to Repetition rating(0-1),  the amount of problems, material the student worked on. For example, worksheets, tutorials.

PEG: Refers to the exam performance rating of the subject(0-1), in this case, the exam performance on Electrical DC Machine course.

STR: Refers to Study time rating of related subjects(0-1), the amount of time students spent studying related topics.

LPR; Refers to exam performance rating in related subjects(0-1), exam performance on related material, or background information.

*UNS(): Refers to student understanding levels; Based on the weighting system Kahramans uses in his rule based system paper, classified  as “Very-low” “Low”, “Middle”, or “High” Understanding of Electrical DC Machines. 

Methods & Results:

### 1. loading in all the packages for data analysis
This analysis utilizes the knn classification algorithm to predict the knowledge level (High, Middle, Low or Very Low) of students. First, all the packages that are necessary to perform this algorithm are loaded into python.

In [None]:
install.packages("themis")
install.packages("kknn")
install.packages("cowplot")
library(kknn)
library(purrr)
library(tidyverse)
library(repr)
library(tidymodels)
library(themis)
library(cowplot)
options(repr.matrix.max.rows = 6)

### 2. Reading in the data from the web, and Preparation of Data analysis

#### (A) loading the data
After loading the data, we factorize the variable we want to predict, UNS, by using the function as.factor.

In [None]:
set.seed(1234) 
options(repr.plot.height = 5, repr.plot.width = 6)

##Loading data
url <- "https://github.com/JackyLinllk/ubc_dsci100_assignment/raw/main/data/Data_User_Modeling_Dataset_full.csv"
knowledge_data<-read_csv(url)|>
    select(STG:UNS)|>
    mutate(UNS = as.factor(UNS))
knowledge_data

##### Table 1: Dataset Of Students' Knowledge Status
    This table represents the students' knowledge status about the subject of Electrical DC Machines. 
    For specific meaning of the columns refer back to the intro about the "Data set Information".

    The dataset had been obtained from: 
     Kahraman,Hamdi, Colak,Ilhami, and Sagiroglu,Seref. (2013). 
    User Knowledge Modeling. UCI Machine Learning Repository. https://doi.org/10.24432/C5231X.

#### (B) NA-values
We then check for NA-values, and realized that the last row restores NA-values across all the variables, so we just simply deleted the last row of the dataset

In [None]:
##Check NA rows
na_row= which(!complete.cases(knowledge_data))
print(na_row)

##Delete the NA row of the data
knowledge_data= knowledge_data[-nrow(knowledge_data),]
knowledge_data

##### Table 2: Dataset Of Students' Knowledge Status with removed NA-Values
 Data set after the NA-Values are removed

#### C) Prelimary Anlaysis and Final prep 
We check the data’s tidiness from "Table 2" by looking into 3 factors, which are: (1) each row is a single observation, (2) Each column is a single variable, (3) Each value is in single cell. 


After checking the data is tidy, we utilize function called
group_by with summarize to find the summary statistics of the number of observations in
each factor level with the corresponding proportion (Table 3).

We realize that under the quantitative variable named UNS, the factor levels
named “very_low” and “Very Low” should be in the same level. Thus, we apply the function
mutate with function `fct_recode` to merge two factor levels into one

On top of that, we realized the
classifier is class imbalance, which means the proportion for each stratum is not equally
proportional. Thus, we apply the functions called `uc_recipe` and `step_upsample` to rebalance
the rare class, namely “Very Low”, by oversampling. Then, we again utilize group_by with
summarize to check each stratum is equally proportional to each other

In [None]:
## Creating the Summary table and checking for Proportion 
class_prop = knowledge_data|>
  group_by(UNS)|>
  summarize(count = n(),
            percentage= count/nrow(knowledge_data))
class_prop


##### Table 3: Summary Statistics of The number of Observations Knowledge Levels
    This table represents the summary statistics of the number of observations in
    each factor level with the corresponding proportion.

In [None]:
##Typo in the dataset #very_low and Very Low should be same observation
knowledge_data= knowledge_data|>
  mutate(UNS = fct_recode(UNS, "Very Low" =  "very_low"))

##Balancing the data
ups_recipe= recipe(UNS~. ,data=knowledge_data)|>
  step_upsample(UNS, over_ratio=1, skip=F)|>
  prep()
upsampled_knowledge=bake(ups_recipe, knowledge_data)

##Checking the balance
upsampled_knowledge|>
  group_by(UNS)|>
  summarize(n=n())


##### Table 5: Summary Statistics of The number of Observations Knowledge Levels After Balancing the Data
    This table represents the summary statistics with upsampleding to Balance the Data

### 3. Spliting the data into 70% properation of training data and testing data

#### A) Spliting the data 
We separated the data using the `initial_split` function to create 2 subsets, namely training set and testing set.
Inside the initial_split function, we set strata argument to the categorical variable UNS. The
training and testing functions are used to create two different data frames with the
corresponding weight of 70% and 30%.

In [None]:
##Split the data into training set and testing set
knowledge_data_split <- initial_split(upsampled_knowledge, prop = 0.70, strata = UNS)  
knowledge_data_split_train <- training(knowledge_data_split)
knowledge_data_split_test <- testing(knowledge_data_split)

knowledge_data_split_train

##### Table 6: The Training set of Student's Knowledge 
    This table represents the training set we're utilizing to train our models

In [None]:
knowledge_data_split_test

##### Table 7: The Testing set of Student's Knowledge 
    This table represents the testing set we're utilizing to evaluate our models with

#### B) Prelimary Anlaysis on the Training set 
we did the Prelimary Anlaysis by creating a table(using `group_by` and `summarize`) that looks at the count of each Knowledge level(Table 7), and 
creating a plot that compares all the variables, changing the y variables.

The summary table(Table 7) shows that our attempt of balancing the data did work.
Plot 1, shows that the variables we plotted can 

In [None]:
data_training_summary <- knowledge_data_split_train|>
    group_by(UNS)|>
    summarize(count = n())
data_training_summary

##### Table 7: Summary Statistics of The number of Observations Knowledge Levels of Training data
    This table represents the summary statistics of the number of obseravtion in each knowledge level.

In [None]:
options(repr.plot.width = 16, repr.plot.height = 10)
data_pairs <- knowledge_data_split_train |> 
     select(STG:PEG)|>
     ggpairs(aes(alpha = 0.05)) +
     theme(text = element_text(size = 20))
data_pairs

In [None]:
options(repr.plot.height = 18, repr.plot.width = 8)
data_plot_LPR_PEG <- ggplot(knowledge_data_split_train, aes(x= LPR,y= PEG, color = UNS))+
    geom_point() +
    labs(x="Exam performance Rating \n in Related Subject (0-1)", y= "Exam performance Rating \n in Subject (0-1)", color = "Knowledge level")+
    ggtitle("Exam Performance vs Exam performance in Related Subject")+
    theme(text = element_text(size = 15))


data_plot_STG_PEG <- ggplot(knowledge_data_split_train, aes(x= STG,y= PEG, color = UNS))+
    geom_point() +
    labs(x="Study time rating (0-1)", y= "Exam performance Rating \n in Subject (0-1)", color = "Knowledge level")+
    ggtitle("Exam Performance vs Study time rating")+
    theme(text = element_text(size = 15))


data_plot_SCG_PEG <- ggplot(knowledge_data_split_train, aes(x= SCG,y= PEG, color = UNS))+
    geom_point() +
    labs(x="Repetition rating (0-1)", y= "Exam performance Rating \n in Subject (0-1)", color = "Knowledge level")+
    ggtitle("Exam Performance vs Repetition rating ")+
    theme(text = element_text(size = 15))


data_plot_STR_PEG <- ggplot(knowledge_data_split_train, aes(x= STR,y= PEG, color = UNS))+
    geom_point() +
    labs(x="Study time rating \n in Related Subject (0-1)", y= "Exam performance Rating \n in Subject (0-1)", color = "Knowledge level")+
    ggtitle("Exam Performance vs Study time rating in Related Subject")+
    theme(text = element_text(size = 15))



plot_grid(data_plot_LPR_PEG, data_plot_STG_PEG, data_plot_SCG_PEG, data_plot_STR_PEG, ncol = 1)

##### Plot 1: The Relationship Between the Variables  
    This plot represents the Virtualization of Relationship. It is clear from these plot that the all the different Variables can indeed be distinguish between the different knowledge levels.

 #### 4. Parameter selection: Finding the best K value
Our next step is to find the best K value and selecting the predictor variables  which maximizes the accuracy for our model.

Firstly, we apply nearest_neighbor, set_engine, and set_mode
functions to create a model specification. Inside the `nearest_neighbor function`, the argument
`weight_func` is set to rectangular, which means each k neighbor are equally important. For
the neighbors argument, `tune()` is telling the framework to find the different parameter values
for K.

For selecting the K  variable, where K is the number of neighbors. We will be using cross-validation with validation set of 5 in the training set to find the best possible k value. In other words, we will split our
training data into 5 training sets. 

For cross-validation, we use `vfold_cv` function to set the validation set into 5 folds
by using the training set. Finally, we create a tribble with neighbors and use the seq function to
set the K-values to odd numbers (e.g., 1,3,5... 𝑛). The reason why we don’t want even
numbers is because each neighbor is equally weighted; therefore, the even numbers will
cause confusion.

For selecting the predictor variables for recipe, we based it off of the research article where this dataset came from. The article mentioned that in predicting the user knowledge levels, the most useful variable to consider are the Study time rating(STG), the Repetition rating( SCG), the exam performance rating(PEG) of the subject, the Study time rating(STR) of related subjects, and the exam performance rating in related subjects(LPR). Basically all the variables (ref). And from Plot 1, we can also see that indeed distinctions between the different knowledge levels can be made with all the different variables. 

Since KNN classification uses Euclidean distance between points, so it is very sensitive
to the different types of scale. Thus, we planned to standardize the variables for all chosen
variables to ensure the predictive algorithms are accurate and unbiased. We managed to
standardize all the variables by using the recipe function with `step_center(all_predictors())` and
`step_scale(all_predictors())`.

Finally, we put everything into workflow to chain all the steps together to get the
accuracy of different K-values. 
 


In [None]:

##Finding the k value for best accuracy
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

##Choosing all the variables as predictors, and standardize it
data_recipe <- recipe(UNS ~. , data = knowledge_data_split_train)|>
  step_center(all_predictors())|>
  step_scale(all_predictors())

training_vfold <-  vfold_cv(knowledge_data_split_train, v=5, strata = UNS)

k_value = 101
K <- tibble(neighbors = seq(1,k_value,2))

knn_result <- workflow() |>
  add_recipe(data_recipe) |>
  add_model(knn_tune)|>
  tune_grid(resamples = training_vfold, grid = K) |>
  collect_metrics()|>
  filter(.metric == "accuracy")
knn_result

##### Table 8: Accuracy of different K values  
    This table represents the Accuracy of different K values from 1 to 101, Advancing by 2

#### 5. Visualizing the optimal K-value

We used the ggplot function to create a line graph which
helps to visualize the accuracy trends under corresponding K-values. Surprisingly, when the
K=1, we have the most accurate K-value for the model. Thus, we choose K equals to one as
our optimal K-value.

In [None]:
##Scatter plot on the accuracy and number of neighbors
options(repr.plot.height = 8, repr.plot.width = 8)
cross_val_plot <- ggplot(knn_result, aes(x=neighbors, y= mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") +
  ggtitle(label= "KNN Accuracy verses Number of Neighbors")
cross_val_plot

##### Plot 2: KNN Accuracy verses Number of Neighbors
    This plot represents the Virtualization of "Table 7". Notably, the Highest accuracy came from a K neighbor of 1.

#### 6. Creating the model with the optimal K-value

We Chose the K value based on Plot 2, where the highest accuracy is 1

In [None]:
##Finding confusion matrix of model using testing set
knn_best_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = 1) |>
  set_engine("kknn") |>
  set_mode("classification")


knn_fit <- workflow() |>
  add_recipe(data_recipe) |>
  add_model(knn_best_tune)|>
  fit(knowledge_data_split_train)


#### 7. Predicting the model on the testing data set, and evaluating the model

We first used the model to predict the knowledge levels of the testing set(table 9).


In [None]:
## Predicting the UNS of testing data set
knowledge_predictions= knn_fit|>
  predict(knowledge_data_split_test)|>
  bind_cols(knowledge_data_split_test)
knowledge_predictions


##### Table 9: Predicted knowledge levels
    This table represents the predicted knowledge level using the model, with the original testing set(Table 7). 

Next, we calculated the corresponding accuracy and set up the confusion matrix, setting the truth to the actual Knowledge level(UNS),
and comparing it to what the model predicted(.pred_class).
 $$𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = \frac{𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛} {𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}$$


In [None]:
knowledge_metrics= knowledge_predictions|>
  metrics(truth= UNS, estimate = .pred_class)|>
  filter(.metric== "accuracy")
knowledge_metrics

##### Table 10: Accuracy of the model
    This table represents the Accuracy of our model, ie 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 / 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
    Looking at the value of the .estimate variable, it shows that our model has an estimated accuracy on the testing set of ~92.9%

In [None]:
knowledge_conf_mat= knowledge_predictions|>
  conf_mat(truth= UNS, estimate = .pred_class)
knowledge_conf_mat

##### Table 11: Confusion matrix 
    This table represents the true number of each Knowledge level, and the predicted number of each Knowledge level.
    The left column represents the predicted number of obseravtions, and the row represents the true number of obseravtions.
    Looking at the table, The highest number of missed predictions of the Knowledge level came from the "low" class, 
    with a total of 8 missed predictions. All the other Knowledge level had only one missed predictions.

In [None]:
options(repre.plot.width= 11, reper.plot.height = 9)

#Number of Predicted cases 
count <- knowledge_predictions|>
    group_by(.pred_class) |>
    summarize(count = n())|>
    rename(UNS = .pred_class)|>
    mutate(UNS = fct_recode(UNS, pred_High = "High", pred_Low = "Low", pred_Middle = "Middle", pred_Very_low = "Very Low"))

#Number of True cases 
count2 <- knowledge_predictions|>
    group_by(UNS) |>
    summarize(count = n())

## Combining the Number of Predicted cases with Number of True cases 
count3 <- rbind(count, count2)|>
    mutate(pred_or_true = UNS)|>
     mutate(pred_or_true = fct_recode(UNS, Pred = "pred_High", Pred = "pred_Low", Pred = "pred_Middle", Pred = "pred_Very_low",
                                     True = "High",
                                     True = "Low",
                                     True = "Middle",
                                     True = "Very Low"))

Compare_High <- count3|>
    filter(UNS == "pred_High" | UNS== "High") |>
    ggplot(aes(x= UNS, y= count, fill=pred_or_true))+
    geom_bar(stat = "identity") +
    xlab("Knowledge level") +
    ylab("Number of Observations")+
    labs(fill = "Prediction vs True")+
    ggtitle("Predicted High vs True High")

Compare_Middle <- count3|>
    filter(UNS == "pred_Middle" | UNS== "Middle") |>
    ggplot(aes(x= UNS, y= count, fill=pred_or_true))+
    geom_bar(stat = "identity") +
    xlab("Knowledge level") +
    ylab("Number of Observations")+
    labs(fill = "Prediction vs True")+
    ggtitle("Predicted Middle vs True Middle")


Compare_Low <- count3|>
    filter(UNS== "pred_Low" | UNS== "Low")|>
    ggplot(aes(x= UNS, y= count, fill=pred_or_true))+
    geom_bar(stat = "identity") +
    xlab("Knowledge level") +
    ylab("Number of Observations")+
    labs(fill = "Prediction vs True")+
    ggtitle("Predicted Low vs True Low")

Compare_Very_Low <- count3|>
    filter(UNS== "pred_Very_low" | UNS== "Very Low")|>
    ggplot(aes(x= UNS, y= count, fill=pred_or_true))+
    geom_bar(stat = "identity") +
    xlab("Knowledge level") +
    ylab("Number of Observations")+
    labs(fill = "Prediction vs True")+
    ggtitle("Predicted Very Low vs True Very Low")

plot_grid(Compare_High, Compare_Middle,Compare_Low, Compare_Very_Low, ncol = 2)

##### Plot 2: Predicted vs True
    This plot compares the amounts of true with the predicted number of each class levels. 

In [1]:
<<<<<<< REMOTE CELL DELETED >>>>>>>
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library("kknn")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

In [2]:
set.seed(1234) 
options(repr.plot.height = 5, repr.plot.width = 6)

url <- "https://github.com/JackyLinllk/ubc_dsci100_assignment/raw/main/data/Data_User_Modeling_Dataset_full.csv"

knowledge_data<-read_csv(url)|>
    select(STG:UNS)|>
    mutate(UNS = as.factor(UNS))

knowledge_data_split <- initial_split(knowledge_data, prop = 0.70, strata = UNS)  
   knowledge_data_split_train <- training(knowledge_data_split)
   knowledge_data_split_test <- testing(knowledge_data_split)

knowledge_data_split_train

# your code here
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

data_recipe <- recipe(UNS ~. , data = knowledge_data_split_train)|>
    step_center(data_recipe)|>
    step_scale(data_recipe)

training_vfold <-  vfold_cv(knowledge_data_split_train, v=5, strata = UNS)

K <- tibble(neighbors = seq(1,6))

knn_result <- workflow() |>
    add_recipe(data_recipe) |>
    add_model(knn_tune)|>
    tune_grid(resamples = training_vfold, grid = K) |>
    collect_metrics()|>
    filter(.metric == "accuracy")

cross_val_plot <- ggplot(knn_result, aes(x=neighbors, y= mean)) +
    geom_point() +
       geom_line() +
       labs(x = "Neighbors", y = "Accuracy Estimate") 
cross_val_plot


    

[1m[22mNew names:
[36m•[39m `` -> `...7`
[36m•[39m `` -> `...8`
[36m•[39m `` -> `...9`
[1mRows: [22m[34m404[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): UNS
[32mdbl[39m (5): STG, SCG, STR, LPR, PEG
[33mlgl[39m (3): ...7, ...8, ...9

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


STG,SCG,STR,LPR,PEG,UNS
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
0.00,0.00,0.50,0.20,0.85,High
0.18,0.18,0.55,0.30,0.81,High
0.05,0.07,0.70,0.01,0.05,very_low
⋮,⋮,⋮,⋮,⋮,⋮
0.68,0.43,0.60,0.47,0.55,Middle
0.66,0.68,0.81,0.57,0.57,Middle
,,,,,


[31mx[39m [31mFold1: internal:
  [1m[33mError[31m in `purrr::map_chr()`:[22m
  [1m[22m[36mℹ[31m In index: 1.
  [1mCaused by error:[22m
  [33m![31m Result must be length 1, not 9.[39m

[31mx[39m [31mFold2: preprocessor 1/1:
  [1m[33mError[31m in `step_center()`:[22m
  [1mCaused by error in `prep()`:[22m
  [33m![31m Can't subset columns with `data_recipe`.
  [31m✖[31m `data_recipe` must be numeric or character, not a <recipe>...[39m

[31mx[39m [31mFold3: preprocessor 1/1:
  [1m[33mError[31m in `step_center()`:[22m
  [1mCaused by error in `prep()`:[22m
  [33m![31m Can't subset columns with `data_recipe`.
  [31m✖[31m `data_recipe` must be numeric or character, not a <recipe>...[39m

[31mx[39m [31mFold4: preprocessor 1/1:
  [1m[33mError[31m in `step_center()`:[22m
  [1mCaused by error in `prep()`:[22m
  [33m![31m Can't subset columns with `data_recipe`.
  [31m✖[31m `data_recipe` must be numeric or character, not a <recipe>...[39m

[

ERROR: [1m[33mError[39m in `estimate_tune_results()`:[22m
[33m![39m All of the models failed. See the .notes column.


Discussion:
 
Using the optimal k-value (k=1) for the knn algorithm, our model has predicted the knowledge level for each student in the dataset. The accuracy rate is 93% (rounded up to the nearest percent). Overall, our model’s prediction tends to correspond with the labeled category of student performance. Therefore, we conclude that our model can accurately predict student performance in this data set.
 
Our model’s predictions have not shown too much discrepancy in the accuracy across the four categories. However, it should be noted that our model has highest accuracy in predicting the category labeled as “High”, with only one missed prediction, and is less accurate in predicting the category labeled as “Low”, with 8 missed predictions in a total of 39 observations(table 11). Although 8 is not large enough to be considered as a remarkable number, the cause for this difference is worth potential future investigation.
 
Overall, this result of this analysis is what we anticipated in our hypothesis. With a 93% accuracy rate, most of the predictions align with their labeled category in this dataset. However, our analysis based on the knn algorithm does not provide sufficient information to compare the significance of different variables on the labeled categories.
 
This model provides valuable real-life applications as it could potentially be applied to the current education system. Compared to the traditional letter grade grading system, the system used in this dataset has more criteria that brings in a more diverse perspective to assessing student learning. Our algorithm could help bring this system into practical use by using the algorithm to assign the category instead of human assignment(educators, authorities etc.), which avoids subjective bias. This could also allow this grading system to be applied across different educational institutions as the category assignment is universal and objective, in which the influence that the discrepancy between each individual graders has on the result could be minimized. 
 
Looking at the potential real life application of this model, it leads the way to future questions such as: How can we adapt the model to predict the knowledge level for a data set that incorporates more variables? This question is worth investigation as the criterion for this grading system could be modified or extended to fit more educational needs of different institutions.

Another future question could be targeted towards finding out which one out of the five variables has more influence on an individual’s knowledge level in this dataset. This question was originally proposed in our hypothesis, however, the findings of this analysis focused on the overall categorical prediction instead of comparing the significance across the five variables. In future investigation, we could focus on evaluating the relevance of each variable on the knowledge level by using the forward selection method that produces the optimal number of relevant variables (predictors). By comparing the accuracy of the five single-predictor combinations, we can tell which predictor has higher accuracy and is therefore more significant in predicting the knowledge level. If any variable is found to be rather irrelevant to predicting the knowledge level, it could be filtered out and thus improve the accuracy of the model. 



References
At least 2 citations of literature relevant to the project (format is your choice, just be consistent across the references).
Make sure to cite the source of your data as well.

References: 
