We will use the library function to load tidyverse, tidymodels, repr, and readxl package into R. 

In [1]:
library(repr)
library(readxl)
library(tidyverse)
library(tidymodels)
set.seed(1)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

# Title: User Knowledge Data

# Introduction
For our Data Science 100 group project, we will be using a dataset looking at the knowledge status of students about Electrical DC Machines and how certain predictors are associated with it. The knowledge of the students were classified by the authors through utilization of a “hybrid ML technique of k-NN and meta-heuristic exploring methods” knowledge classifier, k-nearest neighbor algorithm. Various attributes and predictors were taken into account when making their dataset, looking at study time, repetition, exam performance, and of course, the user knowledge. The knowledge classifier measures the distance between students depending on their data and the value of their knowledge weights. The dissimilarities go into determining their knowledge class.

# Question
How strongly is knowledge level associated with study time, repetition, and exam performance?


# The Dataset
The dataset we use is downloaded from this link (https://archive.ics.uci.edu/ml/machine-learning-databases/00257/Data_User_Modeling_Dataset_Hamdi%20Tolga%20KAHRAMAN.xls), and it’s in sheet 2 ("Training Data"). 

Since it is untidy data, we can use the “select” function to delete the description, and the remaining data will be tidy. 


# Preliminary exploratory data analysis:

In [2]:
#STG (The degree of study time for goal object materails)
#SCG (The degree of repetition number of user for goal object materails)
#STR (The degree of study time of user for related objects with goal object)
#LPR (The exam performance of user for related objects with goal object)
#PEG (The exam performance of user for goal objects)
#UNS (The knowledge level of user)

# Demonstrate that the dataset can be read from the web into R 
user_knowledge <- read_excel("Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls", sheet = 2)

# Clean and wrangle your data into a tidy format
user_knowledge_wrangle <- user_knowledge |>
    select(STG:UNS)

# Using only training data, summarize the data in at least one table (this is exploratory data analysis). 
user_knowledge_training <- user_knowledge_wrangle |>
    select(STG:UNS) |>
    mutate(UNS = as_factor(UNS))
user_knowledge_training

user_knowledge_testing <- user_knowledge_wrangle |>
    select(STG:UNS) |>
    mutate(UNS = as_factor(UNS))
user_knowledge_testing

ERROR: Error: `path` does not exist: ‘Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls’


In [None]:
user_summary <- user_knowledge_training |>
    group_by(UNS)|>
    summarize (mean_of_STG = mean(STG),
               mean_of_SCG = mean(SCG),
               mean_of_LPR = mean(LPR),
               number_of_UNS = n())
user_summary

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
STG_vs_UNS_hist <- user_knowledge_training|>
    ggplot(aes(x=STG, fill=UNS))+
    geom_histogram(bins=45, alpha=0.6, position="identity")+
    labs(x="STG(The degree of study time)", y="Count", title = "STG Distribution", caption="Figure 1", fill = "User Knowledge level")+
    theme(text = element_text(size = 20))
STG_vs_UNS_hist

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
SCG_hist <- user_knowledge_training|>
    ggplot(aes(x=SCG, fill=UNS))+
    geom_histogram(bins=45, alpha=0.6, position="identity")+
    labs(x="SCG(The degree of repetition number)", y="Count", title = "SCG Distribution", caption="Figure 2", fill = "User Knowledge level")+
    theme(text = element_text(size = 20))
SCG_hist

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
STR_hist <- user_knowledge_training|>
    ggplot(aes(x=STR, fill=UNS))+
    geom_histogram(bins=45, alpha=0.6, position="identity")+
    labs(x="STR(The degree of study time for related objects)", y="Count", title = "STR Distribution", caption="Figure 3", fill = "User Knowledge level")+
    theme(text = element_text(size = 20))
STR_hist

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
LPR_hist <- user_knowledge_training|>
    ggplot(aes(x=LPR, fill=UNS))+
    geom_histogram(bins=45, alpha=0.6, position="identity")+
    labs(x="LPR(The exam performance for related objects )", y="Count", title = "LPR Distribution", caption="Figure 4", fill = "User Knowledge level")+
    theme(text = element_text(size = 20))
LPR_hist

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
PEG_hist <- user_knowledge_training|>
    ggplot(aes(x=PEG, fill=UNS))+
    geom_histogram(bins=45, alpha=0.6, position="identity")+
    labs(x="PEG(The exam performance for goal objects)", y="Count", title = "PEG Distribution", caption="Figure 5", fill = "User Knowledge level")+
    theme(text = element_text(size = 20))
PEG_hist

# Methods
After visualizing the histogram of each variable which is colored by UNS(The knowledge level of the users), we choose the following variables. We can create a KNN classification model to predict user knowledge level.
- UNS (The knowledge level of user)
- PEG (The exam performance of user for goal objects)

The KNN classification model will need to be trained, evaluated, and tuned with training and testing sets before predicting the knowledge level of users. Comparing the histogram of each variable, Figure 5 (PEG Distribution) has less overlap with each knowledge level, which means that PEG has a stronger relevance with UNS compared to other variables. Therefore, the PEG variable will be the useful predictor we are using to do a more effective model.

How to visualize:
One way we will visualize the results would be through the utilization of a histogram. This will help us to analyze the predictors that are relevant to the knowledge of the users by visualizing the distribution of the effects of the predictors on the knowledge level.

# Expected Outcomes and significance
### Expect to find
- Given the user knowledge data, we have the expectation of determining the most effective predictor in the students’ knowledge on Electrical DC Machines when tested.
### Impact of findings
- The impact of our findings is that we determine the predictor that best correlates with students’ knowledge. We can use this information to apply to areas of study other than solely Electrical DC Machines. 
### Future questions
- Future questions could look at different predictors and their effect on user knowledge when examined on Electrical DC Machines. Different predictors include those not within this particular data set. For example, how strongly is room temperature associated with user knowledge when conducting the same study methods?