# Predicting Knowledge Level in Electrical DC Machines: A Data Analysis of Study Time and Exam Performance

In [220]:
# update the altair on UBC's open jupyter to 5.2
# run this everytime the server is restarted
# pip install -U altair

In [221]:
# Run this cell before continuing.
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")
np.random.seed(1)

## Introduction 
In this data analysis, we will be using the “User Knowledge Modeling” dataset created by Hamdi Kahraman, Ilhami Colak and Seref Sagiroglu. This dataset is about the students’ knowledge level about the subject of Electrical DC Machines. Our primary objective is to predict the knowledge levels of students by exploring the relationship between two key factors: study time and exam performance.

To address this question, we have selected the necessary variables from the dataset which are “knowledge level”, “study time” and “exam performance”. The dataset has been loaded into Python and wrangled to ensure a tidy dataset for our analysis. In the following sections, we present the initial exploratory data analysis, multiple tables have been included to showcase the general overview of our data. Additionally, to illustrate trends and patterns, we have incorporated a data visualization. These components aim to provide a better understanding of the dataset we are using. 


In [222]:
url = "https://drive.usercontent.google.com/download?id=1RoSQfR5gGaMZuM8KSyjbkNSWuHJ6LpXD&export=download&authuser=0&confirm=t&uuid=059f3b74-cd6e-4c50-90ad-f3f527a0ef70&at=APZUnTXBmCv1FCzVJOuNrbHUAbwg:1701902444840"

# import from two sheets and combine into one dataframe
data_training_sheet = pd.read_excel(url, sheet_name="Training_Data")
data_testing_sheet = pd.read_excel(url, sheet_name="Test_Data")
data = pd.concat([data_training_sheet, data_testing_sheet])

## Methods 
Regarding how we choose the variables to predict, we first randomly picked two variables out of five as axes to draw scatter plots for training data, with different colors assigned to different knowledge levels. We found that points with different colors are mixed together when we used study time and repetition time as axes, showing that these two variables shouldn’t be combined together for prediction. After several attempts, we figured out that while using study time and exam performance as axes, points in different colors are distributed with clear boundaries. Therefore, we decided to use these two variables to predict the classification of test data.

Our next step was to standardize values of STG and PEG, use results of cross validation within training data to select the best k, then use it to predict the classification of testing data, finally evaluate the performance of our model.

One way to visualize our results is to compare two scatter plots, one is the actual classification of testing data, another is our prediction for testing data. We can try to capture the distribution of wrongly predicted points, to see whether they are deviation points from the class, or there are other reasons.


In [223]:
# Drops text comment columns 
data = data.drop(
    columns=["Attribute Information:", "Unnamed: 6", "Unnamed: 7"]
)

# rename columns to make them more readable
data = data.rename(
    columns={
        "STG": "Study Time",
        "SCG": "Repetition Time",
        "STR": "Study Time for Related Objects",
        "LPR": "Exam Performance for Related Objects",
        "PEG": "Exam Performance",
        " UNS": "Knowledge Level"
    }
)
data = data.assign()

# the original, complete, data set
data

Unnamed: 0,Study Time,Repetition Time,Study Time for Related Objects,Exam Performance for Related Objects,Exam Performance,Knowledge Level
0,0.00,0.00,0.00,0.00,0.00,very_low
1,0.08,0.08,0.10,0.24,0.90,High
2,0.06,0.06,0.05,0.25,0.33,Low
3,0.10,0.10,0.15,0.65,0.30,Middle
4,0.08,0.08,0.08,0.98,0.24,Low
...,...,...,...,...,...,...
140,0.90,0.78,0.62,0.32,0.89,High
141,0.85,0.82,0.66,0.83,0.83,High
142,0.56,0.60,0.77,0.13,0.32,Low
143,0.66,0.68,0.81,0.57,0.57,Middle


In [224]:
data["Knowledge Level"].value_counts()

Low         129
Middle      122
High        102
very_low     50
Name: Knowledge Level, dtype: int64

In [225]:
# split data into training and testing sets
data_training, data_testing = train_test_split(
    data,
    test_size=0.25,
    stratify=data["Knowledge Level"]
)

In [226]:
data_training.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 116 to 89
Data columns (total 6 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Study Time                            302 non-null    float64
 1   Repetition Time                       302 non-null    float64
 2   Study Time for Related Objects        302 non-null    float64
 3   Exam Performance for Related Objects  302 non-null    float64
 4   Exam Performance                      302 non-null    float64
 5   Knowledge Level                       302 non-null    object 
dtypes: float64(5), object(1)
memory usage: 16.5+ KB


In [227]:
data_testing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101 entries, 48 to 78
Data columns (total 6 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Study Time                            101 non-null    float64
 1   Repetition Time                       101 non-null    float64
 2   Study Time for Related Objects        101 non-null    float64
 3   Exam Performance for Related Objects  101 non-null    float64
 4   Exam Performance                      101 non-null    float64
 5   Knowledge Level                       101 non-null    object 
dtypes: float64(5), object(1)
memory usage: 5.5+ KB


In [228]:
# create a scatterplot of the data to visualize the relationship 
# between study time and exam performance and the knowledge level of the student
alt.Chart(data_training, title="Study Time & Exam Performance vs Knowledge Level").mark_point().encode(
    x="Study Time",
    y="Exam Performance",
    color="Knowledge Level"
)

In [229]:
# create a table to show the mean and standard deviation of each level of knowledge
data_training.groupby("Knowledge Level").agg(["mean", "std"])

Unnamed: 0_level_0,Study Time,Study Time,Repetition Time,Repetition Time,Study Time for Related Objects,Study Time for Related Objects,Exam Performance for Related Objects,Exam Performance for Related Objects,Exam Performance,Exam Performance
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std,mean,std
Knowledge Level,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
High,0.451289,0.247563,0.434289,0.250629,0.497829,0.263056,0.565,0.284246,0.795,0.11514
Low,0.322216,0.179193,0.31766,0.188765,0.439175,0.255692,0.443505,0.224425,0.254876,0.068197
Middle,0.351187,0.199856,0.367879,0.199484,0.481868,0.22528,0.371978,0.255357,0.536593,0.134298
very_low,0.262632,0.187158,0.258947,0.193332,0.345526,0.219504,0.276605,0.190398,0.093158,0.054876


In [230]:
# show the number of students in each level of knowledge
data_training["Knowledge Level"].value_counts()

Low         97
Middle      91
High        76
very_low    38
Name: Knowledge Level, dtype: int64

In [231]:
data_training.notna()

Unnamed: 0,Study Time,Repetition Time,Study Time for Related Objects,Exam Performance for Related Objects,Exam Performance,Knowledge Level
116,True,True,True,True,True,True
148,True,True,True,True,True,True
39,True,True,True,True,True,True
220,True,True,True,True,True,True
257,True,True,True,True,True,True
...,...,...,...,...,...,...
17,True,True,True,True,True,True
12,True,True,True,True,True,True
64,True,True,True,True,True,True
132,True,True,True,True,True,True


## Expected Outcomes and Significance 
We expect to find that students who study a lot and have a high exam performance will have a high knowledge level. 

The findings can motivate students to study harder and set new goals to achieve. This can then influence students to develop better study habits which can lead into better time management skills for the future. Furthermore, these expected outcomes can encourage teachers to focus on effective teaching methods to create brighter minds of the future. 


## Future Questions 
- Is there a point where a very high study time and exam performance may decrease knowledge level?
- How effective are traditional exams in measuring a student's knowledge? 
- What are the effects of the pressure associated with the correlation between exam performance and knowledge level on a student's mental health? 


---

In [232]:
# set up preprocessor, classifier (with no argument), and pipeline
prep = make_column_transformer(
    (StandardScaler(), ["Study Time", "Exam Performance"])
)
knn = KNeighborsClassifier()
x = data_training[["Study Time", "Exam Performance"]]
y = data_training["Knowledge Level"]
pipe = make_pipeline(prep, knn)

In [233]:
# cross validation to find the best k
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 50),
}

tune_grid = GridSearchCV(
    estimator=pipe,
    param_grid=parameter_grid,
    cv=10
)

accuracies_grid = pd.DataFrame(
    tune_grid.fit(
        X = x,
        y = y
    ).cv_results_
)

accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "std_test_score"
    ]]
    .assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2))
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
    .drop(columns=["std_test_score"])
)

accuracy_vs_k = alt.Chart(accuracies_grid, title="Accuracy vs K").mark_line(point=True).encode(
    x=alt.X("n_neighbors")
    .title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

best_k = tune_grid.best_params_["kneighborsclassifier__n_neighbors"]
best_k_df = pd.DataFrame({
    "best_k" : [best_k],
    "best_acc" : [tune_grid.best_score_]
})

# marks the best k
best_k_plot = alt.Chart(best_k_df).mark_point(filled=True, size=50).encode(
    x = "best_k",
    y = "best_acc",
    color = alt.value("red")
)

accuracy_vs_k + best_k_plot

In [234]:
# build the model with the best k
print("The best k selected is:", best_k)
knn_best = KNeighborsClassifier(n_neighbors=best_k)
pipe_best = make_pipeline(prep, knn_best)
pipe_best_fit = pipe_best.fit(X=x, y=y)

The best k selected is: 7


In [235]:
# use the model to predict the test set
predicted_test = data_testing.assign(
    true = data_testing["Knowledge Level"],
    predicted = pipe_best_fit.predict(data_testing[["Study Time", "Exam Performance"]])
)
confusion_matrix = pd.crosstab(
    predicted_test["true"],
    predicted_test["predicted"]
)
confusion_matrix

predicted,High,Low,Middle,very_low
true,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
High,22,0,4,0
Low,0,29,0,3
Middle,2,6,23,0
very_low,0,2,0,10


In [236]:
# prediction accuracy score
pred_acc = pipe_best_fit.score(data_testing[["Study Time", "Exam Performance"]], data_testing["Knowledge Level"])
print("The estimated accuracy of the model on the testing set is:", pred_acc)

The estimated accuracy of the model on the testing set is: 0.8316831683168316


In [237]:
# plot the predicted testing set 
pred_chart = alt.Chart(predicted_test, title="Predicted Categories").mark_point().encode(
    x = alt.X("Study Time").title("Study Time"),
    y = alt.Y("Exam Performance").title("Exam Performance"),
    color = alt.Color("predicted").title("Knowledge Level")
)

# plot the true values for the testing set
test_true_chart = alt.Chart(predicted_test, title="True Categories").mark_point().encode(
    x = alt.X("Study Time").title("Study Time"),
    y = alt.Y("Exam Performance").title("Exam Performance"),
    color = alt.Color("true")
)

# plot the charts side by side
test_true_chart | pred_chart