# Predicting Knowledge Level in Electrical DC Machines: A Data Analysis of Study Time and Exam Performance

In [24]:
# Run this cell before continuing.
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")
np.random.seed(1)

## Introduction 
In this data analysis, we will be using the “User Knowledge Modeling” dataset created by Hamdi Kahraman, Ilhami Colak and Seref Sagiroglu. This dataset is about the students’ knowledge level about the subject of Electrical DC Machines, which is a computer science subject. Our primary objective is to predict the knowledge levels of students by exploring the relationship between two key factors: study time and exam performance.

To address this question, we have selected the necessary variables from the dataset which are “knowledge level”, “study time” and “exam performance”. The dataset has been loaded into Python and also wrangled and cleaned to ensure a tidy dataset for our analysis. In the following sections, we present the initial exploratory data analysis, multiple tables have been included to showcase the structure and general overview of our data. Additionally, to illustrate trends and patterns, we have incorporated a data visualization as well. These components aim to provide a better understanding of the dataset we are using 

In [25]:
url = "https://drive.usercontent.google.com/download?id=1Px4pE2Xf1TEGfYV3ChaoRRRS0YZbKbX_&export=download&authuser=0&confirm=t&uuid=7c9d6e2b-f34f-423d-ad4f-f386faaa47d4&at=APZUnTUUdkEofob3B1bEEFJ0HcHq:1698615817259"

# import from two sheets and combine into one dataframe
data_training_sheet = pd.read_excel(url, sheet_name="Training_Data")
data_testing_sheet = pd.read_excel(url, sheet_name="Test_Data")
data = pd.concat([data_training_sheet, data_testing_sheet])

## Methods 
After reading the data, we first combined the original training data and testing data into one object, finished the cleaning and wrangling process, then re-divided the data into training set and testing set. Doing the step instead of taking original sets is to reset the proportion of two sets and ensure a random division between two sets.

Then we randomly picked two variables out of five (STG, SCG, STR, LPR, PEG) as axes to draw scatter plots for training data, with different colors assigned to different UNS. We found that points with different UNS are totally mixed together when we tried to use STG and SCG as axes, showing that these two variables may not be suitable for prediction. After several attempts, we figured out that while using STG and PEG as axes, points in different UNS classes distribute with clear boundaries against each other. Therefore, we decided to use STG (Study Time) and PEG (Exam Performance) to predict the classification of test data.

Our next steps are to standardize values of STG and PEG, use results of cross validation within training data to select the best k, then use this k to predict the classification of testing data and finally evaluate the performance of our model.

One possible way to visualize our results is to compare two scatter plots, one is the actual classification of testing data, another one is our prediction for testing data. Then we can try to capture the distribution of wrongly predicted points, to see whether they are some deviation points from the class, or there are other reasons behind incorrect prediction.

In [26]:
# Drops comment columns 
data = data.drop(
    columns=["Attribute Information:", "Unnamed: 6", "Unnamed: 7"]
)

In [27]:
# rename columns to make them more readable
data = data.rename(
    columns={
        "STG": "Study Time",
        "SCG": "Repetition Time",
        "STR": "Study Time for Related Objects",
        "LPR": "Exam Performance for Related Objects",
        "PEG": "Exam Performance",
        " UNS": "Knowledge Level"
    }
)
data

Unnamed: 0,Study Time,Repetition Time,Study Time for Related Objects,Exam Performance for Related Objects,Exam Performance,Knowledge Level
0,0.00,0.00,0.00,0.00,0.00,very_low
1,0.08,0.08,0.10,0.24,0.90,High
2,0.06,0.06,0.05,0.25,0.33,Low
3,0.10,0.10,0.15,0.65,0.30,Middle
4,0.08,0.08,0.08,0.98,0.24,Low
...,...,...,...,...,...,...
140,0.90,0.78,0.62,0.32,0.89,High
141,0.85,0.82,0.66,0.83,0.83,High
142,0.56,0.60,0.77,0.13,0.32,Low
143,0.66,0.68,0.81,0.57,0.57,Middle


In [28]:
# split data into training and testing sets
data_training, data_testing = train_test_split(
    data,
    test_size=0.25,
    random_state=111
)

In [29]:
data_training.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 130 to 82
Data columns (total 6 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Study Time                            302 non-null    float64
 1   Repetition Time                       302 non-null    float64
 2   Study Time for Related Objects        302 non-null    float64
 3   Exam Performance for Related Objects  302 non-null    float64
 4   Exam Performance                      302 non-null    float64
 5   Knowledge Level                       302 non-null    object 
dtypes: float64(5), object(1)
memory usage: 16.5+ KB


In [30]:
data_testing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101 entries, 137 to 65
Data columns (total 6 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Study Time                            101 non-null    float64
 1   Repetition Time                       101 non-null    float64
 2   Study Time for Related Objects        101 non-null    float64
 3   Exam Performance for Related Objects  101 non-null    float64
 4   Exam Performance                      101 non-null    float64
 5   Knowledge Level                       101 non-null    object 
dtypes: float64(5), object(1)
memory usage: 5.5+ KB


In [31]:
# create a scatterplot of the data to visualize the relationship 
# between study time and exam performance and the knowledge level of the student
alt.Chart(data_training).mark_point().encode(
    x="Study Time",
    y="Exam Performance",
    color="Knowledge Level"
)


In [32]:
# create a table to show the mean and standard deviation of each level of knowledge
data_training.groupby("Knowledge Level").agg(["mean", "std"])

Unnamed: 0_level_0,Study Time,Study Time,Repetition Time,Repetition Time,Study Time for Related Objects,Study Time for Related Objects,Exam Performance for Related Objects,Exam Performance for Related Objects,Exam Performance,Exam Performance
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std,mean,std
Knowledge Level,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
High,0.400293,0.237109,0.428413,0.241783,0.534333,0.256321,0.531467,0.273727,0.804667,0.108271
Low,0.325896,0.18385,0.324354,0.182445,0.409375,0.252006,0.464375,0.230555,0.24875,0.073588
Middle,0.374656,0.208222,0.373215,0.211954,0.492581,0.234613,0.384516,0.248861,0.529785,0.133904
Very Low,0.224091,0.165462,0.321364,0.181194,0.293182,0.199365,0.185045,0.137006,0.100909,0.059674
very_low,0.321813,0.195123,0.204063,0.144167,0.395625,0.194695,0.37125,0.198926,0.085625,0.062072


In [33]:
# show the number of students in each level of knowledge
data_training["Knowledge Level"].value_counts()

Low         96
Middle      93
High        75
Very Low    22
very_low    16
Name: Knowledge Level, dtype: int64

## Expected Outcomes and Significance 
- What do you expect to find?  
We expect to find that students with a high exam performance will have a high knowledge level. As exam performance decreases, their knowledge level also decreases. 

- What impact could such findings have?  
The impacts that such findings have are that it can motivate students to study harder and set new goals to achieve. This can then influence students to develop better study habits which can lead into better time management skills for the future. Furthermore, these expected outcomes can encourage teachers to focus on effective teaching methods to create brighter minds of the future. 


## Future Questions 
- Is there a point of diminishing returns where a very high study time and exam performance may decrease knowledge level?
- How effective are traditional exams in truly measuring a student's knowledge and understanding? 
- What are the long-term effects of the pressure associated with the correlation between exam performance and knowledge level on a student's mental health? 
- Do high exam scores always correlate with the ability to apply knowledge in real-world scenarios?