There are two csv files that represent student response data for the years 2021 and 2022. The columns of the csv files are:

student_id
question_id
ability - The ability skill of the student.
difficulty - The difficulty of the question.
answered_correctly - Whether the student answered the question correctly or not
Analyse the csv files to answer the following questions:

How did the student's ability to answer the questions change ?
Did the questions get difficult or easy?
Can you create a model that can predict if a student will answer a question correctly?
Note down any other observations you may have about the data.

In [1]:
import pandas as pd

# Load the CSV files
data_2021 = pd.read_csv('student_responses_2021.csv')
data_2022 = pd.read_csv('/mnt/data/student_responses_2022.csv')

# Display the first few rows and summary info of each file to understand the data structure and content
data_2021_info = data_2021.info()
data_2021_head = data_2021.head()

data_2022_info = data_2022.info()
data_2022_head = data_2022.head()

data_2021_info, data_2021_head, data_2022_info, data_2022_head

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45000 entries, 0 to 44999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   student_id          45000 non-null  int64  
 1   question_id         45000 non-null  int64  
 2   ability             44100 non-null  float64
 3   difficulty          44100 non-null  float64
 4   answered_correctly  45000 non-null  bool   
dtypes: bool(1), float64(2), int64(2)
memory usage: 1.4 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   student_id          50000 non-null  int64  
 1   question_id         50000 non-null  int64  
 2   ability             49000 non-null  float64
 3   difficulty          49000 non-null  float64
 4   answered_correctly  50000 non-null  bool   
dtypes: bool(1), float64(2), int64(2)
m

(None,
    student_id  question_id   ability  difficulty  answered_correctly
 0         967           40 -0.320326    1.572793               False
 1         135            2 -2.897524   -1.920777               False
 2         553           18 -1.207800   -0.482901               False
 3         564           29 -2.012802    0.360224               False
 4         449           42  0.332158    1.211066               False,
 None,
    student_id  question_id   ability  difficulty  answered_correctly
 0        1078           22  1.023237   -0.280792                True
 1        1980           37 -2.243018    1.430540               False
 2        1240           14  3.453396   -0.872714                True
 3        1329           33  1.750230    0.816172                True
 4        1995           17  2.288340   -0.558914                True)

Observations from Summary Statistics
Student Ability:

In 2021, the mean ability score was approximately -0.042, whereas in 2022, it increased to around 0.33. This suggests an improvement in students' ability scores from 2021 to 2022.
The standard deviation also increased from 2.17 in 2021 to 2.45 in 2022, indicating a broader range in student ability scores in 2022.
Question Difficulty:

The mean difficulty score shifted from -0.057 in 2021 to 0.117 in 2022, suggesting that questions became slightly more challenging on average in 2022.
The spread of difficulty scores (indicated by standard deviation) increased slightly from 1.05 to 1.17, meaning there was a wider range of question difficulty levels in 2022.
Next, I'll examine how well ability and difficulty predict whether a student will answer a question correctly, using a logistic regression model. ​


In [2]:
# Calculate summary statistics for 'ability' and 'difficulty' columns for both years

ability_2021_stats = data_2021['ability'].describe()
difficulty_2021_stats = data_2021['difficulty'].describe()

ability_2022_stats = data_2022['ability'].describe()
difficulty_2022_stats = data_2022['difficulty'].describe()

ability_2021_stats, difficulty_2021_stats, ability_2022_stats, difficulty_2022_stats

(count    44100.000000
 mean        -0.041824
 std          2.167235
 min         -5.751698
 25%         -1.507034
 50%         -0.068255
 75%          1.442855
 max          5.889554
 Name: ability, dtype: float64,
 count    44100.000000
 mean        -0.057131
 std          1.053968
 min         -2.157522
 25%         -0.870725
 50%         -0.147794
 75%          0.904278
 max          1.774446
 Name: difficulty, dtype: float64,
 count    49000.000000
 mean         0.330433
 std          2.448007
 min         -5.653869
 25%         -1.047968
 50%          0.101803
 75%          1.704314
 max          9.957734
 Name: ability, dtype: float64,
 count    49000.000000
 mean         0.116709
 std          1.171172
 min         -1.955096
 25%         -0.728539
 50%         -0.007172
 75%          1.199790
 max          2.194765
 Name: difficulty, dtype: float64)

To create a model that predicts if a student will answer a question correctly based on attributes like the student's ability and question difficulty, we’ll go through the following steps:

Data Preparation: Load and preprocess the data.
Feature Engineering: Create features that capture relevant information.
Model Selection and Training: Train a model to predict the probability of a correct answer.
Evaluation: Evaluate model performance on a test set.

In [9]:
# Concatenate both datasets for a larger training set
data = pd.concat([data_2021, data_2022], ignore_index=True)

# Check for missing values
print(data.isnull().sum())

# Drop rows with missing values, if any
data.dropna(inplace=True)


student_id               0
question_id              0
ability               1900
difficulty            1900
answered_correctly       0
dtype: int64


Step 2: Feature Engineering
We’ll create new features or modify existing ones. Features such as ability and difficulty may be scaled for improved performance.

In [10]:
from sklearn.preprocessing import StandardScaler

# Select features and target variable
X = data[['ability', 'difficulty']]
y = data['answered_correctly']

# Scale features for better model performance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


Model Selection and Training
We’ll use a simple model, such as Logistic Regression, to start with. Logistic Regression is suitable as it can output the probability of a student answering correctly.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]


Evaluation
Evaluate the model with metrics such as accuracy and AUC-ROC, which are helpful for binary classification.

In [12]:
# Calculating evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
conf_matrix = confusion_matrix(y_test, y_pred)

# Displaying results
accuracy, roc_auc, conf_matrix


(0.9996164383561644,
 0.9999999519201118,
 array([[8852,    7],
        [   0, 9391]]))