# Skillbuilder Documentation

**This file deals with the skillbuilder data and demonstrates our preprocessing steps that lead to fitting the torchbkt model to the data. It can also be thought of as a step-by-step documentation of the datahelper package.**

In [126]:
import numpy as np
import networkx as nx
import pickle

from collections import Counter

from datahelper.importer import SkillbuilderImporter
from datahelper.utils import get_unique_skills
from datahelper.preprocess import remove_null, remove_rare_skills, blockify, prepare_fitting
from datahelper.graphcut import get_skill_graph, Blocks, split_blocks
from datahelper.postprocess import BlockParams

import torch
from torchbkt import *

## Import Skillbuilder-Data

The **SkillbuilderImporter** imports relevant columns of the skillbuilder data and creates the skill_name columns, where the skills of each row are stored in an unconcatenated format. 

In [127]:
path = "data/skill_builder_data_corrected_collapsed.csv"
importer = SkillbuilderImporter()
skillbuilder = importer(path, ['problem_id'])
skillbuilder

Unnamed: 0,order_id,user_id,problem_id,correct,skill_id,skill_name0,skill_name1,skill_name2,skill_name3
222489,20224085,73963,76429,0,297,297,,,
222490,20224095,73963,76430,1,297,297,,,
222491,20224113,73963,76431,1,297,297,,,
222492,20224123,73963,76432,1,297,297,,,
222493,20224142,73963,76433,0,297,297,,,
...,...,...,...,...,...,...,...,...,...
19379,38310198,96282,135605,1,9_15,9,15,,
19380,38310199,96282,135607,1,9_15,9,15,,
19381,38310200,96282,135601,1,9_15,9,15,,
19382,38310201,96282,135602,1,9_15,9,15,,


## General Information about Data Columns

Not all information will be used in our project, everything in **bold** is relevant and in some way used
- **order_id: These id's are chronological, and refer to the id of the original problem log.**
- assignment_id: Two different assignments can have the same sequence id. Each assignment is specific to a single teacher/class.
- **user_id: The ID of the student doing the problem.**
- assistment_id: The ID of the ASSISTment. An ASSISTment consists of one or more problems.
- **problem_id: The ID of the problem.**
- original:
    - 1 = Main problem
    - 0 = Scaffolding problem
- **correct**
    - 1 = Correct on the first attempt
    - 0 = Incorrect on the first attempt, or asked for help.
    - This column is often the target for prediction
- attempt_count: Number of student attempts on this problem.
- ms_first_response: The time in milliseconds for the student's first response.
- tutor_mode: tutor, test mode, pretest, or posttest
- answer_type: 
    - choose_1: Multiple choice (radio buttons)
    - algebra: Math evaluated string (text box)
    - fill_in: Simple string-compared answer (text box)
    - open_response: Records student answer, but their response is always marked correct

- sequence_id: The content id of the problem set. Different assignments that assign the same problem set will have the same sequence id.
- student_class_id: The class ID.
- position: Assignment position on the class assignments page.
- type:
    - Linear - Student completes all problems in a predetermined order.
    - Random - Student completes all problems, but each student is presented with the problems in a different random order.
    - Mastery - Random order; and students must "master" the problem set by getting a certain number of questions (3 by default) correct in a row before being able to continue.
- base_sequence_id: This is to account for if a sequence has been copied. This will point to the original copy, or be the same as sequence_id if it hasn't been copied.
- **skill_id: ID of the skill associated with the problem. Different skills for the same data record are in the same row, separated with underscore.**
- skill_name: Skill name associated with the problem. -> still to find out what happens if multiple skills

- teacher_id: The ID of the teacher who assigned the problem.

- school_id: The ID of the school where the problem was assigned.

- hint_count: Number of student attempts on this problem.

- hint_total: Number of possible hints on this problem.

- overlap_time: The time in milliseconds for the student's overlap time.

- template_id: The template ID of the ASSISTment. ASSISTments with the same template ID have similar questions.

- answer_id: The answer ID for multi-choice questions.

- answer_text: The answer text for fill-in questions.

- first_action: The type of first action: attemp or ask for a hint.

- bottom_hint: Whether or not the student asks for all hints.

- opportunity: The number of opportunities the student has to practice on this skill. For the non skill builder dataset, opportunities for different skills of the same data record are in the same row, separated with comma.

- opportunity_original: The number of opportunities the student has to practice on this skill counting only original problems. For the non skill builder dataset, original opportunities for different skills of the same data record are in the same row, separated with comma.

## General Analysis

We first remove rows with a missing skill_id, i.e. where we do not know what skills appeared in such a row.
We further do some exploratory data analysis to get a better overview over the dataset.

In [128]:
# MISSING VALUES
print(skillbuilder.isna().sum())
skillbuilder = remove_null(skillbuilder)
skillbuilder

order_id            0
user_id             0
problem_id          0
correct             0
skill_id        63755
skill_name0     63755
skill_name1    299823
skill_name2    340680
skill_name3    345181
dtype: int64


Unnamed: 0,order_id,user_id,problem_id,correct,skill_id,skill_name0,skill_name1,skill_name2,skill_name3
222489,20224085,73963,76429,0,297,297,,,
222490,20224095,73963,76430,1,297,297,,,
222491,20224113,73963,76431,1,297,297,,,
222492,20224123,73963,76432,1,297,297,,,
222493,20224142,73963,76433,0,297,297,,,
...,...,...,...,...,...,...,...,...,...
19379,38310198,96282,135605,1,9_15,9,15,,
19380,38310199,96282,135607,1,9_15,9,15,,
19381,38310200,96282,135601,1,9_15,9,15,,
19382,38310201,96282,135602,1,9_15,9,15,,


In [129]:
# GENERAL
uniques = np.unique(skillbuilder["skill_id"])
l = [set(x.split("_")) for x in uniques]
flat_list = set(item for sublist in l for item in sublist)
stud = np.unique(skillbuilder["user_id"])
num_stud = len(stud)
num_entries = len(skillbuilder)

print("Imported data entries:", num_entries)
print("Number of students:", num_stud)
print("Number of different exercise types:", len(uniques))
print("Number of different exercises:", len(np.unique(skillbuilder["problem_id"])))
print("Number of different skills:", len(flat_list))

Imported data entries: 283105
Number of students: 4163
Number of different exercise types: 149
Number of different exercises: 17751
Number of different skills: 123


In [130]:
# STUDENTS
df_stud = skillbuilder.groupby('user_id').size().reset_index()
df_stud.columns = ['Student', 'Number Exercises']

print("Average exercises per student:", num_entries / num_stud)
print("Students that filled most exercises:\n", df_stud.sort_values(by=["Number Exercises"], ascending=False).head(10))
print("Number of students that filled more than 100 exercises:", len(df_stud[df_stud["Number Exercises"] >= 100]))

Average exercises per student: 68.00504443910641
Students that filled most exercises:
       Student  Number Exercises
748     79021              1061
161     75169              1029
697     78970              1014
757     79032               986
706     78979               984
4107    96244               970
4137    96274               963
131     71881               934
714     78987               921
705     78978               882
Number of students that filled more than 100 exercises: 702


In [131]:
# SKILLS
counter_dict = Counter(skillbuilder["skill_id"])
print('Exercise type counts:\n', sorted(counter_dict.items(), key=lambda x: -x[1]))

skills, counts = get_unique_skills(skillbuilder, return_counts=True)
skill_counts = {a:b for (a, b) in zip(skills, counts)}
print('Skill counts:\n', sorted(skill_counts.items(), key=lambda x: -x[1]))

Exercise type counts:
 [('311', 24253), ('47', 18739), ('277', 12741), ('280', 11334), ('312', 8115), ('79', 8068), ('279', 7058), ('27', 6590), ('18', 6557), ('50', 6117), ('77', 6109), ('67', 5547), ('74_92', 5501), ('325', 5398), ('49', 4895), ('310', 4659), ('46', 4434), ('278', 4320), ('70', 4263), ('16_17', 4073), ('61', 4029), ('17', 3890), ('11', 3864), ('65', 3256), ('81', 3183), ('309', 3072), ('76', 3050), ('297', 2978), ('86', 2947), ('85', 2900), ('83', 2813), ('75', 2576), ('49_50', 2422), ('1_13', 2212), ('74_85', 2168), ('58_85', 2059), ('51', 2007), ('276', 1970), ('11_70', 1945), ('5_375', 1894), ('82', 1888), ('8', 1859), ('58', 1816), ('2_70', 1807), ('4', 1804), ('368', 1792), ('40', 1769), ('294', 1760), ('308', 1706), ('63_75', 1584), ('2_37_48_77', 1582), ('10_13', 1551), ('94', 1533), ('51_53_75', 1525), ('25', 1524), ('301', 1480), ('2_37_70', 1474), ('13', 1451), ('101', 1332), ('1_15', 1312), ('92', 1301), ('303', 1285), ('9_12', 1244), ('39', 1149), ('35_46

## Remove Rare Skills

It is not usefull to look at skills, where we have a really small amount of exercises. Therefore, we delete skills with less than threshold=30 occurrences. Rows including other skills are kept (without the rare skill), rows featuring only rare skills are completely removed from the dataset. 

In [132]:
skillbuilder = remove_rare_skills(skillbuilder, threshold=30)
skillbuilder

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['skill_id'] = ['_'.join(filter(None, x)) for x in zip(df['skill_name0'], df['skill_name1'], df['skill_name2'], df['skill_name3'])]


Unnamed: 0,order_id,user_id,problem_id,correct,skill_id,skill_name0,skill_name1,skill_name2,skill_name3
222489,20224085,73963,76429,0,297,297,,,
222490,20224095,73963,76430,1,297,297,,,
222491,20224113,73963,76431,1,297,297,,,
222492,20224123,73963,76432,1,297,297,,,
222493,20224142,73963,76433,0,297,297,,,
...,...,...,...,...,...,...,...,...,...
19379,38310198,96282,135605,1,9_15,9,15,,
19380,38310199,96282,135607,1,9_15,9,15,,
19381,38310200,96282,135601,1,9_15,9,15,,
19382,38310201,96282,135602,1,9_15,9,15,,


## Skill Graph and Blocks

Next we create a skill graph. Its nodes are given by all skills in the dataset. An edge between two skills exists if the two skills co-occur in at least one row of the dataframe. The edge weights are determined by the number of co-occurrences. By finding all connected components of this skill graph, we can find blocks of skills that co-occur in the data.

In [133]:
# SKILL COMBINATIONS AS GRAPH
skill_graph = get_skill_graph(skillbuilder)
blocks = list(nx.connected_components(skill_graph))
blocks = Blocks(blocks)

print("Number of blocks:", len(blocks))
print("Maximum length of a block", max(len(y) for y in blocks))
print("Blocks of skills present in the data:", blocks)

Number of blocks: 75
Maximum length of a block 15
Blocks of skills present in the data: <datahelper.graphcut.Blocks object at 0x7ff269d4bb80>


## Split Blocks

Since the number of states in our BlockBKT model raises exponentially, we have to make sure that there are is no with a too large number of skills. After setting a threshold for the maximum block size, we can iteratively split the largest block with spectral clustering (considering the edge weights via the block's affinity matrix) until all blocks have at most the desired maximum block size.

In [134]:
threshold = 5
random_state = 2020
assign_labels = 'kmeans'

split_blocks(blocks, skill_graph, threshold, random_state, assign_labels)
blocks.blocks_

[{'101'},
 {'105', '97', '99'},
 {'110'},
 {'16', '17'},
 {'163'},
 {'165'},
 {'166'},
 {'173', '190', '193', '221'},
 {'203'},
 {'204'},
 {'21', '22'},
 {'217'},
 {'25'},
 {'26'},
 {'27'},
 {'276'},
 {'277'},
 {'278', '34', '69', '81'},
 {'279'},
 {'280'},
 {'290'},
 {'292', '293'},
 {'294'},
 {'295'},
 {'296'},
 {'297'},
 {'298'},
 {'299'},
 {'301'},
 {'303'},
 {'307'},
 {'308'},
 {'309'},
 {'310'},
 {'311'},
 {'312'},
 {'314'},
 {'317'},
 {'32'},
 {'322'},
 {'324'},
 {'325'},
 {'333'},
 {'343'},
 {'346'},
 {'350'},
 {'356'},
 {'362'},
 {'365'},
 {'367'},
 {'368'},
 {'371'},
 {'375', '5'},
 {'378'},
 {'39'},
 {'4'},
 {'40'},
 {'42'},
 {'49', '50'},
 {'54'},
 {'61'},
 {'65'},
 {'67'},
 {'76'},
 {'8'},
 {'80'},
 {'82'},
 {'83'},
 {'84'},
 {'86'},
 {'91'},
 {'94'},
 {'96'},
 {'323', '74', '92'},
 {'1', '13', '15', '9'},
 {'24', '35', '46'},
 {'51', '53', '63', '75'},
 {'58', '85'},
 {'11', '70'},
 {'10', '12', '14', '18', '64'},
 {'104'},
 {'2', '37', '48', '77', '79'},
 {'47'}]

## Split Dataframe (according to Blocks)

After having found blocks of reasonable size, we can extract the relevant rows of the skillbuilder data for every block, thereby creating a separate dataframe for every block. Note that some rows can be present in multiple dataframes (with non-overlapping skills each time) due to the block splitting. The following exploration shows that some blocks contain very little data.

In [135]:
block_dfs = blockify(skillbuilder, blocks)
block_dfs[1]

Unnamed: 0,order_id,user_id,problem_id,correct,skill_id,skill_name0,skill_name1,skill_name2
0,20660999,79280,75256,0,105,105,,
1,20661295,79272,75242,1,105,105,,
2,20661440,79272,75218,1,105,105,,
3,20663132,79290,75252,0,105,105,,
4,20663448,79287,75256,1,105,105,,
...,...,...,...,...,...,...,...,...
712,38304282,96239,93321,1,105,105,,
713,38304897,96225,75268,1,105,105,,
714,38306277,96225,75222,1,105,105,,
715,38309006,96282,75238,1,105,105,,


In [136]:
block_lengths = [df.shape[0] for df in block_dfs]
sorted(block_lengths)

[32,
 32,
 33,
 36,
 47,
 87,
 89,
 90,
 91,
 102,
 108,
 109,
 115,
 117,
 213,
 234,
 237,
 278,
 280,
 286,
 288,
 305,
 353,
 389,
 392,
 398,
 456,
 459,
 491,
 495,
 632,
 646,
 671,
 717,
 843,
 877,
 926,
 949,
 951,
 1149,
 1183,
 1285,
 1332,
 1480,
 1524,
 1533,
 1706,
 1760,
 1769,
 1792,
 1804,
 1859,
 1888,
 1895,
 1970,
 2813,
 2947,
 2978,
 3050,
 3072,
 3256,
 4029,
 4659,
 5398,
 5547,
 6145,
 6611,
 7058,
 7717,
 8115,
 8757,
 8980,
 9772,
 10417,
 11334,
 12450,
 12741,
 13361,
 13434,
 16323,
 18742,
 20434,
 24253]

In [137]:
block_users = [df['user_id'].nunique() for df in block_dfs]
sorted(block_users)

[5,
 6,
 6,
 8,
 11,
 13,
 14,
 15,
 20,
 22,
 28,
 29,
 30,
 33,
 34,
 39,
 41,
 41,
 75,
 87,
 88,
 95,
 135,
 140,
 147,
 155,
 167,
 168,
 176,
 184,
 194,
 202,
 206,
 215,
 223,
 229,
 229,
 230,
 233,
 256,
 264,
 264,
 264,
 264,
 269,
 270,
 274,
 282,
 283,
 283,
 304,
 307,
 318,
 333,
 345,
 346,
 348,
 354,
 367,
 412,
 458,
 483,
 525,
 527,
 619,
 625,
 651,
 664,
 724,
 783,
 900,
 900,
 961,
 999,
 1063,
 1087,
 1161,
 1164,
 1203,
 1225,
 1226,
 1263,
 1353]

In [138]:
exercise_user_ratio = [l/u for (l, u) in zip (block_lengths, block_users)]
sorted(exercise_user_ratio)

[1.0,
 1.0666666666666667,
 1.2105263157894737,
 1.222707423580786,
 1.6122448979591837,
 1.6206896551724137,
 1.7023809523809523,
 1.9857142857142858,
 2.1707317073170733,
 2.257425742574257,
 2.2857142857142856,
 2.302325581395349,
 2.3333333333333335,
 2.5416666666666665,
 2.6079545454545454,
 2.684057971014493,
 2.715909090909091,
 2.84,
 2.8536585365853657,
 2.9893617021276597,
 3.0679611650485437,
 3.2058823529411766,
 3.310344827586207,
 3.3275109170305677,
 3.637037037037037,
 3.763948497854077,
 4.152838427947598,
 4.167741935483871,
 4.188679245283019,
 4.2269736842105265,
 4.255605381165919,
 4.3522727272727275,
 4.420454545454546,
 4.430635838150289,
 4.481060606060606,
 4.5,
 4.626506024096385,
 4.760904684975768,
 5.229681978798586,
 5.251412429378531,
 5.564917127071824,
 5.660919540229885,
 5.837133550488599,
 6.0,
 6.3146997929606625,
 6.91015625,
 6.992592592592593,
 7.084291187739463,
 7.15668202764977,
 7.842222222222222,
 7.843478260869565,
 8.376940133037694,
 8.4

## Preprocess Dataframes for Fitting

Before fitting, we change the skill_name column names and most importantly binarily encode the skill_name columns. The binary format is necessary for the BlockBKT implementation in torchbkt.

In [139]:
block_dfs = [prepare_fitting(bdf, blk) for (bdf, blk) in zip(block_dfs, blocks)]
block_dfs[1]

Unnamed: 0,order_id,user_id,problem_id,correct,skill_id,skill_name105,skill_name97,skill_name99
0,20660999,79280,75256,0,105,1,0,0
1,20661295,79272,75242,1,105,1,0,0
2,20661440,79272,75218,1,105,1,0,0
3,20663132,79290,75252,0,105,1,0,0
4,20663448,79287,75256,1,105,1,0,0
...,...,...,...,...,...,...,...,...
712,38304282,96239,93321,1,105,1,0,0
713,38304897,96225,75268,1,105,1,0,0
714,38306277,96225,75222,1,105,1,0,0
715,38309006,96282,75238,1,105,1,0,0


## Fitting (all blocks)

We can now fit all blocks with the BlocksTrainer wrapper class.

In [140]:
lr = 0.01
max_batch_size = 8
n_steps = 1000
step_size = 5
gamma = 0.1
delta = 0
omicron = 0
weighted = False

n_splits = 5
verbose = True

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [141]:
newfit = False

if newfit:
    blocks_trainer = BlocksTrainer(lr, max_batch_size, n_steps, step_size, gamma, delta, omicron, weighted)
    blocks_trainer.fit(blocks, block_dfs, n_splits, verbose)
    with open('output/torchbkt/blocks_trainer_01.pkl', 'wb') as out:
        pickle.dump(blocks_trainer, out)
else:
    with open('output/torchbkt/blocks_trainer_01.pkl', 'rb') as out:
        blocks_trainer = pickle.load(out)

## Postprocessing

After fitting, we will extract the fitted BKT parameters and prepare them for their usage in the reinforcement learning part.

In [142]:
if newfit:
    block_params = BlockParams(blocks_trainer.models, blocks)
    with open('output/torchbkt/block_params.pkl', 'wb') as out:
        pickle.dump(block_params, out)
else:
    with open('output/torchbkt/block_params.pkl', 'rb') as out:
        block_params = pickle.load(out)

In [143]:
block_params.df_

Unnamed: 0,skill,block,l0,transition,slip,guess
0,101,0,0.347626,0.365774,0.265478,0.450528
1,105,1,0.739401,0.298232,0.247958,0.463367
2,97,1,0.426309,0.433488,0.247958,0.463367
3,99,1,0.429324,0.647841,0.247958,0.463367
4,110,2,0.521834,0.537199,0.190724,0.751754
...,...,...,...,...,...,...
111,37,81,0.814972,0.296093,0.321292,0.273310
112,48,81,0.933529,0.604706,0.321292,0.273310
113,77,81,0.491410,0.062976,0.321292,0.273310
114,79,81,0.620547,0.062871,0.321292,0.273310


In [144]:
block_params.dict_

{'l0': [0.3476256728172302,
  0.7394005060195923,
  0.42630892992019653,
  0.42932432889938354,
  0.5218336582183838,
  0.6348925828933716,
  0.5346208810806274,
  0.5568701028823853,
  0.3863653540611267,
  0.5130554437637329,
  0.5732694268226624,
  0.5691039562225342,
  0.5551505088806152,
  0.5702091455459595,
  0.3658803403377533,
  0.4903888702392578,
  0.3427667021751404,
  0.6128500699996948,
  0.4810587465763092,
  0.5388386845588684,
  0.3150107264518738,
  0.6116716265678406,
  0.5532068014144897,
  0.3203809857368469,
  0.6864196062088013,
  0.6475683450698853,
  0.5950950384140015,
  0.4881376326084137,
  0.3280208706855774,
  0.6186196208000183,
  0.5426253080368042,
  0.6037954092025757,
  0.4307428002357483,
  0.5247283577919006,
  0.45992785692214966,
  0.5330442786216736,
  0.3991314172744751,
  0.4492711126804352,
  0.4311312735080719,
  0.3933926224708557,
  0.4084225594997406,
  0.2587231695652008,
  0.2993544936180115,
  0.4213736057281494,
  0.6064741611480713,
 