This file is for exploring the Assistments Dataset for the first time. Note that the two files without a year label are from the school year of 2009-2010. Full explanations of the headings is [here](https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data).

The corrected version of data can be found [here](https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data/skill-builder-data-2009-2010).

In [1]:
import pandas as pd
# set low_memory=False to deal with mixed types in a column
skill_2015 = pd.read_csv('2015_100_skill_builders_main_problems.csv', low_memory=False)
non_skill_2010 = pd.read_csv('non_skill_builder_data_new.csv', low_memory=False)
skill_2010 = pd.read_csv('skill_builder_data.csv', low_memory=False)
skill_corrected_2010 = pd.read_csv('skill_builder_data_corrected.csv', low_memory=False)

print('shape of skill_2015: %s' % str(skill_2015.shape))
print('shape of non_skill_2010: %s' % str(non_skill_2010.shape))
print('shape of skill_2010: %s' % str(skill_2010.shape))
print('shape of skill_corrected_2010: %s' % str(skill_corrected_2010.shape))

shape of skill_2015: (708631, 4)
shape of non_skill_2010: (603128, 30)
shape of skill_2010: (525534, 30)
shape of skill_corrected_2010: (401756, 30)


In [2]:
%%bash
ls
echo ''
head -n 5 2015_100_skill_builders_main_problems.csv
echo ''
head -n 5 non_skill_builder_data_new.csv
echo ''
head -n 5 skill_builder_data.csv

0_initial_explore.ipynb
1_preprocessing.ipynb
2015_100_skill_builders_main_problems.csv
assistment_dirty_for_dkt.csv
assistment_for_dkt.csv
assistments_dirty.txt
non_skill_builder_data_new.csv
saving_during_set12
saving_during_set13
saving_during_set14
skill_builder_data.csv
skill_builder_data_corrected.csv
skill_hashed_to_original.pickle
skill_original_to_hashed.pickle
student_hashed_to_original.pickle
student_original_to_hashed.pickle
test_prediction_result_set_10
test_prediction_save_set_11

user_id,log_id,sequence_id,correct
50121,167478035,7014,0
50121,167478043,7014,1
50121,167478053,7014,1
50121,167478069,7014,1

order_id,assignment_id,user_id,assistment_id,problem_id,original,correct,attempt_count,ms_first_response,tutor_mode,answer_type,sequence_id,student_class_id,position,type,base_sequence_id,skill_id,skill_name,teacher_id,school_id,hint_count,hint_total,overlap_time,template_id,answer_id,answer_text,first_action,bottom_hint,opportunity,opportunity_original
20223588,2

Some import points of the data according to its documents:

* If answer_type == "open_response", then the response is always marked correct.
* problem_set_type gives some information of the problem orders: Linear (all problems, predetermined order); Random (all problems, random order); Mastery (getting a certain number of questions, default 3, correct in a row to continue, random order).
* skill_id: skill associated with the problem; in skill builder, multi-skill problems result in duplicate records; while in non-skill builder, different skills for the same data record are in the same row, separated with comma.

The *Deep Knowledge Tracing* paper uses the 2010 skill builder dataset, so now we try to understand that better. The header is:

order_id,assignment_id,user_id,assistment_id,problem_id,original,correct,
attempt_count,ms_first_response,tutor_mode,answer_type,sequence_id,
student_class_id,position,type,base_sequence_id,skill_id,skill_name,
teacher_id,school_id,hint_count,hint_total,overlap_time,template_id,
answer_id,answer_text,first_action,bottom_hint,opportunity,
opportunity_original

In [3]:
student_count = skill_corrected_2010['user_id'].value_counts()
print("%d students" % len(student_count))
print(student_count.describe())
print("studet %d did the most problems" % student_count.index[0])

4217 students
count    4217.000000
mean       95.270571
std       192.971604
min         1.000000
25%        10.000000
50%        26.000000
75%        84.000000
max      1606.000000
Name: user_id, dtype: float64
studet 78978 did the most problems


In [4]:
skill_count = skill_corrected_2010['skill_id'].value_counts()
print("%d skills" % len(skill_count))
print(skill_count.describe())
print(skill_count.index)

123 skills
count      123.000000
mean      2747.975610
std       3742.686524
min          1.000000
25%        279.000000
50%       1447.000000
75%       3993.000000
max      24253.000000
Name: skill_id, dtype: float64
Float64Index([311.0,  47.0, 277.0, 280.0,  70.0,  79.0,  50.0, 312.0,  17.0,
               77.0,
              ...
              365.0, 343.0, 356.0, 321.0, 340.0, 331.0, 348.0, 334.0,  43.0,
              102.0],
             dtype='float64', length=123)


In [5]:
import numpy as np
problem_count = skill_corrected_2010['problem_id'].value_counts()
print("%d problems" % len(problem_count))
print(problem_count.describe())
print("%d original problems" % np.sum(skill_corrected_2010['original']))  # almost all problems are original

26688 problems
count    26688.000000
mean        15.053807
std         21.058209
min          1.000000
25%          3.000000
50%          8.000000
75%         18.000000
max        272.000000
Name: problem_id, dtype: float64
328291 original problems


In [6]:
order_count = skill_corrected_2010['order_id'].value_counts()    # basically unique
print("%d orders" % len(order_count))
print(order_count.describe())

346860 orders
count    346860.000000
mean          1.158266
std           0.433837
min           1.000000
25%           1.000000
50%           1.000000
75%           1.000000
max           4.000000
Name: order_id, dtype: float64


In [7]:
# duplicated order ids are originated form the fact that in Assistments, 
# if a question is associated with multiple skills, it will be multiple records, 
# the only difference between which is the skill_id 
duplicate_order = order_count.index[0]
duplicate = skill_corrected_2010[skill_corrected_2010['order_id'] == duplicate_order]
print(duplicate.iloc[0, :])
print(duplicate.iloc[1, :])
print(duplicate.iloc[2, :])
print(duplicate.iloc[3, :])

order_id                                                 35023691
assignment_id                                              280922
user_id                                                     90996
assistment_id                                               89416
problem_id                                                 166381
original                                                        1
correct                                                         0
attempt_count                                                  13
ms_first_response                                           69011
tutor_mode                                                  tutor
answer_type                                               algebra
sequence_id                                                 10445
student_class_id                                            13583
position                                                       15
type                                               MasterySection
base_seque

In [8]:
one_student = skill_corrected_2010[skill_corrected_2010['user_id'] == 78978]
print(one_student.shape)
skill_counts = one_student['skill_id'].value_counts()
print("learned %d skills" % len(skill_counts))
print("most practiced skill: %d" % skill_counts.index[0])
one_skill_student = one_student[one_student['skill_id'] == 27]
# sequence ~ assignment; problem ~ order
# problem and order are unique when fixing student and skill
print('\nsequence:')
print(one_skill_student['sequence_id'].value_counts().describe())
print('\nproblem:')
print(one_skill_student['problem_id'].value_counts().describe())
print('\nassignment:')
print(one_skill_student['assignment_id'].value_counts().describe())
print('\norder:')
print(one_skill_student['order_id'].value_counts().describe())
one_skill_student_problem = one_skill_student[one_skill_student['problem_id'] == one_skill_student['problem_id'].value_counts().index[0]]
print(one_skill_student_problem.shape)

(1606, 30)
learned 76 skills
most practiced skill: 27

sequence:
count     7.000000
mean      9.142857
std       3.579040
min       4.000000
25%       8.000000
50%       9.000000
75%       9.500000
max      16.000000
Name: sequence_id, dtype: float64

problem:
count    64.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
Name: problem_id, dtype: float64

assignment:
count     7.000000
mean      9.142857
std       3.579040
min       4.000000
25%       8.000000
50%       9.000000
75%       9.500000
max      16.000000
Name: assignment_id, dtype: float64

order:
count    64.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
Name: order_id, dtype: float64
(1, 30)


In [9]:
# it's all mastery section
set_type_count = skill_corrected_2010['type'].value_counts()
print(set_type_count)

MasterySection    401756
Name: type, dtype: int64


In [10]:
# see how mastery section works
one_skill_student_summary = one_skill_student[['order_id', 'user_id', 'assignment_id', 'sequence_id', 'skill_id', 'correct']]
print(one_skill_student_summary)

       order_id  user_id  assignment_id  sequence_id  skill_id  correct
73133  28876887    78978         271256         6471      27.0        0
73134  28876932    78978         271256         6471      27.0        0
73135  28876968    78978         271256         6471      27.0        0
73136  28876979    78978         271256         6471      27.0        0
73137  28876993    78978         271256         6471      27.0        0
73138  28877011    78978         271256         6471      27.0        1
73139  28877022    78978         271256         6471      27.0        1
73140  28916274    78978         271256         6471      27.0        1
73141  31568941    78978         271248         6408      27.0        1
73142  31568968    78978         271248         6408      27.0        0
73143  31569015    78978         271248         6408      27.0        0
73144  31569036    78978         271248         6408      27.0        0
73145  31569090    78978         271248         6408      27.0  

In [11]:
# assignments and sequence
assignment_count = skill_corrected_2010['assignment_id'].value_counts()
print(assignment_count.describe())
sequence_count = skill_corrected_2010['sequence_id'].value_counts()
print(sequence_count.describe())

count    3521.000000
mean      114.102812
std       190.128480
min         1.000000
25%        12.000000
50%        55.000000
75%       138.000000
max      3346.000000
Name: assignment_id, dtype: float64
count      677.000000
mean       593.435746
std        971.534666
min          1.000000
25%         88.000000
50%        244.000000
75%        704.000000
max      10550.000000
Name: sequence_id, dtype: float64


Note that only sequence 6408 and 6464 are done to mastery (3 correct answers in a row), while most sequences are not.
It seems that several sequences have lowered their mastery standard to 2 correct answers in a row. Even if a student has mastered a skill, he/she still might fail in another assignment of the same skill.

Although sequence isn't necessarily related to a unique assignment overall, it seems to be so in individual students.

### Here are some data-quality problems discovered during pre-processing:
* The original log has already been grouped according to skills. It is **NOT** chronological.
* Some records/problems is not associated to a skill. Should discard those entries (around 60,000 entries).

In [12]:
anomaly_student = skill_2010[skill_2010['user_id'] == 77899]
print(anomaly_student.shape)
skill_counts = anomaly_student['skill_id'].value_counts()
print("learned %d skills" % len(skill_counts))
print("most practiced skill: %d" % skill_counts.index[0])
anomaly_one_skill_student = anomaly_student[anomaly_student['skill_id'] == 279]
print('\nsequence:')
print(anomaly_one_skill_student['sequence_id'].value_counts().describe())
print('\nproblem:')
print(anomaly_one_skill_student['problem_id'].value_counts().describe())
print('\nassignment:')
print(anomaly_one_skill_student['assignment_id'].value_counts().describe())
print('\norder:')
print(anomaly_one_skill_student['order_id'].value_counts().describe())

(8214, 30)
learned 6 skills
most practiced skill: 279

sequence:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: sequence_id, dtype: float64

problem:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: problem_id, dtype: float64

assignment:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: assignment_id, dtype: float64

order:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: order_id, dtype: float64
