# Pre-processing of Assistments

This file contains scripts of pre-processing the Assistments data. During the process, I refered to the published DKT code (though they didn't publish their data pre-processing code).

### The theano version of DKT's dataset suffers numerous problem:
* Stale, uncorrected Assistments data.
* Considers those without a skill label as a new skill.
* Not in the chronological order.

First, let's hash the student and skill ids according to their order (descendant appearances order).

In [14]:
# libraries
import pandas as pd
import numpy as np
import pickle
import random

In [15]:
# test sorting a dataframe
df = pd.DataFrame([['a', 2], ['b', 1], ['c', 3]])
df.columns = ['name', 'age']
print df
df.sort_values(by='age', inplace=True)
print df

  name  age
0    a    2
1    b    1
2    c    3
  name  age
1    b    1
0    a    2
2    c    3


In [16]:
assitments = pd.read_csv('skill_builder_data_corrected.csv')
print(assitments.shape)
assitments.dropna(subset=["skill_id"], inplace=True)
#assitments = assitments[np.isfinite(assitments['skill_id'])]   # does the same thing
assitments.dropna(subset=["user_id"], inplace=True)
assitments.sort_values(by='order_id', inplace=True)
print(assitments.shape)
print(assitments.columns)

(401756, 30)
(338001, 30)
Index([u'order_id', u'assignment_id', u'user_id', u'assistment_id',
       u'problem_id', u'original', u'correct', u'attempt_count',
       u'ms_first_response', u'tutor_mode', u'answer_type', u'sequence_id',
       u'student_class_id', u'position', u'type', u'base_sequence_id',
       u'skill_id', u'skill_name', u'teacher_id', u'school_id', u'hint_count',
       u'hint_total', u'overlap_time', u'template_id', u'answer_id',
       u'answer_text', u'first_action', u'bottom_hint', u'opportunity',
       u'opportunity_original'],
      dtype='object')


In [17]:
# skill 0 appears the most
skills = assitments['skill_id'].value_counts()
skill_hashed_to_original = skills.index.astype(int)
skill_original_to_hashed = dict()
for i in range(len(skill_hashed_to_original)):
    skill_original_to_hashed[skill_hashed_to_original[i]] = i
pickle.dump(skill_hashed_to_original, file('skill_hashed_to_original.pickle', 'w'))
pickle.dump(skill_original_to_hashed, file('skill_original_to_hashed.pickle', 'w'))

# student 0 appears the most
students = assitments['user_id'].value_counts()
print "Now we have %d students" % len(students)
student_hashed_to_original = students.index.astype(int)
student_original_to_hashed = dict()
for i in range(len(student_hashed_to_original)):
    student_original_to_hashed[student_hashed_to_original[i]] = i
pickle.dump(student_hashed_to_original, file('student_hashed_to_original.pickle', 'w'))
pickle.dump(student_original_to_hashed, file('student_original_to_hashed.pickle', 'w'))

Now we have 4163 students


Now we are ready to generate our data file. To train a basic DKT, we only need student_id, skill_id and correct. Other fields, including sequence and assignment ids are not necessary, though probably useful in the future.

*If we take time elapse into consideration (to better simulate forgetting a skill), will the DKT performs better?*

Our output file should be grouped according to students (i.e., the sequence of each student's actions). The actions shall be in the order of order_id, rather than grouped by skill.

In [18]:
# output file
output = file('assistment_for_dkt.csv', 'w')
output.write('student,skill,correct\n')
student_list = student_original_to_hashed.keys()[:]
random.shuffle(student_list)
record_cnt = 0
line2write = ''
for student in student_list:
    student_sequence = assitments[assitments['user_id'] == student]
    student_id = student_original_to_hashed[student]
    skill_list = student_sequence['skill_id'].values
    correct_list = student_sequence['correct'].values
    for i in range(len(student_sequence)):
        line2write += (str(student_id) + ',' + \
                     str(skill_original_to_hashed[int(skill_list[i])]) + \
                     ',' + str(int(correct_list[i])) + '\n')
        record_cnt += 1
output.write(line2write)
output.flush()
output.close()
print "wrote %d records" % record_cnt
# Do not forget to flush and close file after you are done writing!!

wrote 338001 records


Now we have a cleaned dataset with following improvement:
* Originated from the corrected data from the website.
* Discarded all records with unlabeled skill.
* Sorted according to order_id, i.e. timestamp.

Remaining Issue:
* Problems related to multiple skills are treated as a sequence of separated problems of different (single) skills.

Actually this is not that big of a drawback and kinda reasonable.

In [19]:
# padding: time window
student_counts = assitments['user_id'].value_counts()
print student_counts.describe()
print np.sum(student_counts)

count    4163.000000
mean       81.191689
std       162.160104
min         1.000000
25%         9.000000
50%        23.000000
75%        69.000000
max      1295.000000
Name: user_id, dtype: float64
338001


In [3]:
# converting Mozer's dirty data to our format to test
dirty_file = file("assistments_dirty.txt")
dirty_target = file("assistment_dirty_for_dkt.csv", "w")

dirty_target.write("student,skill,correct\n")
for line in dirty_file:
    line = line.split()
    dirty_target.write(line[0] + "," + line[1] + "," + line[2] + "\n")
dirty_target.flush()
dirty_target.close()