# Analysis of Beta Test Results

We are currently conducting an ongoing beta test of an instrument scanning course for the flight school at Seneca College. This is in collaboration with Seneca College and their Centre for the Development of Open Source Technology (CDOT).

## Load the results

In [35]:
import cPickle as pickle
import numpy as np
results = pickle.load(open('beta_test_results.p','rb'))
print type(results)

<type 'dict'>


The results are stored in a Python dictionary format, a type of key-value store. The two main keys are:

In [30]:
results.keys()

['student_ids', 'student_states']

**`student_ids`** is a list of all Student IDs, while **`student_states`** indicates each student's participation history in the course.

## How many participants?

In [31]:
print 'There are {0} participants in this beta test so far.'.format(len(results['student_ids']))

There are 41 participants in this beta test so far.


#### Create an index of student IDs

In [40]:
student_ids = results['student_ids']

In [42]:
type(student_ids)

list

The `student_ids` variable is a list containing every participants ID code.

#### How is the results data organized?

In [50]:
student_states = results['student_states']

In [52]:
type(student_states)

list

In [53]:
len(student_states)

41

In [54]:
type(student_states[0])

dict

In [56]:
student_states[0].keys()

['avg_attempts',
 'student_id',
 'avg_time_took',
 'problems',
 'chapters',
 'avg_grade']

The other key variable in the `results` data store is `student_states`. I've created a new variable called `student_states`. It is organized as a list of Python dicts, with one for each student. The dict for each student is organized with 6 further keys, '`avg_attempts`', '`student_ud`', '`average_time_took`', '`problems`', '`chapters`', and '`avg_grade`. Most of these are preprocessing done by the Django webapp during its reading and presentation of the MySQL data from the edX platform. This data analysis should be encapsulated as part of the a data analysis package and returned to the webapp.

#### Explore the structure of one entry

In [58]:
student_states[0]['student_id']

u'10'

In [60]:
student_states[0]['avg_attempts']

2.0

In [61]:
student_states[0]['avg_time_took']

datetime.timedelta(46, 71392)

In [62]:
student_states[0]['avg_grade']

66

These first few are basic data types, except for `avg_time_took` which is a datetime object that is easily converted into different time representations. For example,

In [63]:
time = student_states[0]['avg_time_took']

In [64]:
time.seconds

71392

Let's consider the other key values.

In [80]:
student_states[0]['problems']

[{'attempts': 2,
  'grade': 100,
  'problem_code': u'0699c429e09a44d5a9a64d3b3a355262',
  'time_took': datetime.timedelta(28, 78174)},
 {'attempts': 2,
  'grade': 0,
  'problem_code': u'536c887e89994c0fbac914545be5994f',
  'time_took': datetime.timedelta(24, 69131)},
 {'attempts': 2,
  'grade': 100,
  'problem_code': u'a4074a917acb447bbbad55b8f85be99d',
  'time_took': datetime.timedelta(86, 66871)}]

In [111]:
from pprint import pprint as pp
pp(student_states[0]['chapters'])

[{'created': datetime.datetime(2014, 12, 17, 17, 27, 21, tzinfo=<UTC>),
  'module_id': u'd3493fd85173401c97ccaea5a568ba13',
  'sequentials': [{'created': datetime.datetime(2014, 12, 17, 17, 27, 21, tzinfo=<UTC>),
                   'module_id': u'b34ab8e5baea471183123a6c8dc3065b',
                   'problems': [{'attempts': 2,
                                 'grade': 100,
                                 'problem_code': u'0699c429e09a44d5a9a64d3b3a355262',
                                 'time_took': datetime.timedelta(28, 78174)},
                                {'attempts': 2,
                                 'grade': 0,
                                 'problem_code': u'536c887e89994c0fbac914545be5994f',
                                 'time_took': datetime.timedelta(24, 69131)},
                                {'attempts': 2,
                                 'grade': 100,
                                 'problem_code': u'a4074a917acb447bbbad55b8f85be99d',
                     

The `problems` and `chapters` keys are a more complicated. Each of these are also Python dicts and they contain hashes to represent the Module, Snaps, and Problems found in the course. The data under the `problems` key appears again under the `chapters` key, so it is redundant for the purposes of this analysis. We will focus on the data contained under `chapters`. Let's simplify things a little and create a dictionary that maps hashes to names:

In [104]:
# Find all modules in edX course ('chapters')
modules = []
for i in range(len(student_states)):
    chaps = student_states[i]['chapters']
    for j in range(len(chaps)):
        modules.append(chaps[j]['module_id'])
        
# Convert to set and back to list to keep only unique entries
modules = list(set(modules))

print modules

[u'd3493fd85173401c97ccaea5a568ba13', u'd60e238bd95b4f04973a7880861cc939', u'9214d00733ef411a856cac381b053e28', u'0e6bce40f0c84517a60b5fe55a082b86']


In [119]:
module_name = {}
module_name['d3493fd85173401c97ccaea5a568ba13'] = 'Module 1'
module_name['9214d00733ef411a856cac381b053e28'] = 'Module_EdwardTest'
module_name['d60e238bd95b4f04973a7880861cc939'] = 'DELETED'
module_name['0e6bce40f0c84517a60b5fe55a082b86'] = 'DELETED'

module_hash = {}
module_hash['Module 1'] = 'd3493fd85173401c97ccaea5a568ba13'

All of these are submodules of the Instrument Scanning Course. The only module we care about is 'Module 1', which is where the main parts of the course are located. 'Module_EdwardTest' must have been something created for testing purposes and then not removed. The other two modules appear to have been removed because their pages are no longer accessible.

To proceed, we will fully ignore everything except the Module 1 hash.

We also see that under every module is a key called '`sequentials`'. The sequentials are then further divided by sub-module (these are the pre-test, post-test, and individual snap courses), which are each subdivided again into problems taken as part of those sub-modules. Let's create dictionaries to translate the hashes.

In [120]:
submodules = []
problems = []
for i in range(len(student_states)):
    chaps = student_states[i]['chapters']
    for j in range(len(chaps)):
        if chaps[j]['module_id'] == module_hash['Module 1']:
            # The sequentials
            seqs = chaps[j]['sequentials']
            for k in range(len(seqs)):
                submodules.append(seqs[k]['module_id'])

submodules = list(set(submodules))
pp(submodules)

[u'd35b4afe52684239852e03c115ade6c1',
 u'2685be71892e4608b7cc2d9402d26a6a',
 u'ed7af9e0761247899a14209f37eeb0d6',
 u'b34ab8e5baea471183123a6c8dc3065b',
 u'a0f96824f75448299218d7a62279e86d',
 u'6da197eeb30948a3b6b428168246a001',
 u'd6936c7004a649fa961189413b455ee9',
 u'232aea2870ef46439ddf6e1e0f15475c']


Now just as we did for the modules, let's create dictionaries for translating from hashes to names and back.

In [121]:
submodule_name = {}

submodule_name['d35b4afe52684239852e03c115ade6c1'] = 'DELETED'
submodule_name['2685be71892e4608b7cc2d9402d26a6a'] = 'Snap 3'
submodule_name['ed7af9e0761247899a14209f37eeb0d6'] = 'Post-Test'
submodule_name['b34ab8e5baea471183123a6c8dc3065b'] = 'Snap 1'
submodule_name['a0f96824f75448299218d7a62279e86d'] = 'Snap 2'
submodule_name['6da197eeb30948a3b6b428168246a001'] = 'DELETED'
submodule_name['d6936c7004a649fa961189413b455ee9'] = 'Pre-Test'
submodule_name['232aea2870ef46439ddf6e1e0f15475c'] = 'DELETED'

submodule_hash = {}
submodule_hash['Pre-Test']  = 'd6936c7004a649fa961189413b455ee9'
submodule_hash['Post-Test'] = 'ed7af9e0761247899a14209f37eeb0d6'
submodule_hash['Snap 1']    = 'b34ab8e5baea471183123a6c8dc3065b'
submodule_hash['Snap 2']    = 'a0f96824f75448299218d7a62279e86d'
submodule_hash['Snap 3']    = '2685be71892e4608b7cc2d9402d26a6a'

Again, we see that during the course creation process, several database entries were made for snaps that were ultimately deleted, but the entries still remain in the database tables. We will want to ignore these deleted entries during the analysis, as they could lead to poor performance of our analytics code or simply just confusing data dashboards.

## Who completed the post-test?

In [140]:
for i in range(len(student_states)):
    chaps = student_states[i]['chapters']
    for j in range(len(chaps)):
        if chaps[j]['module_id'] == module_hash['Module 1']:
            # The sequentials
            seqs = chaps[j]['sequentials']
            for k in range(len(seqs)):
                # Did the student attempt the post-test, and try at least 1 problem?
                if seqs[k]['module_id'] == submodule_hash['Post-Test'] and len(seqs[k]['problems']) != 0:
                    print 'Student {0} completed {1}/{2} Post-Test Questions'.format(student_states[i]['student_id'],len(seqs[k]['problems']),4)

Student 18 completed 4/4 Post-Test Questions
Student 15 completed 1/4 Post-Test Questions
Student 34 completed 4/4 Post-Test Questions
Student 39 completed 4/4 Post-Test Questions
Student 42 completed 4/4 Post-Test Questions
Student 44 completed 4/4 Post-Test Questions
Student 45 completed 2/4 Post-Test Questions
Student 46 completed 4/4 Post-Test Questions
Student 48 completed 4/4 Post-Test Questions
Student 49 completed 4/4 Post-Test Questions
Student 51 completed 4/4 Post-Test Questions
Student 53 completed 2/4 Post-Test Questions
Student 54 completed 4/4 Post-Test Questions
Student 56 completed 4/4 Post-Test Questions
Student 57 completed 4/4 Post-Test Questions
Student 60 completed 4/4 Post-Test Questions
Student 62 completed 4/4 Post-Test Questions


## What about the pre-test?

In [146]:
for i in range(len(student_states)):
    chaps = student_states[i]['chapters']
    for j in range(len(chaps)):
        if chaps[j]['module_id'] == module_hash['Module 1']:
            # The sequentials
            seqs = chaps[j]['sequentials']
            for k in range(len(seqs)):
                # Did the student attempt the post-test, and try at least 1 problem?
                if seqs[k]['module_id'] == submodule_hash['Pre-Test'] and len(seqs[k]['problems']) != 0:
                    print 'Student {0} completed {1} Pre-Test Questions:'.format(student_states[i]['student_id'],len(seqs[k]['problems']))
                    for ii in range(len(seqs[k]['problems'])):
                        print '   ',seqs[k]['problems'][ii]['problem_code']

Student 18 completed 8 Pre-Test Questions:
    4dfc44e8dd404b59a3bc70f87ea157f7
    5fb5f6305e3c4d0f9bc3182904e9f06f
    745f507e754e4e319d42fc536924cc81
    7b5a1ecf40a4437491b5ad0f48dddb24
    a4f868d5b2634db49b4c94cf00fe4468
    cacd7cbce27647488d1f8aa8e857e354
    f8650db847534f48b78492169b0e0db8
    ffba9cbbf79e4227bfd36a155cb7901f
Student 25 completed 10 Pre-Test Questions:
    0699c429e09a44d5a9a64d3b3a355262
    30acd92d067c4ec1a74e671b6ab40e78
    4c9cf201b9b944dea74be08887110bae
    6bc1dafc91e04a8894aaef39d00bc257
    720d08631e2748a281576a625c991679
    77becc09573d4728be27f09df98ca593
    7f269b6788b847139036ec67767c042c
    7f3cc3bd60e24bd2a8bdfd7a42b4b827
    c18a372dd551460a9ecaf37a8c0192dd
    ed85b008d308469ca20943f9bc0b865e
Student 31 completed 4 Pre-Test Questions:
    5177b036b3454a49855a1063b4c06d7c
    536c887e89994c0fbac914545be5994f
    a4074a917acb447bbbad55b8f85be99d
    cacd7cbce27647488d1f8aa8e857e354
Student 32 completed 8 Pre-Test Questions:
    4dfc44e8d

### Why are there so many individuals problem_codes for the pre-test? 

Does it have to do with the drag-and-drop elements? Does each element have its own problem code?

## Next steps

- Create a hash dictionaries for all the problems IDs: Snap-1-1, Snap-1-2, PreT-1, PosT-1, etc.
- Create a dataframe organized as:

    `student id | PreT-1 Score | PreT-1 Time | PreT-1 Attempts | ... | PosT-4 Score | PosT Final Score |`
    
    `u'10'      | 100          | 250 seconds | 2               | ... | etc.`
    
    
- How are the timedeltas computed? (they seem really long)
- Is there a better way to calculate timedeltas? We want them to reflect the amount of time spent on a slide looking at a question
- Calculate the PosT Final Score as the mean of all the problems on the Post-Test
- Normalize the data in the dataframe
- Divide the dataframe into a vector Y containing the PosT Final Score, and a matrix X containing everything else.
- Run X,y through a basic regression test in scikit-learn, e.g. Decision Trees, Random Forests
- Run cross-validation testing
- Make recommendations for better data collection (we have fewer than 50 useable datapoints)
