## Codebook

### LMS Data Codebook

[Code Book](https://docs.google.com/document/d/e/2PACX-1vRvzalQuIUm5m4jHhFz1iPp0fCyeDWDbKgERQMs35CbftDKmgiKJb8WCZX8xhmi5jQqn7IQfpgs_jvV/pub)

### Survey Features

* anon
    * student ID
* course_name_number 
    * unique course identifier, e.g., 'History 21'
* section_num
    * primary section number of rated course, e.g., '004'
* secondary_section_number
    * all secondary section numbers of rated course as a string of a list (e.g. "['101, '103']")
        * will be evaluated through `ast.literal_eval()`.
* n_prereqs
    * number of prerequisites of rated course
* n_satisfied_prereqs_2021_Spring
    * the number of prerequisites that were satisfied by courses in Spring 2021 semester
* n_satisfied_prereqs_all_past_semesters
    * the number of prerequisites that were satisfied by courses all past semesters
* credit_hours
    * number of credit hourse of course 
* avg_grade
    * average course GPA in Spring 2021 semester
* grade_std
    * standard deviation course GPA in Spring 2021 semester
* percentage_of_non_letter_grades
    * percentage of non letter grades among grades in Spring 2021
* percentage_of_pass_or_satisfactory_among_non_letter_grades
    * related to `percentage_of_non_letter_grades`
* major
    * the major of the student, e.g., Business Administration
* avg_gpa
    * average GPA of the student
* avg_major_gpa
    * average major GPA of the student
* tl1	
    * survey rating to question 1 (time load)
* tl2	
    * survey rating to question 2 (time load)
* tl_manage	
    * survey rating to question 3 (time load)
* me	
    * survey rating to question 4 (mental effort)
* me_manage	
    * survey rating to question 5 (mental effort)
* ps	
    * survey rating to question 6 (psychological stress)
* ps_manage	
    * survey rating to question 7 (psychological stress)
* tl1_diff	
    * tl1 - tl_manage
* tl2_diff	
    * tl2 - tl_manage
* me_diff	
    * me - me_manage
* ps_diff
    * ps - ps_manage
* tl_importance	
    * general importance rating student attributes to time load when choosing courses (question at the end of survey)
* me_importance	
    * general importance rating student attributes to mental effort when choosing courses (question at the end of survey)
* ps_importance
    * general importance rating student attributes to psychological stress when choosing courses (question at the end of survey)

## Code to Create Data

*Disclaimer: Randomization is not always 100% logically coherent e.g., a student may have different majors rows. It is ensured, however, that this does not affect the feature engineering process.*

In [1]:
import random 
import string
import pandas as pd
import numpy as np
from datetime import timedelta

## Determine Sample Size

In [2]:
# 100 courses, each with 3 sections = 300 canvas courses
# each canvas course with 50 students
# each canvas course with 10 assignments and 10 submissions and submission comments each
# each canvas course with 100 discussion entries
# 3 assignment overrides (e.g., deadline changes), on average, per canvas course
# 250 student survey raters with 5 responses (i.e. course ratings) each

n_courses = 100
n_sections = 3
n_students = 50
n_discussion_entries = 100
n_assignments = 10
n_submissions = 10
n_submission_comments = 10
n_overrides = 3
n_raters = 250
n_responses = 5

## Helper functions

In [3]:
def random_string(length):
    return ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(length))

def sample_k_random_from_list(k: int, l: list):
    return [random.choice(l) for _ in range(k)]

def random_datetime(start=pd.to_datetime('2021-01-18 00:00:00.000'), end=pd.to_datetime('2021-05-13 23:59:59.999')):
    """
    This function will return a random datetime between two datetime 
    objects. Default timeframe is UC Berkeley Spring 2021 Semester.
    """
    delta = end - start
    int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
    random_second = random.randrange(int_delta)
    return str(start + timedelta(seconds=random_second))

## Creating Data

In [4]:
course_section = pd.DataFrame()
course_section['canvas_course_global_id'] = list(range(n_courses*n_sections))
#course_section['course_subject_name_number'] = sorted([random_string(10) for _ in range(n_courses)]*n_sections)
course_section['course_subject_name_number'] = sample_k_random_from_list(n_courses*n_sections, ['Sociology 1', 'Political Science 103', 'Physics 112'])
course_section['section_num'] = [1,2,3]*n_courses
course_section.to_csv('./example_data/course_section.csv', index=False)

n_canvas_courses = n_courses * n_sections
enrollments = pd.DataFrame()
enrollments['course_id'] = sorted(list(range(n_canvas_courses))*n_students)
enrollments['user_id'] = list(range(n_canvas_courses*n_students))
enrollments['enrollment_role_type'] = sample_k_random_from_list(n_canvas_courses*n_students, ['StudentEnrollment', 'TeacherEnrollment', 'TaEnrollment'])
enrollments['enrollment_state'] = sample_k_random_from_list(n_canvas_courses*n_students, ['active', 'deleted', 'completed'])
enrollments['enrollment_updated_at'] = [random_datetime(pd.to_datetime('2021-03-18 00:00:00.000'), pd.to_datetime('2021-05-13 23:59:59.999')) for _ in range(n_canvas_courses*n_students)]
enrollments.to_csv('./example_data/enrollments.csv', index=False)

discussion_entry = pd.DataFrame()
discussion_entry['discussion_entry_id'] = list(range(n_canvas_courses*n_discussion_entries))
discussion_entry['parent_discussion_entry_id'] = sample_k_random_from_list(n_canvas_courses*n_discussion_entries, discussion_entry['discussion_entry_id'])
discussion_entry['depth'] = sample_k_random_from_list(n_canvas_courses*n_discussion_entries, [1,2,3])
discussion_entry['message_length'] = sample_k_random_from_list(n_canvas_courses*n_discussion_entries, np.linspace(1,10,1000))
discussion_entry['course_id'] = sample_k_random_from_list(n_canvas_courses*n_discussion_entries, enrollments['course_id'])
discussion_entry['user_id'] = sample_k_random_from_list(n_canvas_courses*n_discussion_entries, enrollments['user_id'])
discussion_entry['created_at'] = [random_datetime() for _ in range(n_canvas_courses*n_discussion_entries)]
discussion_entry.to_csv('./example_data/discussion_entry.csv', index=False)

assignments = pd.DataFrame()
assignments['course_id'] = sample_k_random_from_list(n_canvas_courses*n_assignments, course_section['canvas_course_global_id'])
assignments['assignment_id'] = list(range(n_canvas_courses*n_assignments))
assignments['asn_unlock_at'] = [random_datetime(pd.to_datetime('2021-01-18 00:00:00.000'), pd.to_datetime('2021-03-18 00:00:00.000')) for _ in range(n_canvas_courses*n_assignments)]
assignments['asn_due_at'] = [random_datetime(pd.to_datetime('2021-03-18 00:00:00.000'), pd.to_datetime('2021-05-13 23:59:59.999')) for _ in range(n_canvas_courses*n_assignments)]
assignments['grading_type'] = sample_k_random_from_list(n_canvas_courses*n_assignments, ['points', 'pass_fail', 'not_graded', 'percent', 'letter_grade'])
assignments['workflow_state'] = sample_k_random_from_list(n_canvas_courses*n_assignments, ['published', 'deleted', 'unpublished'])
assignments.to_csv('./example_data/assignments.csv', index=False)

n_rows = n_canvas_courses*n_assignments*n_submissions
submissions = pd.DataFrame()
submissions['submission_id'] = list(range(n_rows))
submissions['course_id'] = sorted(list(range(n_canvas_courses))*n_assignments*n_submissions)
submissions['user_id'] = sample_k_random_from_list(n_rows, enrollments['user_id'])
submissions['assignment_id'] = sample_k_random_from_list(n_rows, assignments['assignment_id'])
submissions['submitted_at'] = [random_datetime(pd.to_datetime('2021-01-18 00:00:00.000'), pd.to_datetime('2021-03-18 00:00:00.000')) for _ in range(n_rows)]
submissions.to_csv('./example_data/submissions.csv', index=False)

n_rows = n_canvas_courses*n_assignments*n_submission_comments
submission_comments = pd.DataFrame()
submission_comments['submission_id'] = sample_k_random_from_list(n_rows, submissions['submission_id'])
submission_comments['course_id'] = sample_k_random_from_list(n_rows, enrollments['course_id'])
submission_comments['author_id'] = sample_k_random_from_list(n_rows, enrollments['user_id'])
submission_comments['message_size_bytes'] = sample_k_random_from_list(n_rows, np.linspace(1,10,1000))
submission_comments.to_csv('./example_data/submission_comments.csv', index=False)

n_rows = n_canvas_courses*n_overrides
assignments_overrides = pd.DataFrame()
assignments_overrides['assignment_id'] = sample_k_random_from_list(n_rows, assignments['assignment_id'])
assignments_overrides['updated_at'] = [random_datetime() for _ in range(n_rows)]
assignments_overrides['due_at'] = [random_datetime() for _ in range(n_rows)]
assignments_overrides['unlock_at'] = [random_datetime() for _ in range(n_rows)]
assignments_overrides.to_csv('./example_data/assignments_overrides.csv', index=False)

n_rows = n_raters * n_responses
survey_data = pd.DataFrame()
survey_data['anon'] = sorted(list(range(n_raters))*n_responses)
survey_data['course_name_number'] = sample_k_random_from_list(n_rows, course_section['course_subject_name_number'])
survey_data['section_num'] = sample_k_random_from_list(n_rows, course_section['section_num'])
survey_data['secondary_section_number'] = [str(sample_k_random_from_list(2, course_section['section_num'])) for _ in range(n_rows)]
survey_data['n_prereqs'] = sample_k_random_from_list(n_rows, list(range(6)))
survey_data['n_satisfied_prereqs_2021_Spring'] = sample_k_random_from_list(n_rows, list(range(6)))
survey_data['n_satisfied_prereqs_all_past_semesters'] = sample_k_random_from_list(n_rows, list(range(6)))
survey_data['credit_hours'] = sample_k_random_from_list(n_rows, list(range(1,5)))
survey_data['avg_grade'] = sample_k_random_from_list(n_rows, np.linspace(1,4,100))
survey_data['grade_std'] = sample_k_random_from_list(n_rows, np.linspace(1,2,100))
survey_data['percentage_of_non_letter_grades'] = sample_k_random_from_list(n_rows, np.linspace(0,1,100))
survey_data['percentage_of_pass_or_satisfactory_among_non_letter_grades'] = sample_k_random_from_list(n_rows, np.linspace(0,1,100))
survey_data['tl_importance'] = sample_k_random_from_list(n_rows, list(range(1,6)))
survey_data['me_importance'] = sample_k_random_from_list(n_rows, list(range(1,6)))
survey_data['ps_importance'] = sample_k_random_from_list(n_rows, list(range(1,6)))
survey_data['major'] = sample_k_random_from_list(n_rows, ['Business Administration', 'L&S Data Science'])
survey_data['avg_gpa'] = sample_k_random_from_list(n_rows, np.linspace(1,4,100))
survey_data['avg_major_gpa'] = sample_k_random_from_list(n_rows, np.linspace(1,4,100))
survey_data['tl1'] = sample_k_random_from_list(n_rows, list(range(1,7)))
survey_data['tl2'] = sample_k_random_from_list(n_rows, list(range(1,6)))
survey_data['tl_manage'] = sample_k_random_from_list(n_rows, list(range(1,6)))
survey_data['me'] = sample_k_random_from_list(n_rows, list(range(1,6)))
survey_data['me_manage'] = sample_k_random_from_list(n_rows, list(range(1,6)))
survey_data['ps'] = sample_k_random_from_list(n_rows, list(range(1,6)))
survey_data['ps_manage'] = sample_k_random_from_list(n_rows, list(range(1,6)))
survey_data['tl1_diff'] = sample_k_random_from_list(n_rows, list(range(-5,6)))
survey_data['tl2_diff'] = sample_k_random_from_list(n_rows, list(range(-4,5)))
survey_data['me_diff'] = sample_k_random_from_list(n_rows, list(range(-4,5)))
survey_data['ps_diff'] = sample_k_random_from_list(n_rows, list(range(-4,5)))

survey_data.to_csv('./example_data/survey_data.csv', index=False)