# Students data

## Data description

enrollments.csv:

Data about a random subset of Data Analyst Nanodegree students who complete
their first project and a random subset of students who do not.

Columns:
    - account_key:    A unique identifier for the account of the student who
                     enrolled.

    - status:         The enrollment status of the student at the time the data
                      was collected. Possible values are 'canceled' and
                      'current'.

    - join_date:      The date the student enrolled.

    - cancel_date:    The date the student canceled, or blank if the student has
                      not yet canceled.

    - days_to_cancel: The number of days between join_date and cancel_date, or
                      blank if the student has not yet canceled.

    - is_udacity:     True if the account is a Udacity test account, False
                      otherwise.

    - is_canceled:    True if the student had canceled this enrollment at the
                      time the data was collected, False otherwise.

-------------------------------------------------------------------------------

daily_engagement.csv:

Data about engagement within Data Analyst Nanodegree courses for each student in
the enrollment table on each day they were enrolled. Includes a record even if
there was no engagement that day. Includes engagement data from both the
supporting courses for the Nanodegree program, and the corresponding freely
available courses with the same content.

Columns:
    - acct:                  A unique identifier for the account of the student
                             whose engagement data this is.

    - utc_date:              The date for which the data was collected.

    - num_courses_visited:   The total number of Data Analyst Nanodegree courses
                             the student visited for at 2 minutes on this day.
                             Nanodegree courses and freely available courses
                             with the same content are counted separately.

    - total_minutes_visited: The total number of minutes the student spent
                             taking Data Analyst Nanodegree courses on this day.

    - lessons_completed:     The total number of lessons within Data Analyst
                             Nanodegree courses on this day.

    - projects_completed:    The total number of Data Analyst Nanodegree
                             projects the student completed on this day.

-------------------------------------------------------------------------------

project_submissions.csv:

Data about submissions for Data Analyst Nanodegree projects for each student in
the enrollment table.

Columns:
    - creation_date:    The date the project was submitted.

    - completion_date:  The date the project was evaluated.

    - assigned_rating:  This column has 4 possible values:
                        blank - Project has not yet been evaluated.
                        INCOMPLETE - Project did not meet specifications.
                        PASSED - Project met specifications.
                        DISTINCTION - Project exceeded specifications.
                        UNGRADED - The submission could not be evaluated
                                   (e.g. contained a corrupted file)

    - account_key:      A unique identifier for the account of the student who
                        submitted the project.

    - lesson_key:       A unique identifier for the project that was submitted.

    - processing_state: This column has 2 possible values:
                        CREATED - Project has been submitted but not evaluated.
                        EVALUATED - Project has been evaluated.

-------------------------------------------------------------------------------

daily_engagement_full.csv:

Similar to daily_engagement.csv, but with engagement further broken down by
course and with more columns available. This file is about 500 megabytes, which
is why the smaller daily_engagement.csv file was created. This dataset is
optional; it is not needed to complete the course.

In addition to the following columns, this table also contains all the same
columns as daily_engagement.csv, except with has_visited instead of
num_courses_visited.

Columns:
    - registration_date:  Date the account was registered.

    - subscription_start: Date paid subscription for the account started.

    - course_key:         Course in which activity is recorded.

    - sibling_key:        Free course with the same free content as course_key.
                          If course_key is a free course, course_key and
                          sibling_key are the same.

    - course_title:       Title of the course.

    - has_visited:        1 if the student visited this course for at least 2
                          minutes on this day.



## Read data

In [78]:
import csv

def open_csv(file_name):
    with open(file_name, "rt") as f:
        reader = csv.DictReader(f)
        return list(reader)

enrollments = open_csv("./enrollments.csv")
daily_engagement = open_csv("./daily_engagement.csv")
project_submissions = open_csv("./project_submissions.csv")

In [79]:
enrollments[0]

OrderedDict([('account_key', '448'),
             ('status', 'canceled'),
             ('join_date', '2014-11-10'),
             ('cancel_date', '2015-01-14'),
             ('days_to_cancel', '65'),
             ('is_udacity', 'True'),
             ('is_canceled', 'True')])

In [80]:
daily_engagement[0]

OrderedDict([('acct', '0'),
             ('utc_date', '2015-01-09'),
             ('num_courses_visited', '1.0'),
             ('total_minutes_visited', '11.6793745'),
             ('lessons_completed', '0.0'),
             ('projects_completed', '0.0')])

In [81]:
project_submissions[0]

OrderedDict([('creation_date', '2015-01-14'),
             ('completion_date', '2015-01-16'),
             ('assigned_rating', 'UNGRADED'),
             ('account_key', '256'),
             ('lesson_key', '3176718735'),
             ('processing_state', 'EVALUATED')])

## Convert data types

In [82]:
from datetime import datetime as dt

def parse_date(date):
    if date == "":
        return None
    else:
        return dt.strptime(date, "%Y-%m-%d")
    
def parse_int(i):
    if i == "":
        return None
    else:
        return int(i)

def parse_float(f):
    if f == "":
        return None
    else:
        return float(f)
    
def parse_bool(b):
    if b == 'True':
        return True
    else:
        return False

In [83]:
for enrollment in enrollments:
    enrollment["account_key"] = parse_int(enrollment["account_key"])
    enrollment["join_date"] = parse_date(enrollment["join_date"])
    enrollment["cancel_date"] = parse_date(enrollment["cancel_date"])
    enrollment["days_to_cancel"] = parse_int(enrollment["days_to_cancel"])
    enrollment["is_udacity"] = parse_bool(enrollment["is_udacity"])
    enrollment["is_canceled"] = parse_bool(enrollment["is_canceled"])
    
for engagement in daily_engagement:
    engagement["account_key"] = parse_int(engagement["acct"])
    del engagement["acct"]
    engagement["utc_date"] = parse_date(engagement["utc_date"])
    engagement["num_courses_visited"] = int(parse_float(engagement["num_courses_visited"]))
    engagement["total_minutes_visited"] = parse_float(engagement["total_minutes_visited"])
    engagement["lessons_completed"] = int(parse_float(engagement["lessons_completed"]))
    engagement["projects_completed"] = int(parse_float(engagement["projects_completed"]))
    
for submission in project_submissions:
    submission["creation_date"] = parse_date(submission["creation_date"])
    submission["completion_date"] = parse_date(submission["completion_date"])
    submission["account_key"] = parse_int(submission["account_key"])
    submission["lesson_key"] = parse_int(submission["lesson_key"])
    

In [84]:
enrollments[0]

OrderedDict([('account_key', 448),
             ('status', 'canceled'),
             ('join_date', datetime.datetime(2014, 11, 10, 0, 0)),
             ('cancel_date', datetime.datetime(2015, 1, 14, 0, 0)),
             ('days_to_cancel', 65),
             ('is_udacity', True),
             ('is_canceled', True)])

In [85]:
daily_engagement[0]

OrderedDict([('utc_date', datetime.datetime(2015, 1, 9, 0, 0)),
             ('num_courses_visited', 1),
             ('total_minutes_visited', 11.6793745),
             ('lessons_completed', 0),
             ('projects_completed', 0),
             ('account_key', 0)])

In [86]:
project_submissions[0]

OrderedDict([('creation_date', datetime.datetime(2015, 1, 14, 0, 0)),
             ('completion_date', datetime.datetime(2015, 1, 16, 0, 0)),
             ('assigned_rating', 'UNGRADED'),
             ('account_key', 256),
             ('lesson_key', 3176718735),
             ('processing_state', 'EVALUATED')])

## Question phase

* How long to submit project?
* How do students who their projects differ from those who don't?
* How much time students spend taking classes?
* How time spent relates to lessons / projects completed?
* How engagement changes?
* How many times students submit?

## Data cleanup

In [87]:
# Let's count uniq students

def get_uniq_students(data):
    uniq_students = set()
    for item in data:
        uniq_students.add(item["account_key"])
    return uniq_students

unique_enrolments_students = get_uniq_students(enrollments)
unique_engagement_students = get_uniq_students(daily_engagement)
unique_submissions_student = get_uniq_students(project_submissions)
    
print("Number of uniq students in enrollment: ", len(unique_enrolments_students))
print("Number of uniq students in engagement: ", len(unique_engagement_students))
print("Number of uniq students in submissions: ", len(unique_submissions_student))

Number of uniq students in enrollment:  1302
Number of uniq students in engagement:  1237
Number of uniq students in submissions:  743


### Find all enrollments without any engagements

In [88]:
for enrollment in enrollments:
    if not enrollment["account_key"] in unique_engagement_students:
        print(enrollment)

OrderedDict([('account_key', 1219), ('status', 'canceled'), ('join_date', datetime.datetime(2014, 11, 12, 0, 0)), ('cancel_date', datetime.datetime(2014, 11, 12, 0, 0)), ('days_to_cancel', 0), ('is_udacity', False), ('is_canceled', True)])
OrderedDict([('account_key', 871), ('status', 'canceled'), ('join_date', datetime.datetime(2014, 11, 13, 0, 0)), ('cancel_date', datetime.datetime(2014, 11, 13, 0, 0)), ('days_to_cancel', 0), ('is_udacity', False), ('is_canceled', True)])
OrderedDict([('account_key', 1218), ('status', 'canceled'), ('join_date', datetime.datetime(2014, 11, 15, 0, 0)), ('cancel_date', datetime.datetime(2014, 11, 15, 0, 0)), ('days_to_cancel', 0), ('is_udacity', False), ('is_canceled', True)])
OrderedDict([('account_key', 654), ('status', 'canceled'), ('join_date', datetime.datetime(2014, 12, 4, 0, 0)), ('cancel_date', datetime.datetime(2014, 12, 4, 0, 0)), ('days_to_cancel', 0), ('is_udacity', False), ('is_canceled', True)])
OrderedDict([('account_key', 654), ('status'

### Find problem students

In [89]:
count = 0
for enrollment in enrollments:
    student = enrollment["account_key"]
    if student not in unique_engagement_students and \
        enrollment["join_date"] != enrollment["cancel_date"]:
        count += 1
        print(enrollment)

OrderedDict([('account_key', 1304), ('status', 'canceled'), ('join_date', datetime.datetime(2015, 1, 10, 0, 0)), ('cancel_date', datetime.datetime(2015, 3, 10, 0, 0)), ('days_to_cancel', 59), ('is_udacity', True), ('is_canceled', True)])
OrderedDict([('account_key', 1304), ('status', 'canceled'), ('join_date', datetime.datetime(2015, 3, 10, 0, 0)), ('cancel_date', datetime.datetime(2015, 6, 17, 0, 0)), ('days_to_cancel', 99), ('is_udacity', True), ('is_canceled', True)])
OrderedDict([('account_key', 1101), ('status', 'current'), ('join_date', datetime.datetime(2015, 2, 25, 0, 0)), ('cancel_date', None), ('days_to_cancel', None), ('is_udacity', True), ('is_canceled', False)])


In [90]:
count

3

### Those problem students are udacity test accos, lets remove them

In [91]:
udacity_test_accounts = set()
for enrollment in enrollments:
    if enrollment["is_udacity"]:
        udacity_test_accounts.add(enrollment["account_key"])
len(udacity_test_accounts)

6

In [94]:
def remove_udacity_accounts(data):
    data_without_udacity = []
    for item in data:
        if item["account_key"] not in udacity_test_accounts:
            data_without_udacity.append(item)
    return data_without_udacity

In [99]:
non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagements = remove_udacity_accounts(daily_engagement)
non_udacity_submissions = remove_udacity_accounts(project_submissions)

print(len(non_udacity_enrollments))
print(len(non_udacity_engagements))
print(len(non_udacity_submissions))

1622
136240
3642
