Before we get started, a couple of reminders to keep in mind when using iPython notebooks:

- Remember that you can see from the left side of a code cell when it was last run if there is a number within the brackets.
- When you start a new notebook session, make sure you run all of the cells up to the point where you last left off. Even if the output is still visible from when you ran the cells in your previous session, the kernel starts in a fresh state so you'll need to reload the data, etc. on a new session.
- The previous point is useful to keep in mind if your answers do not match what is expected in the lesson's quizzes. Try reloading the data and run all of the processing steps one by one in order to make sure that you are working with the same variables and data that are at each quiz stage.


## Load Data from CSVs

Import UnicodeCSV module. 
Load the engagement and submission data using a for loop.
Then print the first row of each.
(First method)

In [1]:
# The 'rb' node is important.  
    #'r' means to open the file for reading.
    # 'b' dictates the format it will be read in.

#Using DictReader means that each row read will be a dictionary.  This allows us to use the headers in the data.    

# Using 'with' means that everything that accesses that file must be indented beneath, 
 #and the file will automatically be closed after.

    # Import UnicodeCSV module.
import unicodecsv

    # Put file locations into variables.
enrollments_filename = '/Users/bburns/Desktop/Analytics/Data Sets/enrollments.csv'    
engagement_filename = '/Users/bburns/Desktop/Analytics/Data Sets/daily_engagement.csv'
submissions_filename = '/Users/bburns/Desktop/Analytics/Data Sets/project_submissions.csv'
 
    # Load the engagement and submission data using a for loop.
    # Print top row once exiting the loop.

with open(enrollments_filename, 'rb') as f:
    reader = unicodecsv.DictReader(f)
    enrollments = list(reader)
print(enrollments[0])
    
with open(engagement_filename, 'rb') as f:
    reader = unicodecsv.DictReader(f)
    daily_engagement = list(reader)
print(daily_engagement[0])
 
with open(submissions_filename, 'rb') as f:
    reader = unicodecsv.DictReader(f)
    project_submissions = list(reader)
print(project_submissions[0])

{'days_to_cancel': '65', 'cancel_date': '2015-01-14', 'is_canceled': 'True', 'account_key': '448', 'status': 'canceled', 'is_udacity': 'True', 'join_date': '2014-11-10'}
{'acct': '0', 'lessons_completed': '0.0', 'num_courses_visited': '1.0', 'total_minutes_visited': '11.6793745', 'utc_date': '2015-01-09', 'projects_completed': '0.0'}
{'processing_state': 'EVALUATED', 'account_key': '256', 'creation_date': '2015-01-14', 'assigned_rating': 'UNGRADED', 'lesson_key': '3176718735', 'completion_date': '2015-01-16'}


Method 2 -Write a function to do the above.

The only thing that changed between loading the two files above into dicts was the variable name and the file name.
We'll write a function called "read_csv" that takes the filename as an input and then returns the list of rows.


In [None]:
    # Import UnicodeCSV module.
import unicodecsv

    #Define function
def read_csv(filename):
        with open(filename, 'rb') as f:
            reader = unicodecsv.DictReader(f)
            return list(reader)

# Read directly into lists.
enrollments = read_csv('/Users/bburns/Desktop/Analytics/Data Sets/enrollments.csv')
engagement = read_csv('/Users/bburns/Desktop/Analytics/Data Sets/daily_engagement.csv')
project_submissions = read_csv('/Users/bburns/Desktop/Analytics/Data Sets/project_submissions.csv')

print (enrollments[0])
print (engagement[0])
print (project_submissions[0])


## Fixing Data Types

CSV files bring everything in as a string, so we'll usually need to convert the data types.
This is done by defining functions that then loop over the data (parse_date, parse_maybe_int.
Sometimes, an if else will be needed like with dates here so that it will return none if the field is null.


In [2]:
from datetime import datetime as dt

# Takes a date as a string, and returns a Python datetime object. 
# If there is no date given, returns None
def parse_date(date):
    if date == '':
        return None
    else:
        return dt.strptime(date, '%Y-%m-%d')
    
# Takes a string which is either an empty string or represents an integer,
# and returns an int or None.
def parse_maybe_int(i):
    if i == '':
        return None
    else:
        return int(i)

# Clean up the data types in the enrollments table
# Checks if is_cancelled is equal to 'True' and changes transforms this frm a string to a boolean.
for enrollment in enrollments:
    enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
    enrollment['days_to_cancel'] = parse_maybe_int(enrollment['days_to_cancel'])
    enrollment['is_canceled'] = enrollment['is_canceled'] == 'True'
    enrollment['is_udacity'] = enrollment['is_udacity'] == 'True'
    enrollment['join_date'] = parse_date(enrollment['join_date'])
    
enrollments[0]

{'account_key': '448',
 'cancel_date': datetime.datetime(2015, 1, 14, 0, 0),
 'days_to_cancel': 65,
 'is_canceled': True,
 'is_udacity': True,
 'join_date': datetime.datetime(2014, 11, 10, 0, 0),
 'status': 'canceled'}

In [3]:
# Clean up the data types in the engagement table
for engagement_record in daily_engagement:
    #The values in the source have a decimal point and would fail if directly converted to int from string.
    #Therefore, they must be converted to float, then to string.
    engagement_record['lessons_completed'] = int(float(engagement_record['lessons_completed']))
    engagement_record['num_courses_visited'] = int(float(engagement_record['num_courses_visited']))
    engagement_record['projects_completed'] = int(float(engagement_record['projects_completed']))
    engagement_record['total_minutes_visited'] = float(engagement_record['total_minutes_visited'])
    engagement_record['utc_date'] = parse_date(engagement_record['utc_date'])
    
daily_engagement[0]

{'acct': '0',
 'lessons_completed': 0,
 'num_courses_visited': 1,
 'projects_completed': 0,
 'total_minutes_visited': 11.6793745,
 'utc_date': datetime.datetime(2015, 1, 9, 0, 0)}

In [4]:
# Clean up the data types in the submissions table
for submission in project_submissions:
    submission['completion_date'] = parse_date(submission['completion_date'])
    submission['creation_date'] = parse_date(submission['creation_date'])

project_submissions[0]

{'account_key': '256',
 'assigned_rating': 'UNGRADED',
 'completion_date': datetime.datetime(2015, 1, 16, 0, 0),
 'creation_date': datetime.datetime(2015, 1, 14, 0, 0),
 'lesson_key': '3176718735',
 'processing_state': 'EVALUATED'}

Note when running the above cells that we are actively changing the contents of our data variables. If you try to run these cells multiple times in the same session, an error will occur.

## Investigating the Data

In [6]:
## Find the total number of rows and the number of unique students (account keys)
## in each table.

#Unique values: METHOD ONE
    #create an empty set.  Sets only store unique values.
    #Loop through enrollments and add each account_key from each enrollment to the set.
    #Check len of set.

    
# Enrollments 
enrollment_num_rows = len(enrollments)

unique_enrolled_students = set()
for enrollment in enrollments:
    unique_enrolled_students.add(enrollment['account_key'])

enrollment_num_unique_students = len(unique_enrolled_students)
print(enrollment_num_rows)
print(enrollment_num_unique_students)


# Engagements
engagement_num_rows = len(daily_engagement)

unique_engaged_students = set()
for engagement in daily_engagement:
    unique_engaged_students.add(engagement['acct'])

engagement_num_unique_students = len(unique_engaged_students)
print(engagement_num_rows)
print(engagement_num_unique_students)


# Submissions
submission_num_rows = len(project_submissions)

unique_submissions_students = set()
for submission in project_submissions:
    unique_submissions_students.add(submission['account_key'])

submission_num_unique_students = len(unique_submissions_students)
print(submission_num_rows)
print(submission_num_unique_students)


1640
1302
136240
1237
3642
743


In [None]:
#Unique values: METHOD TWO

    #Here, we create a distinct set of the dictionary key (account_key) from our dictionary (enrollments) 
     #and use a for loop to count the set all in one statement.
'''
enrollment_num_rows = len(enrollments)
enrollment_num_unique_students = len(set(x['account_key'] for x in enrollments))  

print(enrollment_num_rows)
print(enrollment_num_unique_students)

'''

## Problems in the Data

In [8]:
## Rename the "acct" column in the daily_engagement table to "account_key".

    #Loop over table to set a new key named 'account_key' equal to 'acct'.
    #Delete acct column
for engagement_record in daily_engagement:
    engagement_record['account_key'] = engagement_record['acct']
    del[engagement_record['acct']]
    

In [9]:
    #Verify it worked
daily_engagement[0]['account_key']

'0'

## Missing Engagement Records

In [10]:
#####################################
#                 4                 #
#####################################

## Find any one student enrollments where the student is missing from the daily engagement table.
## Output that enrollment.


## Create a loop that will...
for enrollment in enrollments:  # Loop over the enrollments table
    student = enrollment['account_key']  # Find account key for each anrollment
    if student not in unique_engaged_students:  # Check if student is not in the engaged student list we created earlier. 
        print(enrollment)  # If the student is missing from the engaged student list, print the record.
        break # Since we only wanted to find one, we then broke from the loop.

{'days_to_cancel': 0, 'cancel_date': datetime.datetime(2014, 11, 12, 0, 0), 'is_canceled': True, 'account_key': '1219', 'status': 'canceled', 'is_udacity': False, 'join_date': datetime.datetime(2014, 11, 12, 0, 0)}


## Checking for More Problem Records

In [11]:
#####################################
#                 5                 #
#####################################

## Find the number of enrolled students missing from the engagement table that remain, if any.

surprising_records = 0  # Create a variable to store the number of students we find.

for enrollment in enrollments: # Loop over the enrollments table
    student = enrollment['account_key']  # Find account key for each anrollment
    days_enrolled = enrollment['days_to_cancel']  # Find number of days enrolled for each enrollment.
    if student not in unique_engaged_students: # Loop over engaged students table.
        if days_enrolled != 0:  # If a student is enrolled for 0 days ...
            surprising_records +=1  # Increment the variable we created by one.
            print(surprising_records)  # Print the number of missing enrolled students.
            print(enrollment)  # Print the records of those students (because there's only 3).


1
{'days_to_cancel': 59, 'cancel_date': datetime.datetime(2015, 3, 10, 0, 0), 'is_canceled': True, 'account_key': '1304', 'status': 'canceled', 'is_udacity': True, 'join_date': datetime.datetime(2015, 1, 10, 0, 0)}
2
{'days_to_cancel': 99, 'cancel_date': datetime.datetime(2015, 6, 17, 0, 0), 'is_canceled': True, 'account_key': '1304', 'status': 'canceled', 'is_udacity': True, 'join_date': datetime.datetime(2015, 3, 10, 0, 0)}
3
{'days_to_cancel': None, 'cancel_date': None, 'is_canceled': False, 'account_key': '1101', 'status': 'current', 'is_udacity': True, 'join_date': datetime.datetime(2015, 2, 25, 0, 0)}


In [12]:
## This gives us 3 records that we know are test records because the variable is_udacity = True.
## We want to remove any records from consideration, so we will store them in another set.
## We find there are 6 total test records.

udacity_test_accounts = set()
for enrollment in enrollments:
    if enrollment['is_udacity']:
        udacity_test_accounts.add(enrollment['account_key'])

len(udacity_test_accounts)



6

In [13]:
## Next, define a functon to remove udacity test students since we will be removing them from multiple data sets.
def remove_udacity_accounts(data):
    non_udacity_data = []
    for data_point in data:
        if data_point['account_key'] not in udacity_test_accounts:
            non_udacity_data.append(data_point)
    return non_udacity_data

In [14]:
## Apply the function to each data set to remove the test records

non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagement = remove_udacity_accounts(daily_engagement)
non_udacity_submissions = remove_udacity_accounts(project_submissions)

# See how many records are left in each set.
print (len(non_udacity_enrollments))
print (len(non_udacity_engagement))
print (len(non_udacity_submissions))


1622
135656
3634


## Tracking Down the Remaining Problems

In [15]:
# Create a set of the account keys for all Udacity test accounts
udacity_test_accounts = set()
for enrollment in enrollments:
    if enrollment['is_udacity']:
        udacity_test_accounts.add(enrollment['account_key'])
len(udacity_test_accounts)

6

In [16]:
# Given some data with an account_key field, removes any records corresponding to Udacity test accounts
def remove_udacity_accounts(data):
    non_udacity_data = []
    for data_point in data:
        if data_point['account_key'] not in udacity_test_accounts:
            non_udacity_data.append(data_point)
    return non_udacity_data

In [17]:
# Remove Udacity test accounts from all three tables
non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagement = remove_udacity_accounts(daily_engagement)
non_udacity_submissions = remove_udacity_accounts(project_submissions)

print (len(non_udacity_enrollments))
print (len(non_udacity_engagement))
print (len(non_udacity_submissions))

1622
135656
3634


## Refining the Question

In [33]:
#####################################
#                 6                 #
#####################################

## Create a dictionary named paid_students containing all students who either
## haven't canceled yet or who remained enrolled for more than 7 days. The keys
## should be account keys, and the values should be the date the student enrolled.

# The second if statement will check and make sure if a student is enrolled more than once, 
# we end up with the most recent enrollment date only.

paid_students = {}

for enrollment in non_udacity_enrollments: # Loop over the new enrollments table.
    if not enrollment['is_canceled'] or enrollment['days_to_cancel'] > 7:  # Loop to find records with our conditions.
        account_key = enrollment['account_key']  # Grab the account key when we find one.
        enrollment_date = enrollment['join_date']  # Grab the enrollment date for the ones we found.

        # This piece ensures if a student is enrolled more than once, 
        # we end up with the most recent enrollment date only.
        if account_key not in paid_students or enrollment_date > paid_students[account_key]:
            paid_students[account_key] = enrollment_date
        
    
# See how many records that gave us
len(paid_students)

995

## Getting Data from First Week

In [61]:
# Takes a student's join date and the date of a specific engagement record,
# and returns True if that engagement record happened within one week
# of the student joining.

def within_one_week(join_date, engagement_date):
    time_delta = engagement_date - join_date
    return time_delta.days < 7 and time_delta.days >= 0

In [62]:
#####################################
#                 7                 #
#####################################

## Create a list of rows from the engagement table including only rows where
## the student is one of the paid students you just found, and the date is within
## one week of the student's join date.


# Write a function to remove students who canceled during the free trial.
def remove_free_trial_cancels(data):
    new_data = []
    for data_point in data:
        if data_point['account_key'] in paid_students:
            new_data.append(data_point)
    return new_data
        

In [63]:
# Call function on all of the data sets to remove those canceled students and save results in new variables.

paid_enrollments = remove_free_trial_cancels(non_udacity_enrollments)
paid_engagement = remove_free_trial_cancels(non_udacity_engagement)
paid_submissions = remove_free_trial_cancels(non_udacity_submissions)

# See how many rows are being removed
print (len(paid_enrollments))
print (len(paid_engagement))
print (len(paid_submissions))

1293
134549
3618


In [64]:
# Create an empty list to store records we find.
paid_engagement_in_first_week = []

# Use for loop to...
for engagement_record in paid_engagement: # Loop over paid_engagement table
    account_key = engagement_record['account_key']  #  Get account_key
    join_date = paid_students[account_key]  # Get join_date for studentfrom table we created earlier.
    engagement_record_date = engagement_record['utc_date']  #Save date of engagment record.
    
    # Loop over what we just collected in the first loop to apply our within_one_week function to the variables we collected.
    if within_one_week(join_date, engagement_record_date):  # If the function results to True...
        paid_engagement_in_first_week.append(engagement_record)  # Add the record to our new list. 

# See how many we have.
len(paid_engagement_in_first_week) 

6919

## Exploring Student Engagement
#### Find the average number of minutes students spend in the classroom each week.

In [65]:
from collections import defaultdict

# Create a dictionary of engagement grouped by student.
# The keys are account keys, and the values are lists of engagement records.

engagement_by_account = defaultdict(list)  # Create a list to store our dictionary of lists. 
                                            # defaultdict(list) ensures it will store lists, or whatever is specified.
                                            # By making it a list, any key that isn't there will return an empty list instead of nothing.
    
for engagement_record in paid_engagement_in_first_week:  # Loop over records in the first week paid engagement list.
    account_key = engagement_record['account_key']  # Get the account key for the engagement_record.
    engagement_by_account[account_key].append(engagement_record) # Look up the list of engagement records for that account_key.
                                                                # Append engagement record to list of engagement records for that account_key

In [66]:
# Create a dictionary with the total minutes each student spent in the classroom during the first week.
# The keys are account keys, and the values are numbers (total minutes).
# This gives us the total minutes for each account_key, rather than a list of the engagements and minutes for each.

total_minutes_by_account = {}  # Create a dictionary to store grouped engagement records.

for account_key, engagement_for_student in engagement_by_account.items(): # Looping over with items() gives us the key and the value we specified for each entry in dict.
    total_minutes = 0 #Initialize number of minutes as 0
    for engagement_record in engagement_for_student: # Loop over each engagement record in the list of engagement records for each student.
        total_minutes += engagement_record['total_minutes_visited'] # Add up total minutes in each engagement record for each student.
    total_minutes_by_account[account_key] = total_minutes # Store the total number of minutes for the each student.
    

In [67]:
import numpy as np

# Summarize the data about minutes spent in the classroom
total_minutes = list(total_minutes_by_account.values())
print ('Mean:', np.mean(total_minutes))
print('Standard deviation:', np.std(total_minutes))
print('Minimum:', np.min(total_minutes))
print('Maximum:', np.max(total_minutes))

Mean: 306.708326753
Standard deviation: 412.996933409
Minimum: 0.0
Maximum: 3564.7332645


## Debugging Data Analysis Code

In [None]:
#####################################
#                 8                 #
#####################################

## Go through a similar process as before to see if there is a problem.
## Locate at least one surprising piece of data, output it, and take a look at it.

## Lessons Completed in First Week

In [None]:
#####################################
#                 9                 #
#####################################

## Adapt the code above to find the mean, standard deviation, minimum, and maximum for
## the number of lessons completed by each student during the first week. Try creating
## one or more functions to re-use the code above.

## Number of Visits in First Week

In [None]:
######################################
#                 10                 #
######################################

## Find the mean, standard deviation, minimum, and maximum for the number of
## days each student visits the classroom during the first week.

## Splitting out Passing Students

In [None]:
######################################
#                 11                 #
######################################

## Create two lists of engagement data for paid students in the first week.
## The first list should contain data for students who eventually pass the
## subway project, and the second list should contain data for students
## who do not.

subway_project_lesson_keys = ['746169184', '3176718735']

passing_engagement =
non_passing_engagement =

## Comparing the Two Student Groups

In [None]:
######################################
#                 12                 #
######################################

## Compute some metrics you're interested in and see how they differ for
## students who pass the subway project vs. students who don't. A good
## starting point would be the metrics we looked at earlier (minutes spent
## in the classroom, lessons completed, and days visited).

## Making Histograms

In [None]:
######################################
#                 13                 #
######################################

## Make histograms of the three metrics we looked at earlier for both
## students who passed the subway project and students who didn't. You
## might also want to make histograms of any other metrics you examined.

## Improving Plots and Sharing Findings

In [None]:
######################################
#                 14                 #
######################################

## Make a more polished version of at least one of your visualizations
## from earlier. Try importing the seaborn library to make the visualization
## look better, adding axis labels and a title, and changing one or more
## arguments to the hist() function.