# Actions, Steps, Watching Sessions and Cleaning

Note: this is updated to be compatible with new data "Scrubbing" (aka. "Jump") as in June 2015. Work on both full length and class x videos.

## Define directories, data files, and actions with formatting.

First, we specify course name and specific video name that we want to process.

In [1]:
#coursename = 'CS229'
#videoname = 'CS22901Dec2014'

#coursename = 'CS107_ClassX'
#videoname = 'CS10706Jan2014_Introduction_to_C_Nuances'
#videoname = 'CS10707Feb2014_Load_Effective_Address_and_Move_abuses'
#videoname = 'CS10721Feb2014_Compilation_Tool_Chain'

coursename = 'CS110_ClassX'
#videoname = 'CS11002Apr2014_Copying_in_Unix'
#videoname = 'CS11002Apr2014_Unix_Builtins'
videoname = 'CS11002Apr2014_Full'

Here we import package for data analysis called 'pandas' and other relevant packages.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import datetime
import math
%matplotlib inline

Next, we specify main directory and create output directories. "Big" directory contains outputs that requires further processing. "Small" directory contains outputs that requires no/little processing.

In [3]:
#Main directory contains data (this is my local machine)
#mainDir = 'C:/Users/Tee/Documents/Active/EducationProject/SEOL/VideoAnalytics/main' #(Tee's ICME local machine)
mainDir = 'C:/Users/Admin/Dropbox/Active/EducationProject/SEOL/VideoAnalytics/main' #(Tee's dropbox machine)
#set up output directory
OutputBigDirectory = mainDir + '/output/big/'+ coursename +'/'+ videoname 
OutputSmallDirectory = mainDir + '/output/small/'+ coursename +'/'+ videoname
if not os.path.exists(OutputBigDirectory):
    os.makedirs(OutputBigDirectory)
if not os.path.exists(OutputSmallDirectory):
    os.makedirs(OutputSmallDirectory)

Read correspond raw CSV file. The data folder should be on the main directory.

In [4]:
#We want HashedUser, UserType, VideoActionTypeName to be string for sorting purpose
raw = pd.read_csv(mainDir+'/data/'+ coursename + '/' + videoname +'.csv',
                  dtype={'HashedUser':'object','UserType':'object','VideoActionTypeName':'object'})

In [5]:
raw[:1]

Unnamed: 0,VideoLogId,VideoActionLogId,LogDate,DepartmentName,DepartmentPrefix,CourseNumber,QuarterName,QuarterYear,CourseName,LectureDate,HashedUser,UserType,Duration,ActionTime,StartTime,EndTime,VideoActionTypeName,Speed,Resolution
0,54,281,2014-06-20 19:15:38.563,Computer Science,CS,110,Spring,2014,Principles of Computer Systems,2014-04-02,2096691145,WEBAUTH,2926,2014-06-20 19:15:39.000,-1,-1,Loaded,1,720


In raw data frame, there are number of columns. Several of them are just general information of the course and video, which are same across all rows. We will keep those general information in another data frame called 'Essential'.

In [74]:
Essential = pd.DataFrame({'Key':['Course Name'],'Value':[coursename]})
Essential.loc[Essential.shape[0]] = ['Video Name',videoname]
Essential.loc[Essential.shape[0]] = ['Quarter',raw.QuarterName[0] + ' ' + str(raw.QuarterYear[0])]
Essential.loc[Essential.shape[0]] = ['Lecture Date',raw.LectureDate[0]]
Essential.loc[Essential.shape[0]] = ['Duration[sec]',raw.Duration[0]]
Essential.loc[Essential.shape[0]] = ['Duration[min]',int(math.ceil(raw.Duration[0]/60))]

#Identify the maximum second and minute marker
MaxSecondMarker = raw.Duration[0]
MaxMinuteMarker = MaxSecondMarker/60+1

#select only relevant columns
KeepColumns = ['VideoLogId','VideoActionLogId','HashedUser','UserType',\
               'ActionTime','StartTime','EndTime','VideoActionTypeName','Speed','Resolution']
raw = raw[KeepColumns]

In raw data frame, each row represents one $\textbf{action}$. There are several kinds of action:

1) Actions done by users ($\texttt{VideoActionTypeName}$ = 'play,pause,jump,speedchange,resolutionchange').

2) Actions done by the system at the beginning and end of video (($\texttt{VideoActionTypeName}$  = 'loaded, end') or during the video ($\texttt{VideoActionTypeName}$  = integer indicated the minute mark that play passes). 

3) Actions derived by the system from previous actions ($\texttt{VideoActionTypeName}$  = 'Watching').  

We remove the 3rd kind of action out and rename all integers in $\texttt{VideoActionTypeName}$  as 'MinuteMarker'.

$\texttt{StartTime}$  and $\texttt{EndTime}$ are quantities indicating where action starts and ends in video timeline. Action 'Jump' has both $\texttt{StartTime}$ and $\texttt{EndTime}$ recorded. For other actions, $\texttt{EndTime}$ is set to be -1. For the sake of futher processing, we will define $\texttt{EndTime}$ for other actions to be equal to $\texttt{StartTime}$.

There is a new type of action called 'Scrubbing' in new data format (June 2015). It is just a jump.

In [75]:
#deselect data with VideoActionTypeName 'Watching'
raw = raw.loc[raw.VideoActionTypeName != 'Watching']

#Convert Scrubiing to Jumps
raw.loc[raw.VideoActionTypeName == 'Scrubbing',('VideoActionTypeName')] = 'Jump'

#convert ActionTime (aka. user timestamp) into datetime format
raw.ActionTime = raw.ActionTime.map(lambda x: datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f"))
#convert 1,2,..., MaxMinuteMarker as 'MinuteMarker'
#raw.VideoActionTypeName[raw.VideoActionTypeName.map(lambda x: x in map(str, range(1,MaxMinuteMarker+1)))] = 'MinuteMarker'
raw.loc[raw.VideoActionTypeName.map(lambda x: x in map(str, range(1,MaxMinuteMarker+1))),('VideoActionTypeName')] = 'MinuteMarker'

#add EndTime to single video timestamp entry
raw.loc[raw.VideoActionTypeName != 'Jump',('EndTime')] = raw.loc[raw.VideoActionTypeName != 'Jump',('StartTime')]

It is useful to sort data according to users then by $\texttt{ActionTime}$ (timestamp in user timeline). We record number of users in raw data in Essential data frame.

In [76]:
#sort data by HashedUser then by ActionTime
raw.sort(['HashedUser', 'ActionTime'],inplace=True)

#create a list of users
HashedUserList = sorted(list(set(raw.HashedUser)))

Essential.loc[Essential.shape[0]] = ['Number of raw users',len(HashedUserList)]
Essential.loc[Essential.shape[0]] = ['Number of raw WEBAUTH users',len(set(raw[raw.UserType=='WEBAUTH'].HashedUser))]
Essential.loc[Essential.shape[0]] = ['Number of raw ONECE users',len(set(raw[raw.UserType=='ONECE'].HashedUser))]

## Define steps

Now that we have a stream of actions sorted by users then by timestamp in user timeline, we define $\textbf{step}$ as interval between two actions. Step $i$ is between action $i$ and $i+1$ and stored in the same row as action $i$. 

For each step, we can define step length in user timeline and in video timeline. Note that these two measures are not same. For example, between action 'Pause' and 'Play', step length in user timeline is likely to be greater tha zero, while step length in video timeline is zero.

For each step, we can also define step type. Basic types are 'TransitionToNewVideoLogId','Watch','Break', and 'StepAfterEndOfVideo'. Note that 'TransitionToNew VideoLogId' indicates the change of video log IDs. These log IDs change every time the page is re-logined. 'Watch' step need to pass consistent time flow criterion: step length in video time and step length in user timeline after speed of video adjustment should be equal. We allow 30 seconds difference for possible signal delay.

We also have more sophisticated types. 'Concurrent' indicates that two actions occur at the same user time. 'WatchTooLong' indicates inactivity of users for a long time even though video is playing. 'BreakTooLong' indicates inactivity of users by breakign too long. 
$\texttt{LongTime} = 1800 \text{ sec} = 30 \text{ min} $ here is threshold for inactivity.

Finally, we have 'unexplained' steps. These are steps we cannot categorize by simple heurestic.

In [77]:
raw.reset_index(inplace=True)
raw.drop('index', axis=1, inplace=True)

In [78]:
LongTime = 1800
raw['StepType'] = 'NaN'
raw['UserStepLength'] = 0
raw['VideoStepLength'] = 0

OnOffSignal = 1
WatchingStreakTime = 0
for i in range(0,raw.shape[0]-1):         
    #on-off signal can be triggered by these actions. This signal is used to distinguish watch and pause step type.
    if raw.loc[i,'VideoActionTypeName'] == 'Play':
        OnOffSignal = 1
    if raw.loc[i,'VideoActionTypeName'] == 'Loaded':
        OnOffSignal = 1
    if raw.loc[i,'VideoActionTypeName'] == 'Pause':
        OnOffSignal = 0
    
    #measure step length in user timeline and videotimeline
    raw.loc[i,'UserStepLength'] = (raw.loc[i+1,'ActionTime']-raw.loc[i,'ActionTime']).total_seconds()
    raw.loc[i,'VideoStepLength'] = (raw.loc[i+1,'StartTime']-raw.loc[i,'EndTime'])   
    
    #calculate watching streak to consider too long watch
    if raw.loc[i,'VideoActionTypeName'] == 'MinuteMarker':
        WatchingStreakTime = WatchingStreakTime + raw.loc[i,'UserStepLength']
    else:
        WatchingStreakTime = 0
        
    if raw.loc[i,'UserStepLength'] == 0:
        raw.loc[i,'StepType'] = 'Concurrent'
    elif (raw.loc[i,'VideoLogId'] != raw.loc[i+1,'VideoLogId'])|(raw.loc[i,'HashedUser'] != raw.loc[i+1,'HashedUser']):
        raw.loc[i,'StepType'] = 'TransitionToNewVideoLogId'
        OnOffSignal = 1
    elif raw.loc[i,'VideoActionTypeName'] == 'End':
        raw.loc[i,'StepType'] = 'StepAfterEndOfVideo'
    elif OnOffSignal == 0:
        raw.loc[i,'StepType'] = 'Break'
    elif WatchingStreakTime > LongTime:
        raw.loc[i,'StepType'] = 'WatchTooLong'
        WatchingStreakTime = 0
    else:
        if((raw.loc[i,'VideoStepLength'] >= 0) &
         (abs(raw.loc[i,'Speed']*raw.loc[i,'UserStepLength'] - raw.loc[i,'VideoStepLength']) <= 30)): 
            raw.loc[i,'StepType'] ='Watch'            
        else:
            raw.loc[i,'StepType'] ='Unexplained'

#define step type for last entry
raw.loc[raw.shape[0]-1,'StepType'] = 'TransitionToNewVideoLogId'

#define step type for break too long
raw.loc[(raw.StepType == 'Break') & (raw.UserStepLength > LongTime),'StepType'] = 'BreakTooLong'

Here we collect some statistics about steps.

In [79]:
Essential.loc[Essential.shape[0]] = ['Number of raw actions', raw.shape[0]]
Essential.loc[Essential.shape[0]] = ['Number of watch too long instances', sum(raw.StepType=='WatchTooLong')]
Essential.loc[Essential.shape[0]] = ['Number of break too long instances', sum(raw.StepType=='BreakTooLong')]          
[raw.shape[0],sum(raw.StepType=='Unexplained'),sum(raw.StepType=='Concurrent'),sum(raw.StepType=='WatchTooLong'),
 sum(raw.StepType=='BreakTooLong')]

[216, 57, 0, 1, 0]

Now we would like to explain 'Unexplained'. We know that weird phenomena can happen when action 'Jump' appeared. This is likely due to scrubbing where timestamp of action may not be properly registered. So we define another step type called 'CloseToJump' to those unexplained steps if the head actions of such steps are close to action 'Jump' within 30 seconds. 

In [80]:
UnexplainedIndex = [i for i, elem in enumerate(raw.StepType=='Unexplained') if elem]
JumpSubset = raw[raw.VideoActionTypeName=='Jump']
for i in UnexplainedIndex:
    TimeMark = raw.loc[i,'ActionTime']
    HashedUserMark = raw.loc[i,'HashedUser']
    VideoLogIdMark = raw.loc[i,'VideoLogId']
    TimeList = JumpSubset.ActionTime[(JumpSubset.HashedUser == HashedUserMark) & (JumpSubset.VideoLogId == VideoLogIdMark)]
    for j in TimeList:
        if abs((j - TimeMark).total_seconds())<30:
            raw.loc[i,'StepType'] = 'CloseToJump'

Here we collect some statistics about 'CloseToJump' and truly 'Unexplained' steps.

In [81]:
Essential.loc[Essential.shape[0]] = ['Number of close to jump steps', sum(raw.StepType=='CloseToJump')]
Essential.loc[Essential.shape[0]] = ['Number of unexplained steps', sum(raw.StepType=='Unexplained')]
[raw.shape[0],sum(raw.StepType=='Unexplained'),sum(raw.StepType=='CloseToJump')]

[216, 57, 0]

## Define cut for watching session and session numbers assignment

According to Guo, Kim and Rubin, 2014 ("6-Minute rule" paper), we will define cuts according to 3 heurestics: end of video, end of video log ID, and 30 minutes of inactivity 'WatchTooLong' and 'BreakTooLong'. These cuts are made right at the steps. Note that a cut on 'WatchTooLong' makes one minute worth of watching missing. We can compare later if such heuristic will affect our result.

In [82]:
CutKeywords = ['TransitionToNewVideoLogId', 'StepAfterEndOfVideo', 'BreakTooLong', 'WatchTooLong']

Now, we define assign watching session numbers for all timestamps: generic session number and seesion number by user. The first watching session is labelled 0. We also define watching session number specific to each user. That is, the first watching session number of each user is labelled 0. The next one is 1 and so on.

In [83]:
raw['SessionNumber'] = 0
raw['SessionNumberByUser'] = 0
CutPoints = [i for i, elem in enumerate(raw.StepType.map(lambda x: x in CutKeywords)) if elem]
NumSessions = len(CutPoints)
j = 0
for i in range(1,NumSessions):
    raw.loc[(CutPoints[i-1]+1):(CutPoints[i]+1),'SessionNumber'] = i
    if raw.loc[CutPoints[i-1]+1,'HashedUser'] == raw.loc[CutPoints[i-1],'HashedUser']:
        j = j+1
    else:
        j = 0
    raw.loc[(CutPoints[i-1]+1):(CutPoints[i]+1),'SessionNumberByUser'] = j
#Don't need end point modification since last entry is always a cut point.
#raw.SessionNumber.iloc[raw.shape[0]-1] = i
#raw.SessionNumberByUser.iloc[raw.shape[0]-1] = j

Here we collect some statistics about raw watching sessions.

In [84]:
Essential.loc[Essential.shape[0]] = ['Number of raw watching sessions', NumSessions]
[len(HashedUserList),NumSessions]

[8, 16]

## Watching sessions summary and clean data selection 

Here we can summary some statistics on each watching session.

In [85]:
SessionSummary = pd.DataFrame({'SessionNumber': range(0,NumSessions)})
SessionSummary['SessionNumberByUser'] = [0] + list(raw.loc[CutPoints[1:(len(CutPoints)+1)],'SessionNumberByUser'])
SessionSummary['HashedUser'] = 'NaN'
SessionSummary['UserType'] = 'NaN'
SessionSummary['CutType'] = 'NaN'
SessionSummary['HasConcurrent'] = False
SessionSummary['HasCloseToJump'] = False
SessionSummary['HasUnexplained'] = False
SessionSummary['StartTimeStamp'] = 'NaN'
SessionSummary['EndTimeStamp'] = 'NaN'
SessionSummary['TimestampDiff'] = 0
SessionSummary['TotalUserWatchTime'] = 0
SessionSummary['TotalVideoWatchTime'] = 0
SessionSummary['SessionStartVideoTime'] = 0
SessionSummary['SessionEndVideoTime'] = 0
SessionSummary['NumJumps'] = 0
SessionSummary['NumPauses'] = 0

for i in range(0,NumSessions):
    d = raw.loc[raw.SessionNumber==i]
    d.reset_index(inplace=True)
    d.drop('index', axis=1, inplace=True)
    SessionSummary.loc[i,'HashedUser'] = d.loc[0,'HashedUser']
    
    SessionSummary.loc[i,'UserType'] = d.loc[0,'UserType']
    SessionSummary.loc[i,'CutType'] = d.loc[d.shape[0]-1,'StepType']
    SessionSummary.loc[i,'HasConcurrent'] = 'Concurrent' in list(d.StepType)
    SessionSummary.loc[i,'HasCloseToJump'] = 'CloseToJump' in list(d.StepType)
    SessionSummary.loc[i,'HasUnexplained'] = 'Unexplained' in list(d.StepType)
    SessionSummary.loc[i,'StartTimeStamp'] = d.loc[0,'ActionTime']
    SessionSummary.loc[i,'EndTimeStamp'] = d.loc[d.shape[0]-1,'ActionTime']
    SessionSummary.loc[i,'TimestampDiff'] = (SessionSummary.loc[i,'EndTimeStamp']- SessionSummary.loc[i,'StartTimeStamp']).total_seconds()
    SessionSummary.loc[i,'TotalUserWatchTime'] = sum(d[d.StepType=='Watch'].UserStepLength)
    SessionSummary.loc[i,'TotalVideoWatchTime'] = sum(d[d.StepType=='Watch'].VideoStepLength)
    SessionSummary.loc[i,'SessionStartVideoTime'] = d.loc[0,'StartTime']
    SessionSummary.loc[i,'SessionEndVideoTime'] = d.loc[d.shape[0]-1,'EndTime']
    SessionSummary.loc[i,'NumJumps'] = sum(d.VideoActionTypeName=='Jump')
    SessionSummary.loc[i,'NumPauses'] = sum(d.VideoActionTypeName=='Pause')

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Observe that there are several accidental "watching sessions": time stamp difference is less than 5 seconds according to Guo, Kim and Rubin (2014). 

In [86]:
Essential.loc[Essential.shape[0]] = ['Number of accidential watching sessions', sum(SessionSummary.TimestampDiff<5)]
[float(sum(SessionSummary.TimestampDiff>=5)),  SessionSummary.shape[0]]#/SessionSummary.shape[0]

[8.0, 16]

We would like to throw those sessions away and re-enumerate session number by user accordingly.

In [87]:
#5 seconds throw away
raw = raw[raw.SessionNumber.map(lambda x: x in list(SessionSummary.SessionNumber[SessionSummary.TimestampDiff>=5]))]
SessionSummary = SessionSummary[SessionSummary.TimestampDiff>=5]
#reindex

In [88]:
raw.reset_index(inplace=True)
raw.drop('index', axis=1, inplace=True)
SessionSummary.reset_index(inplace=True)
SessionSummary.drop('index', axis=1, inplace=True)

In [89]:
#re-enumerate session number by user in raw data frame
previous = raw.loc[0,'SessionNumber']
raw.loc[0,'SessionNumber'] = 0
j = 0
k = 0
for i in range(1,raw.shape[0]):
    if raw.loc[i,'SessionNumber'] != previous:
        previous = raw.loc[i,'SessionNumber']
        j = j+1
        if raw.loc[i,'HashedUser'] == raw.loc[i-1,'HashedUser']:
            k = k+1
        else:
            k = 0
    raw.loc[i,'SessionNumber'] = j
    raw.loc[i,'SessionNumberByUser'] = k

In [90]:
#re-enumerate session number by user in session summary data frame
previous = SessionSummary.loc[0,'SessionNumber']
SessionSummary.loc[0,'SessionNumber'] = 0
j = 0
k = 0
for i in range(1,SessionSummary.shape[0]):
    if SessionSummary.loc[i,'SessionNumber'] != previous:
        previous = SessionSummary.loc[i,'SessionNumber']
        j = j+1
        if SessionSummary.loc[i,'HashedUser'] == SessionSummary.loc[i-1,'HashedUser']:
            k = k+1
        else:
            k = 0
    SessionSummary.loc[i,'SessionNumber'] = j
    SessionSummary.loc[i,'SessionNumberByUser'] = k

Here we may look at number of steps and sessions with unwanted behaviors. 

In [91]:
[raw.shape[0],sum(raw.StepType=='Unexplained'),sum(raw.StepType=='Concurrent'),sum(raw.StepType=='WatchTooLong'),
 sum(raw.StepType=='BreakTooLong')]

[207, 57, 0, 1, 0]

In [92]:
[SessionSummary.shape[0], sum(SessionSummary.HasConcurrent==True),sum(SessionSummary.HasCloseToJump==True),\
               sum(SessionSummary.HasUnexplained==True),sum(SessionSummary.TimestampDiff>10800)]

[8, 0, 0, 4, 0]

From the session summary data frame, we may call following watching sessions as 'dirty' watching sessions if it has concurrent steps, close to jump steps, or unexplained steps. To maintain amount of data, we decide to use $\textbf{only unexplained steps}$ as criterion to determine dirty watching sessions.

We also remove watching sessions with $\textbf{too long user timestamp difference}$ (due to possible logic failure from missing signals). We set threshold to $3 \text{ hr} = 10800 \text{ sec}$

We have two choices to clean data: throw away just those dirty watching sessions or throw away all data of users who have dirty watching sessions. Here I decide to throw away $\textbf{all data of users who have dirty watching sessions}$. This will make analysis on user's first watching session vs. later watching sessions easier.

From this, we can now define 'DirtyHashedUserList', 'CleanSessionSummary', and 'CleanData'.

In [93]:
#DirtyHashedUserList = list(set(SessionSummary[(SessionSummary.HasConcurrent==True)|(SessionSummary.HasCloseToJump==True)|
#               (SessionSummary.HasUnexplained==True)|(SessionSummary.TimestampDiff>10800)].HashedUser))
DirtyHashedUserList = list(set(SessionSummary[(SessionSummary.HasUnexplained==True)|(SessionSummary.TimestampDiff>10800)].HashedUser))

CleanSessionSummary = SessionSummary.loc[[i for i, 
                                           elem in enumerate(SessionSummary.HashedUser.map(lambda x: not(x in DirtyHashedUserList))) if elem]]
CleanData = raw.loc[[i for i, 
                                           elem in enumerate(raw.HashedUser.map(lambda x: not(x in DirtyHashedUserList))) if elem]]


Here we summarize information about clean data that we will use later.

In [94]:
[len(set(CleanData[CleanData.UserType == 'WEBAUTH'].HashedUser)),\
 len(set(CleanData[CleanData.UserType == 'ONECE'].HashedUser)),\
 len(set(CleanSessionSummary[CleanSessionSummary.UserType=='WEBAUTH'].HashedUser)),\
 len(set(CleanSessionSummary[CleanSessionSummary.UserType=='ONECE'].HashedUser)),\
 len(set(CleanSessionSummary.HashedUser))]

[3, 1, 3, 1, 4]

In [95]:
Essential.loc[Essential.shape[0]] = ['Number of users after cleaning', len(set(CleanData.HashedUser))]
Essential.loc[Essential.shape[0]] = ['Number of WEBAUTH users after cleaning',\
                                     len(set(CleanData[CleanData.UserType == 'WEBAUTH'].HashedUser))]
Essential.loc[Essential.shape[0]] = ['Number of ONECE users after cleaning',\
                                     len(set(CleanData[CleanData.UserType == 'ONECE'].HashedUser))]
Essential.loc[Essential.shape[0]] = ['Number of actions after cleaning', CleanData.shape[0]]
Essential.loc[Essential.shape[0]] = ['Number of sessions after cleaning', CleanSessionSummary.shape[0]]
[len(set(CleanSessionSummary.HashedUser)),CleanSessionSummary.shape[0]]

[4, 4]

In [96]:
Essential

Unnamed: 0,Key,Value
0,Course Name,CS110_ClassX
1,Video Name,CS11002Apr2014_Full
2,Quarter,Spring 2014
3,Lecture Date,2014-04-02
4,Duration[sec],2926
5,Duration[min],48
6,Number of raw users,8
7,Number of raw WEBAUTH users,6
8,Number of raw ONECE users,1
9,Number of raw actions,216


Here we record clean data, clean session summary, and essential information.

In [400]:
CleanData.to_csv(OutputBigDirectory+"/CleanData.csv",index=False)
CleanSessionSummary.to_csv(OutputBigDirectory+"/CleanSessionSummary.csv",index=False)
Essential.to_csv(OutputSmallDirectory+"/CleaningSummary.csv",index=False)

## Wrap up

We wrap this up as a function $\texttt{CleanData(coursename,videoname)}$ assuming that relevant video data file is in place. Once function is called CleanData, CleanSessionSummary, and Essential data frame are created in output directories.

In [401]:
def CleanData( coursename, videoname ):    
    
    #insert all codes above here
    
    print coursename + ":" + videoname + " is cleaned :)";
    return;