In this notebook, we will finally create predictive features using the logs we cleaned on notebook 2.2. Our focus, for now, will be prediction using an aggregate non-temporal representation of each student.

Throughout the notebook, we will start with the import of logs and remaining tables that we consider to be relevant for feature engineering and extraction.

#### 1. Importing the relevant packages, setting global variables and importing the relevant files

In [None]:
#import libs
import pandas as pd
import numpy as np
from pandas.tseries.offsets import *

#viz related tools
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.colors import LogNorm, Normalize
from matplotlib.ticker import MaxNLocator
import matplotlib as mpl
from matplotlib import cm
import seaborn as sns

#tqdm to monitor progress
from tqdm.notebook import tqdm, trange
tqdm.pandas(desc="Progress")

#time related features
from datetime import timedelta
from copy import copy, deepcopy

#starting with other tools
sns.set()

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

#to save
import xlsxwriter

In [None]:
#global variables that may come in handy
#course threshold sets the % duration that will be considered (1 = 100%)
duration_threshold = [0.1, 0.25, 0.33, 0.5, 1]

#colors for vizualizations
nova_ims_colors = ['#BFD72F', '#5C666C']

#standard color for student aggregates
student_color = '#474838'

#standard color for course aggragates
course_color = '#1B3D2F'

#standard continuous colormap
standard_cmap = 'viridis_r'

#Function designed to deal with multiindex and flatten it
def flattenHierarchicalCol(col,sep = '_'):
    '''converts multiindex columns into single index columns while retaining the hierarchical components'''
    if not type(col) is tuple:
        return col
    else:
        new_col = ''
        for leveli,level in enumerate(col):
            if not level == '':
                if not leveli == 0:
                    new_col += sep
                new_col += level
        return new_col

In [None]:
#loading student log data 
student_logs = pd.read_csv('../Data/Modeling Stage/R_Gonz_cleaned_logs.csv', 
                           dtype = {
                                   'id': object,
                                   'itemid': object,
                                   'userid': object,
                                   'course': object,
                                   'cmid': object,
                                   },
                                   parse_dates = ['time'],).drop(['Unnamed: 0', 'id', 'url', 'info'], axis = 1).dropna(how = 'all', axis = 1) #logs

#loading support table
support_table = pd.read_csv('../Data/R_Gonz_support_table.csv', 
                           dtype = {
                                   'assign_id': object,
                                   'courseid': object,
                                   'userid': object,
                                   }, 
                            parse_dates = ['sup_time', 'startdate']).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1)

#save tables 
class_list = pd.read_csv('../Data/Modeling Stage/R_Gonz_class_duration.csv', 
                         dtype = {
                                   'course': object,                                   
                                   },
                        parse_dates = ['Start Date','End Date', 'cuttoff_point']).drop('Unnamed: 0', axis = 1).rename(columns = {'cuttoff_point' : 'Week before start'})

#targets tables 
targets_table = pd.read_csv('../Data/Modeling Stage/R_Gonz_targets_table.csv',
                           dtype = {
                                   'userid': object,
                                   'courseid': object,
                                   },)

We'll start with the general verification of the different datasets we've imported. 

**Starting with the targets table, which includes all valid student-course logs with Final-Grade.**

In [None]:
#get info
targets_table.info()

In [None]:
targets_table.describe(include = 'all', datetime_is_numeric = True)

In [None]:
targets_table.rename(columns = {'courseid' : 'course'}, inplace = True)

Then, we repeat the same for the list of courses and their respective start and end dates. We know that the number of students attending each course is the number found in the logs. We will need to make further cuts 

In [None]:
class_list.info()

In [None]:
class_list.describe(include = 'all', datetime_is_numeric = True)

In [None]:
class_list

We still note a significant presence of courses with small numbers of students. The first step we will take is the removal of all courses whose number of attending students is below 50.

In [None]:
class_list = class_list[class_list['Users per course'] >= 25]

#updating student logs
student_logs = student_logs[student_logs['course'].isin(class_list['course'])]


#additionally updating targets_table
targets_table = targets_table[targets_table['course'].isin(class_list['course'])]
class_list.describe(include = 'all', datetime_is_numeric = True)

We'll follow up with taking a closer look logs we cleaned in the previous section. 

In [None]:
student_logs.info()

In [None]:
student_logs.describe(include = 'all', datetime_is_numeric = True)

In [None]:
student_logs

In [None]:
#then we plot an histogram with all courses, we are not interested in keeping courses with a number of students inferior to 10
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(16, 10)}, font_scale=2)
hist4 = sns.histplot(data=class_list, x='Users per course', kde=True, color= student_color, binwidth = 5,)

fig = hist4.get_figure()
fig.savefig('../Images/hist4_students_per_course_bin_5.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, hist4


Likewise, there is some attention to be found on courses with abnormally high numbers of attending students in a face-to-face context (over 200). We will pay closer attention to those courses.

In [None]:
#create df only with high affluence courses
most_affluent_courses = class_list[class_list['Users per course'] >= 200]

#separate logs accordingly
high_attendance_logs = student_logs[student_logs['course'].isin(most_affluent_courses['course'])]
high_attendance_logs

In [None]:
high_attendance_logs.describe(include = 'all', datetime_is_numeric = 'all')

We can plot the weekly interactions of these courses.

In [None]:
#Then, when it comes to logs, we aggregate by week
grouped_data = high_attendance_logs.groupby([pd.Grouper(key='time', freq='W'), 'course']).agg({
                                                                             'action': 'count',
                                                                             }).reset_index().sort_values('time')
#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['course'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan).reset_index().rename(columns = {'course' : 'Course'})

#now, we will sort the courses according to the starting date
grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index('Course', drop = True)
grouped_data.T.describe(include = 'all').T

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the nex
heat4 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= 1,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat4.get_figure()
fig.savefig('../Images/highest_attendance_weekly_clicks_heat4.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat4

While most courses are very clearly restricted to their semester, there are courses that have interactions occurring across the entire year. 

For these courses, we just want to undestand whether all students are interacting continuously or we are speaking of different co-horts of students. As such, we will look more deeply at the following courses:

3022, 3069 and 3151

In [None]:
year_long_high_attendance = ['3022.0', '3069.0', '3151.0']

#Then, when it comes to logs, we aggregate by week
grouped_data = high_attendance_logs[high_attendance_logs['course'].isin(year_long_high_attendance)].groupby([pd.Grouper(key='time', freq='W'), 'course', 'userid']).agg({
                                                                             'action': 'count',
                                                                             }).reset_index().sort_values('time')
#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['course', 'userid'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan).reset_index().rename(columns = {'course' : 'Course'})

#now, we will sort the courses according to the starting date
grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index(['Course', 'userid'], drop = True)
grouped_data.T.describe(include = 'all').T

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the nex
heat5 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= 0,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat5.get_figure()
fig.savefig('../Images/high_attend_yearlong_weekly_heat5.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat5, grouped_data, high_attendance_logs, year_long_high_attendance

After consideration, we find that the student interactions seem to be consistent with the course duration. 

We note, however, that the accesses to course 4923 seem to be inconsistent at best. We will monitor this course (and others) in the following steps. For now, we will proceed with the analysis over targets and support table.

**1. First, we filter by our current list of valid courses.**

In [None]:
#Representation of different targets depending 
g = sns.PairGrid(targets_table, diag_sharey=False, corner=True)
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g.add_legend()

In [None]:
#a larger overlook at the different courses
targets_table.groupby('course').agg({
                                    'userid' : 'count', 
                                    'Grade Mandatory' : ['min', 'mean', 'max'],
                                    'Grade Optional' : ['min', 'mean', 'max'],                                    
                                    'final_mark' : ['min', 'mean', 'max'],
                                    }).describe(include = 'all')

#### Finally, we will take a look at the support table we have and repeat the same steps performed thus far

In [None]:
#separate logs accordingly
support_table = support_table[support_table['assign_id'].isin(student_logs['cmid'])].rename(columns = {
                                                                                            'courseid' : 'course'
                                                                                            })
#filter student logs_approppriately
student_logs = student_logs[student_logs['course'].isin(support_table['course'])].reset_index(drop = True)

#get info
support_table.info()

In [None]:
support_table.describe(include = 'all', datetime_is_numeric = True)

In [None]:
support_table

In [None]:
#a larger overlook at the different courses
support_table.groupby('course').agg({
                                    'userid' : 'count',
                                    'assign_id' : 'count', 
                                    'mandatory_status' : 'mean',
                                    'delivered' : 'mean',                                    
                                    'assignment_mark' : 'mean',
                                    }).describe(include = 'all')

In [None]:
#Representation of different targets depending 
g = sns.PairGrid(support_table, diag_sharey=False, corner=True)
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g.add_legend()

**Going forward**.

After this preliminary look, we will go forward with extracting features from the Moodle logs. 

In this notebook, we will consider a static non-temporal representation that considers each student-course pair as a row. We will, however, construct different datasets - 1 for each relevant timestep. 

We will rely on features that are regular presences in the literature. Some of these features may appear in more than one work:

**From Macfadyen et al. (2010)**

Count related features:
- Discussion messages posted, 
- Online Sessions, 
- File views,
- Assessments finished, 
- Assessments started, 
- Replies to discussion messages, 
- Mail messages sent, 
- Assignments submitted, 
- Discussion MEssages read, 
- Web link views

Time related features:
- Total time online, 
- Time spent on assignments,


**From Romero et al. (2013)**

Number Accesses to:
- Assignments done,
- Quizzes passed,
- Quizzes failed,
- Forum messages posted, 
- Forum messages read,

Time related features:
- Total time on assignments, 
- Total time on quizzes, 
- Total time on forums

**From Gasevic et. al (2016)**

Number of Accesses of the following variables:
- course logins,
- forum,
- resources,
- Turnitin file submission,
- assignments,
- book,
- quizzes, 
- feedback,
- lessons,
- virtual classroom
- chat,

- etc...

**From Conijn et. al (2017)**

Click count related features:
- Clicks,
- Online sessions, 
- Course page views,
- Resources viewed,
- Links viewed, 
- Discussion post views,
- Content page views,
- Quizzes,
- Quizzes passed,
- Assignments submitted, 
- Wiki edits,
- Wiki views,

Time related features:
- Total time online,
- Largest period of inactivity,
- Time until first action, 
- Averages session time,

Performance related features:
- Average assignment grade,

**Chen and Cui (2020)**

Click count related features:
- Total clicks, 
- Clicks on campus, 
- Online sessions,
- Clicks during weekdays,
- Clicks on weekend,
- Assignments, 
- File,
- Forum,
- Overview Report,
- Quizz,
- System, 
- User Report

Time related features
- Total time of online sessions, 
- Mean duration of online sessions, 
- SD of time between sessions, 
- Total time on Quiz, 
- Total time on File, 
- SD of time on File, 

Other statistics
- Ratio between on-campups and off-campus clicks

**Nuno Rosário Thesis**
- number of forum messages read, 
- number of forum messages posted, 
- number of pages,
- number of clicks, 
- number of submissions,
- number of files accessed,

As stated, some of the features are calculated across multiple works - and these only address the course level. They are not designed specifically for course-agnostic purposes.

First, **we will split the logs by the difference courses and, for each student calculate the different features we intend to calculate.** The are calculated via aggregate operations, from the most common (appear more times in our literature).

**But before that**, we will look at different columns of our logs and, when appropriate, keep the different values they may take. The first, and most immediate correction is IP - 127.0.0.1 hints at local connection, while other IPs suggest an out of Campus connection. 

In [None]:
#converts up to on_campus
student_logs['on_campus'] = np.where(student_logs['ip'] == '127.0.0.1', 1, 0)
student_logs.drop('ip', axis = 1, inplace = True)

Secondly, we take a look at the actions and the modules. Here, we can find the most common actions and modules.

We are familiar with the most common features: course, resources, assignments, quiz, forums, etc...

There are, however, other less common labels whose usage is not very common. We can, start by grouping together the less common modules together in a way that, at least intuitively, makes sense.

In [None]:
#other pages with unclear meaning will be grouped together
other_modules = ['oublog', 'data', 'data', 'bigbluebuttonbn', 'nanogong', 'role', 'notes', 'calendar', 'recordingsbn', 'bookmark']

#converts discussion points to forum or, alternatively,groups other elements to other category
student_logs['module'] = np.where(student_logs['module'] == 'discussion', 'forum',
                                  np.where(student_logs['module'] == 'imscp', 'resource', #imscp is what allows content packages to be posted
                                  np.where(student_logs['module'] == 'glossary', 'course', #usually, glossaries refer to course
                                  np.where(student_logs['module'].isin(other_modules), 'others', student_logs['module']))))

del other_modules

Likewise, we will need to take a look at the different actions in order to understand how common these may be. 

Again, we will look at different actions and see how we can group them together in a way that, at least, makes intuitive sense. There is use in keeping the distinction between different types of view.

In [None]:
#updates and edits related to making editions on presented information:
update_related = ['edit post', 'edit override',  'edit report', 'update mod'
                  'update submission', 'update entry', 'editsection', 'update switch phase',
               'update assessment',   'update post', 'editquestions', 'update submission', 'edit', 'update mod']

#additions 
addition = ['add assessment', 'add category', 'add item', 'add submission', 'add page', 'add mod', 'add entry', 'add post', 'add discussion',
            'add', 'save report','add comment']

#deletion
deletion = ['delete override', 'record delete', 'delete entry', 'delete', 'delete post',
            'delete discussion', 'delete attempt','delete mod',]

#likewise, we will also join other moderation related tasks together,
moderation = ['restore', 'save', 'unlock submission', 'stop tracking', 'assign', 'lock submission', 'usage report',
              'start tracking', 'view subscribers', 'grade submission', 'grant extension', 'manualgrade']

#other reporting actions
report = ['report live', 'report participation', 'report', 'report log', 'report outline', 'view report', 'user report']

#messages/files
messages = ['message saved', 'message sent', 'files']

#other actions
other_actions = ['preview', 'unsubscribe', 'download all submissions', 'mark read', 'subscribeall', 'diff', 'comment', 
                 'unsubscribeall', 'search', 'history', 'map', 'subscribe', 'submissioncopied', 'comments']

#smaller view commands to join main view
small_view = ['view entry', 'view edit']

#converts discussion points to forum or, alternatively,groups other elements to other category
student_logs['action'] = np.where(student_logs['action'].isin(update_related), 'update', #edit list
                                  np.where(student_logs['action'].isin(addition), 'addition', #addition list
                                  np.where(student_logs['action'].isin(deletion), 'delete', #deletion list
                                  np.where(student_logs['action'].isin(moderation), 'other admin actions', #other admin actions
                                  np.where(student_logs['action'].isin(report), 'report', #reporting related
                                  np.where(student_logs['action'].isin(other_actions), 'other actions', #other actions
                                  np.where(student_logs['action'].isin(messages), 'messages and files', #messages
                                  np.where(student_logs['action'].isin(small_view), 'view others', #main view files
                                  np.where(student_logs['action'] == 'choose again','choose', #choose again to choose
                                  
                                  #finishing with splitting the view command to different subgroups according to the module - will make it easier later
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'assign'),'view assignment',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'choice'),'view choice',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'course'),'view course',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'folder'),'view folder',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'glossary'),'view glossary',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'others'),'view others',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'page'),'view page',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'questionnaire'),'view questionnaire',
                                  np.where(((student_logs['action'] == 'view') | (student_logs['action'] == 'view all')) & (student_logs['module'] == 'resource'),'view resource',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'url'),'view url',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'user'),'view user',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'wiki'),'view wiki',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'workshop'),'view workshop',
                                  np.where((student_logs['action'] == 'view') & (student_logs['module'] == 'quiz'),'view quiz',                             
                                  np.where((student_logs['action'] == 'view forums') & (student_logs['module'] == 'forum'),'view forum',
                                  np.where((student_logs['action'] == 'submit for grading') & (student_logs['module'] == 'assign'),'submit',
                                  np.where((student_logs['action'] == 'view submit assignment form') & (student_logs['module'] == 'assign'),'view assignment',
                                  student_logs['action']))))))))))))))))))))))))))

#we finish by ending these lists we've created
del update_related, addition, deletion, moderation, messages, other_actions, small_view 

In [None]:
#uncomment to verify pairings
with pd.option_context('display.max_rows', None,):
     display(student_logs.groupby(['module', 'action']).size().to_frame())

We have addressed the most obvious possible aggregations. Now, we will go forward with our intended feature extraction and selection.

For this step, we will create 5 distinct dicts of dataframes. Each dict refers to a certain course duration threshold.

In [None]:
#number of days
days = {}

#additionally, we will look at our estimated course duration
for i in tqdm(duration_threshold):
    #create, for each desired threshold, the appropriate cutoff date 
    class_list[f'Date_threshold_{int(i*100)}'] = pd.to_datetime((class_list['Start Date'] + pd.to_timedelta(class_list['Course duration days'] * i, unit = 'Days')).dt.date)
    
    #setting up duration threshold to be on friday -> reason being that it will be easier to 
    class_list[f'Date_threshold_{int(i*100)}'] = class_list[f'Date_threshold_{int(i*100)}'].where( class_list[f'Date_threshold_{int(i*100)}'] == (( class_list[f'Date_threshold_{int(i*100)}'] + Week(weekday=4)) - Week()), class_list[f'Date_threshold_{int(i*100)}'] + Week(weekday=4))
    
    #storing date threshold and week, week before start to consider calculate features relative to course duration
    days[f'Date_threshold_{int(i*100)}'] = deepcopy(class_list.filter(['course', 'Start Date', 'Week before start', 'End Date',
                                                                        f'Date_threshold_{int(i*100)}']))
    
    days[f'Date_threshold_{int(i*100)}']['Number of days'] = (days[f'Date_threshold_{int(i*100)}'][f'Date_threshold_{int(i*100)}'] - days[f'Date_threshold_{int(i*100)}'][f'Week before start']).dt.days
    #then, we will create a dictionary of dictionaries, each main dictionary storing and a version of the logs
logs_dict = {}
assignment_dict = {}

for i in tqdm(duration_threshold):
    #create, for each desired threshold, a different dictionary of dataframes wherein we will perform the different operations
    print(f'Date_threshold_{int(i*100)}\n' +
          f'Logs')
    logs_dict[f'Date_threshold_{int(i*100)}'] = {course: student_logs.loc[student_logs['course'] == course].reset_index(drop = True) for course in tqdm(student_logs['course'].unique())}
    
    #Assignments
    print(f'Assignments')
    assignment_dict[f'Date_threshold_{int(i*100)}'] = {course: support_table.loc[support_table['course'] == course].reset_index(drop = True) for course in tqdm(support_table['course'].unique())}

Now, we have a nested dictionary with different dataframes inside it. We will use this data structure to perform the most of the operations we are interested in.

**First, we will add, to each dataframe, a column with the corresponding threshold date**

After this cleaning procedure, we will all different columns referring to our features of interest. These will be:
1. Number of assignments submitted, 
2. Number of online sessions,
3. Discussion messages read,
4. Resource views, 
5. Assessments started,
6. Total time online,
7. Assignment views,
8. Average duration of session,
9. Messages posted,
10. Clicks on Forum, 
11. Clicks on Folder
12. On-campus clicks,
13. On-campus/off-campus clicks,
14. Total number of clicks
15. Number of links viewed
16. Largest period of inactivity
17. Average clicks per day
18. Average clicks per session
19. The start date of the first 10 sessions (relative to the entire course duration)
20. % of Submissions made in the period,
21. % of clicks made in the period,

To check difference between inclusion and not inclusion

22. Average grade of assignments (optional)

A double loop is not very efficient but, to the best of my ability, is the obvious solution to perform these operations. 

In [None]:
#for each intended course duration threshold
for i in tqdm(logs_dict):
    #start with creating a dictionary of course and intended cuttoff date
    cut = class_list.set_index('course').to_dict()[i] 
    
    #for each dataframe
    for j in tqdm(logs_dict[i]):
        #where the course is the same as in the class_list, get the corresponding value of the appropriate column,
        logs_dict[i][j]['Date Threshold'] = logs_dict[i][j]['course'].map(cut)
        logs_dict[i][j] = logs_dict[i][j][logs_dict[i][j]['time'] <= logs_dict[i][j]['Date Threshold']].reset_index(drop = True).drop('Date Threshold', axis = 1)
        
        #doing the same for the assignments list
        try:
            assignment_dict[i][j]['Date Threshold'] = assignment_dict[i][j]['course'].map(cut)
            assignment_dict[i][j] = assignment_dict[i][j][assignment_dict[i][j]['sup_time'] <= assignment_dict[i][j]['Date Threshold']].reset_index(drop = True).drop('Date Threshold', axis = 1)
        
        except:
            continue

        #calculates the difference between previous within group row and current
        logs_dict[i][j]['t_diff'] = logs_dict[i][j].sort_values(['userid', 'time']).groupby('userid')['time'].diff()
        
        #will need to ignore dictionaries where there is no lenght
        if len(logs_dict[i][j]) > 0:
            #the nans will be correspond to the first interaction made by each student - also signaling the start of the first session
            logs_dict[i][j]['session'] = np.where(logs_dict[i][j]['t_diff'].isna(), 1, #the first session is started by nans
                                             np.where(logs_dict[i][j]['t_diff'] > pd.to_timedelta(40, unit = 'minutes'), 1, #also identify the starting point of new sessions
                                                      0))
            
            #then, we cumulative sum all in-group members 
            logs_dict[i][j]['session'] = logs_dict[i][j].groupby('userid')['session'].transform(pd.Series.cumsum)
            
            #before finishing this step, we will calculate the accumulated duration of a session
            logs_dict[i][j]['mask'] = np.where(logs_dict[i][j]['t_diff'].isna(), 0, #the first session is started by nans
                                      np.where(logs_dict[i][j]['t_diff'] > pd.to_timedelta(40, unit = 'minutes'), 0, #also identify the starting point of new sessions
                                      logs_dict[i][j]['t_diff'].dt.total_seconds()))
            #fillnas in t_diff
            logs_dict[i][j]['t_diff'].fillna(pd.to_timedelta(0), inplace = True)

            #then, we cumulative sum all in-group members 
            logs_dict[i][j]['session_cumul_time'] = logs_dict[i][j].groupby(['userid','session'])['mask'].transform(pd.Series.cumsum)
            logs_dict[i][j]['session_cumul_time'] = pd.to_timedelta(logs_dict[i][j]['session_cumul_time'], unit = 'seconds')
            #drop mask
            logs_dict[i][j].drop('mask', axis = 1, inplace = True)
        
        else:
            logs_dict[i][j] = pd.DataFrame(columns=['time', 'userid', 'course', 'module', 'cmid', 'action', 'on_campus', 't_diff',
                                                    'session', 'session_cumul_time'])
            continue

In [None]:
#create backup of logs dict, we will need it for later
#backup = deepcopy(logs_dict)

In [None]:
#logs_dict = deepcopy(backup)

Now, we'll go forward with the creation of the features using these datasets. We will do it, using groupby commands.

After this cleaning procedure, we will all different columns referring to our features of interest.

**We cannot perform all steps at once, unfortunately.** (at least not in a capacity I can manage)

We will need to create multiple dfs to ensure that all features are accounted for:

1. We start with features that relate to raw aggregate counts of clicks and sessions - a general set of features, 
2. We continue by computing features related with time -  total time online and average duration of session,
3. Then, we go into finer grained features using specific pairs of modules and actions.
4. Then, we finish by merging these features with the final mark we have previously calculated and the average grade of assignments delivered up to the threshold date.

In [None]:
#we will need to perform the same double loop we have done before
for i in tqdm(logs_dict):
     
    for j in tqdm(logs_dict[i]):    
        #as it is very difficult to we will need to create multiple placeholders
        #placeholder 1 - general features
        general_features = logs_dict[i][j].groupby(['course', 'userid']).agg(
                                                {'action' : [('N_clicks','count')], #number of clicks
                                                'session' : [('N_sessions', 'nunique')], #number of sessions
                                                 'on_campus' : [('Clicks on Campus', np.sum)], #number of clicks on campus
                                                 't_diff' : [('Largest_period_of Inactivity' , 'max')], #largest period of inactivity 
                                                })
        
        #the second group will deal with session related time features
        session_features = logs_dict[i][j].groupby(['userid', 'session'])['session_cumul_time'].max().to_frame().reset_index() #the accumulated time up to last click of each session identifies the duration 
        
        #now, we get to our intended features
        session_features = session_features.groupby(['userid']).agg(
                                                {'session_cumul_time' : [np.sum, #The total time online is the sum of the time spent in all sessions 
                                                                        np.mean], #mean duration across all sessions made by the student
                                                 })
        
        #start og multiple session
        start_of_sessions = logs_dict[i][j].groupby(['userid', 'session']).agg({
                                                                                'time' : 'min',
                                                                                }).reset_index()
        
        #we will get the start dates of the first 10 sessions
        start_of_sessions = pd.pivot_table(start_of_sessions, index = 'userid', columns = 'session', values = 'time',
                            aggfunc = 'min').reset_index()
        
        # We reindex pivot to contain all columns we intend to have, even if they are note present
        start_of_sessions = start_of_sessions.reindex(columns = ['userid', 1, 2, 3, 4, 5,
                                                                6, 7, 8, 9, 10], fill_value = np.nan).rename(
                                                                    columns = {1 : 'Start of Session 1 (%)', 
                                                                               2 : 'Start of Session 2 (%)',
                                                                               3 : 'Start of Session 3 (%)',
                                                                               4 : 'Start of Session 4 (%)',
                                                                               5 : 'Start of Session 5 (%)',
                                                                               6 : 'Start of Session 6 (%)', 
                                                                               7 : 'Start of Session 7 (%)',
                                                                               8 : 'Start of Session 8 (%)',
                                                                               9 : 'Start of Session 9 (%)',
                                                                               10 : 'Start of Session 10 (%)',
                                                                              })
        
        #the third relies on clicks of multiple types and modules. An elegant way is to deal with these is pivot_tables of the counts
        pivot = pd.pivot_table(logs_dict[i][j], index = 'userid', 
                              columns = ['module', 'action'],
                              values='cmid',
                              aggfunc = 'count')
        
        #applies the function that removes multiindex
        pivot.columns = pivot.columns.map(flattenHierarchicalCol)
        pivot.reset_index(inplace = True)
        
        #now, we filter the pivot table to only keep the features that we are interested in - specifically, the counts
        pivot = pivot.reindex(columns = [
                               'userid',
                               'assign_submit', #Number of assignments submitted
                               'resource_view resource', #resource views,
                               'assign_view assignment', #view assignment
                               'forum_view discussion', #view discussion,
                               'quiz_attempt', #quizzes started
                               'forum_addition', #forum messages posted
                                'folder_view all', #view all folders
                                ])
        
        #drop columns unnecessary columns and rename others 
        pivot = pivot.rename(columns = {'forum_addition' : 'Forum posts', 
                                            'forum_view discussion' : 'Discussions viewed',
                                            'assign_submit' : 'Assignments submitted', 
                                            'resource_view resource' : 'Resources viewed',
                                            'quiz_attempt' : 'Quizzes started',  
                                            'assign_view assignment' : 'Assignments viewed'
                                           })
        
        #the third relies on clicks of multiple types and modules. An elegant way is to deal with these is pivot_tables of the counts
        pivot_1 = pd.pivot_table(logs_dict[i][j], index = 'userid', 
                              columns = 'module',
                              values='cmid',
                              aggfunc = 'count').reset_index()
        
        #now, we filter the pivot table to only keep the features that we are interested in - specifically, the counts
        pivot_1 = pivot_1.filter([
                               'userid',
                               'forum',
                                'url',
                                'folder',
                                'course', 
                                ], 
                               ).rename(columns = {'forum' : 'Clicks on forum',
                                                    'url': 'Links viewed',
                                                   'folder' : 'Clicks on folder',
                                                   'course' : 'Clicks on course',
                                                  })

        #applies the function that removes multiindex
        general_features.columns = general_features.columns.map(flattenHierarchicalCol)
        general_features.reset_index(inplace = True)
        
        #same for session features
        session_features.columns = session_features.columns.map(flattenHierarchicalCol)
        session_features.reset_index(inplace = True)

        #merging the timestamp that marks the start of the first 5 sessions
        session_features = pd.merge(session_features, start_of_sessions, on = 'userid')
        
        #we finish this section by wrapping everything together
        general_features = pd.merge(general_features, session_features, on = 'userid', how = 'inner')
        general_features.rename(columns = {'session_cumul_time_sum': 'Total time online (min)',
                                           'session_cumul_time_mean': 'Average session duration (min)',
                                           'action_N_clicks': 'Number of clicks',
                                           'session_N_sessions': 'Number of sessions',
                                           'on_campus_Clicks on Campus': 'Clicks on campus',
                                           't_diff_Largest_period_of Inactivity': 'Largest period of inactivity (h)',
                                          }, inplace = True)
        
        #merge features from pivot_table
        pivot = pd.merge(pivot_1, pivot, on = 'userid', how = 'inner')
        
        #joining assignment grades
        assignment_pivot = pd.pivot_table(assignment_dict[i][j], index = 'userid', 
                              columns = 'assign_id',
                              values='assignment_mark',
                              aggfunc = np.sum)
        
        #drop assignments that either were not delivered or received grade = 0  
        assignment_pivot = assignment_pivot.dropna(axis=1, how='all')
        
        #now, we stack these together 
        assignment_pivot = assignment_pivot.stack().reset_index().rename(columns = {0 : 'Average grade of assignments'})
        
        #join all together to get the corresponding dataframe
        logs_dict[i][j] = pd.merge(general_features, pivot, on = 'userid', how = 'inner')
        
        #calculating on-campus/off campus ratio
        logs_dict[i][j]['On/off campus click ratio'] = np.where((logs_dict[i][j]['Number of clicks'] - logs_dict[i][j]['Clicks on campus']) > 0,
                                                                logs_dict[i][j]['Clicks on campus'] / (logs_dict[i][j]['Number of clicks'] - logs_dict[i][j]['Clicks on campus']),
                                                                logs_dict[i][j]['Clicks on campus']) # we consider 1 click off campus to avoid dividing by 0 
        
        #Merge here with days - this will allow us to get first action
        logs_dict[i][j] = logs_dict[i][j].merge(days[i], on = ['course'])
        
        #additional features to compute
        logs_dict[i][j]['Clicks per day'] = logs_dict[i][j]['Number of clicks'] / logs_dict[i][j]['Number of days'] #clicks per day
        logs_dict[i][j]['Clicks per session'] = logs_dict[i][j]['Number of clicks'] / logs_dict[i][j]['Number of sessions'] #clicks per session
        
        #now computing other features in relative terms:
        logs_dict[i][j]['Clicks (% of course total)'] = logs_dict[i][j]['Number of clicks'] / logs_dict[i][j]['Number of clicks'].sum()
        logs_dict[i][j]['Submissions (% of course total)'] = logs_dict[i][j]['Assignments submitted'] / logs_dict[i][j]['Assignments submitted'].sum() #avoid dividing by 0 
        
        #joining final grade for target
        logs_dict[i][j] = logs_dict[i][j].merge(targets_table.filter(['course', 'userid', 'final_mark']), on = ['course', 'userid'], how = 'right')
        
        #changing column dtype to reduce space
        logs_dict[i][j][logs_dict[i][j].select_dtypes(np.float64).columns] = logs_dict[i][j].select_dtypes(np.float64).astype(np.float32)
        logs_dict[i][j][logs_dict[i][j].select_dtypes(np.int32).columns] = logs_dict[i][j].select_dtypes(np.float64).astype(np.int16)
        #joining with assignments
        if len(assignment_pivot) > 0:
                    
            #now, and get the mean of non-zero mean assignments - averaged by all students attending the course
            assignment_pivot = assignment_pivot.groupby(['userid'])['Average grade of assignments'].mean().reset_index()
            
            #and merge with final result
            logs_dict[i][j] = logs_dict[i][j].merge(assignment_pivot, on = 'userid', how = 'left')
            
        #clean unnecessary dfs
        del pivot, pivot_1, general_features, session_features, assignment_pivot
    
    #after the end of the loops:
    logs_dict[i] = pd.concat(logs_dict[i], ignore_index=True)
    logs_dict[i] = logs_dict[i].sort_values(by = ['course', 'userid', 'final_mark']).reset_index(drop = True)

In order to account for situations where registered students only access Moodle later in the course, we will make ann additional, but necessary adaptation. 

We will start by looking at the complete set of valid students/courses in our 100% dataset. From these, we get the indexes of the rows that are valid (i.e. have a valid click count at 100% duration), get the indexes and retain only these.

In [None]:
#we gather the index number of valid rows in the 100% df
rows_to_keep = logs_dict['Date_threshold_100'][~logs_dict['Date_threshold_100']['Number of clicks'].isna()].index
columns_copy = ['course', 'userid']

#first filter the date treshold for entire course
logs_dict['Date_threshold_100'] = logs_dict['Date_threshold_100'].iloc[rows_to_keep, :].reset_index(drop = True)

#Convert timedelta format to numbered format - minutes
logs_dict['Date_threshold_100']['Total time online (min)'], logs_dict['Date_threshold_100']['Average session duration (min)'] = logs_dict['Date_threshold_100']['Total time online (min)'].dt.total_seconds() / 60, logs_dict['Date_threshold_100']['Average session duration (min)'].dt.total_seconds() / 60
logs_dict['Date_threshold_100']['Largest period of inactivity (h)'] = logs_dict['Date_threshold_100']['Largest period of inactivity (h)'].dt.total_seconds() // 60 / 60
# #then slice accordingly
for i in tqdm(list(logs_dict.keys())[:-1]):
    logs_dict[i] = logs_dict[i].iloc[rows_to_keep, :].reset_index(drop = True)
    #we will need to keep some columns this way - students that made noa ction prior
    logs_dict[i][columns_copy] = deepcopy(logs_dict['Date_threshold_100'][columns_copy])
    
    #Convert timedelta format to numbered format - minutes
    logs_dict[i]['Total time online (min)'], logs_dict[i]['Average session duration (min)'] = logs_dict[i]['Total time online (min)'].dt.total_seconds() / 60, logs_dict[i]['Average session duration (min)'].dt.total_seconds() / 60
    logs_dict[i]['Largest period of inactivity (h)'] = logs_dict[i]['Largest period of inactivity (h)'].dt.total_seconds() // 60

### Now , we calculate the remaining features 

In specific features related to the time each session starts.

The % of course duration time passed at each login between the first 10 logins.

In [None]:
#create new features related to the different sessions start time relative to the start of the course
for i in tqdm(logs_dict):
    
    #columns mentioning start of session
    for k in start_of_sessions.columns[1:]:
        logs_dict[i][k] =  ((logs_dict[i][k] - logs_dict[i]['Start Date']).dt.days / (logs_dict[i]['End Date'] - logs_dict[i]['Start Date']).dt.days) * 100
        
    logs_dict[i].drop(['Start Date', 'End Date', f'{i}', 'Week before start'], axis = 1, inplace = True)

In [None]:
logs_dict['Date_threshold_10'].describe().T

#### Almost Done.

We will finish the Feature Extraction Stage momentarily. Before we do, we need to save all dfs in an easily accessible Excel File.

In [None]:
writer = pd.ExcelWriter('../Data/Modeling Stage/R_gonz_Non_temporal_Datasets.xlsx', engine='xlsxwriter')

#now loop thru and put each on a specific sheet
for sheet, frame in  logs_dict.items(): 
    frame.to_excel(writer, sheet_name = sheet)

#critical last step
writer.save()

#also saving additional info on class list
class_list.to_csv('../Data/Modeling Stage/R_Gonz_updated_classlist.csv')