## Notebook 2.2. Understanding and Preprocessing of Moodle Logs

For all intents and purposes, this should be considered as the first real notebook that is part of the thesis work. In it, we will take the original student log file and perform the necessary manipulations to ensure that we have a dataset with the potential to be useful.

We will use this notebook to filter the Moodle logs to only include the courses of our interest and estimate course duration.

#### 1. A Small overview of the logs and each column

The presented logs report to interactions with the Moodle LMS:

    - Each interaction with the LMS is recorded sequentially:
        When is the action performed,
        What is the nature of the interaction,
        Where is the actor when the action is performed,
        Who performed the interaction,
        In the context of which course page,
        What is the specific link,
                
    - Each user is uniquely identified by the userID,
    - Each course is uniquely identified by the courseID,
    - Each specific interaction is recorded -> action performed and clicked url, 
    - Each click is timestamped,
    - The actor's IP is recorded,

A brief description of each column follows:

##### component
An identifier of the component,

##### TStamp	
A timestamp of the event,

##### userid
Unique numerical identifier of user -> be it student, faculty or other,

##### ip
ip adress used by the user when interactiong with the LMS system,

##### course
Unique numerical identifier of a course,

##### objecttable
meaning unclear at the moment - to check with other Moodle Sources,

##### action
categorization of nature of the interaction

#### target	
category of the page the student is accessing,

##### cd_discip
The identifier of the course in the other institutional software


#### 2. We'll start this notebook by importing all relevant packages and data

All data is stored in the csv files that were exported in the previous notebook. 

In order to minimize unecessary steps, as we import these csv files we will immediatly remove, from each dataset:
1. The first unnamed column,
2. All columns that are entirely made of missing values - we have detected some.
3. All numerical columns that are immediatly recognied as categorical (or likely to be categorical values) are also immediatly declared as categoricals - this does not mean that, upon further assessment, other features may be converted to objects,
4. All features that display no null values and have a single value are promptly removed as well, 
5. No preprocessing of time related features is performed at this stage - namely because the features realted with time may require further assessment.

In [1]:
#import libs
import pandas as pd
import numpy as np
from pandas.tseries.offsets import *
import re
from copy import copy, deepcopy

#viz related tools
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.colors import LogNorm, Normalize
from matplotlib.ticker import MaxNLocator
import matplotlib as mpl
from matplotlib import cm

import seaborn as sns
from tqdm.notebook import tqdm, trange
tqdm.pandas(desc="Progress")

sns.set()
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
#additionally, we will also define preemptively some golbal variables that may come in handy

#colors for vizualizations
nova_ims_colors = ['#BFD72F', '#5C666C']

#standard color for student aggregates
student_color = '#474838'

#standard color for course aggragates
course_color = '#1B3D2F'

#standard continuous colormap
standard_cmap = 'viridis_r'

In [None]:
#loading student log data 
student_logs = pd.concat(pd.read_excel('../Data/Nova_IMS_logs_Moodle.xlsx', sheet_name = None,
                           dtype = {
                                   'userid': float,
                                   'courseid': object,
                                   'TStamp' : pd.datetime,
                           })).drop(['eventname', 'CourseShortname', 'startdate', 'enddate'], axis = 1).dropna(how = 'all', axis = 1) #logs

#other tables with support information
support_table = pd.read_csv('../Data/Nova_IMS_support_table.csv',
                             dtype = {
                                 'cd_curso' : object,
                                 'courseid' : float,
                                 'userid' : float,
                                 'assign_id': object,
                             }, parse_dates = ['startdate', 'end_date']).drop('Unnamed: 0', axis = 1)

#after checking, we note that time and stime report to the same date and differ in 1 hour, hence, we will only keep the time column
#additionally, we will make the immediate conversion of time
student_logs = student_logs.rename(columns = {
                    'TStamp': 'time', #readjusting names to match other information I already have
                    'courseid': 'course', #moodle courseid
                    'cd_discip' : 'courseid', #netpa course id
                    }).reset_index(drop = True).sort_values(by = 'time')

student_logs['userid'], support_table['courseid'], support_table['userid'] = student_logs['userid'].astype(object), support_table['courseid'].astype(object), support_table['userid'].astype(object)

### We start by taking a preliminary look at the logs

In [None]:
student_logs.info()

In [None]:
student_logs.describe(include ='all', datetime_is_numeric = True).T

In [None]:
student_logs.head()

I am unable to convert courseid to an object, which hints at some of these courses as different. We've identified 2 instances:

1. 100012-100013
2. 200032-400007

In the support table, all of the 4 courses are represented. We can get the students attending each individual course and make the proper assignment.

In [None]:
#use this cell to write any additional piece of code that may be required

### And follow-up by looking at the support table

In [None]:
support_table.info()

In [None]:
support_table.describe(include ='all', datetime_is_numeric = True).T

In [None]:
support_table.head()

Correcting instances where the logs recorded 

In [None]:
#getting list of courses
students_course_1 = support_table[support_table['courseid'] == 100012.0]['userid'].unique()
students_course_2 = support_table[support_table['courseid'] == 100013.0]['userid'].unique()
students_course_3 = support_table[support_table['courseid'] == 200032.0]['userid'].unique()
students_course_4 = support_table[support_table['courseid'] == 400007.0]['userid'].unique()

In [None]:
#converting 
student_logs['courseid'] = np.where(student_logs['courseid'] == '100012-100013',
                                np.where(student_logs['userid'].isin(students_course_1),
                                   100012.0, #course 1
                                   100013.0), #course 2,    
                                np.where(student_logs['courseid'] == '200032-400007',
                                  np.where(student_logs['userid'].isin(students_course_3),
                                   200032.0, #course 3
                                   400007.0), #course 4,
                                student_logs['courseid'] #remain the same in all others      
                                           ))

#converting to float to get the .0 and back to object again
student_logs['courseid'] = student_logs['courseid'].astype(float)
student_logs['courseid'] = student_logs['courseid'].astype(object)

del students_course_1, students_course_2, students_course_3, students_course_4

In [None]:
#use this cell to write any additional piece of code that may be required

### Goal 1: 

One of the first thing to do is to consider the set of students and courses we intend to use. We have, from our support table, a list of the courses and students that we are interested in.

Unlike in the situation of R. Gonz, we have to account for semesters, as there are instances of the same course - better said different courses with the same internal reference in Netpa have different course reference codes on Moodle.

We need to start by making sure that we have a real way to properly sinchronize both databases - as to avoid joining together students attending different versions of a course.

A first, preliminary approach is to only retains logs from courses for which we have records. We  will not perform an inner pairs on the logs and see how they match up to programid-semester-courseid. We would expect there to be a reasonable match between both.

In [None]:
#We start by filtering by all courses that are in our support table
course_array = support_table['courseid'].unique()

#We start by filtering by all courses that are in our support table
students = support_table['userid'].unique()

#then, we keep logs of the courses of interest   
student_logs = student_logs[student_logs['courseid'].isin(course_array)].sort_values(by = 'time')

#and the students
student_logs = student_logs[student_logs['userid'].isin(students)].reset_index(drop = True)

#and get the complete list of students interacting with the system - graded or not
student_courses = student_logs.filter(['courseid', 'course', 'userid']).drop_duplicates().reset_index(drop = True)

#take a look at slices dataset
student_logs.describe(include ='all', datetime_is_numeric = True).T

From this filtering process, we get **4 400 020 recorded interactions**, performed by **2 140** unique students in the context of **231 curricular units**.

We can remove courses such as Research methodologies, Thesis and the doctoral discipline experimental design. It seems that there are instances of distinct classes of the same curricular unit - just different classes.

We will treat different courses differently. The following courses will be removed outright:
1. Research Methodologies - 200163.0,
2. Experimental Design - 200086.0, 
3. Thesis - 200131.0
4. Dissertação - 200040.0
5. Thesis follow-up - 200050.0
6. Thesis Seminars - 200263.0
7. Methodology of Legal Research - 200250.0
8. Research Seminar - 300005.0

In [None]:
#with pd.option_context('display.max_rows', None,):
#    print(student_logs[['courseid', 'course', 'CourseFullname']].value_counts())

In [None]:
#getting a list of courses to eliminate
courseid_to_eliminate = [200086.0, 
                         200163.0,
                         200131.0,
                         200040.0,
                         200150.0, 
                         200263.0,
                         200250.0,
                         300005.0,
                        ]

#adapt student_logs and support_table to match courses to eliminate
student_logs = student_logs[~student_logs['courseid'].isin(courseid_to_eliminate)]
support_table = support_table[~support_table['courseid'].isin(courseid_to_eliminate)]
student_courses = student_courses[~student_courses['courseid'].isin(courseid_to_eliminate)]

At this point, we have to deal, to the best of our ability with mismatches between courseid and course - instances where a specific courseid refers to more than one course.

These will be relevant if and when we have students a student attending multiple courses within a courseid. Because in these instances we will need an additional identifier in order to promote a verifiable association between course a(the finer grained resolution) and courseid - the Netpa reference we have.

We can start by counting the number of courses, each courseid student pair is registered to - we have no other option than to verifyu each course id individually and deal with it in the most approppriate manner. 

In [None]:
#we create a pivot_table
student_courses_piv = pd.pivot_table(student_courses, index = ['courseid', 'userid'], values = 'course',
                                aggfunc = 'count')

#and only keep courseid-student pairs for whom there is more than 1 occurrence of course
student_courses_piv = student_courses_piv[student_courses_piv['course'] > 1]

In [None]:
#with pd.option_context('display.max_rows', None,): #uncomment to see 
#    display(student_courses_piv)

**The following Courseids have the same student attending in different semesters:**
1. **Course 100008.0**,
2. **Course 100010.0**
3. **Course 200070.0**
4. **Course 200014.0**

Each version is easily mergeable with netpa data via a semester-courseid pairing,

**The following Courseids have the same student displaying activity in versions of the same NetPa course whithin a semester:**

1. **Course 200197.0** - 1 student in these conditions,
2. **Course 200195.0** - Multiple students
3. **Course 200193.0** - 1 student, 
4. **Course 200166.0** - a couple of students
5. **Course 200165.0** - 1 student, 
6. **Course 200146.0** - different classes, same student,
7. **Course 200049.0** - 1 student
8. **Course 200013.0** - 5 students, 
9. **Course 200012.0** - 1 student

In this instance, we can treat the unique courseid pairing as sufficient for identification of the course. It seems that the different courses result from registration in different classes of the same version of the course.

**The following courses need to be verified more thoroughly:**
1. **Course 200194.0** -> different versions occur (T3 and T4), with different classes being registered in T4
2. **Course 200170.0** -> no id on the semester - will need to cross with support table
3. **Course 200167.0** -> no id on semester, will need to check further -> problably different programs are in the mix -> 1488 is S2

### Section for additional verification using Support Table

This small section will assist us in the decision on how to deal with 3 courses.
Courses to verify:

1. **Course 200194.0** -> different versions occur (T3 and T4), with different classes being registered in T4
2. **Course 200170.0** -> no id on the semester - will need to cross with support table
3. **Course 200167.0** -> no id on semester, will need to check further -> problably different programs are in the mix -> 1488 is S2

In [None]:
#courses
to_verify = [
            200194.0,
            200170.0, 
            200167.0
            ]

#filtering support table 
verification = support_table[support_table['courseid'].isin(to_verify)]
verification[['cd_curso', 'nm_curso_pt', 'courseid', 'semestre', 'ds_discip_pt']].drop_duplicates(
    subset = ['cd_curso', 'courseid', 'semestre']).sort_values(by = 'nm_curso_pt')

We see, from the data in the support table, that there are multiple instances of different programs having similar curricular units - that have, each, a different version of the same curricular unit.

Therefore, we will be required to perform the merger at the level of resolution we can: That is using Course ID, Semester and UserID.

Additionally, we see that most of the of the courses have the semester indication in their name. We can use this knowledge to extract the number semester and store it in a columns.

Then, we will need to take care of duplicates that may surface:

In [None]:
#we get all nonduplicate rows 
synch_df = student_logs.filter(['courseid', 'userid', 'course', 'CourseFullname']).drop_duplicates(subset = ['course', 'courseid', 'userid'],
                                                                                        keep = 'first') #will allow us to understand when the first student interaction occurs

#extracts an S or T followed by a digit - not perferct, but workable
synch_df['semester'] = synch_df['CourseFullname'].str.extract(pat = '([ST]\d)') #matches for capitol S or T followed by digit
synch_df.describe(include = 'all', datetime_is_numeric = True)

In [None]:
#now, we get to perform an inner merge - we not expecting an increase in rows
synch_df = pd.merge(synch_df, support_table.filter(['cd_curso','semestre', 'courseid', 'nm_curso_pt', 'ds_discip_pt', 'userid']).drop_duplicates(), on = ['courseid', 'userid'])

#previous step will definitely generate immediate duplicates -> multiple courses for same course id
synch_df

Now that we have correctly managed to perform the merger, we must take into account the fact that we were not able to merge semester-wise.

We will find semestral mismatches (i.e. Semester and Semestre have different values) of 2 kinds:

1. Courses whose Moodle name has no indication of semester -> making the column derived from moodle a Nan,
    
2. Courses who have the same students attending in the first semester and second semester version of the course,
    - This occurs on courseids 100008.0, 100010.0 and 200070.0 (course 200014.0 did not come through the filtering process)

In [None]:
#this cell allows us to verify which courses (according to Moodle) are registered to a semester differently to NetPA
semester_mismatch =  synch_df[synch_df['semester'] != synch_df['semestre']]['course'].unique()
synch_df[synch_df['course'].isin(semester_mismatch)].drop_duplicates(['course', 'semester', 'semestre']).sort_values('courseid')

In [None]:
#we can use np.where to make the necessary adjustments
synch_df['semester'] = np.where(synch_df['semester'].isna(),
                                synch_df['semestre'],  #fill nas with the netpa semester reference
                                np.where(synch_df['semester'] != synch_df['semestre'],  #in other instances of difference
                                         'delete this row', #we understand the duplication to be a by product of the merger
                                         synch_df['semester'])
                               )
#this way, we ensure only the correct semester-courseid pairings is kept                               
synch_df = synch_df[~(synch_df['semester'] == 'delete this row')].reset_index(drop = True)

We are expecting all of these filtering and cleaning results in our ability to, at the exception of instances where different classes of the same course are registered differently on Moodle,  draw a 1 to 1 between Moodle's course-userid pairing and Netpa's courseid-semester-userid grouping.

**Course-userid pairs generate no duplicates**, but **Programid-Semester-CourseID-userid generate 222 duplicate rows**.

We'll need check whether any duplicate user entries refer to:
1. same curricular unit and different class,
2. Different course entirely with the same courseid. 

**Uncomment next cells to verify duplicates.**

In [None]:
# with pd.option_context('display.max_rows', None,): #uncomment to see 
#     display(synch_df[synch_df.duplicated(subset = ['cd_curso', 'courseid', 'semester', 'userid'], keep = False)].sort_values('courseid'))

In [None]:
#synch_df[synch_df.duplicated(subset = ['cd_curso', 'courseid', 'semester', 'userid'], keep = False)]['CourseFullname'].value_counts()

In [None]:
#course-user pairs generate no duplicates
#with pd.option_context('display.max_rows', None,): #uncomment to see 
#     display(synch_df[synch_df.duplicated(subset = ['course', 'userid'], keep = False)].sort_values('courseid'))

In [None]:
#synch_df[synch_df.duplicated(subset = ['cd_curso', 'courseid', 'semester', 'userid'], keep = False)]['CourseFullname'].value_counts()

#### Going for mergers

These results suggest that duplicate entries using programid-semester-courseid-userid pairing result from the fact that, in the moment of registration on Moodle, different classes (TP1 vs TP2 e.g.) of the same curricular unit were registered differently. 

**Now, we are confident in performing an inner merge between our current synchronization of Moodle and SIS data.**

To do this, we will go back to the logs and, to each course-user id pairing map the respective courseid-programid-semesterid. We'll do that by performing an inner merge between synch_df and the logs.

In [None]:
#filter
synch_df = synch_df.filter(['course', 'userid', 'courseid', 'cd_curso', 'semestre', 'nm_curso_pt', 'ds_discip_pt'])

#merge
student_logs = pd.merge(student_logs.drop('courseid', axis = 1), synch_df, on = ['course', 'userid'])
student_logs.drop('course', axis = 1, inplace = True)

#and update complete list of students interacting with the system - graded or not
student_courses = student_logs.filter(['cd_curso', 'courseid', 'semestre', 'userid']).drop_duplicates().reset_index(drop = True)

#likewise, we can now update the support table to only contain students that are present in the logs
support_table = pd.merge(support_table, student_courses, on= ['cd_curso', 'courseid', 'semestre', 'userid'])

#clearing space
del synch_df

#check our updated dateset
student_logs.describe(include = 'all', datetime_is_numeric = True)

#### Small visualization: Weekly clicks per course
We know that the conditions from course to course vary wildly. 
For the purposes of a more thorough understanding of the data, we can see how clicks for each course vary, from course to course, through time.

In [None]:
#first, we sort the courses by the start date. Then, we'll get the index of each 
sorting_hat = support_table[['cd_curso', 'semestre', 'courseid', 'startdate']].drop_duplicates().sort_values(by = 'startdate').reset_index(drop = True)
sorting_hat = sorting_hat.set_index(['cd_curso', 'semestre', 'courseid']).to_dict()['startdate'] 

#second, we sort the courses by the start date. Then, we'll get the index of each 
ending_hat = support_table[['cd_curso', 'semestre', 'courseid', 'end_date']].drop_duplicates().reset_index(drop = True)
ending_hat = ending_hat.set_index(['cd_curso', 'semestre', 'courseid']).to_dict()['end_date'] 

#Then, when it comes to logs, we aggregate by week
data_grouper = student_logs.groupby([pd.Grouper(key='time', freq='W'), 'cd_curso', 'semestre', 'courseid']).agg({
                                                                             'action': 'count',
                                                                             }).reset_index().sort_values('time')


#Weekly Interactions overall
grouped_data = deepcopy(data_grouper)

#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['cd_curso', 'semestre', 'courseid'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan)

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data.index.map(sorting_hat)
grouped_data = grouped_data.reset_index().rename(columns = {'courseid': 'Course',
                                                            'cd_curso': 'Program',
                                                            'semestre': 'Semester',
                                                           })

grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index(['Program', 'Semester', 'Course'], drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the first
heat1 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat1.get_figure()
fig.savefig('../Images/NovaIMS_exploratory_course_weekly_clicks_heatmap.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat1

At a first glance, the interaction patterns seem to be consistent with the Semester duration. We will need to take a closer look at each Semester type separately.

**Starting with Semester 1**

In [None]:
grouped_data = deepcopy(data_grouper[data_grouper['semestre'] == 'S1'])

#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['cd_curso', 'semestre', 'courseid'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan)

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data.index.map(sorting_hat)
grouped_data = grouped_data.reset_index().rename(columns = {'courseid': 'Course',
                                                            'cd_curso': 'Program',
                                                            'semestre': 'Semester',
                                                           })

grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index(['Program', 'Semester', 'Course'], drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the first
heat2 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat2.get_figure()
fig.savefig('../Images/NovaIMS_S1_weekly_interactions.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat2

**Trimester 1**

In [None]:
grouped_data = deepcopy(data_grouper[data_grouper['semestre'] == 'T1'])

#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['cd_curso', 'semestre', 'courseid'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan)

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data.index.map(sorting_hat)
grouped_data = grouped_data.reset_index().rename(columns = {'courseid': 'Course',
                                                            'cd_curso': 'Program',
                                                            'semestre': 'Semester',
                                                           })

grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index(['Program', 'Semester', 'Course'], drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the first
heat3 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat3.get_figure()
fig.savefig('../Images/NovaIMS_T1_weekly_interactions.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat3

**Trimester 2**

In [None]:
grouped_data = deepcopy(data_grouper[data_grouper['semestre'] == 'T2'])

#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['cd_curso', 'semestre', 'courseid'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan)

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data.index.map(sorting_hat)
grouped_data = grouped_data.reset_index().rename(columns = {'courseid': 'Course',
                                                            'cd_curso': 'Program',
                                                            'semestre': 'Semester',
                                                           })

grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index(['Program', 'Semester', 'Course'], drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the first
heat4 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat4.get_figure()
fig.savefig('../Images/NovaIMS_T2_weekly_interactions.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat4

**Semester 2**

In [None]:
grouped_data = deepcopy(data_grouper[data_grouper['semestre'] == 'S2'])

#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['cd_curso', 'semestre', 'courseid'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan)

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data.index.map(sorting_hat)
grouped_data = grouped_data.reset_index().rename(columns = {'courseid': 'Course',
                                                            'cd_curso': 'Program',
                                                            'semestre': 'Semester',
                                                           })

grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index(['Program', 'Semester', 'Course'], drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the first
heat5 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat5.get_figure()
fig.savefig('../Images/NovaIMS_S2_weekly_interactions.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat5

**Trimester 3**

In [None]:
grouped_data = deepcopy(data_grouper[data_grouper['semestre'] == 'T3'])

#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['cd_curso', 'semestre', 'courseid'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan)

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data.index.map(sorting_hat)
grouped_data = grouped_data.reset_index().rename(columns = {'courseid': 'Course',
                                                            'cd_curso': 'Program',
                                                            'semestre': 'Semester',
                                                           })

grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index(['Program', 'Semester', 'Course'], drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the first
heat6 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat6.get_figure()
fig.savefig('../Images/NovaIMS_T3_weekly_interactions.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat6

**Trimester 4**

In [None]:
grouped_data = deepcopy(data_grouper[data_grouper['semestre'] == 'T4'])

#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['cd_curso', 'semestre', 'courseid'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan)

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data.index.map(sorting_hat)
grouped_data = grouped_data.reset_index().rename(columns = {'courseid': 'Course',
                                                            'cd_curso': 'Program',
                                                            'semestre': 'Semester',
                                                           })

grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index(['Program', 'Semester', 'Course'], drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the first
heat7 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat7.get_figure()
fig.savefig('../Images/NovaIMS_T4_weekly_interactions.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat7, grouped_data

**We can, additionally**, make some additional observations that may come in handy in the future:

a. How many students are attending each course,

b. How many courses is each student attending,

This knowledge will allow us make additional filtering decisions to enhance our sample.

In [None]:
#we can compute the number of students attending each course, and the number of courses each student is attending
class_list = student_courses.groupby(['cd_curso', 'semestre', 'courseid'])['userid'].count().to_frame().rename(columns = {'userid' : 'Users per course'})
enrollment_size = student_courses.groupby('userid')['courseid'].count().to_frame().rename(columns = {'courseid' : 'Courses per User'})

**A. How many students are attending each course?**

In [None]:
#settub
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(16, 10)}, font_scale=2)

#a number of students per course
#student_courses.rename(columns = {'userid' : 'Students per course'}, inplace = True)

#then we plot an histogram with all courses, we are not interested in keeping courses with a number of students inferior to 10
hist1 = sns.histplot(data=class_list, x='Users per course', kde=True, color= student_color, binwidth = 5,)

fig = hist1.get_figure()
fig.savefig('../Images/Nova_IMS_hist1_students_per_course_bin_5.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, hist1

**There is a very significant number of courses with between 1 and 10 students**

**B. In how many courses is each student enrolled?**

In [None]:
#then we plot an histogram with all courses, we are not interested in keeping courses with a number of students inferior to 10
hist2 = sns.histplot(data= enrollment_size, 
        x='Courses per User', color= course_color, discrete = True, fill = True)

fig = hist2.get_figure()
fig.savefig('../Images/Nova_IMS_hist2_courses_per_student course_bin_1.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, hist2

Depending on the course in question, it is possible for it to have 1 registered user vs almost 175.
Additionally, we can also see that there is a significant number of students attending a single course (over 150).

To some extent, most courses have some degree of interaction with Moodle, no matter how small.

We can see that interactions with a Course will usually start as the starting date approaches - regardless of the semester in question. Usually, accesses past the end of course date still continue occuring - liklely to access some educational content.

Additionally, course interactions are consistent with the intentional splits many courses seem, at least on a preliminary level, be consistent with the split between semesters and trimesters.

### So, what's next?

Well, we are going back to the original script. We arte going to complete our course list. We'll need to add The start-date, end date, course duration, course size and, finally, 

In [None]:
#start date
class_list['Start Date'] = class_list.index.to_series().map(sorting_hat)

#end date
class_list['End Date'] = class_list.index.to_series().map(ending_hat)
class_list['End Date'] = class_list['End Date'].where( class_list['End Date'] == (( class_list['End Date'] + Week(weekday=4) ) - Week()), class_list['End Date'] + Week(weekday=4))

#additionally, we will look at our estimated course duration
class_list['Course duration days'] = class_list['End Date'] - class_list['Start Date']
class_list

**Now, we will fininsh our work by removing all logs outside the following conditions.**

We will build 2 cutoff points:

1. One week before the start date of the course, 
2. After the perceived end of course.

In [None]:
#a new look into class list
class_list['cuttoff_point'] = pd.to_datetime((class_list['Start Date'] - pd.to_timedelta(1, unit = 'W')).dt.date)

#convert to date
class_list['Start Date'] = pd.to_datetime(class_list['Start Date'].dt.date)
class_list['End Date'] = pd.to_datetime(class_list['End Date'].dt.date)
class_list['Course duration days'] = class_list['End Date'] - class_list['Start Date']
class_list['Course duration days'] = (class_list['Course duration days'].dt.total_seconds() // 3600 // 24) + 1

#we will create a new dict with the start date
cuttoff_point = class_list.to_dict()['cuttoff_point'] 

#we'll create a new column that will signal whether we are whithin our course boundaries or not
student_logs.set_index(['cd_curso', 'semestre', 'courseid'], drop = True, inplace = True)

student_logs['start_bound'] = student_logs.index.map(cuttoff_point)
student_logs['end_bound'] = student_logs.index.map(ending_hat)

#convert to date
student_logs['start_bound'] = pd.to_datetime(student_logs['start_bound'].dt.date)
student_logs['end_bound'] = pd.to_datetime(student_logs['end_bound'].dt.date)

**Now, we only keep rows that are inside between the dates inside the start and end bounds.**

In [None]:
student_logs = student_logs[student_logs['time'].between(student_logs['start_bound'], student_logs['end_bound'], inclusive = True)].reset_index()
student_logs

**After finishing, we will now take a new look at the weekly interactions.**

We are expecting a cleaner view at the weekly interactions performed by students in the context of their courses.

In [None]:
#Then, when it comes to logs, we aggregate by week
grouped_data = student_logs.groupby([pd.Grouper(key='time', freq='W'), 'cd_curso', 'semestre', 'courseid']).agg({
                                                                             'action': 'count',
                                                                             }).reset_index().sort_values('time')

#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index =['cd_curso', 'semestre', 'courseid'], 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan)

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data.index.map(sorting_hat)
grouped_data = grouped_data.reset_index().rename(columns = {'courseid': 'Course',
                                                            'cd_curso': 'Program',
                                                            'semestre': 'Semester',
                                                           })

grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index(['Program', 'Semester', 'Course'], drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
#here, we are plotting the nex
heat8 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat8.get_figure()
fig.savefig('../Images/Nova_IMS_cleaned_weekly_clicks.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat8

We finish the notebook by saving the cleaned logs and the list of the courses with which we will be going forward in our analysis. 

A very important factor to take into account is the fact that, as our targets, we will only have access to the student-pairt courses that we were able to identify in our targets table - which are the same as the ones present iin our support_table.

It is, therefore, wise to perform a last filtering step before going forward.

In [None]:
#save tables 
class_list.to_csv('../Data/Modeling Stage/NovaIMS_class_duration.csv') 

student_logs.drop(['start_bound', 'end_bound'], axis = 1).to_csv('../Data/Modeling Stage/NovaIMS_cleaned_logs.csv')

#### Done

From here on out, we will continue with feature engineering and extraction for modeling purposes in Notebooks 3.