## Notebook 2.2. Understanding and Preprocessing of Moodle Logs

For all intents and purposes, this should be considered as the first real notebook that is part of the thesis work. In it, we will take the original student log file and perform the necessary manipulations to ensure that we have a dataset with the potential to be useful.

We will use this notebook to filter the Moodle logs to only include the courses of our interest and estimate course duration.

#### 1. A Small overview of the logs and each column

The presented logs report to interactions with the Moodle LMS:

    - Each interaction with the LMS is recorded sequentially:
        When is the action performed,
        What is the nature of the interaction,
        Where is the actor when the action is performed,
        Who performed the interaction,
        In the context of which course page,
        What is the specific link,
                
    - Each user is uniquely identified by the userID,
    - Each course is uniquely identified by the courseID,
    - Each specific interaction is recorded -> action performed and clicked url, 
    - Each click is timestamped,
    - The actor's IP is recorded,

A brief description of each column follows:

##### component
An identifier of the component,

##### TStamp	
A timestamp of the event,

##### userid
Unique numerical identifier of user -> be it student, faculty or other,

##### ip
ip adress used by the user when interactiong with the LMS system,

##### course
Unique numerical identifier of a course,

##### objecttable
meaning unclear at the moment - to check with other Moodle Sources,

##### action
categorization of nature of the interaction

#### target	
category of the page the student is accessing,

##### cd_discip
The identifier of the course in the other institutional software


#### 2. We'll start this notebook by importing all relevant packages and data

All data is stored in the csv files that were exported in the previous notebook. 

In order to minimize unecessary steps, as we import these csv files we will immediatly remove, from each dataset:
1. The first unnamed column,
2. All columns that are entirely made of missing values - we have detected some.
3. All numerical columns that are immediatly recognied as categorical (or likely to be categorical values) are also immediatly declared as categoricals - this does not mean that, upon further assessment, other features may be converted to objects,
4. All features that display no null values and have a single value are promptly removed as well, 
5. No preprocessing of time related features is performed at this stage - namely because the features realted with time may require further assessment.

In [1]:
#import libs
import pandas as pd
import numpy as np
from pandas.tseries.offsets import *
import re

#viz related tools
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.colors import LogNorm, Normalize
from matplotlib.ticker import MaxNLocator
import matplotlib as mpl
from matplotlib import cm

import seaborn as sns
from tqdm.notebook import tqdm, trange
tqdm.pandas(desc="Progress")

sns.set()
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
#additionally, we will also define preemptively some golbal variables that may come in handy

#colors for vizualizations
nova_ims_colors = ['#BFD72F', '#5C666C']

#standard color for student aggregates
student_color = '#474838'

#standard color for course aggragates
course_color = '#1B3D2F'

#standard continuous colormap
standard_cmap = 'viridis_r'

In [3]:
#loading student log data 
student_logs = pd.concat(pd.read_excel('../Data/Nova_IMS_logs_Moodle.xlsx', sheet_name = None,
                           dtype = {
                                   'userid': float,
                                   'courseid': object,
                                   'TStamp' : pd.datetime,
                           })).drop(['eventname', 'CourseShortname', 'startdate', 'enddate'], axis = 1).dropna(how = 'all', axis = 1) #logs

#other tables with support information
support_table = pd.read_csv('../Data/Nova_IMS_support_table.csv',
                             dtype = {
                                 'cd_curso' : object,
                                 'courseid' : float,
                                 'userid' : float,
                                 'assign_id': object,
                             }, parse_dates = ['startdate', 'end_date']).drop('Unnamed: 0', axis = 1)

#after checking, we note that time and stime report to the same date and differ in 1 hour, hence, we will only keep the time column
#additionally, we will make the immediate conversion of time
student_logs = student_logs.rename(columns = {
                    'TStamp': 'time', #readjusting names to match other information I already have
                    'courseid': 'course', #moodle courseid
                    'cd_discip' : 'courseid', #netpa course id
                    }).reset_index(drop = True).sort_values(by = 'time')

student_logs['userid'], support_table['courseid'], support_table['userid'] = student_logs['userid'].astype(object), support_table['courseid'].astype(object), support_table['userid'].astype(object)

### We start by taking a preliminary look at the logs

In [4]:
student_logs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4829394 entries, 117552 to 4773753
Data columns (total 9 columns):
 #   Column          Dtype         
---  ------          -----         
 0   component       object        
 1   action          object        
 2   target          object        
 3   objecttable     object        
 4   userid          object        
 5   course          object        
 6   time            datetime64[ns]
 7   CourseFullname  object        
 8   courseid        object        
dtypes: datetime64[ns](1), object(8)
memory usage: 368.5+ MB


In [5]:
student_logs.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
component,4829394.0,38.0,core,1928779.0,NaT,NaT,NaT,NaT,NaT,NaT
action,4829394.0,43.0,viewed,4439966.0,NaT,NaT,NaT,NaT,NaT,NaT
target,4829394.0,90.0,course,1749544.0,NaT,NaT,NaT,NaT,NaT,NaT
objecttable,2921683.0,64.0,quiz_attempts,962760.0,NaT,NaT,NaT,NaT,NaT,NaT
userid,4829394.0,2631.0,24.0,33174.0,NaT,NaT,NaT,NaT,NaT,NaT
course,4829394.0,340.0,1219.0,105791.0,NaT,NaT,NaT,NaT,NaT,NaT
time,4829394.0,,,,2021-01-31 04:35:53.171790592,2020-08-26 11:51:00,2020-11-12 22:32:00,2021-01-22 20:56:00,2021-04-19 18:18:00,2021-07-30 22:02:00
CourseFullname,4829394.0,341.0,202021 - Marketing Digital e ComÃ©rcio EletrÃ³...,105791.0,NaT,NaT,NaT,NaT,NaT,NaT
courseid,4816821.0,272.0,200165.0,129264.0,NaT,NaT,NaT,NaT,NaT,NaT


In [6]:
student_logs.head()

Unnamed: 0,component,action,target,objecttable,userid,course,time,CourseFullname,courseid
117552,mod_forum,viewed,course_module,forum,4867.0,1214,2020-08-26 11:51:00,202021 - InteligÃªncia EconÃ³mica e Competitiv...,400033
201301,mod_forum,viewed,course_module,forum,4867.0,1242,2020-08-26 11:51:00,202021 - Metodologias e TÃ©cnicas de AnÃ¡lise ...,400037
201300,core,viewed,course,,4867.0,1242,2020-08-26 11:51:00,202021 - Metodologias e TÃ©cnicas de AnÃ¡lise ...,400037
117551,core,viewed,course,,4867.0,1214,2020-08-26 11:51:00,202021 - InteligÃªncia EconÃ³mica e Competitiv...,400033
282504,core,viewed,course,,4867.0,1284,2020-08-26 11:51:00,202021 - TÃ©cnicas AnalÃ­ticas Estruturadas pa...,400032


I am unable to convert courseid to an object, which hints at some of these courses as different. We've identified 2 instances:

1. 100012-100013
2. 200032-400007

In the support table, all of the 4 courses are represented. We can get the students attending each individual course and make the proper assignment.

In [7]:
#use this cell to write any additional piece of code that may be required

### And follow-up by looking at the support table

In [8]:
support_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40450 entries, 0 to 40449
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   cd_curso         40450 non-null  object        
 1   semestre         40450 non-null  object        
 2   courseid         40450 non-null  object        
 3   userid           40450 non-null  object        
 4   assignment_mark  37611 non-null  float64       
 5   assign_id        37611 non-null  object        
 6   nm_curso_pt      40450 non-null  object        
 7   ds_discip_pt     40450 non-null  object        
 8   end_date         40450 non-null  datetime64[ns]
 9   startdate        40450 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(1), object(7)
memory usage: 3.1+ MB


In [9]:
support_table.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
cd_curso,40450.0,28.0,9434,8175.0,,,,,,,
semestre,40450.0,6.0,S1,16882.0,,,,,,,
courseid,40450.0,251.0,200071.0,1465.0,,,,,,,
userid,40450.0,2305.0,3248.0,56.0,,,,,,,
assignment_mark,37611.0,,,,14.511157,0.0,12.7,15.88,17.7,20.0,4.548028
assign_id,37611.0,1781.0,1553,179.0,,,,,,,
nm_curso_pt,40450.0,28.0,Mestrado em Gestão de Informação,8175.0,,,,,,,
ds_discip_pt,40450.0,239.0,Gestão do Conhecimento,1465.0,,,,,,,
end_date,40450.0,,,,2021-03-24 13:34:54.496909568,2020-10-30 00:00:00,2021-01-15 00:00:00,2021-01-22 00:00:00,2021-06-18 00:00:00,2021-06-18 00:00:00,
startdate,40450.0,,,,2020-11-26 10:38:45.715698176,2020-09-07 00:00:00,2020-09-07 00:00:00,2020-11-02 00:00:00,2021-02-08 00:00:00,2021-04-12 00:00:00,


In [10]:
support_table.head()

Unnamed: 0,cd_curso,semestre,courseid,userid,assignment_mark,assign_id,nm_curso_pt,ds_discip_pt,end_date,startdate
0,8259,S1,100001.0,1544.0,,,Licenciatura em Sistemas e Tecnologias de Info...,Álgebra Linear,2021-01-15,2020-09-07
1,8259,S1,100001.0,1556.0,13.9,0.0,Licenciatura em Sistemas e Tecnologias de Info...,Álgebra Linear,2021-01-15,2020-09-07
2,8259,S1,100001.0,1556.0,8.7,1.0,Licenciatura em Sistemas e Tecnologias de Info...,Álgebra Linear,2021-01-15,2020-09-07
3,9155,S1,100001.0,1564.0,16.8,2.0,Licenciatura em Gestão de Informação,Álgebra Linear,2021-01-15,2020-09-07
4,9155,S1,100001.0,1564.0,7.3,3.0,Licenciatura em Gestão de Informação,Álgebra Linear,2021-01-15,2020-09-07


Correcting instances where the logs recorded 

In [11]:
#getting list of courses
students_course_1 = support_table[support_table['courseid'] == 100012.0]['userid'].unique()
students_course_2 = support_table[support_table['courseid'] == 100013.0]['userid'].unique()
students_course_3 = support_table[support_table['courseid'] == 200032.0]['userid'].unique()
students_course_4 = support_table[support_table['courseid'] == 400007.0]['userid'].unique()

In [12]:
#converting 
student_logs['courseid'] = np.where(student_logs['courseid'] == '100012-100013',
                                np.where(student_logs['userid'].isin(students_course_1),
                                   100012.0, #course 1
                                   100013.0), #course 2,    
                                np.where(student_logs['courseid'] == '200032-400007',
                                  np.where(student_logs['userid'].isin(students_course_3),
                                   200032.0, #course 3
                                   400007.0), #course 4,
                                student_logs['courseid'] #remain the same in all others      
                                           ))

#converting to float to get the .0 and back to object again
student_logs['courseid'] = student_logs['courseid'].astype(float)
student_logs['courseid'] = student_logs['courseid'].astype(object)

del students_course_1, students_course_2, students_course_3, students_course_4

Additionally, we see that most of the of the courses have the semester indication in their name. We can use this knowledge to extract the number semester and store it in a columns.

In [13]:
#extracts an S or T followed by a digit - not perferct, but workable
student_logs['semester'] = student_logs['CourseFullname'].str.extract(pat = '([ST]\d)') #matches for capitol S or T followed by digit

In [14]:
#use this cell to write any additional piece of code that may be required

### Goal 1: 

One of the first thing to do is to consider the set of students and courses we intend to use. We have, from our support table, a list of the courses and students that we are interested in.

Unlike in the situation of R. Gonz, we have to account for semesters, as there are instances of the same course - better said different courses with the same internal reference in Netpa have different course reference codes on Moodle.

We need to start by making sure that we have a real way to properly sinchronize both databases - as to avoid joining together students attending different versions of a course.

A first, preliminary approach is to only retains logs from courses for which we have records. We  will not perform an inner pairs on the logs and see how they match up to programid-semester-courseid. We would expect there to be a reasonable match between both.

In [15]:
#We start by filtering by all courses that are in our support table
course_array = support_table['courseid'].unique()

#We start by filtering by all courses that are in our support table
students = support_table['userid'].unique()

#then, we keep logs of the courses of interest   
student_logs = student_logs[student_logs['courseid'].isin(course_array)].sort_values(by = 'time')

#and the students
student_logs = student_logs[student_logs['userid'].isin(students)].reset_index(drop = True)

#and get the complete list of students interacting with the system - graded or not
student_courses = student_logs.filter(['courseid', 'course', 'userid']).drop_duplicates().reset_index(drop = True)

#take a look at slices dataset
student_logs.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
component,4400020.0,27.0,core,1696151.0,NaT,NaT,NaT,NaT,NaT,NaT
action,4400020.0,26.0,viewed,4151426.0,NaT,NaT,NaT,NaT,NaT,NaT
target,4400020.0,60.0,course,1625929.0,NaT,NaT,NaT,NaT,NaT,NaT
objecttable,2659376.0,43.0,quiz_attempts,928527.0,NaT,NaT,NaT,NaT,NaT,NaT
userid,4400020.0,2241.0,6826.0,11601.0,NaT,NaT,NaT,NaT,NaT,NaT
course,4400020.0,288.0,1219.0,101481.0,NaT,NaT,NaT,NaT,NaT,NaT
time,4400020.0,,,,2021-01-31 02:42:29.972207360,2020-08-26 11:51:00,2020-11-12 18:46:00,2021-01-22 16:21:00,2021-04-19 19:02:00,2021-07-30 22:02:00
CourseFullname,4400020.0,290.0,202021 - Marketing Digital e ComÃ©rcio EletrÃ³...,101481.0,NaT,NaT,NaT,NaT,NaT,NaT
courseid,4400020.0,231.0,200165.0,121009.0,NaT,NaT,NaT,NaT,NaT,NaT
semester,3981573.0,6.0,S1,1835499.0,NaT,NaT,NaT,NaT,NaT,NaT


From this filtering process, we get **4 400 020 recorded interactions**, performed by **2 140** unique students in the context of **231 curricular units**.

We can remove courses such as Research methodologies, Thesis and the doctoral discipline experimental design. It seems that there are instances of distinct classes of the same curricular unit - just different classes.

We will treat different courses differently. The following courses will be removed outright:
1. Research Methodologies - 200163.0,
2. Experimental Design - 200086.0, 
3. Thesis - 200131.0
4. Dissertação - 200040.0
5. Thesis follow-up - 200050.0
6. Thesis Seminars - 200263.0
7. Methodology of Legal Research - 200250.0
8. Research Seminar - 300005.0

In [16]:
with pd.option_context('display.max_rows', None,):
    print(student_logs[['courseid', 'course', 'CourseFullname']].value_counts())

courseid  course  CourseFullname                                                                               
200196.0  1219    202021 - Marketing Digital e ComÃ©rcio EletrÃ³nico - Turma TP1, TP2, TP3 e TPDAYTIME - S1        101481
200165.0  1507    MÃ©todos Descritivos de Data Mining - Turma TP1 e TP2 - S1                                        90041
200179.0  1239    202021 - Aprendizagem AutomÃ¡tica - S1                                                            87682
200178.0  1278    202021 - EstatÃ­stica para a CiÃªncia de Dados - S1                                               71292
200175.0  1246    202021 - Data Mining - S1                                                                         65977
400082.0  1259    202021 - Digital Analytics - S1                                                                   64544
200211.0  1318    Programming for Data Science                                                                      64176
200187.0  1177    MSI 2020/21 - Ma

In [17]:
#getting a list of courses to eliminate
courseid_to_eliminate = [200086.0, 
                         200163.0,
                         200131.0,
                         200040.0,
                         200150.0, 
                         200263.0,
                         200250.0,
                         300005.0,
                        ]

#adapt student_logs and support_table to match courses to eliminate
student_logs = student_logs[~student_logs['courseid'].isin(courseid_to_eliminate)]
support_table = support_table[~support_table['courseid'].isin(courseid_to_eliminate)]
student_courses = student_courses[~student_courses['courseid'].isin(courseid_to_eliminate)]

At this point, we have to deal, to the best of our ability with mismatches between courseid and course - instances where a specific courseid refers to more than one course.

These will be relevant if and when we have students a student attending multiple courses within a courseid. Because in these instances we will need an additional identifier in order to promote a verifiable association between course a(the finer grained resolution) and courseid - the Netpa reference we have.

We can start by counting the number of courses, each courseid student pair is registered to - we have no other option than to verifyu each course id individually and deal with it in the most approppriate manner. 

In [18]:
#we create a pivot_table
student_courses_piv = pd.pivot_table(student_courses, index = ['courseid', 'userid'], values = 'course',
                                aggfunc = 'count')

#and only keep courseid-student pairs for whom there is more than 1 occurrence of course
student_courses_piv = student_courses_piv[student_courses_piv['course'] > 1]

In [19]:
#with pd.option_context('display.max_rows', None,): #uncomment to see 
#    display(student_courses_piv)

**The following Courseids have the same student attending in different semesters:**
1. **Course 100008.0**,
2. **Course 100010.0**
3. **Course 200070.0**
4. **Course 200014.0**

Each version is easily mergeable with netpa data via a semester-courseid pairing,

**The following Courseids have the same student displaying activity in versions of the same NetPa course whithin a semester:**

1. **Course 200197.0** - 1 student in these conditions,
2. **Course 200195.0** - Multiple students
3. **Course 200193.0** - 1 student, 
4. **Course 200166.0** - a couple of students
5. **Course 200165.0** - 1 student, 
6. **Course 200146.0** - different classes, same student,
7. **Course 200049.0** - 1 student
8. **Course 200013.0** - 5 students, 
9. **Course 200012.0** - 1 student

In this instance, we can treat the unique courseid pairing as sufficient for identification of the course. It seems that the different courses result from registration in different classes of the same version of the course.

**The following courses need to be verified more thoroughly:**
1. **Course 200194.0** -> different versions occur (T3 and T4), with different classes being registered in T4
2. **Course 200170.0** -> no id on the semester - will need to cross with support table
3. **Course 200167.0** -> no id on semester, will need to check further -> problably different programs are in the mix -> 1488 is S2

### Section for additional verification using Support Table

This small section will assist us in the decision on how to deal with 3 courses.
Courses to verify:

1. **Course 200194.0** -> different versions occur (T3 and T4), with different classes being registered in T4
2. **Course 200170.0** -> no id on the semester - will need to cross with support table
3. **Course 200167.0** -> no id on semester, will need to check further -> problably different programs are in the mix -> 1488 is S2

In [20]:
#courses
to_verify = [
            200194.0,
            200170.0, 
            200167.0
            ]

#filtering support table 
verification = support_table[support_table['courseid'].isin(to_verify)]
verification[['cd_curso', 'nm_curso_pt', 'courseid', 'semestre', 'ds_discip_pt']].drop_duplicates(
    subset = ['cd_curso', 'courseid', 'semestre']).sort_values(by = 'nm_curso_pt')

Unnamed: 0,cd_curso,nm_curso_pt,courseid,semestre,ds_discip_pt
27713,9435,Mestrado em Data-Driven Marketing,200170.0,S2,Consumer Behavior Insights
39304,9435,Mestrado em Data-Driven Marketing,200194.0,T4,Transformação Digital
38904,4281,Mestrado em Estatística e Gestão de Informação,200194.0,T4,Transformação Digital
27695,4281,Mestrado em Estatística e Gestão de Informação,200170.0,S2,Consumer Behavior Insights
27111,4281,Mestrado em Estatística e Gestão de Informação,200167.0,S2,Big Data Analytics
26934,9434,Mestrado em Gestão de Informação,200167.0,S2,Big Data Analytics
27677,9434,Mestrado em Gestão de Informação,200170.0,S2,Consumer Behavior Insights
38907,9434,Mestrado em Gestão de Informação,200194.0,T4,Transformação Digital
26938,7512,Mestrado em Métodos Analíticos Avançados,200167.0,S2,Big Data Analytics
38101,7512,Mestrado em Métodos Analíticos Avançados,200194.0,T3,Transformação Digital


We see, from the data in the support table, that there are multiple instances of different programs having similar curricular units - that have, each, a different version of the same curricular unit.

Therefore, we will be required to perform the merger at the level of resolution we can: That is using Course ID, Semester and UserID.

Then, we will need to take care of duplicates that may surface:

In [21]:
#we get all nonduplicate rows 
synch_df = student_logs.filter(['courseid', 'userid', 'course', 'time', 'CourseFullname', 'semester']).drop_duplicates(subset = ['course', 'courseid', 'semester', 'userid'],
                                                                                        keep = 'first') #will allow us to understand when the first student interaction occurs
synch_df

Unnamed: 0,courseid,userid,course,time,CourseFullname,semester
0,400033.0,4867.0,1214,2020-08-26 11:51:00,202021 - InteligÃªncia EconÃ³mica e Competitiv...,S1
1,400037.0,4867.0,1242,2020-08-26 11:51:00,202021 - Metodologias e TÃ©cnicas de AnÃ¡lise ...,S1
4,400032.0,4867.0,1284,2020-08-26 11:51:00,202021 - TÃ©cnicas AnalÃ­ticas Estruturadas pa...,S1
5,400035.0,4867.0,1266,2020-08-26 11:51:00,202021 - DinÃ¢micas Regionais de SeguranÃ§a e ...,S1
8,200165.0,5916.0,1506,2020-08-26 12:01:00,202021 - MÃ©todos Descritivos de Data Mining -...,S1
...,...,...,...,...,...,...
4138451,300033.0,5756.0,1368,2021-06-16 15:09:00,202021 - Teste de teoria com modelos de equaÃ§...,S2
4148433,200013.0,6029.0,1490,2021-06-16 18:31:00,202021 - Business Intelligence II - Turma TPDA...,S2
4253333,100027.0,4705.0,1357,2021-06-22 16:31:00,202021 - ComputaÃ§Ã£o II - S2,S2
4399013,200202.0,4889.0,1402,2021-07-25 01:20:00,202021 - Big Data para Marketing - S2,S2


In [22]:
#now, we get to perform an inner merge - we not expecting an increase in rows
synch_df = pd.merge(synch_df, support_table.filter(['cd_curso','semestre', 'courseid', 'nm_curso_pt', 'ds_discip_pt', 'userid']).drop_duplicates(), on = ['courseid', 'userid'])

In [23]:
#previous step will definitely generate immediate duplicates -> multiple courses for same course id
synch_df

Unnamed: 0,courseid,userid,course,time,CourseFullname,semester,cd_curso,semestre,nm_curso_pt,ds_discip_pt
0,400033.0,4867.0,1214,2020-08-26 11:51:00,202021 - InteligÃªncia EconÃ³mica e Competitiv...,S1,4964,S1,Pós-Graduação em Gestão de Informações e Segur...,Inteligência Económica e Competitiva
1,400037.0,4867.0,1242,2020-08-26 11:51:00,202021 - Metodologias e TÃ©cnicas de AnÃ¡lise ...,S1,4964,S1,Pós-Graduação em Gestão de Informações e Segur...,Metodologias e Técnicas de Análise e de Prospe...
2,400032.0,4867.0,1284,2020-08-26 11:51:00,202021 - TÃ©cnicas AnalÃ­ticas Estruturadas pa...,S1,4964,S1,Pós-Graduação em Gestão de Informações e Segur...,Técnicas Analíticas Estruturadas para Análise ...
3,400035.0,4867.0,1266,2020-08-26 11:51:00,202021 - DinÃ¢micas Regionais de SeguranÃ§a e ...,S1,4964,S1,Pós-Graduação em Gestão de Informações e Segur...,Dinâmicas Regionais de Segurança e Defesa
4,200165.0,5916.0,1506,2020-08-26 12:01:00,202021 - MÃ©todos Descritivos de Data Mining -...,S1,9434,S1,Mestrado em Gestão de Informação,Métodos Descritivos de Data Mining
...,...,...,...,...,...,...,...,...,...,...
14666,100014.0,3288.0,1355,2021-05-22 19:20:00,202021 - Base de Dados II - S2,S2,9155,S2,Licenciatura em Gestão de Informação,Base de Dados II
14667,400106.0,6588.0,1401,2021-05-24 18:32:00,Big Data for Finance,,4974,S2,Pós-Graduação em Data Science for Finance,Big Data for Finance
14668,400106.0,6512.0,1401,2021-05-24 18:45:00,Big Data for Finance,,4974,S2,Pós-Graduação em Data Science for Finance,Big Data for Finance
14669,200167.0,4118.0,1512,2021-06-13 09:36:00,Big Data Analytics (night),,9434,S2,Mestrado em Gestão de Informação,Big Data Analytics


In [24]:
synch_df[synch_df.duplicated()]

Unnamed: 0,courseid,userid,course,time,CourseFullname,semester,cd_curso,semestre,nm_curso_pt,ds_discip_pt


In [None]:
synch_df[synch_df['courseid'] == 100014.0]

In [None]:
#we get to see to how many programs does a Moodle course refer to - in specific -
course_to_courseid = synch_df.groupby(['courseid', 'semestre']).agg({
                                                'cd_curso' : 'nunique',
                                                'course' : 'nunique',
                                                'userid' : 'nunique'
                                                }).sort_values(['semestre'], ascending=False).reset_index()

In [None]:
support_table[support_table.duplicated()]

In [None]:
#store them together
more_than_one = course_to_courseid[course_to_courseid['course'] > 1]['courseid'].tolist() # 83

#we find those logs
logs_multiple_courses = student_logs[student_logs['courseid'].isin(more_than_one)]
logs_multiple_courses['courseid'] = logs_multiple_courses['courseid'].astype(object)
logs_multiple_courses.describe(include = 'all', datetime_is_numeric = True)

In [None]:
logs_multiple_courses

In [None]:
student_logs

In [None]:
more_than_one

In [None]:
logs_2_semesters

In [None]:
synch_df.describe(include = 'all', datetime_is_numeric = True)

We will be able to go further in our filtering efforts, but, beforehand, we may create some exploratory visualizations that will be able to assist us.

#### Small visualization: Weekly clicks per course
We know that the conditions from course to course vary wildly. 
For the purposes of a more thorough understanding of the data, we can see how clicks for each course vary, from course to course, through time.

In [None]:
#first, we sort the courses by the start date. Then, we'll get the index of each 
sorting_hat = support_table[['courseid', 'startdate']].drop_duplicates().sort_values(by = 'startdate').reset_index(drop = True)
sorting_hat.reset_index(inplace = True)
sorting_hat = sorting_hat.set_index('courseid').to_dict()['startdate'] 

#Then, when it comes to logs, we aggregate by week
grouped_data = student_logs.groupby([pd.Grouper(key='time', freq='W'), 'course']).agg({
                                                                             'action': 'count',
                                                                             }).reset_index().sort_values('time')
#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index ='course', 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan).reset_index().rename(columns = {'course' : 'Course'})

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data['Course'].map(sorting_hat)
grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index('Course', drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
grouped_data

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the first
heat1 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat1.get_figure()
fig.savefig('../Images/NovaIMS_exploratory_course_weekly_clicks_heatmap.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat1

**We can, additionally**, make some additional observations that may come in handy in the future:

a. How many students are attending each course,

b. How many courses is each student attending,

This knowledge will allow us make additional filtering decisions to enhance our sample.

In [None]:
#we can compute the number of students attending each course, and the number of courses each student is attending
class_list = student_courses.groupby('course')['userid'].count().to_frame().rename(columns = {'userid' : 'Users per course'})
enrollment_size = student_courses.groupby('userid')['course'].count().to_frame().rename(columns = {'course' : 'Courses per User'})

**A. How many students are attending each course?**

In [None]:
#settub
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(16, 10)}, font_scale=2)

#a number of students per course
#student_courses.rename(columns = {'userid' : 'Students per course'}, inplace = True)

#then we plot an histogram with all courses, we are not interested in keeping courses with a number of students inferior to 10
hist1 = sns.histplot(data=class_list, x='Users per course', kde=True, color= student_color, binwidth = 5,)

fig = hist1.get_figure()
fig.savefig('../Images/hist1_students_per_course_bin_5.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, hist1

**There is a very significant number of courses with between 1 and 10 students**


**B. In how many courses is each student enrolled?**

In [None]:
#then we plot an histogram with all courses, we are not interested in keeping courses with a number of students inferior to 10
hist2 = sns.histplot(data= enrollment_size, 
        x='Courses per User', color= course_color, discrete = True, fill = True)

fig = hist2.get_figure()
fig.savefig('../Images/hist2_courses_per_student course_bin_1.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, hist2

Depending on the course in question, it is possible for it to have 1 registered user vs almost 1400.
Additionally, we can also see that there is a significant number of students attending a single course (over 6000).

To some extent, most courses have some degree of interaction no matter how small.

We can see that many of the interactions start in late August, early September - the start of the school year. 
When it comes to different disciplines we can also discern different the most common starting points:
- Early September
- End of January/Start of February

Additionally, many courses seem, at least on a preliminary level, be consistent with the split between semesters and trimesters.

In [None]:
course_dict = {course: student_logs.loc[student_logs['course'] == course][['time']] for course in tqdm(course_array)}

After getting the dictionary, we now have the ability to, for every course, repeat what was done by the authors of the Riestra González paper.

For that, we will create a function. We choose to do it this way because we can use the pipe method.

In [None]:
#start date
class_list['Start Date'] = class_list.index.to_series().map(sorting_hat)

#end date
class_list['End Date'] = class_list.index.to_series().map(course_dict)
class_list['End Date'] = class_list['End Date'].where( class_list['End Date'] == (( class_list['End Date'] + Week(weekday=4) ) - Week()), class_list['End Date'] + Week(weekday=4))

#additionally, we will look at our estimated course duration
class_list['Course duration days'] = class_list['End Date'] - class_list['Start Date']

Before finishing, we will take an additional look at our list of courses their duration.

In situations where the course duration is very small we may disconsider them outright. Courses at NOVA IMS will be either annual, semestral or trimestral. For that, we will add the following criteria for exclusion:

1. Start Date occurs before 24th of August 2014 - our threshold.
2. Courses with a small duration (below 4 weeks)

In [None]:
#1 Filter by Start Date - we lose 3 courses
class_list = class_list[class_list['Start Date'] >= '24-08-2014']

#filtering all courses with duration equal to or below 1 week
small_duration = class_list[class_list['Course duration days'] <= pd.to_timedelta(4, unit = 'W')].sort_values(by = 'Course duration days').reset_index()
class_list = class_list[~class_list.index.isin(small_duration['course'])]

#we can now filter our student logs, hereby removing all of these unnecessary courses
student_logs = student_logs[student_logs['course'].isin(class_list.index)]

In [None]:
student_logs

We note that many of the disciplines discarded have, in practice, logs that are consistent with the theoretical very small duration of the course. 

It is, nonetheless important to highlight the existence of courses whose log existence would not be consistent with the calculated duration. After verification, we found that, while the End-date was consistent with the logs, the start date was not.

In these instances, we opted to remove the courses as there is no way to confirm the exact start date.

In [None]:
#additional manipulations required to plot interactions
small_duration['course'] = pd.to_numeric(small_duration['course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
small_duration = small_duration.set_index('course', drop = True)

In [None]:
sns.set_theme(context='paper', style='whitegrid', font='Calibri', rc={"figure.figsize":(20, 12)}, font_scale=2)

#here, we are plotting the first
heat2 = sns.heatmap(grouped_data[grouped_data.index.isin(small_duration.index.to_list())], robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat2.get_figure()
fig.savefig('../Images/weekly_clicks_small_duration_heatmap.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat2

**Now, we will fininsh our work by removing all logs outside the following conditions.**

We will build 2 cutoff points:

1. One week before the start date of the course, 
2. After the perceived end of course.

In [None]:
#a new look into class list
class_list['cuttoff_point'] = pd.to_datetime((class_list['Start Date'] - pd.to_timedelta(1, unit = 'W')).dt.date)

#convert to date
class_list['Start Date'] = pd.to_datetime(class_list['Start Date'].dt.date)
class_list['End Date'] = pd.to_datetime(class_list['End Date'].dt.date)
class_list['Course duration days'] = class_list['End Date'] - class_list['Start Date']
class_list.reset_index(inplace = True)

#we will create a new dict with the start date
cuttoff_point = class_list.set_index('course').to_dict()['cuttoff_point'] 

#we'll create a new column that will signal whether we are whithin our course boundaries or not
student_logs['start_bound'] = student_logs['course'].map(cuttoff_point)
student_logs['end_bound'] = student_logs['course'].map(course_dict)

#convert to date
student_logs['start_bound'] = pd.to_datetime(student_logs['start_bound'].dt.date)
student_logs['end_bound'] = pd.to_datetime(student_logs['end_bound'].dt.date)

**Now, we only keep rows that are inside between the dates inside the start and end bounds.**

In [None]:
student_logs = student_logs[student_logs['time'].between(student_logs['start_bound'], student_logs['end_bound'], inclusive = True)].reset_index(drop = True)
student_logs

**After finishing, we will now take a new look at the weekly interactions.**

We are expecting a cleaner view at the weekly interactions performed by students in the context of their courses.

In [None]:
#Then, when it comes to logs, we aggregate by week
grouped_data = student_logs.groupby([pd.Grouper(key='time', freq='W'), 'course']).agg({
                                                                             'action': 'count',
                                                                             }).reset_index().sort_values('time')
#change for better reading
grouped_data['Date (week)'] = grouped_data['time'].astype(str)

#creating pivot table to create heatmap
grouped_data = grouped_data.pivot_table(index ='course', 
                       columns = 'Date (week)',
                        values = 'action', 
                       aggfunc =np.sum,
                        fill_value=np.nan).reset_index().rename(columns = {'course' : 'Course'})

#now, we will sort the courses according to the starting date
grouped_data['sort'] = grouped_data['Course'].map(sorting_hat)
grouped_data['Course'] = pd.to_numeric(grouped_data['Course']).astype(int)

#finally we create the pivot_table that we will use to create our heatmap
grouped_data = grouped_data.set_index('Course', drop = True).sort_values('sort').drop('sort', axis = 1)
grouped_data.T.describe(include = 'all').T

In [None]:
#here, we are plotting the nex
heat3 = sns.heatmap(grouped_data, robust=True, norm=LogNorm(), xticklabels = 2, yticklabels= False,
            cmap = standard_cmap, cbar_kws={'label': 'Weekly interactions'})

fig = heat3.get_figure()
fig.savefig('../Images/cleaned_weekly_clicks_heat3.png', transparent=True, dpi=300)

#delete to remove from memory
del fig, heat3

We finish the notebook by saving the cleaned logs and the list of the courses with which we will be going forward in our analysis. 

A very important factor to take into account is the fact that, as our targets, we will only have access to the student-pairt courses that we were able to identify in our targets table - which are the same as the ones present iin our support_table.

It is, therefore, wise to perform a last filtering step before going forward.

In [None]:
#get all unique student course pairs present in the support table
support_table = support_table[['courseid', 'userid']].drop_duplicates().rename(columns = {'courseid': 'course'})

#merge with our current, cleaned, logs - will only include courses that we have not expressly excluded thus far
student_logs = student_logs.merge(support_table, on = ['course','userid'], how = 'inner').reset_index(drop = True)

In [None]:
student_logs

And finally, we will update the student per course count presented in the final logs: This will ensure that we are working with the most up-to-date information.

In [None]:
#We start by updating the relevant student_course list
student_courses = student_logs.filter(['course', 'userid']).drop_duplicates().reset_index(drop = True)

#then, we compute the updated number of students attending each course
new_class_list = student_courses.groupby('course')['userid'].count().to_frame().rename(columns = {'userid' : 'Users per course'}).reset_index()

#we will create a new dict with the start date
new_class_list = new_class_list.set_index('course').to_dict()['Users per course']

#we'll create a new column that will signal whether we are whithin our course boundaries or not
class_list['Users per course'] = np.where(class_list['course'].isin(new_class_list.keys()),
                                          class_list['course'].map(new_class_list).astype(int),
                                          np.nan)

In [None]:
class_list['Course duration days'] = (class_list['Course duration days'].dt.total_seconds() // 3600 // 24) + 1

In [None]:
class_list

In [None]:
#save tables 
class_list.to_csv('../Data/Modeling Stage/R_Gonz_class_duration.csv') 

student_logs.drop(['start_bound', 'end_bound'], axis = 1).to_csv('../Data/Modeling Stage/R_Gonz_cleaned_logs.csv')

#### Done

From here on out, we will continue with feature engineering and extraction for modeling purposes in Notebooks 3.