## Notebook 2.1 Data Understanding and Preprocessing of Support Tables

For all intents and purposes, this should be considered as the second real notebook that is part of the thesis work. In it, we will look at the support tables that are part of the NOVA IMS original database.

#### 1. We are familiarized with the general structure of the logs

Before going further, we should assess the remaining tables presented in the database. 

Recall, **logs record interactions with the system and we are looking for ways to determine whether these interactions can assist educators identify at risk students and high performing students.**

Thus, to make the best out of the logs, we will need to perform different segmentations and it is likely that we will need perform some filtering. 

### To do that, we will take a look at all tables

We will look at all tables and all columns to make a preliminary assessment of the utility of the available elements.
In general, these are support elements that will be used sparsely, as most of the relevant information is present in the logs.

The observation of each table will resort to the same chain of commands:

info -> to observe count and datatype of each column, 
describe -> a command that that returns the most notable descriptive statistics of each column.
The obeservation of each table ends with a look at the raw data (At least the visible rows).

#### 2. We'll start this notebook by importing all relevant packages and data

All data is stored in an excel file.

In [1]:
#import libs
import pandas as pd
import numpy as np
from pandas.tseries.offsets import *

import matplotlib.pyplot as plt
import seaborn as sns

from tqdm.notebook import tqdm, trange
tqdm.pandas(desc="Progress")

sns.set()

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [None]:
#other tables with support information
support_table = pd.read_excel('../Data/Nova_IMS_logs_Moodle_cursos.xlsx', sheet_name = None,
                             dtype = {
                                 'cd_lectivo' : object,
                                 'cd_curso' : object,
                                 'cd_Discip': object,
                                 'cd_discip': object,
                                 'userId': object,
                                 'dataExame': pd.datetime,
                             })

support_table['logs_Moodle_cursos'].rename(columns = {
                                 'cd_Discip' : 'courseid',
                                 'userId' : 'userid',
                             }, inplace = True)

In [None]:
#use this cell to write any additional piece of code that may be required

We can see that the support table dict is composed by 2 distinct tables.

Student performance is, in general, measured by the student's grade. So... how do we measure grades?
In our data, we have access to different grades - Exam/assignment and Final Grades.

In [None]:
support_table['logs_Moodle_cursos'].info()

In [None]:
support_table['logs_Moodle_cursos'].describe(include = 'all', datetime_is_numeric = True).T

We can start by removing identifiers of courses and programs. These will will provide no additional information for our analysis. Additionally, as all of the logs refer to the same school year, the reference to them will promptly be removed.

In [None]:
#cd Lectivo is a single value column ~- other meaningless columns may also be
support_table['logs_Moodle_cursos'].drop(['cd_lectivo', 'nm_curso_pt', 'nm_ramo', 'ds_discip_pt'], axis = 1, inplace = True)

Before going forward, let us take a closer look at some columns - specifically the unique values each of these columns may take:
1. Semestre: Will provide valuable insights into how to split the data, 
2. statusAvaliação: Teels us to which phase a grade refers to,
3. statusEpoca: Refers to the grading status of a particular item,
4. statusFinal

In [None]:
#store columns of interest in list
columns_of_interest = ['semestre', 'statusAvaliacao', 'statusEpoca', 'statusFinal']

#print value counts of each of them:
for i in tqdm(columns_of_interest):
    print(f'Unique Values of Column {i}: \n')
    print(support_table['logs_Moodle_cursos'][i].value_counts())
    print('\n')

We now know a lot of different things.

**Next, we have the datasExames table**

This table stores important information concerning the different curricular units and their exam dates. 

In [None]:
support_table['datasExames'].info()

In [None]:
support_table['datasExames'].describe(include = 'all', datetime_is_numeric = True).T

In [None]:
#cd Lectivo is a single value column - that is likely to refer to propbably referes
support_table['datasExames'] = support_table['datasExames'].drop(['cd_lectivo'], axis = 1).rename(columns = {'cd_discip': 'courseid'})
support_table['datasExames']

#### Before going forward, we have to consider the following:

In this instance, all courses have grades. Be it exam grade or final grade. As a first indicator, we will not consider grade improvements. 

These results come from students that have effectively completed the curricular unit. Likewise, we will not consider grades of special season - Reason being that these exams are only accessible to students that fulfill very strict conditions.

In this instance, it does not make sense to distinguish between mandatory and optional assignments. We will keep the different assignments listed in the hope we can associate an assignment to a specific time-stamp.

Additionally, we can already start to address course duration:
1. We need a start date and an end date.
2. Moodle Logs usually have a start date for the course: It is unclear, at this moment, whether the date presented therein is reflective of the actual start date or the course registry date. Regardless, it may possible for us to use the semester denomination to make a reasonable inference for duration using the weeks of start and finish.
3. The end date is given by the Normal exam data

In [None]:
#we create a list of items to remove from the logs
invalid_keys = ['Melhoria - Nota Parcial 1', #partial for improv
                'Melhoria - Nota Parcial 2', #partial 2
                'Melhoria - Nota Parcial 3', #partial 3          
                'Melhoria - Nota Parcial 4', #...
                'Melhoria - Nota Parcial 5',          
                'Melhoria - Nota Parcial 6',          
                'Época Especial', #special season
                'Estatuto especial - Exame 1', #extraordinarily special season exam        
                'Estatuto Especial - Exame 2', #extraordinarily special season exam 2 
                'Melhoria (1ª Época)', #improv season 1
                'Melhoria (2ª época)', #improv season 2
                'Creditação', #credits with grade season
               ]

#we remove the entries from grades
support_table['logs_Moodle_cursos'] = support_table['logs_Moodle_cursos'][~(support_table['logs_Moodle_cursos']['statusAvaliacao'].isin(invalid_keys))].reset_index(drop = True)

#for course duration, we will only care about the date of the first season - as it is more indicative of duration than other options
normal_season = ['Exame Época Final Normal', 'Normal']

#and only keep normal season dates
support_table['datasExames'] = support_table['datasExames'][support_table['datasExames']['epocaAvaliacao'].isin(normal_season)].reset_index(drop = True)

#first, make distinction between passes and fails 
support_table['datasExames']['epocaAvaliacao'] = np.where(support_table['datasExames']['epocaAvaliacao'] == 'Exame Época Final Normal',
                                                             'Normal', #getting rid of multiclassification
                                                             support_table['datasExames']['epocaAvaliacao'])

support_table['datasExames'].sort_values(by = ['semestre', 'courseid', 'dataExame']).drop_duplicates().reset_index(inplace = True)

In [None]:
support_table['datasExames']

**Our main target will be the exam grade** -> reason being that it is not directly computed from the grades of the different assignments. Whereas finalgrade results from these.

In [None]:
#first, make distinction between passes and fails 
support_table['logs_Moodle_cursos']['statusEpoca'] = np.where(support_table['logs_Moodle_cursos']['statusEpoca'] != 'Aprovado',
                                                             'Reprovado', #getting rid of multiclassification
                                                             support_table['logs_Moodle_cursos']['statusEpoca'])

#first, make distinction between passes and fails 
support_table['logs_Moodle_cursos']['statusAvaliacao'] = np.where(support_table['logs_Moodle_cursos']['statusAvaliacao'] == 'Exame Época Final Normal',
                                                             'Normal', #getting rid of multiclassification
                                                             support_table['logs_Moodle_cursos']['statusAvaliacao'])

support_table['logs_Moodle_cursos'] = support_table['logs_Moodle_cursos'].sort_values(by = ['semestre', 'courseid', 'userid', 'statusAvaliacao', 'notaAvaliacao'])
support_table['logs_Moodle_cursos'].drop_duplicates(subset = ['semestre', 'courseid', 'userid', 'statusAvaliacao'], inplace = True)

support_table['logs_Moodle_cursos']

In [None]:
exames = ['Normal', 'Recurso']

#first, split exams from remaining rows
exams = support_table['logs_Moodle_cursos'][support_table['logs_Moodle_cursos']['statusAvaliacao'].isin(exames)].filter(['courseid', 'semestre','userid', 
                                                                                                            'statusAvaliacao', 'notaAvaliacao',
                                                                                                           'notaFinal'])

#then, create a pivot table with the normal + recurso season exam grades 
exam_pivot = pd.pivot(exams, index = ['courseid', 'semestre','userid'], columns = 'statusAvaliacao', 
                 values = 'notaAvaliacao')

exam_pivot.dropna(how = 'all', inplace = True)

# We look at the exams at Normal and Recurso Season

We have 1341 student /course pairs that have take the 2nd season exam. 
We also have 329 student who have not taken the 1st season exam.

We can take a simple approach - fill all nans in Normal exam with the Recurso grade.

For that, we will check whether both dfs are the same.

In [None]:
#This cell is a relic from a previous formulation of the problem

# #we need to look for instances of nans in normal and not nas in recurso
# normal_nans = exam_pivot[exam_pivot['Normal'].isna()]
# valid_recurso = exam_pivot[exam_pivot['Recurso'].notna()]

# print(f'Are both dataframes exactly the same? \n' +
#       f'Answer: {normal_nans.equals(valid_recurso)}.')

# #del normal_nans, valid_recurso

**In light of the previous result**, we will look to join both columns together and get a valid targets_table - with exam grade.

In [None]:
#fillnas with the recurso 
exam_pivot['Normal'].fillna(exam_pivot['Recurso'], inplace = True)

#drop recurso table and rename column to describe what it refers to - an exam_mark
exam_pivot = exam_pivot.drop('Recurso', axis = 1).rename(columns = {'Normal': 'exam_mark'})

In [None]:
#now we remove useless columns and rename nota_final to final_mark
exams = exams.drop(['statusAvaliacao', 'notaAvaliacao'], axis = 1).rename(columns = {'notaFinal' : 'final_mark'})
exams = pd.merge(exams, exam_pivot, on = ['courseid', 'semestre', 'userid'], how = 'inner')

#we finish by dropping the rows that have no final_mark
exams.dropna(inplace = True)
exams.describe(include = 'all', datetime_is_numeric = True)

In [None]:
exams

**We now have a table with both of our potential targets**.

Now, we need to go back to our main support table. In this table, we now deal with the remaining assignments - that is, the ones that are identified as N.

In [None]:
#first, split exams from remaining rows
assignments = support_table['logs_Moodle_cursos'][~(support_table['logs_Moodle_cursos']['statusAvaliacao'].isin(exames))].filter(['courseid', 'semestre','userid', 
                                                                                                            'statusAvaliacao', 'cd_final','notaAvaliacao'])
assignments

In [None]:
exams.info()

In [None]:
support_table['datasExames'].drop_duplicates(inplace = True)
support_table['datasExames'][support_table['datasExames'].duplicated(subset = ['cd_discip'], keep = False)]

In [None]:
exames

In [None]:
support_table['logs_Moodle_cursos'][support_table['logs_Moodle_cursos']['statusAvaliacao'] == 'Exame Época Final Normal']

**We will only touch the course Start Date in the next notebook.**

For now, we will make the following concessions concerning end-date:

1. For every Semester class: S1, S2, T1, T2, T3 and T4, we will find the mean date (which corresponds to the mean of the dates of the normal exam).

2. All courses in each semester class will have an end-date that is equal to the Friday of the week in question.

This will ensure that all courses in the class have the same duration - which will ease our work later.

In [None]:
#then, we cumulative sum all in-group members 
support_table['datasExames']['End Date'] = pd.to_datetime((support_table['datasExames'].groupby('semestre')['dataExame'].transform(pd.Series.mean)).dt.date)

#setting up duration threshold to be on friday -> weekday 4
support_table['datasExames']['End Date'] = support_table['datasExames']['End Date'].where( support_table['datasExames']['End Date'] == (( support_table['datasExames']['End Date'] + Week(weekday=4)) - Week()), support_table['datasExames']['End Date'] + Week(weekday=4))

#this also allows us to remove the pesky columns that serve no additional purpose
support_table['datasExames'].drop(['epocaAvaliacao', 'dataExame'], axis = 1, inplace = True)
support_table['datasExames']

In [None]:
#then, we cumulative sum all in-group members 
support_table['datasExames']['End Date'] = pd.to_datetime((support_table['datasExames'].groupby('semestre')['dataExame'].transform(pd.Series.mean)).dt.date)
support_table['datasExames']

#### 3. To business

The information stored in these tables is pivotal for our work with the logs. Ignoring all other noise potential insights that may arise from this data we are, for the most part, interested in 3 things:

1. Identify the student population - implicitly achieved via
2. Compute Student Performance - our target
3. Get course duration - or find a way to compute those - the courses to that we will take forward.

We've been discussing continuously that we want to, in some capacity, predict student performance. As we do not have access to the final grades, we will need to infer it from graded Moodle assignments. The first, and almost immediate observation is that we will can only use courses that use Moodle in this capacity -> which will reduce the number of courses we have to work with.

We will follow the formula adopted by the authors of the Riestra-González paper:

#### Student Performance and Course Duration

The authors got to student performance and course duration by performing inner joins across multiple tables and filtered across different conditions:

course_mod_table,
grades_table,
grade_item_table

We will replicate their steps and hopefully, reach suport tables that return comparable results. The first step is to perform the removal of rows that will be unnecessary for us. We can only construct a solution for items that are graded and for which we have the means to estimate the course duration. 

Thus, in the grades_table, we will look to only keep rows that can, simultaneuously, fulfill the following pre-requisite:
1. Have a valid final grade,

The second phase will be to perform inner joins of the different tables:
1. course_mod_table with grade_item_table on iteminstance and courseid
2. grade_item_table.id with grades_table.itemid
3. The merge of the previous 2 merged tables

In [None]:
# #Step 1, removing all rows that have no interest to us

# grades_table.dropna(subset = ['finalgrade','timecreated', 'timemodified'], inplace = True)

#Step 2: Create temporary tables that associate courses and assignments
placeholder_1 = pd.merge(course_mod_table, grade_item_table, on=['iteminstance','courseid'], how='inner')

#Step 3: Create second temporary table that associates grades with assignments
placeholder_2 = pd.merge(placeholder_1, grades_table, on ='itemid', how='inner')

#step 3: merge both placeholder tables
support_table = placeholder_2[:]
support_table['sup_time'] = np.where(support_table['timecreated'] > support_table['timemodified'],
                                support_table['timecreated'], support_table['timemodified'])

#step 4: only keep graded items, which means nonzero max grades
support_table = support_table[(support_table['rawgrademax'] > 0) & (support_table['sup_time'] >= '2014-08-24')]

del placeholder_1, placeholder_2

**As a final step, we will store the start date of each course - as it will provide us with the means to, further down the line, perform the inference for course duration.**

In [None]:
#only keep rows worth merging - this cell can only be run once
course_table = course_table.filter(['id', 'startdate']).rename(columns = {'id': 'courseid'})

#perform inner join between support table and courses with grades
support_table = pd.merge(support_table, course_table, on = 'courseid', how = 'inner')

#only keep the final result
support_table = support_table[(support_table['startdate'] <= support_table['sup_time']) & (support_table['startdate'].dt.year >= 2014)].filter(['assign_id', 'courseid', 'startdate', 'userid', 'finalgrade', 
                                      'rawgrademax', 'sup_time'])

#we will, by default, consider the start date to be Monday
support_table['startdate'] = support_table['startdate'].dt.to_period('W').dt.start_time

In [None]:
support_table.describe(include = 'all', datetime_is_numeric = True).T

We will finish this section by filtering the features to keep and, afterward, export the support table to use with the LMS logs. 

#### Section 2: 

By now, we know, generally:

- all courses that had graded assignments (i.e. whose max assignment grade was not 0) - courseid,
- all students that were registered in the curricular unit - userid,
- if a student delivered an assignment or not and the assignment's grade - finalgrade.

This information is especially useful because it will assist us in the proper filtering of the moodle activity logs and, additionally, assist us in the achievement of valuable information needed for the project: course duration and target.

#### 1. Course duration: 

We have the start date for each course. The authors of original paper inferred course end to occur at the 95% log threshold. That is to say, 5% of the logs were registered after the end of course. We have no way to obtain a better estimate so we will accept the postulation. When we deal with the logs, we will be able to calculate end of course date.

#### 2. Targets - finalgrade:

The authors calculated final grade as a construct computed from the assignment grades. Again, we will accept the author's methods for this. 

**First**: to classify whether different assignments were mandatory or not

The authors of the paper focused made a split between mandatory and optional assignments. In their view, any assignment whose submittal rate (relative to the number of students attending the course) is 40% or under would be considered an optional assignment. 

1. We will need to know which assignments are optional and which are mandatory. We have, from our support table, the ability to list the courses and students that attending the course. From here, we can get the number of students attending each course. 

In [None]:
# we get to create a pivot-table that associates students and the courses they are attending
student_list = pd.pivot_table(support_table, index='userid', columns = 'courseid', values = 'assign_id',
                    aggfunc='count')

# we use the describe command to get the course-level aggregate statistics
# count -> number of students attending, mean is the average number of clicks performed by each student 
student_count = student_list.describe(include = 'all').T.sort_values(by = 'count', ascending = False)['count'].reset_index()

#from here, we can create a dict that associates each course to the number of students attending the course
student_count = student_count.set_index('courseid').to_dict()['count']

2. We can, in some capacity, partially repeat the steps performed in the previous pivot-table and make the option/mandatory classification of each assignment.

In [None]:
# we get to create a pivot-table that associates assignments and the courses are asked on
assign_number = pd.pivot_table(support_table.dropna(), index= 'userid', columns = ['courseid', 'assign_id'], values = 'finalgrade',
                    aggfunc='count')

# we use the describe command to get the course-level aggregate statistics
# count -> number of students delivering the assignment, mean is the average number of students delivering the assignment 
assign_number = assign_number.describe(include = 'all').T.sort_values(by = 'count', ascending = False)['count'].reset_index()

#from her, we can create 2 columns: i) one with the number of students attending the course
assign_number['registered_students'] = assign_number['courseid'].map(student_count)

#then, we can calculate the percentage of assignments delivered relative to the number of attending students
assign_number['%_submissions'] = assign_number['count'] / assign_number['registered_students']

#finally, we classify each assignment as mandatory vs non-mandatory (over 40% submission rates)
assign_number['mandatory_status'] = np.where(assign_number['%_submissions'] > 0.4, 1, 0)

#from here, we can create a dict that associates each course to the number of students attending the course
mandatory_status = assign_number.set_index('assign_id').to_dict()['mandatory_status']

#from here, we can now map the mandatory vs non mandatory status of 
support_table['mandatory_status'] = support_table['assign_id'].map(mandatory_status)

del assign_number, student_count, mandatory_status

We now have assigned the mandatory status to different assignments. We will not use this knowledge immediatly, but we will need it later. What it allows us is the ability to perform new computations.

**3. Now, we can clean unnecessary assignments and courses. We can now perform the following operations:**

1. identify whether the students made the delivery of the assignment or not - nans vs non nans

2. give every nan the classification of 0.

3. verify whether any courses have average finalgrade of 0. By extension, every course that only has 0 mean finalgrades will also excluded.

4. Another variant we considered was the removal of all assignments with average finalgrade = 0. Ultimately, we opted to keep these in courses where there are assignments with finalgrade > 0,

5. Additionally, we considered the removal of average finalgrade equal to the rawgrademax . However, we opted to keep these records.

In [None]:
# check whether the assignment was delivered by the student or not
support_table['delivered'] = np.where(support_table['finalgrade'].isna(), 0, 1)

#now, we fill the nas of finalgrade with 0
support_table.fillna(0, inplace = True)

#as a final note, we can now verify which courses we can exclude
#criteria 1: if all assignments have average grade 0, the course can be excluded
assignments_keep = support_table.groupby(['courseid']).agg({
                                                    'userid': 'count',
                                                    'finalgrade' : 'mean',
                                                    'rawgrademax' : 'mean',
                                                    },
                                                    )

#now we select assignments that fulfill the criteria avg finalgrade = 0 and store it in a list
assignments_keep = list(assignments_keep[assignments_keep['finalgrade'] > 0].reset_index()['courseid'])

#so, we keep assignments who have a positive finalgrade
support_table = support_table[support_table['courseid'].isin(assignments_keep)]

del assignments_keep

#check
support_table

In [None]:
support_table.describe(include = 'all', datetime_is_numeric = True).T

**4. Before finishing this notebook, there is still one thing we need to do:**

So far, the students registered in each course, and their results in graded assignments.

The authors of the Riestra González paper used the following equation to calculate target with several different possible values between 0 and 1 for $\alpha$:

$$\hat{Y} = 10(\alpha \frac{\sum{} mandatory\:assignment\:marks}{number\:of\:mandatory\:assignments} + (1 - \alpha) \frac{\sum{} optional\:assignment\:marks}{number\:of\:optional\:assignments})$$

We will now calculate our results for final marks using the same value used by the authors of the R. Gonzalez paper:

$\alpha = 0.5$

In order to do that, we will start by normalizing all grades relative to their max possible grade to a 0 to be on a 0 to 1 scale.

Then, we will compute the final mark, which will allow us to compute the targets for our classifiers.

In [None]:
#hwe will start by defining a function that will assist us in dealing with the multiindex

def flattenHierarchicalCol(col,sep = '_'):
    '''converts multiindex columns into single index columns while retaining the hierarchical components'''
    if not type(col) is tuple:
        return col
    else:
        new_col = ''
        for leveli,level in enumerate(col):
            if not level == '':
                if not leveli == 0:
                    new_col += sep
                new_col += level
        return new_col

In [None]:
#step 1: current grade scales are varying between 0.7 and 1001. The fastest way to account for this 
support_table['assignment_mark'] = support_table['finalgrade'] / support_table['rawgrademax']

#step 2: For every student and every course, we can obtain the sum and number of both mandatory and optional assignments:
grade_estimation = support_table.groupby(['courseid', 'userid', 'mandatory_status']).agg({
                                                                                    'assignment_mark' : ['sum', 'count'],
                                                                                        })

#applies the function that removes multiindex
grade_estimation.columns = grade_estimation.columns.map(flattenHierarchicalCol)
grade_estimation.reset_index(inplace = True)

#now we can create an optional and a mandatory sum column for each sum 
grade_estimation['Optional'] = 0.5 * np.where(grade_estimation['mandatory_status'] == 0, #1 - alpha = 0.5,  
                                              grade_estimation['assignment_mark_sum'] / grade_estimation['assignment_mark_count'],
                                             np.nan)

grade_estimation['Mandatory'] = 0.5 * np.where(grade_estimation['mandatory_status'] == 1, # alpha = 0.5,  
                                              grade_estimation['assignment_mark_sum'] / grade_estimation['assignment_mark_count'],
                                              np.nan)

#for all intents and purposes, we can now remove the columns assignment_mark_sum
grade_estimation.drop(['assignment_mark_sum', 'assignment_mark_count'], axis = 1, inplace = True)

#we can now create a new pivot_table that perfectly arranges our intended result
targets_table = grade_estimation.pivot_table(index=['courseid','userid'], 
                                         columns=['mandatory_status'],
                                         values=['Optional', 'Mandatory'],aggfunc='sum')

#next we remove the columns that we do not want to keep, final result being: for each discipline, the optional and the mandatory grades
targets_table.columns.set_levels(['optional','mandatory'],level=1,inplace=True)
targets_table.columns = targets_table.columns.map(flattenHierarchicalCol)
targets_table.drop(['Mandatory_optional', 'Optional_mandatory'], axis = 1, inplace = True)

#next, we sum the columns and multiply by 10:
targets_table['final_mark'] = 10 * (targets_table['Mandatory_mandatory'].fillna(0) + targets_table['Optional_optional'].fillna(0))

#if there are no optional assignments on the course, we will double the ponderation of the mandatory course
targets_table['final_mark'] = np.where(targets_table['Optional_optional'].isna(), 2 * targets_table['final_mark'],
                                                                                  targets_table['final_mark'])

#if there are no optional assignments on the course, we will double the ponderation of the mandatory course
targets_table['final_mark'] = np.where(targets_table['Mandatory_mandatory'].isna(), 2 * targets_table['final_mark'],
                                                                                  targets_table['final_mark'])

#before finishing this cell, we will now drop the unncecessary columns and keep the final mark
targets_table.dropna(subset = ['Mandatory_mandatory']).reset_index(inplace = True)
targets_table = targets_table.rename(columns = {'Mandatory_mandatory': 'Grade Mandatory', 'Optional_optional' : 'Grade Optional'})

del grade_estimation

We started the last cell with the normalization of the mark of each assignment. 

The cell finishes with a dataframe containing each curricular unit, each student attending it and the final mark of the student according to the formula we had placed previously.

In [None]:
targets_table.dropna(how = 'all', inplace = True)
targets_table.describe(include = 'all')

We now have 2 distinct table that will be invaluable for our work in future notebooks.

targets_table has stored every final mark obtain by each student attending the different courses of the university:

- From final_mark, we finally are able to calculate our target variables:
    - We can label students as at-risk or as overachievers depending on their mark,


- From support_table, we will need more robust sets of information to be used for feature extraction and engineering:
    - The startdate of each course,
    - The individual mark of each assignment and at what time the assignment was delivered,
    - The mandatory status of an assignment and whether it was or not delivered by the student in question.

In [None]:
support_table.describe(include = 'all', datetime_is_numeric = True)

In [None]:
#save tables 
targets_table.to_csv('../Data/Modeling Stage/Nova_IMS_targets_table.csv') 

support_table.drop(['finalgrade', 'rawgrademax'], axis = 1).to_csv('../Data/Nova_IMS_support_table.csv')

#### Done for now

In notebook 2.2. we will rely on the activity logs and our support table to perform the necessary filtering and preprocessing of the data in order to make it compliant with our necessities. 