# ***NOTEBOOK FOR `UNDERSTANDING` THE `OULAD` DATA***
***
- ***We need to check tables,relationships and which are important tables for our analysis, we can not use all tables***

## **Creating connection with the database**

In [5]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, text
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
def get_project_root(project_name="ranojoy_data_analytics_projects"):
    current_path = Path.cwd()
    if project_name in str(current_path):
        while current_path.name != project_name:
            current_path = current_path.parent
        return current_path
    else:
        for path in current_path.rglob(project_name):
            if path.is_dir():
                return path
        raise FileNotFoundError(f"Could not find project: {project_name}")

repo_root = get_project_root()
data_path = repo_root / "Beyond Oulad - Students Dropout"

sys.path.append(str(data_path))

import src.functions as F

In [7]:
engine = F.get_engine()

In [8]:
engine

Engine(mysql+pymysql://root:***@localhost:3306/oulad_university_dataset)

## **Checking all the tables**

In [9]:
tables = F.sql("show tables in oulad_university_dataset",engine)
tables = [c[0] for c in tables.values]
tables

['assessments',
 'courses',
 'studentassessment',
 'studentinfo',
 'studentregistration',
 'studentvle',
 'vle']

In [25]:
for name in tables:
    print(f'Table: {name}')
    query = f''' select * from {name}
            limit 5;
        '''
    display(F.sql(query,engine))
    print("_"*50)

Table: assessments


Unnamed: 0,code_module,code_presentation,id_assessment,assessment_type,date,weight
0,AAA,2013J,1752,TMA,19.0,10.0
1,AAA,2013J,1753,TMA,54.0,20.0
2,AAA,2013J,1754,TMA,117.0,20.0
3,AAA,2013J,1755,TMA,166.0,20.0
4,AAA,2013J,1756,TMA,215.0,30.0


__________________________________________________
Table: courses


Unnamed: 0,code_module,code_presentation,module_presentation_length
0,AAA,2013J,268
1,AAA,2014J,269
2,BBB,2013J,268
3,BBB,2014J,262
4,BBB,2013B,240


__________________________________________________
Table: studentassessment


Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
0,1752,11391,18,0,78.0
1,1752,28400,22,0,70.0
2,1752,31604,17,0,72.0
3,1752,32885,26,0,69.0
4,1752,38053,19,0,79.0


__________________________________________________
Table: studentinfo


Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass


__________________________________________________
Table: studentregistration


Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,AAA,2013J,11391,-159.0,
1,AAA,2013J,28400,-53.0,
2,AAA,2013J,30268,-92.0,12.0
3,AAA,2013J,31604,-52.0,
4,AAA,2013J,32885,-176.0,


__________________________________________________
Table: studentvle


Unnamed: 0,code_module,code_presentation,id_student,id_site,date,sum_click
0,AAA,2013J,28400,546652,-10,4
1,AAA,2013J,28400,546652,-10,1
2,AAA,2013J,28400,546652,-10,1
3,AAA,2013J,28400,546614,-10,11
4,AAA,2013J,28400,546714,-10,1


__________________________________________________
Table: vle


Unnamed: 0,id_site,code_module,code_presentation,activity_type,week_from,week_to
0,546943,AAA,2013J,resource,,
1,546712,AAA,2013J,oucontent,,
2,546998,AAA,2013J,resource,,
3,546888,AAA,2013J,url,,
4,547035,AAA,2013J,resource,,


__________________________________________________


In [11]:
df={}
for name in tables:
    query = f''' select count(*) as total_rows from {name};
        '''
    df[name] = F.sql(query).iloc[0,0]
pd.DataFrame.from_dict(df,orient='index',columns=['total_rows'])

Unnamed: 0,total_rows
assessments,206
courses,22
studentassessment,173912
studentinfo,32593
studentregistration,32593
studentvle,10655280
vle,6364


In [12]:
F.select('assessments')

Unnamed: 0,code_module,code_presentation,id_assessment,assessment_type,date,weight
0,AAA,2013J,1752,TMA,19.0,10.0
1,AAA,2013J,1753,TMA,54.0,20.0
2,AAA,2013J,1754,TMA,117.0,20.0
3,AAA,2013J,1755,TMA,166.0,20.0
4,AAA,2013J,1756,TMA,215.0,30.0


In [13]:
F.sql('select count(distinct id_assessment) as total_assessments from assessments')

Unnamed: 0,total_assessments
0,206


***`id_assessment` is the primary key for `assessments` table***

***`assessments` holds the assessment information of each `module` and `presentation`***

In [14]:
F.select('courses')

Unnamed: 0,code_module,code_presentation,module_presentation_length
0,AAA,2013J,268
1,AAA,2014J,269
2,BBB,2013J,268
3,BBB,2014J,262
4,BBB,2013B,240


In [15]:
F.sql('select count(distinct code_module) as total_modules,count(distinct code_presentation) as total_presentations from courses')   

Unnamed: 0,total_modules,total_presentations
0,7,4


***`code_module` and `code_presentation` both are the composite primary key for `courses` table***

***`courses` holds the information about length of each `presentation`***

In [16]:
F.select('vle')

Unnamed: 0,id_site,code_module,code_presentation,activity_type,week_from,week_to
0,546943,AAA,2013J,resource,,
1,546712,AAA,2013J,oucontent,,
2,546998,AAA,2013J,resource,,
3,546888,AAA,2013J,url,,
4,547035,AAA,2013J,resource,,


In [17]:
F.sql('select count(distinct id_site) as total_id, count(*) as total_rows from vle')

Unnamed: 0,total_id,total_rows
0,6364,6364


***`id_site` is the primary key for `vle` table***

***`vle` holds the information of each `id_site` and which course and part belongs to***

In [18]:
F.select('studentassessment')

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
0,1752,11391,18,0,78.0
1,1752,28400,22,0,70.0
2,1752,31604,17,0,72.0
3,1752,32885,26,0,69.0
4,1752,38053,19,0,79.0


In [19]:
query = '''
    select count(distinct id_assessment) as total_assessments,
    count(distinct id_student) as total_student,
    count(*) as total_rows from studentassessment
'''
F.sql(query)

Unnamed: 0,total_assessments,total_student,total_rows
0,188,23369,173912


In [20]:
F.sql('select count(*) as `count` from studentassessment group by id_assessment,id_student order by count desc').head()

Unnamed: 0,count
0,1
1,1
2,1
3,1
4,1


***`studentassessment` column have composite primary key which are `id_assessment` and `id_student`***

***`studentassessment` holds the information of different student, of appearing in differenet assessments and iformation of that particular assessment***

In [21]:
F.select('studentinfo')

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass


***`code_module`, `code_presentation` and `id_students` these columns are the primary key for `studentinfo` table***

***`studentinfo` holds the information of each student and the presentation performance of that student***

In [22]:
F.select('studentregistration')

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,AAA,2013J,11391,-159.0,
1,AAA,2013J,28400,-53.0,
2,AAA,2013J,30268,-92.0,12.0
3,AAA,2013J,31604,-52.0,
4,AAA,2013J,32885,-176.0,


In [23]:
F.sql('select count(distinct id_student) as total_students, count(*) as total_rows from studentregistration')

Unnamed: 0,total_students,total_rows
0,28785,32593


***`studentregistration` have 3 primary keys which are `id_student`, `code_module`, `code_presentation`***

***`studentregistration` table holds the information about each student registering to each presentation of each module and if the unregistered***

In [24]:
F.select('studentvle')

Unnamed: 0,code_module,code_presentation,id_student,id_site,date,sum_click
0,AAA,2013J,28400,546652,-10,4
1,AAA,2013J,28400,546652,-10,1
2,AAA,2013J,28400,546652,-10,1
3,AAA,2013J,28400,546614,-10,11
4,AAA,2013J,28400,546714,-10,1


***`studentvle` have composite primary key by using 4 columns which are `code_module`,`code_presentation`,`id_student`,`id_site`***

***`studentvle` holds the information about which student of which module and which vle they used how many times***

***


### **IN SUMMARY, THE UNDRSTANDING OF THE DATA IS** ->


- **`studentinfo` have the information of -> each student and what performance they did in each presentation**

- ***`assessments` have the informatino of -> which module and presentation have which assessment and the details of that particular assessment***

- **`courses` have the iformation of -> just the length of each unique presentation**

- **`vle` have the information of -> which course and module the vle belongs to and details of that particular vle**

- **`studentassessment` have the information of -> joining details of each student appearing in each assessment and the score, along with the submit date**

- **`studentvle` have the information of -> which student of which module is have used which vle**

- **`studentregistration` have the information of -> which student registered to which module, when and are they still registered**