# Load and strip the original data

Original data was exported from the TU Graz online system by Susanne Voller between August and October 2021.  

Data samples stored in the folder `data/original/data_samples` were supplied to the section "Quality Management, Evaluation & Reporting" of TU Graz, which is responsible for data protection issues. Based on these samples, the data was cleared for publication under the condition that student and lecturer identifiers (`student_id` and `lecturer_id`) are hashed.  

In the following script, the raw files exported from the TU Graz online data base are read in and stripped of unneccessary columns. Student and lecturer identifiers are encoded with a salted hash function and original identifiers replaced by the hashed identifiers. The stripped ans hashed data is stored in the folder `data/raw` for further cleaning and imputation, which is performed in the script `clean_data.ipynb`. 

In [1]:
import pandas as pd
from os.path import join
from bcrypt import gensalt, hashpw

# parallelisation functionality
from multiprocess import Pool
import psutil
from tqdm import tqdm

In [7]:
src = "../../data/original/data_exports"
dst = "../../data/raw/"

In [3]:
# Function to hash the student and lecturer IDs given a randomly generated salt.
def hash_id(ID):
    hashed_id = hashpw(str(ID).encode('utf-8'), salt=salt)
    return hashed_id[29:].decode()

salt = gensalt()

## Students

Exported file from database: `Studiendaten.csv`.  
Sample stored in `/origianl/data_samples/Studiendaten.csv`.

**Original data fields & actions:**
* `ST_PERSON_NR`: rename to `student_id` & hash
* `STUDIENIDENTIFIKATOR`: rename to `study_id`
* `STUDIENBEZEICHNUNG`: rename to `study_name`
* `SEMESTERANZAHL`: rename to `term_number`

In [6]:
# List of studies for every student. A student can have more than one study,
# which will show up as separate entries (row) for the same student_id. Each
# study also has a term number, i.e. the number of semesters the student has
# been enrolled in the given study.
df = pd.read_csv(join(src, 'Studiendaten.csv'), encoding='latin_1')
df = df.rename(columns={
    'ST_PERSON_NR':'student_id', # unique student identifier
    'STUDIENIDENTIFIKATOR':'study_id', # unique study identifier
    'STUDIENBEZEICHNUNG':'study_name', # (german) name of the study
    'SEMESTERANZAHL':'term_number' # number of terms a student has been enrolled
})

# hash student IDs with the given salt
hashed_IDs = []
pool = Pool(16)
for hashed_ID in tqdm(
        pool.imap_unordered(func=hash_id, iterable=df["student_id"]),
        total=len(df["student_id"])
    ):
    hashed_IDs.append(hashed_ID)
    
df["student_id"] = hashed_IDs   
df.to_csv(join(dst, "students.csv"), index=False)

100%|██████████| 24475/24475 [08:02<00:00, 50.74it/s]


## Lecturers

Exported file from database: `Bedienstete_mit_DV_an_Org.csv`  
Sample stored in `/original/data_damples/Bedienstete_mit_DV_an_Org.csv`.  

**Original data fields & actions:**
* `PERSON_NR`: rename to `lecturer_id` & hash
* `TUG_NEW.PUORG.GETNAME(A.ORG_NR)`: rename to `organisation_name`
* `ORG_NR`: drop

In [8]:
# Mapping of lecturers to organisations (institute, faculty). A lecturer can
# be associated with more than one organisation.
df = pd.read_csv(join(src, 'Bedienstete_mit_DV_an_Org.csv'),
                            encoding='latin_1')
df = df.rename(columns={
    'PERSON_NR':'lecturer_id', # unique lecturer id
    'TUG_NEW.PUORG.GETNAME(A.ORG_NR)':'organisation_name' # German org name
})

df = df.drop(columns=[
    "ORG_NR" # organisation ID, not needed
])

# hash lecturer IDs with the given salt
hashed_IDs = []
pool = Pool(16)
for hashed_ID in tqdm(
        pool.imap_unordered(func=hash_id, iterable=df["lecturer_id"]),
        total=len(df["lecturer_id"])
    ):
    hashed_IDs.append(hashed_ID)
    
df["lecturer_id"] = hashed_IDs
df.to_csv(join(dst, "lecturers.csv"), index=False)

100%|██████████| 5707/5707 [01:52<00:00, 50.82it/s]


## Courses

Exported file from database: `LV_cleaned.csv`.  
Sample stored in `original/data_samples/LV_cleaned.csv`.   

**Original data fields & actions:**
* `STP_SP_NR`: rename to `course_id`
* `STP_SP_TITEL_ENGL`: rename to `course_name`
* `STP_LV_ART_KURZ`: rename to `course_type`, provide dictionary
* `STP_SP_LVNR`: drop
* `SJ_NAME`: drop
* `SEMESTER_KB`: drop
* `STP_SP_TITEL`: drop
* `STP_SP_SST`: drop
* `STP_LV_ART_NAME`: drop
* `BETREUENDE_ORG_NR`: drop
* `BETREUENDE_ORG_NAME`: drop

In [9]:
# List of lectures with information about their type, their name, their module
# (this is only relevant for how studies are composed at TU Graz) and the 
# organisational unit (institute, faculty) which is responsible for the lecture.

# The list of lectures was manually cleaned, since for some rows, the entries 
# starting from the column STP_LV_ART_KURZ were shifted to the right by one 
# column. The originally exported file is "/original/data_exports/LV.csv".

df = pd.read_csv(join(src, 'LV_cleaned.csv'), encoding="utf-8")
df = df.rename(columns={
    'STP_SP_NR':'course_id', # unique course id
    'STP_SP_TITEL_ENGL':'course_name', # english lecture name
    'STP_LV_ART_KURZ':'course_type', # type of the lecture (tutorial, lab, ...)
})
df = df.drop(columns=[
    "SJ_NAME",
    "SEMESTER_KB",
    "STP_SP_SST",
    "STP_LV_ART_NAME",
    "STP_SP_TITEL",
    'STP_SP_LVNR',
    'BETREUENDE_ORG_NR',
    'BETREUENDE_ORG_NAME',
    "Unnamed: 11"
])
df.to_csv(join(dst, "courses.csv"), index=False)

## Course enrollment by students

Exported file from database: `Studierende_pro_LV_mit_Idf.csv`.  
Sample stored in `original/data_samples/Studierende_pro_LV_mit_Idf.csv`.

**Original data fields & actions:**
* `ST_PERSON_NR`: rename to `student_id` & hash
* `STUDIENIDENTIFIKATOR`: rename to `study_id`
* `STP_SP_NR`: rename to `course_id`
* `LV_GRP_NR`: rename to `group_id`

In [10]:
# List of enrolled courses of the WiSe 2019/20 for every student. A course
# can have several groups (for example for tutorial parts). The group
# identifier is also listed for every student. It is not completely unique
# as there are a number of overlapping groups (for example same time, 
# different rooms). These are disambiguated at a later point in the data
# cleaning process.
# The data also includes the identifier of the study through which the 
# student enrolled in a given lecture. 
df = pd.read_csv(join(src, 'Studierende_pro_LV_mit_Idf.csv'))
df = df.rename(columns={
    'ST_PERSON_NR':'student_id', # unique student identifier
    'STUDIENIDENTIFIKATOR':'study_id', # unique study identifier
    'STP_SP_NR':'course_id', # unique course identifier
    'LV_GRP_NR':'group_id', # (almost) unique group identifier 
    }) 

# hash student IDs with the given salt
hashed_IDs = []
pool = Pool(16)
for hashed_ID in tqdm(
        pool.imap_unordered(func=hash_id, iterable=df["student_id"]),
        total=len(df["student_id"])
    ):
    hashed_IDs.append(hashed_ID)
    
df["student_id"] = hashed_IDs   
df.to_csv(join(dst, "course_enrollment.csv"), index=False)

100%|██████████| 85505/85505 [28:50<00:00, 49.41it/s]


## Exam enrollment by students

Exported file from database: `Prüfungen-2.csv`.  
Sample stored in `/original/data_exports/Prüfungen-2.csv`.

**Original data fields & actions:**
* `PV_TERM_NR`: rename to `exam_id`
* `ST_PERSON_NR`: rename to `student_id` & hash
* `STUDIENIDENTIFIKATOR`: rename to `study_id`
* `STP_SP_NR`: rename to `course_id`
* `PRUEFUNGSDATUM`: drop

In [11]:
df = pd.read_csv(join(src, 'Prüfungen-2.csv'), encoding='latin_1')
df = df.rename(columns={
    'PV_TERM_NR':'exam_id', # unique exam ID
    'ST_PERSON_NR':'student_id', # unique student ID
    'STUDIENIDENTIFIKATOR':'study_id', # unique study ID
    'STP_SP_NR':'course_id'}) # unique course ID

df = df.drop(columns=[
    'PRUEFUNGSDATUM'
])

# hash student IDs with the given salt
hashed_IDs = []
pool = Pool(12)
for hashed_ID in tqdm(
        pool.imap_unordered(func=hash_id, iterable=df["student_id"]),
        total=len(df["student_id"])
    ):
    hashed_IDs.append(hashed_ID)

df["student_id"] = hashed_IDs
df.to_csv(join(dst, "exam_enrollment.csv"), index=False)

100%|██████████| 57815/57815 [23:45<00:00, 40.57it/s]


## Course supervision

Exported file from database `Lehrende.csv`.  
Sample stored in `/original/data_samples/Lehrende.csv`.

**Original data fields & actions:**
* `PERSON_NR`: rename to `lecturer_id` & hash
* `STP_SP_NR`: rename to `course_id`
* `LV_GRP_NR`: rename to `group_id`

In [12]:
# List of lecturers which are responsible for courses and groups within
# courses. Similar to the list of students, the group_id is disambiguated
# later in the data cleaning process.
df = pd.read_csv(join(src, 'Lehrende.csv'))
df = df.rename(columns={
    'PERSON_NR':'lecturer_id', # unique lecturer id
    'STP_SP_NR':'course_id', # unique lecture id
    'LV_GRP_NR':'group_id'# (almost) unique group id
})

# hash lecturer IDs with the given salt
hashed_IDs = []
pool = Pool(12)
for hashed_ID in tqdm(
        pool.imap_unordered(func=hash_id, iterable=df["lecturer_id"]),
        total=len(df["lecturer_id"])
    ):
    hashed_IDs.append(hashed_ID)
    
df["lecturer_id"] = hashed_IDs
df.to_csv(join(dst, "course_supervision.csv"), index=False)

100%|██████████| 13212/13212 [05:21<00:00, 41.12it/s]


## Exam supervision

Exported file from database `Prüfungstermine_mit_Räumen.csv`.  
Sample stored in `/original/data_samples/Prüfungstermine_mit_Räumen.csv`.

**Original data fields & actions:**
* `PV_TERM_NR`: rename to `exam_id`
* `PERSON_NR`: rename to `lecturer_id` & hash
* `STP_SP_NR`: rename to `course_id`
* `DATUM`: drop
* `BEGINNZEIT`: drop
* `ENDEZEIT`: drop
* `RES_NR`: drop

In [13]:
df = pd.read_csv(join(src, 'Prüfungstermine_mit_Räumen.csv'), encoding='latin_1',
                     parse_dates=['DATUM'], dayfirst=True)
df = df.rename(columns={
    'PV_TERM_NR':'exam_id', # unique exam ID
    'PERSON_NR':'lecturer_id', # unique lecturer ID
    'STP_SP_NR':'course_id'}) # unique course ID

df = df.drop(columns=[
    'DATUM',
    'BEGINNZEIT',
    'ENDEZEIT',
    'RES_NR',
])

# hash lecturer IDs with the given salt
hashed_IDs = []
pool = Pool(12)
for hashed_ID in tqdm(
        pool.imap_unordered(func=hash_id, iterable=df["lecturer_id"]),
        total=len(df["lecturer_id"])
    ):
    hashed_IDs.append(hashed_ID)
    
df["lecturer_id"] = hashed_IDs
df.to_csv(join(dst, "exam_supervision.csv"), index=False)

100%|██████████| 5633/5633 [02:11<00:00, 42.77it/s]


## Course dates and rooms

Exported file from database: `Termine_mit_LV_Bezug.csv`.  
Sample stored in `/original/data_samples/Termine_mit_LV_Bezug.csv`.

**Original data fields & actions:**
* `RES_NR`: rename to `room_id`
* `DATUM_AM`: rename to `date`
* `ZEIT_VON`: rename to `start_time`
* `ZEIT_BIS`: rename to `end_time`
* `STP_SP_NR`: rename to `course_id`
* `LV_GRP_NR`: rename to `group_id`

In [14]:
# events (start time, end time, room) for every course and group in WiSe 2019/20
df = pd.read_csv(join(src, 'Termine_mit_LV_Bezug.csv'),
                parse_dates=['DATUM_AM', 'ZEIT_VON', 'ZEIT_BIS'], dayfirst=True)
df = df.rename(columns={
    'RES_NR':'room_id', # unique room id
    'DATUM_AM':'date', # date
    'ZEIT_VON':'start_time', # start time
    'ZEIT_BIS':'end_time', # end time
    'STP_SP_NR':'course_id', # unique lecture id
    'LV_GRP_NR':'group_id'# (almost) unique group id
})
df.to_csv(join(dst, "course_dates.csv"), index=False)

## Exam dates and rooms

Exported file from database `Prüfungstermine_mit_Räumen.csv`.  
Sample stored in `Prüfungstermine_mit_Räumen.csv`.

**Original data fields & actions:**
* `PV_TERM_NR`: rename to `exam_id`
* `DATUM`: rename to `date`
* `BEGINNZEIT`: rename to `start_time`
* `ENDEZEIT`: rename to `end_time`
* `RES_NR`: rename to `room_id`
* `STP_SP_NR`: rename to `course_id`
* `PERSON_NR`: drop

In [15]:
df = pd.read_csv(join(src, 'Prüfungstermine_mit_Räumen.csv'), encoding='latin_1',
                     parse_dates=['DATUM'], dayfirst=True)
df = df.rename(columns={
    'PV_TERM_NR':'exam_id', # unique exam ID
    'DATUM':'date', # date of the exam
    'BEGINNZEIT':'start_time', # start time of the exam
    'ENDEZEIT':'end_time', # end time of the exam
    'RES_NR':'room_id', # unique room ID
    'STP_SP_NR':'course_id'}) # unique course ID

df = df.drop(columns=[
    'PERSON_NR',
])

# hash lecturer IDs with the given salt
hashed_IDs = []
pool = Pool(16)
for hashed_ID in tqdm(
        pool.imap_unordered(func=hash_id, iterable=df["lecturer_id"]),
        total=len(df["lecturer_id"])
    ):
    hashed_IDs.append(hashed_ID)
    
df["lecturer_id"] = hashed_IDs
df.to_csv(join(dst, "exam_dates.csv"), index=False)

KeyError: 'lecturer_id'

## Rooms

Exported file from database `Räume_cleaned.csv`.  
Sample stored in `/original/data_samples/Räume_cleaned.csv`.

**Original data fields & actions:**
* `RES_NR`: rename to `room_id`
* `RAUM_SITZPLAETZE`: rename to `seats`
* `QUADRATMETER`: remame to `area`
* `RAUM_GEBAEUDE_BEREICH_NAME`: rename to `campus`
* `STRASSE`: rename to `address`
* `PLZ`: rename to `postal_code`
* `ORT`: rename to `city`
* `RAUM_CODE`: drop
* `RAUM_ZUSATZBEZEICHNUNG`: drop

In [None]:
# List of rooms and information about them (number of seats, square meters).
# TU Graz has three campuses: Alte Technik, Neue Technik and Inffeldgasse. The
# mapping of every room to a campus is also stored.

# Information for rooms outside TU Graz premises was missing. Jana Lasser 
# manually searched for and filled in room information for rooms at Uni Graz 
# and added the information to the file /data/raw/Räume.csv. The original file 
# is stored in /data/original/data_exports/Räume.csv. These rooms are excluded

df = pd.read_csv(join(src, 'Räume_cleaned.csv'), 
                    encoding='latin_1')
df = df.rename(columns={
    'RES_NR':'room_id', # unique room id
    'RAUM_SITZPLAETZE':'seats', # number of seats in the room
    'QUADRATMETER':'area', # number of square meters in the room
    'RAUM_GEBAEUDE_BEREICH_NAME':'campus', # campus where the room is located
    'STRASSE':'address', # address (street & number)
    'PLZ':'postal_code', # post code (always 8010)
    'ORT':'city', # city (always Graz)
})
df = df.drop(columns=[
    "RAUM_CODE",
    "RAUM_ZUSATZBEZEICHNUNG"
])
df.to_csv(join(dst, "rooms.csv"), index=False)