In [1211]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import re

In [1212]:
df = pd.read_csv("ODI-2025.csv", sep=';') #I added sep = ';' because by defalut pandas
#is trying to read the file assuming comma as the delimiter, and we have ;
df.head() #printing the first 5 rows of the table 

Unnamed: 0,Timestamp,What programme are you in?,Have you taken a course on machine learning?,Have you taken a course on information retrieval?,Have you taken a course on statistics?,Have you taken a course on databases?,What is your gender?,I have used ChatGPT to help me with some of my study assignments,When is your birthday (date)?,How many students do you estimate there are in the room?,What is your stress level (0-100)?,How many hours per week do you do sports (in whole hours)?,Give a random number,Time you went to bed Yesterday,What makes a good day for you (1)?,What makes a good day for you (2)?
0,4-1-2025 12:17:07,MSc Artificial Intelligence,yes,unknown,mu,ja,male,yes,01-01-1888,400,78,0,928393.0,00:00,Food,Travel
1,4-1-2025 12:17:14,Artificial Intelligence,yes,1,sigma,ja,female,yes,31/01/2002,321,1000,2,31.416,12:30,sun,coffee
2,4-1-2025 12:17:17,Econometrics,yes,1,mu,ja,male,not willing to say,September,200,101,4,5.0,0:30,Zonnetje,Aperol
3,4-1-2025 12:17:21,Econometrics - Data Science,yes,0,mu,nee,male,yes,17/10/2003,350,60,6,37.0,23:00,Sun,Sun
4,4-1-2025 12:17:24,Bioinformatics’s & Systems Biology,yes,1,mu,ja,male,yes,19/04/2000,500,50,8,1.0,12,-,-


As we can notice, the survey took place on first of april. Students were asked various questions, such as the program they are enrolled in, whether or not they took any machine learning, information retrieval or databases courses. Some personal information was asked as well, such as gender, date of birth, stress level, hours per week they spend on sports, time they went to bed and what makes them happy. Let us dive into exploring each column separately, to understand the data better. 

In [1213]:
df.columns # Here we print all the possible column names, provided in the data set, in order to prevent typos. 

Index(['Timestamp', 'What programme are you in?',
       'Have you taken a course on machine learning?',
       'Have you taken a course on information retrieval?',
       'Have you taken a course on statistics?',
       'Have you taken a course on databases?', 'What is your gender?',
       'I have used ChatGPT to help me with some of my study assignments ',
       'When is your birthday (date)?',
       'How many students do you estimate there are in the room?',
       'What is your stress level (0-100)?',
       'How many hours per week do you do sports (in whole hours)? ',
       'Give a random number', 'Time you went to bed Yesterday',
       'What makes a good day for you (1)?',
       'What makes a good day for you (2)?'],
      dtype='object')

In [1214]:
programs = df['What programme are you in?'].unique()
print(programs)

['MSc Artificial Intelligence' 'Artificial Intelligence ' 'Econometrics'
 'Econometrics - Data Science' 'Bioinformatics’s & Systems Biology'
 'computer science' 'Masters in AI' 'Ms cs' 'AI'
 'Computer Science (joint degree)' 'NPN'
 'Masters Artificial Intelligence ' 'Computer science' 'Computer Science'
 'Artificial Intelligence' 'Master AI'
 'Bioinformatics and systems biology ' 'Master’s Business Analytics'
 'Artificial intelligence ' 'Masters in Artificial Intelligence'
 'Computational Science ' 'Master econometrics' 'AI master'
 'Artificial Intelligences' 'Msc AI' 'Master Artificial Intelligence'
 'Computational Science' 'MSc in Finance and Technology'
 'Master Computer Science' 'Human Language Technology' 'Master’s in AI'
 'Quantitative Finance' 'Business analytics '
 'Econometrics and operations research ' 'Master Artifical Intelligence'
 'Security ' 'MSc Computer Science SEG' 'Business Analytics'
 'Bioinformatics' 'CS' 'AI masters' 'Computational science'
 'Econometrics ' 'Maste

As we can notice, there are some implicit duplicates in data, for example, it is clear that 'computer science' and 'Computer science' and 'Computer Science MSc' correspond to the same degree. 
We can start cleaning this data by putting all the letters to lower case, and checking how many duplicates were eliminated. First lets print the number of unique elements before cleaning.

In [1215]:
num_b4_cleaning = print(len(programs)) 

118


So, we see that there are for now 118 distinct programs that students are enrolled to. Lets try to get rid of implicit duplicates. Firstly, we will put all the data to lower case and delete the extra space between words. 

As we can notice, there are more implicit duplicates. Such as 'ai', 'ai master', artificial intellegence etc. We can of course merge it in one. 

In [1216]:
def group_program(name):
    name = name.strip().lower().replace('&', 'and').replace('-', ' ') # Here we make the study name lower case and replace & with "and"

    if 'artificial' in name or 'ai' in name or 'intelligence' in name:
        return 'artificial intelligence'
    elif 'computer science' in name or 'comp sci' in name:
        return 'computer science'
    elif 'bio' in name:
        return 'bioinformatics and systems biology'
    elif 'business analytics' in name or 'ba' in name:
        return 'business analytics'
    elif  'fintech' in name or 'finance' in name and 'technology' in name:
        return 'finance and technology'
    elif 'green' in name:
        return 'software engineering and green it'
    elif 'econometrics and operations research' in name or 'eor' in name:
        return  'econometrics and operations research'
    elif 'computational science' in name:
        return 'computational science'
    elif 'econometrics   data science' in name:
        return 'econometrics and data science'
    elif 'human language technology' in name:
        return 'human language technology'
    else:
        return name 
    
df['What programme are you in?']= df['What programme are you in?'].apply(group_program)

print(df['What programme are you in?'].value_counts())


What programme are you in?
artificial intelligence                                    100
computer science                                            36
bioinformatics and systems biology                          21
computational science                                       17
business analytics                                          17
econometrics                                                 9
cs                                                           6
finance and technology                                       6
econometrics and operations research                         5
econometrics and data science                                5
master econometrics                                          2
big data engineering                                         2
human language technology                                    2
software engineering and green it                            2
msc econometrics                                             2
security                    

In [1217]:
print(len(df['What programme are you in?'].unique()))

27


Now let us examine the answers to the whole boolean questions. 

In [1218]:
machine_learning = df['Have you taken a course on machine learning?'].unique()
print(machine_learning)

['yes' 'no' 'unknown']


In [1219]:
information_retrieval = df['Have you taken a course on information retrieval?'].unique()
print(information_retrieval)

['unknown' '1' '0']


In [1220]:
statistics = df['Have you taken a course on statistics?'].unique()
print(statistics)

['mu' 'sigma' 'unknown']


In [1221]:
database = df['Have you taken a course on databases?'].unique()
print(database)

['ja' 'nee' 'unknown']


In [1222]:
chatgpt = df['I have used ChatGPT to help me with some of my study assignments '].unique()
print(chatgpt)

['yes' 'not willing to say' 'no']


Suggestion is to replace all positive values with yes and negative with no, not willing to say, we will replace with unknown. 

In [1223]:
df['Have you taken a course on information retrieval?'] = df['Have you taken a course on information retrieval?'].replace({
    '1': 'yes',
    '0': 'no',
    'not willing to say': 'unknown'
})

df['Have you taken a course on statistics?'] = df['Have you taken a course on statistics?'].replace({
    'mu': 'no',
    'sigma': 'yes'  
})
df['Have you taken a course on databases?'] = df['Have you taken a course on databases?'].replace({
    'ja': 'yes',
    'nee': 'no',
})
df['I have used ChatGPT to help me with some of my study assignments '] = df['I have used ChatGPT to help me with some of my study assignments '].replace({
    'not willing to say' : 'unkown'
})

Now lets fix the column with the birthday date. Firstly, we notice that sometimes people use DD-MM-YYYY for the date format, but sometimes they can also use DD.MM.YYYY or DD/MM/YYYY. Let us start by fixing this small issue.  Also later on, we decided to change here some texted months to their numbers, e.g January or June.

In [1224]:
df['When is your birthday (date)?'] = df[ 'When is your birthday (date)?'].astype(str).str.strip()
replacements = {
    '.': '-', 
    '/' : '-',
    ' ': '-', 
    'January' : '01', 
    'augustus' : '08',
    'Dec' : '12', 
    'October' : '10', 
    'juni' : '06', 
    'June' : '06', 
    'July' : '07',
    'September' : '09',
    '070' : '07' #We have noticed that one of the dates has a typo, eventho the whole date seems normal,
     # but the month is clearly mistyped and should be july, so 07
    

}
for old, new in replacements.items():
    df['When is your birthday (date)?'] = df['When is your birthday (date)?'].str.replace(old, new, regex=False)
print(df['When is your birthday (date)?'].to_list())

['01-01-1888', '31-01-2002', '09', '17-10-2003', '19-04-2000', 'Tomorrow', '25-10-1999', '1-april', '29-01-2001', '01082000', '19-10-1999', '1-1-1999', '01012000', '06-15-2001', '01', '27-02-2001', '10-05-1982', '16-12-1998', '23-06-2002', '10-08-2000', '30-12-2003', '11-August', 'Idk', '04-19-2000', '19-07-2003', '19-February', '19-05-2000', '09-14', '1999', '11-12-2001', '24-01-1999', '29-07-2000', '24-de-Diciembre', '01-06-2000', '11-11-00', '05-11-1997', '27-11-2002', '20-07-2001', '2000', '23-12-2002', '16-03-2002', '09-05-2002', '18-05-2003', '11-11-2002', '2000', '26', '29th-09-2001', '21-11-2002', '12ember-14th', '2001-09-16', '1997', '16-08-1996', '69-69-2069', '26-11-1998', '23-maart', '14-09-2000', '-', '20-05-2001', '28-12-1999', '01-16th', '23-05', '19-07-1997', '10-12-1994', '20-06', '30-09-2002', '01-01-1900', '23', '15-02', '13-03-2002', '08-10-2001', '21-10-2001', '10-12-1999', '21-06-2001', '28-02-2001', '28-06-2002', '10112000', '08031998', '01-9th', '28-05', '26-10-

Now, we have noticed that some of the students put their date of birth in the format DDMMYYYY or YYYYMMDD. Let us try to fix it. 

In [1225]:
# Ensure strings and clean whitespace
col = 'When is your birthday (date)?'
df[col] = df[col].astype(str).str.strip()

# Mask: match 8-digit strings only
is_eight_digits = df[col].str.match(r'^\d{8}$')

# Convert only those entries
converted = pd.to_datetime(df.loc[is_eight_digits, col], format='%d%m%Y', errors='coerce')

# Format to DD-MM-YYYY strings (skip NaT)
df.loc[is_eight_digits & converted.notna(), col] = converted.dt.strftime('%d-%m-%Y')

# Second type of 8-digit strings (starts with year)
is_yyyymmdd = df[col].str.match(r'^\d{8}$') & df[col].str.startswith(('19', '20'))

converted2 = pd.to_datetime(df.loc[is_yyyymmdd, col], format='%Y%m%d', errors='coerce')
df.loc[is_yyyymmdd & converted2.notna(), col] = converted2.dt.strftime('%d-%m-%Y')


In [1226]:
# Here we will fix the dates of the type 1-1-1999. 
col = 'When is your birthday (date)?'

# Try parsing everything that looks like a date with dashes (e.g. 1-1-1999)
mask_dashed = df[col].str.match(r'^\d{1,2}-\d{1,2}-\d{4}$')

# Convert and reformat with zero-padding
df.loc[mask_dashed, col] = pd.to_datetime(df.loc[mask_dashed, col], dayfirst=True, errors='coerce') \
                              .dt.strftime('%d-%m-%Y')


In [1227]:
print(df['When is your birthday (date)?'].to_list())

['01-01-1888', '31-01-2002', '09', '17-10-2003', '19-04-2000', 'Tomorrow', '25-10-1999', '1-april', '29-01-2001', '01-08-2000', '19-10-1999', '01-01-1999', '01-01-2000', nan, '01', '27-02-2001', '10-05-1982', '16-12-1998', '23-06-2002', '10-08-2000', '30-12-2003', '11-August', 'Idk', nan, '19-07-2003', '19-February', '19-05-2000', '09-14', '1999', '11-12-2001', '24-01-1999', '29-07-2000', '24-de-Diciembre', '01-06-2000', '11-11-00', '05-11-1997', '27-11-2002', '20-07-2001', '2000', '23-12-2002', '16-03-2002', '09-05-2002', '18-05-2003', '11-11-2002', '2000', '26', '29th-09-2001', '21-11-2002', '12ember-14th', '2001-09-16', '1997', '16-08-1996', nan, '26-11-1998', '23-maart', '14-09-2000', '-', '20-05-2001', '28-12-1999', '01-16th', '23-05', '19-07-1997', '10-12-1994', '20-06', '30-09-2002', '01-01-1900', '23', '15-02', '13-03-2002', '08-10-2001', '21-10-2001', '10-12-1999', '21-06-2001', '28-02-2001', '28-06-2002', '10-11-2000', '08-03-1998', '01-9th', '28-05', '26-10-2001', '21-07-200

In [1228]:
# Create a new parsed column
df['birthday_parsed'] = pd.NaT  # start with all NaT

# ddmmyyyy
df.loc[is_ddmmyyyy, 'birthday_parsed'] = pd.to_datetime(
    df.loc[is_ddmmyyyy, col], format='%d%m%Y', errors='coerce'
)

# yyyymmdd
df.loc[is_yyyymmdd, 'birthday_parsed'] = pd.to_datetime(
    df.loc[is_yyyymmdd, col], format='%Y%m%d', errors='coerce'
)
print(df['When is your birthday (date)?'].to_list())

# Delete unnecerary column
df.drop(columns=['birthday_parsed'], inplace=True)


['01-01-1888', '31-01-2002', '09', '17-10-2003', '19-04-2000', 'Tomorrow', '25-10-1999', '1-april', '29-01-2001', '01-08-2000', '19-10-1999', '01-01-1999', '01-01-2000', nan, '01', '27-02-2001', '10-05-1982', '16-12-1998', '23-06-2002', '10-08-2000', '30-12-2003', '11-August', 'Idk', nan, '19-07-2003', '19-February', '19-05-2000', '09-14', '1999', '11-12-2001', '24-01-1999', '29-07-2000', '24-de-Diciembre', '01-06-2000', '11-11-00', '05-11-1997', '27-11-2002', '20-07-2001', '2000', '23-12-2002', '16-03-2002', '09-05-2002', '18-05-2003', '11-11-2002', '2000', '26', '29th-09-2001', '21-11-2002', '12ember-14th', '2001-09-16', '1997', '16-08-1996', nan, '26-11-1998', '23-maart', '14-09-2000', '-', '20-05-2001', '28-12-1999', '01-16th', '23-05', '19-07-1997', '10-12-1994', '20-06', '30-09-2002', '01-01-1900', '23', '15-02', '13-03-2002', '08-10-2001', '21-10-2001', '10-12-1999', '21-06-2001', '28-02-2001', '28-06-2002', '10-11-2000', '08-03-1998', '01-9th', '28-05', '26-10-2001', '21-07-200

Now, lets move on to the stress-level column. 

In [1229]:
df['What is your stress level (0-100)?'] = pd.to_numeric(df['What is your stress level (0-100)?'], errors='coerce')
min_stress = df['What is your stress level (0-100)?'].min()
max_stress = df['What is your stress level (0-100)?'].max()
mean_stress =  df['What is your stress level (0-100)?'].mean()
print(min_stress)
print(max_stress)
print(mean_stress)

-10000.0
2.14748365e+18
9023460714286198.0


As we can notice, many records were given outside of the possible interval.  (DECIDE WHAT TO DO W IT (DELETE EVERYTHING OTSIDE 0-100? REPLACE OUTSIDERES WITH POSSIBLE MIN/MAX))

Let's move on to gender.

In [1230]:
print(df['What is your gender?'].unique())

['male' 'female' 'gender fluid' 'not willing to answer' 'intersex'
 'non-binary' 'other']


Nothing wrong here. We will move on to the time people go to sleep.

In [1231]:
# Printing all distinct values
print(df['Time you went to bed Yesterday'].unique())


['00:00' '12:30' '0:30' '23:00' '12' '5am' '12:00' '12am' '10:37' '0200'
 '11 pm' '23h45' '9 am' '9' '2 am' '01.00' '23.30' '23:16' '2' '1:00'
 '23.00' '1 am' '12:30 PM' '23:30' '00:30' '0.30' '1:00 am' '23:57'
 '05:00' '4:00' '22:30' '00:45' '01:30' '23:40' '3' '2am' '1am' '2:00'
 '23:59' '01:00' '04:00' '23-00' '4am' '00.30' '2.30' '1 AM' '12:00 pm'
 '22.00' '5' '11' '12:30am' '2300' '2 pm' '23' '23:55' '23:45' '1:30am'
 '12:34' '1' '00:40' '1:30' '12:45' '11:35' '23:25' '21:45' '7pm'
 '11:33 PM' '22:40' 'Midnight' '3AM' '03:00' '3am' '01:23' '8' '00:31'
 '3:54' '3 AM x)' '02:00' '11:30pm' '22:00' '5:00am' '12.30'
 'around midnight' '23u30' '1.22am' '0:00' '3:00 ' '3:00' '4' '1:03 '
 '00:10' '1:37' '11:00' '12.00' '00:30 AM' '00.15' '11:34' '5 AM' '00:33'
 '00:15' '4:30' '22:45' '9:30' '23:15' '02:15' '21.30' '12.30am' '21:30'
 '10' '00:54' '1743502757' '0 AD ']


In [1232]:
# Normalize text
col = 'Time you went to bed Yesterday'
df[col] = df[col].astype(str).str.strip().str.lower().str.replace(' ', '', regex=False)

# Now lets do some visible replacements.
replacements = {
    'midnight': '00:00',
    'aroundmidnight': '00:00',
    'noon': '12:00',
    'a.m.': 'am',
    'p.m.': 'pm',
    'amx': 'am',
    '0ad': '',
}

for old, new in replacements.items():
    df[col] = df[col].str.replace(old, new, regex=False)

# Parse 12-hour (AM/PM) format
parsed = pd.to_datetime(df[col], format='%I:%M%p', errors='coerce')

#  Fill in unparsed with 24-hour format
mask_unparsed = parsed.isna()
parsed_24 = pd.to_datetime(df.loc[mask_unparsed, col], format='%H:%M', errors='coerce')
parsed.loc[mask_unparsed] = parsed_24

# Format back to HH:MM and write to same column
df[col] = parsed.dt.strftime('%H:%M')

#Print the rsult 
print(df[col].unique())


['00:00' '12:30' '00:30' '23:00' nan '12:00' '10:37' '23:16' '01:00'
 '23:30' '23:57' '05:00' '04:00' '22:30' '00:45' '01:30' '23:40' '02:00'
 '23:59' '23:55' '23:45' '12:34' '00:40' '12:45' '11:35' '23:25' '21:45'
 '23:33' '22:40' '03:00' '01:23' '00:31' '03:54' '22:00' '01:03' '00:10'
 '01:37' '11:00' '11:34' '00:33' '00:15' '04:30' '22:45' '09:30' '23:15'
 '02:15' '21:30' '00:54']


In [1233]:
df.head(10)

Unnamed: 0,Timestamp,What programme are you in?,Have you taken a course on machine learning?,Have you taken a course on information retrieval?,Have you taken a course on statistics?,Have you taken a course on databases?,What is your gender?,I have used ChatGPT to help me with some of my study assignments,When is your birthday (date)?,How many students do you estimate there are in the room?,What is your stress level (0-100)?,How many hours per week do you do sports (in whole hours)?,Give a random number,Time you went to bed Yesterday,What makes a good day for you (1)?,What makes a good day for you (2)?
0,4-1-2025 12:17:07,artificial intelligence,yes,unknown,no,yes,male,yes,01-01-1888,400,78.0,0,928393.0,00:00,Food,Travel
1,4-1-2025 12:17:14,artificial intelligence,yes,yes,yes,yes,female,yes,31-01-2002,321,1000.0,2,31.416,12:30,sun,coffee
2,4-1-2025 12:17:17,econometrics,yes,yes,no,yes,male,unkown,09,200,101.0,4,5.0,00:30,Zonnetje,Aperol
3,4-1-2025 12:17:21,econometrics and data science,yes,no,no,no,male,yes,17-10-2003,350,60.0,6,37.0,23:00,Sun,Sun
4,4-1-2025 12:17:24,bioinformatics and systems biology,yes,yes,no,yes,male,yes,19-04-2000,500,50.0,8,1.0,,-,-
5,4-1-2025 12:17:26,econometrics,no,unknown,no,no,gender fluid,unkown,Tomorrow,467,,8,0.0,,Chocolate,Bitterballen
6,4-1-2025 12:17:26,econometrics,no,no,no,yes,female,yes,25-10-1999,500,60.0,4,4.0,12:00,Sun,Good food
7,4-1-2025 12:17:27,computer science,yes,no,yes,yes,male,unkown,1-april,400,30.0,1,6656678.0,,good food,good grade
8,4-1-2025 12:17:28,artificial intelligence,no,no,no,yes,female,yes,29-01-2001,500,60.0,3,888.0,10:37,Sun1,Sun2
9,4-1-2025 12:17:30,ms cs,no,no,no,yes,gender fluid,unkown,01-08-2000,200,70.0,1,4.204204204204204e+17,,Work done good,Then smoke weed
