The file “studentInfo.csv” contains information about students' enrollment and deregistration dates. If a student is not yet deregistered, the value of deregistration date is None. The dates are in the following format dd/mm/yyyy.

Read the file into a panda dataframe.

In [7]:
import pandas as pd

filepath = "studentInfo.csv"
data = pd.read_csv(filepath, sep=';')

Show the first rows to verify that this worked well.

In [8]:
data.head()

Unnamed: 0,studentnummer,inschrijfdatum,uitschrijfdatum
0,r012345,11/09/2023,
1,r124589,10/05/2023,01/12/2022
2,r457899,10/10/2023,24/31/2024
3,r012345,11/09/2023,22/12/2023
4,r024589,10/18/2023,100/12/2023


Delete all rows with a non-valid date in the “deregistration date” column (also None is obviously a non-valid date). 
Show the first rows again to check if this was successful

In [18]:
import re
import numpy as np

date_format = re.compile(r'^(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[0-2])/\d{4}$')

def check_date_format(value):
    if date_format.match(str(value)):
        return value
    else:
        return np.nan

data['uitschrijfdatum'] = data['uitschrijfdatum'].apply(check_date_format)

data = data.dropna()

data.head()

Unnamed: 0,studentnummer,inschrijfdatum,uitschrijfdatum
1,r124589,10/05/2023,01/12/2022
3,r012345,11/09/2023,22/12/2023
6,r112345,21/09/2023,01/01/2024
7,r124589,10/08/2023,01/12/2023
8,r457899,10/10/2023,24/01/2024


Delete all rows with a non-valid date in the “enrollment date” column .

In [19]:
data['inschrijfdatum'] = data['inschrijfdatum'].apply(check_date_format)

data = data.dropna()

data.head()

Unnamed: 0,studentnummer,inschrijfdatum,uitschrijfdatum
1,r124589,10/05/2023,01/12/2022
3,r012345,11/09/2023,22/12/2023
6,r112345,21/09/2023,01/01/2024
7,r124589,10/08/2023,01/12/2023
8,r457899,10/10/2023,24/01/2024


Create a new column “number of days” that contains the number of days between entry and exit dates.

In [26]:
data["inschrijfdatum"] = pd.to_datetime(data['inschrijfdatum'], format='%d/%m/%Y')
data['uitschrijfdatum'] = pd.to_datetime(data['uitschrijfdatum'], format='%d/%m/%Y')

delta = data['uitschrijfdatum'] - data['inschrijfdatum']
data['number_of_dates'] = delta.dt.days

data.head()


Unnamed: 0,studentnummer,inschrijfdatum,uitschrijfdatum,number_of_dates
1,r124589,2023-05-10,2022-12-01,-160
3,r012345,2023-09-11,2023-12-22,102
6,r112345,2023-09-21,2024-01-01,102
7,r124589,2023-08-10,2023-12-01,113
8,r457899,2023-10-10,2024-01-24,106


Delete all rows for which the “number of days” column has a negative value.

In [28]:
def valid_number_of_dates(value):
    if value > 0:
        return value
    else:
        return np.nan

data['number_of_dates'] = data['number_of_dates'].apply(valid_number_of_dates)

data = data.dropna()

data.head()

Unnamed: 0,studentnummer,inschrijfdatum,uitschrijfdatum,number_of_dates
3,r012345,2023-09-11,2023-12-22,102.0
6,r112345,2023-09-21,2024-01-01,102.0
7,r124589,2023-08-10,2023-12-01,113.0
8,r457899,2023-10-10,2024-01-24,106.0
9,r012345,2023-09-11,2023-12-22,102.0


Provide a visual overview in the form of a histogram of the number of students enrolled for the same number of days.
Example: 

![output.png contains an image of the expected output. A histogram is shown with "aantal dagen ingeschreven" on the x-axis and "aantal studenten" on the y-axis.](output.png)