<html>
            <div style="font-size:7pt">
            This notebook may contain text, code and images generated by artificial intelligence.
            Used model: gpt-4o-2024-05-13,
            vision model: gpt-4o-2024-05-13,
            endpoint: None,
            bia-bob version: 0.18.0.
            It is good scientific practice to check the code and results it produces carefully.
            <a href="https://github.com/haesleinhuepf/bia-bob">Read more about code generation using bia-bob</a>.
            </div>
            </html>

# Curate and Convert Trainings

This notebook processes a CSV file containing trainings data. The steps include loading the data, cleaning it, filtering it, and finally converting it to a LaTeX formatted string for document preparation.

In [1]:
import pandas as pd
import numpy as np



In [2]:
filename = 'trainings-example.csv'

## Load the CSV file into a dataframe

In [3]:
df = pd.read_csv(filename, header=1)

df.head(10)

Unnamed: 0.1,Unnamed: 0,Start/Ende,Titel,#TN,Verantwortl.,Uni,link,Unnamed: 7
0,,laufend,Datenanalyse,,Matthias Peters,UL,,
1,,11/23/23,Imaging,15.0,Maria Schmidt,UL,,
2,,8/28/23,Image Analysis Training School,25.0,Robert Lange,TUD,,
3,,09.10.2023-02.02.2024,Studienpraktikum,1.0,Nikolaus Rode,UL,,
4,,laufend,Trainings zu Datenanalyse und KI/ML,7.0,Thomas Bergweich,UL,,
5,,5/24/23,Training Data Visualization,10.0,Jan Baum,UL,,
6,,6/1/23,Training Data Visualization @ DataWeek 2023,20.0,Jan Baum,UL,,
7,,10/1/24,Tag der offenen Tür (ML Training),100.0,Johannes Haus,UL,,
8,,,,,,,,
9,,4/27/24,"“Cultivating Training”, Online Webinar",60.0,Robert Hund,UL,https://github.com/,


## Standardize column names

In [4]:
# Rename the column
df.rename(columns={"#TN": "Num_students"}, inplace=True)
df.rename(columns={"Verantwortl.": "Betreuer_aus_scadsai"}, inplace=True)

# Verify the column renaming
df.head()

Unnamed: 0.1,Unnamed: 0,Start/Ende,Titel,Num_students,Betreuer_aus_scadsai,Uni,link,Unnamed: 7
0,,laufend,Datenanalyse,,Matthias Peters,UL,,
1,,11/23/23,Imaging,15.0,Maria Schmidt,UL,,
2,,8/28/23,Image Analysis Training School,25.0,Robert Lange,TUD,,
3,,09.10.2023-02.02.2024,Studienpraktikum,1.0,Nikolaus Rode,UL,,
4,,laufend,Trainings zu Datenanalyse und KI/ML,7.0,Thomas Bergweich,UL,,


## Cleaning the Start/End column

In [5]:
def keep_first_date(date_str):
    return str(date_str).split('-')[0].strip() if '-' in str(date_str) else date_str

# Apply the custom function to the column
df["Start/Ende"] = df["Start/Ende"].apply(keep_first_date)

# Replace "laufend" with NaN
df["Start/Ende"].replace("laufend", np.nan, inplace=True)

# Make sure the year is in the column
df["Start/Ende"] = df["Start/Ende"].str.replace("/23", "/2023")
df["Start/Ende"] = df["Start/Ende"].str.replace("/2023/", "/23/")

# Verify the changes
df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Start/Ende"].replace("laufend", np.nan, inplace=True)


Unnamed: 0.1,Unnamed: 0,Start/Ende,Titel,Num_students,Betreuer_aus_scadsai,Uni,link,Unnamed: 7
0,,,Datenanalyse,,Matthias Peters,UL,,
1,,11/23/2023,Imaging,15.0,Maria Schmidt,UL,,
2,,8/28/2023,Image Analysis Training School,25.0,Robert Lange,TUD,,
3,,09.10.2023,Studienpraktikum,1.0,Nikolaus Rode,UL,,
4,,,Trainings zu Datenanalyse und KI/ML,7.0,Thomas Bergweich,UL,,


## Replace non-numeric strings in `Num_students` with zero

In [6]:
df['Num_students'] = pd.to_numeric(df['Num_students'], errors='coerce').fillna(0)

df.head(10)

Unnamed: 0.1,Unnamed: 0,Start/Ende,Titel,Num_students,Betreuer_aus_scadsai,Uni,link,Unnamed: 7
0,,,Datenanalyse,0.0,Matthias Peters,UL,,
1,,11/23/2023,Imaging,15.0,Maria Schmidt,UL,,
2,,8/28/2023,Image Analysis Training School,25.0,Robert Lange,TUD,,
3,,09.10.2023,Studienpraktikum,1.0,Nikolaus Rode,UL,,
4,,,Trainings zu Datenanalyse und KI/ML,7.0,Thomas Bergweich,UL,,
5,,5/24/2023,Training Data Visualization,10.0,Jan Baum,UL,,
6,,6/1/2023,Training Data Visualization @ DataWeek 2023,20.0,Jan Baum,UL,,
7,,10/1/24,Tag der offenen Tür (ML Training),100.0,Johannes Haus,UL,,
8,,,,0.0,,,,
9,,4/27/24,"“Cultivating Training”, Online Webinar",60.0,Robert Hund,UL,https://github.com/,


## Convert `Num_students` to integers

In [7]:
df['Num_students'] = df['Num_students'].astype(int)

df.head(10)

Unnamed: 0.1,Unnamed: 0,Start/Ende,Titel,Num_students,Betreuer_aus_scadsai,Uni,link,Unnamed: 7
0,,,Datenanalyse,0,Matthias Peters,UL,,
1,,11/23/2023,Imaging,15,Maria Schmidt,UL,,
2,,8/28/2023,Image Analysis Training School,25,Robert Lange,TUD,,
3,,09.10.2023,Studienpraktikum,1,Nikolaus Rode,UL,,
4,,,Trainings zu Datenanalyse und KI/ML,7,Thomas Bergweich,UL,,
5,,5/24/2023,Training Data Visualization,10,Jan Baum,UL,,
6,,6/1/2023,Training Data Visualization @ DataWeek 2023,20,Jan Baum,UL,,
7,,10/1/24,Tag der offenen Tür (ML Training),100,Johannes Haus,UL,,
8,,,,0,,,,
9,,4/27/24,"“Cultivating Training”, Online Webinar",60,Robert Hund,UL,https://github.com/,


## Select specific columns

In [8]:
df = df[["Start/Ende", "Titel", "Num_students", "Betreuer_aus_scadsai", "Uni"]]
df.head(10)

Unnamed: 0,Start/Ende,Titel,Num_students,Betreuer_aus_scadsai,Uni
0,,Datenanalyse,0,Matthias Peters,UL
1,11/23/2023,Imaging,15,Maria Schmidt,UL
2,8/28/2023,Image Analysis Training School,25,Robert Lange,TUD
3,09.10.2023,Studienpraktikum,1,Nikolaus Rode,UL
4,,Trainings zu Datenanalyse und KI/ML,7,Thomas Bergweich,UL
5,5/24/2023,Training Data Visualization,10,Jan Baum,UL
6,6/1/2023,Training Data Visualization @ DataWeek 2023,20,Jan Baum,UL
7,10/1/24,Tag der offenen Tür (ML Training),100,Johannes Haus,UL
8,,,0,,
9,4/27/24,"“Cultivating Training”, Online Webinar",60,Robert Hund,UL


## Filter out empty lines

In [9]:
df = df.dropna(how='all')
df.head(10)

Unnamed: 0,Start/Ende,Titel,Num_students,Betreuer_aus_scadsai,Uni
0,,Datenanalyse,0,Matthias Peters,UL
1,11/23/2023,Imaging,15,Maria Schmidt,UL
2,8/28/2023,Image Analysis Training School,25,Robert Lange,TUD
3,09.10.2023,Studienpraktikum,1,Nikolaus Rode,UL
4,,Trainings zu Datenanalyse und KI/ML,7,Thomas Bergweich,UL
5,5/24/2023,Training Data Visualization,10,Jan Baum,UL
6,6/1/2023,Training Data Visualization @ DataWeek 2023,20,Jan Baum,UL
7,10/1/24,Tag der offenen Tür (ML Training),100,Johannes Haus,UL
8,,,0,,
9,4/27/24,"“Cultivating Training”, Online Webinar",60,Robert Hund,UL


## Filter rows containing '2023' in the 'Semester' column

In [10]:
df = df[df['Start/Ende'].str.contains("2023", na=False)]
df.head(10)

Unnamed: 0,Start/Ende,Titel,Num_students,Betreuer_aus_scadsai,Uni
1,11/23/2023,Imaging,15,Maria Schmidt,UL
2,8/28/2023,Image Analysis Training School,25,Robert Lange,TUD
3,09.10.2023,Studienpraktikum,1,Nikolaus Rode,UL
5,5/24/2023,Training Data Visualization,10,Jan Baum,UL
6,6/1/2023,Training Data Visualization @ DataWeek 2023,20,Jan Baum,UL


## Filter out trainings with attendees < 4

In [11]:
df = df[df['Num_students'] >= 4]
df.head(10)

Unnamed: 0,Start/Ende,Titel,Num_students,Betreuer_aus_scadsai,Uni
1,11/23/2023,Imaging,15,Maria Schmidt,UL
2,8/28/2023,Image Analysis Training School,25,Robert Lange,TUD
5,5/24/2023,Training Data Visualization,10,Jan Baum,UL
6,6/1/2023,Training Data Visualization @ DataWeek 2023,20,Jan Baum,UL


## Rename columns for print

In [12]:
df.rename(columns={
    "Start/Ende": "Datum",
    "Betreuer_aus_scadsai": "Verantw.",
    "Num_students": "# Stud."
}, inplace=True)
df.head(10)

Unnamed: 0,Datum,Titel,# Stud.,Verantw.,Uni
1,11/23/2023,Imaging,15,Maria Schmidt,UL
2,8/28/2023,Image Analysis Training School,25,Robert Lange,TUD
5,5/24/2023,Training Data Visualization,10,Jan Baum,UL
6,6/1/2023,Training Data Visualization @ DataWeek 2023,20,Jan Baum,UL


## Convert the dataframe to a LaTeX formatted string and save to a .tex file

In [13]:
latex_string = df.to_latex(index=False)

latex_string = latex_string.replace("\\begin{tabular}{llrll}", "\\begin{longtable}{|p{.12\\textwidth}|p{.38\\textwidth}|p{.04\\textwidth}|p{.2\\textwidth}|p{.08\\textwidth}|}")
latex_string = latex_string.replace("\\end{tabular}", "\\end{longtable}")
latex_string = latex_string.replace("\\toprule", "\\hline")
latex_string = latex_string.replace("\\midrule", "\\hline")
latex_string = latex_string.replace("\\bottomrule", "\\hline")
latex_string = latex_string.replace("\\\\", "\\\\ \\hline")

with open(filename.replace(".csv", ".tex"), 'w') as f:
    f.write(latex_string)

## Print the LaTeX formatted string

In [14]:
print(latex_string)

\begin{longtable}{|p{.12\textwidth}|p{.38\textwidth}|p{.04\textwidth}|p{.2\textwidth}|p{.08\textwidth}|}
\hline
Datum & Titel & # Stud. & Verantw. & Uni \\ \hline
\hline
11/23/2023 & Imaging & 15 & Maria Schmidt & UL \\ \hline
8/28/2023 & Image Analysis Training School & 25 & Robert Lange & TUD \\ \hline
5/24/2023 & Training Data Visualization & 10 & Jan Baum & UL \\ \hline
6/1/2023 & Training Data Visualization @ DataWeek 2023 & 20 & Jan Baum & UL \\ \hline
\hline
\end{longtable}

