# Lecture Exercise 01 - Chem 273  *solution*
## Reading Files

**1) Motivation**

The goal of this exercise is to benchmark different tools for reading files of different formats. We also want to repeat some python coding as a warm-up for the course.

<br>

**2) Preparation**

In order to be able to measure runtime accurately, we will use a *decorator*:

In [75]:
from my_timer import my_timer 

In the next step we want to read the following data files:<br>
<br>
*Data_set_0.xlsx*<br>
*Data_set_0.csv*<br>
*Data_set_0.txt*<br>
<br>
All three files have the exact same content, but are of different formats. We now load the required libraries *pandas*, *dask* and *polars*: 

In [77]:
import pandas as pd
import dask.dataframe as dd
import polars as pl
#run pip install dask and/or pip install polars if needed!

<br>

**3) Exercise**

Write a short function using *def* that reads the data file of a given format and using a specific library. Apply the decorator via. 

In [None]:
#@my_timer
#def My_Function(input1, input2, ...)

What is the difference in time you measure?<br>
In order to obtain the same functionality for data frames you are used to by *pandas*, sometimes the data frame which has been generated using another library, such as *polars* or *dask*, has to be converted into a *pandas* data frame via: 

In [None]:
#df = pd.DataFrame(df)

How much time does the conversion require? Do we still gain time?

<br>

**4) Solution**

Here is a simple example code:

In [79]:
@my_timer
def ReadWithPandasCSV(filename: str = 'Data_set_0.csv') -> pd.DataFrame:
    return pd.read_csv(filename)

@my_timer
def ReadWithDaskCSV(filename: str = 'Data_set_0.csv')   -> dd.DataFrame:
    return dd.read_csv(filename)

In [81]:
dfPandasCSV = ReadWithPandasCSV()
dfDaskCSV   = ReadWithDaskCSV()

Total runtime: 4.546999999991385 seconds
Total runtime: 0.0 seconds


<br>

Same code, but more dynamically:

In [83]:
@my_timer
def ReadWithAnyToolCSV(filename: str = 'Data_set_0.csv', my_tool: str = 'pd') -> pd.DataFrame:
    read_csv = globals()[my_tool].read_csv
    return read_csv(filename)

In [85]:
dfPandasCSV = ReadWithAnyToolCSV()                     #pandas as default
dfPandasCSV = ReadWithAnyToolCSV(my_tool = 'dd')       #dask
dfPandasCSV = ReadWithAnyToolCSV(my_tool = 'pl')       #polars

Total runtime: 11.48399999999674 seconds
Total runtime: 0.0159999999741558 seconds
Total runtime: 1.2030000000086147 seconds


<br>

Both, tool and method dynamically:

In [87]:
@my_timer
def ReadWithAnyToolAnyMethod(filename: str = 'Data_set_0.csv', my_tool: str = 'pd', my_method: str = 'read_csv') -> pd.DataFrame:
    tool   = globals()[my_tool]
    method = getattr(tool, my_method)
    return method(filename)

In [89]:
dfPandasCSV = ReadWithAnyToolAnyMethod()                                                                                 #pandas as default, read csv
dfPandasCSV = ReadWithAnyToolAnyMethod(filename = 'Data_set_0.xlsx', my_method = 'read_excel')                           #pandas as default, read xlsx

dfPandasCSV = ReadWithAnyToolAnyMethod(my_tool = 'dd')                                                                   #dask

dfPandasCSV = ReadWithAnyToolAnyMethod(my_tool = 'pl')                                                                   #polars
dfPandasCSV = ReadWithAnyToolAnyMethod(filename = 'Data_set_0.xlsx', my_tool = 'pl', my_method = 'read_excel')           #polars

Total runtime: 10.030999999988126 seconds
Total runtime: 366.7970000000205 seconds
Total runtime: 0.0 seconds
Total runtime: 0.26600000000325963 seconds
Total runtime: 24.625 seconds
