**Anmerkungen zur Bedienung der interaktiven Präsentation**

- Die Vortragspräsentation ist als interaktive [Rise](https://rise.readthedocs.io/en/latest/) Präsentation konzeptioniert.

- Um die interaktive Präsentation zu starten, öffnen Sie bitte die URL https://tinyurl.com/fs-plv2 in ihrem Browser.
<!-- oder scannen den QR Code -> kann ja das dann nicht bedienen? -->
- Es kann ein wenig dauern, bis die Präsentation gestartet wird.
- Falls die Präsentation nicht automatisch startet und Sie ein Jupyter notebook sehen, bitte laden Sie die Seite neu.
- Die interaktive Präsentation kann mittels der Navigationspfeile oder wie folgt bedient werden:
    - `Leertaste`: Eine Folie weiter.
    - `Hochstelltaste + Leertaste`: Eine Folie zurück.
    - `Strg + Enter`: Laden eines interaktiven Plots.
    - `Alt + r`: Aktivieren bzw. Deaktivieren der Präsentation.
    - Falls ein interaktive Plot bei Ausführung einer Code Zelle nicht richtig positioniert ist, deaktiviere und aktiviere Sie die Präsentation indem sie zwei Mal `Alt + r` drücken.

<div align="center" style="font-size:60px;">
Probelehrveranstaltung für die Professur für Angewandte Mathematik mit Schwerpunkt Statistical Learning
<br><br>
Data Science Projects
<br><br>
Dr. Fabian Spanhel
<div/>
    
<div align="left" style="font-size:16px;">
<div/>

**How has data science changed in the last years?**
- From on-premise to cloud.

- From POC to production.
- No unicorns anymore but the data science role is getting split into multiple specialized roles 

    -> ML & Data Engineering roles have emerged.
- In general, sofware engineering skills and MLOps have become more important.
- ...






See also ["**Is Data Scientist Still the Sexiest Job of the 21st Century?**"](https://hbr.org/2022/07/is-data-scientist-still-the-sexiest-job-of-the-21st-century)

**The goal of this course is to address the changes that have occurred in recent years and provide students with skills that might gain them a competitive advantage.**

**Data Science Project Workflow**
<img src="./figures/ds_project_0.png" alt="Data Science Projects" style="width: 1600px;"/>

**Data Science Project Workflow**
<img src="./figures/ds_project_1.png" alt="Data Science Projects" style="width: 1600px;"/>

**Data Science Project Workflow**
<img src="./figures/ds_project_2.png" alt="Data Science Projects" style="width: 1600px;"/>

**Data Science Project Workflow**
<img src="./figures/ds_project_3.png" alt="Data Science Projects" style="width: 1600px;"/>

**AWS Cloud services**

- Interacting with cloud services is a demanded skill for Data Scientists and will become even more important.

- We will cover the fundamental services of AWS in this course and learn how to
    - Store and retrieve data, such as files and backups, using **S3**.
    - Leverage scalable cloud-based compute capacity using **EC2**.
    - To build, train, and deploy ML models with **Sagemaker**.
- Optional: Glue & Athena, Lambda functions...
- How to interact with various AWS services using Python and the **boto3** package.

**We will be using AWS for the hands-on project in this course!**

**Methodology: Model tuning for time series** 

- $K$-fold cross-validation is most commonly used to tune the hyperparameters of a model.
- It is typically based on the assumption of iid data.
- How to do cross-validation when we have temporal dependence?

- $K$-fold cross-validation can be used in [special cases](https://www.sciencedirect.com/science/article/abs/pii/S0167947317302384).
- In general, cross-validation must consider existing dependence of the data to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning).
- Data leakage is a [serious problem in academica and industry](doi:10.1016/j.patter.2023.100804).


**Model tuning with time series cross-validation**
<div style="display: flex; align-items: left;">
  <div style="flex: 1; padding: 30px; font-size: 25px;">
    <ol style="margin: 200; padding: 0">
      <li>Model training: A predictive model is trained on training data of length $T_{train}$.</li>
      <li>Validation: The model is scored on validation data of length $T_{val}$.</li>
      <li>Shifting: The end of the new training data is increased to $T_{train} + T_{val}$.</li>
      <li>Repeat Steps 1 - 3 until $T_{val} = 0$.</li>
      <li>Performance evaluation: Aggregate validation scores.</li>
    </ol>
  </div>
  <div style="flex: 2; padding: 0px;">
      <img src="./figures/ts_split.png" alt="Time Series Cross Validation" style="width: 700px;"/>
  </div>
</div>

Questions:
<ul style="margin: 200; padding: 0; font-size: 25px; margin-top: -20px">
  <li>How to specify ($T_{train}$, $T_{val}$) or the resulting folds?</li>
  <li>How to optimize the training length? Use $(w_tY_t)_{t=1,\ldots T_{train}}$ as training data, where $(w_t)_{t=1}^{T}$ is increasing in $t$ and can be obtained via cross-validation?</li>
  <li>Should the aggregation of validation scores be weighted equally, or should the results of validation sets closer to today be weighted more heavily?</li>
  <li>How to actually split the data into folds? </li>
</ul>
 
<!-- To the best of my knowledge there is no package available that covers important practical cases 
(split w.r.t. date, a set of time series, groups)
-->
<span style="font-size: 30px; margin-top: -20px">**These questions will be investigated with a hands-on project in this course!**</span>

**MLOps: Managing the ML lifecycle with MLflow**


[MLflow](https://mlflow.org/) is an open-source platform for managing the machine learning lifecycle with the following features
1. **Tracking**: MLflow tracks experiments, metrics, parameters, and artifacts for easy comparison and result reproducibility.
2. **Registry**: MLflow's model registry organizes and versions models for collaboration and governance.
3. **Models**: MLflow offers a model management component for packaging models in a standard format and deploying them to various platforms.
4. **UI and API**: MLflow offers a web-based UI and REST API for interactive exploration and programmatic access.

Integrated in DataBricks and 15.5k stars on GitHub (October 2023)

**We will use Mlflow for the project in this course to manage the machine learning lifecycle!**

**MLflow Demo**

The following code snippet uses `mlflow.autolog` to automatically track the cross-validation of a `RandomForestRegressor` on a diabetes dataset.

In [None]:
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor

mlflow.autolog()
db = load_diabetes()

def run(): 
    X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)
    rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
    rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    
run()

We now start the MLflow tracking server.

In [None]:
!mlflow ui   

And open the [MLflow user interface](http://127.0.0.1:5000) (won't work if you are not executing this presentation on your local machine)

**Building Apps with Streamlit**
- Streamlit turns data scripts into shareable web apps in minutes and exhibits the following features.
    - **Rapid Development**: Simplifies web app creation with minimal Python code, ideal for non-web developers.
    - **Code reuse**: Seamlessly integrates with data science libraries (e.g., Pandas, Matplotlib) for code reuse.
    - **Interactive Widgets & Real-time Updates**: Offers various widgets for easy data and model interaction and enables dynamic data visualization.

- Trusted by over 80% of Fortune 50 companies and integrated in Snowflake. 27.8k stars on GitHub (October 2023).
<!--- Not the right tool for complex interfaces and/or nested state.-->
- Examples: [Analytics Dashoboard](https://shamiraty-streamlit-dashboard-descriptive-analytics-home-5ks7sm.streamlit.app/), [MathGPT](https://mathgpt.streamlit.app/).

**We will be using Streamlit to build an App for the machine learning model of our project!**

**Learning objectives** 
<br>

Students will experience the entire **Data Science workflow**, from defining the task to serving the model via an application or dashboard, through projects. 
<div style="display: flex; align-items: center; margin-top: -40px;">
  <div style="flex: 3; padding: 40px; font-size: 35px;">
    <ul style="list-style-type: disc; font-size: 30px; margin: 0; padding: 0;">
        <li>Deal with common problems when working with <strong>tabular data</strong>.</li>
        <li>Differentiate between <strong>prediction</strong> and <strong>causal inference</strong> tasks.</li>
        <li>Utilize <strong>cloud services</strong> for model training.</li>
        <li>Tackle model tuning in the presence of <strong>temporal dependence</strong> and perform multi-step forecasts.</li>
        <li><strong>Track</strong> models with MLflow.</li>
        <li><strong>Deploy</strong> machine learning models using Streamlit.</li>
    </ul>
  </div>
  <div style="flex: 0.1; padding: 50px;">
     <img src="./figures/ds_project_3.png" alt="Data Science Projects"/>
  </div>
</div>


**References**

Bergmeir C., Hyndman R. J., Koo B. "A note on the validity of cross-validation for evaluating autoregressive time series prediction". Computational Statistics & Data Analysis, Volume 120, 2018, Pages 70-83.
https://doi.org/10.1016/j.csda.2017.11.003.

Guts Y. "Target Leaking in Machine Learning". AI Ukraine Conference 2018. https://www.youtube.com/watch?v=dWhdWxgt5SU

Kapoor, S., Narayanan, A. "Leakage and the reproducibility crisis in machine-learning-based science". Patterns, 100804, August 2023. https://doi:10.1016/j.patter.2023.100804.


# Appendix

**Methodology: Multi-step forecasts**

- In practice, one often has to provide multi-step forecasts $\big(\text{Pred}_t[Y_{t+h}]\big)_{h=1, \ldots, H}$.
- How can we obtain this sequence of forecasts?


- Recall: If $Y_t = aY_{t-1} + U_t$ then $\text{Pred}_t[Y_{t+h}] = a\text{Pred}_t[Y_{t+h-1}]$.


- What if $Y_t = f(Y_{t-1}) + U_t$?

- Can we use $\text{Pred}_t[Y_{t+h}] = f\left(\text{Pred}_t[Y_{t+h-1}]\right)$?

**Multi-step forecasts with features**


- What if $Y_t = f(Y_{t-1}, X_{t-1}) + U_t$ and we don't know $(X_{t})_{h=t+1,\ldots, t+H}$? 
- If $Y_t = aY_{t-1} + bX_{t-1} + U_t$, we have that

    $
    \begin{align}
    \text{Pred}_t[Y_{t+1}] & = aY_{t} + bX_{t}\phantom{....}
    \end{align}
    $

    but because we don't know $\text{Pred}_t[X_{t+1}]$ we cannot compute

    $
    \begin{align}
    & \text{Pred}_t[Y_{t+2}] = a\text{Pred}_t[Y_{t+1}] + b\text{Pred}_t[X_{t+1}]
    \end{align}
    $

- Possible solutions:
    1. Set up a model for $X_t$, e.g., $X_t = r(X_{t-1}, Y_{t-1}) + V_t$, to get the **indirect forecast**
    
        $\text{Pred}_t[Y_{t+h}] = f(\text{Pred}_t[Y_{t+h-1}], \text{Pred}_t[X_{t+h-1}])$
    2. For each forecast horizon $h$, set up a model $Y_{t+h} = f_h(Y_{t-1}, X_{t-1}) + U_{t,h}$ to get the **direct forecast** 
    
        $\text{Pred}_t[Y_{t+h}] = f_h(Y_{t-1}, X_{t-1})$

**Multi-step forecasts with features: Possible solutions**

1. Get the **indirect forecast**

    $\text{Pred}_t[Y_{t+h}] = f(\text{Pred}_t[Y_{t+h-1}], \text{Pred}_t[X_{t+h-1}])$
2. Get the **direct forecast**

    $\text{Pred}_t[Y_{t+h}] = f_h(Y_{t-1}, X_{t-1})$
- Note that 1. increases in the number of features, whereas 2. increases in the number of forecast horizons $h$.
- How to tune the corresponding models of 1. and 2.?
- How can we handle the data to do direct and indirect forecasts?
- Which approach is better?

**These questions will be investigated with a hands-on project in this course!**