# Complete Workflow testing
This notebook processes the full workflow from raw data to trained model for future usage with the preferred MLOps Tool Stack.
In this case the relevant components will be DVC for Data Versioning, MLflow for Experiment Tracking and Model Registry and Prefect for Workflow Orchestration.
Everything in this notebook is adapted to the specific customer segmentation project of a small car repair shop.

## Preparation

## Extract
The extraction phase consists of
- merging the raw text files
- converting the text files to one single csv file
- converting the csv file to a pandas DataFrame
- processing the data (header name conversion, deleting unnessecary columns, normalizing, etc.)
- converting the final DataFrame to a parquet file

All steps are logged as MLflow runs with the relevant metadata and artifacts.
Every step will be represented as a python function to easily create the corresponding python scripts.

### Define common variables
These variables will be parameters for the final python scripts.

In [15]:
import os
raw_data_path = 'data'
tmp_file_path = '/tmp'
raw_data_merged_file_name = 'data_merged.TXT'
raw_data_encoding = 'iso8859_15'

### Merging the raw text files

In [16]:
raw_files = os.listdir(raw_data_path)
raw_files

['2012_08-2016_07.TXT',
 '2020_08-2024_07.TXT',
 '2010_08-2012_07.TXT',
 '2016_08-2020_07.TXT']

In [17]:
import time

start_time = time.time()
with open(os.path.join(tmp_file_path, raw_data_merged_file_name), 'w') as merged:
    for idx, file in enumerate(raw_files):
        with open(os.path.join(raw_data_path, file), 'r', encoding=raw_data_encoding) as current_raw_file:
            if idx == 0:
                merged.writelines(current_raw_file.readlines())
                continue
            merged.writelines(current_raw_file.readlines()[1:])
print(f"Execution of merging took: {time.time() - start_time}s")

Execution of merging took: 0.21520328521728516s
