<a href="https://colab.research.google.com/github/EvgeniyStrizhak/My-master-s-thesis/blob/main/Data_pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data pre-processing

This notebook downloads files from the Bundesbank [database](https://www.bundesbank.de/), merges them, processes them and uploads them to a GitHub repository as features for a model in features.csv.
All feaures description and their sources are [here](https://github.com/EvgeniyStrizhak/My-master-s-thesis/blob/a5d137148423b66a3dbeb3045b7e776677ed5838/row_datasets/files_info.json)

Target data is downloaded manually from www.investing.com. The notebook then downloads it from my repository, processes it, and uploads it to GitHub as a target.csv file.

This pipeline allows you to flexibly configure and select any indicators from the database. To do this, simply enter the file metadata in files_info.json.
This pipeline can run automatically on a schedule.

Indicators such as GDP are calculated once a quarter, while the forecast is calculated for a month. Therefore, for such indicators, the data is repeated for each month in the quarter.

Function **process_dataset(url, file_name, period):** containes all data cleaning operations

The target value contains aggregated monthly average prices

In [3]:
import requests
import json
from sklearn.pipeline import Pipeline
import pandas as pd
from google.colab import drive
import json
import base64

In [4]:
#url of a json file from my repo containes all nesessary url from bundesbank's database
API_DATA = 'https://raw.githubusercontent.com/EvgeniyStrizhak/My-master-s-thesis/refs/heads/main/row_datasets/files_info.json'
#target url is in my repo
TARGET_URL = 'https://raw.githubusercontent.com/EvgeniyStrizhak/My-master-s-thesis/refs/heads/main/row_datasets/DAX%20Historical%20Data.csv'
#json file contains my git hub token to push files in repo
CONFIG_PATH = "/content/drive/My Drive/config.json"
#repo owner
REPO_OWNER = "EvgeniyStrizhak"
#repo name
REPO_NAME = "My-master-s-thesis"

In [5]:
# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


##Data pipline

In [6]:
#geting data from url and saves it locally
def get_data(url, file_name):
  response = requests.get(url)

#Check if the request was successful
  if response.status_code == 200:
      # Save the file locally
      with open(file_name, "wb") as file:
          file.write(response.content)
      print(f"File downloaded successfully: {file_name}")
      return file_name
  else:
    print(f"Failed to download file: {file_name} Status code: {response.status_code}")
    return None

In [7]:
#funcion downloads csv from url and reads it
def read_csv(file_name):
    return pd.read_csv(file_name)

#deleting descriptions and comments from file
def filter_rows(df):
    df = df.dropna(subset=['Unnamed: 0'])
    return df.iloc[10:]

#the column Unnamed: 0 contains year and quarter with - delimeter
def split_date_column(df, period):
    if period == 'quarter':
        return df.assign(
                year=df['Unnamed: 0'].str.split('-', expand=True)[0],
                quarter=df['Unnamed: 0'].str.split('-', expand=True)[1].str.replace('Q', '')
                )
    else:
        df = df.assign(
                year=df['Unnamed: 0'].str.split('-', expand=True)[0],
                month=df['Unnamed: 0'].str.split('-', expand=True)[1],
                )
        #if there is month data a quarter column is added for further merging
        df['quarter'] = pd.to_datetime(df['month'], format='%m').dt.quarter
        return df

#drop unnecessary columns data and flags
def drop_unnecessary_columns(df):
    return df.drop(columns=['Unnamed: 0', df.columns[2]])

#columns with a feature renamed the same as file named
def rename_columns(df, file_name):
    columns = df.columns
    file_name = file_name.replace('.csv', '')
    return df.rename(columns={columns[0]: file_name})

#year, quarter and month turned into integer
def convert_dates_to_int(df, period):
    if period == 'month':
        df = df.assign(
                    month=df['month'].astype('int')
                    )
    return df.assign(year=df['year'].astype('int'),
                    quarter=df['quarter'].astype('int')
                    )

#selecting a period where all features are available
def filter_by_year(df):
    return df.query('year >= 2005 and not (year == 2024 and quarter == 4) and year != 2025')

def reset_index(df):
    df.reset_index(drop=True, inplace=True)
    return df

In [8]:
#combine functions into one pipline
def process_dataset(url, file_name, period):
    df = pd.read_csv(get_data(url, file_name))
    return (df
            .pipe(filter_rows)
            .pipe(split_date_column, period)
            .pipe(drop_unnecessary_columns)
            .pipe(rename_columns, file_name)
            .pipe(convert_dates_to_int, period)
            .pipe(filter_by_year)
            .pipe(reset_index)
            )

In [9]:
#function upload updated dataset into github repository
def upload_file(df, file_path, github_file_path):
    #save localy o csv
    df.to_csv(file_path, index=False)
    #generate url
    url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/contents/{github_file_path}"
    headers = {"Authorization": f"token {github_token}", "Accept": "application/vnd.github.v3+json"}
    #checking file updating file
    response_sha = requests.get(url, headers=headers)

    #generate api query

    data = {
        "message": "Добавлен CSV-файл",
        "content": "",
        "branch": "main"
    }

    if response_sha.status_code == 200:
        sha = response_sha.json()["sha"]
        data["sha"] = sha
    else:
        print(f"Error getting SHA: {response_sha.json()}")
        exit(1)

    #read file as utf-8
    with open(file_path, "rb") as file:
        data['content'] = base64.b64encode(file.read()).decode("utf-8")


    #Upload file to GitHub
    response = requests.put(url, json=data, headers=headers)

    if response.status_code == 201:
        print("A new file has been uploaded succesfully")
    elif response.status_code == 200:
        print("An existing file has been updated succesfully")
    else:
        print(f"Error: {response.status_code} - {response.json()}")

##Feature uploading and processing

In [10]:
#download git token from Google Drive
with open(CONFIG_PATH) as f:
    config = json.load(f)
    github_token = config.get("GITHUB_TOKEN")

if github_token:
    print("GitHub token loaded successfully!")
else:
    print("Error: GitHub token not found!")

GitHub token loaded successfully!


In [11]:
#download data for api queries from repository
get_data(API_DATA, 'api_data.json')
with open('api_data.json', 'r') as json_file:
    data_dictionary = json.load(json_file)

File downloaded successfully: api_data.json


In [12]:
#generate an empty dataframe to merge it with uploaded datasets
df = pd.DataFrame({'year': [], 'quarter': [], 'month':[]})

#loop get metadata from json to download and process every file
for item in data_dictionary:
    file_path = item['file_name']
    print(file_path)
    #if a file has only quarter data without months data for each month of this quarter will be the same
    if item['period'] == 'quarter':
        keys = ['year', 'quarter']
    else:
        keys = ['year', 'quarter', 'month']
    df = df.merge(process_dataset(item['url'], file_path, item['period']), on = keys, how = 'outer')

production_sector.csv
File downloaded successfully: production_sector.csv
orders-received.csv
File downloaded successfully: orders-received.csv
unemployment_rate.csv
File downloaded successfully: unemployment_rate.csv
labour_costs.csv
File downloaded successfully: labour_costs.csv
consumer_prices.csv
File downloaded successfully: consumer_prices.csv
economy's_price competitiveness.csv
File downloaded successfully: economy's_price competitiveness.csv
balance_of_payments.csv
File downloaded successfully: balance_of_payments.csv
gdp.csv
File downloaded successfully: gdp.csv
interest_rate.csv
File downloaded successfully: interest_rate.csv


In [13]:
df.head()

Unnamed: 0,year,quarter,month,production_sector,orders-received,unemployment_rate,labour_costs,consumer_prices,economy's_price competitiveness,balance_of_payments,gdp,interest_rate
0,2005,1,1,87.1,73.6,11.6,70.3,84.5,97.0,-19072.388,83.4,2.0
1,2005,1,2,85.8,72.3,11.9,70.3,84.6,97.0,12614.579,83.4,2.0
2,2005,1,3,86.1,73.7,12.1,70.3,84.8,97.0,21713.763,83.4,2.0
3,2005,2,4,87.4,73.2,11.9,70.3,85.0,95.6,17593.788,83.86,2.0
4,2005,2,5,86.5,73.6,11.9,70.3,85.0,95.6,6893.84,83.86,2.0


In [14]:
github_file_path = 'row_datasets/features.csv'
file_path = 'features.csv'
if df.isna().sum().sum() == 0:
  upload_file(df, file_path, github_file_path)
else:
    print("There are missing values in the DataFrame.")

An existing file has been updated succesfully


## Target processing

In [15]:
#upload data for target from github
target = get_data(TARGET_URL, 'target.csv')

File downloaded successfully: target.csv


In [16]:
target = pd.read_csv('target.csv')

In [17]:
target.head()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
0,02/01/2025,21787.9,21269.5,21945.57,21258.0,349.99M,0.26%
1,01/01/2025,21732.05,19923.07,21800.52,19833.82,1.42B,9.16%
2,12/01/2024,19909.14,19586.17,20522.82,19568.5,1.11B,1.44%
3,11/01/2024,19626.45,19093.99,19640.15,18812.53,1.42B,2.88%
4,10/01/2024,19077.54,19409.39,19674.68,18911.72,1.22B,-1.28%


In [18]:
target.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Date      241 non-null    object
 1   Price     241 non-null    object
 2   Open      241 non-null    object
 3   High      241 non-null    object
 4   Low       241 non-null    object
 5   Vol.      241 non-null    object
 6   Change %  241 non-null    object
dtypes: object(7)
memory usage: 13.3+ KB


In [19]:
#change data types for nesessary columns select year and month
target['Date'] = pd.to_datetime(target['Date'], format='%m/%d/%Y')
target['month'] = target['Date'].dt.month
target['year'] = target['Date'].dt.year
target['Price'] = target['Price'].str.replace(',', '').astype('float')

In [20]:
#select nesessary dates
target = target.query('year >= 2005 and not (year == 2024 and month in (10, 11, 12)) and year != 2025')

In [21]:
#final target grouped by year and month
target_table = target.groupby(['year', 'month'])['Price'].mean().reset_index()

In [22]:
#upload data
github_file_path = 'row_datasets/target.csv'
file_path = 'target.csv'
upload_file(target_table, file_path, github_file_path)

An existing file has been updated succesfully
