<a href="https://colab.research.google.com/github/EvgeniyStrizhak/My-master-s-thesis/blob/main/Data_pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data pre-processing

This notebook downloads files from the Bundesbank [database](https://www.bundesbank.de/), merges them, processes them and uploads them to a GitHub repository as features for a model in features.csv.
All feaures description and their sources are [here](https://github.com/EvgeniyStrizhak/My-master-s-thesis/blob/a5d137148423b66a3dbeb3045b7e776677ed5838/row_datasets/files_info.json)

Target data is downloaded manually from www.investing.com. The notebook then downloads it from my repository, processes it, and uploads it to GitHub as a target.csv file.

This pipeline allows you to flexibly configure and select any indicators from the database. To do this, simply enter the file metadata in files_info.json.
This pipeline can run automatically on a schedule.

Indicators such as GDP are calculated once a quarter, while the forecast is calculated for a month. Therefore, for such indicators, the data is repeated for each month in the quarter.

Function **process_dataset(url, file_name, period):** containes all data cleaning operations

The target value contains aggregated monthly average prices

In [168]:
import requests
import json
from sklearn.pipeline import Pipeline
import pandas as pd
from google.colab import drive
import json
import base64
import yfinance as yf

In [169]:
#url of a json file from my repo containes all nesessary url from bundesbank's database
API_DATA = 'https://raw.githubusercontent.com/EvgeniyStrizhak/My-master-s-thesis/refs/heads/main/row_datasets/files_info.json'
#target url is in my repo
TARGET_URL = 'https://raw.githubusercontent.com/EvgeniyStrizhak/My-master-s-thesis/refs/heads/main/row_datasets/DAX%20Historical%20Data.csv'
#json file contains my git hub token to push files in repo
CONFIG_PATH = "/content/drive/My Drive/config.json"
#repo owner
REPO_OWNER = "EvgeniyStrizhak"
#repo name
REPO_NAME = "My-master-s-thesis"
EXPORT_DATA_URL = "https://www-genesis.destatis.de/datenbank/online/statistic/51000/table/51000-0002#modal=web-service-api"

In [170]:
QUERY = 'year >= 2005 and not (year == 2025 and quarter in (2, 3, 4))'

In [171]:
# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Data pipline

In [172]:
#geting data from url and saves it locally
def get_data(url, file_name):
  response = requests.get(url)

#Check if the request was successful
  if response.status_code == 200:
      # Save the file locally
      with open(file_name, "wb") as file:
          file.write(response.content)
      print(f"File downloaded successfully: {file_name}")
      return file_name
  else:
    print(f"Failed to download file: {file_name} Status code: {response.status_code}")
    return None

In [173]:
def parse_yahoo_data(ticker_text, period_text):
    ticker = yf.Ticker(ticker_text)
    df = ticker.history(period=period_text)

#change data types for nesessary columns select year and month
    df['Date'] = pd.to_datetime(df.index)
    df['month'] = df['Date'].dt.month
    df['year'] = df['Date'].dt.year
    df['quarter'] = df['Date'].dt.quarter
    df = df.drop(columns=['Open', 'High', 'Low', 'Volume', 'Dividends', 'Stock Splits'])
    df.rename(columns={'Close': ticker_text}, inplace=True)
    df.reset_index(drop=True, inplace=True)
    df = df.query(QUERY)
    df_grouped = df.groupby(['year', 'month']).tail(1).reset_index(drop=True)
    df_grouped = df_grouped.drop(columns=['Date'])
    return df_grouped

In [174]:
#funcion downloads csv from url and reads it
def read_csv(file_name):
    return pd.read_csv(file_name)

#deleting descriptions and comments from file
def filter_rows(df):
    df = df.dropna(subset=['Unnamed: 0'])
    return df.iloc[10:]

#the column Unnamed: 0 contains year and quarter with - delimeter
def split_date_column(df, period):
    if period == 'quarter':
        return df.assign(
                year=df['Unnamed: 0'].str.split('-', expand=True)[0],
                quarter=df['Unnamed: 0'].str.split('-', expand=True)[1].str.replace('Q', '')
                )
    else:
        df = df.assign(
                year=df['Unnamed: 0'].str.split('-', expand=True)[0],
                month=df['Unnamed: 0'].str.split('-', expand=True)[1],
                )
        #if there is month data a quarter column is added for further merging
        df['quarter'] = pd.to_datetime(df['month'], format='%m').dt.quarter
        return df

#drop unnecessary columns data and flags
def drop_unnecessary_columns(df):
    return df.drop(columns=['Unnamed: 0', df.columns[2]])

#columns with a feature renamed the same as file named
def rename_columns(df, file_name):
    columns = df.columns
    file_name = file_name.replace('.csv', '')
    return df.rename(columns={columns[0]: file_name})

#year, quarter and month turned into integer
def convert_dates_to_int(df, period):
    if period == 'month':
        df = df.assign(
                    month=df['month'].astype('int')
                    )
    return df.assign(year=df['year'].astype('int'),
                    quarter=df['quarter'].astype('int')
                    )

#selecting a period where all features are available
def filter_by_year(df):
    return df.query(QUERY)

def reset_index(df):
    df.reset_index(drop=True, inplace=True)
    return df

In [175]:
#combine functions into one pipline
def process_dataset(url, file_name, period):
    df = pd.read_csv(get_data(url, file_name))
    return (df
            .pipe(filter_rows)
            .pipe(split_date_column, period)
            .pipe(drop_unnecessary_columns)
            .pipe(rename_columns, file_name)
            .pipe(convert_dates_to_int, period)
            .pipe(filter_by_year)
            .pipe(reset_index)
            )

In [176]:
#function upload updated dataset into github repository
def upload_file(df, file_path, github_file_path):
    #save localy o csv
    df.to_csv(file_path, index=False)
    #generate url
    url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/contents/{github_file_path}"
    headers = {"Authorization": f"token {github_token}", "Accept": "application/vnd.github.v3+json"}
    #checking file updating file
    response_sha = requests.get(url, headers=headers)

    #generate api query

    data = {
        "message": "Добавлен CSV-файл",
        "content": "",
        "branch": "main"
    }

    if response_sha.status_code == 200:
        sha = response_sha.json()["sha"]
        data["sha"] = sha
    else:
        print(f"Error getting SHA: {response_sha.json()}")
        exit(1)

    #read file as utf-8
    with open(file_path, "rb") as file:
        data['content'] = base64.b64encode(file.read()).decode("utf-8")


    #Upload file to GitHub
    response = requests.put(url, json=data, headers=headers)

    if response.status_code == 201:
        print("A new file has been uploaded succesfully")
    elif response.status_code == 200:
        print("An existing file has been updated succesfully")
    else:
        print(f"Error: {response.status_code} - {response.json()}")

##Feature uploading and processing

In [177]:
oil_price = parse_yahoo_data("CL=F", "21y")

In [178]:
oil_price.head()

Unnamed: 0,CL=F,month,year,quarter
0,48.200001,1,2005,1
1,51.75,2,2005,1
2,55.400002,3,2005,1
3,49.720001,4,2005,2
4,51.970001,5,2005,2


In [179]:
oil_price.tail()

Unnamed: 0,CL=F,month,year,quarter
238,68.0,11,2024,4
239,71.720001,12,2024,4
240,72.529999,1,2025,1
241,69.760002,2,2025,1
242,71.480003,3,2025,1


In [180]:
oil_price.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243 entries, 0 to 242
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CL=F     243 non-null    float64
 1   month    243 non-null    int32  
 2   year     243 non-null    int32  
 3   quarter  243 non-null    int32  
dtypes: float64(1), int32(3)
memory usage: 4.9 KB


In [181]:
#download git token from Google Drive
with open(CONFIG_PATH) as f:
    config = json.load(f)
    github_token = config.get("GITHUB_TOKEN")

if github_token:
    print("GitHub token loaded successfully!")
else:
    print("Error: GitHub token not found!")

GitHub token loaded successfully!


In [182]:
#download data for api queries from repository
get_data(API_DATA, 'api_data.json')
with open('api_data.json', 'r') as json_file:
    data_dictionary = json.load(json_file)

File downloaded successfully: api_data.json


In [183]:
#generate an empty dataframe to merge it with uploaded datasets
df = pd.DataFrame({'year': [], 'quarter': [], 'month':[]})

#loop get metadata from json to download and process every file
for item in data_dictionary:
    file_path = item['file_name']
    print(file_path)
    #if a file has only quarter data without months data for each month of this quarter will be the same
    if item['period'] == 'quarter':
        keys = ['year', 'quarter']
    else:
        keys = ['year', 'quarter', 'month']
    df = df.merge(process_dataset(item['url'], file_path, item['period']), on = keys, how = 'outer')

production_sector.csv
File downloaded successfully: production_sector.csv
gdp.csv
File downloaded successfully: gdp.csv
consumer_prices.csv
File downloaded successfully: consumer_prices.csv
industrial_production_index.csv
File downloaded successfully: industrial_production_index.csv
interest_rate.csv
File downloaded successfully: interest_rate.csv
economy's_price competitiveness.csv
File downloaded successfully: economy's_price competitiveness.csv
unemployment_rate.csv
File downloaded successfully: unemployment_rate.csv
labour_costs.csv
File downloaded successfully: labour_costs.csv
mutual_funds_sales.csv
File downloaded successfully: mutual_funds_sales.csv
orders-received.csv
File downloaded successfully: orders-received.csv
balance_of_payments.csv
File downloaded successfully: balance_of_payments.csv
shares_sale.csv
File downloaded successfully: shares_sale.csv


In [184]:
df = df.merge(oil_price, on = ['year', 'month'], how = 'outer')
df = df.drop(columns=['quarter_y'])
df = df.rename(columns={'quarter_x': 'quarter'})

In [185]:
df.head()

Unnamed: 0,year,quarter,month,production_sector,gdp,consumer_prices,industrial_production_index,interest_rate,economy's_price competitiveness,unemployment_rate,labour_costs,mutual_funds_sales,orders-received,balance_of_payments,shares_sale,CL=F
0,2005,1,1,87.1,83.4,84.5,75.8,2.0,97.0,11.6,70.3,11818,73.6,-19072.388,-4426,48.200001
1,2005,1,2,85.8,83.4,84.6,75.9,2.0,97.0,11.9,70.3,5485,72.3,12614.579,4658,51.75
2,2005,1,3,86.1,83.4,84.8,76.2,2.0,97.0,12.1,70.3,8491,73.7,21713.763,1723,55.400002
3,2005,2,4,87.4,83.86,85.0,76.5,2.0,95.6,11.9,70.3,4295,73.2,17593.788,-3212,49.720001
4,2005,2,5,86.5,83.86,85.0,76.4,2.0,95.6,11.9,70.3,3684,73.6,6893.84,3144,51.970001


In [186]:
df.tail(10)

Unnamed: 0,year,quarter,month,production_sector,gdp,consumer_prices,industrial_production_index,interest_rate,economy's_price competitiveness,unemployment_rate,labour_costs,mutual_funds_sales,orders-received,balance_of_payments,shares_sale,CL=F
233,2024,2,6,93.4,104.55,129.0,127.6,4.25,94.2,6.0,115.5,10128,86.3,9981.221,-1502,81.540001
234,2024,3,7,91.5,104.66,129.3,127.8,4.25,94.0,6.0,115.0,11075,87.8,43811.841,3370,77.910004
235,2024,3,8,92.8,104.66,129.3,128.2,4.25,94.0,6.0,115.0,7802,83.0,5211.833,-2514,73.550003
236,2024,3,9,91.5,104.66,129.4,127.6,3.65,94.0,6.0,115.0,6286,88.8,38985.925,7403,68.169998
237,2024,4,10,91.2,104.45,129.9,127.8,3.4,93.7,6.1,116.5,18134,88.1,3958.485,6559,69.260002
238,2024,4,11,92.2,104.45,129.9,128.7,3.4,93.7,6.1,116.5,16622,84.3,26974.131,-2898,68.0
239,2024,4,12,90.9,104.45,130.3,128.6,3.15,93.7,6.1,116.5,27208,89.0,44493.679,-3134,71.720001
240,2025,1,1,92.1,104.88,130.6,128.2,3.15,93.4,6.2,,25562,84.1,13390.318,7644,72.529999
241,2025,1,2,91.2,104.88,131.0,127.9,2.9,93.4,6.2,,20919,84.1,11.893,6871,69.760002
242,2025,1,3,93.3,104.88,131.2,127.0,2.65,93.4,6.3,,12407,87.0,59973.157,-5327,71.480003


In [187]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243 entries, 0 to 242
Data columns (total 16 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   year                             243 non-null    int64  
 1   quarter                          243 non-null    int64  
 2   month                            243 non-null    int64  
 3   production_sector                243 non-null    object 
 4   gdp                              243 non-null    object 
 5   consumer_prices                  243 non-null    object 
 6   industrial_production_index      243 non-null    object 
 7   interest_rate                    243 non-null    object 
 8   economy's_price competitiveness  243 non-null    object 
 9   unemployment_rate                243 non-null    object 
 10  labour_costs                     240 non-null    object 
 11  mutual_funds_sales               243 non-null    object 
 12  orders-received       

In [188]:
df['labour_costs'] = df['labour_costs'].fillna(116.5)

In [189]:
github_file_path = 'row_datasets/features.csv'
file_path = 'features.csv'
if df.isna().sum().sum() == 0:
  upload_file(df, file_path, github_file_path)
else:
    print("There are missing values in the DataFrame.")

An existing file has been updated succesfully


## Target processing

In [190]:
target = parse_yahoo_data('^GDAXI', "21y")

In [191]:
target.head()

Unnamed: 0,^GDAXI,month,year,quarter
0,4254.850098,1,2005,1
1,4350.490234,2,2005,1
2,4348.77002,3,2005,1
3,4184.839844,4,2005,2
4,4460.629883,5,2005,2


In [192]:
target.tail()

Unnamed: 0,^GDAXI,month,year,quarter
238,19626.449219,11,2024,4
239,19909.140625,12,2024,4
240,21732.050781,1,2025,1
241,22551.429688,2,2025,1
242,22163.490234,3,2025,1


In [193]:
target.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243 entries, 0 to 242
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ^GDAXI   243 non-null    float64
 1   month    243 non-null    int32  
 2   year     243 non-null    int32  
 3   quarter  243 non-null    int32  
dtypes: float64(1), int32(3)
memory usage: 4.9 KB


In [194]:
#upload data
github_file_path = 'row_datasets/target.csv'
file_path = 'target.csv'
upload_file(target, file_path, github_file_path)

An existing file has been updated succesfully
