<a href="https://colab.research.google.com/github/Thukyd/TradeRepublic_Portfolio/blob/master/TradeRepublic_PDF_Converter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **0.4 | TradeRepublic Portfolio** 
# *PDF Converter for Portfolio Performance & Investing.com*
# a) Description
This tool should help to keep the overview of the transactions within TradeRepublic. In TR there is the possibility to export all orders as PDF. For testing purposes the import and export is currently only possible with G-Drive. Alternatives are on the todo list. 


## Output A: Master Sheet
The master sheet lists all previous transactions from the PDFs. 

## Output B: Delta Sheet
The delta sheet contains all new transactions that have been added since the script was last executed. These are intended to be imported into Portfolio Performance and Investing.com. Delta sheets are dated.

## Output C: Portfolio Sheet
The portfolio sheet contains an overview of all open positions with basic information as average purchase price. The three fields "Stop Preis", "Limit Preis" and "Strategie" are for notes, which are kept at each execution of the script, if the corresponding position is still in the portfolio. 

## Scope 
*   optimized for Google Colab (https://colab.research.google.com/)
*   export optimized for "Portfolio Performance" (https://www.portfolio-performance.info/) and "Investing.com" (https://de.investing.com/)
*   for further information about the import at "Investing.com", see: https://www.investing-support.com/hc/en-us/articles/360000265217-Import-Portfolio-Holdings 


# b) Configuration
## G-Drive Path
*   TradeRepublic statements will be imported via GoogleDrive. All PDF files shoule be in a single folder. You need to configure the path to your G-Drive before usage (see: "Configurations - to be defined by user")

# c) Options to customize
- for different data sources, see: https://colab.research.google.com/notebooks/io.ipynb
- in order to create different data structures, take a look at "Examples for extracted fields". 

# d) Further Information
## Handling of costs statements
*   "Kosten des Wertpapierkaufs/verkaufs" are be considered.
*   "Kosten während der Haltedauer (pro Jahr)" are not extracted and therefore do not appear the sheets. 

## Deposits & Withdrawls
* deposits and withdrawals to the depot are not recorded in PDF format. Therefore they are not taken into account and must be entered by hand if necessary.

## "Order" in Trade Republic documents
* each TradeRepublic document has got a "order" number. This is extracted and stored in the field "Notiz". It serves to prevent duplicate entries.  

# e) Backlog
## Open
* create sheets for portfolio performance in .csv format
* offer alternatives to G-Drive import/export

## To be fixed
- ...

# f) Changelog 
## 0.1
* extract G-Drive folder of TradeRepublic PDFs
* create data structure (for Portfolio Performance or other purposes)
* generate master sheet of all transactions
* generate delta sheet for new transactions (base for TradeRepublic import)

## 0.2
- fixed | sort extracted transactions by date
- fixed | double entries of table lables in delta sheet

## 0.3
- add ticker symbol via finnhub api (requires your own api key)
- fixed | calculate purchase price
- add values to sheet ("Kaufkurs", "Gebühren")
- data structure for Investing.com import

## 0.4
- full refactoring (new class structure)
- add current portfolio as output option
- allow data enrichment of extracted data
- note already extracted PDFs

In [None]:

#@markdown ### PDF Source
input_source = "G-Drive" #@param ["G-Drive"] {allow-input: true}

#@markdown ### GDrive Folder
# GDRIVE
# path to gdrive folder with your TradeRepublic pdfs

# e..g "/content/drive/My Drive/MY_TR_FOLDER/"
path_gdrive = "/content/drive/My Drive/MY_TR_FOLDER/" #@param {type:"string"}

#@markdown ### Output Destination

output_destination = "Google-Sheets" #@param ["Google-Sheets"] {allow-input: true}

#@markdown ### Data Enrichment by OnvistaAPI (optional) 
# formatted name of asset
# ticker symbol
# wkin
# asset type
  
onvista = True #@param {type:"boolean"}

#@markdown ### Optimized for Investing.com or Porfolio Performance

export_optimized_for = "Portfolio Performance" #@param ["Portfolio Performance", "Investing.com"] {allow-input: true}


#@markdown ### Create overview table of current portfolio (next to Delta & Master)

portfolio_sheet = True #@param {type:"boolean"}


config = {
    "input_source" : input_source,
    "output_destination" : output_destination,
    "path_gdrive" : path_gdrive,
    "onvista" : onvista,
    "export_optimized_for" : export_optimized_for,
    "portfolio_sheet" : portfolio_sheet
}


# Helper Methods


In [None]:
class HelperMethods:
  def __init__(self):
    # requirement for get_date_today(), convert_str_to_date()
    from datetime import date, datetime
    self.date = date
    self.datetime = datetime
    # pandas
    import pandas as pd
    self.pd = pd


  def get_date_today(self):
    today = self.date.today()
    return today.strftime("%Y-%m-%d")

  def convert_str_to_date(self, input_string):
    output = self.datetime.strptime(input_string, "%d.%m.%Y").date()
    return output

  def convert_date_to_str(self, input_date):
    output = input_date.strftime("%d.%m.%Y")
    return output

  def convert_german_decimal_to_float(self, german_decimal_str):
    decimal_point = german_decimal_str.replace(",", ".") # input: 20.000,00 ; output: 20.000.00
    result = decimal_point.replace(".", "", decimal_point.count(".") -1) # input: 20.000.00 ; output: 20000.00
    return float(result)

  def convert_float_to_german_decimal(self, number):
    # build format string
    format_str = "{{:,.{}f}}".format(2)
    # make number string
    number_str = format_str.format(number)
    # replace chars
    return number_str.replace(',', 'X').replace('.', ',').replace('X', '.')
    # https://stackoverflow.com/questions/55616520/is-there-a-simple-and-preferred-way-of-german-number-string-formatting-in-python/55616843

  def calculate_weighted_average(self, dataframe, column_for_average, column_for_weight):
    a = dataframe[column_for_average]
    w = dataframe[column_for_weight]
    # convert german decimal to float
    a1 = a.apply( lambda x : self.convert_german_decimal_to_float(x) )
    # calulate average
    weighted_average = (a1 * w).sum() / w.sum()
    # convert float to german decimal 
    weighted_average = self.convert_float_to_german_decimal(weighted_average)
    return weighted_average

  def abs_sum_dataframe_with_german_decimals(self, dataframe_column):
    df = dataframe_column.apply( lambda x : self.convert_german_decimal_to_float(x) )
    df = abs( df.sum() ) # absolute sum 
    return self.convert_float_to_german_decimal(df)

  def get_diff_between_2_dataframes(self, df1, df2, which=None):
    """
      Find Rows Which Are Not common Between Two dataframes - by subset "Order ID"
        https://kanoki.org/2019/07/04/pandas-difference-between-two-dataframes/

      Remember: 
        If "Order ID" was not used as a subset to filter, 
        it would no longer be possible to edit and complete the sheet (notes, WKN or other missing information).
        All changes would be overwritten with the next execution of the script.
    """
    return self.pd.concat([df1,df2]).drop_duplicates(subset=["Order ID"], keep=False)

  def sort_dataframe_column_by_label(self, my_dataframe, label):
    return my_dataframe.reindex(columns=label)

  def sort_dataframe_rows_by_date(self, my_dataframe):
    # convert to str to datetime
    my_dataframe["Datum"] = self.pd.to_datetime(my_dataframe["Datum"], infer_datetime_format=True)
    # sort by date
    my_dataframe = my_dataframe.sort_values(by = "Datum")
    # change format
    my_dataframe["Datum"] = my_dataframe["Datum"].dt.strftime("%d.%m.%Y")
    # convert datetime to str
    my_dataframe["Datum"] = my_dataframe["Datum"].astype(str)
    # remove NaT artefacts after conversion
    my_dataframe = my_dataframe[my_dataframe.Datum != "NaT"]
    return my_dataframe
  
  # helper method to assign the proper sign for each transaction
  def get_transaction_sign(self, amount, order_type):
    if order_type == "Kauf" or order_type == "Buy":  
      return amount
    else:  
      return -amount

# Input
- defines the source of the PDFs to be extracted

In [None]:
class Input:
  def __init__(self, config):
    self.config = config
    if self.config["input_source"] == "G-Drive":
      self.connect_google_drive()

  def connect_google_drive(self):
    """option to use google drive as source"""
    # Load Google Drive helper
    from google.colab import drive
    # This will prompt for authorization
    drive.mount("/content/drive")

# Sheet Source
- defines the source of the previously created sheets

In [None]:
class Sheet_Source:
  def __init__(self, config):
    #Install gsread
    !pip install --upgrade --quiet gspread
    import gspread
    # authenticate gsheet access
    from google.colab import auth
    auth.authenticate_user()
    from oauth2client.client import GoogleCredentials
    # get access credentials for gsheet
    self.gc = gspread.authorize(GoogleCredentials.get_application_default())
    self.config = config
    # g sheet as output & source of existing sheets
    if self.config["output_destination"] == "Google-Sheets":
      self.get_extracted_pdf()

  def get_extracted_pdf(self):
    try: # yes => need to check which data is new and append 
      spreadsheet = self.gc.open("TradeRepublic_Extracted_Files")
      worksheet_extracted = spreadsheet.sheet1 # get worksheet for document
      extracted_sheet = self.pd.DataFrame(worksheet_extracted.get_all_records())

      self.config["list_extracted_pdfs"] = extracted_sheet
    except: # no => you can just fill in the extracted data as master sheet
      self.config["list_extracted_pdfs"] = []

# Extract PDFs
- performs the extraction of the PDFs
- only previously unknown documents are extracted

In [None]:
class ExtractPdfs:
  def __init__(self, config):
    self.config = config
    # PyPDF2 for PDF extraction
    !pip install PyPDF2
    import PyPDF2
    # requirement for pdf-folder extraction
    import os
    import glob

    self.PyPDF2 = PyPDF2
    self.glob = glob
    self.os = os

    # config G-Drive
    if self.config["input_source"] == "G-Drive":
      if self.config["path_gdrive"] == "/content/drive/My Drive/YOUR_FOLDER/":
        print("Please provide the Google Drive path for your TradeRepublic PDFs folder")
      elif self.config["path_gdrive"] == "":
        print("Please provide the Google Drive path for your TradeRepublic PDFs folder")
      else:
        self.path = self.config["path_gdrive"]
    
  def start(self):
    """
      Checks for relevant PDF in folder and extracts full text
        Parameters
        ----------
        path : str
            relative path for folder

        Output
        ------
        raw_text_list : list
          A list for all raw pdf extractions - one item is one pdf
          [
            "Wertpapierordenummer\n\...",
            ...
          ]
    """
    raw_text_list = []
    for filename in self.glob.glob(self.os.path.join(self.path, '*.pdf')):
      with open(filename, 'rb') as fin: # open in readonly mode

        # extract only pdfs which are not extracted already
        if filename in self.config["list_extracted_pdfs"]:
          print("schon vorhanden", filename)
          continue
        else:
          self.config["list_extracted_pdfs"].append(filename)

        # read and extract pdf infos
        pdf_reader = self.PyPDF2.PdfFileReader(fin)
        # extract first page
        extr_page = pdf_reader.getPage(0)
        text = extr_page.extractText()

        # relevance criteria
        if ")." in filename: # try to find pdf duplicates - eg. "filename (1).pdf" instead of "filename.pdf"
            #print("Processing PDF | DUPLICATE   | \"WERTPAPIERABRECHNUNGÜBERSICHT\" |  " + filename)
            continue
        elif "WERTPAPIERABRECHNUNGÜBERSICHT" in text:  ## all non duplicate,"Wertpapergeschäftsorder"
            #print("Processing PDF | RELEVANT    | \"WERTPAPIERABRECHNUNGÜBERSICHT\" |  " + filename)
            pass
        else: # non duplicate, irelevant files
            #print("Processing PDF | IRELEVANT   |                            |  " + filename)
            continue
      
        raw_text_list.append(text)

    return raw_text_list

# Process Text
- processes the extracted PDFs into a temporary data structure 

In [None]:
class ProcessText:
  def __init__(self):
    # Regex for text processing
    import re
    self.re = re
    # HelperMethods()
    self.HelperMethods = HelperMethods()
  
  def get_list_of_order_dicts(self, raw_text_list):
    """
      Processes raw text of pdfs and returns a record of orders
        Parameters
        ----------
        raw_text_list : list
          A list for all raw pdf extractions - one item is one pdf
          [
            "Wertpapierordenummer\n\...",
            "..."
          ]

        Output
        ------
        list_of_order_dicts : list
          A list of dictionaries of all orders. The order is determined by the date of the transaction.

          [
            {
              'Datum': '16.06.2020',
              'Wertpapiername': 'Alibaba',
              ...
            },
            {...}
          ]
    """
    list_of_order_dicts = []
    for raw_text in raw_text_list:
      # simplify processing in some cases by splitting pdf in header & body
      split_criteria = "WERTPAPIERABRECHNUNGÜBERSICHT"
      text_header, text_body = raw_text.split(split_criteria)

      current_order = {
      "Datum" : self.order_date(text_header),
      "Isin" : self.isin(text_body),
      "Typ": self.type(text_body),
      "Wert" : self.value(text_body),
      "Buchungswährung" : self.currency(text_body), 
      "Stück" : self.amount(text_body),
      "Wertpapiername" : self.name(text_body),
      "Order ID" : self.order_id(text_header),
      "Gebühren" : self.costs(text_body)
      }
      list_of_order_dicts.append(current_order)
    return list_of_order_dicts

  
  def order_date(self, text_header):
    """extracts the document date of header (transaction date)"""
    pattern = "DATUM\d{2}.\d{2}.\d{4}"
    transaction_date = self.re.findall(pattern, text_header)
    date = self.re.split("DATUM", transaction_date[0])
    return date[1]
  
  def order_id(self, text_header):
    """extracts the internal order id by TradeRepublic"""
    pattern = "(?<=ORDER)(.*?)(?=AUSFÜHRUNG)"
    id = self.re.findall(pattern, text_header)
    return id[0]
  
  
  def isin(self, text_body):
    """TradeRepublic uses ISIN only - ticker symbol or WSIN have to be added in later processing"""
    isin_pattern = "(?<=ISIN: )(BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|BR|XS|FI|GR|IS|RU|LB|PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|SK|KR|SI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU)([0-9A-Z]{9})([0-9])"
    matches = self.re.findall(isin_pattern, text_body )
    item = "".join(matches[0])
    return item

  def name(self, text_body):
    pattern = "(?<=KURSBETRAG)(.*?)(?=ISIN)"
    name = self.re.findall(pattern, text_body)
    return name[0]

  def costs(self, text_body):
    """
      extracts costs for order in TradeRepublic
      - running costs (TER etc.) won't be considered
      - takes first appeance of of costs in pdf
      - buy operations => pdf consists of buying costs, running costs & selling costs | only buying costs considered
      - sell operation => pdf consists of selling costs | selling costs considered
    """
    pattern = "(?<=Fremdkostenzuschlag)(.*?)(?= )"
    costs = self.re.findall(pattern, text_body)
    return costs[0]

  def currency(self, text_body):
    pattern = "(?<= )(\S*)(?=BUCHUNGVERRECHNUNGSKONTO)"
    currency = self.re.findall(pattern, text_body)
    return currency[0]

  def amount(self, text_body):
    ISIN = self.isin(text_body)
    pattern = "(?<=" + ISIN + ")(.*?)(?= Stk.)"
    amount = self.re.findall(pattern, text_body)
    return amount[0]

  def type(self, text_body):
    pattern = "(Kauf|Verkauf)"
    order_type = self.re.findall(pattern, text_body)
    return order_type[0]

  def value(self, text_body):
    currency = self.currency(text_body)
    pattern = "(?<=GESAMT)(.*)(?= "+ currency +"ABRECHNUNGPOSITIONBETRAG)"
    value = self.re.findall(pattern, text_body)
    return value[0]

# Enrich Data
- - enriches the data structure with additional information from third-party sources or calculations.

In [None]:
class EnrichData:
  def __init__(self, config, list_of_order_dicts):
    import requests
    self.requests = requests
    self.list_of_order_dicts = list_of_order_dicts
    # HelperMethods()
    self.HelperMethods = HelperMethods()
    self.config = config

  def start_enrichment_by_config(self):
    for order_dict in self.list_of_order_dicts:
      # add order price for investing.com import
      order_dict["Kaufkurs"] = self.order_price_for_investing_com(order_dict["Wert"], order_dict["Stück"])

      # overwrite name, add wkin, ticker-symbol and asset type by onvista api
      if self.config["onvista"] == True:
        self.information_by_onvista_api(order_dict)

      # Investing.com | translate order to english
      if self.config["export_optimized_for"] == "Investing.com":
        order_dict["Typ"] = self.order_type_for_investing_com(order_dict["Typ"])

      # Portfolio Perforrmance | add sign to order value
      if self.config["export_optimized_for"] == "Portfolio Performance":
        order_dict["Wert"] = self.order_sign_for_portfolio_performance(order_dict["Typ"], order_dict["Wert"])

    return list_of_order_dicts

  def information_by_onvista_api(self, order_dict):
    """
      Enriches stock information by onvista
        Parameters
        ----------
        order_dict : dict
          {
              'Datum': '16.06.2020',
              'Wertpapiername': 'Alibaba',
              ...
          }
          
        Change in order_dict
        ------
        Wertpapiername : str
          Name of Stock
        Ticker-Symbol : str
          International Symbol (crucial for stock data)
        WKN : str
          Wertpapierkennnummer 
        Vermögenswert : str
          Type of Asset

    """
    query = order_dict["Isin"] 
    r = self.requests.get("https://www.onvista.de/onvista/boxes/assetSearch.json?doSubmit=Suchen&portfolioName=&searchValue=" + query)
    result = r.json()["onvista"]["results"]["asset"][0]

    order_dict["Wertpapiername"] = result["shortname"]
    order_dict["Ticker-Symbol"] = result["symbol"]
    order_dict["WKN"] = result["nsin"] # WKIN
    order_dict["Vermögenswert"] = result["assettype"]
    
  def order_type_for_investing_com(self, order_type):
    """
      The import at Investing.com needs "Buy" & "Sell" as input instead of "Kauf" & "Verkauf".
        Parameters
        ----------
        order_type : str
          "Kauf"/"Verkauf" statement

        Output
        ------
        order_type : str
          "Buy"/"Sell" statement - used in new field!
    """
    if order_type == "Kauf":
      order_type = "Buy"
    if order_type == "Verkauf":
      order_type = "Sell"
    return order_type

  def order_price_for_investing_com(self, order_value, order_amount):
    """
      The purchase price for the stock for The import at Investing.com needs 
        Parameters
        ----------
        order_type : str
          "Kauf"/"Verkauf" statement

        Output
        ------
        order_type : str
          "Buy"/"Sell" statement - used in new field!
    """
    # transform string to float
    float_value = self.HelperMethods.convert_german_decimal_to_float(order_value)
    float_amount = float(order_amount)
    # calculate price of single stock
    calculated_price = float_value / float_amount
    # convert calculated float into string again
    order_price = self.HelperMethods.convert_float_to_german_decimal(calculated_price)
    return order_price


  def order_sign_for_portfolio_performance(self, order_type, order_value):
    """
      In TradeRepublic statements all total values for buy & sell are in positive numbers.
      For the conversion to Portfolio Performance (PP) this value has got to take the type of transaction in account.
      It needs to have a sign. Negative numbers for "buy" and positive numbers for "sell".
        Parameters
        ----------
        order_type : str
          "Kauf"/"Verkauf" statement
        order_value : str
          Total value of order

        Output
        ------
         order_type : str
          "Kauf"/"Verkauf" statement
        order_value : str
          Total value of order with sign
    """
    if order_type == "Kauf":
      return "-" + order_value
    else: 
      return order_value

# Data Integration
- prepares the export to the target source
- extends the Mastersheet
- determines the delta sheet (new transactions)
- determines the current portfolio and the shares of the positions held in it

In [None]:
class DataIntegration:
  def __init__(self, config, list_of_dicts_extracted_data):
    # read/write config
    self.config = config
    # provides extracted & enriched data - list of dictionaries
    self.extracted_list = list_of_dicts_extracted_data
    # HelperMethods()
    self.HelperMethods = HelperMethods()
    # pandas
    import pandas as pd
    self.pd = pd

    # CONFIGURATIONS
    self.master_and_delta_lables = ["Datum", "Typ", "Wertpapiername", "Stück",	"Kaufkurs",	"Wert",	"Isin",	"WKN","Ticker-Symbol", "Gebühren", "Notiz", "Vermögenswert","Steuern", "Buchungswährung",	"Order ID", "Icon"]

    # do you need a portfolio sheet next to master & delta sheet
    if self.config["portfolio_sheet"] == True:
      self.portfolio_lables = ["Datum", "Isin", "WKN", "Ticker-Symbol", "Vermögenswert", "Wertpapiername", "Buchungswährung", "Kaufpreis", "Stück", "Ø Kaufkurs", "Stop Kurs", "Limit Kurs", "Strategie"]

    # g sheet as output 
    if self.config["output_destination"] == "Google-Sheets":
      self.init_gsheet()
    
  def init_gsheet(self):
    #Install gsread
    !pip install --upgrade --quiet gspread
    import gspread
    # authenticate gsheet access
    from google.colab import auth
    auth.authenticate_user()
    from oauth2client.client import GoogleCredentials
    # get access credentials for gsheet
    self.gc = gspread.authorize(GoogleCredentials.get_application_default())
  
  def get_dataframes_for_gsheet_export(self):
    """
    Returns rows which need to be added to delta & master sheet
      checks which entries are new & puts them into dataframe for export
    
      Output
        ------
        data_update_sheets : dict
          Dictionary of dataframes which can be exported to master and delta sheet, eg.
          {
            "master" : DataframeForMasterUpdate,
            "delta" : DataframeForDeltaUpdate,
            "portfolio" : ""
          }
    """
    # check if sheets are exisiting (write result in config) & get existing data from sheets
    dataframe_master = self.get_gsheet_master()
    dataframe_delta = self.get_gsheet_delta()
    data_update_sheets = {
          "master" : dataframe_master,
          "delta" : dataframe_delta
      }

    # append portfolio if needed
    if self.config["portfolio_sheet"] == True:
      dataframe_portfolio = self.get_gsheet_portfolio(dataframe_master)
      # config for sheet update
      data_update_sheets["portfolio"] = dataframe_portfolio

    # return the data for exports
    return data_update_sheets

  def get_extracted_data(self):
    """Converts extracted & enriched data (list of dictionaries) into a pandas dataframe"""
    extracted_data_dataframe = self.pd.DataFrame(self.extracted_list)
    return extracted_data_dataframe

  def get_gsheet_master(self):
    # get new data from pdfs
    extracted_data = self.get_extracted_data()

    # check if master sheet exists
    try: # yes => need to check which data is new and append 
      spreadsheet = self.gc.open("TradeRepublic_Master")
      worksheet_master = spreadsheet.sheet1 # get worksheet for document
      master_sheet = self.pd.DataFrame(worksheet_master.get_all_records())

      # get what new data do you have 
      new_dataframe = self.HelperMethods.get_diff_between_2_dataframes(master_sheet, extracted_data)
      # delta => save difference for delta sheet
      self.new_delta = new_dataframe

      # append the new data to master sheet
      new_dataframe = master_sheet.append(new_dataframe)
      # update config file
      self.config["master_sheet_exists"] = True
    except: # no, you can just fill in the extracted data as master sheet
      new_dataframe = extracted_data
      # delta => save difference for delta sheet
      self.new_delta = extracted_data
      self.config["master_sheet_exists"] = False
    
    # sort columns by label
    dataframe_master = self.HelperMethods.sort_dataframe_column_by_label(new_dataframe, self.master_and_delta_lables)
    #remove "NAN" 
    dataframe_master.fillna('', inplace=True)
    # sort rows (by date)
    dataframe_master  = self.HelperMethods.sort_dataframe_rows_by_date(dataframe_master)

    return dataframe_master

  def get_gsheet_delta(self):
    # get new data from pdfs
    extracted_data = self.get_extracted_data()
    # check if delta sheet exists
    today = HelperMethods().get_date_today()
    try: # yes => need to check which data is new does not exist in current sheet already
      spreadsheet = self.gc.open("TradeRepublic_Delta_" + today)
      worksheet_delta = spreadsheet.sheet1
      delta_sheet = self.pd.DataFrame(worksheet_delta.get_all_records())
      # was created in get_gsheet_master and represents the diference of the extracted data and the master sheet => the delta!
      new_delta = self.new_delta
      # append the new data to delta sheet
      dataframe_delta = delta_sheet.append(new_delta)

      # update config file
      self.config["delta_sheet_exists"] = True
    except:
      # was created in get_gsheet_master and represents the diference of the extracted data and the master sheet => the delta!
      dataframe_delta = self.new_delta
      self.config["delta_sheet_exists"] = False
    # sort column by label
    dataframe_delta = self.HelperMethods.sort_dataframe_column_by_label(dataframe_delta, self.master_and_delta_lables)
    #remove "NAN" 
    dataframe_delta.fillna('', inplace=True)
    # sort rows (by date)
    dataframe_delta  = self.HelperMethods.sort_dataframe_rows_by_date(dataframe_delta)
    return dataframe_delta

  def get_gsheet_portfolio(self, dataframe_master):
    try:
      # does the sheet already exists?
      spreadsheet = self.gc.open("TradeRepublic_Portfolio")
      worksheet_portfolio = spreadsheet.sheet1 # get worksheet for document
      exisiting_portfolio = self.pd.DataFrame(worksheet_portfolio.get_all_records())
      # prepare current portfolio
      new_portfolio = self.current_portfolio_by_master_sheet(dataframe_master)
      # merge extracted & new portfolio
      dataframe_portfolio = new_portfolio.merge(exisiting_portfolio[["Isin", "Stop Kurs", "Limit Kurs", "Strategie"]], on='Isin', how="left")
      dataframe_portfolio = dataframe_portfolio.drop(columns=["Stop Kurs_x", "Limit Kurs_x", "Strategie_x"])
      dataframe_portfolio = dataframe_portfolio.rename(columns={"Stop Kurs_y" : "Stop Kurs", "Limit Kurs_y" : "Limit Kurs", "Strategie_y": "Strategie"})
      dataframe_portfolio.fillna('', inplace=True)
      self.config["portfolio_sheet_exists"] = True
       
    except: 
      # portfolio does not exists so it can be created from scratch
      dataframe_portfolio = self.current_portfolio_by_master_sheet(dataframe_master)
      dataframe_portfolio.fillna('', inplace=True)
      self.config["portfolio_sheet_exists"] = False

    return dataframe_portfolio

  def current_portfolio_by_master_sheet(self, dataframe_all_orders):
    """
      Calcultes the amount of stock in current porfolio and it's average purchase price 
          Parameters
          ----------
          dataframe_all_orders : pd.dataframe
            all orders

          Output
          ------
          ----------
          dataframe_portfolio : pd.dataframe
            current holdings with key infrmation as average purchase price etc.
    """
    current_portfolio = self.pd.DataFrame( columns= self.portfolio_lables) # emtpy dataframe with correct labeling

    # filter down uniqe stocks by isin
    unique_stocks = dataframe_all_orders.drop_duplicates(subset=['Isin'])
    # loop through unique stocks and handover to calculate the current holding in the portfolio
    for index, row in unique_stocks.iterrows():
      current_isin = row["Isin"]
      orders_for_current_isin = dataframe_all_orders [ dataframe_all_orders['Isin'] == current_isin ] # filter down orders of current isin
      # calculate the current number of shares in the portfolio
      portfolio_current_stock = self.calculate_stock_in_portfolio( orders_for_current_isin )
      current_portfolio = self.pd.concat( [current_portfolio, portfolio_current_stock] ) # append to dataframe for to portfolio
    return current_portfolio


  def summarize_portfolio_orders(self, dataframe_orders):
    """
     Calculates on the basis of transactions how many shares are currently in the portfolio, their total value and average purchase price
      Parameters
      ----------
      dataframe_orders : pd.dataframe
        transaction for a single share value

      Output
      ------
      dataframe_stock_in_portfolio : pd.dataframe
        returns a single row in the "self.portfolio_lables" indexed dataframe
    """
    
    # calcualte weighted average for purchase price
    weighted_average = self.HelperMethods.calculate_weighted_average(dataframe_orders, "Kaufkurs", "Stück")
    # get the total amount of stocks
    order_amount = dataframe_orders["total_holdings"].iloc[-1] # cell in last row
    # sum value and turn into positive number "Gezahlter Wert" ~ or something similar
    order_value = self.HelperMethods.abs_sum_dataframe_with_german_decimals(dataframe_orders["Wert"])

    dataframe_stock_in_portfolio = dataframe_orders.tail(1)
    dataframe_stock_in_portfolio["Stück"] = order_amount
    dataframe_stock_in_portfolio["Ø Kaufkurs"] = weighted_average
    dataframe_stock_in_portfolio["Kaufpreis"] = order_value
    # reindex with portfolio lables
    dataframe_stock_in_portfolio = dataframe_stock_in_portfolio.reindex(self.portfolio_lables, axis="columns")
    return dataframe_stock_in_portfolio

  def calculate_stock_in_portfolio(self, dataframe_orders_single_stock):
    """
      Checks the amount of stocks left for each share value
        if non => return
        if still stocks left & never completley sold => return calculated remaining shares
        if still stocks left & in between sold all => filter down relevant once, return calculated remaining shares  


        Parameters
        ----------
        dataframe_orders_single_stock : pd.dataframe
          contains all orders for a specific stock

        Output
        ------
        "" : pd.dataframe
          returns dataframe for total holdings of a single share value
            if non left => return
            if still stocks left & never completley sold => return calculated remaining shares
            if still stocks left & in between sold all => filter down relevant once, return calculated remaining shares  
          
    """
    # "change of stock amount" add the sign to the amount of stocks for each order; sell 5 => -5; buy 5 => 5;
    dataframe_orders_single_stock["Stück"] = dataframe_orders_single_stock["Stück"].astype(int) 
    dataframe_orders_single_stock["change_of_stock_amount"] = dataframe_orders_single_stock.apply( lambda row : self.HelperMethods.get_transaction_sign( row["Stück"] , row["Typ"] ) , axis=1)

    # sum up current stocks in portfolio
    dataframe_orders_single_stock["change_of_stock_amount"] = dataframe_orders_single_stock["change_of_stock_amount"].astype(float)
    dataframe_orders_single_stock["total_holdings"] = dataframe_orders_single_stock["change_of_stock_amount"].cumsum()
    
    
    # check last row for current amount of stock in portfolio which contains the current holding of each stock
    last_row = dataframe_orders_single_stock.tail(1)
    if 0 in last_row["total_holdings"].values:
      # there are not stocks left anymore => kick out the stock
      return 

    else:
      # Check if stock was sold out in between, but bought again into current portfolio
        # a) stock was never sold completley => all orders are relevant for calculation
      if 0 not in dataframe_orders_single_stock["total_holdings"].values:
        orders_for_current_portfolio =  self.summarize_portfolio_orders(dataframe_orders_single_stock)
        return orders_for_current_portfolio

        # b) stock was sold compleltly in between => orders should be filtered since last complete sell
      else:
        # filter to the last time the the stock was sold completely
          # Reverse the column C then use Series.ne + Series.cummin to create a boolean mask
        boolean_mask = dataframe_orders_single_stock.loc[::-1, "total_holdings"].ne(0).cummin()[::-1]
          # then use this mask to filter the rows in datframe:
        last_remaining = dataframe_orders_single_stock[boolean_mask]
        
        # calulate the current holdings based - only considering relevant oders in the books
        relevant_orders_for_current_portfolio = self.summarize_portfolio_orders(last_remaining)
        return relevant_orders_for_current_portfolio

# Output
- performs the export of the specified sheets

In [None]:
class Output:
  def __init__(self, config, dataframe):
    #Install gsread
    !pip install --upgrade --quiet gspread
    import gspread
    # authenticate gsheet access
    from google.colab import auth
    auth.authenticate_user()
    from oauth2client.client import GoogleCredentials
    # get access credentials for gsheet
    self.gc = gspread.authorize(GoogleCredentials.get_application_default())
    # pandas
    import pandas as pd
    self.pd = pd

    self.config = config

    # update list of already extracted pdfs
    self.update_extracted_pdf()

    self.update_master(dataframe["master"])
    self.update_delta(dataframe["delta"])
    if self.config["portfolio_sheet"] == True:
      self.update_portfolio(dataframe["portfolio"])


  def update_extracted_pdf(self):
    extracted_pdfs_dataframe = self.pd.DataFrame(self.config["list_extracted_pdfs"])
    
    try:
      spreadsheet = self.gc.open("TradeRepublic_Extracted_Files")
      worksheet_extracted = spreadsheet.sheet1 # get worksheet for document
      worksheet_extracted.clear() # alten Worksheet löschen

    except:
      spreadsheet = self.gc.create("TradeRepublic_Extracted_Files")
      worksheet_extracted = spreadsheet.sheet1 # get worksheet for document

    worksheet_extracted.update(
        extracted_pdfs_dataframe.values.tolist() # value
    )


  def update_master(self, master_data):

    master_update_dataframe = self.pd.DataFrame(master_data)

    if self.config["master_sheet_exists"]:
      spreadsheet = self.gc.open("TradeRepublic_Master")
      worksheet_master = spreadsheet.sheet1 # get worksheet for document
      worksheet_master.clear() # alten Worksheet löschen

    else:
      spreadsheet = self.gc.create("TradeRepublic_Master")
      worksheet_master = spreadsheet.sheet1 # get worksheet for document

    worksheet_master.update(
        [master_update_dataframe.columns.values.tolist()] # label
        +
        master_update_dataframe.values.tolist() # value
    )

  def update_delta(self, delta_data):
    delta_update_dataframe = self.pd.DataFrame(delta_data)
    today = HelperMethods().get_date_today()

    if self.config["delta_sheet_exists"]:
      spreadsheet = self.gc.open("TradeRepublic_Delta_" + today)
      worksheet_delta = spreadsheet.sheet1 # get worksheet for document
      worksheet_delta.clear() # alten Worksheet löschen

    else:
      spreadsheet = self.gc.create("TradeRepublic_Delta_" + today)
      worksheet_delta = spreadsheet.sheet1 # get worksheet for document

    worksheet_delta.update(
        [delta_update_dataframe.columns.values.tolist()] # label
        +
        delta_update_dataframe.values.tolist() # value
    )

  def update_portfolio(self, portfolio_data):
    portfolio_update_dataframe = self.pd.DataFrame(portfolio_data)

    if self.config["portfolio_sheet_exists"]:
      spreadsheet = self.gc.open("TradeRepublic_Portfolio")
      worksheet_portfolio = spreadsheet.sheet1 # get worksheet for document
      worksheet_portfolio.clear() # alten Worksheet löschen

    else:
      spreadsheet = self.gc.create("TradeRepublic_Portfolio")
      worksheet_portfolio = spreadsheet.sheet1 # get worksheet for document

    worksheet_portfolio.update(
        [portfolio_update_dataframe.columns.values.tolist()] # label
        +
        portfolio_update_dataframe.values.tolist() # value
    )

# Script Execution

In [None]:
# a) define input source
Input(config)

# b) define sheet source
Sheet_Source(config)

# c) extract pdfs
raw_text_list = ExtractPdfs(config).start()

# d) process text
list_of_order_dicts = ProcessText().get_list_of_order_dicts(raw_text_list)

# e) enrich data
enriched_list_of_order_dicts = EnrichData(config, list_of_order_dicts).start_enrichment_by_config()

# f) integrate data
dict_data_for_export = DataIntegration(config, enriched_list_of_order_dicts).get_dataframes_for_gsheet_export()

# g) export sheets
Output(config, dict_data_for_export)

# Execution results



In [None]:
dict_data_for_export["master"]

In [None]:
dict_data_for_export["delta"]

In [None]:
dict_data_for_export["portfolio"]