# Getting the securities lending open positions from the brazilian stock exchange

**What ?** Securities lending open positions refer to the active loans in which securities, such as stocks or bonds, have been lent by one party (the lender) to another (the borrower) in exchange for collateral. These open positions indicate that the borrower has temporary possession of the securities and has yet to return them to the lender. The Brazilian stock exchange, B3, provides daily data on all open positions, including quantities, volumes, and average prices, grouped by ticker and type of register [[see glossary here]](https://www.b3.com.br/data/files/BF/F1/58/15/391EA810E9C1AAA8AC094EA8/Glossario%20_%20Posicao_Em_Aberto.pdf).

**Why ?** Securities lending open positions data is crucial for financial market analysis related on market sentiment indicator, supply and demand insights, risk management and price movement predictions.

**How ?** The open position data is available in a comprehensive PDF document titled [Daily Market Bulletin](https://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/boletim-diario/boletim-diario-do-mercado/), which spans over one hundred pages and includes various unstructured market data. To extract the open position tables, the PDF documents were manually downloaded from the B3 website. A complex scraping code was then developed using the **pdfplumber** Python library to parse the textual data and convert it into a structured dataframe. Finally, the dataframe is saved into a local SQLite database for further analysis.

<img src="https://lh3.googleusercontent.com/d/1hHJx0qCjbFOMg4hjF7CK7dEei-xUPaNF" alt="securities_lending_open_position_icon" width="300" align="center">


# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
from datetime import datetime
import re
import fitz  # PyMuPDF
import pdfplumber
pd.set_option('display.float_format', '{:.2f}'.format)

import sqlite3
import requests
import zipfile
import PyPDF2

# Defining parameters

In [2]:
script_directory = os.getcwd() #getting the script directory path
directory_path = os.path.join(script_directory,'temp_file/202406') # Define the file path within the subfolder
initial_tag = 'Empréstimos de Ativos – Posição em Aberto' # string phrase that marks the beginning of the table
final_tag = 'Empréstimos de Ativos – Empréstimos Registrados' # string phrase that marks the end of the table
initial_word_tag = 'Aberto' 
final_word_tag = 'Registrados'
table_columns = ['date_dt','ticker','isin','company_orfund','type','market','balance_qty','average_price','balance_reais'] #columns to be extracted


# Defining Functions

### Extracting downloaded pdf file name list to be scrapped

In [3]:
def func_get_file_names(directory_path):
    pattern = re.compile(r'^BDI_00_\d{8}.pdf$') # daily bulletin pattern name
    matching_files = [f for f in os.listdir(directory_path) if pattern.match(f)]
    return matching_files 

### Find initial and final pages of the table by the tag's

In [4]:
def func_find_string_in_pdf(pdf_path, search_string):
    # Open the PDF file
    pdf_document = fitz.open(pdf_path)
    pages_with_string = []
    # Iterate through each page
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        text = page.get_text()

        # Check if the search string is in the text of the current page
        if search_string in text:
            pages_with_string.append(page_num + 1)  # Pages are 1-indexed
    
    if not pages_with_string:
        print("\n\n**** The tag phrase was not fount in this document! ****\n\n")
        return
    else: 
        return pages_with_string

### Getting the table from the page range and parsing data into a dataframe

##### Hey, this was a very hard function to build, send me a message on [linkedin](https://www.linkedin.com/in/lucas-aragao-santos/) and I will be glad to share with you

In [5]:
from my_functions import func_btb_open_pos_extract_tables

###  Processing and cleaning the table scrapped into a final dataframe

In [6]:
def func_clean_and_convert_to_float(value):
    try:
        # Remove leading and trailing whitespaces
        cleaned_value = value.strip()
        # Remove any internal white spaces
        cleaned_value = cleaned_value.replace(' ', '')  
        # Check if the cleaned value is a valid float
        if re.match(r'^-?\d+(\.\d+)?$', cleaned_value):
            return float(cleaned_value)
        else:
            # Handle cases where the value is not a valid float
            return None
    except Exception as e:
        # Handle unexpected errors
        return None


def func_process_dataframe(df):
    
    df.replace({'-': np.nan, '': np.nan}, inplace=True) # normalizing null valus
    df.dropna(subset=['ticker'], inplace = True) # drop lines where there is no usefull information
    # changing data types 
    df['balance_qty'] = df['balance_qty'].str.replace('.', '', regex=False)
    df['average_price'] = df['average_price'].str.replace('.', '', regex=False).str.replace(',', '.', regex=False)
    df['balance_reais'] = df['balance_reais'].str.replace('.', '', regex=False).str.replace(',', '.', regex=False)
    
    df['balance_qty'] = df['balance_qty'].apply(func_clean_and_convert_to_float)
    df['average_price'] = df['average_price'].apply(func_clean_and_convert_to_float)
    df['balance_reais'] = df['balance_reais'].apply(func_clean_and_convert_to_float)
    
    # remapping class values that was not well scrapped
    market_remap = {'NDe+1g. Eletrônica':'Neg. Eletrônica/D+1',
                    'NDe+0g. Eletrônica':'Neg. Eletrônica/D+0',
                    '1 NDe+1g. Eletrônica':'Neg. Eletrônica/D+1',
                    '2 NDe+1g. Eletrônica':'Neg. Eletrônica/D+1',
                    '1 NDe+0g. Eletrônica' :'Neg. Eletrônica/D+0',
                    '2 NDe+0g. Eletrônica':'Neg. Eletrônica/D+0',
                    '1 Registro':'Registro',
                    '2 Registro':'Registro',
                    'Registro':'Registro',
                    'Total':'Total',
                    'ERFiTexFnada': 'ETF Renda Fixa'}
    
    df['market'] = df['market'].apply(lambda x: market_remap.get(x, x))
    
    # adding a processing data column to referece the date of the scraping process run
    df['proc_datedt'] = datetime.now().replace(microsecond=0) 

    print('Dataframe was processed and cleaned ...')
    return df

## Calling functions

In [7]:
downloaded_files = func_get_file_names(directory_path) #getting the files names downloaded
df_append = pd.DataFrame() #init the final appended dataframe

for file_name in downloaded_files: #loop over all .pdf daily bulletin manually downloaded from B3 website

    pdf_path = os.path.join(directory_path, file_name) # creating the complete file path to the pdf document
    print('\n---------------------- Extracting from file: {} ----------------------\n'.format(file_name))
    
    ################## 1 - Look over the .pdf file for the pages where the desire table begins and ends
    
    initial_page_tag = max(func_find_string_in_pdf(pdf_path, initial_tag)) # finding the initial page tag
    final_page_tag = max(func_find_string_in_pdf(pdf_path, final_tag)) # finding the final page tag

    print(f'The string "{initial_tag}" was last found on the following page: {initial_page_tag} \n')
    print(f'The string "{final_tag}" was last found on the following page: {final_page_tag} \n')
    print('\n--------------- Step 1 completed ----------------\n')

    ################## 2 - Scrapping tables of data from the previous defined page range
    df_data = func_btb_open_pos_extract_tables(pdf_path,
                                initial_page_tag,
                                final_page_tag,
                                initial_word_tag,
                                final_word_tag,
                                table_columns)
    print('\n--------------- Step 2 completed ----------------\n')

    ################## 3  - Processing and cleaning dataframe
    df_final = func_process_dataframe(df_data)
    print('\n--------------- Step 3 completed ----------------\n')

    #append daily dataframes scrapped into a final df
    df_append = pd.concat([df_append,df_final], ignore_index = True)
    
    
# Displaying dataframe appended
print(df_append.info())
df_append.head()


---------------------- Extracting from file: BDI_00_20240603.pdf ----------------------

The string "Empréstimos de Ativos – Posição em Aberto" was last found on the following page: 602 

The string "Empréstimos de Ativos – Empréstimos Registrados" was last found on the following page: 647 


--------------- Step 1 completed ----------------

Starting scrapping tables from page ...
602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 
--------------- Step 2 completed ----------------

Dataframe was processed and cleaned ...

--------------- Step 3 completed ----------------


---------------------- Extracting from file: BDI_00_20240604.pdf ----------------------

The string "Empréstimos de Ativos – Posição em Aberto" was last found on the following page: 604 

The string "Empréstimos de Ativos – Empréstimos Registrados" was last found on the following page: 64

670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 
--------------- Step 2 completed ----------------

Dataframe was processed and cleaned ...

--------------- Step 3 completed ----------------


---------------------- Extracting from file: BDI_00_20240619.pdf ----------------------

The string "Empréstimos de Ativos – Posição em Aberto" was last found on the following page: 679 

The string "Empréstimos de Ativos – Empréstimos Registrados" was last found on the following page: 722 


--------------- Step 1 completed ----------------

Starting scrapping tables from page ...
679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 
--------------- Step 2 completed ----------------

Dataframe was processed and cleaned ...

--------------- Step 

Unnamed: 0,date_dt,ticker,isin,company_orfund,type,market,balance_qty,average_price,balance_reais,doc_page,proc_datedt
0,03/06/2024,5GTK11,BR5GTKCTF000,IÍNNVDEICSET OIE ETF BLUESTAR 5G COM INDEX FDO,CI,Neg. Eletrônica/D+1,1543.0,90.59,139780.46,602,2024-06-25 09:02:42
1,03/06/2024,5GTK11,BR5GTKCTF000,IÍNNVDEICSET OIE ETF BLUESTAR 5G COM INDEX FDO,CI,Registro,73.0,98.94,7222.44,602,2024-06-25 09:02:42
2,03/06/2024,5GTK11,BR5GTKCTF000,IÍNNVDEICSET OIE ETF BLUESTAR 5G COM INDEX FDO,CI,Total,1616.0,,147002.9,602,2024-06-25 09:02:42
3,03/06/2024,A1AP34,BRA1APBDR001,ADVANCE AUTO PARTS INC,DRN,Registro,106.0,20.73,2197.04,602,2024-06-25 09:02:42
4,03/06/2024,A1AP34,BRA1APBDR001,ADVANCE AUTO PARTS INC,DRN,Total,106.0,,2197.04,602,2024-06-25 09:02:42


## Data quality

In [9]:
#importing data quality functions
from my_functions import func_dq_integer_as_decimal # check for any float number in a integer column
from my_functions import func_dq_date_format # check if the data form is it right format DD/MM/YYYY

In [13]:
func_dq_integer_as_decimal(df_append,'balance_qty')

Check ok: Any integer as decimal detected


In [20]:
func_dq_date_format(df_append,'date_dt')

Check ok: date column is in the right format 'DD/MM/YYYY'


## Write the dataframe into a local SQLite database

In [22]:
conn = sqlite3.connect('D:/finance_data/finance_database.db')

df_append.to_sql('B3_securities_lending_open_pos',conn,if_exists='append',index=False)

37091