# Getting closed trade by trade data from the brazilian stock exchange

**What ?** Brazilian stock exchange B3 publishes daily a report on its website with all the closed trades that occurred on the previous market day. This dataset includes information on every negotiated asset, traded quantity, prices, and both sides of the brokers involved at the specific second when each trade was closed.

**Why ?** This dataset is higly useful for displaying price performance over an intraday series. It enables the recognition of patterns, analysis, and provides interesting insights into the behavior of each asset tradings.

**How ?** The last 20 days of market data trading are available in a .zip file on the B3 website
[(see this link)](https://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/cotacoes/cotacoes/). From there, it will be manually downloaded to a local temporary folder. This notebook demonstrates how to read the .csv database within each .zip file downloaded and apply some data cleaning and transformation according to this guide [(link here)](https://www.b3.com.br/data/files/14/42/28/31/FEC4A8103234E0A8AC094EA8/Glossario_NegociosListados_PT.pdf). At the end, the data is uploaded to a local SQLite database where it can be used for further analysis.

<img src="https://lh3.googleusercontent.com/d/1e-hu9egDMB2j2ZoRQLKXd0qd0GTLuXmL" alt="texto_alternativo" width="400" align="center">

## Import Libraries

In [3]:
import pandas as pd
import numpy as np
import os
import re

import sqlite3
import requests
import zipfile

#### Search at local SQLite database what is the last available data uploaded 

In [16]:
conn = sqlite3.connect(os.getenv('MY_FINANCE_DB_PATH')+'/finance_database.db')
cursor = conn.cursor()
cursor.execute('''SELECT DataReferencia
                    FROM B3_trade_by_trade 
                    GROUP BY DataReferencia''') # this table was previously created to hold the trade by trade data

rows = cursor.fetchall()
columns = [description[0] for description in cursor.description]

df_dt = pd.DataFrame(rows, columns=columns)
conn.close()

df_dt['DataReferencia'].sort_values(ascending = False).head(3)

19    2024-05-21
18    2024-05-20
17    2024-05-17
Name: DataReferencia, dtype: object

####  Looking for new files manually downloaded from B3 website into a local folder

In [17]:
file_path = os.path.join('temp_files') # Define the file path within the subfolder
all_files = os.listdir(file_path)
zip_files_with_paths = [os.path.join(file_path, file) for file in all_files if file.endswith('.zip')]

zip_files_with_paths

['temp_files\\22-05-2024_NEGOCIOSAVISTA.zip',
 'temp_files\\23-05-2024_NEGOCIOSAVISTA.zip',
 'temp_files\\24-05-2024_NEGOCIOSAVISTA.zip']

#### Extract, Transform and Load dataset

In [18]:
# Reading each .csv trade by trade files from .zip downlodaded from B3
####################################################################################################
df_app = pd.DataFrame()
csv_file_name_list = []

for zip_file_name in zip_files_with_paths: #zip_files_with_paths[::-1][6:8]

    with zipfile.ZipFile(zip_file_name, 'r') as zip_file:
        
        csv_file_name = zip_file.namelist()[0]
        
        df_app = pd.read_csv(zip_file.open(csv_file_name),sep = ";", encoding = "UTF-8",low_memory=False, dtype=str)

# Data cleaning: changing data types
####################################################################################################
    def interpret_timestamp(timestamp_str): # transform the integer number representing hour of tradings in a HH:MM:SS format
        hours = timestamp_str[:2]
        minutes = timestamp_str[2:4]
        seconds = timestamp_str[4:6]
        timestamp = f"{hours}:{minutes}:{seconds}"
        return timestamp

    # Apply the interpret_timestamp function to all rows in the 'timestamp_str' column
    df_app['ClosedHour'] = df_app['HoraFechamento'].apply(lambda x: interpret_timestamp(x))

    # Convert combined strings to datetime objects
    combined_datetime = df_app['DataNegocio'] + ' ' + df_app['ClosedHour']
    df_app['ClosedDateTime'] = pd.to_datetime(combined_datetime)

    # ajusting datatype of prices and quantity of each trade
    df_app['PrecoNegocio'] = df_app['PrecoNegocio'].str.replace(',', '.').astype(float)
    df_app['QuantidadeNegociada'] = df_app['QuantidadeNegociada'].str.replace(',', '.').astype(float)


# write the dataframe into the SQLite database
####################################################################################################
    conn = sqlite3.connect(os.getenv('MY_FINANCE_DB_PATH')+'/finance_database.db')

    df_app.to_sql('B3_trade_by_trade',conn,if_exists='append',index=False)
    
    # printing file read
    print(csv_file_name)
    print(df_app.head(1))
    del df_app 
    df_app = pd.DataFrame()
    

22-05-2024_NEGOCIOSAVISTA.txt
23-05-2024_NEGOCIOSAVISTA.txt
24-05-2024_NEGOCIOSAVISTA.txt


#### Reading a sample from the SQLite database

In [5]:
# Show a small sample of the data
conn = sqlite3.connect(os.getenv('MY_FINANCE_DB_PATH')+'/finance_database.db')
cursor = conn.cursor()
cursor.execute('''SELECT *
            FROM B3_trade_by_trade
            WHERE CodigoInstrumento = 'PETR4' and DataReferencia = '2024-04-30' ''') # reading a specificly ticker and date as an exlaple
rows = cursor.fetchall()
columns = [description[0] for description in cursor.description]

df_sample = pd.DataFrame(rows, columns=columns)
conn.close()

df_sample.head()

Unnamed: 0,DataReferencia,CodigoInstrumento,AcaoAtualizacao,PrecoNegocio,QuantidadeNegociada,HoraFechamento,CodigoIdentificadorNegocio,TipoSessaoPregao,DataNegocio,CodigoParticipanteComprador,CodigoParticipanteVendedor,ClosedHour,ClosedDateTime
0,2024-04-30,PETR4,0,42.0,100.0,100353907,10,1,2024-04-30,238,8,10:03:53,2024-04-30 10:03:53
1,2024-04-30,PETR4,0,42.0,6300.0,100353907,20,1,2024-04-30,77,8,10:03:53,2024-04-30 10:03:53
2,2024-04-30,PETR4,0,42.0,100.0,100353907,30,1,2024-04-30,4090,8,10:03:53,2024-04-30 10:03:53
3,2024-04-30,PETR4,0,42.0,100.0,100353907,40,1,2024-04-30,90,8,10:03:53,2024-04-30 10:03:53
4,2024-04-30,PETR4,0,42.0,900.0,100353907,50,1,2024-04-30,3,8,10:03:53,2024-04-30 10:03:53
