# Extracting Data
Investment Funds must be registered in Brazilian Securities and Exchange Comission (CVM). CVM has funds data on their daily returns, benchmark index, type of fund, issuer, manager and other informations related to it. All data is open to the public in CVM [website](https://dados.cvm.gov.br/group/fundos-de-investimento):
- Funds return: one database for daily, monthly, quarterly, anual
- Register info: funds name, manager, issuer, type, open/close
- Statement of Income: database with funds link to their state of income
- Investors profile: who owns funds quotas (other business, retirement funds, individual investors, professional investors)
- Performance metrics: how to calculate return, risk accordin to managers, collateral

For this study, I'm interested in the daily returns and register info. This way I can map funds by their features and follow their performance in time. The advantage to use daily data is that I can transform daily info into month, quarter and annual.

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import requests
import zipfile
import os
from bcb import currency, sgs, Expectativas
import yfinance
from datetime import datetime, timedelta

# Financial funds data

In [2]:
# Creating parameters to download data
## Date paramenters to match CVM files
years = ['2024','2023','2022','2021','2020']
legacy = ['2020']

months = range(1,13)
month_list = []

for i in months:
    if i<10: 
        i = str('0'+str(i))
    else:   
        i = str(i)
    month_list.append(i)

To collect the data, I'll request directly from CVM website the zipfiles containing the [daily returns of financial funds](https://dados.cvm.gov.br/dataset/fi-doc-inf_diario). The main issue here, is that we have two types of repositories: (1) daily data organized in monthly zip files from the current year to 3 years ago (Y-3), (2) yearly zip file for older data. So in my study 2024,2023,2022 and 2021 will have monthly zip files, while [2020](https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/HIST/) will have only one zip file with 12 csv files inside.

To solve it, I'll just need to create a loop to identify which year is considered old/legacy by CVM. Previously I created a legacy list with 2020 so I can use it in the loop now.

URL with daily return data
URL model: dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/**DADOS/inf_diario_fi_202307**.zip
- replace the date on the URL '202307' with the loop

URL for legacy data
URL legacy model: dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/**DADOS/HIST/inf_diario_fi_2000**.zip
- replace only the year 2000

In [3]:
# Create empty dataframes to store return data
cvm_daily_return = pd.DataFrame()
cvm_legacy_return = pd.DataFrame()

# Create loop to download data from 2020 to July 2024
## I'll collect data for each year in our years list
for yyyy in years:
    try:
        if yyyy in legacy:  # Y-3 data is considered history and moved to a different directory, I'll call it legacy
            daily_return_url = f'https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/HIST/inf_diario_fi_{yyyy}.zip'
            download_url = requests.get(daily_return_url)
            zip_filename = f'inf_diario_fi_{yyyy}.zip' 
            with open(zip_filename,'wb') as zip_ref:
                zip_ref.write(download_url.content)
            with zipfile.ZipFile(zip_filename, 'r') as cvm_zip:
                legacy_csv = [pd.read_csv(cvm_zip.open(f), sep=';') for f in cvm_zip.namelist()]
                cvm_legacy_return = pd.concat(legacy_csv)
            os.remove(zip_filename)
        else:
            for mm in month_list:   # Y-3< data is called by year and month. So we need to run all month_list elements
                daily_return_url = f'https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_{yyyy}{mm}.zip'
                download_url = requests.get(daily_return_url) 
                zip_filename = f'inf_diario_fi_{yyyy}{mm}.zip' 
                with open(zip_filename, 'wb') as zip_ref:
                    zip_ref.write(download_url.content)
                with zipfile.ZipFile(zip_filename) as cvm_zip:
                    for file_name in cvm_zip.namelist():
                        if file_name.endswith('.csv'):
                            with cvm_zip.open(file_name) as cvm_csv:
                                cvm_daily_return_temp = pd.read_csv(cvm_csv, sep=';')
                                cvm_daily_return = pd.concat([cvm_daily_return, cvm_daily_return_temp])
                os.remove(zip_filename)
        cvm_daily_return = pd.concat([cvm_daily_return,cvm_legacy_return])
    except:
        pass    # Avoid stopping the process in case it looks for current year and months yet to come (YYYYMM+1)

# Delete unused variables to clean memory
del cvm_csv, cvm_zip, download_url, file_name, legacy, mm, yyyy

In [4]:
# Save data into csv file to be cleaned in the next step
cvm_daily_return.to_csv('cvm_daily_return.csv')

# Register data
The next step is to identify the funds. Investors know it by their name, not their register number. I'll also need to check their status, to see if the fund still active. Here I face a similar situation from return data, [current](https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi.csv) and [legacy](https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi_hist.zip) data.

In [5]:
# Open current register file
cvm_register = pd.read_csv('https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi.csv', sep=';', encoding='latin-1')

# Open legacy register file
## Issue: legacy has several csvs in one zip file
register_url = 'https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi_hist.zip'
download_register = requests.get(register_url)
register_zip = 'cad_fi_hist.zip'
with open(register_zip,'wb') as reg_ref:
    reg_ref.write(download_register.content)
with zipfile.ZipFile(register_zip,'r') as register_zip:
    reg_csv_lag = [pd.read_csv(register_zip.open(g), sep=';', encoding='latin-1') for g in register_zip.namelist()]
    legacy_register = pd.concat(reg_csv_lag, axis=0, ignore_index=True)
os.remove('cad_fi_hist.zip')

  cvm_register = pd.read_csv('https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi.csv', sep=';', encoding='latin-1')


In [6]:
# Save data into csv file to be cleaned in the next step
cvm_register.to_csv('cvm_register.csv')
legacy_register.to_csv('legacy_register.csv')

# Macro economic data

Now I'll colect macro economic data that will influence funds performance (aka return). To do it, I'll use an API developed by Brazil's Central Bank (BACEN) to colect historical data and projections on the following economic indexes:
- Risk-free rate: [SELIC](https://www.bcb.gov.br/controleinflacao/taxaselic)
- Inflation rate: Consumer Price Index [IPCA](https://www.ibge.gov.br/explica/inflacao.php) (*Índice de Preço ao Consumidor*)

**Why am I using SELIC as risk-free rate? What about [CDI](https://borainvestir.b3.com.br/tipos-de-investimentos/taxa-do-cdi-o-que-e-como-impacta-seus-investimentos/) (Interbank Certificate of Deposit)?**
- SELIC and CDI value shouldn't be too different. CDI is the rate used when financial institutions lend money to each other, but with only 24h to pay back. This makes this oppperation extremely safe due the short period and the size/type of borrower. Many financial assets use CDI as their benchmark, because it's slightly higher than SELIC. However, since we are focusing on **financial funds**, there are funds focused on public treasure papers (that uses SELIC).

**Where are you getting all this data?**
- I'll use [BACEN](https://pypi.org/project/python-bcb/) API. I'll get the time series data ([SGS](https://wilsonfreitas.github.io/python-bcb/sgs.html)) and currency value (currency).

**The BACEN (BCB) API is confused! How do I get time series code? For example what is SELIC's ID?**
- You'll have to search which rate do you want on the search bar, then BACEN will direct you to a page with all index ID you need. For this example, SELIC has more than 5 [IDS](https://www3.bcb.gov.br/sgspub/localizarseries/localizarSeries.do?method=prepararTelaLocalizarSeries) depending on how do you want the data.

## Get historical data
IMPORTANT DESCLAIMER: THIS IS THE ANNUAL RATE (a.a%)

IMPORTANT DESCLAIMER: THIS IS THE MONTHLY RATE (a.m%)

In [4]:
# Get risk-free and inflation historical data
current_rf = sgs.get({'selic': 432}, start='2019-12-01')
current_ipca = sgs.get({'IPCA': 433}, start='2019-12-01')

# Write csv
current_rf.to_csv('current_rf.csv')
current_ipca.to_csv('current_ipca.csv')

# Market data
To collect market data, I'm gonna use Yahoo Finance API to collect historical data on Dollar (using exchance rate USD/BRL) and Brazilian stock market index (IBOVESPA). 

**Why are you using only Dollar as exchange rate?**
- Dollar is the reference currency to our economy. Depending on its variation, other currencies such as Euro will also move in the same direction (but not in the same intensity).

In [57]:
# Download dollar and IBOVESPA values based on their tickers
dolar = yfinance.download('BRL=X', start='2019-12-31', end='2024-07-31')
ibov = yfinance.download('^bvsp', start='2019-12-31', end='2024-07-31')

dolar.to_csv('dolar_mkt.csv')
ibov.to_csv('ibov_mkt.csv')

[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
