# Brazilian Investment Funds Overview

This project main objective is to map investment fund market in Brazil. Investment funds can trade in different types of assets, like financial products (stocks, options, debentures), real state or buying part of other funds. Each fund raises capital by offering quotas, this can be an ongoing process (open fund) or only one time offer (closed fund).

The funds are created and managed by a financial institution (manager). Managers will also take care of all the paper work involving fund rules, which index should investors use as benchmark and general structure so a bank can issue the fund. The bank role is so ensure the fund is legit, has the paperwork in order and distribute the financial sheets so potential investor can buy a quota. This is a very efficient way to poll resources and diversify the risk for all parties involved.

Learn more in ANBIMA [PDF](https://www.anbima.com.br/data/files/D7/B6/AD/5E/369EC8104606BDC8B82BA2A8/CPA-10-Cap5.pdf).

Now that we know what is an investment fund, and how is its structure, let's be curious about it:
- How many funds are in the market?
- What are the main assets traded? Financial? Real State? Credit?
- Who are the big managers?
- Who are the main issuers?
- How was the market for the last years?

## Importing libraries

In [1]:
import pandas as pd # manipulating data
import numpy as np  # basic math operations
import matplotlib.pyplot as plt # graphs
import seaborn as sns   #graphs
import requests # request files on Brazilian Securities and Exchange Comission (CVM)
import zipfile  # unzip CVM files
import os   # manipulate disk files

## Downloading data
Investment Funds must be registered in Brazilian Securities and Exchange Comission (CVM). CVM has funds data on their daily returns, benchmark index, type of fund, issuer, manager and other informations related to it. All data is open to the public in CVM [website](https://dados.cvm.gov.br/group/fundos-de-investimento):
- Funds return: one database for daily, monthly, quarterly, anual
- Register info: funds name, manager, issuer, type, open/close
- Statement of Income: database with funds link to their state of income
- Investors profile: who owns funds quotas (other business, retirement funds, individual investors, professional investors)
- Performance metrics: how to calculate return, risk accordin to managers, collateral

For this study, I'm interested in the daily returns and register info. This way I can map funds by their features and follow their performance in time. The advantage to use daily data is that I can transform daily info into month, quarter and annual.

In [2]:
# Creating parameters to download data
## Date paramenters to match CVM files
years = ['2024','2023','2022','2021','2020']    # Creating a five year window so we can see the end of pandemic and current government
legacy = ['2020']   # Creating legacy list, CVM moves old data to another URL/directory

months = range(1,13)    # Crete month list from Jan(01) to Dec(12)
month_list = []     # List must be a string bc I'll add each emelento to a url request to CVM

for i in months:    # Transform each integer element into a string element
    if i<10:    # For months with only one digit, we need to add zero (0) before to match the csv file
        i = str('0'+str(i))
    else:   # Months with two digits only need to be converted to string
        i = str(i)
    month_list.append(i)    # Append each string to the month list

To collect the data, I'll request directly from CVM website the zipfiles containing the [daily returns of financial funds](https://dados.cvm.gov.br/dataset/fi-doc-inf_diario). The main issue here, is that we have two types of repositories: (1) daily data organized in monthly zip files from the current year to 3 years ago (Y-3), (2) yearly zip file for older data. So in my study 2024,2023,2022 and 2021 will have monthly zip files, while [2020](https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/HIST/) will have only one zip file with 12 csv files inside.

To solve it, I'll just need to create a loop to identify which year is considered old/legacy by CVM. Previously I created a legacy list with 2020 so I can use it in the loop now.

URL with daily return data
URL model: dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/**DADOS/inf_diario_fi_202307**.zip
- replace the date on the URL '202307' with the loop

URL for legacy data
URL legacy model: dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/**DADOS/HIST/inf_diario_fi_2000**.zip
- replace only the year 2000

In [3]:
cvm_daily_return = pd.DataFrame()   # Create an empty dataframe for daily returns
cvm_legacy_return = pd.DataFrame()  # Creawte an empty dataframe for legacy data

# Create loop to download data from 2020 to July 2024
## I'll collect data for each year in our years list
for yyyy in years:
    try:
        if yyyy in legacy:  # Y-3 data is considered history and moved to a different directory, I'll call it legacy
            daily_return_url = f'https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/HIST/inf_diario_fi_{yyyy}.zip'
            download_url = requests.get(daily_return_url)   # Access URL with zipfile, else would be just a string on our local machine
            zip_filename = f'inf_diario_fi_{yyyy}.zip'  # Save zipfile name, only asks for year to identify it. All 12 month csv are inside it.
            with open(zip_filename,'wb') as zip_ref:    # Open zipfile from CVM url. Using 'wb' because I need to Write ('w') the file in Binary('b')
                zip_ref.write(download_url.content)     # Write/save zipfile on local disk. I used '.content' to write/save the files withing the URL zip
            with zipfile.ZipFile(zip_filename, 'r') as cvm_zip: # Legacy Zip has 12 csv files, so I'll just read and concatenate all
                legacy_csv = [pd.read_csv(cvm_zip.open(f), sep=';') for f in cvm_zip.namelist()]
                cvm_legacy_return = pd.concat(legacy_csv)
            os.remove(zip_filename)  # Delete zipfile from disk so I can keep a clean directory
        else:
            for mm in month_list:   # Y-3< data is called by year and month. So we need to run all month_list elements
                daily_return_url = f'https://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_{yyyy}{mm}.zip'
                download_url = requests.get(daily_return_url)   # Access URL with zipfile
                zip_filename = f'inf_diario_fi_{yyyy}{mm}.zip'  # Save zipfile name, it's important to name it here bc I'll delete the zipfile after
                with open(zip_filename, 'wb') as zip_ref:   # Open zipfile from CVM url
                    zip_ref.write(download_url.content)     # Write/save zipfile on local disk. I used '.content' to write/save the files withing the URL zip
                with zipfile.ZipFile(zip_filename) as cvm_zip:  # Manipulate recent saved zipfile on our disk
                    for file_name in cvm_zip.namelist():    # Look for all files within zipfile
                        if file_name.endswith('.csv'):      # If a file ends with '.csv' I'll open and read it with pandas
                            with cvm_zip.open(file_name) as cvm_csv:    # I'll create a temporary obj, so we can concatenate it with our main dataframe
                                cvm_daily_return_temp = pd.read_csv(cvm_csv, sep=';')   # Read csv, but Brazilian data is separated with ';'
                                cvm_daily_return = pd.concat([cvm_daily_return, cvm_daily_return_temp]) # Concat main dataframe with temporary
                os.remove(zip_filename)  # Delete zipfile from disk so I can keep a clean directory
        cvm_daily_return = pd.concat([cvm_daily_return,cvm_legacy_return])  # Concatenate current and legacy data
    except:
        pass    # Avoid stopping the process in case it looks for current year and months yet to come (YYYYMM+1)

# cvm_daily_return    # Observe we get all monthly data from 2024 to 2000
# Delete unused variables to clean memory
del daily_return_url,zip_filename, legacy_csv,cvm_legacy_return, daily_return_url, download_url, cvm_daily_return_temp

In [21]:
# Transform 'DT_COMPTC' in datetime variable
cvm_daily_return['DT_COMPTC'] = pd.to_datetime(cvm_daily_return['DT_COMPTC'] , format='ISO8601')
# cvm_daily_return.info()

In [26]:
# To calculate the funds return, we need the first and last day of the month
cvm_daily_return['year_month'] = cvm_daily_return['DT_COMPTC'].dt.to_period('M')    # Support columns with year-month
# Create separate dataframes to first and last day
first_day  = cvm_daily_return.groupby(['year_month', 'CNPJ_FUNDO']).min().reset_index() # Sort by fund and date
last_day = cvm_daily_return.groupby(['year_month', 'CNPJ_FUNDO']).max().reset_index()

In [27]:
# Concat first and last day dataframes
cvm_return = pd.concat([first_day, last_day]).sort_values(by='DT_COMPTC').reset_index(drop=True)
# cvm_return.sort_values(['CNPJ_FUNDO','DT_COMPTC'])
# I'll also keep onlye data I need
cvm_return.drop(columns=['year_month','TP_FUNDO','CAPTC_DIA','RESG_DIA'], inplace=True)


In [29]:
# cvm_daily_return.groupby(['year_month', 'CNPJ_FUNDO']).min()
cvm_return['DT_COMPTC'].unique()

# Clean memory space
del cvm_daily_return, first_day, last_day

<DatetimeArray>
['2020-01-01 00:00:00', '2020-01-02 00:00:00', '2020-01-03 00:00:00',
 '2020-01-06 00:00:00', '2020-01-07 00:00:00', '2020-01-08 00:00:00',
 '2020-01-09 00:00:00', '2020-01-10 00:00:00', '2020-01-13 00:00:00',
 '2020-01-14 00:00:00',
 ...
 '2024-07-22 00:00:00', '2024-07-23 00:00:00', '2024-07-24 00:00:00',
 '2024-07-25 00:00:00', '2024-07-26 00:00:00', '2024-07-29 00:00:00',
 '2024-07-30 00:00:00', '2024-07-31 00:00:00', '2024-08-01 00:00:00',
 '2024-08-02 00:00:00']
Length: 1166, dtype: datetime64[ns]

Python show not all funds have the first register at the first day of the month. Is this possible? Yes.
- A fund only register the value of its assets once it's approved by CVM. Which means our starting date isn't the same for all funds, will depend on when their documents were processed.

Is this bad for this analysis? No.
- Since I'm calculating the return (assets final value - assets start value), the number between those day won't bias the data. I can expect a small variation in value with smaller the window between those.

### Register data
The next step is to identify the funds. Investors know it by their name, not their register number. I'll also need to check their status, to see if the fund still active. Here I face a similar situation from return data, [current](https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi.csv) and [legacy](https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi_hist.zip) data.

In [4]:
# Open current register file
cvm_register = pd.read_csv('https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi.csv', sep=';', encoding='latin-1')

# Open legacy register file
## Issue: legacy has several csvs in one zip file
register_url = 'https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi_hist.zip'    # Find legacy url
download_register = requests.get(register_url)  # Reach url online
register_zip = 'cad_fi_hist.zip'    # Create an object with zipfile name so we can use it on next steps
with open(register_zip,'wb') as reg_ref:    # Open zipfile to write it on disk
    reg_ref.write(download_register.content)
with zipfile.ZipFile(register_zip,'r') as register_zip: # Read zip file
    reg_csv_lag = [pd.read_csv(register_zip.open(g), sep=';', encoding='latin-1') for g in register_zip.namelist()] # Read each csv and put in all together in one object as a list
    legacy_register = pd.concat(reg_csv_lag, axis=0, ignore_index=True) # Read all lists into one dataframe
os.remove('cad_fi_hist.zip')    # Delete legacy refister zip file from our disk

  cvm_register = pd.read_csv('https://dados.cvm.gov.br/dados/FI/CAD/DADOS/cad_fi.csv', sep=';', encoding='latin-1')


In [5]:
# Check if headers from current and legacy register dataframes are the same
## Print the shape of each dataframe to see if they are the same
if cvm_register.shape[1] > legacy_register.shape[1]:    # Analyze the shape(0_row, 1_col). We are interested in cols so shape[1]
    cur_lag = cvm_register.shape[1] - legacy_register.shape[1]
    print('Current register dataframe has',cur_lag, 'more columns')
else:
    lag_cur = legacy_register.shape[1] - cvm_register.shape[1] 
    print('Legacy register dataframe has',lag_cur,'more columns')
print('CVM register shape:',cvm_register.shape) # General idea of shape for current register dataframe
print('Legacy register shape:',legacy_register.shape)# General idea of shape for legacy register dataframe
print('\n')

# Find which columns are convergent and divergent between them
reg_cur_col = cvm_register.columns      # Get columns names for current register dataframe
reg_lag_col = legacy_register.columns   # Get columns names for legacy register dataframe

# Create an object for coluimns in common and columns in legacy and not current
common_cols = reg_cur_col.intersection(reg_lag_col)
cur_not_lag = reg_cur_col.difference(reg_lag_col)
lag_not_cur = reg_lag_col.difference(reg_cur_col)

print(common_cols.nunique(),'columns in COMMON between current and legacy:')
print(common_cols)
print('\n')
print(cur_not_lag.nunique(),'columns in CURRENT register dataframe and not in LEGACY:')
print(cur_not_lag)
print('\n')
print(lag_not_cur.nunique(),'columns in LEGACY register dataframe and not in CURRENT:')
print(lag_not_cur)

Legacy register dataframe has 23 more columns
CVM register shape: (79612, 41)
Legacy register shape: (1836914, 64)


29 columns in COMMON between current and legacy:
Index(['CNPJ_FUNDO', 'DENOM_SOCIAL', 'DT_REG', 'SIT', 'DT_INI_SIT',
       'DT_INI_EXERC', 'DT_FIM_EXERC', 'CLASSE', 'DT_INI_CLASSE',
       'RENTAB_FUNDO', 'CONDOM', 'FUNDO_COTAS', 'FUNDO_EXCLUSIVO',
       'TRIB_LPRAZO', 'PUBLICO_ALVO', 'TAXA_ADM', 'INF_TAXA_ADM', 'DIRETOR',
       'CNPJ_ADMIN', 'ADMIN', 'PF_PJ_GESTOR', 'CPF_CNPJ_GESTOR', 'GESTOR',
       'CNPJ_AUDITOR', 'AUDITOR', 'CNPJ_CUSTODIANTE', 'CUSTODIANTE',
       'CNPJ_CONTROLADOR', 'CONTROLADOR'],
      dtype='object')


12 columns in CURRENT register dataframe and not in LEGACY:
Index(['CD_CVM', 'CLASSE_ANBIMA', 'DT_CANCEL', 'DT_CONST', 'DT_INI_ATIV',
       'DT_PATRIM_LIQ', 'ENTID_INVEST', 'INF_TAXA_PERFM', 'INVEST_CEMPR_EXTER',
       'TAXA_PERFM', 'TP_FUNDO', 'VL_PATRIM_LIQ'],
      dtype='object')


35 columns in LEGACY register dataframe and not in CURRE

#### Comparing dataframes
On this step I noticed a difference between columns within current and legacy register dataframe, and why is that?
- According to CVM notes, they changed the infos required from funds along the years. This can happen due a law change or CVM don't see the need to ask for that information (e.g: legacy  'DT_INI_TAXA_ADM' has the date for when administration fee was charged)
- Some headers changed their name. Due to operational reasons, CVM data managers change the name of the columns to fit their system (e.g: legacy 'VL_TAXA_PERFM' and current 'TAXA_PERFM'. Both them have the performance rate for managers based on funds gains)
- New columns addition. With the law change, CVM may ask for new information (e.g: current 'CLASSE_ANBIMA')

Based on those difference, I'll work with the columns they have in common. In total, there are **29 in common columns** I can work with, but not all data in needed.

**Columns in common between current and legacy register:**

*'CNPJ_FUNDO', 'DENOM_SOCIAL', 'DT_REG', 'SIT', 'DT_INI_SIT',
'DT_INI_EXERC', 'DT_FIM_EXERC', 'CLASSE', 'DT_INI_CLASSE',
'RENTAB_FUNDO', 'CONDOM', 'FUNDO_COTAS', 'FUNDO_EXCLUSIVO',
'TRIB_LPRAZO', 'PUBLICO_ALVO', 'TAXA_ADM', 'INF_TAXA_ADM', 'DIRETOR',
'CNPJ_ADMIN', 'ADMIN', 'PF_PJ_GESTOR', 'CPF_CNPJ_GESTOR', 'GESTOR',
'CNPJ_AUDITOR', 'AUDITOR', 'CNPJ_CUSTODIANTE', 'CUSTODIANTE',
'CNPJ_CONTROLADOR', 'CONTROLADOR'*

**The data I'll need to identify investment funds are:**

*'CNPJ_FUNDO', 'DENOM_SOCIAL', 'DT_REG', 'SIT','CLASSE', 'DT_INI_CLASSE', 'CONDOM', 'FUNDO_COTAS', 'FUNDO_EXCLUSIVO','CPF_CNPJ_GESTOR', 'GESTOR','CNPJ_AUDITOR', 'AUDITOR', 'CNPJ_CUSTODIANTE', 'CUSTODIANTE',*

Based on CVMs [dictionary](https://dados.cvm.gov.br/dados/FI/CAD/META/meta_cad_fi.txt), I'll use the following cols:
| Column | Description|
| --- | ---|
|CNPJ_FUNDO| Investment fund register code |
|DENOM_SOCIAL| Investment fund name |
|DT_REG| Register date|
|SIT| Situation (Active, Deactive)|
|CLASSE| Type of assets|
|DT_INI_CLASSE| Date from when assets were purchased|
|CONDOM| Open/Close fund|
|FUNDO_COTAS| If fund has quotas or not|
|FUNDO_EXCLUSIVO| Exclusive fund|
|CPF_CNPJ_GESTOR| Manager register code|
|GESTOR| Manager name|
|CNPJ_AUDITOR| Audit firm register code|
|AUDITOR| Audit firm name|
|CNPJ_CUSTODIANTE| Issuer register code|
|CUSTODIANTE| Issuer name|

These columns will give me an idea of each fund structue. Those columns will tell me what type of assets each fund is working with, when they started trading those, who is the manager choosing the assets and who is issuing the quotas. And why does it matter?

I can analyze funds performance over the years and identify if a manager has better results than the others. Funds features (such as open/closed or exclusive/not) may indicate better performing funds due to private information or access to better assets. Issuer can hold a specific type of asset or only issue for a certain type of investor, and therefore have a different performance. By keeping audit firm data, I can point who are the big firms working with investment funds or a specific type of asset.

In [6]:
# Merging current and legacy register dataframes based on the columns I selected
main_cols = ['CNPJ_FUNDO', 'DENOM_SOCIAL', 'DT_REG', 'SIT','CLASSE', 'DT_INI_CLASSE', 'CONDOM', 'FUNDO_COTAS', 'FUNDO_EXCLUSIVO','CPF_CNPJ_GESTOR', 'GESTOR','CNPJ_AUDITOR', 'AUDITOR', 'CNPJ_CUSTODIANTE', 'CUSTODIANTE']
cvm_complete_reg = cvm_register[main_cols].copy()   # Copu current register dataframe data
pd.concat([cvm_register,legacy_register], join='inner')   # Concat based on columns in common
cvm_complete_reg.shape  # Analyze the object shape

(79612, 16)

In [7]:
# Check data type
print(cvm_complete_reg.info())  # .info() to see how variables are stored. Observe that date variables are stored as string
# Convert 'DT_REG' and 'DT_INI_CLASSE' in datetime variables
cvm_complete_reg['DT_REG'] = pd.to_datetime(cvm_complete_reg['DT_REG'], format='ISO8601')   # ISO8601 sets date format to Year/month/day
cvm_complete_reg['DT_INI_CLASSE'] = pd.to_datetime(cvm_complete_reg['DT_INI_CLASSE'], format='ISO8601')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79612 entries, 0 to 79611
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   CNPJ_FUNDO        79612 non-null  object
 1   DENOM_SOCIAL      79612 non-null  object
 2   DT_REG            79612 non-null  object
 3   SIT               79612 non-null  object
 4   CLASSE            66306 non-null  object
 5   DT_INI_CLASSE     66306 non-null  object
 6   RENTAB_FUNDO      50204 non-null  object
 7   CONDOM            65958 non-null  object
 8   FUNDO_COTAS       66312 non-null  object
 9   FUNDO_EXCLUSIVO   55845 non-null  object
 10  CPF_CNPJ_GESTOR   52466 non-null  object
 11  GESTOR            52466 non-null  object
 12  CNPJ_AUDITOR      51736 non-null  object
 13  AUDITOR           51736 non-null  object
 14  CNPJ_CUSTODIANTE  50913 non-null  object
 15  CUSTODIANTE       50913 non-null  object
dtypes: object(16)
memory usage: 9.7+ MB
None


In [8]:
# Check final result
print(cvm_complete_reg.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79612 entries, 0 to 79611
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   CNPJ_FUNDO        79612 non-null  object        
 1   DENOM_SOCIAL      79612 non-null  object        
 2   DT_REG            79612 non-null  datetime64[ns]
 3   SIT               79612 non-null  object        
 4   CLASSE            66306 non-null  object        
 5   DT_INI_CLASSE     66306 non-null  datetime64[ns]
 6   RENTAB_FUNDO      50204 non-null  object        
 7   CONDOM            65958 non-null  object        
 8   FUNDO_COTAS       66312 non-null  object        
 9   FUNDO_EXCLUSIVO   55845 non-null  object        
 10  CPF_CNPJ_GESTOR   52466 non-null  object        
 11  GESTOR            52466 non-null  object        
 12  CNPJ_AUDITOR      51736 non-null  object        
 13  AUDITOR           51736 non-null  object        
 14  CNPJ_CUSTODIANTE  5091

In [9]:
# Check to see if I got the current data right
cvm_complete_reg.sort_values(by=['DT_REG'], inplace=True,ascending=False)
cvm_complete_reg.head()

Unnamed: 0,CNPJ_FUNDO,DENOM_SOCIAL,DT_REG,SIT,CLASSE,DT_INI_CLASSE,RENTAB_FUNDO,CONDOM,FUNDO_COTAS,FUNDO_EXCLUSIVO,CPF_CNPJ_GESTOR,GESTOR,CNPJ_AUDITOR,AUDITOR,CNPJ_CUSTODIANTE,CUSTODIANTE
60935,56.251.883/0001-67,ENG CAPITAL FUNDO DE INVESTIMENTO FINANCEIRO R...,2024-08-03,FASE PRÉ-OPERACIONAL,Fundo de Renda Fixa,2024-08-03,Não se aplica,Aberto,N,N,09.630.188/0001-26,PLURAL INVESTIMENTOS GESTÃO DE RECURSOS LTDA.,49.928.567/0001-11,DELOITTE TOUCHE TOHMATSU AUDITORES INDEPENDENT...,45.246.410/0001-55,BANCO GENIAL S.A.
77380,56.248.822/0001-40,4UM FUNDO DE INVESTIMENTO EM PARTICIPAÇÕES EM ...,2024-08-03,FASE PRÉ-OPERACIONAL,FIP IE,2024-08-02,,Fechado,N,,03.983.856/0001-12,4UM GESTÃO DE RECURSOS LTDA.,57.755.217/0001-29,KPMG AUDITORES INDEPENDENTES LTDA.,39.669.186/0001-01,HEMERA DISTRIBUIDORA DE TITULOS E VALORES MOBI...
60933,56.237.607/0001-44,SANTANDER PB 160 FUNDO DE INVESTIMENTO FINANCE...,2024-08-02,FASE PRÉ-OPERACIONAL,Fundo Multimercado,2024-08-01,DI de um dia,Aberto,N,N,03.502.968/0001-04,SANTANDER DISTRIBUIDORA DE TÍTULOS E VALORES M...,,,62.318.407/0001-19,S3 CACEIS BRASIL DISTRIBUIDORA DE TITULOS E VA...
60932,56.237.501/0001-40,SANTANDER SAM 175 FUNDO DE INVESTIMENTO FINANC...,2024-08-02,FASE PRÉ-OPERACIONAL,Fundo Multimercado,2024-08-02,DI de um dia,Aberto,N,N,10.231.177/0001-52,SANTANDER BRASIL GESTÃO DE RECURSOS LTDA,,,62.318.407/0001-19,S3 CACEIS BRASIL DISTRIBUIDORA DE TITULOS E VA...
66489,56.237.113/0001-60,GOLDFISH FUNDO DE INVESTIMENTO EM DIREITOS CRE...,2024-08-02,FASE PRÉ-OPERACIONAL,FIDC,2024-08-02,,Fechado,N,,48.954.141/0001-70,REAG INSTITUCIONAL GESTÃO DE ATIVOS LTDA.,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,34.829.992/0001-86,REAG DISTRIBUIDORA DE TITULOS E VALORES MOBILI...


In [10]:
# Check to see if I got the legacy data right
cvm_complete_reg.tail()

               CNPJ_FUNDO                                       DENOM_SOCIAL  \
78444  42.468.488/0001-26                               BOZANO SIMONSEN FMIA   
78480  47.220.660/0001-41  BANERJ AÇÕES - FUNDO DE INVESTIMENTO EM COTAS ...   
61292  47.220.660/0001-41  BANERJ AÇÕES - FUNDO DE INVESTIMENTO EM COTAS ...   
78441  42.468.421/0001-91                                    F UNIBANCO FMIA   
78443  42.468.454/0001-31                          F CRESCINCO UNIBANCO FMIA   

          DT_REG        SIT CLASSE DT_INI_CLASSE RENTAB_FUNDO CONDOM  \
78444 1969-01-10  CANCELADA    NaN           NaT          NaN    NaN   
78480 1963-07-01  CANCELADA    NaN           NaT          NaN    NaN   
61292 1963-07-01  CANCELADA    NaN           NaT          NaN    NaN   
78441 1961-04-21  CANCELADA    NaN           NaT          NaN    NaN   
78443 1957-01-18  CANCELADA    NaN           NaT          NaN    NaN   

      FUNDO_COTAS FUNDO_EXCLUSIVO CPF_CNPJ_GESTOR GESTOR CNPJ_AUDITOR AUDITOR  \
78444

In [11]:
## For our tail data, we have really old data from 1950. I won't need data before 2020
cvm_complete_reg = cvm_complete_reg[cvm_complete_reg['DT_REG'].dt.year >= 2020]

In [12]:
# Check to see if I got the legacy data right
# cvm_complete_reg.tail()

Unnamed: 0,CNPJ_FUNDO,DENOM_SOCIAL,DT_REG,SIT,CLASSE,DT_INI_CLASSE,RENTAB_FUNDO,CONDOM,FUNDO_COTAS,FUNDO_EXCLUSIVO,CPF_CNPJ_GESTOR,GESTOR,CNPJ_AUDITOR,AUDITOR,CNPJ_CUSTODIANTE,CUSTODIANTE
63758,34.633.504/0001-60,KINEA INFRA V - FUNDO INCENTIVADO DE INVESTIME...,2020-01-03,CANCELADA,FIDC,2019-08-05,,Aberto,N,,08.604.187/0001-44,KINEA INVESTIMENTOS LTDA.,61.562.112/0001-20,PRICEWATERHOUSECOOPERS AUDITORES INDEPENDENTES...,60.701.190/0001-04,ITAU UNIBANCO S.A.
41471,35.557.596/0001-00,CGI P FUNDO DE INVESTIMENTO EM COTAS DE FUNDOS...,2020-01-03,CANCELADA,Fundo Multimercado,2020-01-02,,Aberto,S,N,13.344.438/0001-39,PACIFICO GESTÃO DE RECURSOS LTDA,57.755.217/0001-29,KPMG AUDITORES INDEPENDENTES LTDA.,42.272.526/0001-70,BNY MELLON BANCO S.A.
40642,34.658.753/0001-00,DAYCOVAL BOLSA AMERICANA USD BDR-AÇÕES FUNDO D...,2020-01-03,EM FUNCIONAMENTO NORMAL,Fundo de Ações,2019-08-08,Não se aplica,Aberto,N,N,72.027.832/0001-02,DAYCOVAL ASSET MANAGEMENT ADMINISTRACAO DE REC...,61.366.936/0001-25,ERNST & YOUNG AUDITORES INDEPENDENTES S/S LTDA.,62.232.889/0001-90,BANCO DAYCOVAL S.A.
41109,35.136.864/0001-10,EUQUEROINVESTIR CBI HEDGE FUNDO DE INVESTIMENT...,2020-01-02,CANCELADA,Fundo Multimercado,2019-09-11,DI de um dia,Aberto,N,N,,,,,,
72224,15.798.220/0001-80,ORCHID FUNDO DE INVESTIMENTO IMOBILIÁRIO,2020-01-02,EM FUNCIONAMENTO NORMAL,FII,2020-01-02,,,N,,23.863.529/0001-34,REAG ADMINISTRADORA DE RECURSOS LTDA.,15.454.120/0001-36,MGI ASSURANCE AUDITORES INDEPENDENTES SS,,


In [13]:
# Check for duplicates in register file
cvm_complete_reg.duplicated(['CNPJ_FUNDO']).value_counts()  # .duplicate() assings True to duplicated elements
## .value_counts() counts all row with True (duplicate) and False (unique) values
### This means I have 1304 duplicated values in my register dataframe, why?

False    24360
True      1304
Name: count, dtype: int64

In [14]:
duplicated_register = cvm_complete_reg[cvm_complete_reg.duplicated(subset='CNPJ_FUNDO', keep=False)]    # .duplicate() by default keeps the first entry, buy since I want to know who are the duplicated funds, I need to set the parameter to FALSE so I can identify them
duplicated_register.sort_values(by=['CNPJ_FUNDO'], inplace=True)
duplicated_register.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  duplicated_register.sort_values(by=['CNPJ_FUNDO'], inplace=True)


Unnamed: 0,CNPJ_FUNDO,DENOM_SOCIAL,DT_REG,SIT,CLASSE,DT_INI_CLASSE,RENTAB_FUNDO,CONDOM,FUNDO_COTAS,FUNDO_EXCLUSIVO,CPF_CNPJ_GESTOR,GESTOR,CNPJ_AUDITOR,AUDITOR,CNPJ_CUSTODIANTE,CUSTODIANTE
25362,17.031.327/0001-23,BRASFOR FUNDO DE INVESTIMENTO EM COTAS DE FUND...,2022-08-19,CANCELADA,Fundo Multimercado,2022-08-23,OUTROS,Fechado,N,N,09.204.714/0001-96,PETRA CAPITAL GESTÃO DE INVESTIMENTOS LTDA,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,11.758.741/0001-52,BANCO FINAXIS S.A.
62291,17.031.327/0001-23,BRASFOR FUNDO DE INVESTIMENTO EM COTAS DE FUND...,2023-05-09,EM FUNCIONAMENTO NORMAL,FICFIDC-NP,2023-05-09,,Fechado,S,,09.204.714/0001-96,PETRA CAPITAL GESTÃO DE INVESTIMENTOS LTDA,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,11.758.741/0001-52,BANCO FINAXIS S.A.
26392,17.545.987/0001-22,MURA FUNDO DE INVESTIMENTO EM COTAS DE FUNDOS ...,2020-05-06,CANCELADA,Fundo Multimercado,2020-05-06,OUTROS,Fechado,S,N,28.529.686/0001-21,WNT GESTORA DE RECURSOS LTDA,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,62.285.390/0001-40,SINGULARE CORRETORA DE TITULOS E VALORES MOBIL...
62336,17.545.987/0001-22,MURA FUNDO DE INVESTIMENTO EM COTAS DE FUNDOS ...,2023-06-12,EM FUNCIONAMENTO NORMAL,FICFIDC-NP,2023-06-12,,Fechado,S,,28.529.686/0001-21,WNT GESTORA DE RECURSOS LTDA,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,62.285.390/0001-40,SINGULARE CORRETORA DE TITULOS E VALORES MOBIL...
62359,18.151.682/0001-07,STONE FUNDO DE INVESTIMENTO EM COTAS DE FUNDO ...,2024-03-04,EM FUNCIONAMENTO NORMAL,FIC FIDC,2024-03-04,,Fechado,S,,09.121.454/0001-95,TERCON INVESTIMENTOS LTDA,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,62.285.390/0001-40,SINGULARE CORRETORA DE TITULOS E VALORES MOBIL...
27151,18.151.682/0001-07,STONE FUNDO DE INVESTIMENTO EM COTAS DE FUNDO ...,2020-04-02,CANCELADA,Fundo Multimercado,2020-04-02,OUTROS,Fechado,S,N,09.121.454/0001-95,TERCON INVESTIMENTOS LTDA,05.452.311/0001-05,CONFIANCE AUDITORES INDEPENDENTES,62.285.390/0001-40,SINGULARE CORRETORA DE TITULOS E VALORES MOBIL...
62472,20.209.230/0001-72,MBM FUNDO DE INVESTIMENTO EM DIREITOS CREDITÓ...,2022-11-23,EM FUNCIONAMENTO NORMAL,FIDC-NP,2022-11-17,,Aberto,N,,16.707.841/0001-73,TYR GESTÃO DE RECURSOS LTDA.,03.156.926/0001-69,SÊNIOR AUDITORES INDEPENDENTES S/S,15.489.568/0001-95,INTRA INVESTIMENTOS DTVM LTDA
72351,20.209.230/0001-72,MBM FUNDO DE INVESTIMENTO EM DIREITOS CREDITÓ...,2022-10-03,CANCELADA,FII,2022-10-03,,Fechado,N,,35.541.359/0001-50,INTRA BLACK INVESTIMENTOS GESTÃO DE RECURSOS LTDA,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,15.489.568/0001-95,INTRA INVESTIMENTOS DTVM LTDA
62474,20.250.623/0001-20,MA MÁQUINAS FUNDO DE INVESTIMENTO MULTIMERCADO...,2022-12-07,CANCELADA,FIDC,2014-03-06,,Fechado,N,,09.204.714/0001-96,PETRA CAPITAL GESTÃO DE INVESTIMENTOS LTDA,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,11.758.741/0001-52,BANCO FINAXIS S.A.
29156,20.250.623/0001-20,MA MÁQUINAS FUNDO DE INVESTIMENTO MULTIMERCADO...,2023-01-02,EM FUNCIONAMENTO NORMAL,Fundo Multimercado,2023-01-13,OUTROS,Fechado,N,N,09.204.714/0001-96,PETRA CAPITAL GESTÃO DE INVESTIMENTOS LTDA,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,11.758.741/0001-52,BANCO FINAXIS S.A.


My duplicate analysis show an interesting behavior. Usually, analysist drop/delete duplicates in their datasets. The reason is becaus duplicates usually are viewed as an error. However, when we see what is happening in our register dataframe, I see that funds are registered two times (at least): first when the fund is opened to business and the second when the fund closes its activities.

Based on this behavior, I can't drop duplicates because in this case it is a feature rather than an error. Let's see if there are duplicated values even if the fund still in business.

In [15]:
fund_active = duplicated_register[~duplicated_register['SIT'].str.contains('CANCELADA')]
fund_active = fund_active[fund_active.duplicated(subset='CNPJ_FUNDO', keep=False)]
fund_active.head(10)

Unnamed: 0,CNPJ_FUNDO,DENOM_SOCIAL,DT_REG,SIT,CLASSE,DT_INI_CLASSE,RENTAB_FUNDO,CONDOM,FUNDO_COTAS,FUNDO_EXCLUSIVO,CPF_CNPJ_GESTOR,GESTOR,CNPJ_AUDITOR,AUDITOR,CNPJ_CUSTODIANTE,CUSTODIANTE
75282,25.333.547/0001-30,JNC I FUNDO DE INVESTIMENTO EM PARTICIPAÇÕES -...,2023-12-12,EM FUNCIONAMENTO NORMAL,FIP Multi,2023-12-12,,Fechado,N,,23.863.529/0001-34,REAG ADMINISTRADORA DE RECURSOS LTDA.,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,34.829.992/0001-86,REAG DISTRIBUIDORA DE TITULOS E VALORES MOBILI...
75281,25.333.547/0001-30,JNC I FUNDO DE INVESTIMENTO EM PARTICIPAÇÕES -...,2023-12-12,EM ANÁLISE,FIP Multi,2023-12-12,,Fechado,N,,,,,,,
36543,30.102.340/0001-94,SANTANDER GO PREV GLOBAL EQUITY ESG REAIS 20 M...,2021-11-17,FASE PRÉ-OPERACIONAL,Fundo Multimercado,2021-11-17,Não se aplica,Aberto,S,N,87.376.109/0001-06,ZURICH SANTANDER BRASIL SEGUROS E PREVIDENCIA ...,61.562.112/0001-20,PRICEWATERHOUSECOOPERS AUDITORES INDEPENDENTES...,62.318.407/0001-19,S3 CACEIS BRASIL DISTRIBUIDORA DE TITULOS E VA...
36542,30.102.340/0001-94,SANTANDER GO PREV GLOBAL EQUITY ESG REAIS 20 M...,2021-11-17,FASE PRÉ-OPERACIONAL,Fundo Multimercado,2021-11-17,Não se aplica,Aberto,S,N,10.231.177/0001-52,SANTANDER BRASIL GESTÃO DE RECURSOS LTDA,61.562.112/0001-20,PRICEWATERHOUSECOOPERS AUDITORES INDEPENDENTES...,62.318.407/0001-19,S3 CACEIS BRASIL DISTRIBUIDORA DE TITULOS E VA...
63531,32.287.668/0001-58,ARKOS FUNDO DE INVESTIMENTO EM COTAS DE FUNDOS...,2023-12-29,EM FUNCIONAMENTO NORMAL,FIC FIDC,2023-12-29,,Fechado,S,,40.297.139/0001-63,H2 KAPITAL S.A.,07.037.795/0001-51,AUDIFACTOR AUDITORES INDEPENDENTES S/S LTDA,33.886.862/0001-12,"MASTER S/A CORRETORA DE CAMBIO, TITULOS E VALO..."
63532,32.287.668/0001-58,ARKOS FUNDO DE INVESTIMENTO EM COTAS DE FUNDOS...,2023-12-29,EM FUNCIONAMENTO NORMAL,FIC FIDC,2023-12-29,,Fechado,S,,40.297.139/0001-63,H2 KAPITAL S.A.,19.280.834/0001-26,NEXT AUDITORES INDEPENDENTES S/S LTDA.,33.886.862/0001-12,"MASTER S/A CORRETORA DE CAMBIO, TITULOS E VALO..."
63541,32.302.296/0001-91,FUNDO DE INVESTIMENTO EM DIREITOS CREDITÓRIOS...,2021-05-24,EM FUNCIONAMENTO NORMAL,FIDC,2022-09-15,,Fechado,N,,17.254.708/0001-71,SOLIS INVESTIMENTOS LTDA,16.549.480/0001-84,RSM BRASIL AUDITORES INDEPENDENTES LTDA.,39.669.186/0001-01,HEMERA DISTRIBUIDORA DE TITULOS E VALORES MOBI...
63542,32.302.296/0001-91,FUNDO DE INVESTIMENTO EM DIREITOS CREDITÓRIOS...,2021-05-24,EM FUNCIONAMENTO NORMAL,FIDC,2022-09-15,,Fechado,N,,48.089.509/0001-89,EXT CAPITAL LTDA.,16.549.480/0001-84,RSM BRASIL AUDITORES INDEPENDENTES LTDA.,39.669.186/0001-01,HEMERA DISTRIBUIDORA DE TITULOS E VALORES MOBI...
39134,32.891.432/0001-26,ZEVER II PREVIDENCIÁRIO FUNDO DE INVESTIMENTO ...,2020-04-07,EM FUNCIONAMENTO NORMAL,Fundo Multimercado,2020-04-06,DI de um dia,Aberto,N,S,09.262.533/0001-16,JGP GESTÃO PATRIMONIAL LTDA,57.755.217/0001-29,KPMG AUDITORES INDEPENDENTES LTDA.,42.272.526/0001-70,BNY MELLON BANCO S.A.
39135,32.891.432/0001-26,ZEVER II PREVIDENCIÁRIO FUNDO DE INVESTIMENTO ...,2020-04-07,EM FUNCIONAMENTO NORMAL,Fundo Multimercado,2020-04-06,DI de um dia,Aberto,N,S,15.289.957/0001-77,XP ADVISORY GESTAO DE RECURSOS LTDA,57.755.217/0001-29,KPMG AUDITORES INDEPENDENTES LTDA.,42.272.526/0001-70,BNY MELLON BANCO S.A.


This results show that a fund can be registered twice even if it still active. However, the condition of how the funding is running might be different. A fund can still be operational, but in a specific state, such as in procress of registering, on hold for analysis, or even starting the process of liquidation (selling assets to end the fund).

Based on these condition, I need to know what can I expect to see withing the status ('SIT') feature.

In [16]:
# Find unique labels for each situation that are not canceled (CANCELADA)
fund_active['SIT'].unique()
# Delete variables to clean memory space
del fund_active,duplicated_register

array(['EM FUNCIONAMENTO NORMAL', 'EM ANÁLISE', 'FASE PRÉ-OPERACIONAL',
       'LIQUIDAÇÃO'], dtype=object)

Status feature ('SIT') has 4 unique labels:
- 'EM ANÁLISE': CVM is analyzing the documents
- 'EM FUNCIONAMENTO NORMAL': operating business as usual
- 'FASE PRÉ-OPERACIONAL': 1 phase before opening to business
- 'LIQUIDAÇÃO': in process of selling assets to end the fund

These classes are important to tell us the story of each fund. So a fund can be registered twice even if it's not stopped its activities.

But status ('SIT') is not the only feature that can explain duplicated registers. Look for example what happend with fund 32.891.432/0001-26. All its features are the same, except for a change Manager ('GESTOR'), which means the fund at some point in time changed institutions managing their assets and had to issue a new registration.

### Merging datasets: daily returns + register data
Now it's time to merge our dataframes, return base and register data. Due to my duplicate analysis, I need to pay attention to how to merge data. For instance, a fund can have two different managers institution based on the date.

So my keys to merge the data will be the funds registration number ('CNPJ_FUNDO') and the register data ('DT_REG'). Naturally, another issue will appear: return data is in a daily frequency and register data is ponctual. This means I'll have a series of empty rows for the features coming from register dataframe. I'll fix it by coping each fund previous row, this means: since register occurs only when the fund is created or has some change, it's fair to use the data on the register/change day to further dates until the fund dissapears from our database (meaning they are ceased their operations).

In [19]:
# Make the keys the same
# cvm_complete_reg.info()
cvm_complete_reg.rename(columns={'DT_REG':'DT_COMPTC'}, inplace=True)

In [None]:
funds_df.sort_values(by='CNPJ_FUNDO', axis=1)

Unnamed: 0,TP_FUNDO,CNPJ_FUNDO,DT_COMPTC,VL_TOTAL,VL_QUOTA,VL_PATRIM_LIQ,CAPTC_DIA,RESG_DIA,NR_COTST,DENOM_SOCIAL,...,RENTAB_FUNDO,CONDOM,FUNDO_COTAS,FUNDO_EXCLUSIVO,CPF_CNPJ_GESTOR,GESTOR,CNPJ_AUDITOR,AUDITOR,CNPJ_CUSTODIANTE,CUSTODIANTE
0,True,True,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
1,True,True,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
2,True,True,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
3,True,True,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
4,True,True,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25828883,True,True,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
25828884,True,True,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
25828885,True,True,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
25828886,True,True,True,True,True,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
