# Introdução: Preprocessamento para Engenharia de Atributos em Ativos Financeiros

O pré-processamento de dados financeiros desempenha um papel fundamental na preparação de conjuntos de dados para análises e modelagem. Uma parte crucial desse processo é a engenharia de atributos, onde novas colunas são criadas para melhor representar as nuances e padrões dos dados financeiros.

Dentre as colunas criadas, destacam-se aquelas relacionadas aos retornos, variações e indicadores de desempenho. A análise do perfil de retornos do benchmark, juntamente com os retornos e a variação do ativo em questão, proporciona insights valiosos sobre o comportamento histórico dos ativos financeiros. A introdução de características como o Índice de Força Relativa (RSI) e suas variações ao longo do tempo adiciona uma camada adicional de complexidade, permitindo a captura de tendências e reversões no mercado.

Além disso, a criação de atributos de intervalos temporais oferece uma perspectiva dinâmica, considerando não apenas os valores instantâneos, mas também as mudanças ao longo de períodos específicos. Essas colunas não apenas enriquecem o conjunto de dados, mas também proporcionam uma base sólida para análises temporais e modelagem preditiva.

No âmbito financeiro, a qualidade das features é vital, influenciando diretamente a capacidade dos modelos de aprendizado de máquina em identificar padrões e realizar previsões precisas. Assim, a criação criteriosa de colunas durante o pré-processamento se revela como um passo crucial na busca por insights significativos nos mercados financeiros.

In [1]:
#Bibliotecas
import numpy as np
import pandas as pd
from ta.momentum import RSIIndicator

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import nbformat

import yfinance as yf
import datetime

import os

# Coletando os dados

In [2]:
# Extração de dados
start_date = "2017-01-01"
end_date = datetime.datetime.now().strftime("%Y-%m-%d")
symbol = "BTC-USD"

start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")

In [3]:

class Preprocessing():

    def __init__(self, symbol, start_date, end_date):
        self.df = self._extract_data(symbol, start_date, end_date)
        self._structure_df()
        self._calculate_benchmark_returns()

    def _extract_data(self, symbol, start_date, end_date):
        data = yf.download(symbol, start=start_date, end=end_date)
        data = data[["Open", "High", "Low", "Close", "Volume"]]
        return data

    def _structure_df(self):
        self.df["Returns"] = self.df["Close"].pct_change()
        self.df["Range"] = self.df["High"] / self.df["Low"] - 1
        self.df["Equity Curve"] = np.cumprod(1 + self.df["Returns"]) - 1
        self.df.dropna(inplace=True)

    def _calculate_benchmark_returns(self):
        self.df["Bench_C_Rets"] = np.cumprod(1 + self.df["Close"].pct_change()) - 1


In [4]:
extration = Preprocessing(symbol, start_date, end_date)
df = extration.df
df

[*********************100%%**********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Returns,Range,Equity Curve,Bench_C_Rets
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2017-01-02,998.617004,1031.390015,996.702026,1021.750000,222184992,0.023464,0.034803,0.023464,
2017-01-03,1021.599976,1044.079956,1021.599976,1043.839966,185168000,0.021620,0.022005,0.045591,0.021620
2017-01-04,1044.400024,1159.420044,1044.400024,1154.729980,344945984,0.106233,0.110130,0.156667,0.130149
2017-01-05,1156.729980,1191.099976,910.416992,1013.380005,510199008,-0.122410,0.308302,0.015080,-0.008192
2017-01-06,1014.239990,1046.810059,883.943970,902.200989,351876000,-0.109711,0.184249,-0.096285,-0.117004
...,...,...,...,...,...,...,...,...,...
2023-11-08,35419.476562,35994.417969,35147.800781,35655.277344,17295394918,0.005973,0.024087,34.715100,33.896283
2023-11-09,35633.632812,37926.257812,35592.101562,36693.125000,37762672382,0.029108,0.065581,35.754689,34.912038
2023-11-10,36702.250000,37493.800781,36362.753906,37313.968750,22711265155,0.016920,0.031105,36.376574,35.519666
2023-11-11,37310.070312,37407.093750,36773.667969,37138.050781,13924272142,-0.004715,0.017225,36.200361,35.347493


In [5]:
# Criar gráficos interativos usando Plotly
fig = make_subplots(rows=3, cols=1, subplot_titles=["Perfil de Retornos do Benchmark", "Perfil de Retornos", "Perfil de Variação"])

# Gráfico 1: Benchmark Returns Profile
fig.add_trace(go.Scatter(x=df.index, y=df["Bench_C_Rets"], mode="lines", name="Benchmark"),
              row=1, col=1)

# Gráfico 2: Returns Profile
fig.add_trace(go.Scatter(x=df.index, y=df["Returns"], mode="lines", name="Retornos"),
              row=2, col=1)

# Gráfico 3: Range Profile
fig.add_trace(go.Scatter(x=df.index, y=df["Range"], mode="lines", name="Variação"),
              row=3, col=1)

# Configurações de layout
fig.update_layout(title_text="Perfil de Retornos e Variação",
                  showlegend=True,
                  xaxis=dict(title="Data"),
                  yaxis=dict(title="Valor"),
                  height=800)

# Adicionar legendas em português
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=-0.1,
    xanchor="left",
    x=0
))

# Exibir gráfico interativo
fig.show()

# Engenharia de Atributos

In [6]:
# Criando cópia
df_fe = df.copy()

In [7]:
# Adicionando RSI
rsi = RSIIndicator(close=df_fe["Close"], window=14).rsi()
df_fe["RSI"] = rsi
df_fe["RSI_Ret"] = df_fe["RSI"] / df_fe["RSI"].shift(1)

In [8]:
# Adicionando Médias Móveis
df_fe["MA_12"] = df_fe["Close"].rolling(window=12).mean()
df_fe["MA_21"] = df_fe["Close"].rolling(window=21).mean()

In [9]:
# Adicionando dias da semana
df_fe["DOW"] = df_fe.index.dayofweek

In [10]:
# Retornos acumulativos em janelas de 30 dias
df_fe["Roll_Rets"] = df_fe["Returns"].rolling(window=30).sum()

In [11]:
# Retornos acumulativos da Variação
df_fe["Avg_Range"] = df_fe["Range"].rolling(window=30).mean()

In [12]:
# Adicionando intervalos de tempo
t_steps = [1, 2]
t_features = ["Returns", "Range", "RSI_Ret"]
for ts in t_steps:
    for tf in t_features:
        df_fe[f"{tf}_T{ts}"] = df_fe[tf].shift(ts)

In [13]:
df_fe

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Returns,Range,Equity Curve,Bench_C_Rets,RSI,...,MA_21,DOW,Roll_Rets,Avg_Range,Returns_T1,Range_T1,RSI_Ret_T1,Returns_T2,Range_T2,RSI_Ret_T2
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-02,998.617004,1031.390015,996.702026,1021.750000,222184992,0.023464,0.034803,0.023464,,,...,,0,,,,,,,,
2017-01-03,1021.599976,1044.079956,1021.599976,1043.839966,185168000,0.021620,0.022005,0.045591,0.021620,,...,,1,,,0.023464,0.034803,,,,
2017-01-04,1044.400024,1159.420044,1044.400024,1154.729980,344945984,0.106233,0.110130,0.156667,0.130149,,...,,2,,,0.021620,0.022005,,0.023464,0.034803,
2017-01-05,1156.729980,1191.099976,910.416992,1013.380005,510199008,-0.122410,0.308302,0.015080,-0.008192,,...,,3,,,0.106233,0.110130,,0.021620,0.022005,
2017-01-06,1014.239990,1046.810059,883.943970,902.200989,351876000,-0.109711,0.184249,-0.096285,-0.117004,,...,,4,,,-0.122410,0.308302,,0.106233,0.110130,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-11-08,35419.476562,35994.417969,35147.800781,35655.277344,17295394918,0.005973,0.024087,34.715100,33.896283,79.056775,...,33668.851283,2,0.264974,0.032208,0.011593,0.038980,1.023620,-0.000342,0.014977,0.997738
2023-11-09,35633.632812,37926.257812,35592.101562,36693.125000,37762672382,0.029108,0.065581,35.754689,34.912038,82.787702,...,34048.533110,3,0.301067,0.033888,0.005973,0.024087,1.011473,0.011593,0.038980,1.023620
2023-11-10,36702.250000,37493.800781,36362.753906,37313.968750,22711265155,0.016920,0.031105,36.376574,35.519666,84.559697,...,34411.914993,4,0.336887,0.033779,0.029108,0.065581,1.047193,0.005973,0.024087,1.011473
2023-11-11,37310.070312,37407.093750,36773.667969,37138.050781,13924272142,-0.004715,0.017225,36.200361,35.347493,81.984170,...,34755.707310,5,0.336508,0.033898,0.016920,0.031105,1.021404,0.029108,0.065581,1.047193


# Engenharia de Atributos - Escala de Atributos

In [14]:
# Corrigir para Estacionariedade (transformando em retornos percentuais)
df_fs = df_fe.copy()
df_fs[["Open", "High", "Low", "Volume"]] = df_fs[["Open", "High", "Low", "Volume"]].pct_change()
df_fs

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Returns,Range,Equity Curve,Bench_C_Rets,RSI,...,MA_21,DOW,Roll_Rets,Avg_Range,Returns_T1,Range_T1,RSI_Ret_T1,Returns_T2,Range_T2,RSI_Ret_T2
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-02,,,,1021.750000,,0.023464,0.034803,0.023464,,,...,,0,,,,,,,,
2017-01-03,0.023015,0.012304,0.024980,1043.839966,-0.166604,0.021620,0.022005,0.045591,0.021620,,...,,1,,,0.023464,0.034803,,,,
2017-01-04,0.022318,0.110471,0.022318,1154.729980,0.862881,0.106233,0.110130,0.156667,0.130149,,...,,2,,,0.021620,0.022005,,0.023464,0.034803,
2017-01-05,0.107555,0.027324,-0.128287,1013.380005,0.479069,-0.122410,0.308302,0.015080,-0.008192,,...,,3,,,0.106233,0.110130,,0.021620,0.022005,
2017-01-06,-0.123183,-0.121140,-0.029078,902.200989,-0.310316,-0.109711,0.184249,-0.096285,-0.117004,,...,,4,,,-0.122410,0.308302,,0.106233,0.110130,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-11-08,0.010605,0.002842,0.017426,35655.277344,-0.081729,0.005973,0.024087,34.715100,33.896283,79.056775,...,33668.851283,2,0.264974,0.032208,0.011593,0.038980,1.023620,-0.000342,0.014977,0.997738
2023-11-09,0.006046,0.053671,0.012641,36693.125000,1.183395,0.029108,0.065581,35.754689,34.912038,82.787702,...,34048.533110,3,0.301067,0.033888,0.005973,0.024087,1.011473,0.011593,0.038980,1.023620
2023-11-10,0.029989,-0.011403,0.021652,37313.968750,-0.398579,0.016920,0.031105,36.376574,35.519666,84.559697,...,34411.914993,4,0.336887,0.033779,0.029108,0.065581,1.047193,0.005973,0.024087,1.011473
2023-11-11,0.016561,-0.002313,0.011300,37138.050781,-0.386900,-0.004715,0.017225,36.200361,35.347493,81.984170,...,34755.707310,5,0.336508,0.033898,0.016920,0.031105,1.021404,0.029108,0.065581,1.047193


# Tornando os dados acessíveis para modelos de Machine Learning

In [15]:
# Verificando valores nulos
df_fs.dropna(inplace=True)
print(df_fs.isnull().values.any())

False


In [16]:
# Verificando valores infinitos
dfobj = df_fs.isin([np.inf, -np.inf])
count = np.isinf(dfobj).values.sum()
count

0

In [17]:
# Verificando se existem valores não numéricos (objetos)
df_fs.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2477 entries, 2017-01-31 to 2023-11-12
Data columns (total 22 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Open          2477 non-null   float64
 1   High          2477 non-null   float64
 2   Low           2477 non-null   float64
 3   Close         2477 non-null   float64
 4   Volume        2477 non-null   float64
 5   Returns       2477 non-null   float64
 6   Range         2477 non-null   float64
 7   Equity Curve  2477 non-null   float64
 8   Bench_C_Rets  2477 non-null   float64
 9   RSI           2477 non-null   float64
 10  RSI_Ret       2477 non-null   float64
 11  MA_12         2477 non-null   float64
 12  MA_21         2477 non-null   float64
 13  DOW           2477 non-null   int32  
 14  Roll_Rets     2477 non-null   float64
 15  Avg_Range     2477 non-null   float64
 16  Returns_T1    2477 non-null   float64
 17  Range_T1      2477 non-null   float64
 18  RSI_Ret_T1

In [18]:
# Dados estatísticos
df_fs.describe()

Unnamed: 0,Open,High,Low,Close,Volume,Returns,Range,Equity Curve,Bench_C_Rets,RSI,...,MA_21,DOW,Roll_Rets,Avg_Range,Returns_T1,Range_T1,RSI_Ret_T1,Returns_T2,Range_T2,RSI_Ret_T2
count,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0,...,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0,2477.0
mean,0.002247,0.002069,0.00226,19022.872149,0.039373,0.002247,0.050254,18.054789,17.617932,52.917075,...,18882.348302,3.001211,0.066242,0.050288,0.002249,0.05025,1.005496,0.00225,0.050245,1.005505
std,0.038711,0.034131,0.038804,15989.658405,0.310731,0.038716,0.042986,16.016486,15.649286,14.496038,...,15901.411628,1.999899,0.236674,0.024088,0.038716,0.04299,0.105744,0.038715,0.042994,0.105742
min,-0.365924,-0.263712,-0.364062,937.52002,-0.869188,-0.371695,0.003596,-0.060907,-0.082437,9.920239,...,885.613522,0.0,-0.844981,0.015756,-0.371695,0.003596,0.453659,-0.371695,0.003596,0.453659
25%,-0.013938,-0.011876,-0.011208,6865.493164,-0.132732,-0.014226,0.023351,5.877012,5.719347,42.810547,...,6830.622373,1.0,-0.082169,0.033344,-0.014226,0.023351,0.954779,-0.014226,0.023351,0.954779
50%,0.001248,-0.000246,0.002731,11246.348633,-0.006773,0.001285,0.038768,10.265218,10.006948,51.655496,...,10898.574777,3.0,0.037724,0.045027,0.001285,0.038768,1.003914,0.001285,0.038768,1.003914
75%,0.018574,0.014499,0.017035,28904.623047,0.159775,0.018269,0.061375,27.953119,27.28933,62.63232,...,28507.946987,5.0,0.218027,0.059741,0.018269,0.061375,1.046938,0.018269,0.061375,1.046938
max,0.250461,0.245708,0.247892,67566.828125,5.439003,0.252472,0.631387,66.680192,65.128533,94.302215,...,63016.876488,6.0,1.154731,0.138628,0.252472,0.631387,1.926078,0.252472,0.631387,1.926078


# Salvando os dados

In [19]:
# Diretório para salvar o arquivo CSV
output_directory = "data"

try:
    # Verificar se o diretório existe
    if not os.path.exists(output_directory):
        # Criar o diretório se não existir
        os.makedirs(output_directory)

    # Salvando DataFrame
    df_fs.to_csv(f"{output_directory}/{symbol}.csv")

    print(f"DataFrame salvo em: {output_directory}/{symbol}.csv")
except Exception as e:
    print(f"Erro ao salvar o DataFrame: {e}")

DataFrame salvo em: data/BTC-USD.csv


# Considerações finais

O pré-processamento não é apenas técnico; é uma estratégia para desvendar padrões e construir modelos preditivos sólidos. A seleção cuidadosa de atributos e a compreensão profunda do domínio financeiro são cruciais. Ao trazer inovação à criação de atributos (features), destacamos a importância da qualidade sobre a quantidade, fundamentais para decisões informadas em um mercado tão dinâmico.