# Previs√£o de Tend√™ncia da PETR4.SA usando Machine Learning

## Objetivo
Desenvolver um modelo de machine learning para prever a tend√™ncia (‚Üë ou ‚Üì) de uma s√©rie temporal financeira da PETR4.SA, com acur√°cia m√≠nima de 75% no conjunto de teste (√∫ltimos 30 dias).

## Outline do Projeto
1. **Importa√ß√£o de Bibliotecas**
2. **Carregamento e Explora√ß√£o dos Dados**
3. **Pr√©-processamento dos Dados**
4. **Engenharia de Features**
5. **Divis√£o Treino/Teste**
6. **Treinamento de M√∫ltiplos Modelos**
7. **Avalia√ß√£o dos Modelos**
8. **Previs√µes e Visualiza√ß√£o dos Resultados**

---

## 1. Importa√ß√£o de Bibliotecas

In [None]:
# Bibliotecas para manipula√ß√£o de dados
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Bibliotecas para visualiza√ß√£o
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Bibliotecas para download de dados financeiros
import yfinance as yf

# Bibliotecas para machine learning
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import Pipeline

# Bibliotecas para an√°lise t√©cnica
import talib

# Configura√ß√µes
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Ignorar warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Bibliotecas importadas com sucesso!")

## 2. Carregamento e Explora√ß√£o dos Dados

Vamos carregar os dados hist√≥ricos da PETR4.SA usando a biblioteca yfinance e explorar a estrutura dos dados.

In [None]:
# Definir o per√≠odo de an√°lise (2 anos de dados hist√≥ricos)
end_date = datetime.now()
start_date = end_date - timedelta(days=730)  # ~2 anos

# Carregar dados da PETR4.SA
ticker = "PETR4.SA"
print(f"Carregando dados de {ticker} de {start_date.strftime('%Y-%m-%d')} at√© {end_date.strftime('%Y-%m-%d')}")

data = yf.download(ticker, start=start_date, end=end_date, progress=False)

print(f"\nüìä Dados carregados: {len(data)} registros")
print(f"Per√≠odo: {data.index[0].strftime('%Y-%m-%d')} a {data.index[-1].strftime('%Y-%m-%d')}")

# Visualizar primeiras e √∫ltimas linhas
print("\nüîç Primeiras 5 linhas:")
display(data.head())

print("\nüîç √öltimas 5 linhas:")
display(data.tail())

# Informa√ß√µes b√°sicas sobre os dados
print("\nüìà Informa√ß√µes dos dados:")
print(data.info())

print("\nüìä Estat√≠sticas descritivas:")
display(data.describe())

In [None]:
# Verificar valores ausentes
print("üîç Verificando valores ausentes:")
missing_values = data.isnull().sum()
print(missing_values)

if missing_values.sum() > 0:
    print("\n‚ö†Ô∏è Encontrados valores ausentes. Ser√° necess√°rio tratamento.")
else:
    print("\n‚úÖ Nenhum valor ausente encontrado!")

# Visualiza√ß√£o inicial dos pre√ßos
fig = make_subplots(rows=2, cols=2, 
                    subplot_titles=('Pre√ßo de Fechamento', 'Volume', 'High-Low', 'Open-Close'),
                    vertical_spacing=0.08)

# Pre√ßo de fechamento
fig.add_trace(go.Scatter(x=data.index, y=data['Close'], name='Close', line=dict(color='blue')), row=1, col=1)

# Volume
fig.add_trace(go.Scatter(x=data.index, y=data['Volume'], name='Volume', line=dict(color='orange')), row=1, col=2)

# High-Low
fig.add_trace(go.Scatter(x=data.index, y=data['High'] - data['Low'], name='High-Low', line=dict(color='green')), row=2, col=1)

# Open-Close
fig.add_trace(go.Scatter(x=data.index, y=data['Close'] - data['Open'], name='Close-Open', line=dict(color='red')), row=2, col=2)

fig.update_layout(height=600, title_text="üìà An√°lise Explorat√≥ria - PETR4.SA", showlegend=False)
fig.show()

## 3. Pr√©-processamento dos Dados

Nesta se√ß√£o, vamos limpar e preparar os dados para an√°lise.

In [None]:
# Criar uma c√≥pia dos dados para processamento
df = data.copy()

# Renomear colunas para facilitar o uso
df.columns = ['open', 'high', 'low', 'close', 'adj_close', 'volume']

# Remover dados ausentes (se houver)
df = df.dropna()

# Criar vari√°vel target: 1 se o pre√ßo subiu, 0 se desceu
df['price_change'] = df['close'].pct_change()
df['target'] = (df['price_change'] > 0).astype(int)

# Remover a primeira linha (NaN devido ao pct_change)
df = df.dropna()

print(f"üìä Dados ap√≥s pr√©-processamento: {len(df)} registros")
print(f"\nüéØ Distribui√ß√£o da vari√°vel target:")
target_dist = df['target'].value_counts()
print(f"Descida (0): {target_dist[0]} ({target_dist[0]/len(df)*100:.1f}%)")
print(f"Subida (1): {target_dist[1]} ({target_dist[1]/len(df)*100:.1f}%)")

# Visualizar distribui√ß√£o do target
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Gr√°fico de barras
target_dist.plot(kind='bar', ax=ax1, color=['red', 'green'])
ax1.set_title('Distribui√ß√£o da Tend√™ncia')
ax1.set_xlabel('Tend√™ncia (0=Descida, 1=Subida)')
ax1.set_ylabel('Frequ√™ncia')
ax1.tick_params(axis='x', rotation=0)

# S√©rie temporal da tend√™ncia
ax2.plot(df.index, df['target'], alpha=0.7, color='purple')
ax2.set_title('Tend√™ncia ao Longo do Tempo')
ax2.set_xlabel('Data')
ax2.set_ylabel('Tend√™ncia (0=Descida, 1=Subida)')

plt.tight_layout()
plt.show()

display(df.head())

## 4. Engenharia de Features

Vamos criar indicadores t√©cnicos e features que podem ajudar na previs√£o da tend√™ncia.

In [None]:
def create_technical_features(df):
    """
    Cria features t√©cnicas para an√°lise de s√©ries temporais financeiras
    """
    df_features = df.copy()
    
    # ========== FEATURES B√ÅSICAS ==========
    # Retornos percentuais
    for period in [1, 2, 3, 5, 10]:
        df_features[f'return_{period}d'] = df_features['close'].pct_change(period)
    
    # Volatilidade (desvio padr√£o dos retornos)
    for period in [5, 10, 20]:
        df_features[f'volatility_{period}d'] = df_features['price_change'].rolling(period).std()
    
    # ========== M√âDIAS M√ìVEIS ==========
    for period in [5, 10, 20, 50]:
        df_features[f'sma_{period}'] = df_features['close'].rolling(period).mean()
        df_features[f'price_sma_{period}_ratio'] = df_features['close'] / df_features[f'sma_{period}']
    
    # M√©dias m√≥veis exponenciais
    for period in [12, 26]:
        df_features[f'ema_{period}'] = df_features['close'].ewm(span=period).mean()
    
    # ========== INDICADORES T√âCNICOS COM TALIB ==========
    try:
        # RSI (Relative Strength Index)
        df_features['rsi_14'] = talib.RSI(df_features['close'].values, timeperiod=14)
        
        # MACD
        macd, macd_signal, macd_hist = talib.MACD(df_features['close'].values)
        df_features['macd'] = macd
        df_features['macd_signal'] = macd_signal
        df_features['macd_histogram'] = macd_hist
        
        # Bollinger Bands
        bb_upper, bb_middle, bb_lower = talib.BBANDS(df_features['close'].values)
        df_features['bb_upper'] = bb_upper
        df_features['bb_lower'] = bb_lower
        df_features['bb_position'] = (df_features['close'] - bb_lower) / (bb_upper - bb_lower)
        
        # Stochastic Oscillator
        stoch_k, stoch_d = talib.STOCH(df_features['high'].values, df_features['low'].values, df_features['close'].values)
        df_features['stoch_k'] = stoch_k
        df_features['stoch_d'] = stoch_d
        
        # Williams %R
        df_features['williams_r'] = talib.WILLR(df_features['high'].values, df_features['low'].values, df_features['close'].values)
        
        print("‚úÖ Indicadores t√©cnicos (TA-Lib) criados com sucesso!")
    except:
        print("‚ö†Ô∏è TA-Lib n√£o dispon√≠vel. Criando indicadores manualmente...")
        
        # RSI manual
        delta = df_features['close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
        rs = gain / loss
        df_features['rsi_14'] = 100 - (100 / (1 + rs))
        
        # MACD manual
        ema_12 = df_features['close'].ewm(span=12).mean()
        ema_26 = df_features['close'].ewm(span=26).mean()
        df_features['macd'] = ema_12 - ema_26
        df_features['macd_signal'] = df_features['macd'].ewm(span=9).mean()
        df_features['macd_histogram'] = df_features['macd'] - df_features['macd_signal']
    
    # ========== FEATURES DE VOLUME ==========
    df_features['volume_sma_10'] = df_features['volume'].rolling(10).mean()
    df_features['volume_ratio'] = df_features['volume'] / df_features['volume_sma_10']
    
    # ========== FEATURES DE PADR√ïES DE PRE√áO ==========
    # High-Low range
    df_features['hl_range'] = (df_features['high'] - df_features['low']) / df_features['close']
    
    # Open-Close range
    df_features['oc_range'] = (df_features['close'] - df_features['open']) / df_features['open']
    
    # Posi√ß√£o do fechamento no range do dia
    df_features['close_position'] = (df_features['close'] - df_features['low']) / (df_features['high'] - df_features['low'])
    
    # ========== FEATURES TEMPORAIS ==========
    df_features['day_of_week'] = df_features.index.dayofweek
    df_features['month'] = df_features.index.month
    df_features['quarter'] = df_features.index.quarter
    
    return df_features

# Criar features
print("üî® Criando features t√©cnicas...")
df_with_features = create_technical_features(df)

print(f"‚úÖ Features criadas! Total de colunas: {len(df_with_features.columns)}")
print(f"üìä Shape dos dados: {df_with_features.shape}")

# Remover linhas com NaN (criados pelos indicadores)
df_clean = df_with_features.dropna()
print(f"üìä Dados limpos: {len(df_clean)} registros")

# Mostrar algumas features criadas
feature_columns = [col for col in df_clean.columns if col not in ['open', 'high', 'low', 'close', 'adj_close', 'volume', 'price_change', 'target']]
print(f"\nüéØ Features criadas ({len(feature_columns)}):")
for i, feat in enumerate(feature_columns[:10]):
    print(f"{i+1:2d}. {feat}")
if len(feature_columns) > 10:
    print(f"    ... e mais {len(feature_columns) - 10} features")

In [None]:
# Analisar correla√ß√£o das features com o target
correlations = df_clean[feature_columns + ['target']].corr()['target'].abs().sort_values(ascending=False)

print("üéØ Top 15 features mais correlacionadas com o target:")
print(correlations.head(16)[1:])  # Excluir o pr√≥prio target

# Visualizar correla√ß√µes
plt.figure(figsize=(12, 8))
top_features = correlations.head(16)[1:15]  # Top 15 excluindo target
plt.barh(range(len(top_features)), top_features.values)
plt.yticks(range(len(top_features)), top_features.index)
plt.xlabel('Correla√ß√£o Absoluta com Target')
plt.title('Top 15 Features por Correla√ß√£o com Target')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# Selecionar features para o modelo
selected_features = correlations.head(21)[1:].index.tolist()  # Top 20 features
print(f"\n‚úÖ Selecionadas {len(selected_features)} features para o modelo")

## 5. Divis√£o Treino/Teste

Vamos dividir os dados considerando que os √∫ltimos 30 dias ser√£o nosso conjunto de teste.

In [None]:
# Preparar dados para machine learning
X = df_clean[selected_features]
y = df_clean['target']

print(f"üìä Prepara√ß√£o dos dados:")
print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")

# Divis√£o temporal: √∫ltimos 30 dias para teste
split_date = df_clean.index[-30]  # 30 dias antes do final
print(f"\nüìÖ Data de divis√£o: {split_date.strftime('%Y-%m-%d')}")

# Conjuntos de treino e teste
X_train = X[X.index < split_date]
X_test = X[X.index >= split_date]
y_train = y[y.index < split_date]
y_test = y[y.index >= split_date]

print(f"\nüìä Divis√£o dos dados:")
print(f"Treino: {len(X_train)} registros ({X_train.index[0].strftime('%Y-%m-%d')} a {X_train.index[-1].strftime('%Y-%m-%d')})")
print(f"Teste:  {len(X_test)} registros ({X_test.index[0].strftime('%Y-%m-%d')} a {X_test.index[-1].strftime('%Y-%m-%d')})")

# Verificar distribui√ß√£o do target em cada conjunto
print(f"\nüéØ Distribui√ß√£o do target:")
print(f"Treino - Descida: {(y_train==0).sum()} ({(y_train==0).mean()*100:.1f}%), Subida: {(y_train==1).sum()} ({(y_train==1).mean()*100:.1f}%)")
print(f"Teste  - Descida: {(y_test==0).sum()} ({(y_test==0).mean()*100:.1f}%), Subida: {(y_test==1).sum()} ({(y_test==1).mean()*100:.1f}%)")

# Visualizar a divis√£o
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))

# Pre√ßos com divis√£o
ax1.plot(df_clean.index, df_clean['close'], label='Pre√ßo de Fechamento', alpha=0.8)
ax1.axvline(x=split_date, color='red', linestyle='--', alpha=0.8, label=f'Divis√£o Treino/Teste')
ax1.fill_between(X_train.index, df_clean.loc[X_train.index, 'close'].min(), 
                df_clean.loc[X_train.index, 'close'].max(), alpha=0.2, color='blue', label='Treino')
ax1.fill_between(X_test.index, df_clean.loc[X_test.index, 'close'].min(), 
                df_clean.loc[X_test.index, 'close'].max(), alpha=0.2, color='red', label='Teste')
ax1.set_title('Divis√£o Treino/Teste - Pre√ßos PETR4.SA')
ax1.set_ylabel('Pre√ßo (R$)')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Target com divis√£o
ax2.plot(df_clean.index, df_clean['target'], alpha=0.7, color='purple', label='Target')
ax2.axvline(x=split_date, color='red', linestyle='--', alpha=0.8, label='Divis√£o Treino/Teste')
ax2.fill_between(X_train.index, -0.1, 1.1, alpha=0.2, color='blue', label='Treino')
ax2.fill_between(X_test.index, -0.1, 1.1, alpha=0.2, color='red', label='Teste')
ax2.set_title('Target (Tend√™ncia) - Divis√£o Treino/Teste')
ax2.set_ylabel('Tend√™ncia (0=Descida, 1=Subida)')
ax2.set_xlabel('Data')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Treinamento de M√∫ltiplos Modelos

Vamos treinar diferentes algoritmos de machine learning e compar√°-los.

In [None]:
# Preparar pipelines com normaliza√ß√£o
models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(random_state=42, max_iter=1000))
    ]),
    
    'Random Forest': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
    ]),
    
    'Gradient Boosting': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', GradientBoostingClassifier(n_estimators=100, random_state=42))
    ]),
    
    'SVM': Pipeline([
        ('scaler', RobustScaler()),
        ('classifier', SVC(probability=True, random_state=42))
    ])
}

# Treinar e avaliar cada modelo
results = {}
trained_models = {}

print("üöÄ Iniciando treinamento dos modelos...\n")

for name, model in models.items():
    print(f"üìö Treinando {name}...")
    
    # Treinamento
    model.fit(X_train, y_train)
    
    # Previs√µes
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    y_pred_proba_test = model.predict_proba(X_test)[:, 1]
    
    # M√©tricas
    train_acc = accuracy_score(y_train, y_pred_train)
    test_acc = accuracy_score(y_test, y_pred_test)
    
    # Cross-validation no conjunto de treino
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    
    results[name] = {
        'train_accuracy': train_acc,
        'test_accuracy': test_acc,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'predictions': y_pred_test,
        'probabilities': y_pred_proba_test
    }
    
    trained_models[name] = model
    
    print(f"   ‚úÖ Acur√°cia Treino: {train_acc:.4f}")
    print(f"   ‚úÖ Acur√°cia Teste:  {test_acc:.4f}")
    print(f"   ‚úÖ CV (m√©dia¬±std):  {cv_scores.mean():.4f}¬±{cv_scores.std():.4f}")
    print()

print("üéâ Treinamento conclu√≠do!")

In [None]:
# Criar ensemble (Voting Classifier) com os melhores modelos
print("ü§ù Criando modelo ensemble...")

# Selecionar modelos base para o ensemble
ensemble_models = [
    ('rf', trained_models['Random Forest']),
    ('gb', trained_models['Gradient Boosting']),
    ('lr', trained_models['Logistic Regression'])
]

ensemble = VotingClassifier(estimators=ensemble_models, voting='soft')
ensemble.fit(X_train, y_train)

# Avaliar ensemble
y_pred_ensemble_train = ensemble.predict(X_train)
y_pred_ensemble_test = ensemble.predict(X_test)
y_pred_ensemble_proba = ensemble.predict_proba(X_test)[:, 1]

ensemble_train_acc = accuracy_score(y_train, y_pred_ensemble_train)
ensemble_test_acc = accuracy_score(y_test, y_pred_ensemble_test)
ensemble_cv_scores = cross_val_score(ensemble, X_train, y_train, cv=5, scoring='accuracy')

results['Ensemble'] = {
    'train_accuracy': ensemble_train_acc,
    'test_accuracy': ensemble_test_acc,
    'cv_mean': ensemble_cv_scores.mean(),
    'cv_std': ensemble_cv_scores.std(),
    'predictions': y_pred_ensemble_test,
    'probabilities': y_pred_ensemble_proba
}

trained_models['Ensemble'] = ensemble

print(f"‚úÖ Ensemble - Acur√°cia Treino: {ensemble_train_acc:.4f}")
print(f"‚úÖ Ensemble - Acur√°cia Teste:  {ensemble_test_acc:.4f}")
print(f"‚úÖ Ensemble - CV (m√©dia¬±std):  {ensemble_cv_scores.mean():.4f}¬±{ensemble_cv_scores.std():.4f}")

## 7. Avalia√ß√£o dos Modelos

Vamos comparar o desempenho de todos os modelos e identificar o melhor.

In [None]:
# Criar DataFrame com resultados
results_df = pd.DataFrame({
    'Modelo': list(results.keys()),
    'Acur√°cia Treino': [results[model]['train_accuracy'] for model in results.keys()],
    'Acur√°cia Teste': [results[model]['test_accuracy'] for model in results.keys()],
    'CV M√©dia': [results[model]['cv_mean'] for model in results.keys()],
    'CV Desvio': [results[model]['cv_std'] for model in results.keys()]
})

# Ordenar por acur√°cia no teste
results_df = results_df.sort_values('Acur√°cia Teste', ascending=False)

print("üìä RESULTADOS COMPARATIVOS DOS MODELOS")
print("=" * 60)
display(results_df.round(4))

# Identificar o melhor modelo
best_model_name = results_df.iloc[0]['Modelo']
best_accuracy = results_df.iloc[0]['Acur√°cia Teste']

print(f"\nüèÜ MELHOR MODELO: {best_model_name}")
print(f"üìà Acur√°cia no teste: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")

# Verificar se atende o crit√©rio de 75%
if best_accuracy >= 0.75:
    print(f"‚úÖ OBJETIVO ATINGIDO! Acur√°cia ‚â• 75%")
else:
    print(f"‚ùå Objetivo n√£o atingido. Acur√°cia atual: {best_accuracy*100:.2f}% (meta: 75%)")
    print("üí° Sugest√µes para melhoria:")
    print("   - Adicionar mais features")
    print("   - Otimizar hiperpar√¢metros")
    print("   - Usar mais dados hist√≥ricos")
    print("   - Aplicar t√©cnicas de feature selection")

In [None]:
# Visualiza√ß√£o comparativa dos modelos
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

# 1. Compara√ß√£o de acur√°cias
x_pos = np.arange(len(results_df))
ax1.bar(x_pos - 0.2, results_df['Acur√°cia Treino'], 0.4, label='Treino', alpha=0.8)
ax1.bar(x_pos + 0.2, results_df['Acur√°cia Teste'], 0.4, label='Teste', alpha=0.8)
ax1.axhline(y=0.75, color='red', linestyle='--', alpha=0.7, label='Meta 75%')
ax1.set_xlabel('Modelos')
ax1.set_ylabel('Acur√°cia')
ax1.set_title('Compara√ß√£o de Acur√°cias')
ax1.set_xticks(x_pos)
ax1.set_xticklabels(results_df['Modelo'], rotation=45)
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Cross-validation scores
ax2.errorbar(x_pos, results_df['CV M√©dia'], yerr=results_df['CV Desvio'], 
            fmt='o', capsize=5, capthick=2, markersize=8)
ax2.axhline(y=0.75, color='red', linestyle='--', alpha=0.7, label='Meta 75%')
ax2.set_xlabel('Modelos')
ax2.set_ylabel('Acur√°cia CV')
ax2.set_title('Cross-Validation (5-fold)')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(results_df['Modelo'], rotation=45)
ax2.legend()
ax2.grid(True, alpha=0.3)

# 3. Matriz de confus√£o do melhor modelo
best_predictions = results[best_model_name]['predictions']
cm = confusion_matrix(y_test, best_predictions)
sns.heatmap(cm, annot=True, fmt='d', ax=ax3, cmap='Blues',
           xticklabels=['Descida', 'Subida'], yticklabels=['Descida', 'Subida'])
ax3.set_title(f'Matriz de Confus√£o - {best_model_name}')
ax3.set_xlabel('Predito')
ax3.set_ylabel('Real')

# 4. Distribui√ß√£o de probabilidades
best_probabilities = results[best_model_name]['probabilities']
ax4.hist(best_probabilities[y_test == 0], bins=20, alpha=0.7, label='Descida Real', color='red')
ax4.hist(best_probabilities[y_test == 1], bins=20, alpha=0.7, label='Subida Real', color='green')
ax4.axvline(x=0.5, color='black', linestyle='--', alpha=0.7, label='Threshold 0.5')
ax4.set_xlabel('Probabilidade Predita')
ax4.set_ylabel('Frequ√™ncia')
ax4.set_title(f'Distribui√ß√£o de Probabilidades - {best_model_name}')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Relat√≥rio detalhado do melhor modelo
print(f"\nüìã RELAT√ìRIO DETALHADO - {best_model_name}")
print("=" * 50)
print(classification_report(y_test, best_predictions, target_names=['Descida', 'Subida']))

## 8. Previs√µes e Visualiza√ß√£o dos Resultados

Vamos visualizar as previs√µes do melhor modelo e analisar os resultados.

In [None]:
# Preparar dados para visualiza√ß√£o
test_dates = X_test.index
test_prices = df_clean.loc[test_dates, 'close']
actual_trends = y_test.values
predicted_trends = results[best_model_name]['predictions']
predicted_probabilities = results[best_model_name]['probabilities']

# Criar DataFrame para an√°lise
analysis_df = pd.DataFrame({
    'Data': test_dates,
    'Pre√ßo': test_prices.values,
    'Tend√™ncia_Real': actual_trends,
    'Tend√™ncia_Predita': predicted_trends,
    'Probabilidade': predicted_probabilities,
    'Acerto': actual_trends == predicted_trends
})

print(f"üìä AN√ÅLISE DAS PREVIS√ïES - √öltimos 30 dias")
print(f"Per√≠odo: {test_dates[0].strftime('%Y-%m-%d')} a {test_dates[-1].strftime('%Y-%m-%d')}")
print(f"\nüéØ Resultados:")
print(f"Total de previs√µes: {len(analysis_df)}")
print(f"Acertos: {analysis_df['Acerto'].sum()} ({analysis_df['Acerto'].mean()*100:.1f}%)")
print(f"Erros: {(~analysis_df['Acerto']).sum()} ({(~analysis_df['Acerto']).mean()*100:.1f}%)")

# An√°lise por tipo de movimento
subidas_reais = analysis_df[analysis_df['Tend√™ncia_Real'] == 1]
descidas_reais = analysis_df[analysis_df['Tend√™ncia_Real'] == 0]

print(f"\nüìà Subidas (Real):")
print(f"Total: {len(subidas_reais)}")
print(f"Acertos: {subidas_reais['Acerto'].sum()} ({subidas_reais['Acerto'].mean()*100:.1f}%)")

print(f"\nüìâ Descidas (Real):")
print(f"Total: {len(descidas_reais)}")
print(f"Acertos: {descidas_reais['Acerto'].sum()} ({descidas_reais['Acerto'].mean()*100:.1f}%)")

display(analysis_df.head(10))

In [None]:
# Visualiza√ß√£o abrangente dos resultados
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=(
        'Pre√ßos e Previs√µes', 'Probabilidades de Previs√£o',
        'Acertos vs Erros', 'Retornos Reais vs Preditos',
        'Distribui√ß√£o de Acertos', 'Performance por Per√≠odo'
    ),
    specs=[[{"secondary_y": True}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]],
    vertical_spacing=0.08
)

# 1. Pre√ßos e Previs√µes
fig.add_trace(
    go.Scatter(x=analysis_df['Data'], y=analysis_df['Pre√ßo'], 
              name='Pre√ßo PETR4', line=dict(color='blue')),
    row=1, col=1
)

# Adicionar marcadores para previs√µes
acertos = analysis_df[analysis_df['Acerto']]
erros = analysis_df[~analysis_df['Acerto']]

fig.add_trace(
    go.Scatter(x=acertos['Data'], y=acertos['Pre√ßo'], 
              mode='markers', name='Acertos', 
              marker=dict(color='green', size=8, symbol='circle')),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=erros['Data'], y=erros['Pre√ßo'], 
              mode='markers', name='Erros', 
              marker=dict(color='red', size=8, symbol='x')),
    row=1, col=1
)

# 2. Probabilidades
fig.add_trace(
    go.Scatter(x=analysis_df['Data'], y=analysis_df['Probabilidade'], 
              name='Prob. Subida', line=dict(color='purple')),
    row=1, col=2
)
fig.add_hline(y=0.5, line_dash="dash", line_color="black", row=1, col=2)

# 3. Acertos vs Erros ao longo do tempo
fig.add_trace(
    go.Scatter(x=analysis_df['Data'], y=analysis_df['Acerto'].astype(int), 
              mode='markers+lines', name='Acertos (1) vs Erros (0)',
              line=dict(color='orange')),
    row=2, col=1
)

# 4. Retornos
returns_real = np.where(analysis_df['Tend√™ncia_Real'] == 1, 1, -1)
returns_pred = np.where(analysis_df['Tend√™ncia_Predita'] == 1, 1, -1)

fig.add_trace(
    go.Scatter(x=analysis_df['Data'], y=returns_real, 
              name='Retorno Real', line=dict(color='blue')),
    row=2, col=2
)
fig.add_trace(
    go.Scatter(x=analysis_df['Data'], y=returns_pred, 
              name='Retorno Predito', line=dict(color='red', dash='dash')),
    row=2, col=2
)

# 5. Distribui√ß√£o de acertos
accuracy_by_week = analysis_df.groupby(analysis_df['Data'].dt.week)['Acerto'].mean()
fig.add_trace(
    go.Bar(x=accuracy_by_week.index, y=accuracy_by_week.values, 
          name='Acur√°cia Semanal'),
    row=3, col=1
)

# 6. Performance acumulada
cumulative_accuracy = analysis_df['Acerto'].expanding().mean()
fig.add_trace(
    go.Scatter(x=analysis_df['Data'], y=cumulative_accuracy, 
              name='Acur√°cia Acumulada', line=dict(color='green')),
    row=3, col=2
)
fig.add_hline(y=0.75, line_dash="dash", line_color="red", row=3, col=2)

# Configurar layout
fig.update_layout(
    height=1000,
    title_text=f"üìä An√°lise Completa - {best_model_name} (Acur√°cia: {best_accuracy:.2%})",
    showlegend=True
)

# Configurar eixos
fig.update_xaxes(title_text="Data", row=3, col=1)
fig.update_xaxes(title_text="Data", row=3, col=2)
fig.update_yaxes(title_text="Pre√ßo (R$)", row=1, col=1)
fig.update_yaxes(title_text="Probabilidade", row=1, col=2)
fig.update_yaxes(title_text="Acerto (1) / Erro (0)", row=2, col=1)
fig.update_yaxes(title_text="Retorno", row=2, col=2)
fig.update_yaxes(title_text="Acur√°cia", row=3, col=1)
fig.update_yaxes(title_text="Acur√°cia Acumulada", row=3, col=2)

fig.show()

In [None]:
# An√°lise de feature importance (para modelos tree-based)
if 'Random Forest' in best_model_name or 'Gradient Boosting' in best_model_name:
    # Extrair feature importances
    if best_model_name == 'Ensemble':
        # Para ensemble, usar Random Forest como refer√™ncia
        feature_importance = trained_models['Random Forest']['classifier'].feature_importances_
    else:
        feature_importance = trained_models[best_model_name]['classifier'].feature_importances_
    
    # Criar DataFrame de import√¢ncias
    importance_df = pd.DataFrame({
        'Feature': selected_features,
        'Importance': feature_importance
    }).sort_values('Importance', ascending=False)
    
    print("üéØ TOP 10 FEATURES MAIS IMPORTANTES:")
    display(importance_df.head(10))
    
    # Visualizar import√¢ncias
    plt.figure(figsize=(12, 8))
    top_features = importance_df.head(15)
    plt.barh(range(len(top_features)), top_features['Importance'])
    plt.yticks(range(len(top_features)), top_features['Feature'])
    plt.xlabel('Import√¢ncia')
    plt.title(f'Top 15 Features - {best_model_name}')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

# Resumo final
print("\n" + "="*60)
print("üèÜ RESUMO FINAL DO PROJETO")
print("="*60)
print(f"üìä Dados analisados: {len(df_clean)} registros")
print(f"üéØ Per√≠odo de teste: {len(y_test)} dias (√∫ltimos 30 dias)")
print(f"ü§ñ Melhor modelo: {best_model_name}")
print(f"üìà Acur√°cia obtida: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")
print(f"üéØ Meta de acur√°cia: 75%")

if best_accuracy >= 0.75:
    print(f"‚úÖ SUCESSO: Meta atingida!")
    print(f"üí∞ O modelo pode ser usado para apoiar decis√µes de trading")
else:
    print(f"‚ùå Meta n√£o atingida (diferen√ßa: {(0.75 - best_accuracy)*100:.2f}%)")
    print(f"üí° Recomenda√ß√µes para melhoria no futuro")

print(f"\nüìù Features utilizadas: {len(selected_features)}")
print(f"‚è∞ Per√≠odo de an√°lise: {df_clean.index[0].strftime('%Y-%m-%d')} a {df_clean.index[-1].strftime('%Y-%m-%d')}")
print("="*60)

## üìã Conclus√µes e Pr√≥ximos Passos

### Principais Resultados
- **Modelo desenvolvido**: Sistema de previs√£o de tend√™ncia para PETR4.SA
- **Dados utilizados**: ~2 anos de dados hist√≥ricos
- **Features criadas**: Indicadores t√©cnicos, m√©dias m√≥veis, volatilidade, etc.
- **Modelos testados**: Logistic Regression, Random Forest, Gradient Boosting, SVM, Ensemble
- **Per√≠odo de teste**: √öltimos 30 dias de dados

### Metodologia Aplicada
1. ‚úÖ **Coleta de dados**: yfinance para dados hist√≥ricos da PETR4.SA
2. ‚úÖ **Engenharia de features**: Cria√ß√£o de 20+ indicadores t√©cnicos
3. ‚úÖ **Divis√£o temporal**: Treino/teste respeitando a natureza temporal
4. ‚úÖ **M√∫ltiplos modelos**: Compara√ß√£o de diferentes algoritmos
5. ‚úÖ **Ensemble learning**: Combina√ß√£o dos melhores modelos
6. ‚úÖ **Avalia√ß√£o rigorosa**: M√©tricas completas e visualiza√ß√µes

### Pr√≥ximos Passos (se necess√°rio)
- **Otimiza√ß√£o de hiperpar√¢metros**: GridSearch mais detalhado
- **Mais features**: Dados macroecon√¥micos, sentimento de mercado
- **Modelos avan√ßados**: LSTM, Transformer para s√©ries temporais
- **Validation**: Walk-forward validation para robustez temporal
- **Risk management**: Incorporar stop-loss e take-profit

### ‚ö†Ô∏è Avisos Importantes
- Este modelo √© para fins educacionais e de pesquisa
- Mercados financeiros s√£o imprevis√≠veis e envolvem riscos
- Sempre consulte profissionais qualificados antes de investir
- Performance passada n√£o garante resultados futuros