# üó∫Ô∏è Arquitetura do Sistema de Mapeamento Ticker ‚Üî CNPJ (Aurum)

Este documento detalha o funcionamento l√≥gico e t√©cnico do subsistema respons√°vel por criar o "Golden Record" que vincula os ativos da B3 (Tickers) aos dados cadastrais oficiais da CVM (CNPJs).

---

## üéØ O Objetivo
Sistemas financeiros enfrentam um problema cl√°ssico de desconex√£o de dados:
1.  **Mundo de Pre√ßos (B3):** Opera via Ticker (ex: `VALE3`), mas n√£o fornece CNPJ ou Raz√£o Social limpa.
2.  **Mundo Fundamentalista (CVM):** Opera via CNPJ (ex: `33.592.510/0001-54`), mas desconhece os Tickers.

Este sistema resolve isso criando uma **Ponte Automatizada** usando enriquecimento de dados e l√≥gica fuzzy.

---

## üîÑ Fluxograma do Processo

Abaixo est√° o fluxo de dados, desde a coleta bruta na internet at√© a gera√ß√£o do arquivo Master Parquet.

```mermaid
graph TD
    %% Fontes Externas
    subgraph "Fase 1: Coleta e Intelig√™ncia Externa"
        B3_API["üì° API B3<br/>(IndexProxy)"] -->|Tickers| SCRIPT_AUTO
        YAHOO["üîç Yahoo Finance"] -->|Nomes Comerciais| SCRIPT_AUTO
        CVM_WEB["üèõÔ∏è Dados Abertos CVM"] -->|CNPJs Oficiais| SCRIPT_AUTO
        
        SCRIPT_AUTO("üêç ticker_extractor.py")
        
        SCRIPT_AUTO -->|Fuzzy Match| MATCH_LOGIC{"Cruzamento<br/>Nome x Nome"}
        MATCH_LOGIC -->|Sucesso| CSV_AUTO["üìÑ mapa_ticker_cnpj_automatizado.csv"]
    end

    %% Pipeline Interno
    subgraph "Fase 2: Pipeline Aurum (ETL)"
        CSV_AUTO --> EXT_AUTO["üß© AutomatedMapExtractor"]
        REF_MANUAL["üìù manual_reference.json"] --> EXT_MANUAL["üß© ManualOverrideExtractor"]
        
        EXT_AUTO --> PIPELINE("‚öôÔ∏è create_map.py")
        EXT_MANUAL --> PIPELINE
        
        PIPELINE -->|Valida√ß√£o| VALIDATORS{"validators.py<br/>Verifica D√≠gitos CNPJ"}
        
        VALIDATORS -->|Aprovado| MASTER_DB[("üóÑÔ∏è ticker_cnpj_master.parquet")]
        VALIDATORS -->|Log| REPORT["üìä quality_report.txt"]
    end

    %% Estiliza√ß√£o
    style MASTER_DB fill:#d4edda,stroke:#28a745,stroke-width:2px
    style SCRIPT_AUTO fill:#e2e3e5,stroke:#333,stroke-width:2px
    style CSV_AUTO fill:#fff3cd,stroke:#ffc107,stroke-width:2px

In [36]:
import pandas as pd
import numpy as np
import json
import logging
import hashlib
import sys
from pathlib import Path
from typing import Dict, List, Optional
from datetime import datetime
from dataclasses import dataclass, field
import re

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler(sys.stdout)],
    force=True
)
logger = logging.getLogger("AurumMapper")

In [37]:
@dataclass
class MappingConfig:
    """Configura√ß√£o centralizada adaptada para Notebook em subpasta"""

    BASE_DIR: Path = Path.cwd().parent 
    
    DATA_DIR: Path = BASE_DIR / "data" 
    
    CVM_DIR: Path = DATA_DIR / "cvm"
    HISTORICAL_DIR: Path = DATA_DIR / "historical"

    OUTPUT_DIR: Path = DATA_DIR / "mapping"
    MASTER_FILE: Path = OUTPUT_DIR / "ticker_cnpj_master.parquet"
    REFERENCE_FILE: Path = OUTPUT_DIR / "manual_reference.json"
    AUDIT_LOG: Path = OUTPUT_DIR / "audit_log.json"
    QUALITY_REPORT: Path = OUTPUT_DIR / "quality_report.txt"
    VERSIONS_DIR: Path = OUTPUT_DIR / "versions"

    AUTOMATED_MAP_FILE: Path = DATA_DIR / "dados_mapeamento" / "mapa_ticker_cnpj_automatizado.csv"
    
    TICKERS_FILE: Path = DATA_DIR / "tickers_ibrx100_full.parquet"
    FUNDAMENTALS_FILE: Path = CVM_DIR / "final" / "fundamentals_wide.parquet"

    FUZZY_THRESHOLD: int = 85
    MIN_CONFIDENCE: int = 70

    KNOWN_UNITS: List[str] = field(default_factory=lambda: ['BPAC11', 'ENGI11', 'IGTI11', 'TAEE11', 'SANB11'])
    TICKER_HISTORY: Dict[str, str] = field(default_factory=lambda: {'ITUB3': 'ITUB4', 'BBDC3': 'BBDC4'})
    SAME_CNPJ_GROUPS: List[List[str]] = field(default_factory=lambda: [['BBDC3', 'BBDC4'], ['PETR3', 'PETR4']])

    def __post_init__(self):
        self.OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
        self.VERSIONS_DIR.mkdir(parents=True, exist_ok=True)
        
        print(f"üìç Local do Notebook: {Path.cwd()}")
        print(f"üìÇ Diret√≥rio de Dados Calculado: {self.DATA_DIR}")
        
        if self.AUTOMATED_MAP_FILE.exists():
            print(f"‚úÖ ARQUIVO ENCONTRADO: {self.AUTOMATED_MAP_FILE.name}")
        else:
            print(f"‚ùå ARQUIVO N√ÉO ENCONTRADO EM: {self.AUTOMATED_MAP_FILE}")
            print("   Verifique se a pasta 'dados_mapeamento' est√° realmente dentro de 'aurum/data'")

    def get_version_path(self, version: str) -> Path:
        return self.VERSIONS_DIR / f"ticker_cnpj_master_v{version}.parquet"

config = MappingConfig()

üìç Local do Notebook: c:\Users\kaike\projeto_aurum\aurum\mapeadores
üìÇ Diret√≥rio de Dados Calculado: c:\Users\kaike\projeto_aurum\aurum\data
‚úÖ ARQUIVO ENCONTRADO: mapa_ticker_cnpj_automatizado.csv


In [38]:
class CNPJValidator:
    @staticmethod
    def clean(cnpj: str) -> str:
        return re.sub(r'\D', '', str(cnpj))

    @staticmethod
    def format(cnpj: str) -> str:
        clean = CNPJValidator.clean(cnpj)
        if len(clean) != 14: return cnpj
        return f"{clean[:2]}.{clean[2:5]}.{clean[5:8]}/{clean[8:12]}-{clean[12:]}"

    @staticmethod
    def validate(cnpj: str) -> Dict[str, any]:
        clean_cnpj = CNPJValidator.clean(cnpj)
        if len(clean_cnpj) != 14:
            return {'valid': False, 'reason': 'Tamanho incorreto', 'formatted': cnpj}
        return {'valid': True, 'reason': 'OK', 'formatted': CNPJValidator.format(clean_cnpj)}

def validate_cnpj(cnpj: str) -> Dict[str, any]:
    return CNPJValidator.validate(cnpj)

def validate_ticker(ticker: str) -> Dict[str, any]:
    return {'valid': True, 'ticker_simple': ticker.replace('.SA', '')}

In [39]:
class AutomatedMapExtractor:
    def __init__(self):
        self.map_file = config.AUTOMATED_MAP_FILE

    def extract(self) -> pd.DataFrame:
        logger.info(f"Carregando mapa automatizado de: {self.map_file}")
        
        if not self.map_file.exists():
            logger.error(f"‚ùå Arquivo n√£o encontrado: {self.map_file}")
            return pd.DataFrame()

        try:
            df = pd.read_csv(self.map_file, sep=';', dtype=str)
            
            df['ticker_simple'] = df['ticker'].str.replace('.SA', '', regex=False).str.strip().str.upper()
            
            valid_cnpjs = []
            for cnpj in df['CNPJ']:
                if pd.isna(cnpj):
                    valid_cnpjs.append(None)
                    continue
                res = validate_cnpj(str(cnpj))
                valid_cnpjs.append(res['formatted'] if res['valid'] else None)
            
            df['CNPJ_CIA'] = valid_cnpjs
            df['DENOM_CIA'] = df['nome_oficial_cvm']
            
            df_valid = df.dropna(subset=['CNPJ_CIA']).copy()
            df_valid['fonte'] = 'AutomatedMap'
            
            logger.info(f"‚úÖ {len(df_valid)} pares carregados.")
            return df_valid
        except Exception as e:
            logger.error(f"Erro ao ler mapa: {e}")
            return pd.DataFrame()

class FundamentalsWideExtractor:
    def __init__(self): self.automap = AutomatedMapExtractor()
    def extract(self): return self.automap.extract()

class CVMExtractor:
    def __init__(self): self.automap = AutomatedMapExtractor()
    def extract(self): 
        df = self.automap.extract()
        return df[['CNPJ_CIA', 'DENOM_CIA']].drop_duplicates() if not df.empty else df

class B3Extractor:
    def __init__(self): self.automap = AutomatedMapExtractor()
    def extract(self):
        df = self.automap.extract()
        if df.empty: return df
        df['ticker'] = df['ticker_simple'] + '.SA'
        df['ticker_valido'] = True
        return df[['ticker', 'ticker_simple', 'ticker_valido']]

class ManualOverrideExtractor:
    def __init__(self): self.reference_file = config.REFERENCE_FILE
    def load(self):
        if not self.reference_file.exists(): return {}
        try:
            with open(self.reference_file, 'r', encoding='utf-8') as f: return json.load(f)
        except: return {}

class HistoricalExtractor:
    def __init__(self): self.historical_dir = config.HISTORICAL_DIR
    def get_available_tickers(self): return []
    def check_recent_data(self, ticker, days=30): return True

In [40]:
class MatchingEngine:
    def __init__(self, df_cvm, df_b3, manual_overrides=None, df_fundamentals_wide=None):
        self.manual_overrides = manual_overrides or {}
        self.df_fundamentals_wide = df_fundamentals_wide
        
        self.fundamentals_map = {}
        if self.df_fundamentals_wide is not None and not self.df_fundamentals_wide.empty:
            self.fundamentals_map = dict(zip(
                self.df_fundamentals_wide['ticker_simple'], 
                self.df_fundamentals_wide['CNPJ_CIA']
            ))
            self.name_map = dict(zip(
                self.df_fundamentals_wide['ticker_simple'], 
                self.df_fundamentals_wide['DENOM_CIA']
            ))
            logger.info(f"üî• Motor carregado com {len(self.fundamentals_map)} links diretos.")

    def match(self, ticker_simple):
        if ticker_simple in self.manual_overrides:
            ov = self.manual_overrides[ticker_simple]
            return {'cnpj': ov['cnpj'], 'razao_social': ov['razao_social'], 'confidence': 100, 'method': 'manual'}

        if ticker_simple in self.fundamentals_map:
            return {
                'cnpj': self.fundamentals_map[ticker_simple],
                'razao_social': self.name_map.get(ticker_simple, ''),
                'confidence': 99, 
                'method': 'automated_map'
            }
            
        return {'cnpj': None, 'razao_social': None, 'confidence': 0, 'method': 'no_match'}

    def match_batch(self, tickers: List[str]) -> pd.DataFrame:
        results = []
        for ticker in tickers:
            res = self.match(ticker)
            res['ticker_simple'] = ticker
            results.append(res)
        return pd.DataFrame(results)

def create_matching_engine(df_cvm, df_b3, manual_overrides, df_fundamentals_wide):
    return MatchingEngine(df_cvm, df_b3, manual_overrides, df_fundamentals_wide)

In [41]:
class TickerCNPJMapper:
    def __init__(self, version: str = "1.0.0"):
        self.version = version
        self.timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        
        self.b3_extractor = B3Extractor()
        self.cvm_extractor = CVMExtractor()
        self.manual_extractor = ManualOverrideExtractor()
        self.fundamentals_wide_extractor = FundamentalsWideExtractor()

        self.df_master = None

    def run_full_pipeline(self, save_versions=True):
        logger.info("üöÄ INICIANDO PIPELINE NO JUPYTER")
        
        logger.info("[1/3] Extraindo dados...")
        df_fundamentals = self.fundamentals_wide_extractor.extract()
        df_b3 = self.b3_extractor.extract()
        df_cvm = self.cvm_extractor.extract()
        manual_overrides = self.manual_extractor.load()
        
        if df_b3.empty:
            logger.error("‚ùå Tickers B3 vazios. Verifique o arquivo de mapa automatizado.")
            return False

        logger.info("[2/3] Executando Matching...")
        engine = create_matching_engine(df_cvm, df_b3, manual_overrides, df_fundamentals)
        
        tickers = df_b3['ticker_simple'].tolist()
        df_matches = engine.match_batch(tickers)
        
        self.df_master = df_b3.merge(
            df_matches[['ticker_simple', 'cnpj', 'razao_social', 'confidence', 'method']],
            on='ticker_simple', how='left'
        )
        self.df_master.rename(columns={'cnpj': 'CNPJ_CIA', 'razao_social': 'DENOM_CIA'}, inplace=True)
        
        logger.info("[3/3] Salvando Resultados...")
        
        self.df_master.to_parquet(config.MASTER_FILE, index=False)
        self.df_master.to_csv(config.MASTER_FILE.with_suffix('.csv'), index=False, sep=';')
        
        v_file = config.get_version_path(self.version)
        self.df_master.to_parquet(v_file, index=False)
        
        logger.info(f"‚úÖ Sucesso! Arquivo salvo em: {config.MASTER_FILE}")
        return True

In [42]:
if __name__ == "__main__":
    mapper = TickerCNPJMapper(version="1.0.0_nb")
    mapper.run_full_pipeline()
    
    if mapper.df_master is not None:
        display(mapper.df_master.head(10))

2025-12-11 19:29:56,411 - INFO - üöÄ INICIANDO PIPELINE NO JUPYTER
2025-12-11 19:29:56,412 - INFO - [1/3] Extraindo dados...
2025-12-11 19:29:56,413 - INFO - Carregando mapa automatizado de: c:\Users\kaike\projeto_aurum\aurum\data\dados_mapeamento\mapa_ticker_cnpj_automatizado.csv
2025-12-11 19:29:56,426 - INFO - ‚úÖ 96 pares carregados.
2025-12-11 19:29:56,428 - INFO - Carregando mapa automatizado de: c:\Users\kaike\projeto_aurum\aurum\data\dados_mapeamento\mapa_ticker_cnpj_automatizado.csv
2025-12-11 19:29:56,442 - INFO - ‚úÖ 96 pares carregados.
2025-12-11 19:29:56,450 - INFO - Carregando mapa automatizado de: c:\Users\kaike\projeto_aurum\aurum\data\dados_mapeamento\mapa_ticker_cnpj_automatizado.csv
2025-12-11 19:29:56,487 - INFO - ‚úÖ 96 pares carregados.
2025-12-11 19:29:56,493 - INFO - [2/3] Executando Matching...
2025-12-11 19:29:56,498 - INFO - üî• Motor carregado com 96 links diretos.
2025-12-11 19:29:56,513 - INFO - [3/3] Salvando Resultados...
2025-12-11 19:29:56,547 - INF

Unnamed: 0,ticker,ticker_simple,ticker_valido,CNPJ_CIA,DENOM_CIA,confidence,method
0,ALOS3.SA,ALOS3,True,05.878.397/0001-32,ALLOS S.A.,99,automated_map
1,ABEV3.SA,ABEV3,True,07.526.557/0001-00,AMBEV S.A.,99,automated_map
2,ANIM3.SA,ANIM3,True,60.651.809/0001-05,SUZANO HOLDING S.A.,99,automated_map
3,ASAI3.SA,ASAI3,True,06.057.223/0001-71,SENDAS DISTRIBUIDORA S.A.,99,automated_map
4,AURE3.SA,AURE3,True,28.594.234/0001-23,AUREN ENERGIA S.A,99,automated_map
5,AXIA3.SA,AXIA3,True,07.659.538/0001-51,DINAMICA ENERGIA S/A,99,automated_map
6,AXIA6.SA,AXIA6,True,07.659.538/0001-51,DINAMICA ENERGIA S/A,99,automated_map
7,AZZA3.SA,AZZA3,True,16.590.234/0001-76,AZZAS 2154 S.A.,99,automated_map
8,B3SA3.SA,B3SA3,True,09.346.601/0001-25,"B3 S.A. - BRASIL, BOLSA, BALC√ÉO",99,automated_map
9,BBSE3.SA,BBSE3,True,17.344.597/0001-94,BB SEGURIDADE PARTICIPA√á√ïES S.A.,99,automated_map
