#Considerações

## Definição de Período de Treinamento e Teste
Para garantir robustez na avaliação do modelo e evitar overfitting, decidimos utilizar os dados do período de 2020 a 2025, com a seguinte divisão:

Dados de treino: 2020, 2021 e 2022

Dados de teste: 2023, 2024 e 2025

Inicialmente, havíamos treinado o modelo com os dados de 2024 e testado em 2025, o que resultou em uma acurácia de 90%. No entanto, consideramos que esse resultado poderia ser reflexo de overfitting, visto que o modelo estava treinando em um ano imediatamente anterior ao teste.

Para validar essa hipótese, incluímos mais anos no treinamento (2020 a 2022) e utilizamos um intervalo maior de teste (2023 a 2025). Mesmo assim, o desempenho do modelo se manteve próximo a 90% de acerto, o que fortalece a confiabilidade do modelo e demonstra sua capacidade de generalização.


## Estrutura do Dataset
A base de dados final utilizada no treinamento do modelo foi composta por:

Variáveis alvo: qtd_acidentes por BR, KM e mês.

Infrações: Quantidade total de infrações naquele mesmo BR/KM e mês.

Feriado: Indica se no dia do acidente era feriado.

Variáveis de tempo: Mês, ano.

O objetivo foi enriquecer a base com variáveis contextuais que ajudassem o modelo a capturar padrões sazonais, comportamentais e operacionais associados ao risco de acidentes.


## Pycaret
Utilizamos essa lib ela facilidade de treino, análise em diversos modelos para definir o melhor e facilidade na interpretação.

## Outras observações
Arredondamos a previsão para um melhor entendimento, visto que não fazia sentido a quantidade de acidentes ser um número do tipo float.

In [None]:
# Install the libraries again
!pip install pycaret



In [None]:
import pandas as pd

#Gerando um DF com o dataset gerado pela nossa camada gold
df = pd.read_csv("/content/ml_acidentes_mensal_full.csv", sep=',', low_memory=False)

In [None]:
#Setup do PyCaret
from pycaret.regression import *

setup(data=df, target='qtd_acidentes', session_id=42)

#Treinar e comparar modelos
best_model = compare_models()

#Fazer previsões no holdout/teste
predictions = predict_model(best_model)

#Criar DataFrame com Real vs Previsto
resultado = predictions[['qtd_acidentes', 'prediction_label']]
resultado.columns = ['Real', 'Previsto']

#Mostrar os primeiros resultados
print(resultado.head(20))

Unnamed: 0,Description,Value
0,Session id,42
1,Target,qtd_acidentes
2,Target type,Regression
3,Original data shape,"(317323, 15)"
4,Transformed data shape,"(317323, 15)"
5,Transformed train set shape,"(222126, 15)"
6,Transformed test set shape,"(95197, 15)"
7,Numeric features,12
8,Categorical features,2
9,Rows with missing values,0.3%


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,0.0464,0.0292,0.1709,0.857,0.0608,0.0325,53.03
lightgbm,Light Gradient Boosting Machine,0.0519,0.0301,0.1735,0.8526,0.0594,0.0341,10.718
xgboost,Extreme Gradient Boosting,0.053,0.0307,0.1752,0.8498,0.06,0.0347,2.539
rf,Random Forest Regressor,0.0566,0.0354,0.1881,0.8268,0.066,0.039,84.344
gbr,Gradient Boosting Regressor,0.0756,0.0437,0.209,0.7862,0.0696,0.0501,28.215
dt,Decision Tree Regressor,0.0543,0.073,0.2701,0.6428,0.0938,0.0369,1.818
lar,Least Angle Regression,0.2047,0.1718,0.4144,0.1599,0.1427,0.1489,0.696
br,Bayesian Ridge,0.2046,0.1718,0.4144,0.1599,0.1427,0.1489,0.902
ridge,Ridge Regression,0.2047,0.1718,0.4144,0.1599,0.1427,0.1489,0.807
lr,Linear Regression,0.2047,0.1718,0.4144,0.1599,0.1427,0.1489,1.676


Processing:   0%|          | 0/81 [00:00<?, ?it/s]

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,0.0466,0.0301,0.1734,0.8562,0.0609,0.0325


        Real  Previsto
247852     1      1.01
204155     1      1.01
99272      1      1.01
173096     1      1.00
300199     1      1.00
278576     1      1.09
292585     1      1.00
154839     1      1.00
90908      1      1.00
75297      1      1.00
114566     1      1.00
286555     1      1.00
58041      1      1.00
88514      1      1.00
151365     1      1.04
49644      1      1.00
184300     1      1.00
105310     1      1.00
33140      1      1.00
73552      1      1.02


In [None]:
import pandas as pd
from pycaret.regression import *

#Filtra os dados
df_historica = df[df['ano'].isin([2020,2021,2022])].copy()
df_atual = df[df['ano'].isin([2023,2024,2025])].copy()

#Configura o ambiente com os dados de 2024
setup(data=df_historica, target='qtd_acidentes', session_id=123)

#Treina o melhor modelo
best_model = compare_models()

#Faz a predição usando os dados de 2025
predictions = predict_model(best_model, data=df_atual)

#Exibe os resultados
predictions[['qtd_acidentes', 'prediction_label']].head()

Unnamed: 0,Description,Value
0,Session id,123
1,Target,qtd_acidentes
2,Target type,Regression
3,Original data shape,"(173578, 15)"
4,Transformed data shape,"(173578, 15)"
5,Transformed train set shape,"(121504, 15)"
6,Transformed test set shape,"(52074, 15)"
7,Numeric features,12
8,Categorical features,2
9,Rows with missing values,0.3%


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,0.0464,0.0283,0.1682,0.8401,0.0609,0.0328,22.086
lightgbm,Light Gradient Boosting Machine,0.0503,0.0288,0.1696,0.8374,0.0588,0.0332,5.585
xgboost,Extreme Gradient Boosting,0.0529,0.0292,0.1706,0.8355,0.06,0.0355,1.398
rf,Random Forest Regressor,0.0584,0.0357,0.1888,0.7988,0.0675,0.0406,37.163
gbr,Gradient Boosting Regressor,0.069,0.0374,0.1933,0.789,0.0665,0.0466,13.495
dt,Decision Tree Regressor,0.0562,0.0735,0.2711,0.5847,0.0962,0.0386,0.84
lr,Linear Regression,0.1888,0.1536,0.3919,0.1332,0.1371,0.1375,0.488
ridge,Ridge Regression,0.1888,0.1536,0.3919,0.1332,0.1371,0.1375,0.387
br,Bayesian Ridge,0.1888,0.1536,0.3919,0.1332,0.1371,0.1375,0.443
lar,Least Angle Regression,0.189,0.1537,0.3919,0.133,0.1371,0.1377,0.402


Processing:   0%|          | 0/81 [00:00<?, ?it/s]

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,0.0516,0.0365,0.191,0.8489,0.0645,0.0348


Unnamed: 0,qtd_acidentes,prediction_label
173578,2,1.01
173579,1,1.02
173580,1,1.03
173581,1,1.04
173582,1,1.02


In [None]:
import numpy as np

#Garantir que 'prediction_label' seja numérico
predictions['prediction_label'] = pd.to_numeric(predictions['prediction_label'], errors='coerce')

predictions['br'] = predictions['prediction_label'].astype(str)

predictions['media_idade_envolvidos'] = np.floor(predictions['media_idade_envolvidos']).astype(int)


#Criar colunas com arredondamento
predictions['pred_arredondado_baixo'] = np.floor(predictions['prediction_label']).astype(int)

#Visualizar as novas colunas
predictions[['prediction_label', 'pred_arredondado_baixo']].head()


Unnamed: 0,prediction_label,pred_arredondado_baixo
173578,1.01,1
173579,1.02,1
173580,1.03,1
173581,1.04,1
173582,1.02,1


In [None]:
import csv
from google.colab import files

#Nome do arquivo
filename = 'previsoes_modelo_full.csv'

#Abre o arquivo e escreve com o módulo csv
with open(filename, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)

    #Escreve o cabeçalho
    writer.writerow(predictions.columns.tolist())

    #Escreve os dados
    for row in predictions.itertuples(index=False):
        writer.writerow(row)

#Faz o download
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>