<a href="https://colab.research.google.com/github/TiagoIesbick/dashboard-etl/blob/main/budget_forecast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align="center">
  <img src="https://prefeitura.poa.br/sites/default/files/usu_img/sites/previmpa/Capa-Previmpa-20anos.jpg" width="250" />
</p>
<p align="center">
    PREVISÃO ORÇAMENTÁRIA
</p>
<br>
<p align="center">
		<em>Desenvolvido com os softwares e ferramentas abaixo.</em>
</p>
<p align="center">
    <img src="https://img.shields.io/badge/Python-3776AB.svg?style=flat&logo=Python&logoColor=white" alt="Python">
    <img src="https://img.shields.io/badge/Colab-F9AB00.svg?style=flat&logo=googlecolab&logoColor=525252" alt="Colab">
    <img src="https://img.shields.io/badge/Pandas-2C2D72.svg?style=flat&logo=pandas&logoColor=white" alt="Pandas">
</p>

---

## 🚀 Começando

### ✅ Requisito

* **Conta Google institucional da UPO com acesso ativado no navegador**

### 📊 Dados de entrada

1. Com a conta Google da UPO ativada no navegador, acesse o link abaixo:<br>
<a href="https://drive.google.com/drive/folders/1I4yh03z7R8FN28WSsUfagERF4V_m7Wzf?usp=sharing">https://drive.google.com/drive/folders/1I4yh03z7R8FN28WSsUfagERF4V_m7Wzf?usp=sharing</a>
2. Verifique a **data da última modificação** dos arquivos `df_exp.csv` e `df_rev.csv`;
3. Caso os arquivos estejam desatualizados, execute o notebook `dashboard_data.ipynb`, localizado tanto na pasta `Compartilhados comigo` do Drive da UPO quanto no link abaixo:<br>
<a href="https://colab.research.google.com/drive/1zvq8RRNMpHCoQJTARksJfUKrpjdkkPDa?usp=sharing">https://colab.research.google.com/drive/1zvq8RRNMpHCoQJTARksJfUKrpjdkkPDa?usp=sharing</a>

---

## ▶️ Executando budget_forecast.ipynb

1. Execute este arquivo indo em `Ambiente de execução > Executar Tudo` ou digite `Ctrl+F9`;
2. Quando questionado, dê as **permissões** para o Colab;
3. O código estará concluído, quando todas as células exibirem o ícone ✔️ <small>(tempo estimado entre 30 e 40 minutos)</small>;
4. Pode-se conferir os dados gerados, baixando a planilha Excel `rev_exp_forecast.xlsx` localizada na pasta virtual do Colab <small>(ícone de pasta 📁 no canto esquerdo dessa tela)</small>;

---
<p align="center">Desenvolvido com 💜 por Tiago Iesbick</p>

In [None]:
import logging
from prophet import Prophet
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
from google.colab import drive


# mount drive
drive.mount('/content/drive')


# cleaning budget unit 7001
def clean_7001(df: pd.DataFrame) -> pd.DataFrame:
  df.loc[df['Proj/Ativ'] == 2870, 'Proj/Ativ'] = 4396
  df.loc[df['Proj/Ativ'].isin([2872, 1507]), ['Proj/Ativ', 'Elemento']] = 4471, 339040
  df.loc[df['Proj/Ativ'].isin([2873, 2532]), 'Proj/Ativ'] = 4413
  df.loc[df['Proj/Ativ'].isin([1505, 1503, 1373, 1506]), 'Proj/Ativ'] = 2529
  df.loc[(df['Proj/Ativ'] == 2681) & (df['Elemento'] == 319192), 'Elemento'] = 319113
  df.loc[(df['Proj/Ativ'] == 2529) & (df['Elemento'] == 449092), 'Elemento'] = 449051
  df.loc[(df['Proj/Ativ'] == 9071) & (df['Elemento'] == 319091), 'Elemento'] = 339091
  df.loc[(df['Proj/Ativ'] == 9071) & (df['Elemento'].isin([339092, 339147])), 'Elemento'] = 339047
  df.loc[~((df['Proj/Ativ'] == 9071) & (df['Vinc. Orçam.'] == 1)), 'Vinc. Orçam.'] = 6069
  df.loc[(df['Proj/Ativ'] == 2529) & (df['Elemento'].isin([319011, 319016, 319092, 319094, 339036, 339046, 339049])), 'Proj/Ativ'] = 4396
  df.loc[(df['Proj/Ativ'] == 2529) & (df['Elemento'] == 319013), 'Proj/Ativ'] = 2680
  return df


# changing the elements 339001, 339003, 339091, 339092, 332001
def change_elements(df: pd.DataFrame) -> pd.DataFrame:
  df.loc[df['Elemento'] == 339001, 'Elemento'] = 319001
  df.loc[df['Elemento'] == 339003, 'Elemento'] = 319003
  df.loc[df['Elemento'] == 339091, 'Elemento'] = 319091
  df.loc[(df['Elemento'] == 339092) & (~df['Proj/Ativ'].isin([9075, 9077])), 'Elemento'] = 319092
  df.loc[df['Elemento'] == 332001, 'Elemento'] = 319086
  return df


# filling empty cells after the first filled cell in a column with 0
def fill_zero(df: pd.DataFrame) -> pd.DataFrame:
    for col in df.columns.difference(['T', 'Comp.pagto.']):
        first_valid = df[col].first_valid_index()
        if first_valid is not None:
            df.loc[first_valid:, col] = df.loc[first_valid:, col].fillna(0)
    return df


# creating moving average dataframes
def moving_averages(df: pd.DataFrame, window: int) -> pd.DataFrame:
    df_ma = df.copy()
    df_ma.loc[:, df_ma.columns.difference(['T', 'Comp.pagto.'])] = df_ma.loc[:, df_ma.columns.difference(['T', 'Comp.pagto.'])].rolling(window).mean()
    df_ma.dropna(axis=1, how='all', inplace=True)
    return df_ma


# building prophet model
def build_prophet_model() -> Prophet:
  model = Prophet(
      yearly_seasonality=False,
      weekly_seasonality=False,
      daily_seasonality=False
  )
  model.add_seasonality(name='yearly', period=365.25, fourier_order=10)
  model.add_seasonality(name='monthly', period=30.5, fourier_order=5)

  logging.getLogger('prophet').setLevel(logging.WARNING)
  logging.getLogger('cmdstanpy').setLevel(logging.WARNING)
  return model


# normalizes and calculates the score
def normalizes_calculates_score(df: pd.DataFrame) -> pd.DataFrame:
  scaler = MinMaxScaler()
  df['r2_norm'] = scaler.fit_transform(df[['R²']])
  df['rmse_norm'] = 1 - scaler.fit_transform(df[['RMSE']])
  df['mae_norm'] = 1 - scaler.fit_transform(df[['MAE']])

  df['score'] = (
      0.5 * df['r2_norm'] +
      0.25 * df['rmse_norm'] +
      0.25 * df['mae_norm']
  )
  return df


# checks if there's any negative forecast
# checks they don't jump by more than 30% (increase or decrease), while skipping division when the previous value is 0
def check_negative_and_jump(forecast_years: dict[int, float]) -> bool:
  values = list(forecast_years.values())
  has_negative = any(v < 0 for v in values)
  has_large_jump = any(
      values[i-1] != 0 and abs((values[i] / values[i-1]) - 1) > 0.3
      for i in range(1, len(values))
  )
  return has_negative or has_large_jump


# expense columns that will be readjusted due to mass segregation
mass_segregation_cols_exp = {
    '7002-2736-319003-6049', '7002-2738-319003-6049', '7002-2740-319003-6049', '7002-2742-319003-6049',
    '7002-2744-319003-6049', '7002-2747-319003-6049', '7002-2752-319003-6049', '7002-2754-319003-6049',
    '7002-2756-319003-6049', '7003-2760-319003-6050', '7003-2762-319003-6050', '7003-2764-319003-6050',
    '7003-2766-319003-6050', '7003-2768-319003-6050', '7003-2771-319003-6050', '7003-2776-319003-6050',
    '7003-2778-319003-6050', '7003-2780-319003-6050'
}


# revenue columns that will be readjusted due to mass segregation
mass_segregation_cols_rev = {
    'Contr.do Servidor Civil - Pensionistas - Plano em Capitalização - CMPA-6050', 'Contr.do Servidor Civil - Pensionistas - Plano em Capitalização - DMAE-6050',
    'Contr.do Serv. Civil-Pensionistas-Plano em Capitalização-Centralizada-6050', 'Contr.do Servidor Civil - Pensionistas - Plano em Capitalização - FASC-6050',
    'Contr.do Serv. Civil - Pensionistas - Plano em Capitalização - DEMHAB-6050', 'Contr.do Servidor Civil - Pensionistas - Plano em Capitalização - DMLU-6050',
    'Contr.do Serv.Civil -Pensionistas - Plano em Repartição - Centralizada-6049', 'Contr.do Servidor Civil - Pensionistas - Plano em Repartição - CMPA-6049',
    'Contr.do Servidor Civil - Pensionistas - Plano em Repartição - DMAE-6049', 'Contr.do Servidor Civil - Pensionistas - Plano em Repartição - DMLU-6049',
    'Contr.do Servidor Civil - Pensionistas - Plano em Repartição - DEMHAB-6049', 'Contr.do Servidor Civil - Pensionistas - Plano em Repartição - FASC-6049'
}


# revenue source map
source_map = {
    '6069': '1.802.069.001',
    '6049': '1.801.049.001',
    '6050': '1.800.050.001',
    '1': '1.500.001.001'
}


# preparing data for Excel
def prepare_forecast_sheet(df: pd.DataFrame, coming_years: list[int], is_revenue: bool = True) -> pd.DataFrame:
  if is_revenue:
    df_prep = df.loc[~df.index.str.startswith('7'), coming_years].copy()
    split_cols = df_prep.index.str.rsplit('-', n=1)
    df_prep['nome_rubrica'] = split_cols.str[0].str.strip()
    df_prep['fonte'] = split_cols.str[1].str.strip()
  else:
    df_prep = df.loc[df.index.str.startswith('7'), coming_years].copy()
    df_prep.reset_index(inplace=True)
    df_prep[['uni_orc', 'subacao', 'elemento', 'fonte']] = df_prep['index'].str.split('-', n=3, expand=True)

  df_prep['fonte'] = df_prep['fonte'].map(source_map)
  df_prep = df_prep.loc[~(df_prep[coming_years] == 0).all(axis=1)]

  return df_prep


# selecting the best model
def choose_models(df: pd.DataFrame, rename_level: bool = True) -> pd.DataFrame:
  df_normalized = df.groupby('Allocation').apply(normalizes_calculates_score, include_groups=False)
  df_normalized.reset_index(inplace=True)
  chosen_models = (
      df_normalized
      .sort_values('score', ascending=False)
      .groupby('Allocation')
      .first()
  )
  if rename_level:
    chosen_models.rename(columns={'level_1': 'Type'}, inplace=True)
  chosen_models = chosen_models.loc[:, ~chosen_models.columns.map(lambda x: isinstance(x, str) and x.startswith('level'))]
  return chosen_models


# running models
def run_models(df: pd.DataFrame, df_pred: pd.DataFrame, years: list[int], X_prev: pd.DataFrame, start: pd.Period, months: list[pd.Timestamp], month_diff: int, ma: str = None) -> pd.DataFrame:
  df_models = pd.DataFrame({
      'Allocation': pd.Series(dtype='object'),
      'Model': pd.Series(dtype='object'),
      'R²': pd.Series(dtype='float'),
      'RMSE': pd.Series(dtype='float'),
      'MAE': pd.Series(dtype='float'),
      'Forecast': pd.Series(dtype='object'),
      **{year: pd.Series(dtype='float') for year in years}
  })

  # running linear models
  def run_linear_models(X: np.ndarray | pd.DataFrame, y: np.ndarray | pd.Series, df_aux: pd.DataFrame, model: str, col: str) -> pd.DataFrame:
    lr_model = LinearRegression()
    lr_model.fit(X, y)
    forecast = lr_model.predict(X)
    r2 = r2_score(y, forecast)
    RMSE = np.sqrt(mean_squared_error(y, forecast))
    MAE = mean_absolute_error(y, forecast)

    if model == 'lin-lin':
      y_pred = lr_model.predict(X_prev.values)
    elif model == 'log-log':
      y_pred = np.exp(lr_model.predict(np.log(X_prev).values))
    elif model == 'lin-log':
      y_pred = np.exp(lr_model.predict(X_prev.values))
    elif model == 'log-lin':
      y_pred = lr_model.predict(np.log(X_prev).values)
    else:
      raise ValueError(f"Unknown model type: {model}")

    forecast_df = pd.DataFrame({'date': months, 'y_pred': y_pred})
    forecast_df['year'] = forecast_df['date'].dt.year
    forecast_df['month'] = forecast_df['date'].dt.month

    if ma:
      december_forecasts = forecast_df[forecast_df['month'] == 12]
      if ma == '12':
        forecast_years = {year: val * 12 for year, val in zip(december_forecasts['year'], december_forecasts['y_pred'])}
      elif ma == '36':
        forecast_years = {}
        for i, (year, current_val) in enumerate(zip(december_forecasts['year'], december_forecasts['y_pred'])):
            current_total = current_val * 36
            if i == 0:
              past_2y_sum = df_pred.loc[df_pred['Comp.pagto.'].dt.year.isin([year-1, year-2]), col].sum()
              forecast_years[year] = current_total - past_2y_sum
            elif i == 1:
              past_2y_sum = forecast_years[year-1] + df_pred.loc[df_pred['Comp.pagto.'].dt.year == year-2, col].sum()
              forecast_years[year] = current_total - past_2y_sum
            else:
              past_2y_sum = forecast_years[year-1] + forecast_years[year-2]
              forecast_years[year] = current_total - past_2y_sum
    else:
      forecast_years = forecast_df.groupby('year')['y_pred'].sum().to_dict()
      if start.year in forecast_years:
        forecast_years[start.year] += df.loc[df['Comp.pagto.'].dt.year == start.year, col].sum()

    if check_negative_and_jump(forecast_years):
      return df_aux

    model_dict = {'Allocation': col, 'Model': model, 'R²': r2, 'RMSE': RMSE, 'MAE': MAE, 'Forecast': [forecast_df]}
    model_dict.update(forecast_years)
    df_aux = pd.concat([df_aux, pd.DataFrame(model_dict, index=[0])], ignore_index=True)
    return df_aux


  # running prophet model
  def run_prophet_model(df_aux: pd.DataFrame, col:str) -> pd.DataFrame:
    df_prophet = df[['Comp.pagto.', col]].copy()
    df_prophet.rename(columns={'Comp.pagto.': 'ds', col: 'y'}, inplace=True)
    df_prophet.dropna(inplace=True)

    if len(df_prophet) < 3:
            return df_aux

    df_prophet['ds'] = df_prophet['ds'].dt.to_timestamp()

    prophet_model = build_prophet_model()
    prophet_model.fit(df_prophet)
    future = prophet_model.make_future_dataframe(periods=month_diff, freq='MS')
    forecast = prophet_model.predict(future)

    forecast_train = forecast[forecast['ds'].isin(df_prophet['ds'])]
    r2 = r2_score(df_prophet['y'], forecast_train['yhat'])
    RMSE = np.sqrt(mean_squared_error(df_prophet['y'], forecast_train['yhat']))
    MAE = mean_absolute_error(df_prophet['y'], forecast_train['yhat'])

    forecast_future = forecast[forecast['ds'] > df_prophet['ds'].max()]

    if ma:
      december_forecasts = forecast_future[forecast_future['ds'].dt.month == 12]
      if ma == '12':
        forecast_years = {year: val * 12 for year, val in zip(december_forecasts['ds'].dt.year, december_forecasts['yhat'])}
      elif ma == '36':
        forecast_years = {}
        for i, (year, current_val) in enumerate(zip(december_forecasts['ds'].dt.year, december_forecasts['yhat'])):
          current_total = current_val * 36
          if i == 0:
            past_2y_sum = df_pred.loc[df_pred['Comp.pagto.'].dt.year.isin([year-1, year-2]), col].sum()
            forecast_years[year] = current_total - past_2y_sum
          elif i == 1:
            past_2y_sum = forecast_years[year-1] + df_pred.loc[df_pred['Comp.pagto.'].dt.year == year-2, col].sum()
            forecast_years[year] = current_total - past_2y_sum
          else:
            past_2y_sum = forecast_years[year-1] + forecast_years[year-2]
            forecast_years[year] = current_total - past_2y_sum
    else:
      forecast_years = forecast_future.groupby(forecast_future['ds'].dt.year)['yhat'].sum().to_dict()
      if start.year in forecast_years:
        partial_actual_sum = df_prophet[df_prophet['ds'].dt.year == start.year]['y'].sum()
        forecast_years[start.year] += partial_actual_sum

    if check_negative_and_jump(forecast_years):
      return df_aux

    model_dict = {'Allocation': col, 'Model': 'prophet', 'R²': r2, 'RMSE': RMSE, 'MAE': MAE, 'Forecast': [forecast]}
    model_dict.update(forecast_years)
    df_aux = pd.concat([df_aux, pd.DataFrame(model_dict, index=[0])], ignore_index=True)
    return df_aux


  for col in df.columns.difference(['T', 'Comp.pagto.']):
    first_valid = df[col].first_valid_index()

    if first_valid is not None:
      X = df['T'][first_valid:].values.reshape(-1,1)
      y = df[col][first_valid:]

      if len(y) < 3 or y.eq(0).sum() / len(y) > 0.8:
        continue

      df_aux = pd.DataFrame()

      # lin-lin
      df_aux = run_linear_models(X, y, df_aux, 'lin-lin', col)

      # log-log
      if (y > 0).all():
        df_aux = run_linear_models(np.log(X), np.log(y), df_aux, 'log-log', col)

      # lin-log
      if (y > 0).all():
        df_aux = run_linear_models(X, np.log(y), df_aux, 'lin-log', col)

      # log-lin
      df_aux = run_linear_models(np.log(X), y, df_aux, 'log-lin', col)

      # prophet
      df_aux = run_prophet_model(df_aux, col)

      if df_aux.empty:
        continue

      df_models = pd.concat([df_models, df_aux], ignore_index=True)

  return df_models


# running simple linear models
def run_simple_linear_model(X: np.ndarray | pd.DataFrame, y: np.ndarray | pd.Series, model: str, col: str, df_models: pd.DataFrame) -> pd.DataFrame:
  lr_model = LinearRegression()
  lr_model.fit(X, y)
  forecast = lr_model.predict(X)
  r2 = r2_score(y, forecast)
  RMSE = np.sqrt(mean_squared_error(y, forecast))
  MAE = mean_absolute_error(y, forecast)

  years_array = np.array(years).reshape(-1, 1)

  if model == 'lin-lin':
    y_pred = lr_model.predict(years_array)
  elif model == 'log-log':
    y_pred = np.exp(lr_model.predict(np.log(years_array)))
  elif model == 'lin-log':
    y_pred = np.exp(lr_model.predict(years_array))
  elif model == 'log-lin':
    y_pred = lr_model.predict(np.log(years_array))
  else:
    raise ValueError(f"Unknown model type: {model}")

  forecast_years = {year: val for year, val in zip(years, y_pred)}
  model_dict = {'Allocation': col, 'Model': model, 'R²': r2, 'RMSE': RMSE, 'MAE': MAE }
  model_dict.update(forecast_years)

  df_models = pd.concat([df_models, pd.DataFrame(model_dict, index=[0])], ignore_index=True)

  return df_models


# getting expense data
df_exp = pd.read_csv(r'/content/drive/MyDrive/Dashboard_data/final_data/df_exp.csv', sep=';', parse_dates=['Comp.pagto.'])


# getting budget settlement data from 2021
df_2021 = pd.read_excel(r'/content/drive/MyDrive/Dashboard_data/utils/consolidated_settlements_2021.xls', parse_dates=['Compet.Liq.'])
df_2021['Compet.Liq.'] = pd.to_datetime(df_2021['Compet.Liq.'], dayfirst=True)
df_2021 = df_2021.loc[
    (df_2021['Compet.Estorno'] == '21/12/2021') & (df_2021['Unid.Orçam.'] == 7002),
    ['Compet.Liq.', 'Unid.Orçam.', 'Proj/Ativ', 'Rubrica', 'Vinc.Orçam.', 'Val. Liquidado']
    ].copy().rename(columns={
        'Compet.Liq.': 'Comp.pagto.',
        'Unid.Orçam.': 'Unid. Orçam.',
        'Vinc.Orçam.': 'Vinc. Orçam.',
        'Val. Liquidado': 'Result. pago'
    })
df_2021['Elemento'] = df_2021['Rubrica'].astype(str).str[:6].astype(int)


# deleting specific 2021 data to replace it with budget settlement data
df_exp = df_exp[~((df_exp['Comp.pagto.'] == '2021-12-29') & (df_exp['Unid. Orçam.'] == 7002))]
df_exp = pd.concat([df_exp, df_2021], ignore_index=True)


# correcting data
df_exp.loc[
    (df_exp['Proj/Ativ'] == 2529) & (df_exp['Rubrica'] == 339036040000),
    ['Unid. Orçam.', 'Proj/Ativ', 'Elemento', 'Rubrica', 'Vinc. Orçam.']
] = 7003, 9075, 339039, 339039030000, 6050
df_exp.loc[
    (df_exp['Proj/Ativ'] == 9042) & (df_exp['Comp.pagto.'].dt.year > 2010),
    ['Unid. Orçam.', 'Proj/Ativ', 'Elemento', 'Rubrica']
] = 7002, 9076, 319086, 319086010000
df_exp.loc[df_exp['Elemento'] == 339086, 'Elemento'] = 319086


# dropping 'Rubrica" column
df_exp.drop('Rubrica', axis=1, inplace=True)


# Selecting and clearing data from 7001
df_7001 = clean_7001(df_exp.loc[
    (df_exp['Unid. Orçam.'] == 7001) &
    (df_exp['Vinc. Orçam.'].isin([400, 1, 6050, 6069])) &
    (df_exp['Comp.pagto.'].dt.year > 2011) # period prior to GPREV removed
].copy())


# Selecting and clearing data from 7002
df_7002 = change_elements(df_exp.loc[
    (df_exp['Unid. Orçam.'] == 7002) &
    (~df_exp['Proj/Ativ'].isin([2737, 2739, 2741, 2743, 2745, 2746, 2748, 2750, 2753, 2755, 2757, 2759]))
].copy())
df_7002['Vinc. Orçam.'] = 6049


# Selecting and clearing data from 7003
df_7003 = change_elements(df_exp.loc[
    (df_exp['Unid. Orçam.'] == 7003) &
    (~df_exp['Proj/Ativ'].isin([2761, 2763, 2765, 2767, 2769, 2770, 2772, 2774, 2777, 2779, 2781, 2783]))
].copy())
df_7003['Vinc. Orçam.'] = 6050


# getting revenue data
df_rev = pd.read_csv(r'/content/drive/MyDrive/Dashboard_data/final_data/df_rev.csv', sep=';', parse_dates=['Data'])


# clearing revenue data
df_rev.drop(columns=['origem', 'tipo'], inplace=True)
df_rev = df_rev[(df_rev['vinculo'].isin([6050, 6069, 6049, 400])) & (df_rev['Data'].dt.year > 2017)]
df_rev.loc[df_rev['vinculo'] == 400, 'vinculo'] = 6049
df_rev.loc[
    (df_rev['vinculo'] == 6049) &
    (df_rev['nome_rubrica'] == 'Compensações Financ entre o Regime Geral e os RPPS'),
    'nome_rubrica'] = 'Compensações Financ entre o Regime Geral e os RPPS-Plano em Repartição'
df_rev.loc[
    (df_rev['vinculo'] == 6050) &
    (df_rev['nome_rubrica'] == 'Compensações Financ entre o Regime Geral e os RPPS'),
    'nome_rubrica'] = 'Comp. Financ. entre o Regime Geral e os RPPS - Plano em Capitalização'
df_rev.loc[
    (df_rev['vinculo'] == 6049) &
    (df_rev['nome_rubrica'] == 'Contr.do Servidor Civil Ativo - Cedido'),
    'nome_rubrica'] = 'Contr.do Servidor Civil Ativo - Cedido - Plano em Repartição'
df_rev.loc[
    (df_rev['vinculo'] == 6050) &
    (df_rev['nome_rubrica'] == 'Contr.do Servidor Civil Ativo - Cedido'),
    'nome_rubrica'] = 'Contr.do Servidor Civil Ativo - Cedido - Plano em Capitalização'
df_rev.loc[
    (df_rev['vinculo'] == 6049) &
    (df_rev['nome_rubrica'] == 'Contr.do Servidor Civil Ativo - Cedido - Multas e Juros'),
    'nome_rubrica'] = 'Contr.Serv.Civil Ativo - Cedido - Multas e Juros - Plano em Repartição'
df_rev.loc[
    (df_rev['vinculo'] == 6049) &
    (df_rev['nome_rubrica'] == 'Contr.do Servidor Civil Ativo - Cedido - Dív.At.- Multas e Juros'),
    'nome_rubrica'] = 'Contr.Serv.Ativo Cedido-Dív.Ativa-Multas e Juros - Plano em Repartição'
df_rev.loc[
    (df_rev['vinculo'] == 6049) &
    (df_rev['nome_rubrica'] == 'Contr.do Servidor Civil Ativo - Cedido - Dívida Ativa'),
    'nome_rubrica'] = 'Contr.Serv.Civil Ativo - Cedido - Dívida Ativa - Plano em Repartição'
df_rev.loc[
    (df_rev['vinculo'] == 6050) &
    (df_rev['nome_rubrica'] == 'Contr.do Servidor Civil Ativo - Cedido - Multas e Juros'),
    'nome_rubrica'] = 'Contr.Serv.Civil Ativo-Cedido-Multas e Juros - Plano em Capitalização'
df_rev.loc[
    (df_rev['vinculo'] == 6049) &
    (df_rev['nome_rubrica'] == 'Contr.Patronal - Serv. Afastados - Plano em Repartição'),
    'nome_rubrica'] = 'Contr.Patronal - Servidores Afastados - Plano em Repartição'
df_rev.loc[
    (df_rev['vinculo'] == 6049) &
    (df_rev['nome_rubrica'] == 'Contr.do Servidor Civil Ativo - Afastado'),
    'nome_rubrica'] = 'Contr.do Servidor Civil Ativo - Afastado - Plano em Repartição'
df_rev.loc[
    (df_rev['vinculo'] == 6050) &
    (df_rev['nome_rubrica'] == 'Contr.do Servidor Civil Ativo - Afastado'),
    'nome_rubrica'] = 'Contr.do Servidor Civil Ativo - Afastado - Plano em Capitalização'
df_rev.loc[
    (df_rev['vinculo'] == 6069) &
    (df_rev['nome_rubrica'] == 'Taxa de Administração - INTRA ORCAMENTÄRIA'),
    'nome_rubrica'] = 'Aluguéis Diversos - Taxa de Administração - INTRA ORCAMENTÄRIA'
df_rev.loc[
    (df_rev['vinculo'] == 6069) &
    (df_rev['nome_rubrica'] == 'Restituições de Servidores - RPPS'),
    'nome_rubrica'] = 'Restituições de Servidores - Taxa de Administração do RPPS'
df_rev.loc[
    (df_rev['vinculo'] == 6069) &
    (df_rev['nome_rubrica'].isin(['Restituições Diversas - RPPS', 'Outras Restituições Diversas'])),
    'nome_rubrica'] = 'Restituições Diversas - Taxa de Administração do RPPS'
df_rev.loc[
    (df_rev['vinculo'] == 6069) &
    (df_rev['nome_rubrica'].isin(['Restituições de Sobra de Adiantamento de Numerário - RPPS'])),
    'nome_rubrica'] = 'Sobra de Adiantamento de Numerário - Taxa de Administração do RPPS'
df_rev.loc[
    (df_rev['vinculo'] == 6049) &
    (df_rev['nome_rubrica'] == 'Restituições de Servidores - RPPS'),
    'nome_rubrica'] = 'Restituições de Servidores - RPPS - Plano em Repartição'
df_rev.loc[
    (df_rev['vinculo'] == 6049) &
    (df_rev['nome_rubrica'] == 'Restituições Diversas - RPPS'),
    'nome_rubrica'] = 'Restituições Diversas - RPPS - Plano em Repartição'
df_rev.loc[
    (df_rev['vinculo'] == 6050) &
    (df_rev['nome_rubrica'] == 'Restituições de Servidores - RPPS'),
    'nome_rubrica'] = 'Restituições de Servidores - RPPS - Plano em Capitalização'
df_rev.loc[
    (df_rev['vinculo'] == 6050) &
    (df_rev['nome_rubrica'].isin(['Restituições Diversas - RPPS', 'Restituições Diversas - Taxa de Administração do RPPS', 'Outras Restituições Diversas'])),
    'nome_rubrica'] = 'Restituições Diversas - RPPS - Plano em Capitalização'
df_rev = df_rev[~((df_rev['nome_rubrica'].str.lower().str.contains('patr')) & (df_rev['nome_rubrica'].str.lower().str.contains('inativo')))]
df_rev = df_rev[~df_rev['nome_rubrica'].str.contains('637|750|805|supl', case=False, regex=True)]


# preparing to predict expense data
df_pred_exp = pd.concat([df_7001, df_7002, df_7003], ignore_index=True)
df_pred_exp['col'] = df_pred_exp['Unid. Orçam.'].astype(str) + '-' + df_pred_exp['Proj/Ativ'].astype(str) + '-' + df_pred_exp['Elemento'].astype(str) + '-' + df_pred_exp['Vinc. Orçam.'].astype(str)
df_pred_exp = df_pred_exp[~((df_pred_exp['col'].isin(mass_segregation_cols_exp)) & (df_pred_exp['Comp.pagto.'] < '2022-05'))]


# preparing to predict revenue data
df_pred_rev = df_rev.copy().rename(columns={'Data': 'Comp.pagto.', 'valor_arrecadado': 'Result. pago'})
df_pred_rev['col'] = df_pred_rev['nome_rubrica'] + '-' + df_pred_rev['vinculo'].astype(str)
df_pred_rev = df_pred_rev[~((df_pred_rev['col'].isin(mass_segregation_cols_rev)) & (df_pred_rev['Comp.pagto.'] < '2022-05'))]
df_pred_rev = df_pred_rev[~((df_pred_rev['col'].str.contains('pensionista|inativo|926/2021', case=False, regex=True)) & (df_pred_rev['Comp.pagto.'] < '2022-01'))]
df_pred_rev = df_pred_rev[~((df_pred_rev['col'].str.contains('contr', case=False, regex=False)) & (df_pred_rev['vinculo'] == 6069) & (df_pred_rev['Comp.pagto.'].dt.year < 2022))]


# preparing to predict expenditure and revenue data together
df_pred = pd.concat([df_pred_exp, df_pred_rev], ignore_index=True)
df_pred['Comp.pagto.'] = df_pred['Comp.pagto.'].dt.to_period('M')
df_pred = df_pred[['Comp.pagto.', 'col', 'Result. pago']].groupby(['Comp.pagto.', 'col'], as_index=False).sum()
df_pred = df_pred.pivot(index='Comp.pagto.', columns='col', values='Result. pago')
df_pred = df_pred.iloc[:-1]
df_pred.reset_index(inplace=True)
df_pred['T'] = np.arange(1, len(df_pred)+1)
df_pred = fill_zero(df_pred)


# prediction interval
start = df_pred['Comp.pagto.'].max()
current_year = start.year if start.month != 12 else start.year + 1
target = pd.Period(f'{current_year + 4}-12', freq='M')
month_diff = (target - start).n
last_T = df_pred['T'].max()
X_prev = pd.DataFrame({'T': [last_T + val for val in range(1, month_diff + 1)]})
years = list(range(current_year, current_year+5))
months = [start.to_timestamp() + pd.DateOffset(months=i) for i in range(1, month_diff + 1)]


# Run models separately
models_raw = run_models(df_pred, df_pred, years, X_prev, start, months, month_diff)
models_ma12 = run_models(moving_averages(df_pred, 12), df_pred, years, X_prev, start, months, month_diff, '12')
models_ma36 = run_models(moving_averages(df_pred, 36), df_pred, years, X_prev, start, months, month_diff, '36')


# Combined results
combined = pd.concat([models_raw, models_ma12, models_ma36], keys=['raw', 'ma12', 'ma36'])


# chosen models
chosen_models = choose_models(combined)


# getting the history after cleaning
df_pred_year = (
    df_pred.drop(columns='T')
    .assign(year=df_pred['Comp.pagto.'].dt.year)
    .drop(columns='Comp.pagto.')
    .groupby('year')
    .sum()
    .T
)


# changing the name of the last column in the history, if necessary
# getting the last three columns of the history to calculate the average, if necessary
last_year_col = start.year
last_cols = df_pred_year.columns[-3:]
if start.month != 12:
  last_year_col = f"{str(start.year) + ' until ' + start.strftime('%B')}"
  df_pred_year.rename(columns={
      start.year: last_year_col
      }, inplace=True)
  last_cols = df_pred_year.columns.difference([last_year_col])[-3:]


# combining history with forecasting
df_past_future = pd.merge(df_pred_year, chosen_models[years], left_index=True, right_index=True, how='outer')


# calculating the average of the last three years for the columns not predicted by the models
df_past_future.loc[df_pred_year.index.difference(chosen_models.index), years] = df_pred_year.loc[df_pred_year.index.difference(chosen_models.index), last_cols].mean(axis=1).clip(lower=0)


# calculating the average of the last three years for refund and financial compensation general regime columns
df_past_future.loc[
    df_past_future.index.str.contains('restituições|regime geral|adiantamento', case=False, regex=True),
    years] = df_past_future.loc[
        df_past_future.index.str.contains('restituições|regime geral|adiantamento', case=False, regex=True),
        last_cols].mean(axis=1).clip(lower=0)


# adjustments by rules
if start.month != 12:
  # define conditions and multipliers
  rules = [
      {
          'pattern': '319001|319003|319011|319013|319113|319092|319094|339047',
          'starts_with': '7',
          'multiplier': 12,
          'pattern_negate': True,
          'threshold_ratio': 1.05,
          'use_custom_ratio': False
      },
      {
          'pattern': '319001|319003|319011|319013|319113',
          'starts_with': None,
          'multiplier': 13,
          'pattern_negate': False,
          'threshold_ratio': 1.05,
          'use_custom_ratio': False
      },
      {
          'pattern': '319011',
          'starts_with': '7001',
          'multiplier': 13,
          'pattern_negate': False,
          'threshold_ratio': 1.05,
          'use_custom_ratio': True
      },
      {
          'pattern': '319001|319003',
          'starts_with': '7002',
          'multiplier': 13,
          'pattern_negate': False,
          'threshold_ratio': 1.05,
          'use_custom_ratio': True
      },
      {
          'pattern': '319001|319003',
          'starts_with': '7003',
          'multiplier': 13,
          'pattern_negate': False,
          'threshold_ratio': 1.3,
          'use_custom_ratio': True
      },
  ]

  for rule in rules:
      # pattern matching condition
      pattern_condition = ~df_past_future.index.str.contains(rule['pattern'], regex=True) if rule['pattern_negate'] else df_past_future.index.str.contains(rule['pattern'], regex=True)

      # optional: startswith condition
      startswith_condition = df_past_future.index.str.startswith(rule['starts_with']) if rule['starts_with'] else True

      # base projection
      base_proj = df_past_future[last_year_col] / start.month * rule['multiplier']

      # final condition
      final_condition = (df_past_future[current_year] > base_proj * rule['threshold_ratio']) if rule['use_custom_ratio'] else (base_proj > df_past_future[current_year] * rule['threshold_ratio'])

      # full condition
      condition = pattern_condition & startswith_condition & final_condition

      # ratio to scale other years
      ratio = base_proj * rule['threshold_ratio'] / df_past_future[current_year] if rule['use_custom_ratio'] else base_proj / df_past_future[current_year]

      # assign values
      df_past_future.loc[condition, years[0]] = base_proj * rule['threshold_ratio'] if rule['use_custom_ratio'] else base_proj
      for year in years[1:]:
          df_past_future.loc[condition, year] = df_past_future[year] * ratio


# adjusting the forecast for the current year, if the forecast value is lower than the value calculated in the year
df_past_future.loc[df_past_future[current_year] < df_past_future[last_year_col], current_year] = df_past_future[last_year_col]


# calculate social charges
charges = {
    'centralizada': '2736|2740|2742|2744|2747|2749|2751|2758|9074|9076',
    'demhab': '2752',
    'dmae': '2754',
    'dmlu': '2756',
    'fasc': '2738'
}

for k,v in charges.items():

  mask_k_charge = (
      df_past_future.index.str.contains(k, case=False, regex=False) &
      df_past_future.index.str.contains('encargo', case=False, regex=False)
  )

  sum_group = df_past_future.loc[
      df_past_future.index.str.contains(v, case=False, regex=True),
      years
  ].sum()

  if k != 'centralizada':
    sum_deduction = df_past_future.loc[
        df_past_future.index.str.contains(k, case=False, regex=False) &
        df_past_future.index.str.contains('6049', case=False, regex=False) &
        ~df_past_future.index.str.contains('encargo', case=False, regex=False),
        years
    ].sum()
  else:
    sum_deduction = df_past_future.loc[
        ~df_past_future.index.str.contains('demhab|dmae|dmlu|fasc|encargo', case=False, regex=True) &
        ~df_past_future.index.str.startswith('7') &
        df_past_future.index.str.contains('6049', case=False, regex=False),
        years
    ].sum()

  final_values = (sum_group - sum_deduction).clip(lower=0)

  df_past_future.loc[mask_k_charge, years] = [final_values.values] * mask_k_charge.sum()


# calculate PASEP treasury
df_past_future.loc['7001-9071-339047-1', years] = df_past_future.loc[(df_past_future.index.str.contains('encargo', case=False, regex=False)), years].sum() * 0.01


# getting the contribution wages
df_contrib = pd.read_excel(r'/content/drive/MyDrive/Dashboard_data/utils/contribution_wages-2018_2024.xlsx')


# predicting contribution wages
df_models_contrib = pd.DataFrame({
      'Allocation': pd.Series(dtype='object'),
      'Model': pd.Series(dtype='object'),
      'R²': pd.Series(dtype='float'),
      'RMSE': pd.Series(dtype='float'),
      'MAE': pd.Series(dtype='float'),
      **{year: pd.Series(dtype='float') for year in years}
  })

for col in df_contrib.drop(columns='Comp.pagto.'):
  X = df_contrib['Comp.pagto.'].values.reshape(-1,1)
  y = df_contrib[col]

  # lin-lin
  df_models_contrib = run_simple_linear_model(X, y, 'lin-lin', col, df_models_contrib)

  # log-log
  df_models_contrib = run_simple_linear_model(np.log(X), np.log(y), 'log-log', col, df_models_contrib)

  # lin-log
  df_models_contrib = run_simple_linear_model(X, np.log(y), 'lin-log', col, df_models_contrib)

  # log-lin
  df_models_contrib = run_simple_linear_model(np.log(X), y, 'log-lin', col, df_models_contrib)

chosen_contrib_models = choose_models(df_models_contrib, rename_level=False)


# calculating the management fee on the contribution salaries of city hall companies
mask = (
    df_pred_rev['col'].str.contains(r'^Contr\.\s*Patronal.+6069$', regex=True) &
    (df_pred_rev['Comp.pagto.'] == start.to_timestamp())
)
df_contrib_by_co = df_pred_rev.loc[
      mask, ['Comp.pagto.', 'col', 'Result. pago']
    ].groupby(['Comp.pagto.', 'col'], as_index=False).sum()

for tipo in ['Capitalização', 'Repartição']:
    mask_tipo = df_contrib_by_co['col'].str.contains(tipo)
    mask_chosen_models = chosen_contrib_models.index.str.contains(tipo)
    total = df_contrib_by_co.loc[mask_tipo, 'Result. pago'].sum()

    # compute proportions
    df_contrib_by_co.loc[mask_tipo, 'proportion'] = df_contrib_by_co.loc[mask_tipo, 'Result. pago'] / total

    # get base value from previous year and multiply it by 2.4%
    contrib_base = df_contrib.loc[
        df_contrib['Comp.pagto.'] == current_year - 1,
        df_contrib.columns.str.contains(tipo)
    ].squeeze() * 0.024

    # assign the current year's management fee
    df_contrib_by_co.loc[mask_tipo, current_year] = (
        contrib_base * df_contrib_by_co.loc[mask_tipo, 'proportion'].values
    )

    # assign the management fee for the remaining years
    df_contrib_by_co.loc[mask_tipo, years[1:]] = (
        0.024 *
        df_contrib_by_co.loc[mask_tipo, 'proportion'].values[:, None] *
        chosen_contrib_models.loc[mask_chosen_models, years[:-1]].values
    )

# assign the management fee into original dataframe
df_contrib_by_co.set_index('col', inplace=True)
assert df_contrib_by_co.index.isin(df_past_future.index).all(), "Some contrib rows are not in past_future!"
df_past_future.loc[df_contrib_by_co.index, years] = df_contrib_by_co[years].values


# calculating the RPPS capitalization regime reserve
df_past_future.loc['7003-9998-999999-6050', years] = df_past_future.loc[
    df_past_future.index.str.contains('6050') &
    ~df_past_future.index.str.startswith('7'), years].sum() - df_past_future.loc[
        df_past_future.index.str.contains('6050') &
        df_past_future.index.str.startswith('7'), years].sum()


# generating an Excel spreadsheet with the calculated data for the coming years
coming_years = list(set(years).difference({current_year}))
df_rev_forecast = prepare_forecast_sheet(df_past_future, coming_years)
df_exp_forecast = prepare_forecast_sheet(df_past_future, coming_years, is_revenue=False)

with pd.ExcelWriter('rev_exp_forecast.xlsx') as writer:
    df_rev_forecast[['fonte', 'nome_rubrica'] + coming_years].sort_values(by=['fonte', 'nome_rubrica']).to_excel(writer, sheet_name='Receita Prevista', index=False)
    df_exp_forecast[['subacao', 'fonte'] + coming_years].groupby(by=['subacao', 'fonte']).sum().to_excel(writer, sheet_name='Despesa Prevista')
    df_past_future.to_excel(writer, sheet_name='hist + prev')

DEBUG:cmdstanpy:input tempfile: /tmp/tmpeb529o7z/5q_x9smn.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpeb529o7z/xi6uvs1d.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.11/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=14576', 'data', 'file=/tmp/tmpeb529o7z/5q_x9smn.json', 'init=/tmp/tmpeb529o7z/xi6uvs1d.json', 'output', 'file=/tmp/tmpeb529o7z/prophet_modelnl8l15g9/prophet_model-20250724191218.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
19:12:18 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
19:12:18 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
