# Pré - Processamento dos Dados e Limpeza dos Dados

# Introdução

**PROBLEMA DA AEGEA**

A Aegea enfrenta desafios com fraudes no consumo de água, que afetam tanto o faturamento quanto a qualidade do serviço. Fraudes ocorrem através da manipulação de hidrômetros e ligações clandestinas, causando perdas econômicas e danos à infraestrutura.






**OBJETIVO**

Determinar a probabilidade de um comportamento do consumo ser fraudulento ou não, considerando, de maneira holística, dados históricos de consumo e, caso necessário, a influência de variáveis exógenas, como índices macroeconômicos, climáticos, geográficos, dentre outros.

**PRÉ-PROCESSAMENTO DOS DADOS**

O pré-processamento e a limpeza de dados são etapas cruciais na preparação dos dados para análise e modelagem, especialmente em machine learning e ciência de dados. Essas etapas garantem que os dados estejam em um formato adequado, livre de inconsistências e ruídos que possam afetar a qualidade dos resultados. Isso envolve desde a remoção de valores ausentes e duplicados até a normalização e escalonamento de variáveis, assegurando que os dados estejam prontos para alimentar algoritmos de forma eficiente e precisa.

# 1) Setup

A configuração de setup é o processo de preparar e organizar o ambiente para uso. Envolvendo a instalação de bibliotecas e configuração de outros ajustes necessários. O objetivo é criar um ambiente funcional para executar tarefas específicas.

## 1.1) Conexão com drive

In [245]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1.2) Instalação das bibliotecas


In [246]:
!pip install geopy



In [247]:
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut

In [248]:
import pandas as pd

In [249]:
import numpy as np

## 1.3) Leitura do CSV

Foram disponibilizadas duas bases de dados, uma cuja informação principal é o consumo dos clientes (BASE_CONSUMO) e outra com o apontamento de quais clientes já tiveram fraudes identificadas (BASE_FRAUDES).

**BASE DE FRAUDES**

In [250]:
BASE_FRAUDES = pd.read_csv('/content/drive/MyDrive/Módulo 11/Colab/Dados/FRAUDES_HIST.csv', delimiter=';')

In [251]:
BASE_FRAUDES

Unnamed: 0.1,Unnamed: 0,TIPOOS,ANOOS,IDOSP,ANOMES,MATRICULA,OS,SERVICO,DESCRICAO,COD_GRUPO,...,DS_SERVICO_SOLICITADO,FL_EXECUTADO,NM_TIPO_EXECUCAO,DT_LIMITE_EXECUCAO,DT_SERVICO,DT_FECHAMENTO,PARECER_EXECUCAO,FL_PROGRAMACAO_AUTOMATICA,NMCOMUNIDADE,AREAATUACAO
0,0,Desdobro,2023.0,230101031796,11/23,17229588,1031796,110013,IRREGULARIDADE IDENTIFICADA,,...,VISTORIA DE IRREGULARIDADE - IMPEDIMENTO DE AC...,1.0,activityCompleted,2024-03-11 23:59:59,2023-11-06 00:00:00,2023-11-06 11:07:02,HD interno,,,
1,1,Desdobro,2024.0,240100141765,02/24,17804014,141765,110013,IRREGULARIDADE IDENTIFICADA,,...,VISTORIA DE IRREGULARIDADE,1.0,activityCompleted,2024-06-24 23:59:59,2024-02-13 00:00:00,2024-02-13 15:59:01,421,,,
2,2,Desdobro,2024.0,240100021314,01/24,17234771,21314,110013,IRREGULARIDADE IDENTIFICADA,,...,VISTORIA DE IRREGULARIDADE IDENTIFICADA - LEITURA,1.0,activityCompleted,2024-04-22 23:59:59,2024-01-08 00:00:00,2024-01-08 15:45:46,413,,,
3,3,Desdobro,2023.0,230101217142,12/23,17837656,1217142,110013,IRREGULARIDADE IDENTIFICADA,,...,VISTORIA DE IRREGULARIDADE SUSPEITA - LEITURISTA,1.0,activityCompleted,2024-05-08 23:59:59,2024-01-03 00:00:00,2024-01-03 15:40:12,No local ligação cortada e violado no cavalete...,,,
4,4,Desdobro,2024.0,240100077627,01/24,17722316,77627,110013,IRREGULARIDADE IDENTIFICADA,,...,VISTORIA DE IRREGULARIDADE - IMPEDIMENTO DE AC...,1.0,activityCompleted,2024-05-31 23:59:59,2024-01-24 00:00:00,2024-01-24 08:53:18,421,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
225992,225992,OsOnline,2020.0,200100034631,01/20,17865556,100034631,2039,IRREGULARIDADE IDENTIFICADA,4.0,...,VERIFICACAO DE IRREGULARIDADE,1.0,Executado,2020-02-03 23:59:59,2020-01-14 14:00:07,2020-01-14 14:42:56,405 violada cavalete,1.0,,
225993,225993,OsOnline,2022.0,220100065876,01/22,17511826,100065876,110013,IRREGULARIDADE IDENTIFICADA,4.0,...,VISTORIA DE IRREGULARIDADE - DENUNCIA,1.0,Executado,2022-02-04 23:59:59,2022-01-21 10:38:13,2022-01-21 10:38:23,421 ligação cortada com hidrometro interno em ...,0.0,,
225994,225994,OsOnline,2022.0,220100450752,05/22,17920302,100450752,110013,IRREGULARIDADE IDENTIFICADA,4.0,...,VISTORIA DE IRREGULARIDADE IDENTIFICADA - LEITURA,1.0,Executado,2022-10-15 23:59:59,2022-05-10 16:51:43,2022-05-10 16:52:03,"Violada no ramal,passando agua pelo HD.. .",0.0,,
225995,225995,OsOnline,2022.0,220101353952,12/22,17801545,101353952,110013,IRREGULARIDADE IDENTIFICADA,4.0,...,VISTORIA POS CORTE,1.0,Executado,2023-01-24 23:59:59,2022-12-30 11:52:39,2022-12-30 12:22:23,Violada cavalete...parece estar desabitado,0.0,,


**BASE DE CONSUMO**

In [252]:
BASE_CONSUMO = pd.read_csv('/content/drive/MyDrive/Módulo 11/Colab/Dados/CONSUMO_2022.csv', delimiter=';')

  BASE_CONSUMO = pd.read_csv('/content/drive/MyDrive/Módulo 11/Colab/Dados/CONSUMO_2022.csv', delimiter=';')


In [253]:
BASE_CONSUMO

Unnamed: 0.1,Unnamed: 0,EMP_CODIGO,REFERENCIA,COD_GRUPO,COD_SETOR_COMERCIAL,NUM_QUADRA,COD_ROTA_LEITURA,MATRICULA,SEQ_RESPONSAVEL,ECO_RESIDENCIAL,...,DSC_SIMULTANEA,VOLUME_ESTIMADO,VOLUME_ESTIMADO_ACUM,FATURADO_MEDIA,COD_LEITURA_INT,STA_TROCA,EXCECAO,STA_ACEITA_LEITURA,COD_LATITUDE,COD_LONGITUDE
0,0,2.0,2022-02-01,6.0,42.0,156.0,27.0,17224682.0,123755.0,0.0,...,10-ISENTA - NAO IMPRESSA,0.0,0.0,MEDIA,107.0,N,Normal,N,-20.493049,-54.669201
1,1,2.0,2022-05-01,50.0,1.0,56.0,1.0,17386534.0,183125.0,1.0,...,02-CAIXA CORREIO,0.0,0.0,,900.0,N,Normal,S,-20.986346,-54.508939
2,2,2.0,2022-10-01,11.0,34.0,1.0,16.0,17908273.0,1198248.0,1.0,...,02-CAIXA CORREIO,0.0,0.0,,900.0,N,Normal,S,-20.429032,-54.642100
3,3,2.0,2022-08-01,10.0,29.0,118.0,13.0,17270471.0,143366.0,1.0,...,02-CAIXA CORREIO,0.0,0.0,,900.0,N,Normal,S,-20.432093,-54.588228
4,4,2.0,2022-11-01,11.0,6.0,166.0,3.0,17086970.0,345391.0,0.0,...,59-RETIDA - LIGACAO CORTADA,0.0,0.0,,905.0,S,Normal,S,-20.470276,-54.629742
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4492143,4492143,2.0,2022-03-01,20.0,46.0,858.0,40.0,17977627.0,1162382.0,1.0,...,00-NAO ENTREGUE,1.0,1.0,MEDIA,107.0,N,Normal,N,-20.508637,-54.647928
4492144,4492144,2.0,2022-03-01,5.0,95.0,181.0,13.0,17972199.0,1128269.0,1.0,...,04-FIXADA AO PORTAO,1.0,1.0,MEDIA,107.0,N,Normal,N,-20.547528,-54.622010
4492145,4492145,2.0,2022-03-01,5.0,95.0,131.0,13.0,17970540.0,1118284.0,1.0,...,04-FIXADA AO PORTAO,1.0,1.0,MEDIA,107.0,N,Normal,N,-20.546470,-54.622029
4492146,4492146,2.0,2022-03-01,9.0,7.0,248.0,6.0,17963137.0,1059794.0,1.0,...,02-CAIXA CORREIO,0.0,1.0,MEDIA,107.0,N,Normal,N,-20.469654,-54.610963


## 1.4) Seleção dos Dados

### 1.4.1) Seleção das Variáveis

Na Base de Fraudes foram selecionadas as seguintes variáveis:



*   MATRICULA: Ligação na qual houve comportamento fraudulento registrado.
*   DESCRICAO: Tipo de serviço executado que identificou a fraude.
*   DATACONCLUSAO: Data da indentificação da fraude.



In [254]:
DF_BASE_FRAUDES = BASE_FRAUDES[['MATRICULA', 'DESCRICAO', 'DATACONCLUSAO']]

Na Base de Consumo foram selecionadas as seguintes variáveis:

*   MATRICULA: Índice numérico da ligação da residência do cliente.
*   DAT_LEITURA: Data da leitura.
*   CONS_MEDIDO: Valor do consumo em m³.
*   CATEGORIA: Categoria do cliente (Residencial, Industrial, Comercial ou Pública).
*   COD_LATITUDE e COD_LONGITUDE: Coordenadas (latitude e longitude) das ligações.


In [255]:
DF_BASE_CONSUMO = BASE_CONSUMO[['MATRICULA', 'DAT_LEITURA', 'CONS_MEDIDO', 'CATEGORIA', 'COD_LATITUDE', 'COD_LONGITUDE']]

### 1.4.2) Seleção dos Valores

Em um primeiro momento, visando focar em fraudes maiores, serão selecionadas matrículas do setor industrial, comercial e pública.

**ENTRADA**

In [256]:
DF_BASE_CONSUMO['CATEGORIA']

Unnamed: 0,CATEGORIA
0,COMERCIAL
1,RESIDENCIAL
2,RESIDENCIAL
3,RESIDENCIAL
4,COMERCIAL
...,...
4492143,RESIDENCIAL
4492144,RESIDENCIAL
4492145,RESIDENCIAL
4492146,RESIDENCIAL


**FUNÇÃO**

Esta função filtra as linhas do DataFrame (df) com base em valores específicos em uma coluna categórica (col). Apenas as linhas onde o valor da coluna corresponde a um dos valores fornecidos na lista values são mantidas.

In [257]:
def selecionarCategoria(df, col, values):

    DF_FILTRADO = df[df[col].isin(values)]

    return DF_FILTRADO

**SAÍDA**

In [258]:
DF_BASE_CONSUMO = selecionarCategoria(DF_BASE_CONSUMO, 'CATEGORIA', ['INDUSTRIAL', 'COMERCIAL', 'PUBLICA'])

In [259]:
DF_BASE_CONSUMO['CATEGORIA']

Unnamed: 0,CATEGORIA
0,COMERCIAL
4,COMERCIAL
9,COMERCIAL
18,COMERCIAL
27,COMERCIAL
...,...
4492094,COMERCIAL
4492107,COMERCIAL
4492108,COMERCIAL
4492112,COMERCIAL


# 2) Tratamento Geral das Bases

## 2.1) Conversão de Tipos

As colunas apresentadas na base de dados disponibilizada possui tipos diferentes de formatação, sendo divididos em:

1.   **float** : Responsável por armazenar números reais com precisão de 6 casas decimais;
2.   **object** : Responsável por armazenar qualquer tipo de dado genêrico, utilizado para representar características abstratas;
3.   **int64** : Dado numérico que pode armazenar valores inteiros de até 64 bits.

Segue abaixo, os tipos de dados encontrados na Base de Fraudes.

In [260]:
DF_BASE_FRAUDES.dtypes

Unnamed: 0,0
MATRICULA,int64
DESCRICAO,object
DATACONCLUSAO,object


Segue abaixo, os tipos de dados encontrados na Base de Consumo.

In [261]:
DF_BASE_CONSUMO.dtypes

Unnamed: 0,0
MATRICULA,float64
DAT_LEITURA,object
CONS_MEDIDO,float64
CATEGORIA,object
COD_LATITUDE,float64
COD_LONGITUDE,float64


### 2.1.1) Conversão para Date

A função abaixo realiza a conversão das datas para o formato dia-mês-ano.

**ENTRADA**

In [262]:
DF_BASE_FRAUDES['DATACONCLUSAO']

Unnamed: 0,DATACONCLUSAO
0,2023-11-06 11:06:00
1,2024-02-13 15:58:00
2,2024-01-08 15:45:00
3,2024-01-03 15:39:00
4,2024-01-24 08:52:00
...,...
225992,2020-01-14 14:00:07
225993,2022-01-21 10:38:13
225994,2022-05-10 16:51:43
225995,2022-12-30 11:52:39


**FUNÇÃO**

Esta função formata a coluna de datas (col) do DataFrame (df) para string no formato dd-mm-aaaa.

In [263]:
def formatarData(df, col):

    df[col] = pd.to_datetime(df[col]).dt.strftime('%d-%m-%Y')

    return df

**SAÍDA**

In [264]:
DF_BASE_FRAUDES_TRATADA = formatarData(DF_BASE_FRAUDES, 'DATACONCLUSAO')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = pd.to_datetime(df[col]).dt.strftime('%d-%m-%Y')


In [265]:
DF_BASE_FRAUDES_TRATADA

Unnamed: 0,MATRICULA,DESCRICAO,DATACONCLUSAO
0,17229588,IRREGULARIDADE IDENTIFICADA,06-11-2023
1,17804014,IRREGULARIDADE IDENTIFICADA,13-02-2024
2,17234771,IRREGULARIDADE IDENTIFICADA,08-01-2024
3,17837656,IRREGULARIDADE IDENTIFICADA,03-01-2024
4,17722316,IRREGULARIDADE IDENTIFICADA,24-01-2024
...,...,...,...
225992,17865556,IRREGULARIDADE IDENTIFICADA,14-01-2020
225993,17511826,IRREGULARIDADE IDENTIFICADA,21-01-2022
225994,17920302,IRREGULARIDADE IDENTIFICADA,10-05-2022
225995,17801545,IRREGULARIDADE IDENTIFICADA,30-12-2022


### 2.1.2) Conversão para Inteiro

O número de matrícula na base de consumo é tipo float e na base de fraude é do tipo inteiro. Com isso, a função abaixo busca padronizar o tipo de dados da variável em inteiro.

**ENTRADA**

In [266]:
DF_BASE_CONSUMO['MATRICULA']

Unnamed: 0,MATRICULA
0,17224682.0
4,17086970.0
9,17600903.0
18,17701434.0
27,17798530.0
...,...
4492094,17959364.0
4492107,17197853.0
4492108,17105115.0
4492112,17132023.0


**FUNÇÃO**

Esta função realiza a conversão dos valores de uma coluna (col) do DataFrame (df) para o tipo int.

In [267]:
def converterInteiro(df, col):

    df[col] = df[col].astype(int)

    return df

**SAÍDA**

In [268]:
DF_BASE_CONSUMO_TRATADA = converterInteiro(DF_BASE_CONSUMO, 'MATRICULA')

In [269]:
DF_BASE_CONSUMO_TRATADA['MATRICULA']

Unnamed: 0,MATRICULA
0,17224682
4,17086970
9,17600903
18,17701434
27,17798530
...,...
4492094,17959364
4492107,17197853
4492108,17105115
4492112,17132023


## 2.2) Conversão das Variáveis Categóricas

A função abaixo substitui as variáveis categóricas por valores numéricos.

**ENTRADA**

In [270]:
DF_BASE_CONSUMO_TRATADA['CATEGORIA']

Unnamed: 0,CATEGORIA
0,COMERCIAL
4,COMERCIAL
9,COMERCIAL
18,COMERCIAL
27,COMERCIAL
...,...
4492094,COMERCIAL
4492107,COMERCIAL
4492108,COMERCIAL
4492112,COMERCIAL


**FUNÇÃO**

Esta função converte os valores categóricos de uma coluna (col) do DataFrame (df) para valores numéricos usando um mapeamento predefinido.

In [271]:
def converterNumericamente(df, col):
    # Dicionário de Mapeamento
    mapping = {
        'RESIDENCIAL': 0,
        'COMERCIAL': 1,
        'PUBLICA': 2,
        'INDUSTRIAL': 3
    }

    df[col] = df[col].map(mapping)

    return df

**SAÍDA**

In [272]:
DF_BASE_CONSUMO_TRATADA = converterNumericamente(DF_BASE_CONSUMO_TRATADA, 'CATEGORIA')

In [273]:
DF_BASE_CONSUMO_TRATADA['CATEGORIA']

Unnamed: 0,CATEGORIA
0,1
4,1
9,1
18,1
27,1
...,...
4492094,1
4492107,1
4492108,1
4492112,1


# 3) Criação da Nova Base

## 3.1) Nova Base de Fraudes

### 3.1.1) Contabilizar Frequência de Fraudes

A função abaixo, visa contabilizar a frequência de fraudes que cada matrícula possui.

**ENTRADA**

In [274]:
DF_BASE_FRAUDES_TRATADA

Unnamed: 0,MATRICULA,DESCRICAO,DATACONCLUSAO
0,17229588,IRREGULARIDADE IDENTIFICADA,06-11-2023
1,17804014,IRREGULARIDADE IDENTIFICADA,13-02-2024
2,17234771,IRREGULARIDADE IDENTIFICADA,08-01-2024
3,17837656,IRREGULARIDADE IDENTIFICADA,03-01-2024
4,17722316,IRREGULARIDADE IDENTIFICADA,24-01-2024
...,...,...,...
225992,17865556,IRREGULARIDADE IDENTIFICADA,14-01-2020
225993,17511826,IRREGULARIDADE IDENTIFICADA,21-01-2022
225994,17920302,IRREGULARIDADE IDENTIFICADA,10-05-2022
225995,17801545,IRREGULARIDADE IDENTIFICADA,30-12-2022


**FUNÇÃO**

Esta função contabiliza quantas vezes cada valor aparece na coluna de matrícula (col_matricula) do DataFrame (df) e armazena essa contagem em uma nova coluna chamada CONTAGEM_MATRICULA.

In [275]:
def contabilizarFrequencia(df, col_matricula):

    df['CONTAGEM_MATRICULA'] = df.groupby(col_matricula)[col_matricula].transform('count')

    return df

**SAÍDA**

In [276]:
NOVA_BASE_FRAUDES = contabilizarFrequencia(DF_BASE_FRAUDES_TRATADA, 'MATRICULA')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['CONTAGEM_MATRICULA'] = df.groupby(col_matricula)[col_matricula].transform('count')


In [277]:
NOVA_BASE_FRAUDES

Unnamed: 0,MATRICULA,DESCRICAO,DATACONCLUSAO,CONTAGEM_MATRICULA
0,17229588,IRREGULARIDADE IDENTIFICADA,06-11-2023,1
1,17804014,IRREGULARIDADE IDENTIFICADA,13-02-2024,10
2,17234771,IRREGULARIDADE IDENTIFICADA,08-01-2024,6
3,17837656,IRREGULARIDADE IDENTIFICADA,03-01-2024,1
4,17722316,IRREGULARIDADE IDENTIFICADA,24-01-2024,2
...,...,...,...,...
225992,17865556,IRREGULARIDADE IDENTIFICADA,14-01-2020,3
225993,17511826,IRREGULARIDADE IDENTIFICADA,21-01-2022,10
225994,17920302,IRREGULARIDADE IDENTIFICADA,10-05-2022,8
225995,17801545,IRREGULARIDADE IDENTIFICADA,30-12-2022,1


### 3.1.2) Exluir Dados Repetidos

A função abaixo, busca remover as matrículas duplicadas.

**ENTRADA**

In [278]:
NOVA_BASE_FRAUDES['MATRICULA']

Unnamed: 0,MATRICULA
0,17229588
1,17804014
2,17234771
3,17837656
4,17722316
...,...
225992,17865556
225993,17511826
225994,17920302
225995,17801545


**FUNÇÃO**

Está função remove as linhas duplicadas de um DataFrame com base na coluna de matrícula (col_matricula), mantendo apenas a primeira ocorrência de cada matrícula.

In [279]:
def excluirMatriculasRepetidas(df, col_matricula):

    df_sem_duplicatas = df.drop_duplicates(subset=[col_matricula], keep='first')

    return df_sem_duplicatas

**SAÍDA**

In [280]:
NOVA_BASE_FRAUDES = excluirMatriculasRepetidas(NOVA_BASE_FRAUDES, 'MATRICULA')

In [281]:
NOVA_BASE_FRAUDES['MATRICULA']

Unnamed: 0,MATRICULA
0,17229588
1,17804014
2,17234771
3,17837656
4,17722316
...,...
225986,17821300
225989,17276350
225990,17108653
225995,17801545


### 3.1.3) Criar Coluna Fraudador

A função abaixo, visa definir as matrículas onde foram identificadas fraudes.

**ENTRADA**

In [282]:
NOVA_BASE_FRAUDES

Unnamed: 0,MATRICULA,DESCRICAO,DATACONCLUSAO,CONTAGEM_MATRICULA
0,17229588,IRREGULARIDADE IDENTIFICADA,06-11-2023,1
1,17804014,IRREGULARIDADE IDENTIFICADA,13-02-2024,10
2,17234771,IRREGULARIDADE IDENTIFICADA,08-01-2024,6
3,17837656,IRREGULARIDADE IDENTIFICADA,03-01-2024,1
4,17722316,IRREGULARIDADE IDENTIFICADA,24-01-2024,2
...,...,...,...,...
225986,17821300,IRREGULARIDADE IDENTIFICADA,08-01-2020,1
225989,17276350,IRREGULARIDADE IDENTIFICADA,06-01-2020,1
225990,17108653,IRREGULARIDADE IDENTIFICADA,06-01-2020,1
225995,17801545,IRREGULARIDADE IDENTIFICADA,30-12-2022,1


**FUNÇÃO**

Está função adiciona uma nova coluna ao DataFrame (df) com o nome especificado em nome_coluna, preenchendo todas as linhas dessa coluna com o valor 1, indicando que todas as entradas são fraudadores.

In [283]:
def definirFraudador(df, nome_coluna):

    df[nome_coluna] = 1

    return df

**SAÍDA**

In [284]:
NOVA_BASE_FRAUDES = definirFraudador(NOVA_BASE_FRAUDES, 'FRAUDADOR')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[nome_coluna] = 1


In [285]:
NOVA_BASE_FRAUDES

Unnamed: 0,MATRICULA,DESCRICAO,DATACONCLUSAO,CONTAGEM_MATRICULA,FRAUDADOR
0,17229588,IRREGULARIDADE IDENTIFICADA,06-11-2023,1,1
1,17804014,IRREGULARIDADE IDENTIFICADA,13-02-2024,10,1
2,17234771,IRREGULARIDADE IDENTIFICADA,08-01-2024,6,1
3,17837656,IRREGULARIDADE IDENTIFICADA,03-01-2024,1,1
4,17722316,IRREGULARIDADE IDENTIFICADA,24-01-2024,2,1
...,...,...,...,...,...
225986,17821300,IRREGULARIDADE IDENTIFICADA,08-01-2020,1,1
225989,17276350,IRREGULARIDADE IDENTIFICADA,06-01-2020,1,1
225990,17108653,IRREGULARIDADE IDENTIFICADA,06-01-2020,1,1
225995,17801545,IRREGULARIDADE IDENTIFICADA,30-12-2022,1,1


### 3.1.2) Selecionar Variáveis

In [286]:
NOVA_BASE_FRAUDES = NOVA_BASE_FRAUDES[['MATRICULA', 'CONTAGEM_MATRICULA', 'FRAUDADOR']]

In [287]:
NOVA_BASE_FRAUDES

Unnamed: 0,MATRICULA,CONTAGEM_MATRICULA,FRAUDADOR
0,17229588,1,1
1,17804014,10,1
2,17234771,6,1
3,17837656,1,1
4,17722316,2,1
...,...,...,...
225986,17821300,1,1
225989,17276350,1,1
225990,17108653,1,1
225995,17801545,1,1


## 3.2) Nova Base de Consumo

### 3.2.1) Selecionar Ano e Mês

A função abaixo, visa organizar melhor os períodos de consumo.

**ENTRADA**

In [288]:
DF_BASE_CONSUMO_TRATADA['DAT_LEITURA']

Unnamed: 0,DAT_LEITURA
0,2022-02-08
4,2022-11-14
9,2022-02-10
18,2022-02-26
27,2022-02-10
...,...
4492094,2022-03-18
4492107,2022-03-10
4492108,2022-03-12
4492112,2022-03-18


**FUNÇÃO**

Está função extrai o ano e o mês de uma coluna de datas do DataFrame, criando uma nova coluna chamada ANOMES no formato YYYYMM.

In [289]:
def extrairAnomes(df, col):

    df[col] = pd.to_datetime(df[col], format='%Y-%m-%d')

    df['ANOMES'] = df[col].dt.strftime('%Y%m')

    return df

**SAÍDA**

In [290]:
NOVA_BASE_CONSUMO = extrairAnomes(DF_BASE_CONSUMO_TRATADA, 'DAT_LEITURA')

In [291]:
NOVA_BASE_CONSUMO['ANOMES']

Unnamed: 0,ANOMES
0,202202
4,202211
9,202202
18,202202
27,202202
...,...
4492094,202203
4492107,202203
4492108,202203
4492112,202203


### 3.2.2) Seleção das Variáveis

In [292]:
NOVA_BASE_CONSUMO

Unnamed: 0,MATRICULA,DAT_LEITURA,CONS_MEDIDO,CATEGORIA,COD_LATITUDE,COD_LONGITUDE,ANOMES
0,17224682,2022-02-08,0.0,1,-20.493049,-54.669201,202202
4,17086970,2022-11-14,6.0,1,-20.470276,-54.629742,202211
9,17600903,2022-02-10,0.0,1,-20.460644,-54.620770,202202
18,17701434,2022-02-26,0.0,1,-20.476194,-54.662932,202202
27,17798530,2022-02-10,0.0,1,-20.494225,-54.631010,202202
...,...,...,...,...,...,...,...
4492094,17959364,2022-03-18,0.0,1,-20.465770,-54.615450,202203
4492107,17197853,2022-03-10,0.0,1,-20.449548,-54.616009,202203
4492108,17105115,2022-03-12,0.0,1,-20.479587,-54.612049,202203
4492112,17132023,2022-03-18,0.0,1,-20.464203,-54.615575,202203


In [293]:
BASE_VARIACAO_CONSUMO = NOVA_BASE_CONSUMO[['MATRICULA', 'CONS_MEDIDO', 'ANOMES']]

In [294]:
BASE_VARIACAO_CONSUMO

Unnamed: 0,MATRICULA,CONS_MEDIDO,ANOMES
0,17224682,0.0,202202
4,17086970,6.0,202211
9,17600903,0.0,202202
18,17701434,0.0,202202
27,17798530,0.0,202202
...,...,...,...
4492094,17959364,0.0,202203
4492107,17197853,0.0,202203
4492108,17105115,0.0,202203
4492112,17132023,0.0,202203


In [295]:
BASE_GERAL_CONSUMO = NOVA_BASE_CONSUMO[['MATRICULA', 'CATEGORIA', 'COD_LATITUDE', 'COD_LONGITUDE']]

In [296]:
BASE_GERAL_CONSUMO

Unnamed: 0,MATRICULA,CATEGORIA,COD_LATITUDE,COD_LONGITUDE
0,17224682,1,-20.493049,-54.669201
4,17086970,1,-20.470276,-54.629742
9,17600903,1,-20.460644,-54.620770
18,17701434,1,-20.476194,-54.662932
27,17798530,1,-20.494225,-54.631010
...,...,...,...,...
4492094,17959364,1,-20.465770,-54.615450
4492107,17197853,1,-20.449548,-54.616009
4492108,17105115,1,-20.479587,-54.612049
4492112,17132023,1,-20.464203,-54.615575


### 3.2.3) Exluir Dados Repetidos

A função abaixo, visa excluir os valores de matrículas  duplicadas.

**ENTRADA**

In [297]:
BASE_GERAL_CONSUMO['MATRICULA']

Unnamed: 0,MATRICULA
0,17224682
4,17086970
9,17600903
18,17701434
27,17798530
...,...
4492094,17959364
4492107,17197853
4492108,17105115
4492112,17132023


**FUNÇÃO**

Foi utilizada a função `excluirMatriculasRepetidas` (defida anteriormente) para excluir as matrículas repetidas.

**SAÍDA**

In [298]:
BASE_GERAL_CONSUMO = excluirMatriculasRepetidas(BASE_GERAL_CONSUMO, 'MATRICULA')

In [299]:
BASE_GERAL_CONSUMO['MATRICULA']

Unnamed: 0,MATRICULA
0,17224682
4,17086970
9,17600903
18,17701434
27,17798530
...,...
4141746,17844213
4163861,17845017
4178019,17981603
4222419,17821379


### 3.2.4) Variação de Consumo Mensal

Está função, visa definir uma série temporal do consumo por matrícula.

**ENTRADA**

In [300]:
BASE_VARIACAO_CONSUMO

Unnamed: 0,MATRICULA,CONS_MEDIDO,ANOMES
0,17224682,0.0,202202
4,17086970,6.0,202211
9,17600903,0.0,202202
18,17701434,0.0,202202
27,17798530,0.0,202202
...,...,...,...
4492094,17959364,0.0,202203
4492107,17197853,0.0,202203
4492108,17105115,0.0,202203
4492112,17132023,0.0,202203


**FUNÇÃO**

Está função reorganiza o DataFrame para que cada matrícula tenha uma linha única, com os consumos organizados em colunas separadas por mês (cada coluna representa CONSUMO_[ANOMES]).

In [301]:
def organizarConsumoPorMatricula(df, col_matricula, col_consumo, col_anomes):

    df_pivot = df.pivot_table(index=col_matricula, columns=col_anomes, values=col_consumo, aggfunc='sum')

    # Renomear as colunas
    df_pivot.columns = [f'CONSUMO_{int(col)}' for col in df_pivot.columns]

    df_pivot = df_pivot.reset_index()

    return df_pivot

**SAÍDA**

In [302]:
BASE_VARIACAO_CONSUMO = organizarConsumoPorMatricula(BASE_VARIACAO_CONSUMO, 'MATRICULA', 'CONS_MEDIDO', 'ANOMES')

In [303]:
BASE_VARIACAO_CONSUMO

Unnamed: 0,MATRICULA,CONSUMO_202201,CONSUMO_202202,CONSUMO_202203,CONSUMO_202204,CONSUMO_202205,CONSUMO_202206,CONSUMO_202207,CONSUMO_202208,CONSUMO_202209,CONSUMO_202210,CONSUMO_202211,CONSUMO_202212
0,17075331,11.0,0.0,20.0,9.0,7.0,8.0,7.0,9.0,6.0,0.0,0.0,26.0
1,17075333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,17075335,6.0,2.0,2.0,5.0,0.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0
3,17075337,2.0,3.0,2.0,1.0,7.0,3.0,4.0,4.0,6.0,5.0,5.0,4.0
4,17075338,85.0,14.0,13.0,18.0,19.0,11.0,11.0,13.0,13.0,12.0,8.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
33089,17985258,,,,,,,,,,,,1.0
33090,17985328,,,,,,,,,,,,0.0
33091,17985331,,,,,,,,,,,,0.0
33092,17985332,,,,,,,,,,,,0.0


### 3.2.5) Variáveis Estatísticas

A função abaixo, realiza cálculos estatísticos do consumo.

**ENTRADA**

In [304]:
BASE_VARIACAO_CONSUMO

Unnamed: 0,MATRICULA,CONSUMO_202201,CONSUMO_202202,CONSUMO_202203,CONSUMO_202204,CONSUMO_202205,CONSUMO_202206,CONSUMO_202207,CONSUMO_202208,CONSUMO_202209,CONSUMO_202210,CONSUMO_202211,CONSUMO_202212
0,17075331,11.0,0.0,20.0,9.0,7.0,8.0,7.0,9.0,6.0,0.0,0.0,26.0
1,17075333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,17075335,6.0,2.0,2.0,5.0,0.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0
3,17075337,2.0,3.0,2.0,1.0,7.0,3.0,4.0,4.0,6.0,5.0,5.0,4.0
4,17075338,85.0,14.0,13.0,18.0,19.0,11.0,11.0,13.0,13.0,12.0,8.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
33089,17985258,,,,,,,,,,,,1.0
33090,17985328,,,,,,,,,,,,0.0
33091,17985331,,,,,,,,,,,,0.0
33092,17985332,,,,,,,,,,,,0.0


**FUNÇÃO**

Está função calcula estatísticas como média, desvio padrão, consumo mínimo e máximo para cada matrícula do DataFrame, com base em colunas de consumo organizadas por ANOMES.

In [305]:
def calcularEstatisticasDeConsumo(df, col_matricula):

    colunas_consumo = [col for col in df.columns if col.startswith('CONSUMO_')]

    # Calcular a média dos consumos
    df['MEDIA_CONSUMO'] = df[colunas_consumo].mean(axis=1, skipna=True).round(2)

    # Calcular o desvio padrão dos consumos
    df['DESVIO_PADRAO_CONSUMO'] = df[colunas_consumo].std(axis=1, skipna=True).round(2)

    # Selecionar o consumo mínimo
    df['CONSUMO_MINIMO'] = df[colunas_consumo].min(axis=1, skipna=True)

    # Selecionar o consumo máximo
    df['CONSUMO_MAXIMO'] = df[colunas_consumo].max(axis=1, skipna=True)

    return df

**SAÍDA**

In [306]:
BASE_VARIACAO_CONSUMO = calcularEstatisticasDeConsumo(BASE_VARIACAO_CONSUMO, 'MATRICULA')

In [307]:
BASE_VARIACAO_CONSUMO[['MEDIA_CONSUMO', 'DESVIO_PADRAO_CONSUMO', 'CONSUMO_MINIMO', 'CONSUMO_MAXIMO']]

Unnamed: 0,MEDIA_CONSUMO,DESVIO_PADRAO_CONSUMO,CONSUMO_MINIMO,CONSUMO_MAXIMO
0,8.58,7.82,0.0,26.0
1,0.00,0.00,0.0,0.0
2,1.58,2.02,0.0,6.0
3,3.83,1.75,1.0,7.0
4,18.92,21.04,8.0,85.0
...,...,...,...,...
33089,1.00,,1.0,1.0
33090,0.00,,0.0,0.0
33091,0.00,,0.0,0.0
33092,0.00,,0.0,0.0


## 3.3) Base Final

### 3.3.1) União das Bases

A função abaixo, visa unir as bases de consumo com a de fraudes.

**ENTRADA**

Base 1: Variação de Consumo por Matrícula

In [308]:
BASE_VARIACAO_CONSUMO

Unnamed: 0,MATRICULA,CONSUMO_202201,CONSUMO_202202,CONSUMO_202203,CONSUMO_202204,CONSUMO_202205,CONSUMO_202206,CONSUMO_202207,CONSUMO_202208,CONSUMO_202209,CONSUMO_202210,CONSUMO_202211,CONSUMO_202212,MEDIA_CONSUMO,DESVIO_PADRAO_CONSUMO,CONSUMO_MINIMO,CONSUMO_MAXIMO
0,17075331,11.0,0.0,20.0,9.0,7.0,8.0,7.0,9.0,6.0,0.0,0.0,26.0,8.58,7.82,0.0,26.0
1,17075333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0
2,17075335,6.0,2.0,2.0,5.0,0.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,1.58,2.02,0.0,6.0
3,17075337,2.0,3.0,2.0,1.0,7.0,3.0,4.0,4.0,6.0,5.0,5.0,4.0,3.83,1.75,1.0,7.0
4,17075338,85.0,14.0,13.0,18.0,19.0,11.0,11.0,13.0,13.0,12.0,8.0,10.0,18.92,21.04,8.0,85.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33089,17985258,,,,,,,,,,,,1.0,1.00,,1.0,1.0
33090,17985328,,,,,,,,,,,,0.0,0.00,,0.0,0.0
33091,17985331,,,,,,,,,,,,0.0,0.00,,0.0,0.0
33092,17985332,,,,,,,,,,,,0.0,0.00,,0.0,0.0


Base 2: Informações Gerais por Matrícula

In [309]:
BASE_GERAL_CONSUMO

Unnamed: 0,MATRICULA,CATEGORIA,COD_LATITUDE,COD_LONGITUDE
0,17224682,1,-20.493049,-54.669201
4,17086970,1,-20.470276,-54.629742
9,17600903,1,-20.460644,-54.620770
18,17701434,1,-20.476194,-54.662932
27,17798530,1,-20.494225,-54.631010
...,...,...,...,...
4141746,17844213,1,-20.431949,-54.653290
4163861,17845017,1,-20.523607,-54.666714
4178019,17981603,1,-20.549096,-54.622830
4222419,17821379,1,-20.532288,-54.671618


Base 3: Informações de Fraude por Matrícula

In [310]:
NOVA_BASE_FRAUDES

Unnamed: 0,MATRICULA,CONTAGEM_MATRICULA,FRAUDADOR
0,17229588,1,1
1,17804014,10,1
2,17234771,6,1
3,17837656,1,1
4,17722316,2,1
...,...,...,...
225986,17821300,1,1
225989,17276350,1,1
225990,17108653,1,1
225995,17801545,1,1


**FUNÇÃO**

Está função une três DataFrames (df1, df2, df3) com base em uma coluna de chave comum, como a coluna de matrícula (col_matricula).

In [311]:
def unirBases(df1, df2, df3, col_matricula):
    # Unir a base 1 e 2
    df_merged = pd.merge(df1, df2, on=col_matricula, how='outer')

    # Unir a nova base com a base 3
    df_merged = pd.merge(df_merged, df3, on=col_matricula, how='left')

    return df_merged

**SAÍDA**

In [312]:
BASE_FINAL = unirBases(BASE_VARIACAO_CONSUMO, BASE_GERAL_CONSUMO, NOVA_BASE_FRAUDES, 'MATRICULA')

In [313]:
BASE_FINAL

Unnamed: 0,MATRICULA,CONSUMO_202201,CONSUMO_202202,CONSUMO_202203,CONSUMO_202204,CONSUMO_202205,CONSUMO_202206,CONSUMO_202207,CONSUMO_202208,CONSUMO_202209,...,CONSUMO_202212,MEDIA_CONSUMO,DESVIO_PADRAO_CONSUMO,CONSUMO_MINIMO,CONSUMO_MAXIMO,CATEGORIA,COD_LATITUDE,COD_LONGITUDE,CONTAGEM_MATRICULA,FRAUDADOR
0,17075331,11.0,0.0,20.0,9.0,7.0,8.0,7.0,9.0,6.0,...,26.0,8.58,7.82,0.0,26.0,1,-20.452847,-54.599802,,
1,17075333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.0,1,-20.452863,-54.599894,,
2,17075335,6.0,2.0,2.0,5.0,0.0,1.0,2.0,1.0,0.0,...,0.0,1.58,2.02,0.0,6.0,1,-20.452938,-54.600345,1.0,1.0
3,17075337,2.0,3.0,2.0,1.0,7.0,3.0,4.0,4.0,6.0,...,4.0,3.83,1.75,1.0,7.0,1,-20.452982,-54.600607,,
4,17075338,85.0,14.0,13.0,18.0,19.0,11.0,11.0,13.0,13.0,...,10.0,18.92,21.04,8.0,85.0,1,-20.453001,-54.600722,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33089,17985258,,,,,,,,,,...,1.0,1.00,,1.0,1.0,1,-20.536574,-54.637979,1.0,1.0
33090,17985328,,,,,,,,,,...,0.0,0.00,,0.0,0.0,1,-20.458942,-54.685955,,
33091,17985331,,,,,,,,,,...,0.0,0.00,,0.0,0.0,1,-20.458953,-54.685934,,
33092,17985332,,,,,,,,,,...,0.0,0.00,,0.0,0.0,1,-20.458971,-54.685898,,


### 3.3.2) Substituição de Valores NaN

A função abaixo, visa substituir os dados ausentes por 0.

**ENTRADA**

In [314]:
BASE_FINAL

Unnamed: 0,MATRICULA,CONSUMO_202201,CONSUMO_202202,CONSUMO_202203,CONSUMO_202204,CONSUMO_202205,CONSUMO_202206,CONSUMO_202207,CONSUMO_202208,CONSUMO_202209,...,CONSUMO_202212,MEDIA_CONSUMO,DESVIO_PADRAO_CONSUMO,CONSUMO_MINIMO,CONSUMO_MAXIMO,CATEGORIA,COD_LATITUDE,COD_LONGITUDE,CONTAGEM_MATRICULA,FRAUDADOR
0,17075331,11.0,0.0,20.0,9.0,7.0,8.0,7.0,9.0,6.0,...,26.0,8.58,7.82,0.0,26.0,1,-20.452847,-54.599802,,
1,17075333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.0,1,-20.452863,-54.599894,,
2,17075335,6.0,2.0,2.0,5.0,0.0,1.0,2.0,1.0,0.0,...,0.0,1.58,2.02,0.0,6.0,1,-20.452938,-54.600345,1.0,1.0
3,17075337,2.0,3.0,2.0,1.0,7.0,3.0,4.0,4.0,6.0,...,4.0,3.83,1.75,1.0,7.0,1,-20.452982,-54.600607,,
4,17075338,85.0,14.0,13.0,18.0,19.0,11.0,11.0,13.0,13.0,...,10.0,18.92,21.04,8.0,85.0,1,-20.453001,-54.600722,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33089,17985258,,,,,,,,,,...,1.0,1.00,,1.0,1.0,1,-20.536574,-54.637979,1.0,1.0
33090,17985328,,,,,,,,,,...,0.0,0.00,,0.0,0.0,1,-20.458942,-54.685955,,
33091,17985331,,,,,,,,,,...,0.0,0.00,,0.0,0.0,1,-20.458953,-54.685934,,
33092,17985332,,,,,,,,,,...,0.0,0.00,,0.0,0.0,1,-20.458971,-54.685898,,


**FUNÇÃO**

Está função substitui todos os valores NaN do DataFrame (df) por um valor especificado (valor_substituicao). Neste caso, o valor é 0.

In [315]:
def substituirValoresNaN(df):

    return df.fillna(0)

**SAÍDA**

In [316]:
BASE_FINAL = substituirValoresNaN(BASE_FINAL)

Além disso, os valores de CONTAGEM_MATRICULA e FRAUDADOR estão em float, mas devem ser convertidas para inteiro.

In [317]:
BASE_FINAL = converterInteiro(BASE_FINAL, 'CONTAGEM_MATRICULA')

In [318]:
BASE_FINAL = converterInteiro(BASE_FINAL, 'FRAUDADOR')

In [319]:
BASE_FINAL

Unnamed: 0,MATRICULA,CONSUMO_202201,CONSUMO_202202,CONSUMO_202203,CONSUMO_202204,CONSUMO_202205,CONSUMO_202206,CONSUMO_202207,CONSUMO_202208,CONSUMO_202209,...,CONSUMO_202212,MEDIA_CONSUMO,DESVIO_PADRAO_CONSUMO,CONSUMO_MINIMO,CONSUMO_MAXIMO,CATEGORIA,COD_LATITUDE,COD_LONGITUDE,CONTAGEM_MATRICULA,FRAUDADOR
0,17075331,11.0,0.0,20.0,9.0,7.0,8.0,7.0,9.0,6.0,...,26.0,8.58,7.82,0.0,26.0,1,-20.452847,-54.599802,0,0
1,17075333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.0,1,-20.452863,-54.599894,0,0
2,17075335,6.0,2.0,2.0,5.0,0.0,1.0,2.0,1.0,0.0,...,0.0,1.58,2.02,0.0,6.0,1,-20.452938,-54.600345,1,1
3,17075337,2.0,3.0,2.0,1.0,7.0,3.0,4.0,4.0,6.0,...,4.0,3.83,1.75,1.0,7.0,1,-20.452982,-54.600607,0,0
4,17075338,85.0,14.0,13.0,18.0,19.0,11.0,11.0,13.0,13.0,...,10.0,18.92,21.04,8.0,85.0,1,-20.453001,-54.600722,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33089,17985258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.00,0.00,1.0,1.0,1,-20.536574,-54.637979,1,1
33090,17985328,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.0,1,-20.458942,-54.685955,0,0
33091,17985331,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.0,1,-20.458953,-54.685934,0,0
33092,17985332,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.0,1,-20.458971,-54.685898,0,0


### 3.3.3) Exportar Dataframe para CSV

A função abaixo, exporta o DataFrame (df) para um arquivo CSV.

**ENTRADA**

In [320]:
BASE_VERSAO_FINAL = BASE_FINAL[['MATRICULA',	'MEDIA_CONSUMO',	'DESVIO_PADRAO_CONSUMO',	'CONSUMO_MINIMO',	'CONSUMO_MAXIMO',	'CATEGORIA',	'CONTAGEM_MATRICULA', 'FRAUDADOR']]

In [321]:
BASE_VERSAO_FINAL

Unnamed: 0,MATRICULA,MEDIA_CONSUMO,DESVIO_PADRAO_CONSUMO,CONSUMO_MINIMO,CONSUMO_MAXIMO,CATEGORIA,CONTAGEM_MATRICULA,FRAUDADOR
0,17075331,8.58,7.82,0.0,26.0,1,0,0
1,17075333,0.00,0.00,0.0,0.0,1,0,0
2,17075335,1.58,2.02,0.0,6.0,1,1,1
3,17075337,3.83,1.75,1.0,7.0,1,0,0
4,17075338,18.92,21.04,8.0,85.0,1,0,0
...,...,...,...,...,...,...,...,...
33089,17985258,1.00,0.00,1.0,1.0,1,1,1
33090,17985328,0.00,0.00,0.0,0.0,1,0,0
33091,17985331,0.00,0.00,0.0,0.0,1,0,0
33092,17985332,0.00,0.00,0.0,0.0,1,0,0


**FUNÇÃO**

Está função exporta o DataFrame (df) para um arquivo CSV.

In [322]:
def exportarCSV(df, nome_arquivo):

    df.to_csv(nome_arquivo, index=False)

**SAÍDA**

In [323]:
nome_arquivo = 'DADOS_PROCESSADOS.csv'

In [324]:
exportarCSV(BASE_VERSAO_FINAL, nome_arquivo)

In [325]:
print(f'Dados exportados para {nome_arquivo} com sucesso')

Dados exportados para DADOS_PROCESSADOS.csv com sucesso
