## Exploring the Data

In [1]:
import os
import polars as pl
from transform_data import *

Let's read the data corresponding to a single state ('AC' in this case) to explore it.

In [2]:
csv_files_dir = '../data/csv_files'
file = 'AC.csv'
file_path = os.path.join(csv_files_dir, file)
df = read_csv(file_path, separator = ',')

Let's check the different types of 'ClassInfraFisica' present.

In [3]:
df.group_by('ClassInfraFisica').count()

ClassInfraFisica,count
str,u32
"""Rooftop""",129
"""Streetlevel""",15
"""Greenfield""",2532
"""Streetlevel """,9
"""Ran Sharing""",39
,6108
"""Indoor""",7


Let's fix the typo:

In [4]:
df = replace_values(df, 'ClassInfraFisica', 'Streetlevel ', 'Streetlevel')

In [5]:
df.group_by('ClassInfraFisica').agg(pl.col('ClassInfraFisica').count().alias('Count'))

ClassInfraFisica,Count
str,u32
,6108
"""Indoor""",7
"""Greenfield""",2532
"""Rooftop""",129
"""Streetlevel""",24
"""Ran Sharing""",39


Now, we have to check whether each station (indicated by the column 'NumEstacao') corresponds to a single type of 'ClassInfraFisica'; which is the expected behavior (assumption).

In [6]:
count_unique_values = df.group_by('NumEstacao').agg(pl.col('ClassInfraFisica').n_unique().alias('unique_count'))
instances_with_different_values = count_unique_values.filter(count_unique_values['unique_count'] > 1)
print(instances_with_different_values)

shape: (162, 2)
┌────────────┬──────────────┐
│ NumEstacao ┆ unique_count │
│ ---        ┆ ---          │
│ i64        ┆ u32          │
╞════════════╪══════════════╡
│ 690407084  ┆ 2            │
│ 692271724  ┆ 2            │
│ 686180232  ┆ 2            │
│ 693153008  ┆ 2            │
│ …          ┆ …            │
│ 691365415  ┆ 2            │
│ 696304279  ┆ 2            │
│ 441592007  ┆ 2            │
│ 1011149831 ┆ 2            │
└────────────┴──────────────┘


We found that we can actually get two different 'ClassInfraFisica' for any individual station. But if we check further, we'll see that it's actually pairing types with 'null' values, which does not compromise the assumption.

In [7]:
df.filter(df['NumEstacao'] == 699785804)[['NumEstacao', 'ClassInfraFisica']]

NumEstacao,ClassInfraFisica
i64,str
699785804,
699785804,
699785804,
699785804,"""Greenfield"""
699785804,"""Greenfield"""
699785804,"""Greenfield"""
699785804,"""Greenfield"""
699785804,"""Greenfield"""
699785804,"""Greenfield"""
699785804,


That's as far as I can go with polars. Let's move to pandas.

In [23]:
df = df.to_pandas()          

Let's group by 'NumEstacao', but keeping all the info present in the rows as sets (or as a single value if all rows match)

In [24]:
def set_aggregation(x):

    if len(set(x)) == 1:
        return x.iloc[0]  
    else:
        return set(x)  

dfg = df.groupby('NumEstacao').agg(lambda x: set_aggregation(x))
dfg

Unnamed: 0_level_0,Status.state,NomeEntidade,NumFistel,NumServico,NumAto,EnderecoEstacao,EndComplemento,SiglaUf,CodMunicipio,DesignacaoEmissao,...,Latitude,Longitude,CodDebitoTFI,DataLicenciamento,DataPrimeiroLicenciamento,NumRede,_id,DataValidade,NumFistelAssociado,NomeEntidadeAssociado
NumEstacao,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
19453,LIC-LIC-01,EMPRESA BRASILEIRA DE INFRA-ESTRUTURA AEROPORT...,11030016470,19,655742007.0,"BR 364, KM 18 - COA s/n AEROPORTO SBRB",,AC,1200401,16K0F3E,...,-9.9925,-67.804167,A,2016-03-03,2001-10-09,705,"{4d469248e6c2d201, 4d469248e6c2d1fd, 4d469248e...",2027-07-16,,
19461,LIC-LIC-01,EMPRESA BRASILEIRA DE INFRA-ESTRUTURA AEROPORT...,11030016470,19,18652020.0,"BR 364, KM 18 - COE s/n AEROPORTO SBRB",,AC,1200401,16K0F3E,...,-9.992778,-67.804444,A,2016-03-03,2001-10-09,705,"{4d469248e6c2d20e, 4d469248e6c2d208, 4d469248e...",2027-07-16,,
19470,LIC-LIC-01,EMPRESA BRASILEIRA DE INFRA-ESTRUTURA AEROPORT...,11030016470,19,18652020.0,"BR 364, KM 18 - SCI s/n AEROPORTO SBRB",,AC,1200401,16K0F3E,...,-9.992778,-67.804722,A,2016-03-03,2001-10-09,705,"{4d469248e6c2d211, 4d469248e6c2d219, 4d469248e...",2027-07-16,,
64300,LIC-LIC-01,TELEFONICA BRASIL S.A.,50409146366,10,"{105802021.0, 48422023.0, 59072012.0, 97432014...","RUA FLORIANO PEIXOTO, 358",,AC,1200401,"{5M00G9W, 100MG7W, 10M0G7W, 200KG7W, 5M00G7W, ...",...,-9.972025,-67.813556,G,2023-08-05,2000-11-27,,"{e18bd248beab29dc, e18bd248beab29dd, aadc95f8f...",2024-07-21,,
356271,LIC-LIC-01,CENTRAIS ELETRICAS DO NORTE DO BRASIL S/A,11030012059,19,623882006.0,AV NACOES UNIDAS S/N USINA 11 .,,AC,1200401,16K0F3E,...,-9.8,-67.8,A,2006-12-06,2003-05-06,26,"{4d469248e6b8b63f, 4d469248e6b8b63b, 4d469248e...",2036-08-30,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1015180580,LIC-LIC-01,TELEFONICA BRASIL S.A.,50417179405,19,68862019.0,BR 317,FAZENDA CAPIXABA,AC,1200179,40M0D7W,...,-10.469784,-67.753207,C,2023-08-23,2023-08-23,,"{7bbb95f8073b0fd5, 7bbb95f8073b0fd6}",2039-02-08,,
1015180598,LIC-LIC-01,TELEFONICA BRASIL S.A.,50417179405,19,68862019.0,BR 317,FAZENDA JABORANDI,AC,1200708,40M0D7W,...,-10.656322,-68.096036,C,2023-08-23,2023-08-23,,"{7bbb95f8073b0fd8, 7bbb95f8073b0fd9, 7bbb95f80...",2039-02-08,,
1015180601,LIC-LIC-01,TELEFONICA BRASIL S.A.,50417179405,19,68862019.0,,,AC,1200203,29M6D7W,...,-7.588361,-72.751278,C,2023-08-23,2023-08-23,,"{7bbb95f8073b0fdc, 7bbb95f8073b0fdb}",2039-02-08,,
1015180610,LIC-LIC-01,TELEFONICA BRASIL S.A.,50417179405,19,68862019.0,BR 307 KM 459,,AC,1200203,29M6D7W,...,-7.5795,-72.714222,C,2023-08-23,2023-08-23,,7bbb95f8073b0fdd,2039-02-08,,


In [20]:
print(f"Percentage of initial rows kept when grouping = {len(dfg)*100/len(df):.2f}%")

Percentage of initial rows kept when grouping = 16.97%


In [25]:
dfg.head()

Unnamed: 0_level_0,Status.state,NomeEntidade,NumFistel,NumServico,NumAto,EnderecoEstacao,EndComplemento,SiglaUf,CodMunicipio,DesignacaoEmissao,...,Latitude,Longitude,CodDebitoTFI,DataLicenciamento,DataPrimeiroLicenciamento,NumRede,_id,DataValidade,NumFistelAssociado,NomeEntidadeAssociado
NumEstacao,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
19453,LIC-LIC-01,EMPRESA BRASILEIRA DE INFRA-ESTRUTURA AEROPORT...,11030016470,19,655742007.0,"BR 364, KM 18 - COA s/n AEROPORTO SBRB",,AC,1200401,16K0F3E,...,-9.9925,-67.804167,A,2016-03-03,2001-10-09,705.0,"{4d469248e6c2d201, 4d469248e6c2d1fd, 4d469248e...",2027-07-16,,
19461,LIC-LIC-01,EMPRESA BRASILEIRA DE INFRA-ESTRUTURA AEROPORT...,11030016470,19,18652020.0,"BR 364, KM 18 - COE s/n AEROPORTO SBRB",,AC,1200401,16K0F3E,...,-9.992778,-67.804444,A,2016-03-03,2001-10-09,705.0,"{4d469248e6c2d20e, 4d469248e6c2d208, 4d469248e...",2027-07-16,,
19470,LIC-LIC-01,EMPRESA BRASILEIRA DE INFRA-ESTRUTURA AEROPORT...,11030016470,19,18652020.0,"BR 364, KM 18 - SCI s/n AEROPORTO SBRB",,AC,1200401,16K0F3E,...,-9.992778,-67.804722,A,2016-03-03,2001-10-09,705.0,"{4d469248e6c2d211, 4d469248e6c2d219, 4d469248e...",2027-07-16,,
64300,LIC-LIC-01,TELEFONICA BRASIL S.A.,50409146366,10,"{105802021.0, 48422023.0, 59072012.0, 97432014...","RUA FLORIANO PEIXOTO, 358",,AC,1200401,"{5M00G9W, 100MG7W, 10M0G7W, 200KG7W, 5M00G7W, ...",...,-9.972025,-67.813556,G,2023-08-05,2000-11-27,,"{e18bd248beab29dc, e18bd248beab29dd, aadc95f8f...",2024-07-21,,
356271,LIC-LIC-01,CENTRAIS ELETRICAS DO NORTE DO BRASIL S/A,11030012059,19,623882006.0,AV NACOES UNIDAS S/N USINA 11 .,,AC,1200401,16K0F3E,...,-9.8,-67.8,A,2006-12-06,2003-05-06,26.0,"{4d469248e6b8b63f, 4d469248e6b8b63b, 4d469248e...",2036-08-30,,


Now we have a dataframe with all the original information, but only 16% of the rows; in which each represents a single station

## Dropping Irrelevant columns

Some columns will clearly not yield constructive information to classify the stations. We can get rid of them.

In [None]:
dfg.drop(['Status.state', 'NumFistel', 'Endereco'])