Importing the dataset and libraries.

In [1]:
import pandas as pd
import numpy as np

inspections = pd.read_csv('Data/sedec_vistorias.csv',sep=";")

Taking a look at the dataset.

In [2]:
inspections.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15772 entries, 0 to 15771
Data columns (total 10 columns):
ano                      15772 non-null int64
mes                      15772 non-null object
avaliador                15772 non-null object
vistoria_data            15505 non-null object
vistoria_risco           15772 non-null object
vistoria_localidade      15772 non-null object
vistoria_rpa_codigo      15772 non-null object
vistoria_microrregiao    15772 non-null object
vistoria_setor           15772 non-null object
processo_numero          15772 non-null int64
dtypes: int64(2), object(8)
memory usage: 739.4+ KB


In [3]:
inspections.head()

Unnamed: 0,ano,mes,avaliador,vistoria_data,vistoria_risco,vistoria_localidade,vistoria_rpa_codigo,vistoria_microrregiao,vistoria_setor,processo_numero
0,2012,12-dezembro,Engenheiro - Área Morro,2012-12-13 14:34:07,R3 Alto,ALTO DA JAQUEIRA - Jordão,6,6.2,6-SUL,8008472812
1,2012,12-dezembro,Engenheiro - Área Morro,2012-12-20 11:59:13,R2 Médio,Jardim Teresópoles,4,4.3,4-NORDESTE,8008985512
2,2012,12-dezembro,Engenheiro - Área Morro,2012-12-21 14:51:29,R4 Muito Alto,UR-05,6,6.2,6-SUL,8008835312
3,2012,12-dezembro,Engenheiro - Área Morro,2012-12-28 10:09:20,R4 Muito Alto,ALTO DA BRASILEIRA,3,3.3,3-NOROESTE,8006628012
4,2012,12-dezembro,Engenheiro - Área Morro,2012-12-13 10:50:13,Não informado,UR-02,6,6.2,6-SUL,8008455612


In [4]:
inspections['vistoria_risco'].value_counts()

Não informado    7757
R3 Alto          4770
R2 Médio         1899
R4 Muito Alto    1022
R1 Baixo          324
Name: vistoria_risco, dtype: int64

In [5]:
inspections['vistoria_localidade'].value_counts()

Não informada             6069
LAGOA ENCANTADA            691
JD. MONTE VERDE            283
UR 07                      229
JORDAO ALTO                208
                          ... 
ALTO NSª SRª DE FATIMA       1
CORREGO DA IMBAUBA           1
CGO SAO DOMINGOS SAVIO       1
ILCA MACHADO                 1
ur-03                        1
Name: vistoria_localidade, Length: 1217, dtype: int64

In [6]:
inspections['ano'].value_counts()

2013    6983
2012    2818
2014    1967
2015    1183
2017     779
2016     682
2018     667
2019     426
0        267
Name: ano, dtype: int64

We are going to drop all the columns that don't relate to the info that we need, all of these relate to the inspection process and log so these features aren't related to slidings. Additionally, we are dropping the date and locations columns since we'll be extracting these values from a different dataset when cross-referencing the ID.

In [7]:
inspections = inspections.drop(columns=['ano','mes','vistoria_data','avaliador',
                        'vistoria_localidade','vistoria_rpa_codigo','vistoria_setor',
                        'vistoria_microrregiao'],axis=1)

In [8]:
inspections.head()

Unnamed: 0,vistoria_risco,processo_numero
0,R3 Alto,8008472812
1,R2 Médio,8008985512
2,R4 Muito Alto,8008835312
3,R4 Muito Alto,8006628012
4,Não informado,8008455612


Renaming the dataset for better understanding. This dataset relates to an inspection conducted on that place, the feature that relates to our problem is the risk evaluated by the inspector at the scene.

In [9]:
inspections = inspections.rename(columns={'vistoria_risco':'risk','processo_numero':'ID'})
inspections.head()

Unnamed: 0,risk,ID
0,R3 Alto,8008472812
1,R2 Médio,8008985512
2,R4 Muito Alto,8008835312
3,R4 Muito Alto,8006628012
4,Não informado,8008455612


We will transform the risk column into:
4- Very High risk.
3- High.
2- Medium.
1- Low.
0- Not informed.

In [10]:
mapping = {'Não informado': 0, 'R1 Baixo': 1,'R2 Médio': 2,'R3 Alto': 3,'R4 Muito Alto': 4 }

inspections = inspections.applymap(lambda s: mapping.get(s) if s in mapping else s)
inspections.head()

Unnamed: 0,risk,ID
0,3,8008472812
1,2,8008985512
2,4,8008835312
3,4,8006628012
4,0,8008455612


Finally there are some duplicate ID entries, so we'll drop the ones that might have 0 values for risk assessment.

In [11]:
inspections = inspections.sort_values('risk', ascending=False).drop_duplicates('ID').sort_index()

Taking a look at the unique values of this dataset.

In [12]:
inspections['risk'].value_counts()

3    4720
0    2757
2    1878
4    1018
1     323
Name: risk, dtype: int64

In [13]:
inspections.to_csv(path_or_buf='inspections_prepared.csv')