# Preprocessing

By analyzing our dataset we saw a few opportunities for improvement before send the dataset for training, the following sections will optimize the dataset for training.

## Importing the Dataset

In [2]:
import pandas as pd

df = pd.read_csv('../data/dataset.csv')

## Removing duplicate data

Some of the dataset has some data that is duplicated, such as names and descriptions, that are already present on some `ID` column, for example, `COLETA_DESC` has the same information as `COLETA_ID`.

There's also duplicated information on the `PES_PESOINI`, `PES_PESOFIM` and `PES_PESOUTIL`, since `PES_PESOUTIL = PES_PESOFIM - PES_PESOINI`. `PES_PESOINI` refers to the weight a `TPVEICULO_DESC` has, therefore we can drop `PES_PESOINI` and `PES_PESOFIM`.


In [5]:
DROP_LIST = ['LOCDESCARREGO_DESC', 'EMP_NOME', 'PES_PESOFIM', 'PES_PESOINI', 'COLETA_DESC', 'ESPECCOLETA_DESC', 'LOCAL_NOME', 'ROTA_DESC']

df.drop(DROP_LIST, axis=1, inplace=True)

## Making Adjustments

In [None]:
df = df.replace({'-REA': 0, 'CIRCUITO': 1, 'PONTUAL': 2})

## Applying LabelEncoder on Categorical Columns

In [6]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['TPVEICULO_DESC'] = label_encoder.fit_transform(df['TPVEICULO_DESC'])
df['TPCIRCUITO_DESC'] = label_encoder.fit_transform(df['TPCIRCUITO_DESC'])
df['ROTA_ID'] = label_encoder.fit_transform(df['ROTA_ID'])


## Our Results

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119801 entries, 0 to 119800
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   PES_ID            119801 non-null  int64  
 1   LOCDESCARREGO_ID  119801 non-null  int64  
 2   EMP_ID            119801 non-null  int64  
 3   ROTA_ID           119801 non-null  int64  
 4   TPVEICULO_DESC    119801 non-null  int64  
 5   PES_PESOUTIL      119801 non-null  int64  
 6   COLETA_ID         119801 non-null  int64  
 7   ESPECCOLETA_ID    119801 non-null  int64  
 8   PERCUSSO_I        119801 non-null  float64
 9   LOCAL_ID          119801 non-null  int64  
 10  TPCIRCUITO_DESC   119801 non-null  object 
 11  DATETIME_INI      119801 non-null  object 
 12  DATETIME_FIM      119801 non-null  object 
dtypes: float64(1), int64(9), object(3)
memory usage: 11.9+ MB


In [8]:
df.head()

Unnamed: 0,PES_ID,LOCDESCARREGO_ID,EMP_ID,ROTA_ID,TPVEICULO_DESC,PES_PESOUTIL,COLETA_ID,ESPECCOLETA_ID,PERCUSSO_I,LOCAL_ID,TPCIRCUITO_DESC,DATETIME_INI,DATETIME_FIM
0,2490322,7,708,144,3,9640,1,1,2156.0,205,-REA,2014-12-31 23:57:01,2015-01-01 00:12:06
1,2489495,7,708,144,3,9100,1,1,2156.0,205,-REA,2014-12-30 01:27:00,2014-12-30 01:42:32
2,2488707,7,708,144,3,3480,1,1,2156.0,205,-REA,2014-12-27 04:47:05,2014-12-27 05:01:09
3,2488660,7,708,144,3,9370,1,1,2156.0,205,-REA,2014-12-27 01:29:54,2014-12-27 01:43:34
4,2488114,7,708,144,3,9460,1,1,2156.0,205,-REA,2014-12-25 07:10:07,2014-12-25 07:23:23


## Exporting

In [None]:
df.to_csv('../data/preprocessed.csv', index=False)