In [1]:
import pandas as pd


Download robo e furto data:

- Can be downloaded [here](http://www.ssp.sp.gov.br/transparenciassp/)
- The downloads are really slow , so I have also stored the raw data [here](https://drive.google.com/drive/folders/1L3rXeIPOtuK1NYG2zsV9eEUeSvX3ojVt)
- The downloaded data has then to be manipulated so it can be open... I've done it with numbers by exporting it to CSV and still there's some files with encoding errors. Processed files have bee stored [here](https://drive.google.com/drive/folders/17M7w22fJwLGpGwIVuJTNGqMOVt8RkwDc)

From here on we load the data as available after this last step

---

## 1. Reading the data

Read robo e furto data

In [2]:
# Initialize the data
all_data = {'roubo': dict(), 'furto': dict()}

# Try to parse each file
for data_type in ['roubo', 'furto']:
    for year in ['2020', '2021']:
        for month in range(1,13):
            
            # Try to load the data
            try:
                data = pd.read_csv(f"data/processed_roubo_e_furto/DadosBO_{year}_{str(month)}({data_type.upper()} DE VEÍCULOS).csv", 
                                   sep=';', 
                                   encoding='latin-1')
                
                print(f'Usable data for {data_type}, {year}/{str(month)} with {data.shape[0]} records')
                
                # Append it
                all_data[data_type][year+'_'+str(month)] = data
            
            except:
                pass
                
                

Usable data for roubo, 2020/12 with 8525 records
Usable data for roubo, 2021/1 with 6770 records
Usable data for roubo, 2021/2 with 6251 records
Usable data for roubo, 2021/4 with 5703 records
Usable data for roubo, 2021/5 with 5776 records
Usable data for roubo, 2021/6 with 5336 records
Usable data for roubo, 2021/7 with 5363 records
Usable data for roubo, 2021/8 with 5953 records
Usable data for roubo, 2021/9 with 6181 records
Usable data for roubo, 2021/10 with 6713 records
Usable data for furto, 2020/8 with 6755 records
Usable data for furto, 2020/9 with 7109 records
Usable data for furto, 2020/10 with 7788 records
Usable data for furto, 2020/12 with 7034 records
Usable data for furto, 2021/1 with 7516 records
Usable data for furto, 2021/2 with 7897 records
Usable data for furto, 2021/3 with 8192 records
Usable data for furto, 2021/4 with 7665 records
Usable data for furto, 2021/6 with 7937 records
Usable data for furto, 2021/10 with 9680 records


Check that data formats are consistent across tables

In [3]:
all_formats = []
    
for data_type in ['roubo', 'furto']:
    for date, data in all_data[data_type].items():
        all_formats.append({'data_type': data_type,
                            'date': date,
                            'format': list(data.columns)})

all_formats = pd.DataFrame(all_formats)
display(all_formats.head())

print('Number of unique formats:', all_formats['format'].astype(str).nunique())


Unnamed: 0,data_type,date,format
0,roubo,2020_12,"[ANO_BO, NUM_BO, NUMERO_BOLETIM, BO_INICIADO, ..."
1,roubo,2021_1,"[ANO_BO, NUM_BO, NUMERO_BOLETIM, BO_INICIADO, ..."
2,roubo,2021_2,"[ANO_BO, NUM_BO, NUMERO_BOLETIM, BO_INICIADO, ..."
3,roubo,2021_4,"[ANO_BO, NUM_BO, NUMERO_BOLETIM, BO_INICIADO, ..."
4,roubo,2021_5,"[ANO_BO, NUM_BO, NUMERO_BOLETIM, BO_INICIADO, ..."


Number of unique formats: 1


One unique format so we can merge all in one table!

In [4]:
# Create enpty dataframe
final_data = pd.DataFrame(columns=['incident_type', 'file_download_date']+list(data.columns))

# Append to this df all the previous data
for data_type in ['roubo', 'furto']:
    for date, data in all_data[data_type].items():
        final_data = pd.concat([final_data,
                                pd.concat([pd.DataFrame([[data_type, date]]*data.shape[0], columns=['incident_type', 'file_download_date']),
                                           data], 
                                          axis=1)],
                               axis=0)
        

Compare number of records with the ones in the load process

In [5]:
final_data[(final_data['incident_type']=='roubo') & 
           (final_data['file_download_date']=='2021_6')].shape[0]

5336

In [6]:
final_data[(final_data['incident_type']=='furto') & 
           (final_data['file_download_date']=='2021_6')].shape[0]

7937

---
## 2. Creating a table in Postgres with this info