# Introduction

This Data Science project explores socioeconomic conditions across neighborhoods in Buenos Aires using open geospatial and administrative datasets. Our goal is to classify areas by quality of life, infrastructure, and social indicators, and to present the results using interactive visualizations (e.g., Folium maps).


In this notebook we will collect and transform data mainly retrieved from the official open data portal of Buenos Aires: [data.buenosaires.gob.ar](https://data.buenosaires.gob.ar/). 


We will use the following datasets:
- delitos_2021, delitos_2022, delitos_2023 – crime data collected over three years
- caba_pob_barrios_2010 – population census data per neighborhood
- barrios – Location and geographical boundaries of the city's neighborhoods
- establecimientos_educativos_WGS84 - Geographically referenced list of public and private educational institutions.
- hospitales – Geolocated data on hospitals in the city
- espacio_verde_publico - Geographic location of the green zones in the city
- estaciones-de-subte – geographic dataset of subway stations in the city
- https://de.wikipedia.org/wiki/Liste_der_informellen_Siedlungen_in_Buenos_Aires - presence of slum in the area of the city (Unofficial source)
- mapa_de_ruido_diurno / mapa_de_ruido_nocturno – Noise level assessments (daytime & nighttime)

**We will also transform raw data into meaningful features** describing neighborhoods.  
These features will include:

- Population and total area
- Crime indicators (e.g., homicides, injuries, property crimes, threats)
- Density of slum presence
- Number of hospitals and subway stations
- Noise levels (daytime and nighttime)
- Number of educational institutions (private and state)
- Number and area of green zones (total and with recreational facilities)
- Share of green zones relative to neighborhood area

# Data Collection & Exploration

## 1. Crimes Data

### 1.1 Data Loading

In [521]:
import pandas as pd

df_crimes_2023 = pd.read_csv('../data/delitos_2023.csv', sep=';')
df_crimes_2022 = pd.read_csv('../data/delitos_2022.csv', sep=',')
df_crimes_2021 = pd.read_excel('../data/delitos_2021.xlsx')

In [522]:
# Deleting unique indexs from crime datasets
df_crimes_2021 = df_crimes_2021.drop(columns={'id-mapa'})
df_crimes_2022 = df_crimes_2022.drop(columns={'id-mapa'})
df_crimes_2023 = df_crimes_2023.drop(columns={'id-sum'})

In [523]:
crimes_total = pd.concat([df_crimes_2021, df_crimes_2022, df_crimes_2023], ignore_index=True)
crimes_total.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410341 entries, 0 to 410340
Data columns (total 14 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   anio      410341 non-null  int64  
 1   mes       410341 non-null  object 
 2   dia       410341 non-null  object 
 3   fecha     410341 non-null  object 
 4   franja    408554 non-null  float64
 5   tipo      410341 non-null  object 
 6   subtipo   410341 non-null  object 
 7   uso_arma  410341 non-null  object 
 8   uso_moto  410341 non-null  object 
 9   barrio    404684 non-null  object 
 10  comuna    404698 non-null  object 
 11  latitud   402558 non-null  object 
 12  longitud  402558 non-null  object 
 13  cantidad  410341 non-null  int64  
dtypes: float64(1), int64(2), object(11)
memory usage: 43.8+ MB


### 1.2 Data Cleaning

In [524]:
crimes_total.drop(columns={'comuna'})
crimes_total.rename(columns={
    'anio' : 'year',
    'mes'  : 'month',
    'dia'  : 'day_of_week',
    'fecha': 'date',
    'franja': 'hour',
    'tipo'    : 'crime_type',
    'subtipo' : 'crime_subtype',
    'uso_arma' : 'using_arm',
    'uso_moto' : 'using_motorbike',
    'barrio': 'neighborhood',
    'latitud': 'latitude',
    'longitud': 'longitude',
    'cantidad' : 'amount'
}, inplace=True)

In [525]:
crimes_total['date'] = pd.to_datetime(crimes_total['date'])
crimes_total['month'] = crimes_total['date'].dt.month
crimes_total['day_of_week'] = crimes_total['date'].dt.day_of_week

In [526]:
bin_format = {'NO': 0, 'SI': 1}
crimes_total['using_arm'] = crimes_total['using_arm'].map(bin_format)
crimes_total['using_motorbike'] = crimes_total['using_motorbike'].map(bin_format)

In [527]:
crimes_total['crime_subtype'].unique()

array(['Hurto total', 'Robo total', 'Hurto automotor', 'Robo automotor',
       'Lesiones por siniestros viales', 'Homicidios dolosos',
       'Femicidios', 'Lesiones Dolosas', 'Amenazas',
       'Muertes por siniestros viales', 'Homicidio Doloso'], dtype=object)

In [528]:
subcategory_translation = {
    'Hurto total': 'Total Theft',
    'Robo total': 'Total Robbery',
    'Hurto automotor': 'Vehicle Theft',
    'Robo automotor': 'Vehicle Robbery',
    'Lesiones por siniestros viales': 'Injuries from Traffic Accidents',
    'Homicidios dolosos': 'Intentional Homicides',
    'Femicidios': 'Femicides',
    'Lesiones Dolosas': 'Intentional Injuries',
    'Amenazas': 'Threats',
    'Muertes por siniestros viales': 'Deaths from Traffic Accidents',
    'Homicidio Doloso': 'Intentional Homicide'
}

crimes_total['crime_subtype'] = crimes_total['crime_subtype'].map(subcategory_translation)

In [529]:
crimes_total['crime_type'].unique()

array(['Hurto', 'Robo', 'Vialidad', 'Homicidios', 'Lesiones', 'Amenazas'],
      dtype=object)

In [530]:
category_translation = {
    'Hurto': 'Theft',
    'Robo': 'Robbery',
    'Vialidad': 'Traffic Incidents',
    'Homicidios': 'Homicides',
    'Lesiones': 'Injuries',
    'Amenazas': 'Threats'
}

crimes_total['crime_type'] = crimes_total['crime_type'].map(category_translation)

In [531]:
crimes_total.isnull().sum()

year                  0
month                 0
day_of_week           0
date                  0
hour               1787
crime_type            0
crime_subtype         0
using_arm             0
using_motorbike       0
neighborhood       5657
comuna             5643
latitude           7783
longitude          7783
amount                0
dtype: int64

In [532]:
crimes_total = crimes_total.dropna(subset=['latitude', 'neighborhood'], how='all')

In [533]:
crimes_total = crimes_total[
    ~((crimes_total['neighborhood'] == "Sin geo") & (crimes_total['latitude'].isna()))
]
crimes_total.info()

<class 'pandas.core.frame.DataFrame'>
Index: 404641 entries, 0 to 410340
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   year             404641 non-null  int64         
 1   month            404641 non-null  int32         
 2   day_of_week      404641 non-null  int32         
 3   date             404641 non-null  datetime64[ns]
 4   hour             403144 non-null  float64       
 5   crime_type       404641 non-null  object        
 6   crime_subtype    404641 non-null  object        
 7   using_arm        404641 non-null  int64         
 8   using_motorbike  404641 non-null  int64         
 9   neighborhood     404336 non-null  object        
 10  comuna           404344 non-null  object        
 11  latitude         402558 non-null  object        
 12  longitude        402558 non-null  object        
 13  amount           404641 non-null  int64         
dtypes: datetime64[ns](1), flo

In [534]:
crimes_total.isnull().sum()

year                  0
month                 0
day_of_week           0
date                  0
hour               1497
crime_type            0
crime_subtype         0
using_arm             0
using_motorbike       0
neighborhood        305
comuna              297
latitude           2083
longitude          2083
amount                0
dtype: int64

## 2. Population Data

In [535]:
df_population = pd.read_csv('../data/caba_pob_barrios_2010.csv', sep=',')
df_population.head(3)

Unnamed: 0,BARRIO,POBLACION
0,AGRONOMIA,13912
1,ALMAGRO,131699
2,BALVANERA,138926


In [536]:
df_population['BARRIO'].nunique()

48

In [537]:
df_population.rename(columns={
    'BARRIO': 'neighborhood',
    'POBLACION' : 'population'
}, inplace=True)

## 3. Location and Geographical Boundaries of the City's Neighborhoods.

In [541]:
df_neighborhood = pd.read_csv('../data/barrios.csv', sep=',')

df_neighborhood.head(3)

Unnamed: 0,id,objeto,nombre,comuna,perimetro_,area_metro,geometry
0,1,BARRIO,Agronomia,15,6556.17,2122169.34,POLYGON ((-58.475888981732986 -34.591723461272...
1,2,BARRIO,Almagro,5,8537.9,4050752.25,POLYGON ((-58.416002854915654 -34.597854231564...
2,3,BARRIO,Balvanera,3,8375.82,4342280.27,POLYGON ((-58.392934155259674 -34.599636447011...


In [542]:
df_neighborhood = df_neighborhood.drop(columns={'objeto', 'id'})
df_neighborhood.rename(columns={
    'nombre': 'neighborhood',
    'comuna' : 'commune',
    'perimetro_' : 'perimeter_neib',
    'area_metro' : 'area_neib'
}, inplace=True)
df_neighborhood.head(3)

Unnamed: 0,neighborhood,commune,perimeter_neib,area_neib,geometry
0,Agronomia,15,6556.17,2122169.34,POLYGON ((-58.475888981732986 -34.591723461272...
1,Almagro,5,8537.9,4050752.25,POLYGON ((-58.416002854915654 -34.597854231564...
2,Balvanera,3,8375.82,4342280.27,POLYGON ((-58.392934155259674 -34.599636447011...


In [543]:
df_neighborhood['neighborhood'] = df_neighborhood['neighborhood'].str.upper()

## 4. Educational Institutions Data

### 4.1 Data Loading

In [544]:
df_education = pd.read_csv('../data/establecimientos_educativos_WGS84.csv', sep=';')

### 4.2 Data preparation

In [545]:
df_education = df_education.drop(columns={'id', 'cui', 'cueanexo', 'cue', 'anexo', 'telefono', 'email', 'codpost', 'estado',
                           'web_megcba', 'nombre_est', 'area_progr', 'de', 'dom_edific', 'dom_establ', 'depfun', 'depfun_otr',
                           'nivel', 'nivelmodal', 'Unnamed: 29', 'Unnamed: 30', 'longitud', 'latitud'})

In [546]:
import re

def extract_coords(geom_str):
    match = re.search(r'\(\((-?\d+\.\d+)\s+(-?\d+\.\d+)\)\)', geom_str)
    if match:
        lon = float(match.group(1))
        lat = float(match.group(2))
        return lon, lat
    return None, None

df_education[['institution_long', 'institution_lat']] = df_education['WKT'].apply(
    lambda s: pd.Series(extract_coords(s))
)
df_education = df_education.drop(columns={'WKT', 'tipest', 'tipest_abr', 'comuna'})

In [547]:
df_education.rename(columns={
    'barrio': 'neighborhood',
    'nivmod' : 'institution_level',
    'nombre_abr' : 'institution_name'
}, inplace=True)

df_education['neighborhood'] = df_education['neighborhood'].str.upper()
df_education.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3045 entries, 0 to 3044
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sector             3045 non-null   int64  
 1   institution_name   3045 non-null   object 
 2   institution_level  3044 non-null   object 
 3   neighborhood       3044 non-null   object 
 4   institution_long   3042 non-null   float64
 5   institution_lat    3042 non-null   float64
dtypes: float64(2), int64(1), object(3)
memory usage: 142.9+ KB


In [549]:
df_education = df_education.dropna(subset=['institution_lat', 'neighborhood'], how='all')
df_education.head(3)

Unnamed: 0,sector,institution_name,institution_level,neighborhood,institution_long,institution_lat
0,1,EPjs 09,PriCom,VILLA SOLDATI,-58.428063,-34.660773
1,1,EI 06/19,IniCom,FLORES,-58.432877,-34.648141
2,1,EPjs 16,PriCom,PALERMO,-58.408777,-34.585651


## 5. Hospitals Data.

In [550]:
df_hospitals = pd.read_csv('../data/hospitales.csv', sep=',')

In [551]:
df_hospitals = df_hospitals.drop(columns={'tel', 'com', 'fax', 'web', 'nam', 'esp', 'dir', 'gna', 'ate', 'sag', 'geometry', 'fna'})

In [552]:
df_hospitals.rename(columns={
    'bar': 'neighborhood',
    'gna_sym' : 'hospital_type'
}, inplace=True)

df_hospitals['neighborhood'] = df_hospitals['neighborhood'].str.upper()
df_hospitals = df_hospitals.groupby(['neighborhood'], as_index=False)['hospital_type'].count()
df_hospitals.rename(columns={'hospital_type': 'amount_of_hospitals'}, inplace=True)
df_hospitals.head(3)

Unnamed: 0,neighborhood,amount_of_hospitals
0,ALMAGRO,1
1,BALVANERA,1
2,BARRACAS,6


## 6. Green Zones Data

In [553]:
df_green_spase = pd.read_csv('../data/espacio_verde_publico.csv', sep=',')

In [554]:
df_green_spase = df_green_spase.drop(columns={'nombre', 'nom_mapa', 'ubicacion', 'clasificac', 'apadrinada', 'decreto', 'fecha_decr', 'ordenanza_', 
                                              'fecha_orde', 'boletin_of', 'fecha_bole', 'observacio'})

df_green_spase.rename(columns={
    'barrio': 'neighborhood',
    'comuna' : 'commune',
    'perimetro' : 'zone_perimeter',
    'area' : 'zone_area'
}, inplace=True)

df_green_spase['neighborhood'] = df_green_spase['neighborhood'].str.upper()

In [555]:
df_green_spase['tiene_pati'] = df_green_spase['tiene_pati'].map(bin_format)

In [556]:
df_green_spase.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2144 entries, 0 to 2143
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              2144 non-null   int64  
 1   neighborhood    2144 non-null   object 
 2   commune         2144 non-null   int64  
 3   tiene_pati      2124 non-null   float64
 4   zone_area       2144 non-null   float64
 5   zone_perimeter  2144 non-null   float64
 6   geometry        2144 non-null   object 
dtypes: float64(3), int64(2), object(2)
memory usage: 117.4+ KB


## 7. Subway Data

In [557]:
df_subway = pd.read_excel('../data/estaciones-de-subte.xlsx')

In [558]:
df_subway.rename(columns={
    'long': 'long_station',
    'lat' : 'lat_station',
    'linea' : 'line',
    'estacion' : 'station'
}, inplace=True)

In [379]:
df_subway.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   long_station  90 non-null     float64
 1   lat_station   90 non-null     float64
 2   id            90 non-null     int64  
 3   station       90 non-null     object 
 4   line          90 non-null     object 
dtypes: float64(2), int64(1), object(2)
memory usage: 3.6+ KB


## 8. Presence of slums in the area of ​​the city

In [560]:
url = "https://de.wikipedia.org/wiki/Liste_der_informellen_Siedlungen_in_Buenos_Aires"

tables = pd.read_html(url)

df_slums = tables[0]
df_slums.head()

Unnamed: 0,Name,Begrenzung,Stadtteil
0,Bajo AU7,"Parque de La Ciudad, Avenida Roca, Avenida Lac...",Villa Soldati
1,Calacita,"Barros Pazos, Lacarra, B.y Ordoñez, Laguna",Villa Soldati
2,Carrilo 2,"Mariano Acosta, Castañares, Lacarra",Villa Soldati
3,Piletones,"Lacarra, B. Pazos, Parque Indoamericano, Lago ...",Villa Soldati
4,Villa 1-11-14,"Avenida P. Moreno, Varela, Club DAOM, Riestra,...",Flores


In [561]:
df_slums.rename(columns={
    'Stadtteil' : 'neighborhood',
    'Name' : 'slum'
}, inplace=True)
df_slums = df_slums.drop(columns={"Begrenzung"})
df_slums['neighborhood'] = df_slums['neighborhood'].str.upper()

In [562]:
slums_count = df_slums.groupby('neighborhood', as_index=False)['slum'].count()
slums_count

Unnamed: 0,neighborhood,slum
0,BARRACAS,2
1,FLORES,1
2,NUEVA POMPEYA,1
3,PARQUE AVELLANEDA,1
4,PARQUE CHACABUCO,1
5,PUERTO MADERO,1
6,RETIRO,2
7,VILLA LUGANO,5
8,VILLA RIACHUELO,1
9,VILLA SOLDATI,5


## 9. Noise Map

In [563]:
df_noise_map_afternoon = pd.read_csv('../data/mapa_de_ruido_diurno.csv', sep = ',')
df_noise_map_night = pd.read_csv('../data/mapa_de_ruido_nocturno.csv', sep = ',')

In [564]:
df_noise_map = pd.concat([df_noise_map_afternoon, df_noise_map_night], ignore_index=True)
df_noise_map.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 316 entries, 0 to 315
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   wkt                       316 non-null    object 
 1   limite_inferior_rango_db  316 non-null    float64
 2   limite_superior_rango_db  316 non-null    float64
 3   comuna                    316 non-null    int64  
 4   leyenda                   316 non-null    object 
 5   rango_db                  316 non-null    object 
 6   color                     316 non-null    object 
 7   periodo                   316 non-null    object 
dtypes: float64(2), int64(1), object(5)
memory usage: 19.9+ KB


In [565]:
df_noise_map['range_db_mean'] = (df_noise_map['limite_inferior_rango_db'] + df_noise_map['limite_superior_rango_db'])/2
df_noise_map = df_noise_map.drop(columns={'limite_inferior_rango_db', 'limite_superior_rango_db', 'leyenda', 'color'})

In [566]:
trans_period = {"Nocturno" : "Night", "Diurno" : "Day"}
df_noise_map['periodo'] = df_noise_map['periodo'].map(trans_period)

In [567]:
df_noise_map.drop(columns={"comuna"})
df_noise_map.rename(columns={
    'wkt': 'noise_coord',
    'rango_db' : 'range_db',
    'periodo' : 'period'
}, inplace=True)
df_noise_map.head(3)

Unnamed: 0,noise_coord,comuna,range_db,period,range_db_mean
0,MULTIPOLYGON (((-58.3714231269834 -34.57855223...,1,30-35,Day,32.5
1,MULTIPOLYGON (((-58.3714231269834 -34.57855223...,1,35-40,Day,37.5
2,MULTIPOLYGON (((-58.3714231117057 -34.57856621...,1,40-45,Day,42.5


# Adding Neighborhoods to Uploaded Data

### 1. Making Geopandas Dataset from Neighborhood data

In [568]:
import geopandas as gpd
from shapely import wkt

barrios = gpd.read_file("https://cdn.buenosaires.gob.ar/datosabiertos/datasets/ministerio-de-educacion/barrios/barrios.geojson")  # или barrios.shp
barrios = barrios.to_crs(epsg=4326)

In [569]:
barrios = barrios.drop(columns={'objeto', 'id'})
barrios.rename(columns={
    'nombre': 'neighborhood',
    'comuna' : 'commune',
    'perimetro_' : 'perimeter_neib',
    'area_metro' : 'area_neib'
}, inplace=True)
barrios['neighborhood'] = barrios['neighborhood'].str.upper()

### 2. Adding Neighborhood to Subway Data

In [571]:
from shapely.geometry import Point
geometry = [Point(xy) for xy in zip(df_subway['long_station'], df_subway['lat_station'])]
gdf_subway = gpd.GeoDataFrame(df_subway, geometry=geometry, crs='EPSG:4326')  # WGS84

barrios = barrios.to_crs('EPSG:4326')

gdf_subway_with_neigh = gpd.sjoin(gdf_subway, barrios[['neighborhood', 'geometry']], how='left', predicate='within')

In [572]:
gdf_subway_with_neigh = gdf_subway_with_neigh.drop(columns={'long_station', 'lat_station', 'geometry'})

In [573]:
gdf_subway_with_neigh = gdf_subway_with_neigh.groupby('neighborhood', as_index=False)['station'].count()
gdf_subway_with_neigh.rename(columns={'station' : 'subway_stations_amount'})
gdf_subway_with_neigh.head(3)

Unnamed: 0,neighborhood,station
0,ALMAGRO,5
1,BALVANERA,12
2,BELGRANO,2


### 3. Adding Neighborhood to Noise Map Data

In [574]:
# Step 1 - Convert the string to shapely geometry
df_noise_map['geometry'] = df_noise_map['noise_coord'].apply(wkt.loads)

# Step 2 — create GeoDataFrame
gdf_noise = gpd.GeoDataFrame(df_noise_map.drop(columns='noise_coord'), geometry='geometry', crs='EPSG:4326')

# Convert the projection to meters (for area)
gdf_noise = gdf_noise.to_crs(epsg=3857)
barrios = barrios.to_crs(epsg=3857)

# Step 3 - spatial intersection: intersection of areas and noise zones
noise_inter_area = gpd.overlay(gdf_noise, barrios[['neighborhood', 'area_neib', 'geometry']], how='intersection')

# Step 4 - Calculate the intersection area
noise_inter_area['inter_area'] = noise_inter_area.geometry.area

In [398]:
# Calculate the average sound level in decibels for a neighborhood during the day and night

noise_inter_area['mea_db_per_area'] = noise_inter_area['range_db_mean']*noise_inter_area['inter_area']/noise_inter_area['area_neib']
noise_inter_area_clean = noise_inter_area.drop(columns={'range_db', 'comuna', 'range_db_mean','inter_area'})
noise_inter_area_clean = noise_inter_area_clean.groupby(['neighborhood', 'period'], as_index=False)['mea_db_per_area'].sum()

Unnamed: 0,neighborhood,period,mea_db_per_area
0,AGRONOMIA,Day,22.211583
1,AGRONOMIA,Night,20.322316
2,ALMAGRO,Day,28.458668
3,ALMAGRO,Night,26.236751
4,BALVANERA,Day,27.401132


In [575]:
# Transpose the data to obtain a variable of average sound value in decibels separately during the day and separately at night

noise_inter_area_pivot = noise_inter_area_clean.pivot(index='neighborhood', columns='period', values='mea_db_per_area').reset_index()
noise_inter_area_pivot.columns = ['neighborhood', 'day_noise', 'night_noise']
noise_inter_area_pivot.head(3)

Unnamed: 0,neighborhood,day_noise,night_noise
0,AGRONOMIA,22.211583,20.322316
1,ALMAGRO,28.458668,26.236751
2,BALVANERA,27.401132,25.515022


### 4. Adding Neighborhood to Crimes Data

In [400]:
crimes_total['neighborhood'].unique()

array(['VELEZ SARSFIELD', 'MONTE CASTRO', 'FLORESTA', 'VILLA LURO',
       'VILLA DEL PARQUE', 'VILLA SANTA RITA', 'PALERMO', 'VILLA CRESPO',
       'VILLA SOLDATI', 'FLORES', 'VILLA GRAL. MITRE', 'RECOLETA',
       'SAN TELMO', 'CABALLITO', 'PATERNAL', 'NUEVA POMPEYA',
       'PARQUE PATRICIOS', 'SAN NICOLAS', 'ALMAGRO', 'BALVANERA',
       'AGRONOMIA', 'RETIRO', 'CONSTITUCION', 'SAAVEDRA', 'VILLA URQUIZA',
       'COGHLAN', 'BELGRANO', 'VILLA PUEYRREDON', 'SAN CRISTOBAL',
       'NUÃ‘EZ', 'VILLA ORTUZAR', 'COLEGIALES', 'VERSALLES', 'MONSERRAT',
       'CHACARITA', 'BOEDO', 'LINIERS', 'VILLA DEVOTO', 'BARRACAS',
       'VILLA LUGANO', 'PARQUE CHACABUCO', 'PUERTO MADERO', 'BOCA',
       'PARQUE AVELLANEDA', 'MATADEROS', 'VILLA RIACHUELO', 'VILLA REAL',
       'PARQUE CHAS', nan, 'NUÑEZ', 'SD', 'LA BOCA', 'CONTITUCIÓN',
       'RODRIGO BUENO', 'AV BOEDO', 'NO ESPECIFICADA',
       'GREGORIO DE LAFERRERE', 'FLORIDA', 'BERNAL', 'DOCK SUD', '0',
       'SANTA MARÍA', 'BANFIELD OESTE', 'VIL

In [576]:
# Catch suspicious and incorrect values ​​of neighborhoods

crimes_total_fin = crimes_total[crimes_total['neighborhood'].isna() == 0]
crimes_total_fin = crimes_total[crimes_total['neighborhood'] != '0']

In [577]:
crimes_total_zero = crimes_total[crimes_total['neighborhood'] == '0']

In [578]:
crimes_total_zero['latitude'] = crimes_total_zero['latitude'].astype(float)
crimes_total_zero['longitude'] = crimes_total_zero['longitude'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  crimes_total_zero['latitude'] = crimes_total_zero['latitude'].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  crimes_total_zero['longitude'] = crimes_total_zero['longitude'].astype(float)


In the context of this task, due to the small number of records with missing neighborhood values ​​(less than 0.1 percent of all records), we will allow ourselves to neglect these records and not take them into account in the study, considering the effect they have on the results to be insignificant.

In [579]:
crimes_total_infin = crimes_total[crimes_total['neighborhood'].isna()]
perc = 100*crimes_total_infin.shape[0]/crimes_total.shape[0]
print(round(perc, 3))

0.075


Let’s explore the dataset containing crimes with zero or missing neighborhood values.

In [580]:
crimes_total_zero['longitude'].min(), crimes_total_zero['longitude'].max(), crimes_total_zero['latitude'].min(), crimes_total_zero['latitude'].max()

(-583961199.0, -58374817.0, -346148368.0, -34573171.0)

Note that the coordinates of the districts of Buenos Aires correspond to longitude and latitude coordinates of about -58 and -34, respectively. However, the data contains values ​​that differ by a million and ten million times. Let's transform them to the right scale.

In [581]:
while True:
    crimes_total_zero.loc[crimes_total_zero['longitude'] < -100, 'longitude'] /= 10
    if len(crimes_total_zero[crimes_total_zero['longitude'] < -100]) == 0:
        break

In [582]:
while True:
    crimes_total_zero.loc[crimes_total_zero['latitude'] < -100, 'latitude'] /= 10
    if len(crimes_total_zero[crimes_total_zero['latitude'] < -100]) == 0:
        break

In [583]:
crimes_total_zero['longitude'].min(), crimes_total_zero['longitude'].max(), crimes_total_zero['latitude'].min(), crimes_total_zero['latitude'].max()

(-58.515837, -58.37481700000001, -34.678989, -34.573171)

For now we can merge it with df_neighborhood dataset to obtain real  names instead of zeros.

In [584]:
geometry = [Point(xy) for xy in zip(crimes_total_zero['longitude'], crimes_total_zero['latitude'])]
gcrimes_total_zero = gpd.GeoDataFrame(crimes_total_zero, geometry=geometry, crs='EPSG:4326')  # WGS84

barrios = barrios.to_crs('EPSG:4326')

gcrimes_total_zero_with_neigh = gpd.sjoin(gcrimes_total_zero, barrios[['neighborhood', 'geometry']], how='left', predicate='within')


In [585]:
gcrimes_total_zero_with_neigh = gcrimes_total_zero_with_neigh.drop(columns={'neighborhood_left', 'geometry', 'index_right'})
gcrimes_total_zero_with_neigh = gcrimes_total_zero_with_neigh.rename(columns={'neighborhood_right' : 'neighborhood'})


In [586]:
crimes_fin = pd.concat([crimes_total_fin, gcrimes_total_zero_with_neigh], ignore_index=True)
crimes_fin = crimes_fin.drop(columns={'day_of_week', 'date', 'hour', 'using_arm', 'using_motorbike', 'comuna', 'latitude', 'longitude', 'crime_subtype'})

As you can see, not all the district values ​​from the crime dataset match the values ​​from the Buenos Aires districts dataset. Let's try to process them manually using Google Maps

In [588]:
replace_map = {
    'NUÃ‘EZ': 'NUÑEZ',
    'CONTITUCIÓN': 'CONSTITUCION',
    'LA BOCA': 'BOCA',
    'NUNEZ': 'NUÑEZ',
    'VILLA GENERAL MITRE': 'VILLA GRAL. MITRE',
    'VILLA ´PUEYRREDON': 'VILLA PUEYRREDON',
    'AV BOEDO': 'BOEDO',
    'SD': None,  # Unable to find out neighborhood
    'NO ESPECIFICADA': None, # Unable to find out neighborhood
    'RODRIGO BUENO': 'PUERTO MADERO',
    'FLORIDA': None, # The neighborhood extends beyond the city limits
    'GREGORIO DE LAFERRERE': None, # The neighborhood extends beyond the city limits
    'BERNAL': None, # The neighborhood extends beyond the city limits
    'DOCK SUD': None, # The neighborhood extends beyond the city limits
    'SANTA MARÍA': None, # The neighborhood extends beyond the city limits
    'BANFIELD OESTE': None, # The neighborhood extends beyond the city limits
    'CASEROS': None, # The neighborhood extends beyond the city limits
}

In [589]:
crimes_fin['neighborhood'] = crimes_fin['neighborhood'].replace(replace_map)

Let's combine some types of crimes and remove road traffic accidents from their number

In [590]:
repl_crimes = {
    'Theft' : 'property_crime',
    'Robbery' : 'property_crime',
    'Injuries' : 'injuries_crime',
    'Homicides' : 'homicides_crime',
    'Threats' : 'threats_crime'
}
crimes_fin = crimes_fin[crimes_fin['crime_type'] != 'Traffic Incidents']
crimes_fin['crime_type'] = crimes_fin['crime_type'].replace(repl_crimes)

Now we will calculate the average number of crimes of each type committed per month.

In [591]:
crimes_fin_grpd = crimes_fin.groupby(['neighborhood', 'year', 'month', 'crime_type'], as_index=False)['amount'].sum()
crimes_fin_mean = crimes_fin_grpd.groupby(['neighborhood', 'crime_type'], as_index=False)['amount'].mean()

In [592]:
crimes_per_month = crimes_fin_mean.pivot(index='neighborhood', columns='crime_type', values='amount').reset_index().fillna(0).rename_axis(None, axis=1)

crimes_per_month.head(3)

Unnamed: 0,neighborhood,homicides_crime,injuries_crime,property_crime,threats_crime
0,AGRONOMIA,0.0,2.642857,29.055556,2.133333
1,ALMAGRO,1.25,24.444444,355.972222,24.25
2,BALVANERA,1.538462,52.777778,597.944444,37.472222


### 5. Adding Neighborhood to Educational Institutions Data

In [593]:
df_education['neighborhood'].unique()

array(['VILLA SOLDATI', 'FLORES', 'PALERMO', 'NUÃ‘EZ', 'MONTE CASTRO',
       'BELGRANO', 'COGHLAN', 'CABALLITO', 'VILLA CRESPO',
       'VILLA GRAL. MITRE', 'RETIRO', 'RECOLETA', 'VILLA URQUIZA',
       'CONSTITUCION', 'NUEVA POMPEYA', 'LINIERS', 'ALMAGRO',
       'VILLA LUGANO', 'PARQUE CHACABUCO', 'PARQUE AVELLANEDA',
       'SAAVEDRA', 'PARQUE PATRICIOS', 'VILLA SANTA RITA', 'PATERNAL',
       'BALVANERA', 'VILLA DEL PARQUE', 'VILLA DEVOTO', 'VILLA REAL',
       'MATADEROS', 'MONTSERRAT', 'FLORESTA', 'AGRONOMIA', 'BOEDO',
       'SAN CRISTOBAL', 'VILLA ORTUZAR', 'PARQUE CHAS', 'SAN NICOLAS',
       'SAN TELMO', 'BOCA', 'VILLA LURO', 'VERSALLES', 'VELEZ SARSFIELD',
       'VILLA PUEYRREDON', 'BARRACAS', 'CHACARITA', 'COLEGIALES',
       'VILLA RIACHUELO', nan, 'PUERTO MADERO', '6', '10', '3'],
      dtype=object)

As we can see from the list above not all neighborhoods are displayed correctly.

In [594]:
geometry = [Point(xy) for xy in zip(df_education['institution_long'], df_education['institution_lat'])]
gdf_education = gpd.GeoDataFrame(df_education, geometry=geometry, crs='EPSG:4326')  # WGS84

barrios = barrios.to_crs('EPSG:4326')

gdf_education_with_neigh = gpd.sjoin(gdf_education, barrios[['neighborhood', 'geometry']], how='left', predicate='within')

It seems that we will not be able to use data where the areas were not defined, so we can painlessly delete these records

In [595]:
gdf_education_with_neigh = gdf_education_with_neigh[gdf_education_with_neigh['neighborhood_right'].isna() == 0]
gdf_education_with_neigh = gdf_education_with_neigh.drop(columns={'geometry', 'institution_long', 'institution_lat', 'index_right', 'neighborhood_left'})
gdf_education_with_neigh = gdf_education_with_neigh.rename(columns={'neighborhood_right' : 'neighborhood'})

In [596]:
education_grpd = gdf_education_with_neigh.groupby(['neighborhood', 'sector'], as_index=False)['institution_name'].count()

In [597]:
education_grpd['sector'] = education_grpd['sector'].replace({1 : 'state_institution', 2 : 'private_institution'})
education_grpd_pv = education_grpd.pivot(index='neighborhood', columns='sector', values='institution_name').reset_index().fillna(0).rename_axis(None, axis=1)

education_grpd_pv.head()

Unnamed: 0,neighborhood,private_institution,state_institution
0,AGRONOMIA,6,16
1,ALMAGRO,73,70
2,BALVANERA,74,100
3,BARRACAS,30,86
4,BELGRANO,69,37



### 6. Adding Neighborhood to Green Zones Data

In [598]:
# (1) Loading and converting geometry
df_green_spase['geometry'] = df_green_spase['geometry'].apply(lambda x: wkt.loads(x) if isinstance(x, str) else x)

# (2) Create a GeoDataFrame and convert it to meters
gdf_green_spase = gpd.GeoDataFrame(df_green_spase, geometry='geometry', crs='EPSG:4326')
gdf_green_spase = gdf_green_spase.to_crs(epsg=3857)

# (3) Считаем площадь зоны
gdf_green_spase['zone_area'] = gdf_green_spase.geometry.area

# (4) We calculate the area of ​​the zone
barrios = barrios.to_crs(epsg=3857)

# (5) Spatial intersection
green_spase_inter_area = gpd.overlay(
    gdf_green_spase,
    barrios[['neighborhood', 'area_neib', 'geometry']],
    how='intersection'
)

# (6) Intersection area
green_spase_inter_area['inter_area'] = green_spase_inter_area.geometry.area

In [599]:
green_spase_inter_area = green_spase_inter_area.drop(columns={'neighborhood_1', 'commune', 'geometry'})
green_spase_inter_area = green_spase_inter_area.rename(columns={'neighborhood_2': 'neighborhood'})

In [600]:
green_spase_inter_area['pati'] = 0
green_spase_inter_area.loc[
    (green_spase_inter_area['tiene_pati'] == 1.0),
    'pati'
] = 1
green_spase_inter_area = green_spase_inter_area.drop(columns={'tiene_pati'})

In [601]:
green_spase_sum_per_neib = (
    green_spase_inter_area
    .groupby(['neighborhood', 'area_neib'], as_index=False)
    .agg(total_green_area=('inter_area', 'sum'))
)

green_spase_amount_per_neib = (
    green_spase_inter_area
    .groupby(['neighborhood'], as_index=False)
    .agg(n_green_zones=('id', 'count'))
)

patio_amount_per_neib = (
    green_spase_inter_area
    .groupby(['neighborhood'], as_index=False)
    .agg(n_green_zones_with_pati=('pati', 'sum'))
)

mean_green_spase_per_neib = (
    green_spase_inter_area
    .groupby(['neighborhood'], as_index=False)
    .agg(mean_green_area=('inter_area', 'mean'))
)


In [602]:
green_spase_merged_1 = pd.merge(green_spase_sum_per_neib, green_spase_amount_per_neib, on='neighborhood', how='left')
green_spase_merged_2 = pd.merge(green_spase_merged_1, patio_amount_per_neib, on='neighborhood', how='left')
green_spase_final = pd.merge(green_spase_merged_2, mean_green_spase_per_neib, on='neighborhood', how='left')
green_spase_final['green_area_pct'] = 100*green_spase_final['total_green_area']/green_spase_final['area_neib']
green_spase_final = green_spase_final.drop(columns = {'area_neib'})

# Final Data Merging

Merging Neighborhood, Population, Slums and Hospital Datasets

In [512]:
merged_np = pd.merge(df_neighborhood, df_population, on='neighborhood', how='left')
merged_nps = pd.merge(merged_np, slums_count, on='neighborhood', how='left')
merged_npsh = pd.merge(merged_nps, df_hospitals, on='neighborhood', how='left')

Merging Obtained one with Subway, Noise and Crimes Datasets

In [513]:
merged_npshw = pd.merge(merged_npsh, gdf_subway_with_neigh, on='neighborhood', how='left')
merged_npshwns = pd.merge(merged_npshw, noise_inter_area_pivot, on='neighborhood', how='left')
merged_all_crimes = pd.merge(merged_npshwns, crimes_per_month, on='neighborhood', how='left')

Merging Obtained one with Education and Green Space Datasets

In [603]:
merged_all_education = pd.merge(merged_all_crimes, education_grpd_pv, on='neighborhood', how='left')
data_final = pd.merge(merged_all_education, green_spase_final, on='neighborhood', how='left')
data_final.head()

Unnamed: 0,neighborhood,commune,perimeter_neib,area_neib,geometry,population,slum,amount_of_hospitals,station,day_noise,...,injuries_crime,property_crime,threats_crime,private_institution,state_institution,total_green_area,n_green_zones,n_green_zones_with_pati,mean_green_area,green_area_pct
0,AGRONOMIA,15,6556.17,2122169.34,POLYGON ((-58.475888981732986 -34.591723461272...,13912,,,,22.211583,...,2.642857,29.055556,2.133333,6,16,6092.036,11,0,553.821419,0.287066
1,ALMAGRO,5,8537.9,4050752.25,POLYGON ((-58.416002854915654 -34.597854231564...,131699,,1.0,5.0,28.458668,...,24.444444,355.972222,24.25,73,70,15445.74,12,4,1287.14497,0.381305
2,BALVANERA,3,8375.82,4342280.27,POLYGON ((-58.392934155259674 -34.599636447011...,138926,,1.0,12.0,27.401132,...,52.777778,597.944444,37.472222,74,100,46252.71,9,6,5139.189604,1.065171
3,BARRACAS,4,12844.17,7954579.06,POLYGON ((-58.3706620577617 -34.62949214687238...,89452,2.0,6.0,,32.513541,...,43.027778,286.055556,43.944444,30,86,396828.1,101,17,3928.991535,4.988676
4,BELGRANO,13,20443.29,8025458.65,POLYGON ((-58.45056826999142 -34.5356114803969...,126267,,1.0,2.0,26.384386,...,17.666667,291.055556,16.527778,69,37,1588319.0,66,13,24065.44228,19.791008


In [604]:
# Checking missing values

data_final.isna().sum()

neighborhood                0
commune                     0
perimeter_neib              0
area_neib                   0
geometry                    0
population                  0
slum                       38
amount_of_hospitals        29
station                    28
day_noise                   0
night_noise                 0
homicides_crime             0
injuries_crime              0
property_crime              0
threats_crime               0
private_institution         0
state_institution           0
total_green_area            0
n_green_zones               0
n_green_zones_with_pati     0
mean_green_area             0
green_area_pct              0
dtype: int64

In [605]:
# Since missing values ​​are only present in variables containing numerical values ​
# ​(quantitative or continuous), we can safely substitute 0 where a value is missing.

data_final = data_final.fillna(0)

In [606]:
# Load the final version of the data into a csv file

data_final.to_csv('../data/Full_ba_neighborhood_data.csv', index=False)

As a result, we obtained a clean dataset with 48 neighborhoods and 22 features, ready for feature engineering and modeling