# Combining data sources on a geographic level

1) Map data on IRIS is extracted from here: https://geoservices.ign.fr/contoursiris

2) All other geo identifiers from here: https://geoservices.ign.fr/adminexpress

3) Netmob data from: https://netmob2023challenge.networks.imdea.org

### Import packages

In a first step, we load relevant packages

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
from tqdm import tqdm
import json

### Load IRIS- and commune-level data for relevant identifiers

In [None]:
    map_iris = gpd.read_file("../external_data/CONTOURS-IRIS_2-1__SHP__FRA_2019-01-01/CONTOURS-IRIS/1_DONNEES_LIVRAISON_2020-01-00139/CONTOURS-IRIS_2-1_SHP_LAMB93_FXX-2019/CONTOURS-IRIS.shp")

In [None]:
map_com = gpd.read_file("../external_data/ADMIN-EXPRESS-COG_2-0__SHP__FRA_2019-09-24/ADMIN-EXPRESS-COG/1_DONNEES_LIVRAISON_2019-09-24/ADE-COG_2-0_SHP_WGS84_FR/COMMUNE.shp")

In [None]:
map_com = (map_iris
           .dissolve(by = 'INSEE_COM')
           .reset_index()
           .merge(map_com[['INSEE_COM', 'CODE_EPCI', 'INSEE_DEP', 'INSEE_REG', 'POPULATION']], on = 'INSEE_COM', how = 'left'))

In [None]:
map_com = map_com.to_crs(map_iris.crs)

In [None]:
map_com[~map_com['INSEE_COM'].isin(map_iris['INSEE_COM'].unique())]['INSEE_COM'].unique().size

Although most of those not included most likely be French overseas territory, we still go for the commune-level shape for now.

### Combine urban areas included in the Netmob dataset with geographic identifiers in France

Defining names of urban areas under study

In [None]:
cities = ['Bordeaux', 'Clermont-Ferrand', 'Dijon', 'Grenoble', 'Lille',
                 'Lyon', 'Mans', 'Marseille', 'Metz', 'Montpellier',
                 'Nancy', 'Nantes', 'Nice', 'Orleans', 'Paris',
                 'Rennes', 'Saint-Etienne', 'Strasbourg', 'Toulouse', 'Tours']

Creating a map of different geographic identifiers (tile_id, CODE_IRIS, code_com)

We have different mapping approaches here:

1) Mapping the tiles as polygons, so communes just outside the margins are also included in many cases
2) Mapping the tiles by their centroids, so communes just outside the margins are also included in just a few cases
3) Mapping the tiles to their closest neighbor

We go with option 2)

In [None]:
x_walk = pd.DataFrame()

for city_str in tqdm(cities):
    
    shape = gpd.read_file(f'../Data/Netmob/{city_str}/{city_str}.geojson')
    
    shape = shape.to_crs(map_iris.crs)
    
    shape['poly'] = shape["geometry"]
    shape['geometry'] = shape["geometry"].centroid

    tmp = gpd.tools.sjoin(shape, map_com[['INSEE_COM', 'INSEE_DEP', 'INSEE_REG', 'CODE_EPCI', 'POPULATION', 'NOM_COM', 'geometry']], predicate = 'within', how="inner")
    
    tmp['cities'] = city_str
    
    tmp['geometry'] = tmp["poly"]

    tmp = (tmp
           .drop(columns=['index_right', 'poly'])
           .rename(columns={"INSEE_COM": "code_com", 'INSEE_DEP': "code_dep", 'INSEE_REG': "code_reg", 'CODE_EPCI': "code_epci", 'NOM_COM': "name_com", 'POPULATION': "pop"}))   
    
    if x_walk.empty:
        x_walk = tmp
    else:
        x_walk = pd.concat([x_walk, tmp])

In [None]:
x_walk = x_walk.drop_duplicates(['tile_id', 'cities'])

In [None]:
x_walk.shape

In [None]:
x_walk.head()

Save them

In [None]:
x_walk[['tile_id', 'code_com', 'name_com', 'code_epci', 'cities', 'code_dep', 'code_reg']].to_csv("../midsave/xwalk.csv", index=False)

In [None]:
x_walk.to_file("../midsave/map_crime.gpkg", layer='tiles', driver="GPKG")