# Notebook1: Preprocessing
In this notebook: 
We preprocess the provided polygonal dataset and from it we create a networkx Graph of the city of Milan.  
Intersections are the nodes of the network, while the roads connecting them are the edges.  
Edge attributes assigned are length, width, and betweenness.    
We save the final graph, along with the preprocessed polygonal dataset, to a .gml file.  
For simulations of bike lane networks on this network, see notebook 2: Simulations.ipynb.  
For analysis of the simulations, see notebook 3: Network_Analysis.ipynb.  
The network can be analyzed in network_analysis.ipynb.  

Steps:
1) read raw .shp file as geodataframe
2) remove and rename columns, assign IDs
3) dissolve adjacent roads
4) calculate lengths and widths
5) create edgelist dataframe
6) create graph and calculate edge betweenness
7) perfected betweenness calculation
8) Save data


In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import contextily as cx
import matplotlib.pyplot as plt
import networkx as nx
import igraph as ig
import utils

<module 'utils' from '/data/home/basilone/sony/Milan_Graphs_Basilone/utils.py'>

## Read file

In [None]:
vehicle_path = 'MILANO/AC_VEI_AC_VEI_SUP_SR.shp'

In [3]:
gdf1 = gpd.read_file(vehicle_path)

In [4]:
gdf1.head()

Unnamed: 0,NOME,SUBREGID,CLASSREF,AC_VEI_FON,AC_VEI_LIV,AC_VEI_SED,AC_VEI_ZON,geometry
0,VIA TOMMASO GULLI,AC_VEI_SR_29386,AC_VEI_10701,1,2,1,4,"POLYGON Z ((510264.228 5034645.372 121.18, 510..."
1,VIA ODOARDO TABACCHI,AC_VEI_SR_22800,AC_VEI_9928,1,2,1,4,"POLYGON Z ((514556.007 5032723.024 113.789, 51..."
2,VIA GIAN CARLO CASTELBARCO,AC_VEI_SR_22801,AC_VEI_9942,1,2,1,4,"POLYGON Z ((514555.816 5032740.825 114.222, 51..."
3,VIA ODOARDO TABACCHI,AC_VEI_SR_22802,AC_VEI_9928,1,2,1,4,"POLYGON Z ((514552.787 5032740.493 114.214, 51..."
4,VIA ROBERTO SARFATTI,AC_VEI_SR_22803,AC_VEI_9918,1,2,1,4,"POLYGON Z ((514645.162 5032753.152 114.286, 51..."


We have
- NOME: name of the region. if it's a street it will be "via ...", but it could also be "Parcheggio", or others.
- SUBREGID is a unique identifier for the polygon.
- AC_VEI_FON is either 01 (paved street) or 02 (non paved)
- AC_VEI_LIV is either 01 (in an underpass) or 02(not in an underpass)
- AC_VEI_SED is either 01 (street level), 02 (on a bridge of sorts), 03 (in a gallery), 04 (in a dam)
- AC_VEI_ZON classifies the type of region according to whether it's a road, a roundabout, a parking lot etc. This is useful for filtering data the way we need to.

## Initial cleanup

In [5]:
#initial cleaning 

gdf1.drop(['AC_VEI_FON', 'CLASSREF'], axis = 1, inplace = True)
gdf1.rename(columns={'SUBREGID':'ID', 'NOME': 'NAME', 'AC_VEI_ZON': 'TYPE',
                     'AC_VEI_SED': 'PLACEMENT', 'AC_VEI_LIV': 'LEVEL'}, inplace = True)

In [6]:
def rename(df, column1, column2):
    mapping1 = {
        "01": "street level",
        "02": "bridge",
        "03": "gallery",
        "04": "dam"
    }

    mapping2 = {
        "01": "underpass",
        "02": "not underpass"
    }
    df[column1] = df[column1].map(mapping1).fillna(df[column1])
    df[column2] = df[column2].map(mapping2).fillna(df[column2])
    df.ID = "ID_" + df.ID.astype(str)
    return df

def transform_id(id_str):
    return 'ID_' + id_str.split('_')[-1]



In [7]:
gdf1 = rename(gdf1,"PLACEMENT", "LEVEL")
gdf1['ID'] = gdf1['ID'].apply(transform_id)

In [8]:
# filter out everything that is not a road or an intersection
# portions of road (e.g not intersections or parking lots) start with 01 in TYPE
# intersections, squares, and roundabouts start with 02 in TYPE

pattern = ('01','02')
gdf1 = gdf1.loc[gdf1.TYPE.str.startswith(pattern)].copy()

In [9]:
# change crs to italian one.
# OSM_crs = 3857
ITALY_crs = 6875
gdf1.to_crs(epsg=ITALY_crs, inplace = True)



In [10]:
gdf1.head()

Unnamed: 0,NAME,ID,LEVEL,PLACEMENT,TYPE,geometry
10,VIA LORENTEGGIO,ID_22809,not underpass,street level,205,"POLYGON Z ((6776304.852 5031392.755 117.813, 6..."
11,VIA LORENTEGGIO,ID_22810,not underpass,street level,205,"POLYGON Z ((6775647.989 5031111.851 117.016, 6..."
12,VIA LORENTEGGIO,ID_22811,not underpass,street level,205,"POLYGON Z ((6777349.132 5031825.257 117.638, 6..."
13,VIA LORENTEGGIO,ID_22812,not underpass,street level,205,"POLYGON Z ((6776583.882 5031489.752 117.567, 6..."
14,VIA LORENTEGGIO,ID_22813,not underpass,street level,205,"POLYGON Z ((6774677.188 5030729.219 115.947, 6..."


### Removing motorways, ringroads
This is specific to this dataset, and unfortunately difficult to generalize at this level. After extensive testing by hand, we find that it is crucial to remove these roads in order to obtain a usable network without problems.

In [11]:
gdf = gdf1.copy()
gdf = gdf.loc[~gdf['NAME'].str.contains('TANGENZIALE', regex = False)] #removing tangenziali
gdf = gdf.loc[~gdf['NAME'].str.contains('VIA DEL MARE', regex = False)] #removing more useless roads
gdf = gdf.loc[~(gdf['NAME'].str.contains('STRADA SENZA', regex = False)) | (gdf['NAME'].str.contains('324', regex = False))]
gdf = gdf.loc[~gdf['NAME'].str.contains('STRADA_SENZA', regex = False)]

## Dissolve adjacent roads

Each road geometry (TYPE 01) must be touching two intersection geometries (TYPE 02) so that they can be assigned as edges of those intersection nodes.  
Right now, many adjacent road geometries exist, which must be dissolved. The dissolver function detects chains of adjacent geometries and dissolves them into one.  
Testing reveals that for this dataset, it makes sense to consider geometries with distance < 0.1m as touching.  
Finally, before dissolving all the adjacent road groups, we must handle some problematic cases manually (namely bridges and overpasses).

### some exception handling of single streets that give problems
first Cavalcavia Schiavoni and Piazza maggi

In [12]:
#make piazza maggi into one block
temp = gdf1[gdf1.NAME == 'PIAZZA GIAN ANTONIO MAGGI']
temp = temp.dissolve(by = 'NAME', as_index = False)

gdf = gdf[gdf['NAME'] != 'PIAZZA GIAN ANTONIO MAGGI']
gdf = pd.concat([gdf,temp])
gdf.reset_index(inplace = True, drop = True)

#then dull some edges
gdf.loc[gdf['ID'] == 'ID_30921', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_30921', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_35262', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_35262', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_30485', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_30485', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_30132', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_30132', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_30208', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_30208', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_30210', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_30210', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_30211', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_30211', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_30212', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_30212', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_30206', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_30206', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners

#and bulge some corners
gdf.loc[gdf['ID'] == 'ID_25090', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_25090', 'geometry'].buffer(0.5, join_style = 1)
gdf.loc[gdf['ID'] == 'ID_25093', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_25093', 'geometry'].buffer(0.5, join_style = 1) 
gdf.loc[gdf['ID'] == 'ID_25092', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_25092', 'geometry'].buffer(0.5, join_style = 1)
gdf.loc[gdf['ID'] == 'ID_32131', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_32131', 'geometry'].buffer(0.5, join_style = 1) 
gdf.loc[gdf['ID'] == 'ID_32132', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_32132', 'geometry'].buffer(0.5, join_style = 1) 
gdf.loc[gdf['ID'] == 'ID_29939', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_29939', 'geometry'].buffer(0.5, join_style = 1)
gdf.loc[gdf['ID'] == 'ID_24774', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_24774', 'geometry'].buffer(0.5, join_style = 1)
gdf.loc[gdf['ID'] == 'ID_24648', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_24648', 'geometry'].buffer(0.5, join_style = 1)

Next, Cavalcavia Renato Serra leads to some problems

In [13]:
#cavalcavia renato serra
gdf = gdf[gdf.ID != 'ID_3474']
gdf = gdf[gdf.ID != 'ID_3473']
gdf = gdf[gdf.ID != 'ID_2922']
gdf = gdf[gdf.ID != 'ID_2921']
gdf = gdf[gdf.ID != 'ID_3853']
gdf = gdf[gdf.ID != 'ID_3852']
gdf = gdf[gdf.ID != 'ID_3461']
gdf = gdf[gdf.ID != 'ID_3459']
gdf = gdf[gdf.ID != 'ID_3040']
gdf = gdf[gdf.ID != 'ID_2994']
gdf = gdf[gdf.ID != 'ID_1839']
gdf = gdf[gdf.ID != 'ID_1840']
#dull some edges
gdf.loc[gdf['ID'] == 'ID_1845', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_1845', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_1843', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_1843', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
#and add bulge to the intersections
gdf.loc[gdf['ID'] == 'ID_544', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_544', 'geometry'].buffer(0.5, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_573', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_573', 'geometry'].buffer(0.5, join_style = 1) #dulls corners

next sopraelevata monteceneri


In [14]:
gdf = gdf[gdf.ID != 'ID_3850']
gdf = gdf[gdf.ID != 'ID_3851']

now via rubattino

In [15]:
#dull some edges
gdf.loc[gdf['ID'] == 'ID_19697', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_19697', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_19700', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_19700', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_19701', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_19701', 'geometry'].buffer(-0.8, join_style=1).buffer(0.3, join_style = 1) #dulls corners
#and add bulge to the intersections
gdf.loc[gdf['ID'] == 'ID_9534', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_9534', 'geometry'].buffer(0.5, join_style = 1) #dulls corners
gdf.loc[gdf['ID'] == 'ID_9404', 'geometry'] = gdf.loc[gdf['ID'] == 'ID_9404', 'geometry'].buffer(0.5, join_style = 1) #dulls corners

In [16]:
dissolved_gdf = utils.dissolver_tot(gdf)

## Calculate lengths and widths

### Average width calculation: 
Area is length times width for rectangles.  
Perimeter is 2(length) + 2(width)  
$A = lw$  
$P = 2l+2w$  
brings us to solve for width as   

$P = 2\frac{A}{w}+2w$ 
so  
$w^2 -\frac{P}{2}w+A = 0$

In [17]:
dissolved_gdf = utils.widths_and_lengths(dissolved_gdf)

In [None]:
ints, roads = utils.ints_and_roads(dissolved_gdf)
roads.width.describe()

## Create edgelist

We make an edgelist of connected intersections, and assign each edge with an ID, a width and a length

In [19]:
edgelist = utils.make_edgelist(dissolved_gdf)

### Length adjustments
We would like our edges to also account for the length of the intersections, since intersections are "collapsed" into point objects in the graph.  
This is a weighted average of the lengths of the two nodes connected by the edge. 
The weights are the number of times the node appears in the 'from' or 'to' column of the edgelist.  
That is to say, a node with high degree will probably be a square/roundabout, so we distribute it's length evenly to all of it's edges.

In [20]:
# Create a dictionary for quick lookup of lengths in dissolved_gdf
length_dict = dissolved_gdf.set_index('ID')['length'].to_dict()

# Count occurrences of 'from' and 'to' in edgelist
from_counts = edgelist['from'].value_counts().to_dict()
to_counts = edgelist['to'].value_counts().to_dict()

# Update the length in edgelist
edgelist['length'] = edgelist.apply(lambda row: row['length'] + length_dict.get(row['from'], 0) / from_counts.get(row['from'], 1) 
                                    + length_dict.get(row['to'], 0) / to_counts.get(row['to'], 1), axis=1)
# NB this is a weighted average of the lengths of the two nodes connected by the edge. 
# The weights are the number of times the node appears in the 'from' or 'to' column of the edgelist

In [None]:
# One example of a complex intersection can be seen here: 
# Its "length" is the base of the rectangle with the same area and perimeter as it,
# so the approximation is not perfect, but it is better than not doing anything.

dissolved_gdf[dissolved_gdf.ID == 'ID_6477'].explore()

## Create Graphs and assign attributes

In [22]:
# Create a MultiGraph from the edgelist
G = nx.from_pandas_edgelist(edgelist, 'from', 'to', edge_attr=["width", "length", "ID"] , create_using=nx.MultiGraph())
# Store connected components in a list and sort them by size
Gcc = sorted(nx.connected_components(G), key = len, reverse = True)
 
print("Graph:\n"
      "\tNumber of components:",len(Gcc),
      "\n\tlargest:",  len(Gcc[0]), "nodes.",
      "\n\tsecond largest:" , len(Gcc[1]),"nodes.",
      "\n\tthird largest:" ,len(Gcc[2]), "nodes.",
      "\n\ttotal:", len(G), "nodes."
     )

Graph:
	Number of components: 12 
	largest: 7093 nodes. 
	second largest: 9 nodes. 
	third largest: 5 nodes. 
	total: 7118 nodes.


In [23]:
utils.calc_pos(dissolved_gdf, G);

### Graph plot:

We plot the graph network below. NB isolated nodes have been given a self loop to make them more visible in the plot. The loop has no geographical meaning.

In [None]:
fig, ax = plt.subplots()
utils.plot_nodes(G, save = False, ax = ax)
plt.show()

The large black clusters are all the regions where tangenziali are present.  The number of the nodes in the regions is correct, but all of them seem to be connected by these extremely long and identical edges.

### Removing nodes that are not in Gcc

In [25]:
G = G.subgraph(Gcc[0]).copy() #deep copy of subgraph of only largest connected component

## Graph Attributes

In [26]:
deg, cc, bc = utils.add_node_attributes(G)

In [None]:
utils.hist_attribute(G, 'width', nbins = 20)

## Fixing betweenness problems
right now, our graph has lots of nodes with 0 betweenness. this isn't a calculation error. it's due to the way lengths were calculated and the graph was built. 
We weigh our edges by length, so when calculating betweenness, we measure shortest paths in real distance, not topological distance. this means that there are two possible ways in which an edge can have 0 betweenness:  
1) it's a multiedge, and the other edges between those two nodes weigh less (they're shorter)
2) there is a path through other nodes that is shorter than the direct path from A to B (A--->C-->B < A-->B) 
The first aspect can be dealt with by collapsing our multiedges into single edges, with length averaged between all multiedges, and width summed. this corresponds to the case of a large street with more than one carriageway, which can be counted as one wide edge or as many smaller ones.  
The second aspect can only happen if the edge between A and B is not a straight line. Furthermore, We count the length of intersections in our shortest paths, but in a heuristic way, so it can happen that (A--->C-->B < A-->B) if node C was a particularly long and complex intersection.  
We partially solve this problem in the following section


In [28]:
#collapse multigraph edges
edgelist2 = edgelist.groupby(['from', 'to']).agg({
   'width': 'sum', 
   'length': 'mean', 
   'ID': 'first'}).reset_index()

G2 = nx.from_pandas_edgelist(edgelist2, 'from', 'to', edge_attr=["width", "length", "ID"] , create_using=nx.MultiGraph())
Gcc2 = sorted(nx.connected_components(G2), key = len, reverse = True)
utils.calc_pos(dissolved_gdf, G2);
G2 = G2.subgraph(Gcc2[0]).copy() 

In [29]:
def check_betweenness(G,bc_type = 'bc_edges'):
   #returns edgelist of edges with betweenness centrality == 0

   edgelist = nx.to_pandas_edgelist(G, edge_key = 'key')

   # temp = edgelist[edgelist.bc_edges > 0]
   # edgelist_iterable =  set(map(tuple, temp[['source','target','key']].values)) 
   # sub_G = G2.edge_subgraph(edgelist_iterable).copy()
   temp2 = edgelist[edgelist[bc_type] == 0]
   edgelist_iterable2 =  set(map(tuple, temp2[['source','target','key']].values))
   sub_G2 = G.edge_subgraph(edgelist_iterable2).copy()
   print('number of edges with betweenness centrality == 0:', len(temp2))
   print('length of streets with 0 betweenness centrality:', temp2['length'].sum())
   fig, ax = plt.subplots(figsize=(10, 10))
   #plot_edges(sub_G, ax = ax,)
   utils.plot_edges(sub_G2, ax = ax, color = 'blue')
   plt.show()
   return temp2

In [30]:
#take all couples in temp2, calculate shortest path, length of shortest path, and compare it to length of the direct edge

def compare_shortest_path(G, edgelist):
   #takes in graph and edgelist, calculates shortest path between each couple of nodes in edgelist
   #and compares it to the length of the direct edge
    shortest_paths = []
    shortest_path_lengths = []
    direct_edge_lengths = []
    keys = []
    for i in range(len(edgelist)):
        origin = edgelist.iloc[i]['source']
        destination = edgelist.iloc[i]['target']
        shortest_path = nx.shortest_path(G, origin, destination, weight = 'length')
        shortest_paths.append(shortest_path)
        shortest_path_length = nx.shortest_path_length(G, origin, destination, weight = 'length')
        shortest_path_lengths.append(shortest_path_length)
        direct_edge_length = edgelist.iloc[i]['length']
        direct_edge_lengths.append(direct_edge_length)
        keys.append(edgelist.iloc[i]['key'])
    comparison_df = pd.DataFrame({
                                    'shortest_path_length': shortest_path_lengths,
                                    'direct_edge_length': direct_edge_lengths,
                                    'shortest_path': shortest_paths,
                                    'direct_edge': list(zip(edgelist.source, edgelist.target)),
                                    'key' : keys
                                    })
    comparison_df['length_diff'] =  comparison_df['direct_edge_length']- comparison_df['shortest_path_length']
    comparison_df['fraction_diff'] = comparison_df['length_diff']/comparison_df['direct_edge_length']
    return comparison_df

In [31]:
def hist_comparison(comparison_df):
    plt.hist(comparison_df['length_diff'], bins=50, color='blue', alpha=0.7, edgecolor='black')
    plt.xlabel('Length Difference')
    plt.ylabel('Frequency')
    plt.title('Histogram of Length Differences')
    plt.show()

In [32]:
def shortest_path_visualizer(comparison_df, row_idx):
   #VISUALIZATION TOOL, BUT DOESN'T ACTUALLY SOLVE ANY PROBLEMS
   #take a single shortest path and direct edge, plot them over a basemap
   
   
   row = comparison_df.iloc[row_idx]
   direct_edge = row['direct_edge']
   #create subgraphs of shortest path and direct edge
   short_G = G.subgraph(row['shortest_path']).copy()
   OD_G = G.subgraph(direct_edge).copy()
   #also get surroundings of shortest path
   #k = within_dist(dissolved_gdf[dissolved_gdf.ID.isin(row['shortest_path'])],row['shortest_path_length']/4 , dissolved_gdf)
   fig, ax = plt.subplots(figsize=(10, 10))
   #k.plot(color = 'grey', alpha = 0.4, ax = ax)
   utils.plot_edges(short_G, color = 'green', ax = ax, alpha = 0.6)
   utils.plot_edges(OD_G, color = 'red', ax = ax, alpha = 0.6)
   plt.show()
   

We say that for all edges with 0 betweenness, if diff is < threshold % length, then i can assign the direct edge the length of the shortest path,
and then re calc betweenness on G and see if it changes


In [33]:
def length_adjuster(comparison_df, threshold):
   #for all edges with 0 betweenness, if diff is < threshold % length, then i can assign the direct edge the length

   temp = comparison_df[comparison_df['fraction_diff'] < threshold/100]

   comparison_df['adjusted_length'] = 1.0
   for idx in comparison_df.index:
      if idx in temp.index:
         comparison_df.loc[idx, 'adjusted_length'] = comparison_df.loc[idx, 'shortest_path_length']
      else:
         comparison_df.loc[idx, 'adjusted_length'] = comparison_df.loc[idx, 'direct_edge_length']



In [34]:
def graph_length_update(G, comparison_df):
   #adjust lengths of edges in graph according to comparison_df
      length_dict = {(row['direct_edge'][0], row['direct_edge'][1], row['key']): row['adjusted_length'] for _, row in comparison_df.iterrows()}
      nx.set_edge_attributes(G, nx.get_edge_attributes(G,'length'), 'length_for_bc')
      nx.set_edge_attributes(G, length_dict, 'length_for_bc')
#watch it: the original G is a multigraph, so we need to update it while keeping track of keys too. 
# Either don't make g2 into a graph and go directly for this correction here, or, if you do, then update g2 as a graph, instead of updating g



In [35]:
def add_edge_betweenness_new(G, weight = 'length_for_bc'):
    _, possible_weights = utils.list_attributes(G)
    if weight not in possible_weights:
        print('weight must be one of', possible_weights)
        
    g = ig.Graph.from_networkx(G)
    n = g.vcount()
    temp = list(np.array(g.edge_betweenness(weights = weight))*(2/(n*(n-1)))) #use normalization of networkx formula
    g.es['bc_edges_new'] = temp
    H = g.to_networkx(create_using= nx.MultiGraph)
    # for u, v, attr in H.edges(data=True): #i need to delete this extra attribute created when going to and from igraph
    #         del attr['_igraph_index']
    bc_edges = nx.get_edge_attributes(H, 'bc_edges_new')
    nx.set_edge_attributes(G, bc_edges,'bc_edges_new')
    return bc_edges

In [None]:
utils.add_edge_betweenness(G2);
temp2 = check_betweenness(G2)

In [37]:
comparison_df = compare_shortest_path(G2, temp2)

In [None]:
hist_comparison(comparison_df)

In [None]:
utils.hist_attribute(G2, 'bc_edges', nbins = 20)

In [40]:
length_adjuster(comparison_df, 10) 
graph_length_update(G2, comparison_df)
add_edge_betweenness_new(G2);

now, check to see what changed

In [None]:
utils.hist_attribute(G2, 'bc_edges_new', nbins = 20)

In [None]:
temp = check_betweenness(G2, bc_type = 'bc_edges_new')

In [None]:
comparison_df2 = compare_shortest_path(G2, temp2)
hist_comparison(comparison_df2)

In [None]:
utils.plot_edge_attribute(G2, 'bc_edges_new', color_or_width = 'width')

With new betweenness methods we have very few streets with 0 betweenness, and we start to capture some of the structural characteristics of milan just by looking at betweenness (like the connection to cesano boscone and the ring-like structure)

## Saving preprocessed data
The base network is now complete
All that's left to do is :
- save the preprocessed geodataframe as a gpkg
- save the pandas edgelist as a csv
- save the graph as a .gml file. This is all we really need to go on, but it's good to checkpoint the rest, too.

In [45]:
pwd

'/data/home/basilone/sony/Milan_Graphs_Basilone'

In [46]:
import os

# Create paths
base_dir = 'MILANO/dataset_vehicles_preprocessed'
gdf_path = os.path.join(base_dir, 'dissolved_roads.gpkg')
edgelist_path = os.path.join(base_dir, 'edgelist.csv')
graph_path = os.path.join(base_dir, 'base_network.gml')
# Check if the directory exists, if not, create it
if not os.path.exists(base_dir):
    os.makedirs(base_dir)

# Save files
dissolved_gdf.to_file(gdf_path, driver='GPKG')
edgelist.to_csv(edgelist_path, index=False)
nx.write_gml(G2, graph_path)