# World Data League 2021
## Notebook Template

This notebook is one of the mandatory deliverables when you submit your solution (alongside the video pitch). Its structure follows the WDL evaluation criteria and it has dedicated cells where you can add descriptions. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work.

The notebook must:

*   💻 have all the code that you want the jury to evaluate
*   🧱 follow the predefined structure
*   📄 have markdown descriptions where you find necessary
*   👀 be saved with all the output that you want the jury to see
*   🏃‍♂️ be runnable


## External links and resources
Paste here all the links to external resources that are necessary to understand and run your code. Add descriptions to make it clear how to use them during evaluation.

1. Risk profile of streets = https://wdl-data.fra1.digitaloceanspaces.com/pse/m_risk_prfile.zip
2. Excel explaining the categorical features : https://wdl-data.fra1.digitaloceanspaces.com/pse/Dictionary_Risk_Profiles.xlsx
2. OSM map = https://download.bbbike.org/osm/extract/planet_-9.89,38.265_-8.309,39.136.osm.pbf

## Introduction


When it comes to road safety, Portugal has one of the less impressive records in Europe, however, authorities have been taking steps in an attempt to improve the statistics; with fatalities dropping by 40% since 2010. Despite this, more than 400 people lost their lives in 2017 in road accidents and more than 40,000 injured.

An **European report** underlined these characteristics: 
- In Portugal, relatively many moped riders, lorry and truck occupants died in road accidents compared to the EU average.
- Portugal has a somewhat higher share of male road fatalities than the EU average.
- Fatalities in built-up areas, during daylight and while raining are overrepresented in Portugal.
- The number of speed tickets per population in Portugal is much lower than the EU average

Furthermore from our analysis we could see that there are three environments that where the pavement properties significantly, yet distinctly, influence the occurrence of accidents:

1. Rural environment with a heavy presence of urban characteristics
2. Environment characterized by a considerable predominance of intersections in a rural environment
3. Environment with curved segments, high longitudinal gradients and **average speed higher than the tolerable speed**




**HYPHOTHESIS** :


## Development
Start coding here! 👩‍💻

Don't hesitate to create markdown cells to include descriptions of your work where you see fit, as well as commenting your code.

We know that you know exactly where to start when it comes to crunching data and building models, but don't forget that WDL is all about social impact...so take that into consideration as well.

### IMPORTING PACKAGES

In [3]:
# GENERAL
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd

# LOADING DATA
import requests
import os
import shutil
from io import BytesIO
import osmium
import fiona
import json


# GEOSPATIAL DATA
from shapely.geometry import Point, LineString, MultiPoint
from shapely.geometry import shape 


# PLOTTING DATA
from folium import Map, CircleMarker, Vega, Popup, Marker, PolyLine, Icon, Choropleth, LayerControl
from folium.plugins import MarkerCluster, HeatMap, BeautifyIcon
from folium.features import ColorLine, GeoJsonPopup, GeoJsonTooltip
from folium.map import FeatureGroup
import shapely
import matplotlib
from ipywidgets import interact
import seaborn as sns

# STATS
import math
import stats

# ML
import scipy
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

### Extracting data

We decide to start clean and analyses the dataset given by the ***WDL*** team: a shape file containing 34678 different road segments. Each of these road segments is characterized by information on traffic intensity, velocity and environment in which this is inserted. 

In [4]:
## testing remote opening of files
# constructor of google download links
dl_construct = 'https://drive.google.com/uc?export=download&id='
# id from share link google drive
file_id = '1m2BpnJ-NXqlqFW8gYnC_fEI2PLTrXz1r'
geo_df = gpd.read_file(f'{dl_construct}{file_id}')
geo_df.shape

(34678, 10)

In [5]:
def first_df(geo):
    df = pd.DataFrame(geo).drop(columns='Link_ID')
    df_ren = df.rename(columns={
                        'Daily_Aver':'Daily_Average_Traffic_Intensity',
                        'Average_Ve':'Average_Velocity_of_Vehicle_Traffic',
                        'Median_of_':'Median_of_velocity_of_Vehicle_Traffic',
                        'First_Quar': 'FirstQuartil_of_velocity_of_Vehicle_Traffic',
                        'Third_Quar': 'ThirdQuartil_of_velocity_of_Vehicle_Traffic'
                    })
    return df_ren
    
df = first_df(geo_df) 

To have a better understanding on our data and to avoid errors during our analysis we need to investigate it with general statistics.
To be able to trust our analysis we have to clean the dataset before. 

### Remove outliers: 

In [6]:
df.describe()

Unnamed: 0,linkid,Daily_Average_Traffic_Intensity,Average_Velocity_of_Vehicle_Traffic,Median_of_velocity_of_Vehicle_Traffic,FirstQuartil_of_velocity_of_Vehicle_Traffic,ThirdQuartil_of_velocity_of_Vehicle_Traffic,Func_Class,Speed_Cat
count,34678.0,34678.0,34678.0,34678.0,34678.0,34678.0,34678.0,34678.0
mean,895820600.0,3340.417942,56.816834,56.463409,43.822041,68.091844,2.684613,4.904781
std,235591000.0,2725.873982,51.98367,26.240876,24.442204,30.985191,0.538658,1.520568
min,80216820.0,14.435864,-401.703724,1.0,-392.5,1.0,1.0,2.0
25%,736483200.0,1903.398108,38.315321,38.25,26.0,48.0,2.0,4.0
50%,906737700.0,2644.529317,49.966126,50.0,38.875,60.333333,3.0,6.0
75%,1154997000.0,3897.886608,69.511585,71.0,56.0,85.0,3.0,6.0
max,1223731000.0,49309.806935,6357.022296,1326.25,143.0,2605.0,3.0,7.0


- Regarding the columns we know that they report values in km/h: many of the min and max we can observe thank to describe function don't make sense. 
- We need to operate on them as they are **outliers**

In [7]:
#removing outliers:
def rm_out(df):
    for i in df.columns.drop(['linkid', 'Daily_Average_Traffic_Intensity','geometry']):
        lb = 0
        ub = 180
#         print(lb, ub)
        df[i] = df[i].mask(df[i] < lb) 
        df[i] = df[i].mask(df[i] > ub) 
    return df

data = rm_out(df)

**If we run describe again we will see that the data regarding velocity has just feasible values.**

In [8]:
data.describe()

Unnamed: 0,linkid,Daily_Average_Traffic_Intensity,Average_Velocity_of_Vehicle_Traffic,Median_of_velocity_of_Vehicle_Traffic,FirstQuartil_of_velocity_of_Vehicle_Traffic,ThirdQuartil_of_velocity_of_Vehicle_Traffic,Func_Class,Speed_Cat
count,34678.0,34678.0,34633.0,34675.0,34677.0,34674.0,34678.0,34678.0
mean,895820600.0,3340.417942,56.112805,56.402803,43.834624,67.959767,2.684613,4.904781
std,235591000.0,2725.873982,24.346245,25.054113,24.329987,26.706718,0.538658,1.520568
min,80216820.0,14.435864,1.0,1.0,0.0,1.0,1.0,2.0
25%,736483200.0,1903.398108,38.317003,38.25,26.0,48.0,2.0,4.0
50%,906737700.0,2644.529317,49.961538,50.0,38.875,60.333333,3.0,6.0
75%,1154997000.0,3897.886608,69.447459,71.0,56.0,85.0,3.0,6.0
max,1223731000.0,49309.806935,179.691892,143.25,143.0,164.0,3.0,7.0


### Handling duplicates:

In [9]:
len(data) == len(data.drop_duplicates())

True

There are ***no duplicates*** in our dataset: NO ACTION NEEDED
    

### Handling missing values:

In [10]:
data.isnull().sum().sort_values(ascending=False) , f'Total of data points : {data.shape[0]}'

(Average_Velocity_of_Vehicle_Traffic            45
 ThirdQuartil_of_velocity_of_Vehicle_Traffic     4
 Median_of_velocity_of_Vehicle_Traffic           3
 FirstQuartil_of_velocity_of_Vehicle_Traffic     1
 linkid                                          0
 Daily_Average_Traffic_Intensity                 0
 Func_Class                                      0
 Speed_Cat                                       0
 geometry                                        0
 dtype: int64,
 'Total of data points : 34678')

As we can see the total highest number of missing values detected for column is 45 on a total number of rows of 34'678.
- The missing value for Average Velocity e the ones in Speed Difference Mean are the same (one column is created from the other one)
- The missing values of other column can be easily deleted

**As we are handling data regarding AVERAGE velocity we can easily substitute the missing values with the mean of the corresponding column**

In [11]:
def handling_missing(data):
    imputer = SimpleImputer()
    data['Average_Velocity_of_Vehicle_Traffic']=imputer.fit_transform(data[['Average_Velocity_of_Vehicle_Traffic']])
    return data.dropna()
data = handling_missing(data)

In [12]:
data.isnull().sum()

linkid                                         0
Daily_Average_Traffic_Intensity                0
Average_Velocity_of_Vehicle_Traffic            0
Median_of_velocity_of_Vehicle_Traffic          0
FirstQuartil_of_velocity_of_Vehicle_Traffic    0
ThirdQuartil_of_velocity_of_Vehicle_Traffic    0
Func_Class                                     0
Speed_Cat                                      0
geometry                                       0
dtype: int64

### Feature creation:

Now looking at our data we need to search for a target that in the next step we will use in our model.
Most common causes of Accidents:
- Over Speeding.
- Drunken Driving.
- Distractions to Driver.
- Red Light Jumping.
- Avoiding Safety Gears like Seat belts and Helmets.
- Non-adherence to lane driving and overtaking in a wrong manner.

The first cause is always the **over-speed** that can be connected with one of the above causes. 
For this reason we decide to investigate and use as target information regarding the velocity.

- Speed_Cat (described in the excel below)
- Average Velocity of Vehicle Traffic 
- Median of velocity of Vehicle Traffic

We will create a dictionary that, from the information contained in the excel can describe the type of street regarding the max velocity allowed in there.

In [13]:
dl_construct = 'https://drive.google.com/uc?export=download&id='
# id from share link google drive
file_id = '1m2BpnJ-NXqlqFW8gYnC_fEI2PLTrXz1r'
request = requests.get(f'{dl_construct}{file_id}').content
memory = BytesIO(request)



In [14]:
speed_explanation = pd.read_excel(memory, sheet_name='SpeedCat')
speed_explanation

ValueError: File is not a recognized excel file

from the table above we can create a dictionary.
1. count values for category
2. translate the speed range in actual number

In [None]:
data.Speed_Cat.value_counts()

**NO need of mapping for label 1 and 8**

In [None]:
max_speed_dict = {2:130,3:100,4:90,5:70,6:50,7:30}
def target_creation(data):
    data['Max_speed'] = data['Speed_Cat'].map(max_speed_dict)
    data['Speed_Diff_Mean'] = data['Max_speed'] - data['Average_Velocity_of_Vehicle_Traffic']
    data['Speed_Diff_Median'] = data['Max_speed'] - data['Median_of_velocity_of_Vehicle_Traffic']
    return data
data = target_creation(data)

In [None]:
cat_list = {x:data[data['Max_speed']==x] for x in max_speed_dict.values()}
def speed_dist(cat_list):
    fig, axs = plt.subplots(3, 2, figsize=(15, 15))
    fig.suptitle('Categorical Distributions', size=20)
#     fig.tight_layout()
    for c, i in enumerate(cat_list.items()):
        speedlim = i[0]
        plt.subplot(3, 2, c+1)
#         plt.set_xlabel('Average Speed')
        ax = i[1].Average_Velocity_of_Vehicle_Traffic.hist(bins=30)
        ylim = i[1].Average_Velocity_of_Vehicle_Traffic.value_counts(bins=30).max()
        ax.set_title(f'Speed Category {i[0]} kph', size=13)
        ax.set_xlabel('Average Speed')
        ax.axvline(speedlim, color='r', linestyle='--')
#         print(f'done {speedlim}, {ylim/2}, {speedlim}')
        ax.text(x=speedlim+7, y=float(ylim-ylim/4), s=f'Speed Limit: {speedlim}')
        
speed_dist(cat_list)

*Note*<br>
We can see that most over speeding is taking place at roads with lower speed limits such as *50 kph and 30 kph.*

**Over speeding behavior can be extracted by the deltas between the road's speed category and its actual average speed observations**

In [None]:
data[['Speed_Cat','Max_speed', 'Speed_Diff_Mean','Speed_Diff_Median']].head(10)

We believe that the main reason that is able to decrease the safety of a street is the speed rate.
**Our first target will be the difference between the mean of velocity and the max speed**

### Scaling features:

We are now ready to scale our dataframe to have a distribution *around* the mean.

1. We need to separate numerical and categorical column
2. We are going to use the Min-Max Scaling method for the numerical ones: is the one that is commonly used distance based algorithms, as k-means that is one of the possible analysis we are taking in consideration.  
3. For the categorical ones we'll use the OneHotEncoding method (for each label in each category creates a different column)

We could also operate this step all together but is important for us to know which column belong to each of the different classes inside the categorical feature. 
**To do so we need to operate for each categorical separately**

In [None]:
def scaling_numerical(data):
    numerical = data.columns.drop(['geometry','linkid','Speed_Cat', 'Func_Class'])
    scaler = MinMaxScaler()
    data_scaled = data.copy()
    for column in numerical:
        scaler.fit(data_scaled[[column]])
        data_scaled[column]=scaler.transform(data_scaled[[column]]) 
    return data_scaled
data_scaled = scaling_numerical(data)

1. **BEFORE SCALING**

In [None]:
data.drop(columns=['geometry','linkid','Speed_Cat', 'Func_Class']).head(1)

2. **AFTER SCALING**

In [None]:
data_scaled.drop(columns=['geometry','linkid','Speed_Cat', 'Func_Class']).head(1)

**Working with the categorical features the first thing we need to do is to understand the distribution within the labels**

In [None]:
data.Func_Class.value_counts() , data.Speed_Cat.value_counts()

- Functional Class has just 3 possible label for the street that we can understand better looking at the excel

In [None]:
func_explanation = pd.read_excel('wdl_dict/Dictionary_Risk_Profiles.xlsx', sheet_name='Func_Class')
for i,el in enumerate(func_explanation['Description']):
    print(f'Class n.{i+1} : {el} \n')

With this new and deeper understanding of the distribution and the meaning of the category (*NB: regarding speed_cat we can look back at the point **1.1.5 "Feature creation"** to get these informations)* we can now progress with our transformations.  

In [None]:
def scaling_categorical(data):
    ohe = OneHotEncoder(sparse = False)
    ohe.fit(data[['Func_Class']])
    func_encoded = ohe.transform(data[['Func_Class']])
    data["func_1"],data["func_2"],data['func_3'] = func_encoded.T
    ohe = OneHotEncoder(sparse = False)
    ohe.fit(data[['Speed_Cat']])
    speed_encoded = ohe.transform(data[['Speed_Cat']])
    data["speed_2"],data["speed_3"],data["speed_4"],\
    data["speed_5"], data["speed_6"], data["speed_7"]= speed_encoded.T
    return data 

In [None]:
data_scaled = scaling_categorical(data_scaled)

### Preprocessed Dataframe: 

Here we have the first dataframe, the one given by the challenge, completely ready for the model.

In [None]:
data_scaled.head()

### Feature engineering

We assume that ***over speeding*** is the main reason for road hazards. Over speeding behavior can be extracted by the deltas between the road's speed category and its actual average speed observations as processed in column ```Speed_Diff_Mean```. <br>

Over speeding can be harnessed among others by the roads environment [Source](https://www.tandfonline.com/doi/abs/10.1080/014416499295420). People chose their speeding behavior not only by speed limits but also by their assessment of the road's quality and the surrounding environment.<br>

Therefore we chose to gather more information about POIs, amenities and public buildings in the surrounding of the provided road segments. Those can be acquired through OSM sources.

### Scaling data set to Lisbon

waiting expl sisto

In [None]:
# Transforming pandas df to geopandas df
geo_df = gpd.GeoDataFrame(data_scaled)
geo_df.geometry[0].type

In [None]:
## Filtering only lisbon data inside the circle of 38.72526068747401, -9.142352617846093 with buffer '1'
circle_lisbon = Point(-9.142352617846093, 38.72526068747401).buffer(1)
geo_lis = geo_df[geo_df.geometry.within(circle_lisbon)]
# no immediate usage of this pd.DataFrame
df_lis = pd.DataFrame(geo_lis).drop(columns=['geometry', 'linkid'])

In [None]:
print(f'The new data set has {df_lis.shape[0]} rows as opposed to the original set with {geo_df.shape[0]} rows')

### Loading OSM Maps

In [None]:
%%bash
wget https://download.bbbike.org/osm/extract/planet_-9.89,38.265_-8.309,39.136.osm.pbf \
    --quiet -O map_data/Lisbon.osm.pbf

In [None]:
!ogrinfo map_data/Lisbon.osm.pbf

In [None]:
%%bash
ogr2ogr -f "GPKG" \
     map_data/lisbon_polygons.gpkg \
     map_data/Lisbon.osm.pbf \
    -nlt POLYGONS \
    -nln polygons

In [None]:
#Read data
# about 3 mins
layer_file = "map_data/lisbon_polygons.gpkg"
collection = list(fiona.open(layer_file,'r'))
df1 = pd.DataFrame(collection)

#Check Geometry
def isvalid(geom):
    try:
        shape(geom)
        return 1
    except:
        return 0

df1['isvalid'] = df1['geometry'].apply(lambda x: isvalid(x))
df1 = df1[df1['isvalid'] == 1]
collection = json.loads(df1.to_json(orient='records'))

#Convert to geodataframe
gdf_lis_poly = gpd.GeoDataFrame.from_features(collection)

In [None]:
gdf_lis_poly

In [None]:
poi_gdf = gdf_lis_poly.copy()

### Loading POIs from pre-processed OSM file

In [None]:
print(f'The data set of POIs in the Lisbon region has {poi_gdf.shape[0]} individual points which can be merged with our data set.' )

In [None]:
poi_gdf.geometry.type.value_counts()

***Note***<br>
For now we will only be focussing on the geometrical points in the OSM data, not on polygons or line strings.

In [None]:
# filtering down to shapely.geometry.Points
gdf_points = poi_gdf[poi_gdf['geometry'].type == 'Point'].reset_index()
gdf_points.columns

**Note**<br>
The points provided are categorized and stored in many columns. We will shrink this information to one column and fill it with all the important information about the point. <br>
Some points do not provide any information. Those ones will be dropped. 

In [None]:
# reducing geo_df columns, only leaving one valid column
def new_desc(geo):
    geo['desc_points'] = None
    # columns to be taken into consideration
    lst_cols = [  'amenity', 
                  'barrier', 
                  'building', 
                  'highway', 
                  'landuse', 
                  'man_made', 
                  'natural', 
                  'office']
    for c, row in geo.iterrows():
#         concat_name = [f'feat_{i}_{row[i]}' for i in lst_cols if row[i] == row[i]]
        concat_name = [f'feat_{i}_{row[i]}' for i in lst_cols if row[i] != None]
        if len(concat_name) > 0:
            geo.at[c, 'desc_points'] = concat_name[0]
        else: 
            geo.at[c, 'desc_points'] = None
        print(f'done: {c}')
        
    
    geo = geo[['geometry', 'desc_points']]
    # drop empty descriptions
    geo = geo.dropna(subset=['desc_points'])
    geo= geo.reset_index(drop=True)
    
    return geo

In [None]:
# applying cleaning function to geo df
gdf_points_clean = new_desc(gdf_points)

In [None]:
gdf_points_clean.head(5)
# only two columns are left => geometry and name of point

### Re-transforming point's names into columns

To prepare the dataset of points for the merger with the general data set we need to re-transfer the unique feature names into columns. In total we have **96** feature columns.

In [None]:
# encoding all unique values
encoder = OneHotEncoder()
enc_df = encoder.fit_transform(gdf_points_clean[['desc_points']])

In [None]:
# reapplying column names
enc_gdf_points = gpd.GeoDataFrame(enc_df.toarray(), columns=encoder.categories_[0])
enc_gdf_points = enc_gdf_points.join(gdf_points_clean)

**Note**<br>
We need the ```desc_points``` column for later plotting.

In [None]:
enc_gdf_points.max()

### Merging Points with Road segments

In order to merge the points with the provided road segments we need to buffer the LineStrings of the roads and turn them into little Polygons in order to overlap with the POIs around the road. Later we will use the ```.intersect``` method for spatial joins to keep only the points which are in the vicinity of the road segments.

In [None]:
# create a gdf with buffered road segments
geo_lis_buf = geo_lis.copy()
# allowing certain buffer to road segments to "catch" the points. buffer=.0005 seems to be visually adequate.
geo_lis_buf['geometry'] = geo_lis_buf.geometry.buffer(.0005)

In [None]:
# joining both geo dfs
joint_gpd = gpd.sjoin(enc_gdf_points, geo_lis_buf, how="inner", op='intersects')

In [None]:
print(f'We have {joint_gpd.shape[0]} intersecting points with our road segments.')

**Note**<br>
Now, we want to regroup the GDF back to our initial granularity, the road segments with unique link_IDs.

In [None]:
# building the aggregation dictionary for the .groupby method
columns = joint_gpd.columns
agg = {i:'max' for i in columns if 'feat' in i}
agg['geometry'] = lambda x: list(x)
agg['desc_points'] = lambda x: list(x)
# all added features should not be summed per road segment, they will be scaled down to 1 item max.
# all geometries and point descriptions should be listed

**Note**<br>
We **do not want** to aggregate the POIs as summed values. This would add a high bias to the model. E.g. traffic lights or toll booths appear more than once on one road segment. The model would under interpret their function and meaning if they weren't scaled down **to one unit** per segment. 

In [None]:
# regrouping by linkid
grouped_gpd = joint_gpd.groupby('linkid').agg(agg)
# renaming the 'geometry' column so that the gdf won't be confused later
grouped_gpd = grouped_gpd.rename(columns={'geometry':'points'})

**Note**<br>
Only road segments which contained one or more points will be left in the gdf. That's the nature of the inner ```sjoin```.

In [None]:
geo_df_lis = geo_lis.merge(grouped_gpd, left_on='linkid', right_index=True)
geo_df_lis['point_count'] = geo_df_lis['points'].apply(lambda x: len(x))

In [None]:
# have a glance at the merged df
pd.set_option('display.max_columns', None)
geo_df_lis.head(3)

In [None]:
geo_df_lis = geo_df_lis.sort_values(by='linkid')

In [None]:
geo_df_lis[geo_df_lis['linkid']==80264692]

### Testplots for the merged data

In [None]:
plot_amount = 200

- set of points to make them readable for the Marker (folium) (```point.coords.xy[1][0]```, ```point.coords.xy[0][0]```)
- only display ONE POI per section. all name duplicates will be overwritten. Example 'toll booths'

In [None]:
lst_points = []
for c, row in geo_df_lis.iterrows():
    name_ix = {name:c for c, name in enumerate(row.desc_points)}
    name_point = {name:(row.points[c].coords.xy[1][0], row.points[c].coords.xy[0][0]) for name, c in name_ix.items()}
    lst_points.append(name_point)

In [None]:
map_lis_buf = geo_df_lis.copy()
map_lis_buf['geometry'] = map_lis_buf.geometry.buffer(.0005)

In [None]:
# init map
m = Map([38.74288, -9.16624], zoom_start=13)

# unpacking list of lists containing points, mapping them to their names
## POINTS
marker_cluster = MarkerCluster(name='Points')
# unpacking single points out of list of dictionaries and assigning to marker_cluster
for pairs in lst_points[:plot_amount]:
    popups = [Popup(f'<p><b>Name:</b></p> <p>{a}</p>', max_width=200) for a in pairs.keys()]
    markers = [Marker(coord, popup=popups[c]).add_to(marker_cluster) for c, coord in enumerate(pairs.values())]

print('built points')

## BUFFERED ROADS (enable via layer control)
buf_layer = FeatureGroup(name="roads_buf", show=False)
roads_buf = Choropleth(geo_data=map_lis_buf.head(plot_amount)[['linkid', 'Speed_Diff_Mean', 'geometry']],
                       data=None,
                       highlight=True,
                      ).geojson.add_child(GeoJsonTooltip(['linkid']))


## ROADS
roads = Choropleth(geo_data=geo_df_lis.head(plot_amount).geometry,
                          data=None, 
                          name="roads", 
                          show=True)

print('built roads')

marker_cluster.add_to(m)
roads_buf.add_to(buf_layer)
buf_layer.add_to(m)
roads.add_to(m)

LayerControl().add_to(m)
print('activate buffered roads via layer control on upper right corner')
print(f'this is only a subset of the whole df with {plot_amount} rows.')

m

**Note**<br>
The displayed points all seem to be within the boundaries of the buffered road segments 🛣. The ```sjoin``` with ```intersect``` is working.<br>
NOT ALL points are displayed. That would take up too much memory. 

### Modeling

**Note**<br>
Stripping the df to only relevant features. Geometries only serve the plotting on a map.<br>
Avg. velocity and road cat can be removed due to feature creation in 1.4.6.

In [None]:
final_data = pd.DataFrame(geo_df_lis.copy().drop(columns=['Average_Velocity_of_Vehicle_Traffic', 
                                                          'Median_of_velocity_of_Vehicle_Traffic',
                                                          'FirstQuartil_of_velocity_of_Vehicle_Traffic',
                                                          'ThirdQuartil_of_velocity_of_Vehicle_Traffic',
                                                          'Speed_Diff_Median', ##### CHECK IT OUT. ONE HAS TO GO
                                                          'Func_Class',
                                                          'Speed_Cat',
                                                          'Max_speed',
                                                          'geometry',
                                                          'points', 
                                                          'desc_points', 
                                                          'point_count']))
print(final_data.columns)

## Conclusions

### Scalability and Impact
Tell us how applicable and scalable your solution is if you were to implement it in a city. Identify possible limitations and measure the potential social impact of your solution.

### Future Work
Now picture the following scenario: imagine you could have access to any type of data that could help you solve this challenge even better. What would that data be and how would it improve your solution? 🚀