# Loading the clean dataset

Import required libraries

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("cleaned_ds.csv", low_memory=False)
df.head()

Unnamed: 0.1,Unnamed: 0,accident_reference,location_easting_osgr,location_northing_osgr,longitude,latitude,police_force,accident_severity,number_of_vehicles,number_of_casualties,...,pedestrian_crossing_physical_facilities,light_conditions,weather_conditions,road_surface_conditions,special_conditions_at_site,carriageway_hazards,urban_or_rural_area,did_police_officer_attend_scene_of_accident,trunk_road_flag,lsoa_of_accident_location
0,0,10000005,519310.0,188730.0,-0.279323,51.584754,Metropolitan Police,Slight,2,1,...,No physical crossing facilities within 50 metres,Darkness - lights unlit,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01000543
1,1,10000006,551920.0,174560.0,0.184928,51.449595,Metropolitan Police,Slight,1,1,...,No physical crossing facilities within 50 metres,Darkness - lights lit,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01000375
2,2,10000008,505930.0,183850.0,-0.473837,51.543563,Metropolitan Police,Slight,1,1,...,No physical crossing facilities within 50 metres,Darkness - lights lit,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01033725
3,3,10000016,527770.0,168930.0,-0.164442,51.404958,Metropolitan Police,Slight,1,1,...,No physical crossing facilities within 50 metres,Daylight,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01003379
4,4,10000018,510740.0,177230.0,-0.40658,51.483139,Metropolitan Police,Slight,2,1,...,No physical crossing facilities within 50 metres,Daylight,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01002583


# Adding a new feature

## City name using the longitude and latitude

We will use the reverse Geocoding mechanism in order to retrieve the address using the coordinates. Ref: https://towardsdatascience.com/reverse-geocoding-in-python-a915acf29eb6
Many sources on the internet suggests using Geopy library. However, the problem is that this library uses external service providers on the internet to get the address. Making a request for each entry in our dataset is a waste of time and resources. 
It's better to look for an offline alternative to do this reverse geocoding.

We will use a library called reverse geocoder as referenced here: https://github.com/thampiman/reverse-geocoder

In [10]:
%pip install reverse_geocoder

Collecting reverse_geocoder
  Downloading reverse_geocoder-1.5.1.tar.gz (2.2 MB)
Using legacy 'setup.py install' for reverse-geocoder, since package 'wheel' is not installed.
Installing collected packages: reverse-geocoder
    Running setup.py install for reverse-geocoder: started
    Running setup.py install for reverse-geocoder: finished with status 'done'
Successfully installed reverse-geocoder-1.5.1
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\omark\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


Let's look at a sample output of the query.

In [13]:
import reverse_geocoder as rg

coordinates = [(51.5214588,-0.1729636)]
results = rg.search(coordinates) # default mode = 2
print(results)

[{'lat': '51.51116', 'lon': '-0.18426', 'name': 'Bayswater', 'admin1': 'England', 'admin2': 'Greater London', 'cc': 'GB'}]


The output has different regions of the address. We are mainly interested in the city (county) name which is admin2 in this case containing "Greater London".

Let's get the addresses of all dataset rows

In [18]:
coordinates = list(zip(df.latitude, df.longitude))
results = rg.search(coordinates) # default mode = 2
print(results[:10])

[{'lat': '51.6128', 'lon': '-0.27539', 'name': 'Edgware', 'admin1': 'England', 'admin2': 'Greater London', 'cc': 'GB'}, {'lat': '51.48315', 'lon': '0.17484', 'name': 'Erith', 'admin1': 'England', 'admin2': 'Greater London', 'cc': 'GB'}, {'lat': '51.5489', 'lon': '-0.48211', 'name': 'Uxbridge', 'admin1': 'England', 'admin2': 'Greater London', 'cc': 'GB'}, {'lat': '51.40322', 'lon': '-0.16831', 'name': 'Mitcham', 'admin1': 'England', 'admin2': 'Greater London', 'cc': 'GB'}, {'lat': '51.51579', 'lon': '-0.4234', 'name': 'Hayes', 'admin1': 'England', 'admin2': 'Greater London', 'cc': 'GB'}, {'lat': '51.38333', 'lon': '-0.1', 'name': 'Croydon', 'admin1': 'England', 'admin2': 'Greater London', 'cc': 'GB'}, {'lat': '51.51116', 'lon': '-0.18426', 'name': 'Bayswater', 'admin1': 'England', 'admin2': 'Greater London', 'cc': 'GB'}, {'lat': '51.6', 'lon': '-0.21667', 'name': 'Hendon', 'admin1': 'England', 'admin2': 'Greater London', 'cc': 'GB'}, {'lat': '51.50094', 'lon': '-0.19175', 'name': 'Kensi

Extract the city names from the address "admin2".

In [29]:
cities = list(map(lambda res: res['admin2'], results))
df.city = cities
df.city.value_counts()

Greater London      24269
Kent                 4986
Surrey               3736
Essex                3190
Hampshire            3143
                    ...  
Blaenau Gwent          70
Shetland Islands       26
Orkney Islands         25
Eilean Siar            25
Isles of Scilly         1
Name: city, Length: 174, dtype: int64

In [28]:
df.head()

Unnamed: 0.1,Unnamed: 0,accident_reference,location_easting_osgr,location_northing_osgr,longitude,latitude,police_force,accident_severity,number_of_vehicles,number_of_casualties,...,light_conditions,weather_conditions,road_surface_conditions,special_conditions_at_site,carriageway_hazards,urban_or_rural_area,did_police_officer_attend_scene_of_accident,trunk_road_flag,lsoa_of_accident_location,city
0,0,10000005,519310.0,188730.0,-0.279323,51.584754,Metropolitan Police,Slight,2,1,...,Darkness - lights unlit,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01000543,Greater London
1,1,10000006,551920.0,174560.0,0.184928,51.449595,Metropolitan Police,Slight,1,1,...,Darkness - lights lit,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01000375,Greater London
2,2,10000008,505930.0,183850.0,-0.473837,51.543563,Metropolitan Police,Slight,1,1,...,Darkness - lights lit,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01033725,Greater London
3,3,10000016,527770.0,168930.0,-0.164442,51.404958,Metropolitan Police,Slight,1,1,...,Daylight,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01003379,Greater London
4,4,10000018,510740.0,177230.0,-0.40658,51.483139,Metropolitan Police,Slight,2,1,...,Daylight,Fine no high winds,Dry,,,Urban,Yes,Non-trunk,E01002583,Greater London


# Encode the city name using label encoding (and add to lookup table)

# Ask 2 question about the feature and visualize the answer