# Overview

This notebook uses the google maps snap to road API to determine which households from the polling station dataset are on the road and which aren't. I then compare the demographics of the hhs on the road with those off the road and 

steps
1. import the cleaned polling station data --> 
2. use the nsap

In [1]:
import pandas as pd 
import geopandas as gpd
from pathlib import Path
import folium
from shapely import Point, Polygon
from pin_drop_sampling2.utils import get_nearest_point_on_road
from tqdm import tqdm
tqdm.pandas()

DB_DIR = Path.home() / 'IDinsight Dropbox' / 'Random walk testing' 
INPUT_DIR = DB_DIR / '01_Raw data' / '05_Voter roll hh validation'

## 1. Import voter roll datasets

In [3]:
# import voter roll hh listing data 
hhs = gpd.read_parquet(INPUT_DIR / 'voter_rolls_clean.parquet')
borders = gpd.read_parquet(INPUT_DIR / 'voter_rolls_clean_borders.parquet')
rooftops = gpd.read_parquet(INPUT_DIR / 'voter_roll_rooftops.parquet')

## 2. Get nearest point on road for each hh

In [5]:
hhs['nearest_point_on_road'] = hhs.progress_apply(lambda x: get_nearest_point_on_road(x.geometry.centroid), axis=1)

100%|██████████| 2495/2495 [05:08<00:00,  8.08it/s]


In [7]:
# save the dataset since it takes a while to get the nearest point on the road
hhs.to_parquet(INPUT_DIR / 'voter_rolls_clean_w_road_point.parquet')

## 3. Compare hhs on road vs those not on road

In [6]:
share_off_road = hhs['nearest_point_on_road'].isna().sum()/len(hhs)
print(f'{share_off_road:.2%} of households are off the road')



0.20% of households are off the road
