# Extract POIs
The data are from OpenStreetMap, loaded to postgres, using osm2pgsql-1.9.1-x64.

`osm2pgsql -d osm_sweden -U postgres -W -H localhost -P 5433 -S D:\mobi-social-segregation-se\src\MobiSegInsightsSE.etl.1.0\osm2pgsql-1.9.1-x64\osm2pgsql-bin\default.style D:\mobi-social-segregation-se\dbs\geo\sweden-latest.osm.pbf`

The below command line extracts POIs from OSM data in Sweden.
`osm2pgsql -d osm_sweden -U postgres -W -H localhost -P 5433 -S D:\mobi-social-segregation-se\src\MobiSegInsightsSE.etl.1.0\osm2pgsql-1.9.1-x64\osm2pgsql-bin\flex-config\pois.lua -O flex D:\mobi-social-segregation-se\dbs\geo\sweden-latest.osm.pbf`


In [2]:
%load_ext autoreload
%autoreload 2
%cd D:\mobi-social-segregation-se

D:\mobi-social-segregation-se


In [3]:
# Load libs
import pandas as pd
import os
os.environ['USE_PYGEOS'] = '0'
import openai
import preprocess
import geopandas as gpd
from tqdm.notebook import tqdm
import sqlalchemy
import time
from sklearn.neighbors import KDTree

In [4]:
openai.api_key = preprocess.keys_manager['openai']['key']

In [5]:
# Data location
user = preprocess.keys_manager['database']['user']
password = preprocess.keys_manager['database']['password']
port = preprocess.keys_manager['database']['port']
db_name = preprocess.keys_manager['database']['name']
engine = sqlalchemy.create_engine(f'postgresql://{user}:{password}@localhost:{port}/{db_name}?gssencmode=disable')

In [6]:
# Data location for OSM data of Sweden (Aug 28, 2023)
db_name_osm = preprocess.keys_manager['osmdb']['name']
engine_osm = sqlalchemy.create_engine(f'postgresql://{user}:{password}@localhost:{port}/{db_name_osm}?gssencmode=disable')

## 1. POIs

In [6]:
# Get pois from database
gdf_pois = gpd.GeoDataFrame.from_postgis(sql="""SELECT osm_id, "class", subclass, geom FROM pois;""", con=engine_osm)
gdf_pois.head()

Unnamed: 0,osm_id,class,subclass,geom
0,3045039500,amenity,ferry_terminal,POINT (2676667.928 8256984.521)
1,9451269159,amenity,ferry_terminal,POINT (2554819.461 8359838.918)
2,7737172429,amenity,toilets,POINT (2131122.284 8359495.233)
3,8968332391,amenity,toilets,POINT (2131131.234 8359523.443)
4,1147753712,tourism,chalet,POINT (2122190.759 8374453.104)


In [7]:
df_pois_tp = gdf_pois.groupby(['class', 'subclass']).size().to_frame(name='count').reset_index()

In [8]:
len(df_pois_tp)

1238

In [16]:
df_pois_tp

Unnamed: 0,class,subclass,count
0,amenity,Select,1
1,amenity,abandoned_fuel,1
2,amenity,ambulance,1
3,amenity,animal_boarding,24
4,amenity,animal_breeding,13
...,...,...,...
1233,tourism,viewpoint,1997
1234,tourism,waterfall,3
1235,tourism,wilderness_hut,396
1236,tourism,yes,76


In [40]:
tags = [x for x in df_pois_tp.subclass.values]
tags[4]

'animal_breeding'

### 1.1 Define broader categories
This is learned from all the tags by ChatGPT.

In [20]:
categories_str = "- Healthcare\n- Financial Services\n- Food and Drink\n- Entertainment\n- Education\n- Recreation\n- Transportation\n- Religious Places\n- Emergency Services\n- Artisan Workshops\n- Automotive Services\n- Craft\n- Historic\n- Sports and Activities\n- Outdoor Recreation\n- Leisure\n- Office\n- Groceries and Food\n- Fashion and Accessories\n- Home and Living\n- Health and Beauty\n- Indoor Sports\n- Sports\n- Tourism"
categories = categories_str.split('\n')
len(categories)

24

### 1.2 Relabel POIs with the new categories based on subclass
This is done by GPT-4.

In [49]:
def poi_category(x):
    flag = 0
    while flag != 1:
        try:
            response = openai.ChatCompletion.create(
              model="gpt-4",
              messages=[
                {
                  "role": "system",
                  "content": f"You will be presented with points of interest tags from OpenStreetMap and your job is to provide the most suitable tag from the following list. Choose ONLY from the list of tags provided here:\n\n{categories_str}"
                },
                {
                  "role": "user",
                  "content": x
                }
              ],
              temperature=0,
              max_tokens=1024,
              top_p=1,
              frequency_penalty=0,
              presence_penalty=0
            )
            flag = 1
            cate = response.choices[0].message.content
        except:
            time.sleep(1)
    return cate

In [50]:
tqdm.pandas()
df_pois_tp.loc[:, 'category'] = df_pois_tp.loc[:, 'subclass'].progress_apply(lambda x: poi_category(x))

  0%|          | 0/1238 [00:00<?, ?it/s]

In [51]:
df_pois_tp.to_csv('results/pois/tag_category.csv', index=False)

### 1.3 Manually check every re-labeled class-subclass
This step cleaned up some tags and decided which POIs to exclude.

Only those POIs with long duration of stay are included. For example, emergency points and transportation toll are removed.

In [52]:
df_pois_tp = pd.read_csv('results/pois/tag_category.csv')
df_pois_tp.head()

Unnamed: 0,class,subclass,count,category,Keep
0,amenity,bench,19100,Outdoor Recreation,1
1,amenity,shelter,13586,Outdoor Recreation,1
2,amenity,parking,12141,Transportation,0
3,amenity,restaurant,9399,Food and Drink,1
4,amenity,waste_basket,9247,Home and Living,0


In [59]:
df_pois_tp = df_pois_tp.loc[df_pois_tp['Keep'] == 1, :]

In [62]:
df_pois_tp_r = df_pois_tp.groupby(['class', 'category'])['count'].sum().reset_index()
df_pois_tp_r.to_clipboard(index=False)

### 1.4 Merge some classes

Do not distinguish categories: tourism, historic = tourism, leisure, sport = leisure, craft, office = office.

Distinguish categories: amenity, shop.

In [69]:
df_pois_tp_red = pd.read_excel('results/pois/tags_reduced.xlsx', sheet_name='Sheet1')
df_pois_tp_red.head()

Unnamed: 0,class,category,category_s,Tag,count
0,craft,Automotive Services,amenity,Automotive Services (a),9
1,sport,Automotive Services,amenity,Automotive Services (a),10
2,amenity,Education,amenity,Education (a),4166
3,leisure,Education,amenity,Education (a),7
4,office,Education,amenity,Education (a),115


### 1.5 Add final tag to POI data

In [74]:
gdf_pois = pd.merge(gdf_pois, df_pois_tp[['class', 'subclass', 'category']], how='inner')
gdf_pois = pd.merge(gdf_pois, df_pois_tp_red[['class', 'category', 'Tag']], how='inner')
gdf_pois.head()

Unnamed: 0,osm_id,class,subclass,geom,category,Tag
0,1147753712,tourism,chalet,POINT (2122190.759 8374453.104),Leisure,Tourism
1,1147753708,tourism,chalet,POINT (2122450.167 8374415.363),Leisure,Tourism
2,1147753710,tourism,chalet,POINT (2122528.139 8374307.503),Leisure,Tourism
3,1030180338,tourism,chalet,POINT (2121644.423 8374204.607),Leisure,Tourism
4,10796713818,tourism,chalet,POINT (2127318.931 8366983.437),Leisure,Tourism


In [76]:
gdf_pois.to_crs(4326).to_postgis("pois", schema="built_env", con=engine)

## 2. Find nearest POIs
Search radius = 300 m

In [12]:
gdf_pois = gpd.GeoDataFrame.from_postgis(sql="""SELECT osm_id, "Tag", geom FROM built_env.pois;""", con=engine)
gdf_pois = gdf_pois.to_crs(3006)
gdf_pois.loc[:, 'y'] = gdf_pois.geom.y
gdf_pois.loc[:, 'x'] = gdf_pois.geom.x
gdf_pois.head()

Unnamed: 0,osm_id,Tag,geom,y,x
0,1147753712,Tourism,POINT (727361.542 6645721.136),6645721.0,727361.542224
1,1147753708,Tourism,POINT (727492.967 6645710.224),6645710.0,727492.967193
2,1147753710,Tourism,POINT (727535.447 6645658.562),6645659.0,727535.446635
3,1030180338,Tourism,POINT (727094.846 6645579.711),6645580.0,727094.845826
4,10796713818,Tourism,POINT (730169.443 6642133.974),6642134.0,730169.442766


### 2.1 Load stops

In [63]:
df_stops = pd.read_sql(sql=f"""SELECT uid, lat, lng, wt_total, time_span, deso
                               FROM segregation.mobi_seg_deso_raw
                               WHERE weekday=1 AND holiday=0
                               LIMIT 100000;""",
                       con=engine)
gdf_stops = preprocess.df2gdf_point(df_stops, 'lng', 'lat', crs=4326, drop=False)

Add home label to the stops

In [64]:
df_home = pd.read_sql(sql=f"""SELECT uid, lat, lng
                               FROM home_p;""",
                       con=engine)
df_home.loc[:, 'home'] = 1
gdf_stops = pd.merge(gdf_stops, df_home, on=['uid', 'lat', 'lng'], how='left')
gdf_stops = gdf_stops.fillna(0)
gdf_stops.head()

Unnamed: 0,uid,lat,lng,wt_total,time_span,deso,geometry,home
0,bd2788bc-9d0d-41fe-9634-a2415a2e2dda,59.633734,17.852059,1414.426069,"{1,8,39,48}",0191C1130,POINT (17.85206 59.63373),1.0
1,bd27dca3-162e-46d4-9fb1-321b4f5669f9,55.956591,13.552623,0.666484,"{19,25}",1267C1060,POINT (13.55262 55.95659),0.0
2,bd27dca3-162e-46d4-9fb1-321b4f5669f9,55.960339,13.537677,0.178617,"{1,1,48,48}",1267C1060,POINT (13.53768 55.96034),0.0
3,bd27df66-987e-459d-9784-e4aa8b1e572e,57.69426,11.962191,220.768197,"{37,42}",1480C2020,POINT (11.96219 57.69426),0.0
4,bd281654-6af5-4644-adac-6e550bceef16,57.270991,16.456686,584.566352,"{1,4,46,48}",0882C1110,POINT (16.45669 57.27099),1.0


In [65]:
gdf_stops.groupby('home').size()

home
0.0    59872
1.0    40128
dtype: int64

### 1.2 Create a KD tree of all POIs and find the nearest POI within 300 m radius

In [66]:
gdf_stops = gdf_stops.to_crs(3006)
gdf_stops.loc[:, 'y'] = gdf_stops.geometry.y
gdf_stops.loc[:, 'x'] = gdf_stops.geometry.x

In [67]:
tree = KDTree(gdf_pois[["y", "x"]], metric="euclidean")

In [68]:
ind, dist = tree.query_radius(gdf_stops[["y", "x"]].to_records(index=False).tolist(),
                              r=300, return_distance=True, count_only=False, sort_results=True)

In [69]:
gdf_stops.loc[:, 'poi_num'] = [len(x) for x in ind]
gdf_stops.loc[gdf_stops.poi_num > 0, 'osm_id'] = [gdf_pois.loc[x[0], 'osm_id'] for x in ind if len(x) > 0]
gdf_stops.loc[gdf_stops.poi_num > 0, 'dist'] = [x[0] for x in dist if len(x) > 0]
gdf_stops = pd.merge(gdf_stops, gdf_pois[['osm_id', 'Tag']], on='osm_id', how='left')
gdf_stops.head()

Unnamed: 0,uid,lat,lng,wt_total,time_span,deso,geometry,home,y,x,poi_num,osm_id,dist,Tag
0,bd2788bc-9d0d-41fe-9634-a2415a2e2dda,59.633734,17.852059,1414.426069,"{1,8,39,48}",0191C1130,POINT (660803.908 6614076.370),1.0,6614076.0,660803.907837,1,10894220000.0,82.119888,Outdoor Recreation (a)
1,bd27dca3-162e-46d4-9fb1-321b4f5669f9,55.956591,13.552623,0.666484,"{19,25}",1267C1060,POINT (409632.693 6202194.145),0.0,6202194.0,409632.692826,0,,,
2,bd27dca3-162e-46d4-9fb1-321b4f5669f9,55.960339,13.537677,0.178617,"{1,1,48,48}",1267C1060,POINT (408708.421 6202630.865),0.0,6202631.0,408708.421217,3,4084671000.0,206.109678,Tourism
3,bd27df66-987e-459d-9784-e4aa8b1e572e,57.69426,11.962191,220.768197,"{37,42}",1480C2020,POINT (318945.944 6398730.265),0.0,6398730.0,318945.943512,91,10606160000.0,39.262814,Office
4,bd281654-6af5-4644-adac-6e550bceef16,57.270991,16.456686,584.566352,"{1,4,46,48}",0882C1110,POINT (587842.456 6348491.535),1.0,6348492.0,587842.456407,1,6738977000.0,102.336302,Groceries and Food (s)


In [71]:
print("Share of non-home stops with a nearby POI:")
len(gdf_stops.loc[(gdf_stops.home == 0) & (~gdf_stops.Tag.isna()), :]) / \
len(gdf_stops.loc[gdf_stops.home == 0, :])

Share of non-home stops with a nearby POI:


0.757248797434527