# Extract POIs for the random move model
The data are from OpenStreetMap, loaded to postgres, using osm2pgsql-1.9.1-x64.

`osm2pgsql -d osm_sweden -U postgres -W -H localhost -P 5433 -S D:\mobi-social-segregation-se\src\MobiSegInsightsSE.etl.1.0\osm2pgsql-1.9.1-x64\osm2pgsql-bin\default.style D:\mobi-social-segregation-se\dbs\geo\sweden-latest.osm.pbf`

The below command line extracts POIs from OSM data in Sweden.
`osm2pgsql -d osm_sweden -U postgres -W -H localhost -P 5433 -S D:\mobi-social-segregation-se\src\MobiSegInsightsSE.etl.1.0\osm2pgsql-1.9.1-x64\osm2pgsql-bin\flex-config\pois.lua -O flex D:\mobi-social-segregation-se\dbs\geo\sweden-latest.osm.pbf`


In [1]:
%load_ext autoreload
%autoreload 2
%cd D:\mobi-social-segregation-se

D:\mobi-social-segregation-se


In [2]:
# Load libs
import pandas as pd
import os
os.environ['USE_PYGEOS'] = '0'
import openai
import preprocess
import geopandas as gpd
from tqdm.notebook import tqdm
import sqlalchemy
import time
import numpy as np
from sklearn.neighbors import KDTree

In [3]:
openai.api_key = preprocess.keys_manager['openai']['key']

In [4]:
# Data location
user = preprocess.keys_manager['database']['user']
password = preprocess.keys_manager['database']['password']
port = preprocess.keys_manager['database']['port']
db_name = preprocess.keys_manager['database']['name']
engine = sqlalchemy.create_engine(f'postgresql://{user}:{password}@localhost:{port}/{db_name}?gssencmode=disable')

In [5]:
# Data location for OSM data of Sweden (Aug 28, 2023)
db_name_osm = preprocess.keys_manager['osmdb']['name']
engine_osm = sqlalchemy.create_engine(f'postgresql://{user}:{password}@localhost:{port}/{db_name_osm}?gssencmode=disable')

## 1. POIs

In [6]:
# Get pois from database
gdf_pois = gpd.GeoDataFrame.from_postgis(sql="""SELECT osm_id, "class", subclass, geom FROM pois;""", con=engine_osm)
gdf_pois.head()

Unnamed: 0,osm_id,class,subclass,geom
0,3045039500,amenity,ferry_terminal,POINT (2676667.928 8256984.521)
1,9451269159,amenity,ferry_terminal,POINT (2554819.461 8359838.918)
2,7737172429,amenity,toilets,POINT (2131122.284 8359495.233)
3,8968332391,amenity,toilets,POINT (2131131.234 8359523.443)
4,1147753712,tourism,chalet,POINT (2122190.759 8374453.104)


In [7]:
df_pois_tp = gdf_pois.groupby(['class', 'subclass']).size().to_frame(name='count').reset_index()

In [8]:
len(df_pois_tp)

1238

In [16]:
df_pois_tp

Unnamed: 0,class,subclass,count
0,amenity,Select,1
1,amenity,abandoned_fuel,1
2,amenity,ambulance,1
3,amenity,animal_boarding,24
4,amenity,animal_breeding,13
...,...,...,...
1233,tourism,viewpoint,1997
1234,tourism,waterfall,3
1235,tourism,wilderness_hut,396
1236,tourism,yes,76


In [40]:
tags = [x for x in df_pois_tp.subclass.values]
tags[4]

'animal_breeding'

### 1.1 Define broader categories
This is learned from all the tags by ChatGPT.

In [20]:
categories_str = "- Healthcare\n- Financial Services\n- Food and Drink\n- Entertainment\n- Education\n- Recreation\n- Transportation\n- Religious Places\n- Emergency Services\n- Artisan Workshops\n- Automotive Services\n- Craft\n- Historic\n- Sports and Activities\n- Outdoor Recreation\n- Leisure\n- Office\n- Groceries and Food\n- Fashion and Accessories\n- Home and Living\n- Health and Beauty\n- Indoor Sports\n- Sports\n- Tourism"
categories = categories_str.split('\n')
len(categories)

24

### 1.2 Relabel POIs with the new categories based on subclass
This is done by GPT-4.

In [49]:
def poi_category(x):
    flag = 0
    while flag != 1:
        try:
            response = openai.ChatCompletion.create(
              model="gpt-4",
              messages=[
                {
                  "role": "system",
                  "content": f"You will be presented with points of interest tags from OpenStreetMap and your job is to provide the most suitable tag from the following list. Choose ONLY from the list of tags provided here:\n\n{categories_str}"
                },
                {
                  "role": "user",
                  "content": x
                }
              ],
              temperature=0,
              max_tokens=1024,
              top_p=1,
              frequency_penalty=0,
              presence_penalty=0
            )
            flag = 1
            cate = response.choices[0].message.content
        except:
            time.sleep(1)
    return cate

In [50]:
tqdm.pandas()
df_pois_tp.loc[:, 'category'] = df_pois_tp.loc[:, 'subclass'].progress_apply(lambda x: poi_category(x))

  0%|          | 0/1238 [00:00<?, ?it/s]

In [51]:
df_pois_tp.to_csv('results/pois/tag_category.csv', index=False)

### 1.3 Manually check every re-labeled class-subclass
This step cleaned up some tags and decided which POIs to exclude.

Only those POIs with long duration of stay are included. For example, emergency points and transportation toll are removed.

In [52]:
df_pois_tp = pd.read_csv('results/pois/tag_category.csv')
df_pois_tp.head()

Unnamed: 0,class,subclass,count,category,Keep
0,amenity,bench,19100,Outdoor Recreation,1
1,amenity,shelter,13586,Outdoor Recreation,1
2,amenity,parking,12141,Transportation,0
3,amenity,restaurant,9399,Food and Drink,1
4,amenity,waste_basket,9247,Home and Living,0


In [59]:
df_pois_tp = df_pois_tp.loc[df_pois_tp['Keep'] == 1, :]

In [62]:
df_pois_tp_r = df_pois_tp.groupby(['class', 'category'])['count'].sum().reset_index()
df_pois_tp_r.to_clipboard(index=False)

### 1.4 Merge some classes

Do not distinguish categories: tourism, historic = tourism, leisure, sport = leisure, craft, office = office.

Distinguish categories: amenity, shop.

In [69]:
df_pois_tp_red = pd.read_excel('results/pois/tags_reduced.xlsx', sheet_name='Sheet1')
df_pois_tp_red.head()

Unnamed: 0,class,category,category_s,Tag,count
0,craft,Automotive Services,amenity,Automotive Services (a),9
1,sport,Automotive Services,amenity,Automotive Services (a),10
2,amenity,Education,amenity,Education (a),4166
3,leisure,Education,amenity,Education (a),7
4,office,Education,amenity,Education (a),115


### 1.5 Add final tag to POI data

In [74]:
gdf_pois = pd.merge(gdf_pois, df_pois_tp[['class', 'subclass', 'category']], how='inner')
gdf_pois = pd.merge(gdf_pois, df_pois_tp_red[['class', 'category', 'Tag']], how='inner')
gdf_pois.head()

Unnamed: 0,osm_id,class,subclass,geom,category,Tag
0,1147753712,tourism,chalet,POINT (2122190.759 8374453.104),Leisure,Tourism
1,1147753708,tourism,chalet,POINT (2122450.167 8374415.363),Leisure,Tourism
2,1147753710,tourism,chalet,POINT (2122528.139 8374307.503),Leisure,Tourism
3,1030180338,tourism,chalet,POINT (2121644.423 8374204.607),Leisure,Tourism
4,10796713818,tourism,chalet,POINT (2127318.931 8366983.437),Leisure,Tourism


In [76]:
gdf_pois.to_crs(4326).to_postgis("pois", schema="built_env", con=engine)

In [73]:
gdf_pois.groupby('Tag').size()

Tag
Artisan Workshops                849
Automotive Services (a)           19
Automotive Services (s)         1987
Craft                           1875
Education (a)                   4288
Education (s)                    281
Entertainment (s)                255
Fashion and Accessories (s)     3320
Financial Services (a)           831
Financial Services (s)           323
Food and Drink (a)             19871
Food and Drink (s)              1978
Groceries and Food (a)          1903
Groceries and Food (s)          6271
Health and Beauty (a)            652
Health and Beauty (s)           3709
Healthcare (a)                  3623
Healthcare (s)                   848
Home and Living                 3205
Leisure                        21947
Office                          6180
Office (s)                       129
Outdoor Recreation (a)         34254
Outdoor Recreation (s)           470
Recreation (a)                  1157
Recreation (s)                    29
Religious Places (a)            45

## 2. Find nearest POIs
Search radius = 300 m

In [6]:
gdf_pois = gpd.GeoDataFrame.from_postgis(sql="""SELECT osm_id, "Tag", geom FROM built_env.pois;""", con=engine)
gdf_pois = gdf_pois.to_crs(3006)
gdf_pois.loc[:, 'y'] = gdf_pois.geom.y
gdf_pois.loc[:, 'x'] = gdf_pois.geom.x
gdf_pois.head()

Unnamed: 0,osm_id,Tag,geom,y,x
0,1147753712,Tourism,POINT (727361.542 6645721.136),6645721.0,727361.542224
1,1147753708,Tourism,POINT (727492.967 6645710.224),6645710.0,727492.967193
2,1147753710,Tourism,POINT (727535.447 6645658.562),6645659.0,727535.446635
3,1030180338,Tourism,POINT (727094.846 6645579.711),6645580.0,727094.845826
4,10796713818,Tourism,POINT (730169.443 6642133.974),6642134.0,730169.442766


### 2.1 Load stops

In [7]:
df_stops = pd.read_sql(sql=f"""SELECT uid, lat, lng, wt_total, time_span, deso
                               FROM segregation.mobi_seg_deso_raw
                               WHERE weekday=1 AND holiday=0;""",
                       con=engine)
gdf_stops = preprocess.df2gdf_point(df_stops, 'lng', 'lat', crs=4326, drop=False)

Add home label to the stops

In [8]:
df_home = pd.read_sql(sql=f"""SELECT uid, lat, lng
                               FROM home_p;""",
                       con=engine)
df_home.loc[:, 'home'] = 1
gdf_stops = pd.merge(gdf_stops, df_home, on=['uid', 'lat', 'lng'], how='left')
gdf_stops = gdf_stops.fillna(0)
gdf_stops.head()

Unnamed: 0,uid,lat,lng,wt_total,time_span,deso,geometry,home
0,e340267f-5a0f-418f-be9d-6236e4879a67,60.72,15.8,8.72093,"{26,32}",2080A0040,POINT (15.80000 60.72000),0.0
1,e340267f-5a0f-418f-be9d-6236e4879a67,60.72,15.84,8.72093,"{38,44}",2080A0040,POINT (15.84000 60.72000),0.0
2,e3404d31-9797-4da0-8f0c-5056028c9c28,57.9887,15.637943,92.59565,"{21,23}",0513C1020,POINT (15.63794 57.98870),0.0
3,e3404d31-9797-4da0-8f0c-5056028c9c28,57.992429,15.637004,345.507142,"{42,48}",0513C1030,POINT (15.63700 57.99243),0.0
4,e3404d31-9797-4da0-8f0c-5056028c9c28,57.985806,15.633148,345.507142,"{33,39}",0513C1030,POINT (15.63315 57.98581),0.0


In [9]:
gdf_stops.groupby('home').size()

home
0.0    8085161
1.0    5407966
dtype: int64

### 2.2 Create a KD tree of all POIs and find the nearest POI within 300 m radius

In [10]:
gdf_stops = gdf_stops.to_crs(3006)
gdf_stops.loc[:, 'y'] = gdf_stops.geometry.y
gdf_stops.loc[:, 'x'] = gdf_stops.geometry.x

In [14]:
print(len(gdf_stops))
gdf_stops.replace([np.inf, -np.inf], np.nan, inplace=True)
gdf_stops.dropna(subset=["x", "y"], how="any", inplace=True)
print("After processing infinite values", len(gdf_stops))

13493127
After processing infinite values 13493102


In [15]:
tree = KDTree(gdf_pois[["y", "x"]], metric="euclidean")

In [16]:
ind, dist = tree.query_radius(gdf_stops[["y", "x"]].to_records(index=False).tolist(),
                              r=300, return_distance=True, count_only=False, sort_results=True)

In [17]:
gdf_stops.loc[:, 'poi_num'] = [len(x) for x in ind]
gdf_stops.loc[gdf_stops.poi_num > 0, 'osm_id'] = [gdf_pois.loc[x[0], 'osm_id'] for x in ind if len(x) > 0]
gdf_stops.loc[gdf_stops.poi_num > 0, 'dist'] = [x[0] for x in dist if len(x) > 0]
gdf_stops = pd.merge(gdf_stops, gdf_pois[['osm_id', 'Tag']], on='osm_id', how='left')
gdf_stops.head()

Unnamed: 0,uid,lat,lng,wt_total,time_span,deso,geometry,home,y,x,poi_num,osm_id,dist,Tag
0,e340267f-5a0f-418f-be9d-6236e4879a67,60.72,15.8,8.72093,"{26,32}",2080A0040,POINT (543648.267 6731866.130),0.0,6731866.0,543648.26724,0,,,
1,e340267f-5a0f-418f-be9d-6236e4879a67,60.72,15.84,8.72093,"{38,44}",2080A0040,POINT (545830.601 6731893.375),0.0,6731893.0,545830.601035,0,,,
2,e3404d31-9797-4da0-8f0c-5056028c9c28,57.9887,15.637943,92.59565,"{21,23}",0513C1020,POINT (537719.873 6427630.170),0.0,6427630.0,537719.872857,11,2960104000.0,161.285543,Groceries and Food (s)
3,e3404d31-9797-4da0-8f0c-5056028c9c28,57.992429,15.637004,345.507142,"{42,48}",0513C1030,POINT (537660.426 6428044.751),0.0,6428045.0,537660.426178,3,108015800.0,92.782114,Healthcare (a)
4,e3404d31-9797-4da0-8f0c-5056028c9c28,57.985806,15.633148,345.507142,"{33,39}",0513C1030,POINT (537439.352 6427305.303),0.0,6427305.0,537439.352145,15,3309674000.0,112.890025,Office


In [18]:
print("Share of non-home stops with a nearby POI:")
len(gdf_stops.loc[(gdf_stops.home == 0) & (~gdf_stops.Tag.isna()), :]) / \
len(gdf_stops.loc[gdf_stops.home == 0, :])

Share of non-home stops with a nearby POI:


0.7530857173601014

## 3. Randomly shift non-home stops
To a similar POI within 1 km radius.

Step 1. If there are any POI having the same Tag, select one.
Step 2. If Step 1 fails, select one from any equivalent tags in amenity/shop class.
Step 3. If non in Office or Craft, select one from any tags in the other group.
Step 4. If they both below to shops, i.e., (s) or Shop.
Step 5. If all steps fail, set None for the stop.

In [19]:
shift_radius = 1000 # m
shift_radius_lower = 30 # m

def poi2nearby(row):
    X = gdf_pois.loc[gdf_pois.osm_id==row['osm_id'], 'x'].values[0]
    Y = gdf_pois.loc[gdf_pois.osm_id==row['osm_id'], 'y'].values[0]
    ind, dist = tree.query_radius([(Y, X)], r=shift_radius,
                                  return_distance=True, count_only=False, sort_results=True)
    ind = ind[0]
    dist = dist[0]
    def tag_categorization(x):
        if x == row['Tag']:
            return 1
        if ('(' in row['Tag']) & ('(' in x):
            if row['Tag'].split(' (')[0] == x.split(' (')[0]:
                return 2
        if (row['Tag'] in ('Office', 'Craft')) & (x in ('Office', 'Craft')):
            return 3
        if ('(s)' in row['Tag']) | (row['Tag'] == 'Shop'):
            if ('(s)' in x) | (x == 'Shop'):
                return 4
        return 0

    if len(ind) > 0:
        df = pd.DataFrame()
        df.loc[:, "id"] = range(0, len(dist))
        df.loc[:, "dist"] = dist
        df.loc[:, "tag"] = [gdf_pois.loc[x, 'Tag'] for x in ind]
        df.loc[:, "osm_id"] = [gdf_pois.loc[x, 'osm_id'] for x in ind]
        # Exclude POIs too close
        df = df.loc[df.dist > shift_radius_lower, :]
        if len(df) > 0:
            df.loc[:, "tag_cat"] = df.loc[:, "tag"].apply(lambda x: tag_categorization(x))
            if df.loc[:, "tag_cat"].sum() > 0:
                # sample 1 POI following the conditions 1, 2, 3, and 4
                for cat in (1, 2, 3, 4):
                    df2sample = df.loc[df.tag_cat == cat, :]
                    if len(df2sample) > 0:
                        poi_pool = len(df2sample)
                        df2sample = df2sample.sample(1)
                        osm_id_s = df2sample.osm_id.values[0]
                        Tag_s = df2sample.tag.values[0]
                        dist2poi = df2sample.dist.values[0]
                        break
            else:
                osm_id_s = np.nan
                Tag_s = np.nan
                dist2poi = np.nan
                poi_pool = 0
        else:
            osm_id_s = np.nan
            Tag_s = np.nan
            dist2poi = np.nan
            poi_pool = 0
        return pd.Series([osm_id_s, Tag_s, dist2poi, poi_pool], index=['osm_id_s', 'Tag_s', 'dist2poi', 'poi_pool'])

In [20]:
stops2shift = gdf_stops.loc[(gdf_stops.home == 0) & (~gdf_stops.Tag.isna()), :]
stops2keep = gdf_stops.loc[~((gdf_stops.home == 0) & (~gdf_stops.Tag.isna())), :]

In [21]:
test = False
if test:
    stops2shift = stops2shift.sample(1000)

In [22]:
tqdm.pandas()
shifted = stops2shift.progress_apply(poi2nearby, axis=1)
stops2shift = pd.concat([stops2shift, shifted], axis=1)

  0%|          | 0/6088875 [00:00<?, ?it/s]

In [23]:
gdf_stops = pd.concat([stops2shift, stops2keep])
gdf_stops.iloc[0]

uid               e3404d31-9797-4da0-8f0c-5056028c9c28
lat                                            57.9887
lng                                          15.637943
wt_total                                      92.59565
time_span                                      {21,23}
deso                                         0513C1020
geometry     POINT (537719.8728565789 6427630.1701051)
home                                               0.0
y                                       6427630.170105
x                                        537719.872857
poi_num                                             11
osm_id                                    2960104041.0
dist                                        161.285543
Tag                             Groceries and Food (s)
osm_id_s                                  2960104042.0
Tag_s                           Groceries and Food (s)
dist2poi                                    361.109035
poi_pool                                           3.0
Name: 2, d

### 3.1 Find the corresponding DeSO zones for the shifted stays

In [24]:
gdf_deso = gpd.GeoDataFrame.from_postgis(sql="""SELECT deso AS deso_s, geom FROM zones;""", con=engine)
gdf_pois = gpd.sjoin(gdf_pois, gdf_deso)
gdf_pois.head()

Unnamed: 0,osm_id,Tag,geom,y,x,index_right,deso_s
0,1147753712,Tourism,POINT (727361.542 6645721.136),6645721.0,727361.542224,1245,0188A0180
1,1147753708,Tourism,POINT (727492.967 6645710.224),6645710.0,727492.967193,1245,0188A0180
2,1147753710,Tourism,POINT (727535.447 6645658.562),6645659.0,727535.446635,1245,0188A0180
3,1030180338,Tourism,POINT (727094.846 6645579.711),6645580.0,727094.845826,1245,0188A0180
4,10796713818,Tourism,POINT (730169.443 6642133.974),6642134.0,730169.442766,1245,0188A0180


In [25]:
gdf_stops = pd.merge(gdf_stops, gdf_pois[['osm_id', 'deso_s']].rename(columns={'osm_id': 'osm_id_s'}),
                     on='osm_id_s', how='left')
gdf_stops.iloc[0]

uid               e3404d31-9797-4da0-8f0c-5056028c9c28
lat                                            57.9887
lng                                          15.637943
wt_total                                      92.59565
time_span                                      {21,23}
deso                                         0513C1020
geometry     POINT (537719.8728565789 6427630.1701051)
home                                               0.0
y                                       6427630.170105
x                                        537719.872857
poi_num                                             11
osm_id                                    2960104041.0
dist                                        161.285543
Tag                             Groceries and Food (s)
osm_id_s                                  2960104042.0
Tag_s                           Groceries and Food (s)
dist2poi                                    361.109035
poi_pool                                           3.0
deso_s    

In [26]:
print("Share of non-home stops with a shifted POI:")
len(gdf_stops.loc[(gdf_stops.home == 0) & (gdf_stops.poi_pool > 0), :]) / \
len(gdf_stops.loc[gdf_stops.home == 0, :])

Share of non-home stops with a shifted POI:


0.6923986460789063

In [27]:
print("Share of non-home stops with a shifted POI in different DeSO zone:")
len(gdf_stops.loc[(gdf_stops.home == 0) & (gdf_stops.poi_pool > 0) & (gdf_stops.deso_s != gdf_stops.deso), :]) / \
len(gdf_stops.loc[(gdf_stops.home == 0) & (gdf_stops.poi_pool > 0), :])

Share of non-home stops with a shifted POI in different DeSO zone:


0.6703247335180229

In [28]:
gdf_stops.drop(columns=['x', 'y', 'geometry']).\
    to_sql('mobi_seg_deso_raw_poi_w1h0', engine, schema='segregation', index=False,
           method='multi', if_exists='append', chunksize=10000)

13493223

In [29]:
print("Number of individuals that have at least one shifted stay on a different DeSO zone.")
gdf_stops.loc[(gdf_stops.home == 0) & (gdf_stops.poi_pool > 0) & (gdf_stops.deso_s != gdf_stops.deso), 'uid'].nunique()

Number of individuals that have at least one shifted stay on a different DeSO zone.


256248