# CSC 481 Project 

Ayan Patel, Robert Hensley

* [Gowalla Dataset](https://snap.stanford.edu/data/loc-gowalla.html)



---



In [5]:
# !pip install snap-stanford
!pip install python-igraph
!pip install geopy

Collecting geopy
  Downloading https://files.pythonhosted.org/packages/ab/97/25def417bf5db4cc6b89b47a56961b893d4ee4fec0c335f5b9476a8ff153/geopy-1.22.0-py2.py3-none-any.whl (113kB)
Collecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-1.22.0


In [2]:
# import snap ???
import igraph
from igraph import *

import pandas as pd

import time
from datetime import datetime

import geopy
from geopy.geocoders import Nominatim

## Loading Data

* this can takes 3 - 5 minutes to fully load all dataframes

In [3]:
data_path = 'data/'

# creating dataframes
edges = pd.read_csv(data_path + 'Gowalla_edges.txt', sep='\t', names=['u1', 'u2'])
totalCheckins = pd.read_csv(data_path + 'Gowalla_totalCheckins.txt', sep='\t',
                            names=['user', 'time', 'lat', 'long', 'locid'])

totalCheckins['time'] = totalCheckins['time'].map(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%SZ"))

In [7]:
totalCheckins

Unnamed: 0,user,time,lat,long,locid
0,0,2010-10-19 23:55:27,30.235909,-97.795140,22847
1,0,2010-10-18 22:17:43,30.269103,-97.749395,420315
2,0,2010-10-17 23:42:03,30.255731,-97.763386,316637
3,0,2010-10-17 19:26:05,30.263418,-97.757597,16516
4,0,2010-10-16 18:50:42,30.274292,-97.740523,5535878
...,...,...,...,...,...
6442887,196578,2010-06-11 13:32:26,51.742988,-0.488065,906885
6442888,196578,2010-06-11 13:26:45,51.746492,-0.490780,965121
6442889,196578,2010-06-11 13:26:34,51.741916,-0.496729,1174322
6442890,196585,2010-10-08 21:01:49,50.105516,8.571525,471724


In [8]:
edges

Unnamed: 0,u1,u2
0,0,1
1,0,2
2,0,3
3,0,4
4,0,5
...,...,...
1900649,196586,196539
1900650,196587,196540
1900651,196588,196540
1900652,196589,196547


### User Notes

* userCheckins appears to cover only 1245 users
  * this makes me wonder if the userIDs match the edges userIDs (most likely not)
  * therefore it might be better to create our own graphs based off the checkin data and ignore the edges data
    * this means we cannot make the assumptions that two users are friends 
    * we should test to see if our predicted meetups (same location at around the same time) match the meetups in the edges file
  * also notice that user ID's appear to range from 0 - 1440, but there are only 1245 rows, implying some missing users


* we can get an idea of where the users are generally located by gather the mean of their latitude on longitudinal locations
* other user attributes to collect:
  * earliest time checked in
  * latest time checked in 
    * this shows the general timespan the user has been on the app 

In [9]:
totalCheckins['user'].max()

196585

In [5]:
users = totalCheckins.groupby('user').mean()[['lat', 'long']]
users

Unnamed: 0_level_0,lat,long
user,Unnamed: 1_level_1,Unnamed: 2_level_1
0,33.558308,-97.894601
1,47.204338,4.499703
2,35.659617,-120.016716
4,36.800202,-124.714027
5,32.290069,-96.123008
...,...,...
196544,-25.433409,-49.281533
196561,37.528650,-122.004623
196577,51.514905,-0.081277
196578,51.744557,-0.478051


* look at how many checkins the average user makes
  * this seems almost too big?
  * I assume this is a collection of users classified as very active on Gowalla

In [11]:
# average amounts of visits per user
totalCheckins.shape[0] / users.shape[0]

60.16221566503567

### Place Notes

* same grouping process as users
* for the scope of this project, we might want to only focus on places with like 5 or more visits

In [6]:
places = totalCheckins.groupby('locid').agg(['mean', 'count'])[['lat', 'long']]

places.columns = ['lat_mean', 'count_drop', 'long_mean', 'visits']
places = places.drop(columns = ['count_drop'])

#### Geolocation Data (geopy)

* our queries use the [Nominatim API](https://nominatim.org/release-docs/develop/api/Overview/) by default

Let's try to find the country all this data is from. We will query location information using [geopy](https://geopy.readthedocs.io/en/stable/). Below is a sample query.

In [14]:
from geopy.geocoders import Nominatim

geolocator = Nominatim()

location = geolocator.reverse("39.052318, -94.607499")

  This is separate from the ipykernel package so we can avoid doing imports until


In [15]:
location.raw

{'place_id': 251278814,
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'way',
 'osm_id': 552493654,
 'lat': '39.052405',
 'lon': '-94.607436',
 'display_name': '4154, State Line Road, Country Club Plaza, Kansas City, Wyandotte County, Missouri, 66103, United States of America',
 'address': {'house_number': '4154',
  'road': 'State Line Road',
  'suburb': 'Country Club Plaza',
  'city': 'Kansas City',
  'county': 'Wyandotte County',
  'state': 'Missouri',
  'postcode': '66103',
  'country': 'United States of America',
  'country_code': 'us'},
 'boundingbox': ['39.052355', '39.052455', '-94.607486', '-94.607386']}

In [16]:
location.raw['address']['country']

'United States of America'

This gives the following important data:
* name 'address > tourism'
* city 'address > city'
* country 'address > country'

However this data lacks:
* context of the location (is it a pub, park, etc.)
    * we can get this information by querying OpenStreetMap using it's place_id, 

Given the limited time for this project, I feel like city and country information should be informative enough.

**Naming Conventions**

* subregions vary between different countries 
    * i.e. states and counties are not found in Germany, boroughs are found in Germany but not the US
    * this means the easiest thing to analyze right now are places < cities < countries
    * this problem can be resolved by focusing on one Country at a time
* the above problem unfortunately also occurs for city names
    * citities can also be called "towns, villages, municipalities, etc.)

In [20]:
location.raw

place = {"osm_place_id"}

{'place_id': 251278814,
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'way',
 'osm_id': 552493654,
 'lat': '39.052405',
 'lon': '-94.607436',
 'display_name': '4154, State Line Road, Country Club Plaza, Kansas City, Wyandotte County, Missouri, 66103, United States of America',
 'address': {'house_number': '4154',
  'road': 'State Line Road',
  'suburb': 'Country Club Plaza',
  'city': 'Kansas City',
  'county': 'Wyandotte County',
  'state': 'Missouri',
  'postcode': '66103',
  'country': 'United States of America',
  'country_code': 'us'},
 'boundingbox': ['39.052355', '39.052455', '-94.607486', '-94.607386']}

In [88]:
places

Unnamed: 0_level_0,lat_mean,long_mean,visits
locid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8904,39.052318,-94.607499,12
8932,32.927662,-97.254356,16
8936,39.053318,-94.591995,12
8938,39.052824,-94.590311,130
8947,37.331880,-122.029631,570
...,...,...,...
5975123,38.020788,-7.874773,1
5976149,12.939186,100.882264,1
5976173,13.668828,100.644486,1
5977211,4.888548,114.838631,1


In [103]:
location = geolocator.reverse("39.052318, -94.607499")

In [105]:
location.raw

{'place_id': 251278814,
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'way',
 'osm_id': 552493654,
 'lat': '39.052405',
 'lon': '-94.607436',
 'display_name': '4154, State Line Road, Country Club Plaza, Kansas City, Wyandotte County, Missouri, 66103, United States of America',
 'address': {'house_number': '4154',
  'road': 'State Line Road',
  'suburb': 'Country Club Plaza',
  'city': 'Kansas City',
  'county': 'Wyandotte County',
  'state': 'Missouri',
  'postcode': '66103',
  'country': 'United States of America',
  'country_code': 'us'},
 'boundingbox': ['39.052355', '39.052455', '-94.607486', '-94.607386']}

In [7]:
from geopy.geocoders import Nominatim

geolocator = Nominatim()

sorted_places = places.sort_values('visits', ascending=False).reset_index()

  This is separate from the ipykernel package so we can avoid doing imports until


In [8]:
sorted_places

Unnamed: 0,locid,lat_mean,long_mean,visits
0,55033,59.330158,18.058079,5811
1,19542,37.616356,-122.386150,5662
2,9410,30.201557,-97.667127,4713
3,10259,32.897462,-97.040348,4083
4,58725,59.650051,17.932262,3476
...,...,...,...,...
1280964,1071645,8.038478,98.815826,1
1280965,1071647,42.553000,-83.175745,1
1280966,1071650,34.614540,-92.498446,1
1280967,1071652,29.756889,-95.365663,1


In [None]:
import json

location_json = {}
location_json['places'] = []

# choose place to start writing files
i = 1000
chunk = 1000
places_temp = sorted_places[i:]

for index, row in places_temp.iterrows():
    
    location = geolocator.reverse(str(row["lat_mean"]) + ", " + str(row["long_mean"])).raw
    
    location_json['places'].append({'locid' : row['locid'], 'osm_json' : location})
    
    if index % chunk == 0 and index != i:
        
        with open('data/json/' + str(index) + '.json', 'w') as outfile:
            json.dump(location_json, outfile)
        
        print(str(index) + ' index JSON written')
        
        location_json['places'] = []

1000 index JSON written


* we will be analyzing the top 10000 locations in the dataset by number of visits

In [12]:
places_top10 = sorted_places[:10000]

places_top10

Unnamed: 0,locid,lat_mean,long_mean,visits
0,55033,59.330158,18.058079,5811
1,19542,37.616356,-122.386150,5662
2,9410,30.201557,-97.667127,4713
3,10259,32.897462,-97.040348,4083
4,58725,59.650051,17.932262,3476
...,...,...,...,...
9995,230547,36.005830,-115.084607,56
9996,45183,35.655814,-97.470548,56
9997,164356,48.101030,11.645453,56
9998,234171,63.825514,20.261439,56


In [17]:
from os import listdir

listdir('data/json/')

['.ipynb_checkpoints',
 '10000.json',
 '2000.json',
 '3000.json',
 '4000.json',
 '5000.json',
 '6000.json',
 '7000.json',
 '8000.json',
 '9000.json']

In [38]:
import json

data_path = 'data/json/'
start = 2000

location_append = []


while start <= 10000:
    
    if j.endswith('.json'):
        with open(data_path + str(start) + '.json', 'r') as f:
            batch = json.load(f)

    places = batch['places']
    
    for location in places:
        
        # cities
        try:
            city = location['osm_json']['address']['city']
        except:
            pass

        try:
            city = location['osm_json']['address']['town']
        except:
            pass

        try:
            city = location['osm_json']['address']['village']
        except:
            pass

        # skip entities with no city information
        if city == "":
            print("no city")
            print(location)
    
        # countries
        country = ""
        country_code = ""
        
        try:
            country = location['osm_json']['address']['country']
        except:
            pass
        
        try:
            country_code = location['osm_json']['address']['country_code']
        except:
            pass
        
        if country == "":
            print("no country")
            print(location)
        
        location_append.append({
            'locid_check' : location['locid'], 
            'osm_place_id' : location['osm_json']['place_id'],
            'display_name' : location['osm_json']['display_name'],
            'city' : city,
            'country' : country,
            'country_code' : country_code
        })
    
    start += 1000

no country
{'locid': 17100.0, 'osm_json': {'place_id': 283373754, 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', 'osm_type': 'node', 'osm_id': 3815077900, 'lat': '0', 'lon': '0', 'display_name': 'Soul Buoy', 'address': {'man_made': 'Soul Buoy'}, 'boundingbox': ['-5.0E-5', '5.0E-5', '-5.0E-5', '5.0E-5']}}


In [44]:
append_df = pd.DataFrame(location_append)[:-1]

append_df

Unnamed: 0,locid_check,osm_place_id,display_name,city,country,country_code
0,55033.0,235973447,"Pennfäktaren, Norrmalm, Norrmalms stadsdelsomr...",Stockholm,Sverige,se
1,19542.0,235227144,"Domestic Garage, Zone F/G, Lomita Park, San Ma...",Stockholm,United States of America,us
2,9410.0,235768752,"Austin-Bergstrom International Airport (AUS), ...",Austin,United States of America,us
3,10259.0,186874070,"FAA Tower, International Parkway, Grapevine, T...",Grapevine,United States of America,us
4,58725.0,105773201,"Stockholm-Arlanda flygplats, 273, Starrmossen,...",Grapevine,Sverige,se
...,...,...,...,...,...,...
9995,230547.0,257735126,"2265, West Horizon Ridge Parkway, Macdonald Ra...",Henderson,United States of America,us
9996,45183.0,145877854,"Thatcher Hall, East Hurd Street, Edmond, Oklah...",Edmond,United States of America,us
9997,164356.0,235795942,"Bezirksteil Neuperlach, Stadtbezirk 16 Ramersd...",München,Deutschland,de
9998,234171.0,46570017,"Ginatricot, Rådhustorget, Centrum, Centrala st...",Umeå,Sverige,se


In [48]:
append_df = pd.DataFrame(location_append)[:-1]
places_top10_desc = places_top10.join(append_df)

places_top10_desc

Unnamed: 0,locid,lat_mean,long_mean,visits,locid_check,osm_place_id,display_name,city,country,country_code
0,55033,59.330158,18.058079,5811,55033.0,235973447,"Pennfäktaren, Norrmalm, Norrmalms stadsdelsomr...",Stockholm,Sverige,se
1,19542,37.616356,-122.386150,5662,19542.0,235227144,"Domestic Garage, Zone F/G, Lomita Park, San Ma...",Stockholm,United States of America,us
2,9410,30.201557,-97.667127,4713,9410.0,235768752,"Austin-Bergstrom International Airport (AUS), ...",Austin,United States of America,us
3,10259,32.897462,-97.040348,4083,10259.0,186874070,"FAA Tower, International Parkway, Grapevine, T...",Grapevine,United States of America,us
4,58725,59.650051,17.932262,3476,58725.0,105773201,"Stockholm-Arlanda flygplats, 273, Starrmossen,...",Grapevine,Sverige,se
...,...,...,...,...,...,...,...,...,...,...
9995,230547,36.005830,-115.084607,56,230547.0,257735126,"2265, West Horizon Ridge Parkway, Macdonald Ra...",Henderson,United States of America,us
9996,45183,35.655814,-97.470548,56,45183.0,145877854,"Thatcher Hall, East Hurd Street, Edmond, Oklah...",Edmond,United States of America,us
9997,164356,48.101030,11.645453,56,164356.0,235795942,"Bezirksteil Neuperlach, Stadtbezirk 16 Ramersd...",München,Deutschland,de
9998,234171,63.825514,20.261439,56,234171.0,46570017,"Ginatricot, Rådhustorget, Centrum, Centrala st...",Umeå,Sverige,se


In [49]:
places_top10_desc.to_csv('data/places_top10_desc.csv')

In [60]:
# top cities

places_top10_desc['city'].value_counts()

Austin                1183
Stockholm              816
San Francisco          517
Göteborg               295
Dallas                 250
                      ... 
Lower Paia               1
East Hertfordshire       1
Noordwijk                1
Fort Collins             1
Edwardsville             1
Name: city, Length: 1529, dtype: int64

In [61]:
places_top10_desc['country'].value_counts()

United States of America       5894
Sverige                        1954
Deutschland                     412
United Kingdom                  338
Norge                           269
Canada                          126
Nederland                       120
Saudi Arabia / السعودية         111
België - Belgique - Belgien      99
ประเทศไทย                        81
日本 (Japan)                       79
Switzerland                      69
Australia                        65
France                           43
Česká republika                  36
Portugal                         33
Österreich                       29
Danmark                          27
Italia                           25
Malaysia                         24
Singapore                        24
España                           19
China 中国                         14
Magyarország                     14
Philippines                      12
Brasil                           10
Indonesia                         8
Suomi                       

#### Centrality of Cities

* we would like to understand more about which locations are more influential in social networks
* listing the most visted places as shown above gives us an idea of which locations are most popular
    * however we are more interested in which places are most influential in terms of number of connections made

## Graph Representations

* having some issues importing Stanford Snap (conflicts with another library called snap)
* I'm going to try iGraph in the meantime

### Limitations

* the max userID will be the number of edges required in the graph (assumption)
* using the maximum number of nodes in the graph (196591) takes a long time to generate the graph
  * this I would assume makes the algorithms for these graphs tricky to run
  * I could also be adding the edges in an unefficient way (explore ways of generating graphs)
    * might be better to add edges in batches instead of one edge at a time (explore this)

#### Ways to set limits
* we can limit the groups to the first n number of users (example only nodes 0 - 1000)
  * however under the assumption that userIDs are added based on the time the user adds themselves to the network, then there will be a lot of graphs that aren't connected because they are in geographically seperate locations
* a better soltuion would be to take a group of n people in close proximety of one another
  * we could find the average latitude and longitude of a users and record this as the area they are generally in
  * we'll then randomly select a user and select the n amount of people closest to this user (ranking by the distance between their average locations)


In [17]:
edgesG = Graph()
total_edges = max(edges['u1'].max(), edges['u2'].max())
total_edges

196590

In [18]:
edgesG.add_vertices(total_edges)

In [0]:
# this takes too long; there's got to be a better way!

for i in range(len(edges)):
  edgesG.add_edge(edges.loc[i, "u1"], edges.loc[i, "u2"])

In [26]:
#Faster way to make graph from df tuples
eG = igraph.Graph.TupleList(edges.itertuples(index=False), directed=False, weights=False)
eG.summary()

'IGRAPH UN-- 196591 1900654 -- \n+ attr: name (v)'

In [42]:
#Finds degrees of each vertex and the max degree and its vertex - popularity
eG_degrees = eG.indegree()
print(max(eG_degrees))
eG_max_degree_index = eG_degrees.index(max(eG_degrees))
eG.vs[eG_max_degree_index]

29460


igraph.Vertex(<igraph.Graph object at 0x0000022A8E1D25E8>, 307, {'name': 307})

In [53]:
user_locids = totalCheckins[['user', 'locid']]
user_locids['locid'] = user_locids['locid'].apply(lambda x: 'L' + str(x))
user_locids

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,user,locid
0,0,L22847
1,0,L420315
2,0,L316637
3,0,L16516
4,0,L5535878
...,...,...
6442887,196578,L906885
6442888,196578,L965121
6442889,196578,L1174322
6442890,196585,L471724


In [54]:
user_loc_G = igraph.Graph.TupleList(user_locids.itertuples(index=False), directed=False, weights=False)
user_loc_G.summary()

'IGRAPH UN-- 1388061 6442892 -- \n+ attr: name (v)'

## Map APIs

* important for querying specific information about locations people visit
* data visualization is not an essential part of the project, but is worth exploring if data visualization is easy

### Notes About OSM
* [OpenStreetMap Article](https://towardsdatascience.com/loading-data-from-openstreetmap-with-python-and-the-overpass-api-513882a27fd0)

* We will be using the Overpass API to get information about our locations
  * [Overpass Querying Language](https://wiki.openstreetmap.org/wiki/Overpass_API/Overpass_QL)
  * [Site for Testing Overpass Queries](http://overpass-turbo.eu/)


* **node**: a specific lat, long point
* **relation**: a region (typically for buildings that require multiple polygons
  * probably will be using this for our project (need a way to query if a user's location is in a certain relation)
  * includes tags which can helps us 

In [0]:
# example of an overpass query that collects all the biergartens in germany
# there are a lot of biergartens in Germany lol; so the query will take long to run 
# notice that it includes every OpenStreetMap datatype (node, way, rel); maybe there are biergarten streets? (way)

import requests
import json

overpass_url = "http://overpass-api.de/api/interpreter"
overpass_query = """
[out:json];
area["ISO3166-1"="DE"][admin_level=2];
(node["amenity"="biergarten"](area);
 way["amenity"="biergarten"](area);
 rel["amenity"="biergarten"](area);
);
out center;
"""
response = requests.get(overpass_url, 
                        params={'data': overpass_query})
data = response.json()

In [0]:
# notice how descriptive the API is for relations

data['elements'][5:10]

[{'id': 27318009,
  'lat': 52.4200885,
  'lon': 13.1763456,
  'tags': {'addr:city': 'Berlin',
   'addr:housenumber': '260',
   'addr:postcode': '14109',
   'addr:street': 'Kronprinzessinnenweg',
   'amenity': 'biergarten',
   'contact:email': 'info@loretta-berlin.de',
   'contact:fax': '+49 30 80105334',
   'contact:phone': '+49 30 80105333',
   'contact:website': 'https://www.loretta-wannsee.de/biergarten/',
   'name': 'Loretta',
   'opening_hours': '12:00+',
   'wheelchair': 'limited'},
  'type': 'node'},
 {'id': 27352197,
  'lat': 50.3638373,
  'lon': 7.5769021,
  'tags': {'amenity': 'biergarten', 'wheelchair': 'limited'},
  'type': 'node'},
 {'id': 27787909,
  'lat': 52.9822191,
  'lon': 8.845254,
  'tags': {'amenity': 'biergarten',
   'name': 'Schwarzbiergarten',
   'wheelchair': 'limited'},
  'type': 'node'},
 {'id': 29812167,
  'lat': 49.4814353,
  'lon': 10.993033,
  'tags': {'amenity': 'biergarten',
   'toilets:wheelchair': 'no',
   'wheelchair': 'limited'},
  'type': 'node'},

# Sources

A collection of sources to include in the final paper:

* [OpenStreetMap Article](https://towardsdatascience.com/loading-data-from-openstreetmap-with-python-and-the-overpass-api-513882a27fd0)