# CSC 481 Project 

Ayan Patel, Robert Hensley

* [Gowalla Dataset](https://snap.stanford.edu/data/loc-gowalla.html)



---



In [0]:
# !pip install snap-stanford
!pip install python-igraph

Collecting python-igraph
[?25l  Downloading https://files.pythonhosted.org/packages/8b/74/24a1afbf3abaf1d5f393b668192888d04091d1a6d106319661cd4af05406/python_igraph-0.8.2-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 2.8MB/s 
[?25hCollecting texttable>=1.6.2
  Downloading https://files.pythonhosted.org/packages/ec/b1/8a1c659ce288bf771d5b1c7cae318ada466f73bd0e16df8d86f27a2a3ee7/texttable-1.6.2-py2.py3-none-any.whl
Installing collected packages: texttable, python-igraph
Successfully installed python-igraph-0.8.2 texttable-1.6.2


In [0]:
# import snap ???
import igraph
from igraph import *

import pandas as pd

import time
from datetime import datetime

## Loading Data

* use the following path with your google drive
* this can takes 3 - 5 minutes to fully load all dataframes

In [0]:
from google.colab import drive
drive.mount('/content/drive/')

data_path = 'drive/My Drive/CSC 481 Project/data'

# creating dataframes
edges = pd.read_csv(data_path + 'Gowalla_edges.txt', sep='\t', names=['u1', 'u2'])
totalCheckins = pd.read_csv(data_path + 'Gowalla_totalCheckins.txt', sep='\t',
                            names=['user', 'time', 'lat', 'long', 'locid'])

totalCheckins['time'] = totalCheckins['time'].map(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%SZ"))

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!ls 'drive/My Drive/CSC 481 Project/data'

Gowalla_edges.txt  Gowalla_totalCheckins.txt


In [0]:
totalCheckins

Unnamed: 0,user,time,lat,long,locid
0,0,2010-10-19 23:55:27,30.235909,-97.795140,22847
1,0,2010-10-18 22:17:43,30.269103,-97.749395,420315
2,0,2010-10-17 23:42:03,30.255731,-97.763386,316637
3,0,2010-10-17 19:26:05,30.263418,-97.757597,16516
4,0,2010-10-16 18:50:42,30.274292,-97.740523,5535878
...,...,...,...,...,...
6442887,196578,2010-06-11 13:32:26,51.742988,-0.488065,906885
6442888,196578,2010-06-11 13:26:45,51.746492,-0.490780,965121
6442889,196578,2010-06-11 13:26:34,51.741916,-0.496729,1174322
6442890,196585,2010-10-08 21:01:49,50.105516,8.571525,471724


In [0]:
edges

Unnamed: 0,u1,u2
0,0,1
1,0,2
2,0,3
3,0,4
4,0,5
...,...,...
1900649,196586,196539
1900650,196587,196540
1900651,196588,196540
1900652,196589,196547


### User Notes

* userCheckins appears to cover only 1245 users
  * this makes me wonder if the userIDs match the edges userIDs (most likely not)
  * therefore it might be better to create our own graphs based off the checkin data and ignore the edges data
    * this means we cannot make the assumptions that two users are friends 
    * we should test to see if our predicted meetups (same location at around the same time) match the meetups in the edges file
  * also notice that user ID's appear to range from 0 - 1440, but there are only 1245 rows, implying some missing users


* we can get an idea of where the users are generally located by gather the mean of their latitude on longitudinal locations
* other user attributes to collect:
  * earliest time checked in
  * latest time checked in 
    * this shows the general timespan the user has been on the app 

In [0]:
totalCheckins['user'].max()

196585

In [0]:
users = totalCheckins.groupby('user').mean()[['lat', 'long']]
users

Unnamed: 0_level_0,lat,long
user,Unnamed: 1_level_1,Unnamed: 2_level_1
0,33.558308,-97.894601
1,47.204338,4.499703
2,35.659617,-120.016716
4,36.800202,-124.714027
5,32.290069,-96.123008
...,...,...
196544,-25.433409,-49.281533
196561,37.528650,-122.004623
196577,51.514905,-0.081277
196578,51.744557,-0.478051


* look at how many checkins the average user makes
  * this seems almost too big?
  * I assume this is a collection of users classified as very active on Gowalla

In [0]:
# average amounts of visits per user
totalCheckins.shape[0] / users.shape[0]

338.13253012048193

### Place Notes

* same grouping process as users
* maybe also include beginning and end of times visited?
  * probably not necessary for this dataset

In [0]:
places = totalCheckins.groupby('locid').mean()[['lat', 'long']]

In [0]:
places

Unnamed: 0_level_0,lat,long
locid,Unnamed: 1_level_1,Unnamed: 2_level_1
8904,39.052318,-94.607499
8932,32.927662,-97.254356
8936,39.053318,-94.591995
8938,39.052824,-94.590311
8947,37.331880,-122.029631
...,...,...
5975123,38.020788,-7.874773
5976149,12.939186,100.882264
5976173,13.668828,100.644486
5977211,4.888548,114.838631


Let's try to find the country all this data is from. We will query location information using [geopy](https://geopy.readthedocs.io/en/stable/). Below is a sample query.

In [0]:
from geopy.geocoders import Nominatim

geolocator = Nominatim()

location = geolocator.reverse("52.509669, 13.376294")



In [0]:
location.raw

{'address': {'borough': 'Mitte',
  'city': 'Berlin',
  'country': 'Deutschland',
  'country_code': 'de',
  'postcode': '10785',
  'quarter': 'Botschaftsviertel',
  'road': 'Bellevuestraße',
  'suburb': 'Tiergarten',
  'tourism': 'Potsdamer Platz'},
 'boundingbox': ['52.5082999', '52.5100374', '13.3750548', '13.3769528'],
 'display_name': 'Potsdamer Platz, Bellevuestraße, Botschaftsviertel, Tiergarten, Mitte, Berlin, 10785, Deutschland',
 'lat': '52.5098014',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'lon': '13.375589791291057',
 'osm_id': 3200536,
 'osm_type': 'relation',
 'place_id': 235599123}

In [0]:
location.raw['address']['country']

'Deutschland'

* do the query below offline because it's taking too much time in Google Collab
  * assuming it's a resource limitation of Collab
  * could be inefficient to append each row when querying

In [0]:
geolocator = Nominatim()

for index, row in places.iterrows():
  row['loc_json'] = geolocator.reverse(str(row["lat"]) + ", " + str(row["long"])).raw



GeocoderTimedOut: ignored

## Graph Representations

* having some issues importing Stanford Snap (conflicts with another library called snap)
* I'm going to try iGraph in the meantime

### Limitations

* the max userID will be the number of edges required in the graph (assumption)
* using the maximum number of nodes in the graph (196591) takes a long time to generate the graph
  * this I would assume makes the algorithms for these graphs tricky to run
  * I could also be adding the edges in an unefficient way (explore ways of generating graphs)
    * might be better to add edges in batches instead of one edge at a time (explore this)

#### Ways to set limits
* we can limit the groups to the first n number of users (example only nodes 0 - 1000)
  * however under the assumption that userIDs are added based on the time the user adds themselves to the network, then there will be a lot of graphs that aren't connected because they are in geographically seperate locations
* a better soltuion would be to take a group of n people in close proximety of one another
  * we could find the average latitude and longitude of a users and record this as the area they are generally in
  * we'll then randomly select a user and select the n amount of people closest to this user (ranking by the distance between their average locations)


In [0]:
edgesG = Graph()
total_edges = max(edges['u1'].max(), edges['u2'].max())
total_edges

196590

In [0]:
edgesG.add_vertices(total_edges)

In [0]:
# this takes too long; there's got to be a better way!

for i in range(len(edges)):
  edgesG.add_edge(edges.loc[i, "u1"], edges.loc[i, "u2"])

## Map APIs

* important for querying specific information about locations people visit
* data visualization is not an essential part of the project, but is worth exploring if data visualization is easy

### Notes About OSM
* [OpenStreetMap Article](https://towardsdatascience.com/loading-data-from-openstreetmap-with-python-and-the-overpass-api-513882a27fd0)

* We will be using the Overpass API to get information about our locations
  * [Overpass Querying Language](https://wiki.openstreetmap.org/wiki/Overpass_API/Overpass_QL)
  * [Site for Testing Overpass Queries](http://overpass-turbo.eu/)


* **node**: a specific lat, long point
* **relation**: a region (typically for buildings that require multiple polygons
  * probably will be using this for our project (need a way to query if a user's location is in a certain relation)
  * includes tags which can helps us 

In [0]:
# example of an overpass query that collects all the biergartens in germany
# there are a lot of biergartens in Germany lol; so the query will take long to run 
# notice that it includes every OpenStreetMap datatype (node, way, rel); maybe there are biergarten streets? (way)

import requests
import json

overpass_url = "http://overpass-api.de/api/interpreter"
overpass_query = """
[out:json];
area["ISO3166-1"="DE"][admin_level=2];
(node["amenity"="biergarten"](area);
 way["amenity"="biergarten"](area);
 rel["amenity"="biergarten"](area);
);
out center;
"""
response = requests.get(overpass_url, 
                        params={'data': overpass_query})
data = response.json()

In [0]:
# notice how descriptive the API is for relations

data['elements'][5:10]

[{'id': 27318009,
  'lat': 52.4200885,
  'lon': 13.1763456,
  'tags': {'addr:city': 'Berlin',
   'addr:housenumber': '260',
   'addr:postcode': '14109',
   'addr:street': 'Kronprinzessinnenweg',
   'amenity': 'biergarten',
   'contact:email': 'info@loretta-berlin.de',
   'contact:fax': '+49 30 80105334',
   'contact:phone': '+49 30 80105333',
   'contact:website': 'https://www.loretta-wannsee.de/biergarten/',
   'name': 'Loretta',
   'opening_hours': '12:00+',
   'wheelchair': 'limited'},
  'type': 'node'},
 {'id': 27352197,
  'lat': 50.3638373,
  'lon': 7.5769021,
  'tags': {'amenity': 'biergarten', 'wheelchair': 'limited'},
  'type': 'node'},
 {'id': 27787909,
  'lat': 52.9822191,
  'lon': 8.845254,
  'tags': {'amenity': 'biergarten',
   'name': 'Schwarzbiergarten',
   'wheelchair': 'limited'},
  'type': 'node'},
 {'id': 29812167,
  'lat': 49.4814353,
  'lon': 10.993033,
  'tags': {'amenity': 'biergarten',
   'toilets:wheelchair': 'no',
   'wheelchair': 'limited'},
  'type': 'node'},

# Sources

A collection of sources to include in the final paper:

* [OpenStreetMap Article](https://towardsdatascience.com/loading-data-from-openstreetmap-with-python-and-the-overpass-api-513882a27fd0)