# CSC 481 Final Presentation

Ayan Patel, Robert Hensley

* [Gowalla Dataset](https://snap.stanford.edu/data/loc-gowalla.html)



---



In [1]:
# Graph Libraries
import igraph
from igraph import *

# Map Libraries
import geopy
from geopy.geocoders import Nominatim

# Other Libraries
import pandas as pd
import itertools
import json
import time
from datetime import datetime
from collections import Counter

## Loading Data

* the cells below clean the text data given
* this can takes 3 - 5 minutes to fully load all dataframes

In [2]:
data_path = 'data/'

# creating dataframes
edges = pd.read_csv(data_path + 'Gowalla_edges.txt', sep='\t', names=['u1', 'u2'])
totalCheckins = pd.read_csv(data_path + 'Gowalla_totalCheckins.txt', sep='\t',
                            names=['user', 'time', 'lat', 'long', 'locid'])

totalCheckins['time'] = totalCheckins['time'].map(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%SZ"))

In [3]:
totalCheckins

Unnamed: 0,user,time,lat,long,locid
0,0,2010-10-19 23:55:27,30.235909,-97.795140,22847
1,0,2010-10-18 22:17:43,30.269103,-97.749395,420315
2,0,2010-10-17 23:42:03,30.255731,-97.763386,316637
3,0,2010-10-17 19:26:05,30.263418,-97.757597,16516
4,0,2010-10-16 18:50:42,30.274292,-97.740523,5535878
...,...,...,...,...,...
6442887,196578,2010-06-11 13:32:26,51.742988,-0.488065,906885
6442888,196578,2010-06-11 13:26:45,51.746492,-0.490780,965121
6442889,196578,2010-06-11 13:26:34,51.741916,-0.496729,1174322
6442890,196585,2010-10-08 21:01:49,50.105516,8.571525,471724


In [4]:
edges

Unnamed: 0,u1,u2
0,0,1
1,0,2
2,0,3
3,0,4
4,0,5
...,...,...
1900649,196586,196539
1900650,196587,196540
1900651,196588,196540
1900652,196589,196547


### User Notes

* we can get an idea of where the users are generally located by gather the mean of their latitude on longitudinal locations
* other user attributes to collect:
  * earliest time checked in
  * latest time checked in 
    * this shows the general timespan the user has been on the app 

In [5]:
totalCheckins['user'].max()

196585

In [6]:
users = totalCheckins.groupby('user').mean()[['lat', 'long']]
users

Unnamed: 0_level_0,lat,long
user,Unnamed: 1_level_1,Unnamed: 2_level_1
0,33.558308,-97.894601
1,47.204338,4.499703
2,35.659617,-120.016716
4,36.800202,-124.714027
5,32.290069,-96.123008
...,...,...
196544,-25.433409,-49.281533
196561,37.528650,-122.004623
196577,51.514905,-0.081277
196578,51.744557,-0.478051


In [7]:
# average amounts of visits per user
totalCheckins.shape[0] / users.shape[0]

60.16221566503567

### Place Notes

* same grouping process as users
* for the scope of this project, we might want to only focus on places with like 5 or more visits

In [8]:
places = totalCheckins.groupby('locid').agg(['mean', 'count'])[['lat', 'long']]

places.columns = ['lat_mean', 'count_drop', 'long_mean', 'visits']
places = places.drop(columns = ['count_drop'])

In [9]:
places.sort_values('visits')

Unnamed: 0_level_0,lat_mean,long_mean,visits
locid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5977757,13.447479,101.019928,1
951900,50.718459,-1.877455,1
951901,41.543670,-83.582070,1
3733285,50.084312,18.212821,1
3733273,33.621132,-112.185371,1
...,...,...,...
58725,59.650051,17.932262,3476
10259,32.897462,-97.040348,4083
9410,30.201557,-97.667127,4713
19542,37.616356,-122.386150,5662


#### Geolocation Data (geopy)

* our queries use the [Nominatim API](https://nominatim.org/release-docs/develop/api/Overview/) by default

Let's try to find the country all this data is from. We will query location information using [geopy](https://geopy.readthedocs.io/en/stable/). Below is a sample query.

This gives the following important data:
* name 'address > tourism'
* city 'address > city'
* country 'address > country'

However this data lacks:
* context of the location (is it a pub, park, etc.)
    * we can get this information by querying OpenStreetMap using it's place_id, 

Given the limited time for this project, I feel like city and country information should be informative enough.

**Naming Conventions**

* subregions vary between different countries 
    * i.e. states and counties are not found in Germany, boroughs are found in Germany but not the US
    * this means the easiest thing to analyze right now are places < cities < countries
    * this problem can be resolved by focusing on one Country at a time
* the above problem unfortunately also occurs for city names
    * citities can also be called "towns, villages, municipalities, etc.)

In [10]:
from geopy.geocoders import Nominatim

geolocator = Nominatim()

location = geolocator.reverse("39.052318, -94.607499")

  This is separate from the ipykernel package so we can avoid doing imports until


In [11]:
location.raw

{'place_id': 58063045,
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'node',
 'osm_id': 4987894319,
 'lat': '39.0522322',
 'lon': '-94.6074116',
 'display_name': 'State Line at 42nd, State Line Road, Volker, Country Club Plaza, Kansas City, Wyandotte County, Kansas, 66160, United States',
 'address': {'highway': 'State Line at 42nd',
  'road': 'State Line Road',
  'neighbourhood': 'Volker',
  'suburb': 'Country Club Plaza',
  'city': 'Kansas City',
  'county': 'Wyandotte County',
  'state': 'Kansas',
  'postcode': '66160',
  'country': 'United States',
  'country_code': 'us'},
 'boundingbox': ['39.0521822', '39.0522822', '-94.6074616', '-94.6073616']}

In [12]:
location.raw['address']['country']

'United States'

**Goal: Querying the Data**

* now that we have a way of querying information, we would like to extract information from our *places* dataset using the latitude and longitude coordinates
* the two code cells below show us looping through the data in order of how many visits a certain location receives
* this extracts the full JSON object of the location from OpenStreetMaps and stores it in files in chunks of 1000 places
    * we have chosen to limit the number of places analyzed to the top 10,000 because it takes a long time to query the places using a free API

In [13]:
sorted_places = places.sort_values('visits', ascending=False).reset_index()
sorted_places[:10000]

Unnamed: 0,locid,lat_mean,long_mean,visits
0,55033,59.330158,18.058079,5811
1,19542,37.616356,-122.386150,5662
2,9410,30.201557,-97.667127,4713
3,10259,32.897462,-97.040348,4083
4,58725,59.650051,17.932262,3476
...,...,...,...,...
9995,230547,36.005830,-115.084607,56
9996,45183,35.655814,-97.470548,56
9997,164356,48.101030,11.645453,56
9998,234171,63.825514,20.261439,56


In [None]:
geolocator = Nominatim()

location_json = {}
location_json['places'] = []

# choose place to start writing files
i = 1000
chunk = 1000
places_temp = sorted_places[i:]

for index, row in places_temp.iterrows():
    
    location = geolocator.reverse(str(row["lat_mean"]) + ", " + str(row["long_mean"])).raw
    
    location_json['places'].append({'locid' : row['locid'], 'osm_json' : location})
    
    if index % chunk == 0 and index != i:
        
        with open('data/json/' + str(index) + '.json', 'w') as outfile:
            json.dump(location_json, outfile)
        
        print(str(index) + ' index JSON written')
        
        location_json['places'] = []

**Cleaning the Data**

* now that we have the data, we would like append the information to our original data frame
* new information includes:
    * the city of the location
    * the country of the location (including country code)
    * the name of the location
    
* the code cells below show how the JSON data collected from the OSM api is added to the places dataframe (for the top 10000 locations)

In [None]:
import json

data_path = 'data/json/'
start = 2000

location_append = []


while start <= 10000:
    
    with open(data_path + str(start) + '.json', 'r') as f:
        batch = json.load(f)

    places = batch['places']
    
    for location in places:
        
        # cities
        try:
            city = location['osm_json']['address']['city']
        except:
            pass

        try:
            city = location['osm_json']['address']['town']
        except:
            pass

        try:
            city = location['osm_json']['address']['village']
        except:
            pass

        # skip entities with no city information
        if city == "":
            print("no city")
            print(location)
    
        # countries
        country = ""
        country_code = ""
        
        try:
            country = location['osm_json']['address']['country']
        except:
            pass
        
        try:
            country_code = location['osm_json']['address']['country_code']
        except:
            pass
        
        if country == "":
            print("no country")
            print(location)
        
        location_append.append({
            'locid_check' : location['locid'], 
            'osm_place_id' : location['osm_json']['place_id'],
            'display_name' : location['osm_json']['display_name'],
            'city' : city,
            'country' : country,
            'country_code' : country_code
        })
    
    start += 1000

In [None]:
append_df = pd.DataFrame(location_append)[:-1]

append_df

In [None]:
append_df = pd.DataFrame(location_append)[:-1]
places_top10_desc = places_top10.join(append_df)

places_top10_desc

In [None]:
places_top10_desc.to_csv('data/places_top10_desc.csv')

In [None]:
# top cities
places_top10_desc['city'].value_counts()

In [None]:
# top countries
places_top10_desc['country'].value_counts()

**Connecting Top 10 Places with Users**

* we would like to see which users visited the top 10 locations
* this can be done with a simple filtering query
* finally, combine the top 10 places users with the queried information using a simple left join


We will use this data later when looking at centrality of cities and countries in the graph representations section of our report.

* I also will be creating unique city id's and country id's for graph representations of cities and countries (these will be their vertex ID's)

In [None]:
top10_checkins = totalCheckins[totalCheckins.locid.isin(places_top10['locid'])]
top10_checkins

In [None]:
top10_checkins_desc = top10_checkins.merge(places_top10_desc, on='locid')

unique_cities = list(dict.fromkeys(list(top10_checkins_desc['city'])))
unique_countries = list(dict.fromkeys(list(top10_checkins_desc['country'])))

top10_checkins_desc = top10_checkins_desc.merge(pd.DataFrame({'city' : unique_cities, 'city_id' : range(len(unique_cities))}), on='city')
top10_checkins_desc = top10_checkins_desc.merge(pd.DataFrame({'country' : unique_countries, 'country_id' : range(len(unique_countries))}), on='country')

top10_checkins_desc[['user', 'locid', 'city', 'city_id', 'country', 'country_id']]

## Graph Representations

### Limitations

* the max userID will be the number of edges required in the graph (assumption)
* using the maximum number of nodes in the graph (196591) takes a long time to generate the graph
  * this I would assume makes the algorithms for these graphs tricky to run
  * I could also be adding the edges in an unefficient way (explore ways of generating graphs)
    * might be better to add edges in batches instead of one edge at a time (explore this)

#### Ways to set limits
* we can limit the groups to the first n number of users (example only nodes 0 - 1000)
  * however under the assumption that userIDs are added based on the time the user adds themselves to the network, then there will be a lot of graphs that aren't connected because they are in geographically seperate locations
* a better soltuion would be to take a group of n people in close proximety of one another
  * we could find the average latitude and longitude of a users and record this as the area they are generally in
  * we'll then randomly select a user and select the n amount of people closest to this user (ranking by the distance between their average locations)


In [None]:
edgesG = Graph()
total_edges = max(edges['u1'].max(), edges['u2'].max())
total_edges

In [None]:
edgesG.add_vertices(total_edges)

In [None]:
# this takes too long; there's got to be a better way!

for i in range(len(edges)):
  edgesG.add_edge(edges.loc[i, "u1"], edges.loc[i, "u2"])

### Creating graph of user friendship

In [None]:
#Faster way to make graph from df tuples
eG = igraph.Graph.TupleList(edges.itertuples(index=False), directed=False, weights=False)
eG.summary()

Finding the indegree of each user and identifying the user with maximum degree

In [None]:
#Finds degrees of each vertex and the max degree and its vertex - popularity
eG_degrees = eG.indegree()
print(max(eG_degrees))
eG_max_degree_index = eG_degrees.index(max(eG_degrees))
eG.vs[eG_max_degree_index]

Finding all the neighbors of the user with maximum degree, in other words all of the popular user's friends

In [None]:
eG.neighbors(eG_max_degree_index)[:10]

### Creating graph of users that visited the top 10k places

In [None]:
top10_eG = igraph.Graph.TupleList(top10_edges.itertuples(index=False), directed=False, weights=False)
top10_eG.summary()

Identifying the user with max degree, most popular 

In [None]:
top10_eG_degrees = top10_eG.indegree()
print(max(top10_eG_degrees))
top10_eG_max_degree_index = top10_eG_degrees.index(max(top10_eG_degrees))
top10_eG.vs[top10_eG_max_degree_index]

Using pagerank to find the most popular user - note it matches above method

In [None]:
top10_eG_PR = top10_eG.pagerank()
print(max(top10_eG_PR))
print(top10_eG_PR.index(max(top10_eG_PR)))
top10_eG_PR[:10]

Finding all connected components, or clusters, of the graph, and the largest cluster

In [None]:
top10_eG_clusters = top10_eG.components()
print("Num of Clusters:", len(top10_eG_clusters))
top10_eG_giant_cluster = top10_eG_clusters.giant()
print("Num of Vertices in Largest cluster:", top10_eG_giant_cluster.vcount())

In [None]:
top10_eG.degree_distribution()

In [None]:
top10_eG.omega() #size of largest clique

In [None]:
top5_eG = igraph.Graph.TupleList(top5_edges.itertuples(index=False), directed=False, weights=False)
top5_eG.summary()

In [None]:
top5_eG_edges = [edge.tuple for edge in top5_eG.es]

### Creating a graph between users and locations they have visited

In [None]:
user_locids = totalCheckins[['user', 'locid']]
user_locids['locid'] = user_locids['locid'].apply(lambda x: 'L' + str(x))
user_locids

In [None]:
user_loc_G = igraph.Graph.TupleList(user_locids.itertuples(index=False), directed=False, weights=False)
user_loc_G.summary()

### Creating df of all users that visited top10 locations

In [None]:
user_locid = totalCheckins[['user', 'locid']]
top10_user_locid = user_locid[user_locid.locid.isin(places_top10['locid'])]

In [None]:
top10_user_locid_groups = top10_user_locid.groupby('locid')['user'].apply(list).reset_index(name='users')
top10_user_locid_groups

In [None]:
#takes time to run but gets all combinations of users in each list that we can make edges for visitng same location
top10_user_locid_combo = top10_user_locid_groups['users'].apply(lambda x: list(itertools.combinations(x, 2))).reset_index(name='user_combos')
top10_user_locid_combo

### Graphs of Cities and Countries of Top 10K Locations

For this portion of our graph analaysis, we will be creating graphs of the cities and countries of the top 10,000 locations

* we will connect locations based on the number of times users who have visited one place visit another
* for example let's observe user 1003
* they have visited Stockholm, Bloomington and Minneapolis
    * this means that each city has a connection to each other from the user

In [None]:
top10_checkins_desc[top10_checkins_desc['user'] == 1003][['user', 'locid', 'city', 'city_id', 'country', 'country_id']]

**Goal**

* we will create a weighted graph of connected cities, a user's connection between two cities will count as one weighted point for each city
* we will do the same weighted graph for countries

**Cities**

In [None]:
cities_visited = top10_checkins_desc.groupby('user')['city_id'].apply(list).apply(dict.fromkeys).apply(list)
cities_visited

In [None]:
from collections import Counter

visits = []

for user in cities_visited:
    visits.extend(list(itertools.combinations(user, 2)))

In [None]:
visits[-10:]

In [None]:
top10_cities_g = Graph()

# this is the number of vertices in the system
# add 1 because vertices start at zero
v = top10_checkins_desc['city_id'].max()
top10_cities_g.add_vertices(v + 1)
top10_cities_g.add_edges(visits)

In [None]:
summary(top10_cities_g)

In [None]:
top10_cities_g

In [None]:
top10_cities_g_PR = top10_cities_g.pagerank()
print(max(top10_cities_g_PR))
print(top10_cities_g_PR.index(max(top10_cities_g_PR)))

In [None]:
list(top10_checkins_desc[top10_checkins_desc['city_id'] == 9]['city'])[0]

In [None]:
top10_checkins_desc.city.value_counts()

**Countries**

In [None]:
countries_visited = top10_checkins_desc.groupby('user')['country_id'].apply(list).apply(dict.fromkeys).apply(list)
cities_visited

In [None]:
country_visits = []

for user in countries_visited:
    country_visits.extend(list(itertools.combinations(user, 2)))

In [None]:
country_visits[:10]

In [None]:
top10_countries_g = Graph()

# this is the number of vertices in the system
# add 1 because vertices start at zero
v = top10_checkins_desc['country_id'].max()
top10_countries_g.add_vertices(v + 1)
top10_countries_g.add_edges(country_visits)

In [None]:
summary(top10_countries_g)

In [None]:
top10_countries_g_PR = top10_countries_g.pagerank()
print(max(top10_countries_g_PR))
print(top10_countries_g_PR.index(max(top10_countries_g_PR)))

In [None]:
list(top10_checkins_desc[top10_checkins_desc['country_id'] == 5]['country'])[0]

In [None]:
top10_checkins_desc.country.value_counts()

### Grouping checkins by time and location

In [None]:
time_loc = totalCheckins[['time', 'user', 'locid']]
time_loc = time_loc[time_loc.locid.isin(places_top10['locid'])]
time_loc['time'] = time_loc['time'].dt.floor('30T') #rounding timestamp to nearest 30 minutes
time_loc.groupby(['locid', 'time']).agg(['count'])

The following shows a list of users that where at a location at the same time

In [None]:
time_loc5 = totalCheckins[['time', 'user', 'locid']]
time_loc5 = time_loc5[time_loc5.locid.isin(places_top5['locid'])]
time_loc5['time'] = time_loc5['time'].dt.floor('60T') #rounding timestamp to nearest hour
time_loc5_groups = time_loc5.groupby(['locid', 'time'])['user'].apply(list).reset_index(name='users')
time_loc5_groups = time_loc5_groups[time_loc5_groups['users'].apply(lambda x: len(x) > 2)]
time_loc5_groups

In [None]:
time_loc5_groups_combo = time_loc5_groups['users'].apply(lambda x: list(itertools.combinations(x, 2))).reset_index(name='user_combos')
time_loc5_groups_combo

In [None]:
time_loc5_groups_combo[:10]['user_combos'].apply(lambda x: sum(el in top5_eG_edges for el in x)).reset_index(name='friends')

Finding the popular time for people to visit a particular location

In [None]:
time_loc['time'] = time_loc['time'].dt.time
time_loc.groupby(['locid', 'time']).agg(['count'])

The following shows the most common time for people to visit that location as well as what user visited that location the most

In [None]:
time_loc.groupby(['locid']).agg(lambda x:x.value_counts().index[0])