# CSC 481 Final Presentation

Ayan Patel, Robert Hensley

* [Gowalla Dataset](https://snap.stanford.edu/data/loc-gowalla.html)



---



In [6]:
# Graph Libraries
import igraph
from igraph import *

# Map Libraries
import geopy
from geopy.geocoders import Nominatim

# Other Libraries
import pandas as pd
import itertools
import json
import time
from datetime import datetime
from collections import Counter

## Loading Data

* the cells below clean the text data given
* this can takes 3 - 5 minutes to fully load all dataframes

In [5]:
data_path = 'data/'

# creating dataframes
edges = pd.read_csv(data_path + 'Gowalla_edges.txt', sep='\t', names=['u1', 'u2'])
totalCheckins = pd.read_csv(data_path + 'Gowalla_totalCheckins.txt', sep='\t',
                            names=['user', 'time', 'lat', 'long', 'locid'])

totalCheckins['time'] = totalCheckins['time'].map(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%SZ"))

In [7]:
totalCheckins

Unnamed: 0,user,time,lat,long,locid
0,0,2010-10-19 23:55:27,30.235909,-97.795140,22847
1,0,2010-10-18 22:17:43,30.269103,-97.749395,420315
2,0,2010-10-17 23:42:03,30.255731,-97.763386,316637
3,0,2010-10-17 19:26:05,30.263418,-97.757597,16516
4,0,2010-10-16 18:50:42,30.274292,-97.740523,5535878
...,...,...,...,...,...
6442887,196578,2010-06-11 13:32:26,51.742988,-0.488065,906885
6442888,196578,2010-06-11 13:26:45,51.746492,-0.490780,965121
6442889,196578,2010-06-11 13:26:34,51.741916,-0.496729,1174322
6442890,196585,2010-10-08 21:01:49,50.105516,8.571525,471724


In [5]:
edges

Unnamed: 0,u1,u2
0,0,1
1,0,2
2,0,3
3,0,4
4,0,5
...,...,...
1900649,196586,196539
1900650,196587,196540
1900651,196588,196540
1900652,196589,196547


### User Notes

* we can get an idea of where the users are generally located by gather the mean of their latitude on longitudinal locations
* other user attributes to collect:
  * earliest time checked in
  * latest time checked in 
    * this shows the general timespan the user has been on the app 

In [9]:
totalCheckins['user'].max()

196585

In [8]:
users = totalCheckins.groupby('user').mean()[['lat', 'long']]
users

Unnamed: 0_level_0,lat,long
user,Unnamed: 1_level_1,Unnamed: 2_level_1
0,33.558308,-97.894601
1,47.204338,4.499703
2,35.659617,-120.016716
4,36.800202,-124.714027
5,32.290069,-96.123008
...,...,...
196544,-25.433409,-49.281533
196561,37.528650,-122.004623
196577,51.514905,-0.081277
196578,51.744557,-0.478051


In [11]:
# average amounts of visits per user
totalCheckins.shape[0] / users.shape[0]

60.16221566503567

### Place Notes

* same grouping process as users
* for the scope of this project, we might want to only focus on places with like 5 or more visits

In [9]:
places = totalCheckins.groupby('locid').agg(['mean', 'count'])[['lat', 'long']]

places.columns = ['lat_mean', 'count_drop', 'long_mean', 'visits']
places = places.drop(columns = ['count_drop'])

In [5]:
places.sort_values('visits')

Unnamed: 0_level_0,lat_mean,long_mean,visits
locid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5977757,13.447479,101.019928,1
951900,50.718459,-1.877455,1
951901,41.543670,-83.582070,1
3733285,50.084312,18.212821,1
3733273,33.621132,-112.185371,1
...,...,...,...
58725,59.650051,17.932262,3476
10259,32.897462,-97.040348,4083
9410,30.201557,-97.667127,4713
19542,37.616356,-122.386150,5662


#### Geolocation Data (geopy)

* our queries use the [Nominatim API](https://nominatim.org/release-docs/develop/api/Overview/) by default

Let's try to find the country all this data is from. We will query location information using [geopy](https://geopy.readthedocs.io/en/stable/). Below is a sample query.

This gives the following important data:
* name 'address > tourism'
* city 'address > city'
* country 'address > country'

However this data lacks:
* context of the location (is it a pub, park, etc.)
    * we can get this information by querying OpenStreetMap using it's place_id, 

Given the limited time for this project, I feel like city and country information should be informative enough.

**Naming Conventions**

* subregions vary between different countries 
    * i.e. states and counties are not found in Germany, boroughs are found in Germany but not the US
    * this means the easiest thing to analyze right now are places < cities < countries
    * this problem can be resolved by focusing on one Country at a time
* the above problem unfortunately also occurs for city names
    * citities can also be called "towns, villages, municipalities, etc.)

In [14]:
from geopy.geocoders import Nominatim

geolocator = Nominatim()

location = geolocator.reverse("39.052318, -94.607499")

  This is separate from the ipykernel package so we can avoid doing imports until


In [15]:
location.raw

{'place_id': 251278814,
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'way',
 'osm_id': 552493654,
 'lat': '39.052405',
 'lon': '-94.607436',
 'display_name': '4154, State Line Road, Country Club Plaza, Kansas City, Wyandotte County, Missouri, 66103, United States of America',
 'address': {'house_number': '4154',
  'road': 'State Line Road',
  'suburb': 'Country Club Plaza',
  'city': 'Kansas City',
  'county': 'Wyandotte County',
  'state': 'Missouri',
  'postcode': '66103',
  'country': 'United States of America',
  'country_code': 'us'},
 'boundingbox': ['39.052355', '39.052455', '-94.607486', '-94.607386']}

In [16]:
location.raw['address']['country']

'United States of America'

**Goal: Querying the Data**

* now that we have a way of querying information, we would like to extract information from our *places* dataset using the latitude and longitude coordinates
* the two code cells below show us looping through the data in order of how many visits a certain location receives
* this extracts the full JSON object of the location from OpenStreetMaps and stores it in files in chunks of 1000 places
    * we have chosen to limit the number of places analyzed to the top 10,000 because it takes a long time to query the places using a free API

In [12]:
sorted_places = places.sort_values('visits', ascending=False).reset_index()
sorted_places[:10000]

Unnamed: 0,locid,lat_mean,long_mean,visits
0,55033,59.330158,18.058079,5811
1,19542,37.616356,-122.386150,5662
2,9410,30.201557,-97.667127,4713
3,10259,32.897462,-97.040348,4083
4,58725,59.650051,17.932262,3476
...,...,...,...,...
9995,230547,36.005830,-115.084607,56
9996,45183,35.655814,-97.470548,56
9997,164356,48.101030,11.645453,56
9998,234171,63.825514,20.261439,56


In [None]:
geolocator = Nominatim()

location_json = {}
location_json['places'] = []

# choose place to start writing files
i = 1000
chunk = 1000
places_temp = sorted_places[i:]

for index, row in places_temp.iterrows():
    
    location = geolocator.reverse(str(row["lat_mean"]) + ", " + str(row["long_mean"])).raw
    
    location_json['places'].append({'locid' : row['locid'], 'osm_json' : location})
    
    if index % chunk == 0 and index != i:
        
        with open('data/json/' + str(index) + '.json', 'w') as outfile:
            json.dump(location_json, outfile)
        
        print(str(index) + ' index JSON written')
        
        location_json['places'] = []

**Cleaning the Data**

* now that we have the data, we would like append the information to our original data frame
* new information includes:
    * the city of the location
    * the country of the location (including country code)
    * the name of the location
    
* the code cells below show how the JSON data collected from the OSM api is added to the places dataframe (for the top 10000 locations)

In [14]:
import json

data_path = 'data/json/'
start = 2000

location_append = []


while start <= 10000:
    
    with open(data_path + str(start) + '.json', 'r') as f:
        batch = json.load(f)

    places = batch['places']
    
    for location in places:
        
        # cities
        try:
            city = location['osm_json']['address']['city']
        except:
            pass

        try:
            city = location['osm_json']['address']['town']
        except:
            pass

        try:
            city = location['osm_json']['address']['village']
        except:
            pass

        # skip entities with no city information
        if city == "":
            print("no city")
            print(location)
    
        # countries
        country = ""
        country_code = ""
        
        try:
            country = location['osm_json']['address']['country']
        except:
            pass
        
        try:
            country_code = location['osm_json']['address']['country_code']
        except:
            pass
        
        if country == "":
            print("no country")
            print(location)
        
        location_append.append({
            'locid_check' : location['locid'], 
            'osm_place_id' : location['osm_json']['place_id'],
            'display_name' : location['osm_json']['display_name'],
            'city' : city,
            'country' : country,
            'country_code' : country_code
        })
    
    start += 1000

no country
{'locid': 17100.0, 'osm_json': {'place_id': 283373754, 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', 'osm_type': 'node', 'osm_id': 3815077900, 'lat': '0', 'lon': '0', 'display_name': 'Soul Buoy', 'address': {'man_made': 'Soul Buoy'}, 'boundingbox': ['-5.0E-5', '5.0E-5', '-5.0E-5', '5.0E-5']}}


In [15]:
append_df = pd.DataFrame(location_append)[:-1]

append_df

Unnamed: 0,locid_check,osm_place_id,display_name,city,country,country_code
0,55033.0,235973447,"Pennfäktaren, Norrmalm, Norrmalms stadsdelsomr...",Stockholm,Sverige,se
1,19542.0,235227144,"Domestic Garage, Zone F/G, Lomita Park, San Ma...",Stockholm,United States of America,us
2,9410.0,235768752,"Austin-Bergstrom International Airport (AUS), ...",Austin,United States of America,us
3,10259.0,186874070,"FAA Tower, International Parkway, Grapevine, T...",Grapevine,United States of America,us
4,58725.0,105773201,"Stockholm-Arlanda flygplats, 273, Starrmossen,...",Grapevine,Sverige,se
...,...,...,...,...,...,...
9995,230547.0,257735126,"2265, West Horizon Ridge Parkway, Macdonald Ra...",Henderson,United States of America,us
9996,45183.0,145877854,"Thatcher Hall, East Hurd Street, Edmond, Oklah...",Edmond,United States of America,us
9997,164356.0,235795942,"Bezirksteil Neuperlach, Stadtbezirk 16 Ramersd...",München,Deutschland,de
9998,234171.0,46570017,"Ginatricot, Rådhustorget, Centrum, Centrala st...",Umeå,Sverige,se


In [17]:
append_df = pd.DataFrame(location_append)[:-1]
places_top10_desc = places_top10.join(append_df)

places_top10_desc

Unnamed: 0,locid,lat_mean,long_mean,visits,locid_check,osm_place_id,display_name,city,country,country_code
0,55033,59.330158,18.058079,5811,55033.0,235973447,"Pennfäktaren, Norrmalm, Norrmalms stadsdelsomr...",Stockholm,Sverige,se
1,19542,37.616356,-122.386150,5662,19542.0,235227144,"Domestic Garage, Zone F/G, Lomita Park, San Ma...",Stockholm,United States of America,us
2,9410,30.201557,-97.667127,4713,9410.0,235768752,"Austin-Bergstrom International Airport (AUS), ...",Austin,United States of America,us
3,10259,32.897462,-97.040348,4083,10259.0,186874070,"FAA Tower, International Parkway, Grapevine, T...",Grapevine,United States of America,us
4,58725,59.650051,17.932262,3476,58725.0,105773201,"Stockholm-Arlanda flygplats, 273, Starrmossen,...",Grapevine,Sverige,se
...,...,...,...,...,...,...,...,...,...,...
9995,230547,36.005830,-115.084607,56,230547.0,257735126,"2265, West Horizon Ridge Parkway, Macdonald Ra...",Henderson,United States of America,us
9996,45183,35.655814,-97.470548,56,45183.0,145877854,"Thatcher Hall, East Hurd Street, Edmond, Oklah...",Edmond,United States of America,us
9997,164356,48.101030,11.645453,56,164356.0,235795942,"Bezirksteil Neuperlach, Stadtbezirk 16 Ramersd...",München,Deutschland,de
9998,234171,63.825514,20.261439,56,234171.0,46570017,"Ginatricot, Rådhustorget, Centrum, Centrala st...",Umeå,Sverige,se


In [None]:
places_top10_desc.to_csv('data/places_top10_desc.csv')

In [18]:
# top cities
places_top10_desc['city'].value_counts()

Austin               1183
Stockholm             816
San Francisco         517
Göteborg              295
Dallas                250
                     ... 
Englewood               1
Sankt Wolfgang          1
Cypress                 1
Puerto de la Cruz       1
Kalaoa CDP              1
Name: city, Length: 1529, dtype: int64

In [19]:
# top countries
places_top10_desc['country'].value_counts()

United States of America       5894
Sverige                        1954
Deutschland                     412
United Kingdom                  338
Norge                           269
Canada                          126
Nederland                       120
Saudi Arabia / السعودية         111
België - Belgique - Belgien      99
ประเทศไทย                        81
日本 (Japan)                       79
Switzerland                      69
Australia                        65
France                           43
Česká republika                  36
Portugal                         33
Österreich                       29
Danmark                          27
Italia                           25
Malaysia                         24
Singapore                        24
España                           19
Magyarország                     14
China 中国                         14
Philippines                      12
Brasil                           10
Indonesia                         8
Suomi                       

**Connecting Top 10 Places with Users**

* we would like to see which users visited the top 10 locations
* this can be done with a simple filtering query
* finally, combine the top 10 places users with the queried information using a simple left join


We will use this data later when looking at centrality of cities and countries in the graph representations section of our report.

* I also will be creating unique city id's and country id's for graph representations of cities and countries (these will be their vertex ID's)

In [131]:
top10_checkins = totalCheckins[totalCheckins.locid.isin(places_top10['locid'])]
top10_checkins

Unnamed: 0,user,time,lat,long,locid
0,0,2010-10-19 23:55:27,30.235909,-97.795140,22847
1,0,2010-10-18 22:17:43,30.269103,-97.749395,420315
2,0,2010-10-17 23:42:03,30.255731,-97.763386,316637
3,0,2010-10-17 19:26:05,30.263418,-97.757597,16516
4,0,2010-10-16 18:50:42,30.274292,-97.740523,5535878
...,...,...,...,...,...
6442635,196528,2010-10-11 01:42:02,13.766876,100.570672,761001
6442643,196528,2010-10-04 11:55:05,13.766876,100.570672,761001
6442650,196528,2010-09-30 02:33:25,13.689897,100.748320,23519
6442698,196541,2010-09-24 11:26:52,51.496105,-0.171834,23876


In [140]:
top10_checkins_desc = top10_checkins.merge(places_top10_desc, on='locid')

unique_cities = list(dict.fromkeys(list(top10_checkins_desc['city'])))
unique_countries = list(dict.fromkeys(list(top10_checkins_desc['country'])))

top10_checkins_desc = top10_checkins_desc.merge(pd.DataFrame({'city' : unique_cities, 'city_id' : range(len(unique_cities))}), on='city')
top10_checkins_desc = top10_checkins_desc.merge(pd.DataFrame({'country' : unique_countries, 'country_id' : range(len(unique_countries))}), on='country')

top10_checkins_desc[['user', 'locid', 'city', 'city_id', 'country', 'country_id']]

Unnamed: 0,user,locid,city,city_id,country,country_id
0,0,22847,Austin,0,United States of America,0
1,31,22847,Austin,0,United States of America,0
2,66,22847,Austin,0,United States of America,0
3,350,22847,Austin,0,United States of America,0
4,387,22847,Austin,0,United States of America,0
...,...,...,...,...,...,...
1250040,108841,1505587,Las Condes,1483,Chile,51
1250041,108841,1505587,Las Condes,1483,Chile,51
1250042,108841,1505587,Las Condes,1483,Chile,51
1250043,108841,1505587,Las Condes,1483,Chile,51


## Graph Representations

### Limitations

* the max userID will be the number of edges required in the graph (assumption)
* using the maximum number of nodes in the graph (196591) takes a long time to generate the graph
  * this I would assume makes the algorithms for these graphs tricky to run
  * I could also be adding the edges in an unefficient way (explore ways of generating graphs)
    * might be better to add edges in batches instead of one edge at a time (explore this)

#### Ways to set limits
* we can limit the groups to the first n number of users (example only nodes 0 - 1000)
  * however under the assumption that userIDs are added based on the time the user adds themselves to the network, then there will be a lot of graphs that aren't connected because they are in geographically seperate locations
* a better soltuion would be to take a group of n people in close proximety of one another
  * we could find the average latitude and longitude of a users and record this as the area they are generally in
  * we'll then randomly select a user and select the n amount of people closest to this user (ranking by the distance between their average locations)


In [6]:
edgesG = Graph()
total_edges = max(edges['u1'].max(), edges['u2'].max())
total_edges

196590

In [18]:
edgesG.add_vertices(total_edges)

In [0]:
# this takes too long; there's got to be a better way!

for i in range(len(edges)):
  edgesG.add_edge(edges.loc[i, "u1"], edges.loc[i, "u2"])

### Creating graph of user friendship

In [6]:
#Faster way to make graph from df tuples
eG = igraph.Graph.TupleList(edges.itertuples(index=False), directed=False, weights=False)
eG.summary()

'IGRAPH UN-- 196591 1900654 -- \n+ attr: name (v)'

Finding the indegree of each user and identifying the user with maximum degree

In [7]:
#Finds degrees of each vertex and the max degree and its vertex - popularity
eG_degrees = eG.indegree()
print(max(eG_degrees))
eG_max_degree_index = eG_degrees.index(max(eG_degrees))
eG.vs[eG_max_degree_index]

29460


igraph.Vertex(<igraph.Graph object at 0x000001B08ABCAE58>, 307, {'name': 307})

Finding all the neighbors of the user with maximum degree, in other words all of the popular user's friends

In [8]:
eG.neighbors(eG_max_degree_index)

[0,
 0,
 1,
 1,
 2,
 2,
 3,
 3,
 5,
 5,
 8,
 8,
 9,
 9,
 10,
 10,
 11,
 11,
 12,
 12,
 14,
 14,
 17,
 17,
 18,
 18,
 20,
 20,
 21,
 21,
 22,
 22,
 24,
 24,
 30,
 30,
 35,
 35,
 37,
 37,
 39,
 39,
 43,
 43,
 44,
 44,
 46,
 46,
 49,
 49,
 50,
 50,
 53,
 53,
 54,
 54,
 55,
 55,
 60,
 60,
 61,
 61,
 65,
 65,
 67,
 67,
 68,
 68,
 69,
 69,
 71,
 71,
 72,
 72,
 73,
 73,
 75,
 75,
 80,
 80,
 82,
 82,
 90,
 90,
 91,
 91,
 94,
 94,
 96,
 96,
 112,
 112,
 113,
 113,
 114,
 114,
 115,
 115,
 117,
 117,
 119,
 119,
 123,
 123,
 140,
 140,
 141,
 141,
 143,
 143,
 145,
 145,
 153,
 153,
 158,
 158,
 168,
 168,
 174,
 174,
 178,
 178,
 180,
 180,
 182,
 182,
 184,
 184,
 188,
 188,
 194,
 194,
 198,
 198,
 202,
 202,
 205,
 205,
 207,
 207,
 209,
 209,
 210,
 210,
 211,
 211,
 215,
 215,
 216,
 216,
 218,
 218,
 219,
 219,
 220,
 220,
 223,
 223,
 228,
 228,
 236,
 236,
 237,
 237,
 241,
 241,
 242,
 242,
 243,
 243,
 244,
 244,
 245,
 245,
 248,
 248,
 261,
 261,
 266,
 266,
 269,
 269,
 279,
 279,


### Creating graph of users that visited the top 10k places

In [11]:
top10_eG = igraph.Graph.TupleList(top10_edges.itertuples(index=False), directed=False, weights=False)
top10_eG.summary()

'IGRAPH UN-- 63954 633948 -- \n+ attr: name (v)'

Identifying the user with max degree, most popular 

In [10]:
top10_eG_degrees = top10_eG.indegree()
print(max(top10_eG_degrees))
top10_eG_max_degree_index = top10_eG_degrees.index(max(top10_eG_degrees))
top10_eG.vs[top10_eG_max_degree_index]

14630


igraph.Vertex(<igraph.Graph object at 0x000001CCA9533E58>, 268, {'name': 307})

Using pagerank to find the most popular user - note it matches above method

In [24]:
top10_eG_PR = top10_eG.pagerank()
print(max(top10_eG_PR))
print(top10_eG_PR.index(max(top10_eG_PR)))
top10_eG_PR

0.01109515802806296
268


[0.0005601428985739857,
 0.0012451564315959104,
 3.9544888079393464e-05,
 0.0005378165257966492,
 4.1839343459879194e-05,
 4.509137461765956e-05,
 3.139457580824911e-05,
 2.2394456969758714e-05,
 1.8308983076667677e-05,
 3.7075088254714646e-05,
 4.2564765231561685e-05,
 0.00012220473017581946,
 3.504858739399901e-05,
 3.077570251426852e-05,
 0.00011980908394106588,
 3.0447555293858315e-05,
 0.00014091226344829672,
 0.00025192650719313024,
 0.00021114524289245172,
 1.1705708578519792e-05,
 1.585730333712413e-05,
 2.4330927487731063e-05,
 4.911319873694109e-05,
 6.216401646213416e-05,
 1.986162431136007e-05,
 5.836984111221423e-05,
 3.6793734815685755e-05,
 0.00011446439056066688,
 3.3207052781472134e-05,
 8.262801306450721e-05,
 4.069785166706043e-05,
 2.4867214998666827e-05,
 2.456672176997244e-05,
 0.0002279537320801742,
 0.00018039844014468582,
 0.00010946901292076022,
 4.259099614362942e-05,
 0.00024576391209893444,
 8.500463345363405e-06,
 2.593635760059841e-05,
 2.1832071695049115

Finding all connected components, or clusters, of the graph, and the largest cluster

In [22]:
top10_eG_clusters = top10_eG.components()
print("Num of Clusters:", len(top10_eG_clusters))
top10_eG_giant_cluster = top10_eG_clusters.giant()
print("Num of Vertices in Largest cluster:", top10_eG_giant_cluster.vcount())

Num of Clusters: 683
Num of Vertices in Largest cluster: 62302


In [13]:
top10_eG.degree_distribution()

<igraph.statistics.Histogram at 0x1cca95ab5c8>

In [14]:
top10_eG.omega() #size of largest clique

20

In [13]:
top5_eG = igraph.Graph.TupleList(top5_edges.itertuples(index=False), directed=False, weights=False)
top5_eG.summary()

'IGRAPH UN-- 57260 570846 -- \n+ attr: name (v)'

In [14]:
top5_eG_edges = [edge.tuple for edge in top5_eG.es]

### Creating a graph between users and locations they have visited

In [26]:
user_locids = totalCheckins[['user', 'locid']]
user_locids['locid'] = user_locids['locid'].apply(lambda x: 'L' + str(x))
user_locids

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,user,locid
0,0,L22847
1,0,L420315
2,0,L316637
3,0,L16516
4,0,L5535878
...,...,...
6442887,196578,L906885
6442888,196578,L965121
6442889,196578,L1174322
6442890,196585,L471724


In [27]:
user_loc_G = igraph.Graph.TupleList(user_locids.itertuples(index=False), directed=False, weights=False)
user_loc_G.summary()

'IGRAPH UN-- 1388061 6442892 -- \n+ attr: name (v)'

### Creating df of all users that visited top10 locations

In [41]:
user_locid = totalCheckins[['user', 'locid']]
top10_user_locid = user_locid[user_locid.locid.isin(places_top10['locid'])]

In [42]:
top10_user_locid_groups = top10_user_locid.groupby('locid')['user'].apply(list).reset_index(name='users')
top10_user_locid_groups

Unnamed: 0,locid,users
0,8938,"[0, 65, 71, 107, 107, 107, 228, 256, 256, 256,..."
1,8947,"[24, 99, 119, 124, 130, 130, 130, 130, 130, 13..."
2,8956,"[138, 138, 138, 138, 138, 350, 486, 559, 560, ..."
3,8961,"[350, 350, 560, 1759, 2274, 2274, 2274, 2274, ..."
4,8964,"[0, 71, 75, 107, 107, 107, 107, 107, 107, 107,..."
...,...,...
9995,4997564,"[32, 41, 212, 225, 242, 252, 267, 267, 342, 35..."
9996,4998006,"[41, 97, 112, 212, 225, 242, 350, 401, 438, 47..."
9997,4998370,"[41, 97, 180, 212, 225, 225, 242, 350, 401, 47..."
9998,5341244,"[557, 3567, 4011, 4011, 4580, 6609, 7492, 7532..."


In [None]:
#takes time to run but gets all combinations of users in each list that we can make edges for visitng same location
top10_user_locid_combo = top10_user_locid_groups['users'].apply(lambda x: list(itertools.combinations(x, 2))).reset_index(name='user_combos')
top10_user_locid_combo

### Graphs of Cities and Countries of Top 10K Locations

For this portion of our graph analaysis, we will be creating graphs of the cities and countries of the top 10,000 locations

* we will connect locations based on the number of times users who have visited one place visit another
* for example let's observe user 1003
* they have visited Stockholm, Bloomington and Minneapolis
    * this means that each city has a connection to each other from the user

In [146]:
top10_checkins_desc[top10_checkins_desc['user'] == 1003][['user', 'locid', 'city', 'city_id', 'country', 'country_id']]

Unnamed: 0,user,locid,city,city_id,country,country_id
233445,1003,19542,Stockholm,9,United States of America,0
529642,1003,10202,Bloomington,83,United States of America,0
663573,1003,99066,Minneapolis,265,United States of America,0


**Goal**

* we will create a weighted graph of connected cities, a user's connection between two cities will count as one weighted point for each city
* we will do the same weighted graph for countries

**Cities**

In [145]:
cities_visited = top10_checkins_desc.groupby('user')['city_id'].apply(list).apply(dict.fromkeys).apply(list)
cities_visited

user
0         [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2                           [9, 10, 16, 17, 18, 19, 20, 21]
4                       [9, 10, 22, 23, 24, 25, 26, 27, 28]
5                                            [0, 1, 29, 30]
7                                                   [0, 31]
                                ...                        
196508                                                [191]
196524                                                  [2]
196528                                           [267, 538]
196541                                                [201]
196561                                                [533]
Name: city_id, Length: 69323, dtype: object

In [147]:
from collections import Counter

visits = []

for user in cities_visited:
    visits.extend(list(itertools.combinations(user, 2)))

In [174]:
visits[-10:]

[(22, 519),
 (301, 519),
 (134, 832),
 (134, 1083),
 (832, 1083),
 (189, 204),
 (115, 274),
 (115, 449),
 (274, 449),
 (267, 538)]

In [180]:
top10_cities_g = Graph()

# this is the number of vertices in the system
# add 1 because vertices start at zero
v = top10_checkins_desc['city_id'].max()
top10_cities_g.add_vertices(v + 1)
top10_cities_g.add_edges(visits)

In [182]:
summary(top10_cities_g)

IGRAPH U--- 1529 723950 -- 


In [None]:
top10_cities_g

In [185]:
top10_cities_g_PR = top10_cities_g.pagerank()
print(max(top10_cities_g_PR))
print(top10_cities_g_PR.index(max(top10_cities_g_PR)))

0.034297258567479734
9


In [203]:
list(top10_checkins_desc[top10_checkins_desc['city_id'] == 9]['city'])[0]

'Stockholm'

In [216]:
top10_checkins_desc.city.value_counts()

Austin           204672
Stockholm        119977
San Francisco     62111
Göteborg          34150
Dallas            28667
                  ...  
Katwijk              56
Atlantic City        56
Toronto              56
Hoboken              56
Schortens            56
Name: city, Length: 1529, dtype: int64

**Countries**

In [205]:
countries_visited = top10_checkins_desc.groupby('user')['country_id'].apply(list).apply(dict.fromkeys).apply(list)
cities_visited

user
0         [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2                           [9, 10, 16, 17, 18, 19, 20, 21]
4                       [9, 10, 22, 23, 24, 25, 26, 27, 28]
5                                            [0, 1, 29, 30]
7                                                   [0, 31]
                                ...                        
196508                                                [191]
196524                                                  [2]
196528                                           [267, 538]
196541                                                [201]
196561                                                [533]
Name: city_id, Length: 69323, dtype: object

In [206]:
country_visits = []

for user in countries_visited:
    country_visits.extend(list(itertools.combinations(user, 2)))

In [207]:
country_visits[:10]

[(0, 1),
 (0, 2),
 (1, 2),
 (0, 3),
 (0, 4),
 (0, 5),
 (0, 3),
 (0, 2),
 (3, 2),
 (0, 6)]

In [208]:
top10_countries_g = Graph()

# this is the number of vertices in the system
# add 1 because vertices start at zero
v = top10_checkins_desc['country_id'].max()
top10_countries_g.add_vertices(v + 1)
top10_countries_g.add_edges(country_visits)

In [211]:
summary(top10_countries_g)

IGRAPH U--- 53 14363 -- 


In [212]:
top10_countries_g_PR = top10_countries_g.pagerank()
print(max(top10_countries_g_PR))
print(top10_countries_g_PR.index(max(top10_countries_g_PR)))

0.04225539187193717
5


In [213]:
list(top10_checkins_desc[top10_checkins_desc['country_id'] == 5]['country'])[0]

'France'

In [215]:
top10_checkins_desc.country.value_counts()

United States of America       763696
Sverige                        249147
Deutschland                     47577
United Kingdom                  39537
Norge                           30463
Canada                          13504
Nederland                       12861
Saudi Arabia / السعودية         11374
België - Belgique - Belgien     10958
日本 (Japan)                       9896
ประเทศไทย                        9527
Switzerland                      7397
Australia                        6612
France                           4377
Danmark                          3841
Česká republika                  3176
Österreich                       2830
Portugal                         2736
Italia                           2687
Singapore                        2191
Malaysia                         2048
España                           1761
China 中国                         1532
Magyarország                     1215
Brasil                           1135
Philippines                       969
Suomi       

### Grouping checkins by time and location

In [38]:
time_loc = totalCheckins[['time', 'user', 'locid']]
time_loc = time_loc[time_loc.locid.isin(places_top10['locid'])]
time_loc['time'] = time_loc['time'].dt.floor('30T') #rounding timestamp to nearest 30 minutes
time_loc.groupby(['locid', 'time']).agg(['count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,user
Unnamed: 0_level_1,Unnamed: 1_level_1,count
locid,time,Unnamed: 2_level_2
8938,2009-08-29 02:00:00,1
8938,2009-09-26 13:00:00,1
8938,2009-09-27 01:30:00,1
8938,2009-09-27 23:00:00,1
8938,2009-10-06 22:00:00,1
...,...,...
5535878,2010-10-17 19:30:00,2
5535878,2010-10-17 20:30:00,3
5535878,2010-10-17 21:00:00,2
5535878,2010-10-17 21:30:00,1


The following shows a list of users that where at a location at the same time

In [19]:
time_loc5 = totalCheckins[['time', 'user', 'locid']]
time_loc5 = time_loc5[time_loc5.locid.isin(places_top5['locid'])]
time_loc5['time'] = time_loc5['time'].dt.floor('60T') #rounding timestamp to nearest hour
time_loc5_groups = time_loc5.groupby(['locid', 'time'])['user'].apply(list).reset_index(name='users')
time_loc5_groups = time_loc5_groups[time_loc5_groups['users'].apply(lambda x: len(x) > 2)]
time_loc5_groups

Unnamed: 0,locid,time,users
161,8947,2009-11-17 17:00:00,"[285, 487, 518]"
195,8947,2009-12-16 21:00:00,"[494, 494, 23349]"
229,8947,2010-01-27 17:00:00,"[130, 28602, 125329]"
283,8947,2010-03-06 00:00:00,"[6927, 125329, 127250]"
349,8947,2010-04-22 18:00:00,"[519, 5461, 11637]"
...,...,...,...
788693,5535878,2010-10-17 17:00:00,"[33053, 33711, 70217, 84630, 155582]"
788694,5535878,2010-10-17 18:00:00,"[716, 4726, 6701, 21757, 50446]"
788695,5535878,2010-10-17 19:00:00,"[5711, 10803, 103076]"
788696,5535878,2010-10-17 20:00:00,"[3155, 4229, 11238]"


In [20]:
time_loc5_groups_combo = time_loc5_groups['users'].apply(lambda x: list(itertools.combinations(x, 2))).reset_index(name='user_combos')
time_loc5_groups_combo

Unnamed: 0,index,user_combos
0,161,"[(285, 487), (285, 518), (487, 518)]"
1,195,"[(494, 494), (494, 23349), (494, 23349)]"
2,229,"[(130, 28602), (130, 125329), (28602, 125329)]"
3,283,"[(6927, 125329), (6927, 127250), (125329, 1272..."
4,349,"[(519, 5461), (519, 11637), (5461, 11637)]"
...,...,...
18502,788693,"[(33053, 33711), (33053, 70217), (33053, 84630..."
18503,788694,"[(716, 4726), (716, 6701), (716, 21757), (716,..."
18504,788695,"[(5711, 10803), (5711, 103076), (10803, 103076)]"
18505,788696,"[(3155, 4229), (3155, 11238), (4229, 11238)]"


In [24]:
time_loc5_groups_combo[:10]['user_combos'].apply(lambda x: sum(el in top5_eG_edges for el in x)).reset_index(name='friends')

Unnamed: 0,index,friends
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0
5,5,1
6,6,0
7,7,0
8,8,0
9,9,0


Finding the popular time for people to visit a particular location

In [32]:
time_loc['time'] = time_loc['time'].dt.time
time_loc.groupby(['locid', 'time']).agg(['count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,user
Unnamed: 0_level_1,Unnamed: 1_level_1,count
locid,time,Unnamed: 2_level_2
8938,00:00:00,2
8938,00:30:00,4
8938,01:00:00,2
8938,01:30:00,2
8938,02:00:00,1
...,...,...
5535878,21:00:00,2
5535878,21:30:00,1
5535878,22:00:00,2
5535878,23:00:00,1


The following shows the most common time for people to visit that location as well as what user visited that location the most

In [34]:
time_loc.groupby(['locid']).agg(lambda x:x.value_counts().index[0])

Unnamed: 0_level_0,time,user
locid,Unnamed: 1_level_1,Unnamed: 2_level_1
8938,18:00:00,488
8947,19:30:00,518
8956,19:30:00,138
8961,17:00:00,4678
8964,13:00:00,406
...,...,...
4997564,18:00:00,25344
4998006,00:30:00,956
4998370,19:00:00,16356
5341244,23:00:00,91430
