Preprocesses the profile information from the Pokec social network dataset obtained from: https://snap.stanford.edu/data/soc-Pokec.html

In this notebook we:
1. Read in the profile information.
2. Reduce the data down to a subset of the attributes.
3. Use a geocoder to add lat/lon coordinates assoicated with region names.

Created on: 21/12/19

In [1]:
import pandas as pd
from geopy.geocoders import Nominatim
import numpy as np
import time
import tqdm

In [2]:
path_to_profiles = 'Data/raw_data/soc-pokec-profiles.txt'
path_to_edge_list = 'Data/raw_data/soc-pokec-relationships.txt'
column_name_data = 'Data/raw_data/pokec_column_names.txt'

Read in the profile information with all of the attributes. We read in the following columns:
- 0 : user
- 3 : gender
- 4 : region
- 7 : age

Note: trying to read in the entire dataframe will usually lead to a memory error.

In [3]:
columns_of_interest=[0,3,4,7]
data = pd.read_csv(path_to_profiles, sep='\t',header=None,usecols=columns_of_interest)
data.columns = ['user_id','gender','region','age']
data

Unnamed: 0,user_id,gender,region,age
0,1,1.0,"zilinsky kraj, zilina",26.0
1,2,0.0,"zilinsky kraj, kysucke nove mesto",0.0
2,16,1.0,"zilinsky kraj, kysucke nove mesto",23.0
3,3,1.0,"bratislavsky kraj, bratislava - karlova ves",29.0
4,4,0.0,"banskobystricky kraj, brezno",26.0
...,...,...,...,...
1632798,1632799,0.0,"banskobystricky kraj, revuca",23.0
1632799,1632800,1.0,"trenciansky kraj, myjava",33.0
1632800,1632801,1.0,"kosicky kraj, kosice - okolie",0.0
1632801,1632802,1.0,"bratislavsky kraj, bratislava - karlova ves",19.0


# Add lat/lng coordinates

In order to visualise the data we can identify lat/lon coordinates associated with each of the regions. We can use this by using the Nominatim geocoder from the geopy library. We also screen out the following exceptions:

- outlands='zahranicie' in Slovak. 
- There are also several locations in the Czech Republic (ceska republika, cz).

We replace the coordinates with Nans in these cases.

In [4]:
#geolocator = Nominatim(user_agent="specify_your_app_name_here")
#https://github.com/geopy/geopy/issues/314 - suggests changing the name of user agent.
geolocator = Nominatim(user_agent="abcd")

#Get the list of unique locations sot hat we do not call the API too many times:
locations = list(data['region'])
location_set = list(set(locations))
print("There are {} unique locations".format(len(location_set)))
num_locations = len(locations)


location_coords = { } #dictionary to store the locations in
q = 0 
for k in tqdm.tqdm_notebook(location_set) :
    q = q + 1
    print(k)
    try:
        if 'zahranicie' in k :
            print("Fails due to 'outlands' at line {}".format(q))
            location_coords[k] = [float('NaN'),float('NaN')]
        elif 'ceska republika, cz' in k :
            print("In ceska republica")
            loc_name = k.split(' - ')[1]
            print("town = " + loc_name )
            print("")
            loc = geolocator.geocode(loc_name,timeout=10)
            location_coords[k] = [loc.latitude,loc.longitude]
        else :
            loc = geolocator.geocode(k,timeout=10)
            location_coords[k] = [loc.latitude,loc.longitude]
        
    
    except TypeError:
        print(f"Failed due to type of {k}")
        location_coords[k] = [float('NaN'),float('NaN')]
        
    except AttributeError :
        print(f"Failed due to attribute error with k = {k}")
        location_coords[k] = [float('NaN'),float('NaN')]
        
    #print("Failed at place = {} , q = {}  due to another error".format(k,q) )
    #print("Sleeping...")
    time.sleep(2.0) # sleep to stop geocoder timeout
    #location_coords[k] = [float('NaN'),float('NaN')]
        
#Now add to the full set of coordinates:
locations = list(data['region'])
user_lat = [ ]
user_lon = [ ]
for loc_name in locations : 
    user_lat.append(location_coords[loc_name][0])
    user_lon.append(location_coords[loc_name][1])

There are 188 unique locations


HBox(children=(IntProgress(value=0, max=188), HTML(value='')))

nan
Failed due to type of nan
kosicky kraj, kosice - zapad
zilinsky kraj, zilina
presovsky kraj, kezmarok
trnavsky kraj, senica
kosicky kraj, medzev
banskobystricky kraj, zarnovica
bratislavsky kraj, bratislava - vajnory
banskobystricky kraj, velky krtis
zahranicie, bratislava - stare mesto
Fails due to 'outlands' at line 10
ceska republika, cz - olomoucky kraj
In ceska republica
town = olomoucky kraj

trnavsky kraj, hlohovec
banskobystricky kraj, jelsava
nitriansky kraj, komarno
kosicky kraj, brezova pod bradlom
Failed due to attribute error with k = kosicky kraj, brezova pod bradlom
kosicky kraj, spisske vlachy
presovsky kraj, spisska stara ves
presovsky kraj, medzilaborce
zahranicie, zahranicie - usa
Fails due to 'outlands' at line 19
nitriansky kraj, tlmace
presovsky kraj, svit
trenciansky kraj, ilava
presovsky kraj, lipany
banskobystricky kraj, zvolen
trnavsky kraj, samorin
bratislavsky kraj, bratislava - petrzalka
bratislavsky kraj, bratislava - karlova ves
presovsky kraj, stropk

In [5]:
data['user_lat'] = user_lat
data['user_lon'] = user_lon
data

Unnamed: 0,user_id,gender,region,age,user_lat,user_lon
0,1,1.0,"zilinsky kraj, zilina",26.0,49.223467,18.739314
1,2,0.0,"zilinsky kraj, kysucke nove mesto",0.0,49.299918,18.786508
2,16,1.0,"zilinsky kraj, kysucke nove mesto",23.0,49.299918,18.786508
3,3,1.0,"bratislavsky kraj, bratislava - karlova ves",29.0,48.159240,17.052677
4,4,0.0,"banskobystricky kraj, brezno",26.0,48.805335,19.640961
...,...,...,...,...,...,...
1632798,1632799,0.0,"banskobystricky kraj, revuca",23.0,48.683333,20.116667
1632799,1632800,1.0,"trenciansky kraj, myjava",33.0,48.754081,17.576513
1632800,1632801,1.0,"kosicky kraj, kosice - okolie",0.0,48.698393,20.973902
1632801,1632802,1.0,"bratislavsky kraj, bratislava - karlova ves",19.0,48.159240,17.052677


# Save the data

In [6]:
data.to_csv("Data/reduced_pokec_profiles.csv")

In [7]:
fraction_missing = np.sum(data[['user_lat']].isnull())/len(data)
print(f"Missing lats = {100*fraction_missing} %")

Missing lats = user_lat    9.975913
dtype: float64 %
