# Exploring Wikipedia clickstream data: the English Wiki in December 2018    

## Graph modeling

### 1. Introduction  
This notebook contains graph modeling steps for the English Wikipedia clickstream dataset for December 2018. It is a part of my project about the usage patterns of Wikipedia across language domains and over time.  


### 2. Notebook setup  

#### Imports

In [1]:
import pandas as pd
import numpy as np

import os

import  csv

from time import sleep
from timeit import default_timer as timer

# custom general helper functions for this project
import custom_utils as cu
import importlib

In [49]:
# reload imports as needed
importlib.reload(cu);

#### Settings

In [2]:
# set up Pandas options
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 3)
pd.options.display.float_format = '{:.2f}'.format

#### Helper functions

#### Get and check the data

In [5]:
# Get the cleaned tsv data onto AWS from my local machine
# Run from terminal on local:
# scp ~/projects/wiki_graph/data/clickstream-enwiki-2018-12_clean.tsv arinai@3.87.124.217:/home/arinai/data/
# This took about 15 min.
# Try getting the raw data with wget from Wikimedia datadump and then cleaning it next time to see if it's faster.

In [10]:
# Print the head of the tsv file
!head ../data/clickstream-enwiki-2018-12_clean.tsv

# I move the file to neo4j's import folder later
# !head /var/lib/neo4j/import/clickstream-enwiki-2018-12_clean.tsv

other-empty	2019_Horizon_League_Baseball_Tournament	external	16
other-search	ForeverAtLast	external	40
other-empty	ForeverAtLast	external	85
First_Families_of_Pakistan	Jehangir_Wadia	link	19
The_Lawrence_School,_Sanawar	Jehangir_Wadia	link	36
Wadia_family	Jehangir_Wadia	link	715
other-search	Jehangir_Wadia	external	967
Ness_Wadia	Jehangir_Wadia	link	494
other-empty	Jehangir_Wadia	external	638
GoAir	Jehangir_Wadia	link	1191


In [5]:
# Read the cleaned up EN clickstream tsv file into pandas
filepath = "../data/clickstream-enwiki-2018-12_clean.tsv"
df = pd.read_csv(filepath, sep='\t', names=['prev', 'curr', 'type', 'n'])

In [6]:
# Replace the false missing value NaNs with string "NaN"s
df['prev'] = df['prev'].fillna('NaN')
df['curr'] = df['curr'].fillna('NaN')

### Data prep

In [7]:
external_edges = df[df.type == "external"]
internal_edges = df[df.type != "external"]

In [9]:
st = timer()
ext_nodes = set(external_edges["curr"])

cu.printRunTime(st)

Runtime: 0.04 min



In [10]:
len(ext_nodes)

5163109

In [11]:
list(ext_nodes)[:10]

['Kakapo_(album)',
 "When_You're_in_Love_with_a_Beautiful_Woman",
 'Théodore_Aubert',
 'Instituto_de_Historia_de_Cuba',
 '2706',
 'Renouf_Island',
 'Ceyreste',
 'Antoine_Omer_Talon',
 'La_Possession_(film)',
 'Priyanka_Bakaya']

In [43]:
def external_traffic(df, article, external_traffic_type):
    traffic = df.loc[(df.curr == article) & (df.prev == external_traffic_type), 'n'].values
    if len(traffic):
        return traffic[0].item()
    else:
        return None

In [13]:
ext_nodes_list = list(ext_nodes)

In [29]:
None

In [44]:
external_traffic(external_edges, 'Renouf_Island', "other-other")

### 3. Setting up Neo4j on AWS EC2 Ubuntu

#### Install Neo4j community edition server  
- For an Ubuntu EC2 instance on AWS, follow the Debian installation instructions [here](https://neo4j.com/docs/operations-manual/current/installation/linux/debian/). 
- [Run neo4j as a service using `systemctl`](https://neo4j.com/docs/operations-manual/current/installation/linux/systemd/#linux-service-control):  
  - To start neo4j: `systemctl start neo4j`  
  - To stop neo4j: `systemctl stop neo4j`

- run neo4j and tunnel into the neo4j browser:
  - run neo4j: `systemctl start neo4j`
  - tunnel into the browser, e.g.: `ssh -NfL localhost:7474:localhost:7474 -L localhost:7687:localhost:7687 arinai@3.87.124.217`
  - in a browser, open `http://localhost:7474/browser/`
  - On first login, both the username and password are `neo4j`. After the initial login there will be a prompt to set a new password. 

### 4. Load the data into Neo4j

#### Connect

In [3]:
from py2neo import authenticate, Graph, Node, Relationship

In [4]:
# To avoid typing neo4j password into the notebook each time,
# I'm saving it in a separate file and reading it in with the helper function below.
def read_n4jpass():
    """Reads neo4j connection credentials from .n4jpass file in current folder.
    Expects one value per line, ignores comments, e.g.:
    # comments here
    user=neo4j
    password=secretStuff123
    """
    
    cur_folder = os.getcwd()
    
    with open(cur_folder + '/.n4jpass', 'r') as f:
        lines = f.readlines()

    d = {}
    for l in lines:
        if l.strip() and (l[0] != '#'):
            k, v = l.strip().split('=')
            d[k] = v

    return d

In [5]:
n4j_cred = read_n4jpass()

In [6]:
# set up authentication parameters
authenticate("localhost:7474", n4j_cred["user"], n4j_cred["password"])

In [7]:
# connect to authenticated graph database
graph = Graph("http://localhost:7474/db/data/")

#### Set indices  
Unlike in relational databases, there are no built-in constraints or automatic indexing in neo4j. Since article names in Wikipedia should be unique, let's add a uniqueness constraint. This will make the import queries run faster.

In [20]:
graph.run("CREATE CONSTRAINT ON (a:Article) ASSERT a.title IS UNIQUE;");

In [68]:
# list all indices in the db
r = graph.data('CALL db.indexes;')
pd.DataFrame(r)

Unnamed: 0,description,failureMessage,id,indexName,progress,properties,provider,state,tokenNames,type
0,INDEX ON :Article(title),,1,index_1,100.0,[title],"{'version': '1.0', 'key': 'native-btree'}",ONLINE,[Article],node_unique_property


In [22]:
# Read the data from ipython

In [57]:
external_edges[external_edges.curr == "Kakapo_(album)"]

Unnamed: 0,prev,curr,type,n
5383438,other-empty,Kakapo_(album),external,13


In [55]:
for counter, value in enumerate([1,2,3]):
    print(counter, value)

0 1
1 2
2 3


In [71]:
start_time = timer()

ext_nodes_count = len(ext_nodes_list)

last_i = 0

for i, article in enumerate(ext_nodes_list):
    tx = graph.begin()

    a = Node("Article", title=article)
    a["external_website_traffic"] = external_traffic(external_edges, article, "other-external")
    a["other_wikimedia_traffic"] = external_traffic(external_edges, article, "other-internal")
    a["external_search_traffic"] = external_traffic(external_edges, article, "other-search")
    a["empty_referer_traffic"] = external_traffic(external_edges, article, "other-empty")
    a["unknown_external_traffic"] = external_traffic(external_edges, article, "other-other")
    
    tx.create(a)
    tx.commit()
    
    last_i = i
    
    if i % 1000000 == 0:
        print(str(round(i/ext_nodes_count * 100)) +'% done')
        print("elapsed time:", (timer() - start_time)/60, "min\n")
        

cu.printRunTime(start_time)

0% done
elapsed time: 0.07833123079999496 min



KeyboardInterrupt: 

In [72]:
last_i

289

In [24]:
test = Node("Person", name="Alice")

In [25]:
test

(alice:Person {name:"Alice"})

In [26]:
test["age"] = 20

In [27]:
test

(alice:Person {age:20,name:"Alice"})

In [30]:
test['wt'] = None

In [31]:
test

(alice:Person {age:20,name:"Alice"})

In [None]:

tx = graph.begin()
>>> a = Node("Person", name="Alice")
>>> tx.create(a)
>>> b = Node("Person", name="Bob")
>>> ab = Relationship(a, "KNOWS", b)
>>> tx.create(ab)
>>> tx.commit()

In [None]:
start_time = timer()
with open("../data/external_nodes.tsv","w") as tsvfile:
    wr = csv.writer(tsvfile, delimiter='\t')
    
    # header
    wr.writerow(["title", 
                 "external_website_traffic", 
                 "other_wikimedia_traffic", 
                 "external_search_traffic", 
                 "empty_referer_traffic", 
                 "unknown_external_traffic"])
    #rows
    for article in ext_nodes_list:
        wr.writerow([article, 
                     external_traffic(external_edges, article, "other-external"), 
                     external_traffic(external_edges, article, "other-internal"), 
                     external_traffic(external_edges, article, "other-search"), 
                     external_traffic(external_edges, article, "other-empty"), 
                     external_traffic(external_edges, article, "other-other")])
        

cu.printRunTime(start_time)

In [None]:

tx = graph.begin()
>>> a = Node("Person", name="Alice")
>>> tx.create(a)
>>> b = Node("Person", name="Bob")
>>> ab = Relationship(a, "KNOWS", b)
>>> tx.create(ab)
>>> tx.commit()
>>> graph.exists(ab)

In [None]:
graph.exists(ab)

#### Read in the data

By default, neo4j only imports data from its own `import` subfolder (on Ubuntu, it is `/var/lib/neo4j/import`) or from an url, so make sure to move the data into the neo4j import folder (alternatively, this requirement can be commented out in the `conf` file).

In [73]:
filename = "clickstream-enwiki-2018-12_clean.tsv"

In [74]:
fp = 'file:///' + filename
    
query_test = """
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    return row
    limit 10
    ;
    """

test = graph.data(query_test, myfilepath=str(fp))
pd.DataFrame(test)

Unnamed: 0,row
0,"[other-empty, 2019_Horizon_League_Baseball_Tou..."
1,"[other-search, ForeverAtLast, external, 40]"
2,"[other-empty, ForeverAtLast, external, 85]"
3,"[First_Families_of_Pakistan, Jehangir_Wadia, l..."
4,"[The_Lawrence_School,_Sanawar, Jehangir_Wadia,..."
5,"[Wadia_family, Jehangir_Wadia, link, 715]"
6,"[other-search, Jehangir_Wadia, external, 967]"
7,"[Ness_Wadia, Jehangir_Wadia, link, 494]"
8,"[other-empty, Jehangir_Wadia, external, 638]"
9,"[GoAir, Jehangir_Wadia, link, 1191]"


In [75]:
query_test = """
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    return row[0], row[1], split(row[0], '-')[1], toInteger(row[3])
    limit 10
    ;
    """

test = graph.data(query_test, myfilepath=str(fp))
pd.DataFrame(test)

Unnamed: 0,row[0],row[1],"split(row[0], '-')[1]",toInteger(row[3])
0,other-empty,2019_Horizon_League_Baseball_Tournament,empty,16
1,other-search,ForeverAtLast,search,40
2,other-empty,ForeverAtLast,empty,85
3,First_Families_of_Pakistan,Jehangir_Wadia,,19
4,"The_Lawrence_School,_Sanawar",Jehangir_Wadia,,36
5,Wadia_family,Jehangir_Wadia,,715
6,other-search,Jehangir_Wadia,search,967
7,Ness_Wadia,Jehangir_Wadia,,494
8,other-empty,Jehangir_Wadia,empty,638
9,GoAir,Jehangir_Wadia,,1191


In [None]:
# cypher-shell queries

###############
### NODES  ####
###############

### External reference type #################################################

USING PERIODIC COMMIT
    LOAD CSV FROM "file:///clickstream-enwiki-2018-12_clean.tsv" AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[0] = 'other-empty' 
    CREATE (n:Article { title: row[1], empty_referer_traffic:  toInteger(row[3]) })
    ;
# the above ran for about 5 min in terminal cypher-shell on AWS
# Added 5093433 nodes, Set 10186866 properties, Added 5093433 labels


USING PERIODIC COMMIT
    LOAD CSV FROM "file:///clickstream-enwiki-2018-12_clean.tsv" AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[0] = 'other-external' 
    MERGE (n:Article { title: row[1] })
    ON CREATE SET n.external_website_traffic = toInteger(row[3])
    ON MATCH SET n.external_website_traffic = toInteger(row[3])
    ;
# the above ran for about 2 min in terminal cypher-shell on AWS
# Added 726 nodes, Set 788037 properties, Added 726 labels


USING PERIODIC COMMIT
    LOAD CSV FROM "file:///clickstream-enwiki-2018-12_clean.tsv" AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[0] = 'other-other' 
    MERGE (n:Article { title: row[1] })
    ON CREATE SET n.unknown_external_traffic = toInteger(row[3])
    ON MATCH SET n.unknown_external_traffic = toInteger(row[3])
    ;
# the above ran for about 2 min in terminal cypher-shell on AWS
# Added 31 nodes, Set 374952 properties, Added 31 labels


USING PERIODIC COMMIT
    LOAD CSV FROM "file:///clickstream-enwiki-2018-12_clean.tsv" AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[0] = 'other-internal' 
    MERGE (n:Article { title: row[1] })
    ON CREATE SET n.other_wikimedia_traffic = toInteger(row[3])
    ON MATCH SET n.other_wikimedia_traffic = toInteger(row[3])
    ;
# the above ran for about 2 min in terminal cypher-shell on AWS
# Added 3454 nodes, Set 1352346 properties, Added 3454 labels


USING PERIODIC COMMIT
    LOAD CSV FROM "file:///clickstream-enwiki-2018-12_clean.tsv" AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[0] = 'other-search' 
    MERGE (n:Article { title: row[1] })
    ON CREATE SET n.external_search_traffic = toInteger(row[3])
    ON MATCH SET n.external_search_traffic = toInteger(row[3])
    ;
# the above ran for about 3.5 min in terminal cypher-shell on AWS
# Added 65465 nodes, Set 3451890 properties, Added 65465 labels



### Non-external reference type #################################################

USING PERIODIC COMMIT
    LOAD CSV FROM "file:///clickstream-enwiki-2018-12_clean.tsv" AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] <> 'external' 
    MERGE (n1:Article { title: row[0] })
    MERGE (n2:Article { title: row[1] })
    ;
# the above ran for about 13 min in terminal cypher-shell on AWS
# Added 22590 nodes, Set 22590 properties, Added 22590 labels




###############
### EGDES  ####
###############


### LINK_TO relationships #################################################
USING PERIODIC COMMIT
    LOAD CSV FROM "file:///clickstream-enwiki-2018-12_clean.tsv" AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] = 'link' 
    MATCH (n1:Article { title: row[0] })
    MATCH (n2:Article { title: row[1] })
    CREATE (n1)-[r:LINK_TO { traffic: toInteger(row[3]) }]->(n2)
    ;
# the above ran for about 26 min in terminal cypher-shell on AWS
# Created 17851574 relationships, Set 17851574 properties


### SEARCH_FOR relationships ##############################################
USING PERIODIC COMMIT
    LOAD CSV FROM "file:///clickstream-enwiki-2018-12_clean.tsv" AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] = 'other' 
    MATCH (n1:Article { title: row[0] })
    MATCH (n2:Article { title: row[1] })
    CREATE (n1)-[r:SEARCH_FOR { traffic: toInteger(row[3]) }]->(n2)
    ;

# the above ran for about 14 min in terminal cypher-shell on AWS
# Created 1005180 relationships, Set 1005180 properties


# Total runtime is about 70 min on AWS cypher-shell. Every other method of import was much much slower.

In [None]:
## Cypher queries for inspecting data in the neo4j browser (screenshots saved on desktop for now)

# visualize db schema (`call db.schema()` has been depreciated)
call db.schema.visualization()


# looking into Main_Page single-hop graphs
# Main_Page out links
match p=(a:Article)-[r:LINK_TO]->()
where a.title="Main_Page"
return p
limit 200

# Main_Page in links
match p=(a:Article)<-[r:LINK_TO]-()
where a.title="Main_Page"
return p
limit 200


# Main_Page out searches
match p=(a:Article)-[r:SEARCH_FOR]->()
where a.title="Main_Page"
return p
limit 200

# Main_Page in searches
match p=(a:Article)<-[r:SEARCH_FOR]-()
where a.title="Main_Page"
return p
limit 200


# Conclusion:
# Keep link traffic to Main_Page, but drop link traffic from Main_Page and to/from search traffic, 
# save the dropped Main_Page traffic values on nodes.

# cypher-shell query to remove Main_Page out-link-traffic
MATCH (mp:Article {title: "Main_Page"})-[r:LINK_TO]->(a:Article)
SET a.link_traffic_from_main_page = r.traffic
DELETE r;
# 0 rows available after 141 ms, consumed after another 0 ms
# Deleted 73 relationships, Set 73 properties

# cypher-shell query to remove Main_Page out-search-traffic
MATCH (mp:Article {title: "Main_Page"})-[r:SEARCH_FOR]->(a:Article)
SET a.search_traffic_from_main_page = r.traffic
DELETE r;
# 0 rows available after 30816 ms, consumed after another 0 ms
# Deleted 257794 relationships, Set 257794 properties

# cypher-shell query to remove Main_Page in-search-traffic
MATCH (mp:Article {title: "Main_Page"})<-[r:SEARCH_FOR]-(a:Article)
SET a.search_traffic_to_main_page = r.traffic
DELETE r;
# 0 rows available after 1427 ms, consumed after another 0 ms
# Deleted 110009 relationships, Set 110009 properties


In [None]:

# looking into Hyphen-minus single-hop search subgraphs

# out searches
match p=(a:Article)-[r:SEARCH_FOR]->()
where a.title="Hyphen-minus"
return p
limit 200
# The only out-search from "Hyphen-minus" was to Main_Page, which was deleted in the queries above

# in searches
match p=(a:Article)<-[r:SEARCH_FOR]-()
where a.title="Hyphen-minus"
return p
limit 200


# Conclusion:
# Drop in-search traffic to "Hyphen-minus", 
# don't really need to save the dropped "Hyphen-minus" traffic values on nodes,
# but doing it just in case it looks interesting later.

# cypher-shell query to remove "Hyphen-minus" in-search-traffic
MATCH (hm:Article {title: "Hyphen-minus"})<-[r:SEARCH_FOR]-(a:Article)
SET a.search_traffic_to_hyphen_minus = r.traffic
DELETE r;
# 0 rows available after 1569 ms, consumed after another 0 ms
# Deleted 127457 relationships, Set 127457 properties

In [None]:

# looking into Undefined single-hop search subgraphs

# out searches
match p=(a:Article)-[r:SEARCH_FOR]->()
where a.title="Undefined"
return p
limit 200
# The only out-search from "Undefined" was to Main_Page, which was deleted in the queries above

# in searches
match p=(a:Article)<-[r:SEARCH_FOR]-()
where a.title="Undefined"
return p
limit 200


# Conclusion:
# Drop in-search traffic to "Undefined", 
# don't really need to save the dropped "Undefined" traffic values on nodes,
# but doing it just in case it looks interesting later.

# cypher-shell query to remove "Undefined" in-search-traffic
MATCH (u:Article {title: "Undefined"})<-[r:SEARCH_FOR]-(a:Article)
SET a.search_traffic_to_undefined = r.traffic
DELETE r;
# 0 rows available after 37 ms, consumed after another 0 ms
# Deleted 241 relationships, Set 241 properties

In [None]:
# sample final graph paths in neo4j browser
MATCH p=()-[]->() RETURN p LIMIT 300

# looks good, we're done with graph modeling.