# Exploring Wikipedia clickstream data: the English Wiki in December 2018    

## Graph modeling

### 1. Introduction  
This notebook contains graph modeling steps for the English Wikipedia clickstream dataset for December 2018. It is a part of my project about the usage patterns of Wikipedia across language domains and over time.  


### 2. Notebook setup  

#### Imports

In [3]:
import pandas as pd
import numpy as np

import os

import  csv

from time import sleep
from timeit import default_timer as timer

# custom general helper functions for this project
import custom_utils as cu
import importlib

In [4]:
# reload imports as needed
importlib.reload(cu);

#### Settings

In [9]:
# set up Pandas options
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 3)
pd.options.display.float_format = '{:.2f}'.format

#### Helper functions

#### Get and check the data

In [5]:
# Get the cleaned tsv data onto AWS from my local machine
# Run from terminal on local:
# scp ~/projects/wiki_graph/data/clickstream-enwiki-2018-12_clean.tsv arinai@3.87.124.217:/home/arinai/data/
# This took about 15 min.
# Try getting the raw data with wget from Wikimedia datadump and then cleaning it next time to see if it's faster.

In [10]:
# Print the head of the tsv file
!head ../data/wiki_clickstream/clickstream-enwiki-2018-12_clean.tsv

# I move the file to neo4j's import folder later
# !head /var/lib/neo4j/import/clickstream-enwiki-2018-12_clean.tsv

other-empty	2019_Horizon_League_Baseball_Tournament	external	16
other-search	ForeverAtLast	external	40
other-empty	ForeverAtLast	external	85
First_Families_of_Pakistan	Jehangir_Wadia	link	19
The_Lawrence_School,_Sanawar	Jehangir_Wadia	link	36
Wadia_family	Jehangir_Wadia	link	715
other-search	Jehangir_Wadia	external	967
Ness_Wadia	Jehangir_Wadia	link	494
other-empty	Jehangir_Wadia	external	638
GoAir	Jehangir_Wadia	link	1191


In [5]:
# Read the cleaned up EN clickstream tsv file into pandas
filepath = "../data/clickstream-enwiki-2018-12_clean.tsv"
df = pd.read_csv(filepath, sep='\t', names=['prev', 'curr', 'type', 'n'])

In [6]:
# Replace the false missing value NaNs with string "NaN"s
df['prev'] = df['prev'].fillna('NaN')
df['curr'] = df['curr'].fillna('NaN')

### Data prep

In [7]:
external_edges = df[df.type == "external"]
internal_edges = df[df.type != "external"]

In [9]:
st = timer()
ext_nodes = set(external_edges["curr"])

cu.printRunTime(st)

Runtime: 0.04 min



In [10]:
len(ext_nodes)

5163109

In [11]:
list(ext_nodes)[:10]

['Kakapo_(album)',
 "When_You're_in_Love_with_a_Beautiful_Woman",
 'Théodore_Aubert',
 'Instituto_de_Historia_de_Cuba',
 '2706',
 'Renouf_Island',
 'Ceyreste',
 'Antoine_Omer_Talon',
 'La_Possession_(film)',
 'Priyanka_Bakaya']

In [43]:
def external_traffic(df, article, external_traffic_type):
    traffic = df.loc[(df.curr == article) & (df.prev == external_traffic_type), 'n'].values
    if len(traffic):
        return traffic[0].item()
    else:
        return None

In [13]:
ext_nodes_list = list(ext_nodes)

In [29]:
None

In [44]:
external_traffic(external_edges, 'Renouf_Island', "other-other")

### 3. Setting up Neo4j on AWS EC2 Ubuntu

#### Install Neo4j community edition server  
- For an Ubuntu EC2 instance on AWS, follow the Debian installation instructions [here](https://neo4j.com/docs/operations-manual/current/installation/linux/debian/). 
- [Run neo4j as a service using `systemctl`](https://neo4j.com/docs/operations-manual/current/installation/linux/systemd/#linux-service-control):  
  - To start neo4j: `systemctl start neo4j`  
  - To stop neo4j: `systemctl stop neo4j`

- run neo4j and tunnel into the neo4j browser:
  - run neo4j: `systemctl start neo4j`
  - tunnel into the browser, e.g.: `ssh -NfL localhost:7474:localhost:7474 -L localhost:7687:localhost:7687 arinai@3.87.124.217`
  - in a browser, open `http://localhost:7474/browser/`
  - On first login, both the username and password are `neo4j`. After the initial login there will be a prompt to set a new password. 

### 4. Load the data into Neo4j

#### Connect

In [23]:
from py2neo import authenticate, Graph, Node, Relationship

In [15]:
# To avoid typing neo4j password into the notebook each time,
# I'm saving it in a separate file and reading it in with the helper function below.
def read_n4jpass():
    """Reads neo4j connection credentials from .n4jpass file in current folder.
    Expects one value per line, ignores comments, e.g.:
    # comments here
    user=neo4j
    password=secretStuff123
    """
    
    cur_folder = os.getcwd()
    
    with open(cur_folder + '/.n4jpass', 'r') as f:
        lines = f.readlines()

    d = {}
    for l in lines:
        if l.strip() and (l[0] != '#'):
            k, v = l.strip().split('=')
            d[k] = v

    return d

In [16]:
n4j_cred = read_n4jpass()

In [17]:
# set up authentication parameters
authenticate("localhost:7474", n4j_cred["user"], n4j_cred["password"])

In [18]:
# connect to authenticated graph database
graph = Graph("http://localhost:7474/db/data/")

In [19]:
# run a test query
test = graph.data("call db.schema()")
pd.DataFrame(test)

Unnamed: 0,nodes,relationships
0,[],[]


#### Set indices  
Unlike in relational databases, there are no built-in constraints or automatic indexing in neo4j. Since article names in Wikipedia should be unique, let's add a uniqueness constraint. This will make the import queries run faster.

In [20]:
graph.run("CREATE CONSTRAINT ON (a:Article) ASSERT a.title IS UNIQUE;");

In [21]:
# list all indices in the db
r = graph.data('CALL db.indexes;')
pd.DataFrame(r)

Unnamed: 0,description,failureMessage,id,indexName,progress,properties,provider,state,tokenNames,type
0,INDEX ON :Article(title),,1,index_1,100.0,[title],"{'version': '1.0', 'key': 'native-btree'}",ONLINE,[Article],node_unique_property


In [22]:
# Read the data from ipython

In [38]:
external_edges[external_edges.curr == "Kakapo_(album)"]

Unnamed: 0,prev,curr,type,n
5383438,other-empty,Kakapo_(album),external,13


In [47]:
for counter, value in enumerate([1,2,3]):
    print(counter, value)

0 1
1 2
2 3


In [46]:
start_time = timer()

ext_nodes_count = len(ext_nodes_list)

for i, article in enumerate(ext_nodes_list):
    tx = graph.begin()

    a = Node("Article", title=article)
    a["external_website_traffic"] = external_traffic(external_edges, article, "other-external")
    a["other_wikimedia_traffic"] = external_traffic(external_edges, article, "other-internal")
    a["external_search_traffic"] = external_traffic(external_edges, article, "other-search")
    a["empty_referer_traffic"] = external_traffic(external_edges, article, "other-empty")
    a["unknown_external_traffic"] = external_traffic(external_edges, article, "other-other")
    
    tx.create(a)
    tx.commit()
    
    if i % 1000000 == 0:
        print(f'{round(i/ext_nodes_count * 100)}% done')
        print("elapsed time:", (timer() - start_time)/60, "min\n")
        

cu.printRunTime(start_time)

Runtime: 0.16 min



In [24]:
test = Node("Person", name="Alice")

In [25]:
test

(alice:Person {name:"Alice"})

In [26]:
test["age"] = 20

In [27]:
test

(alice:Person {age:20,name:"Alice"})

In [30]:
test['wt'] = None

In [31]:
test

(alice:Person {age:20,name:"Alice"})

In [None]:

tx = graph.begin()
>>> a = Node("Person", name="Alice")
>>> tx.create(a)
>>> b = Node("Person", name="Bob")
>>> ab = Relationship(a, "KNOWS", b)
>>> tx.create(ab)
>>> tx.commit()

In [None]:
start_time = timer()
with open("../data/external_nodes.tsv","w") as tsvfile:
    wr = csv.writer(tsvfile, delimiter='\t')
    
    # header
    wr.writerow(["title", 
                 "external_website_traffic", 
                 "other_wikimedia_traffic", 
                 "external_search_traffic", 
                 "empty_referer_traffic", 
                 "unknown_external_traffic"])
    #rows
    for article in ext_nodes_list:
        wr.writerow([article, 
                     external_traffic(external_edges, article, "other-external"), 
                     external_traffic(external_edges, article, "other-internal"), 
                     external_traffic(external_edges, article, "other-search"), 
                     external_traffic(external_edges, article, "other-empty"), 
                     external_traffic(external_edges, article, "other-other")])
        

cu.printRunTime(start_time)

In [None]:

tx = graph.begin()
>>> a = Node("Person", name="Alice")
>>> tx.create(a)
>>> b = Node("Person", name="Bob")
>>> ab = Relationship(a, "KNOWS", b)
>>> tx.create(ab)
>>> tx.commit()
>>> graph.exists(ab)

In [None]:
graph.exists(ab)

#### Read in the data

By default, neo4j only imports data from its own `import` subfolder (on Ubuntu, it is `/var/lib/neo4j/import`) or from an url, so make sure to move the data into the neo4j import folder (alternatively, this requirement can be commented out in the `conf` file).

In [71]:
filename = "clickstream-enwiki-2018-12_clean.tsv"

In [72]:
fp = 'file:///' + filename
    
query_test = """
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    return row
    limit 10
    ;
    """

test = graph.data(query_test, myfilepath=str(fp))
pd.DataFrame(test)

Unnamed: 0,row
0,"[other-empty, 2019_Horizon_League_Baseball_Tou..."
1,"[other-search, ForeverAtLast, external, 40]"
2,"[other-empty, ForeverAtLast, external, 85]"
3,"[First_Families_of_Pakistan, Jehangir_Wadia, l..."
4,"[The_Lawrence_School,_Sanawar, Jehangir_Wadia,..."
5,"[Wadia_family, Jehangir_Wadia, link, 715]"
6,"[other-search, Jehangir_Wadia, external, 967]"
7,"[Ness_Wadia, Jehangir_Wadia, link, 494]"
8,"[other-empty, Jehangir_Wadia, external, 638]"
9,"[GoAir, Jehangir_Wadia, link, 1191]"


In [73]:
query_test = """
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    return row[0], row[1], split(row[0], '-')[1], toInteger(row[3])
    limit 10
    ;
    """

test = graph.data(query_test, myfilepath=str(fp))
pd.DataFrame(test)

Unnamed: 0,row[0],row[1],"split(row[0], '-')[1]",toInteger(row[3])
0,other-empty,2019_Horizon_League_Baseball_Tournament,empty,16
1,other-search,ForeverAtLast,search,40
2,other-empty,ForeverAtLast,empty,85
3,First_Families_of_Pakistan,Jehangir_Wadia,,19
4,"The_Lawrence_School,_Sanawar",Jehangir_Wadia,,36
5,Wadia_family,Jehangir_Wadia,,715
6,other-search,Jehangir_Wadia,search,967
7,Ness_Wadia,Jehangir_Wadia,,494
8,other-empty,Jehangir_Wadia,empty,638
9,GoAir,Jehangir_Wadia,,1191


In [75]:
# Let's load article-to-article links and 'other' referer types (all referers here are Wiki articles)
# (it's faster to load nodes and relationships separately for large data files)
# nodes
query_nodes = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    FOREACH (_ IN CASE WHEN row[2] <> 'external' THEN [1] else [] end |
        MERGE (n1:Article { title: row[0] })
        MERGE (n2:Article { title: row[1] })
    )
    FOREACH (_ IN CASE WHEN row[2] = 'external' THEN [1] else [] end |
        MERGE (n3:Article { title: row[1] })
            ON CREATE SET
                n3.external_source = split(row[0], '-')[1],
                n3.external_traffic = toInteger(row[3])
    )
    ;
    """

graph.run(query_nodes, myfilepath=str(fp))

KeyboardInterrupt: 

In [None]:
# started running the above at about 8:30pm on Sun Feb 10.

In [76]:
from timeit import default_timer as timer

In [77]:
# edges
query_edges = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] <> 'external'
    FOREACH (_ IN CASE WHEN row[2] = 'link' THEN [1] else [] end |
        MATCH (n1:Article { title: row[0] })
        MATCH (n2:Article { title: row[1] })
        MERGE (n1)-[r:LINK_TO]->(n2)
            ON CREATE SET
                r.traffic = toInteger(row[3])
    )
    FOREACH (_ IN CASE WHEN row[2] = 'other' THEN [1] else [] end |
        MATCH (n3:Article { title: row[0] })
        MATCH (n4:Article { title: row[1] })
        MERGE (n3)-[r2:SEARCH_TO]->(n4)
            ON CREATE SET
                r2.traffic = toInteger(row[3])
    )
    ;
    """

start_time = timer()
graph.run(query_edges, myfilepath=str(fp))
end_time = timer()
print("runtime:", (end_time - start_time)/60, "min\n")

KeyboardInterrupt: 