# Exploring Wikipedia clickstream data: the English Wiki in December 2018    

## Graph modeling

### 1. Introduction  
This notebook contains graph modeling steps for the English Wikipedia clickstream dataset for December 2018. It is a part of my project about the usage patterns of Wikipedia across language domains and over time.  


### 2. Notebook setup  

#### Imports

In [7]:
import pandas as pd
import numpy as np

In [8]:
import os

#### Settings

In [9]:
# set up Pandas options
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 3)
pd.options.display.float_format = '{:.2f}'.format

#### Helper functions

#### Get and check the data

In [5]:
# Get the cleaned tsv data onto AWS from my local machine
# Run from terminal on local:
# scp ~/projects/wiki_graph/data/clickstream-enwiki-2018-12_clean.tsv arinai@3.87.124.217:/home/arinai/data/wiki_clickstream

# This took about 15 min.
# Try getting the raw data with wget from Wikimedia datadump and then cleaning it next time to see if it's faster.

In [10]:
# Print the head of the tsv file
!head ../data/wiki_clickstream/clickstream-enwiki-2018-12_clean.tsv

# I move the file to neo4j's import folder later
# !head /var/lib/neo4j/import/clickstream-enwiki-2018-12_clean.tsv

other-empty	2019_Horizon_League_Baseball_Tournament	external	16
other-search	ForeverAtLast	external	40
other-empty	ForeverAtLast	external	85
First_Families_of_Pakistan	Jehangir_Wadia	link	19
The_Lawrence_School,_Sanawar	Jehangir_Wadia	link	36
Wadia_family	Jehangir_Wadia	link	715
other-search	Jehangir_Wadia	external	967
Ness_Wadia	Jehangir_Wadia	link	494
other-empty	Jehangir_Wadia	external	638
GoAir	Jehangir_Wadia	link	1191


### 3. Setting up Neo4j on AWS EC2 Ubuntu

#### Install Neo4j community edition server  
- For an Ubuntu EC2 instance on AWS, follow the Debian installation instructions [here](https://neo4j.com/docs/operations-manual/current/installation/linux/debian/). 
- [Run neo4j as a service using `systemctl`](https://neo4j.com/docs/operations-manual/current/installation/linux/systemd/#linux-service-control):  
  - To start neo4j: `systemctl start neo4j`  
  - To stop neo4j: `systemctl stop neo4j`

- run neo4j and tunnel into the neo4j browser:
  - run neo4j: `systemctl start neo4j`
  - tunnel into the browser, e.g.: `ssh -NfL localhost:7474:localhost:7474 -L localhost:7687:localhost:7687 arinai@3.87.124.217`
  - in a browser, open `http://localhost:7474/browser/`
  - On first login, both the username and password are `neo4j`. After the initial login there will be a prompt to set a new password. 

### 4. Load the data into Neo4j

#### Connect

In [8]:
from py2neo import authenticate, Graph

In [14]:
# To avoid typing neo4j password into the notebook each time,
# I'm saving it in a separate file and reading it in with the helper function below.
def read_n4jpass():
    """Reads neo4j connection credentials from .n4jpass file in current folder.
    Expects one value per line, ignores comments, e.g.:
    # comments here
    user=neo4j
    password=secretStuff123
    """
    
    cur_folder = os.getcwd()
    
    with open(cur_folder + '/.n4jpass', 'r') as f:
        lines = f.readlines()

    d = {}
    for l in lines:
        if l.strip() and (l[0] != '#'):
            k, v = l.strip().split('=')
            d[k] = v

    return d

In [15]:
n4j_cred = read_n4jpass()

In [17]:
# set up authentication parameters
authenticate("localhost:7474", n4j_cred["user"], n4j_cred["password"])

In [18]:
# connect to authenticated graph database
graph = Graph("http://localhost:7474/db/data/")

In [26]:
# run a test query
test = graph.data("call db.schema()")
pd.DataFrame(test)

Unnamed: 0,nodes,relationships
0,[],[]


#### Set indices  
Unlike in relational databases, there are no built-in constraints or automatic indexing in neo4j. Since article names in Wikipedia should be unique, let's add a uniqueness constraint. This will make the import queries run faster.

In [39]:
graph.run("CREATE CONSTRAINT ON (a:Article) ASSERT a.title IS UNIQUE;")

<py2neo.database.Cursor at 0x7f175df97ac8>

In [40]:
# list all indices in the db
r = graph.data('CALL db.indexes;')
pd.DataFrame(r)

Unnamed: 0,description,failureMessage,id,indexName,progress,properties,provider,state,tokenNames,type
0,INDEX ON :Article(title),,3,index_3,100.0,[title],"{'version': '1.0', 'key': 'native-btree'}",ONLINE,[Article],node_unique_property


#### Read in the data

By default, neo4j only imports data from its own `import` subfolder (on Ubuntu, it is `/var/lib/neo4j/import`) or from an url, so make sure to move the data into the neo4j import folder (alternatively, this requirement can be commented out in the `conf` file).

In [33]:
filename = "clickstream-enwiki-2018-12_clean.tsv"

In [42]:
fp = 'file:///' + filename
    
query_test = """
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    return row
    limit 10
    ;
    """

test = graph.data(query_test, myfilepath=str(fp))
pd.DataFrame(test)

ProtocolError: Unable to connect to localhost on port 7687 - is the server running?

In [41]:
query_test = """
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    return row[0], row[1], split(row[0], '-')[1], toInteger(row[3])
    limit 10
    ;
    """

test = graph.data(query_test, myfilepath=str(fp))
pd.DataFrame(test)

ProtocolError: Server closed connection

In [None]:
# Let's load article-to-article links and 'other' referer types (all referers here are Wiki articles)
# (it's faster to load nodes and relationships separately for large data files)
# nodes
query_nodes = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    FOREACH (_ IN CASE WHEN row[2] <> 'external' THEN [1] else [] end |
        MERGE (n1:Article { title: row[0] })
        MERGE (n2:Article { title: row[1] })
    )
    FOREACH (_ IN CASE WHEN row[2] = 'external' THEN [1] else [] end |
        MERGE (n3:Article { title: row[1] })
            ON CREATE SET
                n3.external_source = split(row[0], '-')[1]
                n3.external_traffic = toInteger(row[3])
    )
    ;
    """

graph.run(query_nodes, myfilepath=str(fp))

In [None]:
def import_wiki_to_neo(filename):
    
    # note: by default, neo4j only imports data from its own import subfolder or from a url,
    # so make sure to place the cleaned data into the neo4j import folder.
    fp = 'file:///' + filename
    

    
    
    """
    LOAD CSV FROM {url} AS row
MATCH (o:Organization {name:row.org})
FOREACH (_ IN case when row.type = 'Person' then [1] else [] end|
   MERGE (p:Person {name:row.name})
   CREATE (p)-[:WORKS_FOR]->(o)
)
FOREACH (_ IN case when row.type = 'Agency' then [1] else [] end|
   MERGE (a:Agency {name:row.name})
   CREATE (a)-[:WORKS_FOR]->(o)
)
    """
    
    
    graph.run(query_nodes_internal, myfilepath=str(fp))

    # relationships - internal
    query_relationships_internal = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] <> 'external'
    MATCH (n1:Article { title: row[0] })
    MATCH (n2:Article { title: row[1] })
    MERGE (n1)-[r:REFERRED_TO]->(n2)
        ON CREATE SET
            r.type = row[2],
            r.count = row[3]
    ;
    """
    graph.run(query_relationships_internal, myfilepath=str(fp))

    # nodes - external
    query_nodes_external = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] = 'external'
    MERGE (n1:ExternalSource {source_type: row[0], language_code: {lang_code} })
        ON CREATE SET
            n1.language = {lang},
            n1.description = CASE 
                WHEN row[0] = 'other-internal' THEN 'a page from any other Wikimedia project (not Wikipedia)'
                WHEN row[0] = 'other-search' THEN 'an external search engine'
                WHEN row[0] = 'other-external' THEN 'any other external site (not search engine)'
                WHEN row[0] = 'other-empty' THEN 'an empty referer'
                WHEN row[0] = 'other-other' THEN 'anything else (catch-all)'
                ELSE '' END

    MERGE (n2:Article {title: row[1], language_code: {lang_code} })
        ON CREATE SET
            n2.language = {lang}
    ;
    """
    graph.run(query_nodes_external, myfilepath=str(fp), lang=str(language), lang_code=str(language_code))

    # relationships - external
    query_relationships_external = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] = 'external'
    MATCH (n1:ExternalSource {source_type: row[0], language_code: {lang_code} })
    MATCH (n2:Article {title: row[1], language_code: {lang_code} })
    MERGE (n1)-[r:REFERRED_TO]->(n2)
        ON CREATE SET
            r.type = row[2],
            r.count = row[3]
    ;
    """
    graph.run(query_relationships_external, myfilepath=str(fp), lang=str(language), lang_code=str(language_code))

In [4]:
# Replace the false missing value NaNs with string "NaN"s
df['prev'] = df['prev'].fillna('NaN')
df['curr'] = df['curr'].fillna('NaN')

In [5]:
df.sort_values("n", ascending=False, inplace=True)

In [6]:
df.head(10)

Unnamed: 0,prev,curr,type,n
7908180,other-empty,Main_Page,external,492341152
772288,other-external,Hyphen-minus,external,17676430
765759,other-empty,Hyphen-minus,external,15498618
7831692,other-internal,Main_Page,external,8826536
21252458,other-search,George_H._W._Bush,external,4576854
3488262,other-empty,XHamster,external,4281194
23951757,other-search,Jason_Momoa,external,3538068
18403529,other-search,2.0_(film),external,3475113
1830737,other-search,Bird_Box_(film),external,3251996
7897786,other-search,Main_Page,external,3020671
