# Importing Wikipedia clickstream data into neo4j graph database

## Data

In 2015, Wikipedia started releasing datasets of clickstream counts to Wikipedia articles for research purposes. The project has evolved over time, and as of April 2018 it takes the form of monthly automatic releases in 11 Wikipedia language domains: English, German, French, Spanish, Russian, Italian, Japanese, Portuguese, Polish, Chinese and Persian.

The clickstream datasets contain counts of times someone went to a Wikipedia article from either another Wikipedia article or from some other webpage. The destination Wikipedia article here is called a resource, and the webpage from which the user went to the resource is called a referer. Referers can be another Wikipedia article from the same language domain, a Wikimedia page that is not part of that Wikipedia language domain, a search engine, or some other webpage. If the referer is a Wikipedia article from the same language domain, then the article name is given. If not, then a general referer category is given.

**Some examples of the data records:**
1. _Article-to-article data record_  
In the English Wikipedia, the article 'Suki Waterhouse' links to the article 'List of Divergent characters'. In March 2018, users went from the 'Suki Waterhouse' article to the 'List of Divergent characters' article 86 times. This clickstream activity between the two articles is recorded in the 2018-03 English Wikipedia clickstream data release as:  

referer | resource | reference type | count
--- | --- | --- | ---  
Suki_Waterhouse | List_of_Divergent_characters | link | 86

2. _External-source-to-article data record_  
In the English Wikipedia, the article 'Bureau of Investigative Journalism' was visited 18 times during March 2018 from some other Wikimedia page that is not part of the English Wikipedia. This clickstream activity is recorded in the 2018-03 English Wikipedia clickstream data release as:  

referer | resource | reference type | count
--- | --- | --- | ---  
other-internal | Bureau_of_Investigative_Journalism | external | 18

---

The Wikipedia clickstream data shows how users get to Wikipedia articles per language domain. It is a weighted network of articles and external sources, weighted by the number of times users went to a given Wikipedia article from either another Wikipedia article or some other webpage.


**Reference for the raw data:**
- Data source/download: [Wikipeadia clickstream data dumps](https://dumps.wikimedia.org/other/clickstream/)
- Data description: [Research:Wikipedia clickstream](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) 
- The data is released and maintained by the Wikimedia Foundation's [Analytics Engineering team](https://wikitech.wikimedia.org/wiki/Analytics)

### Data subset for this project

For this project I'm using the March 2018 Wikipedia clickstream data release, and all of the 11 language domains provided. I'm restricting the data subset to March 2018 for two reasons. First, the way the Wikipedia clickstream data is processed has evolved significantly over time, and the number of language domains included in the releases has changed as well. The changes in data processing are likely to affect the data analysis results, so I am using the March 2018 subset in order to avoid that. Second, the Wikipedia clickstream data is quite large, and the March 2018 subset gives me enough data for an initial exploration. 

**Note on data size**  
The English Wikipedia clickstream data file for March 2018 unzips to 1.3G. The second largest March 2018 data file is for the German Wikipedia clickstream, and it unzips to 219M. The data files for the other language domains are much smaller.

------
## Data import

Since the Wikipedia clickstream data is a weighted network of articles and other webpages, we can use a graph database to store this data as connections between webpages. Once we have this data in a graph database, we can leverage its structure to investigate the data with network analysis techniques.

For this project, I am using the [neo4j](https://neo4j.com/) graph database, which I've set up on [AWS](https://aws.amazon.com/) cloud server. In the cells below, I'm connecting to my neo4j database instance from a jupyter notebook that I am running on the same AWS server.


### Download the data
from the [2018-03 Wikipedia clickstream release](https://dumps.wikimedia.org/other/clickstream/2018-03/)

### Clean the data for import
Neo4j interprets quotes as field terminators, even if another field terminator is explicitly specified (and backslashes in the data are also problematic). Checking the raw data for special characters and escaping them with a backslash solves this problem. 

### Decide on a data model

There are many ways to model connected data as a graph. The goal is to create a graph database model that is easy to understand, and that provides fast traversal of relevant data. These goals are highly dependent on the business use case for the data.  

For this project, I am interested in connections between Wikipedia articles, the role of external sources as referers, and differences and similarities between the Wikipedia articles networks across different language domains. So I am going to model the Wikipedia clickstream data in the following way:
1. nodes
   - Wikipedia article
   - external reference source (i.e. not a Wikipedia article in the same language)
   - both of the above node types will have language properties
2. relationship
   - 'REFERRED_TO'
   - articles can refer to articles
   - external sources can refer to articles
   - this relationship will have 2 properties:
     - count of the number of times users went from the referring webpage to the referred article in the data subset
     - reference type: link (Wikipedia article links to article), external (non-Wikipedia referer), other (referer is a Wikipedia article, but the user did not click on a link to get to the referred article)

### Connect to neo4j

In [1]:
from py2neo import authenticate, Graph

In [2]:
# set up authentication parameters
authenticate("localhost:7474", "my_neo4j_user", "my_neo4j_password")

In [6]:
# connect to authenticated graph database
graph = Graph("http://localhost:7474/db/data/")

### Set indices
Unlike in relational databases, there are no built-in constraints or automatic indexing in neo4j. Let's add some indexes for our data, so that the queries below run faster.

In [8]:
graph.run('CREATE INDEX ON :Article(title);')
graph.run('CREATE INDEX ON :Article(title, language_code);')
graph.run('CREATE INDEX ON :ExternalSource(source_type);')
graph.run('CREATE INDEX ON :ExternalSource(source_type, language_code);')
graph.run('CREATE INDEX ON :Article(language_code);')
graph.run('CREATE INDEX ON :ExternalSource(language_code);')

<py2neo.database.Cursor at 0x7efbf49ed710>

In [17]:
# list all indices in the db
r = graph.data('CALL db.indexes;')

In [19]:
import pandas as pd

In [21]:
rpd = pd.DataFrame(r)
rpd

Unnamed: 0,description,label,properties,provider,state,type
0,INDEX ON :Article(language_code),Article,[language_code],"{'version': '1.0', 'key': 'lucene+native'}",ONLINE,node_label_property
1,INDEX ON :Article(title),Article,[title],"{'version': '1.0', 'key': 'lucene+native'}",ONLINE,node_label_property
2,"INDEX ON :Article(title, language_code)",Article,"[title, language_code]","{'version': '1.0', 'key': 'lucene+native'}",ONLINE,node_label_property
3,INDEX ON :ExternalSource(language_code),ExternalSource,[language_code],"{'version': '1.0', 'key': 'lucene+native'}",ONLINE,node_label_property
4,INDEX ON :ExternalSource(source_type),ExternalSource,[source_type],"{'version': '1.0', 'key': 'lucene+native'}",ONLINE,node_label_property
5,"INDEX ON :ExternalSource(source_type, language...",ExternalSource,"[source_type, language_code]","{'version': '1.0', 'key': 'lucene+native'}",ONLINE,node_label_property


### Run the import queries

In [5]:
def import_wiki_to_neo(language, language_code):
    # file path
    language_code = language_code.upper()
    language_code_fp = language_code.lower()
    # note: by default, neo4j only imports data from its own import subfolder or from a url,
    # so make sure to place the cleaned data into the neo4j import folder.
    fp = 'file:///wikipedia_clickstream/clickstream-' + language_code_fp + 'wiki-2018-03_m.tsv'
    
    # Let's load article-to-article links and 'other' referrer types (all referrers here are Wiki articles)
    # (it's faster to load nodes and relationships separately for large data files)
    # nodes - internal
    query_nodes_internal = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] <> 'external'
    MERGE (n1:Article {title: row[0], language_code: {lang_code} })
        ON CREATE SET
            n1.language = {lang} 
    MERGE (n2:Article {title: row[1], language_code: {lang_code} })
        ON CREATE SET
            n2.language = {lang}
    ;
    """
    graph.run(query_nodes_internal, myfilepath=str(fp), lang=str(language), lang_code=str(language_code))

    # relationships - internal
    query_relationships_internal = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] <> 'external'
    MATCH (n1:Article {title: row[0], language_code: {lang_code} })
    MATCH (n2:Article {title: row[1], language_code: {lang_code} })
    MERGE (n1)-[r:REFERRED_TO]->(n2)
        ON CREATE SET
            r.type = row[2],
            r.count = row[3]
    ;
    """
    graph.run(query_relationships_internal, myfilepath=str(fp), lang=str(language), lang_code=str(language_code))

    # nodes - external
    query_nodes_external = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] = 'external'
    MERGE (n1:ExternalSource {source_type: row[0], language_code: {lang_code} })
        ON CREATE SET
            n1.language = {lang},
            n1.description = CASE 
                WHEN row[0] = 'other-internal' THEN 'a page from any other Wikimedia project (not Wikipedia)'
                WHEN row[0] = 'other-search' THEN 'an external search engine'
                WHEN row[0] = 'other-external' THEN 'any other external site (not search engine)'
                WHEN row[0] = 'other-empty' THEN 'an empty referer'
                WHEN row[0] = 'other-other' THEN 'anything else (catch-all)'
                ELSE '' END

    MERGE (n2:Article {title: row[1], language_code: {lang_code} })
        ON CREATE SET
            n2.language = {lang}
    ;
    """
    graph.run(query_nodes_external, myfilepath=str(fp), lang=str(language), lang_code=str(language_code))

    # relationships - external
    query_relationships_external = """
    USING PERIODIC COMMIT
    LOAD CSV FROM {myfilepath} AS row
    FIELDTERMINATOR '\t'
    WITH row
    WHERE row[2] = 'external'
    MATCH (n1:ExternalSource {source_type: row[0], language_code: {lang_code} })
    MATCH (n2:Article {title: row[1], language_code: {lang_code} })
    MERGE (n1)-[r:REFERRED_TO]->(n2)
        ON CREATE SET
            r.type = row[2],
            r.count = row[3]
    ;
    """
    graph.run(query_relationships_external, myfilepath=str(fp), lang=str(language), lang_code=str(language_code))


In [None]:
# These larger imports were faster to run in cypher-shell as cypher queries
# import_wiki_to_neo('German', 'DE')
# import_wiki_to_neo('English', 'EN')
# import_wiki_to_neo('Spanish', 'ES')
# import_wiki_to_neo('Persian', 'FA')
# import_wiki_to_neo('Russian', 'RU')

In [27]:
import_wiki_to_neo('French', 'FR')

In [6]:
import_wiki_to_neo('Italian', 'IT')

In [7]:
import_wiki_to_neo('Japanese', 'JA')

In [8]:
import_wiki_to_neo('Polish', 'PL')

In [9]:
import_wiki_to_neo('Portuguese', 'PT')

In [10]:
import_wiki_to_neo('Chinese', 'ZH')

### Viewing the graph database schema

To view the database schema, run `CALL db.schema()` in the neo4j browser.  
Here's the Wikipedia clickstream db schema:

![Wikipedia clickstream database schema diagram](graph_db_schema.png "Wikipedia clickstream database schema")


The schema diagram shows the types of nodes (things or entities) and edges (relationships) in the database.  
  
This schema diagram shows that we have two kinds of nodes: articles and external sources. Both of those nodes can have a directional relationship to article nodes, and that relationship is a reference from one node to the other.  

#### Now that we have our Wikipedia clickstream data loaded in a graph database, we can explore the data in the neo4j browser.
Here is the data hairball I get when I run the following cypher query in the neo4j browser  
`MATCH p=()-[r:REFERRED_TO]->(n:Article {language_code:'EN'}) RETURN p LIMIT 25`.  

This query returns a random selection of 25 references to English Wikipedia articles. We see more than 25 relationships here, because in addition to the 25 references queried, neo4j returns all of the other relationships among the nodes involved in the 25 references. The numbers on the links are the reference counts. For example, users went to the 'Cara Delevingne' Wikipedia article from a search engine 186,252 times in March 2018 (bottom right).

![Wikipedia clickstream subgraph](english_wiki_25_rels_graph.png "English Wikipedia clickstream subgraph")

From this subgraph, we can see that the clickstream data is highly interconnected. We can inspect closer a handful of nodes in the neo4j browser on a case by case basis, but to get a sense of what's going on in this graph as a whole, we need to use network analysis techniques.  
We've loaded disjoint networks of Wikipedia articles for 11 language domains. Each one is a highly interconnected data hairball, like the example above. How can we compare them and quantify their differences?