# AiDM 2019 Group 26: Assignment 3: Structure of Wikipedia Links

## Part 1: Data preprocessing (prep)

Auke Bruinsma, s1594443 and Simon van Wageningen, s2317079.

### 1. Imports.

Before we start working with the data or even exploring the descriptives of the dataset it is necessary to preprocess the data into a dataset which only consists of the necessary information in the best format available.

In [1]:
# Import the required packages.
import pandas as pd
import numpy as np

In [2]:
# Load the data, only use two columns.
initial_data = pd.read_csv("wikilink_graph.2004-03-01.csv", sep = "\t", usecols = ['page_id_from', 'page_id_to'])

In [3]:
# The dataframe currenty looks like this.
initial_data.head()

Unnamed: 0,page_id_from,page_id_to
0,12,34568
1,12,35416
2,12,34569
3,12,34699
4,12,34700


### 2. Convert to consecutive integers.

In [4]:
# Get the unique and counts.
unique_from = np.unique(initial_data[['page_id_from']])
print('Number of unique outgoing pages: {0}'.format(len(unique_from)))

Number of unique outgoing pages: 230824


In [5]:
# get the unique and counts
unique_to = np.unique(initial_data[['page_id_to']])
print('Number of unique incoming pages: {0}'.format(len(unique_to)))

Number of unique incoming pages: 192038


In [6]:
# First we must get all unique ids, from the _from and _to columns ...
# ... concatenate the arrays then get the unique again.
all_ids = np.concatenate((unique_from, unique_to), axis = 0)
unique_total = np.unique(all_ids)

In [7]:
# There in total this many unique ids in the entire graph
print('Number of unique total pages: {0}'.format(len(unique_total)))

Number of unique total pages: 248193


In [8]:
# Create a new dataframe consisting of the unique page_id_from and the new number format.
temp_df = pd.DataFrame(data = {"page_id" : unique_total, "new_page_id" : np.arange(len(unique_total))})

In [9]:
# Now the data has keys, this is essentially our ticket in merging back and forth.
temp_df.head()

Unnamed: 0,page_id,new_page_id
0,12,0
1,25,1
2,39,2
3,290,3
4,303,4


In [10]:
# Now we merge twice, once on page_id_from, once on page_id_to.
data = pd.merge(initial_data, temp_df, how = "left", left_on = "page_id_from", right_on = 'page_id')
data = pd.merge(data, temp_df, how = "left", left_on = "page_id_to", right_on = "page_id")

In [11]:
# Remove the duplicate columns, rename and turn to int.
data = data[['new_page_id_x', 'new_page_id_y']]
data.rename(columns = {'new_page_id_x' : 'page_id_from', 'new_page_id_y' : 'page_id_to'}, inplace = True)
data = data.astype(int)

In [12]:
# The final data now looks like this!
data.head()

Unnamed: 0,page_id_from,page_id_to
0,0,18381
1,0,19179
2,0,18382
3,0,18501
4,0,18502


In [13]:
# Write to hard disk.
data.to_csv('preprocessed_data.csv', index = False, header = True)

### 3. Return back to original numbering.

In [14]:
# If we want to return to the original state we can simply use the above code with the keys stored in temp_df.
old = pd.merge(data, temp_df, how = "left", left_on = "page_id_from", right_on = 'new_page_id')
old = pd.merge(old, temp_df, how = "left", left_on = "page_id_to", right_on = "new_page_id")
old = old[['page_id_x', 'page_id_y']]
old.rename(columns = {'page_id_x' : 'page_id_from', 'page_id_y' : 'page_id_to'}, inplace = True)
old = old.astype(int)

In [15]:
# This looks very much like our initial data.
old.head()

Unnamed: 0,page_id_from,page_id_to
0,12,34568
1,12,35416
2,12,34569
3,12,34699
4,12,34700


In [16]:
# As you can see the two dataframes are exactly equal.
print(old.equals(initial_data.astype(int)))

True
