## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 3**
This assignment is concerned with performing an analysis of and execute PageRank on the wikipedia links given in the `wikilink_graph.2004-03-01.csv` file. In order to do this the assignment is split up into four subtasks with each subtask receiving its dedicated `.ipynb` file. See each specific file for details on what this notebook accomplishes.

Note all implementations are based on the assignment guidelines and helper files given as well as the documentation of the used functions. 

#### **Data Preprocessing**
This notebook is responsible for preprocessing the data in the given `.csv` file and preparing it for the other subtasks. Additionally, the processed data is stored to the harddisk.
___

## Load necessary library

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy

## Load in \t seperated data into pandas

In [9]:
pd_wiki = pd.read_csv('wikilink_graph.2004-03-01.csv', usecols=(0,2), delimiter='\t', dtype=np.int64, skiprows=0)

## Find all unique nodes

In [10]:
index = np.unique(pd_wiki.iloc[:,0])
index2 = np.unique(pd_wiki.iloc[:,1])

#categories
categories = np.unique(np.append(index, index2))

## Convert page_id’s into consecutive integers
We see each node as a category and translate those using Categorical from pandas to codes. Which we call `index` and `index2`

In [11]:
pd_wiki['index'] = pd.Categorical(pd_wiki.iloc[:,0], categories=categories).codes
pd_wiki['index2'] = pd.Categorical(pd_wiki.iloc[:,1], categories=categories).codes
pd_wiki.head()

Unnamed: 0,page_id_from,page_id_to,index,index2
0,12,34568,0,18381
1,12,35416,0,19179
2,12,34569,0,18382
3,12,34699,0,18501
4,12,34700,0,18502


## Convert to numpy

In [13]:
dataset = pd_wiki.iloc[:, [2,3]].to_numpy()
print(dataset)

[[     0  18381]
 [     0  19179]
 [     0  18382]
 ...
 [248190 102031]
 [248191 241638]
 [248192 120406]]


## Create a sparse matrix representation
* `c`: Node c -> r
* `r`: Node r -> c
* `d`: Fill with ones 
* `max_n` : Matrix size is N*N
* `sparse_matrix`: Scipy sparse matrix representation

In [14]:
from scipy.sparse import csr_matrix

c = dataset[:,1]
r = dataset[:,0]
d = np.ones(len(c))
max_c = len(categories)
sparse_matrix = csr_matrix((d,(r,c)), shape=(max_c, max_c), dtype=np.int32)

## Further preprocessing diagnols 
Set diagonals to 0 (no self loops) and eliminate zero to save space in csr representation (using lil)

In [21]:
sparse_matrix.tolil().setdiag(0)
sparse_matrix.eliminate_zeros()
sparse_matrix.tocsr()

<248193x248193 sparse matrix of type '<class 'numpy.int32'>'
	with 3168510 stored elements in Compressed Sparse Row format>

## Small broadcast of the sparse matrix
As you can see the matrix contains no self loops and the node with id 9 points to node id 10: (9 -> 10) in our dataset

In [20]:
sparse_matrix[:10,:10].todense()

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int32)

## Save the sparse matrix as prep_data
We use lil representation as it seems to give the lowest nbytes used. See our other notebooks which continue to use prep_data

In [22]:
scipy.tolil().sparse.save_npz('prep_data', sparse_matrix)

In [34]:
sparse_matrix.tolil().data.nbytes

1985544