## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 3**
This assignment is concerned with performing an analysis of and execute PageRank on the wikipedia links given in the `wikilink_graph.2004-03-01.csv` file. In order to do this the assignment is split up into four subtasks with each subtask receiving its dedicated `.ipynb` file. See each specific file for details on what this notebook accomplishes.

Note all implementations are based on the assignment guidelines and helper files given as well as the documentation of the used functions. 

#### **PageRank Algorithm (Improved)**
This notebook executes the PageRank algorithm using the improved storage method and algorithm as presented in the lecture (see also slide 18 of the instructional slideset). Additionally, some basic analaysis is performed and the results are compared to the "Sparse" implementation of PageRank.
___

### **Helper Functions**
This section contains functions which aid and simplify the code.
___
The following snippet handles all imports.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import scipy.sparse
from scipy.sparse import csr_matrix
from collections import defaultdict

The class `PageRank` is our implementation for the Pagerank Algorithm

In [None]:
class PageRank():
    def __init__(self):
        

The `load_prepped_data` function is responsible for retrieving the data prepped by `prep.ipynb` and loading it for exploratory data analysis.

In order to do this the function uses the following parameters:
  * `filename` - the name of the file containing the prepped data [default = `prep-data.npz`]
  
Additionally, it returns the following value:
  * `data` - an array representing the prepped data  

In [2]:
def load_prepped_data(filename = 'prep_data.npz'):
    data = scipy.sparse.load_npz(filename)
    return data

The following snippet triggers data loading

In [3]:
data = load_prepped_data()

The `convert_to_custom_format` function is responsible for converting the transition matrix from a sparse matrix representation to the custom format specified in slide 17 of the instructional slideset. It is assumed that the in the sparse matrix each (nonempty) column represents a source node.

In order to do this the function uses the following parameters:
  * `data` - the data as a sparse matrix
  
Additionally, it returns the following value:
  * `converted` - an array representing the converted data  

In [8]:
def convert_to_custom_format(data):
    indices = data.nonzero()
    dictionary = defaultdict(list)
    np_degree = []

    for source, destination in zip(indices[1], indices[0]):
        dictionary[source].append(destination)
    
    for s,d in dictionary.items():
        np_degree += [[s,len(d),np.array(d)]]
    
    return(np.array(np_degree))

In [134]:
# TODO TEMPORARY
%timeit convert_to_custom_format(simple_data)

88.5 µs ± 4.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [10]:
%time convert_to_custom_format(data)

CPU times: user 2.9 s, sys: 52.9 ms, total: 2.95 s
Wall time: 1.89 s


array([[234, 75,
        array([     0,   2667,   3099,   4923,   6424,   8248,   8270,   9045,
        10479,  13208,  15800,  16386,  17933,  18031,  18126,  18373,
        18408,  18416,  18649,  24851,  26678,  36013,  41131,  50953,
        63777,  75177,  90801,  93361,  96939,  98374,  98864,  99999,
       108340, 110082, 111258, 111417, 114051, 114066, 115513, 116711,
       118275, 122305, 122711, 126312, 127886, 130324, 134003, 136380,
       150296, 169859, 176293, 181813, 182193, 185349, 190904, 193669,
       194106, 197593, 198215, 201113, 204705, 212808, 213280, 213744,
       215591, 216670, 219581, 223713, 226522, 232564, 239792, 240206,
       240638, 241549, 242326], dtype=int32)],
       [240, 36,
        array([     0,   2576,   3938,   6570,   7754,   9488,   9614,  10717,
        11755,  11838,  12210,  13120,  16300,  16301,  16372,  17260,
        27990,  29872,  40859,  53585,  93309, 125582, 125585, 126643,
       126714, 128227, 128388, 132797, 136491, 2143

In [11]:
test = convert_to_custom_format(data)
test.shape

(192038, 3)

### REMOVE EVERTHING BELOW THIS

In [83]:
# TODO REMOVE ME I AM TEMPORARY!
simple_data = load_prepped_data('simple_data.npz')
print(simple_data.todense())

[[0 0 0 1]
 [0 0 1 1]
 [1 0 0 0]
 [1 0 1 0]]


In [147]:
data

<248193x248193 sparse matrix of type '<class 'numpy.int32'>'
	with 3170614 stored elements in Compressed Sparse Row format>