## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 3**
This assignment is concerned with performing an analysis of and execute PageRank on the wikipedia links given in the `wikilink_graph.2004-03-01.csv` file. In order to do this the assignment is split up into four subtasks with each subtask receiving its dedicated `.ipynb` file. See each specific file for details on what this notebook accomplishes.

Note all implementations are based on the assignment guidelines and helper files given as well as the documentation of the used functions. 

#### **Data Preprocessing**
This notebook is responsible for preprocessing the data in the given `.csv` file and preparing it for the other subtasks. Additionally, the processed data is stored to the harddisk.
___

### **Helper Functions**
This section contains functions which aid and simplify the code.
___
The following snippet handles all imports.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
from scipy.sparse import csr_matrix

### **DataPreprocessing Class**

This section contains the class and its functions which execute the various steps of data preprocessing. Due to the limitations of `.ipynb` files the various functions will be described first and then the implementation will be shown.

___
The `__init__` function initializes the class.

In order to do this the function uses the following (optional) parameters:
  * `raw_dataset` - the raw link data [default: None]
  * `sparse_matrix` - the a sparse matrix representation of the link data [default: None]
  * `categories` - the unique set of categories (pages) in the link data [default: None] 
  * `categorical_dataset` - the raw dataset renumbered according to `categories` [default: None]
___
The `load` function loads data and stores it in `raw_dataset`.

In order to do this the function uses the following parameters:
  * `path` - the location of the data to load [default: `wikilnks_graph.2004-03-01.csv`]
  * `delimiter` - the delimiter to use in the file [default: `\t`]
  * `datatype` - the datatype to load the data as [default: `np.int64`]
___
The `determine_categories` function extracts from the data the various unique categories [i.e. pages].
___
The `renumber_pages` function ensures that all page_ids are represented as consecutive integers. In doing so it initializes the `categorical_dataset`.
___
The `convert_to_sparse_matrix` function transforms the dataset into a sparse matrix such that a 1 signifies that page A [column] links to page B [row].

In order to do this the function uses the following parameter:
  * `datatype` - the datatype to use to store the values [default: `np.int32`]
___
The `eliminate_self_loops` function removes all self loops (page A links to page A) from the dataset.
___
The `save` function stores the data in `sparse_matrix` at the specified location.

In order to do this the function uses the following parameters:
  * `path` - the storage location of the data

In [58]:
class DataPreprocessing():
    def __init__(self,raw_dataset=None,sparse_matrix=None,categories=None,categorical_dataset=None):
        self.raw_dataset = raw_dataset
        self.sparse_matrix = sparse_matrix
        self.categories = categories
        self.categorical_dataset = categorical_dataset
        
    def load(self, path='wikilink_graph.2004-03-01.csv',delimiter='\t',datatype=np.int64):
        self.raw_dataset = pd.read_csv(path,usecols=(0,2),delimiter=delimiter,dtype=datatype,skiprows=0)
        
    def determine_categories(self):
        source_index = np.unique(self.raw_dataset.iloc[:,0])
        destination_index = np.unique(self.raw_dataset.iloc[:,1])
        self.categories = np.unique(np.append(source_index, destination_index))
        
    def renumber_pages(self):
        self.raw_dataset['source_index'] = pd.Categorical(self.raw_dataset.iloc[:,0], categories=self.categories).codes
        self.raw_dataset['destination_index'] = pd.Categorical(self.raw_dataset.iloc[:,1], categories=self.categories).codes
        display(self.raw_dataset.head()) #verify if results are what we expect so far
        self.categorical_dataset = self.raw_dataset.iloc[:, [2,3]].to_numpy()
        display(self.categorical_dataset) #verify if results are what we expect so far

    def convert_to_sparse_matrix(self,datatype=np.int32):
        col =  self.categorical_dataset[:,1]
        row =  self.categorical_dataset[:,0]
        data = np.ones(len(col))
        max_len = len(self.categories)
        self.sparse_matrix = csr_matrix((data,(row,col)), shape=(max_len, max_len), dtype=datatype)
        display(self.sparse_matrix[:10,:10].todense()) #verify if results are what we expect so far
        
    def eliminate_self_loops(self):
        lil_matrix = self.sparse_matrix.tolil()
        lil_matrix.setdiag(0) # documentation suggests setdiag to be executed on lil instead of csr
        self.sparse_matrix = lil_matrix.tocsr()
        self.sparse_matrix.eliminate_zeros()
        display(self.sparse_matrix[:10,:10].todense()) #verify if results are what we expect so far
        
    def save(self,path='prep_data'):
        scipy.sparse.save_npz('prep_data', self.sparse_matrix)

### **Program Execution**
This section is concerned with parsing the input arguments and determining the execution flow of the program.
___
The `main` function handles the command line arguments and is responsible for the main flow of the program.

In order to do this the function uses the following parameter:
  * `path` - the location for the link data file [default = `wikilink_graph.2004-03-01.csv`]

In [61]:
def main(path = 'wikilink_graph.2004-03-01.csv'):
    prep = DataPreprocessing()
    prep.load(path)
    
    prep.determine_categories()
    prep.renumber_pages()
    
    prep.convert_to_sparse_matrix()
    prep.eliminate_self_loops()
    
    prep.save()

The following snippet passes the start of the program and the command line arguments to the `main` function.

The following command line argument is expected:
  * `path` - the location of the `wikilink_graph.2004-03-01.csv` file

In [None]:
if __name__ == "__main__":
    filepath = sys.argv[1]
    main(path=filepath)

The following snippet triggers the manual execuation of the program

In [64]:
main()

Unnamed: 0,page_id_from,page_id_to,source_index,destination_index
0,12,34568,0,18381
1,12,35416,0,19179
2,12,34569,0,18382
3,12,34699,0,18501
4,12,34700,0,18502


array([[     0,  18381],
       [     0,  19179],
       [     0,  18382],
       ...,
       [248190, 102031],
       [248191, 241638],
       [248192, 120406]])

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int32)

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int32)