## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 3**
This assignment is concerned with performing an analysis of and execute PageRank on the wikipedia links given in the `wikilink_graph.2004-03-01.csv` file. In order to do this the assignment is split up into four subtasks with each subtask receiving its dedicated `.ipynb` file. See each specific file for details on what this notebook accomplishes.

Note all implementations are based on the assignment guidelines and helper files given as well as the documentation of the used functions. 

#### **Exploratory Data Analysis**
This notebook performs an exploratory analysis on the dataset. This includes some anlaysis on the nodes and edges as well as estimating system requirements for being able to execute the PageRank algorithm.
___

The following snippet handles all imports.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy

The `load_prepped_data` function is responsible for retrieving the data prepped by `prep.ipynb` and loading it for exploratory data analysis.

In order to do this the function uses the following parameters:
  * `filename` - the name of the file containing the prepped data [default = `prepped-data.npy`]
  
Additionally, it returns the following values:
  * `data` - an array representing the prepped data  

In [10]:
def load_prepped_data(filename = 'prepped_data.npy'):
    data = np.load(filename)
    return data

The following snippet triggers data loading

In [30]:
data = load_prepped_data()

#### **Dead Ends**
This section is concerned with determining how many dead ends there are. A dead end refers to nodes which do not have any outgoing edges.
___
The `compute_dead_ends_set` function parses the nodes and computes a set of all nodes which are dead_ends. Analyzing the data it is evident that for a node to be in the dataset it must either have an outgoing edge or an incoming edge. By definition a dead end has no outgoing edges and therefore it must be in the list of incoming edges. Thus, the difference between the set of incoming edges and the set of outgoing edges is the set of dead ends.

In order to do this the function uses the following parameters:
  * `data` - the prepped data
  
Additionally, it returns the following values:
  * `dead_ends` - a list of all the dead ends [in consecutive numbering]

In [48]:
def compute_dead_ends_set(data):
    with_outgoing_edges = set(data[:,0])
    with_incoming_edges = set(data[:,1])
#     print(with_outgoing_edges)
#     print(with_incoming_edges)
    dead_ends = with_incoming_edges - with_outgoing_edges
#     print(dead_ends)
    dead_ends = set(dead_ends)
    return dead_ends

The `analyse_dead_ends` function analyzes the matrix and prints some data on the dead_ends in the graph.

In order to do this the function uses the following parameters:
  * `data` - the prepped data

In [42]:
def analyze_dead_ends(data):
    dead_ends = compute_dead_ends_set(data)
    count_dead_ends = len(dead_ends)
    if count_dead_ends == 0:
        print("There are no dead ends")
        return
    # TODO convert consecutive numbering to original number
    if count_dead_ends == 1:
        print("There is 1 dead end.")
        print("The following node is classified as a dead end")
    else:
        print("There are "+str(len(dead_ends))+" dead ends.")
        print("The following set of nodes are classified as dead ends.")
    print(dead_ends)

The following snippet triggers the dead end analysis

In [49]:
analyze_dead_ends(data)

There are 2 dead ends.
The following set of nodes are classified as dead ends.
{1, 3}


In [44]:
# TODO REMOVE ME I AM TEMPORARY!
no_dead_ends = [[0,2],[1,2],[0,3],[2,1],[2,3],[3,1],[3,0]]
one_dead_end = [[0,2],[0,3],[2,1],[2,3],[3,1],[3,0]]
two_dead_ends = [[0,2],[0,3],[2,1],[2,3]]
temp_data = two_dead_ends
np.save('prepped_data',temp_data)
data = load_prepped_data()