<a href="https://colab.research.google.com/github/IGARDS/structured_artificial/blob/main/notebooks/structured_artificial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create Artificial Structured Datasets

The purpose of this notebook is to provide illustrative examples of artificial ranking datasets. Please feel free to use as is or modify. For questions, suggestions, and discussions please visit: https://github.com/IGARDS/RPLib/discussions. 

## Overview
This notebook contains several functions designed to construct structured graphs
with increasing noise. The motivation of these datasets are to study the correlation between these datasets and rankability measures.

In [None]:
#@title Double click to show/hide code

from IPython.display import display, Markdown, Latex

!apt install libgraphviz-dev
!pip install git+https://github.com/IGARDS/ranking_toolbox.git --upgrade
!pip install git+https://github.com/IGARDS/RPLib.git --upgrade

from pyrplib.artificial import *

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin libgtk2.0-common
  libgvc6-plugins-gtk libxdot4
Suggested packages:
  gvfs
The following NEW packages will be installed:
  libgail-common libgail18 libgraphviz-dev libgtk2.0-0 libgtk2.0-bin
  libgtk2.0-common libgvc6-plugins-gtk libxdot4
0 upgraded, 8 newly installed, 0 to remove and 39 not upgraded.
Need to get 2,120 kB of archives.
After this operation, 7,128 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgtk2.0-common all 2.24.32-1ubuntu1 [125 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgtk2.0-0 amd64 2.24.32-1ubuntu1 [1,769 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgail18 amd64 2.24.32-1ubuntu1 [14.2 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgail-common amd64 2.24.32

## Single use examples

### EMPTY + NOISE

In [None]:
help(emptyplusnoise)

Help on function emptyplusnoise in module pyrplib.artificial:

emptyplusnoise(n, percentnoise, low=0, high=3)
    EMPTY + NOISE
    
    Function starts with an empty graph and adds some amount of noise.
    
    Input: n = number of rows/cols in D matrix
      percentnoise = integer between 1 and n^2 representing the
      percentage of noise to add to D hillside, e.g., 
      if percentnoise = 10, then 10% of the n^2 elements will be noise
    
    Example: 'D = emptyplusnoise(6,20)' creates a 6 by 6 matrix with 20% noise
      added to the empty graph



In [None]:
emptyplusnoise(5,20,2,4)

Unnamed: 0,0,1,2,3,4
0,0,0,0,4,2
1,0,0,0,0,0
2,4,0,0,0,0
3,0,3,0,0,0
4,0,0,0,3,0


### HILLSIDE + NOISE

In [None]:
help(hillsideplusnoise)

Help on function hillsideplusnoise in module pyrplib.artificial:

hillsideplusnoise(n, percentnoise, low=1, high=5)
    HILLSIDE + NOISE
    
    Starts with a perfect hillside graph and then randomly perturbs the matrix at user specified percentage. 
    
    Input: n = number of rows/cols in D matrix
            percentnoise = integer between 1 and n^2 representing the
                          percentage of noise to add to D hillside, e.g., 
                          if percentnoise = 10, then 10% of the n^2
                          elements will be noise
    Example: 'D = hillsideplusnoise(6,20)' creates a 6 by 6 matrix with 20% noise
                added to the hillside graph



In [None]:
hillsideplusnoise(6, 50)

Unnamed: 0,0,1,2,3,4,5
0,0,3,2,3,4,1
1,4,0,1,2,2,2
2,5,1,0,2,2,3
3,1,0,0,0,1,2
4,0,0,0,0,0,1
5,5,3,0,0,3,0


### DOM + NOISE

In [None]:
help(domplusnoise)

Help on function domplusnoise in module pyrplib.artificial:

domplusnoise(n, percentnoise, low=0, high=1)
    function creates a dominance graph and adds noise. 
    
    Input: n = number of rows/cols in D matrix
            percentnoise = integer between 1 and n^2 representing the
                          percentage of noise to add to D domgraph, e.g., 
                          if percentnoise = 10, then 10% of the n^2
                          elements will be noise
    Example: 'D = domplusnoise(6,20)' creates a 6 by 6 matrix with 20% noise
                  added to the dominance graph



In [None]:
domplusnoise(5, 20,high=5)

Unnamed: 0,0,1,2,3,4
0,0,2,5,5,5
1,0,0,5,5,5
2,0,0,0,0,3
3,0,0,0,0,5
4,0,0,2,0,0


### WEAK DOM + NOISE

In [None]:
help(weakdomplusnoise)

Help on function weakdomplusnoise in module pyrplib.artificial:

weakdomplusnoise(n, percentnoise, low=0, high=1)
    function creates a weak dominance graph and adds noise. 
    
    Input: n = number of rows/cols in D matrix
            percentnoise = integer between 1 and n^2 representing the
                          percentage of noise to add to D domgraph, e.g., 
                          if percentnoise = 10, then 10% of the n^2
                          elements will be noise
    Example: 'D = weakdomplusnoise(6,20)' creates a 6 by 6 matrix with 20% noise
                  added to the dominance graph



In [None]:
weakdomplusnoise(5, 20,high=5)

Unnamed: 0,0,1,2,3,4
0,0,5,0,0,4
1,0,0,4,0,0
2,0,0,0,5,4
3,0,0,0,0,1
4,0,0,0,0,0


### UNWEIGHTED

In [None]:
help(unweighted)

Help on function unweighted in module pyrplib.artificial:

unweighted(D)
    CONVERT TO UNWEIGHTED
    
    Function returns an unweighted version of D



In [None]:
D = emptyplusnoise(5, 20,low=2,high=5)
print("Weighted:")
display(D)
print("Unweighted")
unweighted(D)

Weighted:


Unnamed: 0,0,1,2,3,4
0,0,0,0,0,5
1,0,0,0,4,0
2,0,0,0,0,0
3,4,0,0,0,4
4,0,0,4,0,0


Unweighted


Unnamed: 0,0,1,2,3,4
0,0,0,0,0,1
1,0,0,0,1,0
2,0,0,0,0,0
3,1,0,0,0,1
4,0,0,1,0,0


### REMOVE LINKS

In [None]:
help(removelinks)

Help on function removelinks in module pyrplib.artificial:

removelinks(D, percent)
    CONVERT TO UNWEIGHTED
    
    Function returns a modified version of D with percent of nonzero links removed



In [None]:
D = domplusnoise(5, 20,high=5)
print("Original:")
display(D)
print("With links removed")
removelinks(D,50)

Original:


Unnamed: 0,0,1,2,3,4
0,0,5,5,5,5
1,5,0,5,5,2
2,0,0,0,5,5
3,0,4,4,0,5
4,5,0,0,0,0


With links removed


Unnamed: 0,0,1,2,3,4
0,0,5,0,0,0
1,0,0,5,5,0
2,0,0,0,5,5
3,0,4,4,0,0
4,0,0,0,0,0


### DEFINE MULTIPLE OPTIMAL SOLUTION

In [None]:
help(addmossimple)

Help on function addmossimple in module pyrplib.artificial:

addmossimple(D, start_index, end_index)
    For a binary matrix D, create simple multiple optimal solutions in the range of teams specified.
    Indices are inclusive.



In [None]:
D = domplusnoise(20,0)
addmossimple(D,0,4)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
2,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
3,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
5,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1
6,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1
7,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1
8,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1
9,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1


### Cyclical

In [None]:
import pandas as pd
import numpy as np

def cyclic(n):
    """
    Create a simple cycle D matrix of size n x n.
    """
    D=pd.DataFrame(np.zeros((n,n)),dtype=int) # initialize D as an empty graph 
    for i in range(n-1):
        D.iloc[i,i+1] = 1
    D.iloc[n-1,0] = 1
    return D

In [None]:
cyclic(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


## Visualizations

In [None]:
!pip install nx_altair # not available on colab by default

Collecting nx_altair
  Downloading nx_altair-0.1.6-py3-none-any.whl (7.9 kB)
Installing collected packages: nx-altair
Successfully installed nx-altair-0.1.6


In [None]:
import networkx as nx
import nx_altair as nxa
def DGraph(G, pos=None):
  if not isinstance(G, nx.DiGraph):
    G = nx.DiGraph(G)

  if pos is None:
    pos = nx.drawing.layout.spring_layout(G, seed=7)
  
  nx.set_node_attributes(G, {val: val for val in list(G.nodes)}, 'labels')
  return nxa.draw_networkx(G, pos, edge_color = 'weight', node_label='labels', 
                           node_color = 'white', width=4.0, arrow_width=4,
                           edge_tooltip=['source', 'target', 'weight'])

DGraph(hillsideplusnoise(6, 0))

NameError: ignored

In [None]:
import networkx as nx
import nx_altair as nxa
def DGraph(G, pos=None):
  if not isinstance(G, nx.DiGraph):
    G = nx.DiGraph(G)

  if pos is None:
    pos = nx.drawing.layout.circular_layout(G)
  
  nx.set_node_attributes(G, {val: val for val in list(G.nodes)}, 'labels')
  return nxa.draw_networkx(G, pos, edge_color = 'weight', node_label='labels', 
                           node_color = 'white', width=4.0, arrow_width=4,
                           edge_tooltip=['source', 'target', 'weight'])

DGraph(cyclic(10))

## Creating datasets
### Customizing a ``create`` function
You can code your own create function and define your own options. After that, you should be able to use the provided scaffold to create many artifical dominance matrices.

We provide a few sample functions in pyrplib. These are shown below.

In [None]:
import inspect

print(inspect.getsource(example_create))
print(inspect.getsource(example_get_create_options))
print(inspect.getsource(example_create2))
print(inspect.getsource(example_get_create_options2))
print(inspect.getsource(example_create3))
print(inspect.getsource(example_get_create_options3))

def example_create(options=example_get_create_options()):
    """
    Example create function. These functions must return a dominance (D) matrix that is a pandas dataframe. 
    Options is a dictionary. There is one required key/value which is the number_of_rows_columns. 
    It may also have additional arguments.
    """
    assert type(options) == dict
    r = pd.Series(np.arange(options['number_of_rows_columns'],0,-1))

    D = domfromranking(options['number_of_rows_columns'],r,options['num_games'],
                              upset_func=lambda r1,r2: (abs(r1-r2) <= options['threshold']) and np.random.uniform() > 0.5)
    return D

def example_get_create_options():
    """
    Example set of options to be paired with example_create function.
    """
    return {
        "number_matrices":10,
        "number_of_rows_columns": 20,
        "threshold":3,
        "num_games":1000
    }

def example_create2(options=example_get_create_options2()):
    """
    Example create function. T

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Example: DOM + NOISE

In [None]:
#@title Parameters
number_matrices =  30#@param {type:"integer"}

number_of_rows_columns =  20#@param {type:"integer"}

low =  0#@param {type:"integer"}

high = 5 #@param {type:"integer"}

percentage = 90 #@param {type:"slider", min:1, max:100, step:1}

percent_links_to_remove = 10 #@param {type:"slider", min:1, max:100, step:1}

make_unweighted = False #@param {type:"boolean"}

output_directory = "structured_artificial" #@param {type:"string"}


#### Creating the matrices

In [None]:
options = {
        "number_matrices":number_matrices,
        "number_of_rows_columns": number_of_rows_columns,
        "low":low,
        "high":high,
        "percentage":percentage,
        "percent_links_to_remove":percent_links_to_remove,
        "make_unweighted":make_unweighted,
    }
    
dataset = create_dataset(example_create3,options)

#### Storing dataset in Google Drive (optional)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from datetime import datetime
import os.path
from os import path

output_dir = f"/content/drive/MyDrive/{output_directory}" 

if path.exists(output_dir) == False:
  os.mkdir(output_dir)

date = datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p")

filename = f"dataset_{date}.json"

dataset.to_json(f"{output_dir}/{filename}")

In [None]:
!ls {output_dir}/{filename}

/content/drive/MyDrive/structured_artificial/dataset_2022_04_06-10:33:33_PM.json


Your dataset should now be available in JSON format in your Google Drive. Please consider uploading your dataset for the community of research by requesting inclusion into https://igards.github.io/RPLib/. If interested, please [upload your dataset here.](https://forms.gle/iHyfy5popvzaQ5Jd6) 

When you are uploading your dataset, you might find the following output useful for documentation purposes.

In [None]:
print(dataset['Create code'])

def example_create3(options):
    """
    Example create function. These functions must return a dominance (D) matrix that is a pandas dataframe. 
    Options is a dictionary. There is one required key/value which is the number_of_rows_columns. 
    It may also have additional arguments.
    """
    assert type(options) == dict
    D = domplusnoise(options['number_of_rows_columns'],options['percentage'],options['low'],options['high'])
    if options['percent_links_to_remove'] > 0:
        D = removelinks(D,options['percent_links_to_remove'])
    if options['make_unweighted']:
        D = unweighted(D)
    return D



In [None]:
import json
print(json.dumps(options, indent=4, sort_keys=True))

{
    "high": 5,
    "low": 0,
    "make_unweighted": false,
    "number_matrices": 30,
    "number_of_rows_columns": 20,
    "percent_links_to_remove": 10,
    "percentage": 90
}


### Example: Threshold

In [None]:
#@title Parameters
number_matrices =  30#@param {type:"integer"}

number_of_rows_columns =  20#@param {type:"integer"}

threshold =  5#@param {type:"integer"}

num_games =  1000#@param {type:"integer"}

output_directory = "structured_artificial" #@param {type:"string"}


#### Creating the matrices

In [None]:
options = {
        "number_matrices":number_matrices,
        "number_of_rows_columns": number_of_rows_columns,
        "threshold":threshold,
        "num_games":num_games
}
    
dataset = create_dataset(example_create,options)

#### Storing dataset in Google Drive (optional)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from datetime import datetime
import os.path
from os import path

output_dir = f"/content/drive/MyDrive/{output_directory}" 

if path.exists(output_dir) == False:
  os.mkdir(output_dir)

date = datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p")

filename = f"dataset_{date}_{threshold}.json"

dataset.to_json(f"{output_dir}/{filename}")

In [None]:
!ls {output_dir}/{filename}

/content/drive/MyDrive/structured_artificial/dataset_2022_04_06-10:41:35_PM_5.json


Your dataset should now be available in JSON format in your Google Drive. Please consider uploading your dataset for the community of research by requesting inclusion into https://igards.github.io/RPLib/. If interested, please [upload your dataset here.](https://forms.gle/iHyfy5popvzaQ5Jd6) 

When you are uploading your dataset, you might find the following output useful for documentation purposes.

In [None]:
print(dataset['Create code'])

def example_create(options=example_get_create_options()):
    """
    Example create function. These functions must return a dominance (D) matrix that is a pandas dataframe. 
    Options is a dictionary. There is one required key/value which is the number_of_rows_columns. 
    It may also have additional arguments.
    """
    assert type(options) == dict
    r = pd.Series(np.arange(options['number_of_rows_columns'],0,-1))

    D = domfromranking(options['number_of_rows_columns'],r,options['num_games'],
                              upset_func=lambda r1,r2: (abs(r1-r2) <= options['threshold']) and np.random.uniform() > 0.5)
    return D



In [None]:
import json
print(json.dumps(options, indent=4, sort_keys=True))

{
    "num_games": 1000,
    "number_matrices": 30,
    "number_of_rows_columns": 20,
    "threshold": 5
}
