## 0. Import Packages 

Import necessary packages, genfromtext makes it easier to perform Set calculations. Ensure the division function supports non-integer type operations.

In [20]:
import numpy as np
import pandas as pd
from numpy import genfromtxt
from __future__ import division

## 1. Load gene sets

Depending on the format of the data, you may have to use different usecols or skip_header arguments. In the end, the goal is to get a set of integers representing the indices of the genes of interest given some original dataset. For this example, we use the genes selected by either algorithm given the preprocessed GSE102698 dataset, without any sort of subsampling.  

In [21]:
dpfGenes=genfromtxt("s1_selected_genes_dpf.csv",delimiter=",",usecols=1,skip_header=1)

In [22]:
nvrGenes=genfromtxt("s1_selected_genes_nvr.csv",delimiter=",",usecols=0,skip_header=0)

Given that some algorithms only have R implementations, we have to consider some minor differences in indexing formats. Outputs from R, for example, may start its indexing from 1 instead of 0. Since we are analyzing these gene sets in Python, ensure to make the appropriate adjustments. Here we subtract 1 from R output.

In [23]:
dpfSet=set(dpfGenes.astype(int)-1)

In [24]:
nvrSet=set(nvrGenes.astype(int))

## 2. Calculate Jaccard Index

The Jaccard index, or the intersect over union, is the metric we use for quantifying set similarity. It is defined as the size of the intersection divided by the size of the union of the two sets of interest.

In [11]:
len(dpfSet&nvrSet)/len(dpfSet|nvrSet)

0.7046413502109705