# Calculating Semantic Relatedness using Wikipedia

* **Armin Sajadi** - Faculty of Computer Science
* **Dr. Evangelos Milios** - Faculty of Computer Science
* **Dr. Vlado Kešelj** – Faculty of Computer Science
* **Dr. Jeannette C.M. Janssen** - Mathematics & Statistics

This is a simple and step by step explanation of calculating semantic relatedness using Wikipedia. We start by preprocessing and building the api, that is explained in the following papers papers:

* Armin Sajadi, Evangelos E. Milios, Vlado Keselj, Jeannette C. M. Janssen, "Domain-Specific Semantic Relatedness from Wikipedia Structure: A Case Study in Biomedical Text", CICLing (1) 2015: 347-360 [(bib)](http://dblp.uni-trier.de/rec/bibtex/conf/cicling/SajadiMKJ15) [(pdf)](http://link.springer.com/chapter/10.1007%2F978-3-319-18111-0_26)

* Armin Sajadi,"Graph-Based Domain-Speciﬁc Semantic Relatedness from Wikipedia", Canadian AI 2014, LNAI 8436, pp. 381–386, 2014 [(bib)](../resrc/caai14.bib) [(pdf)](http://link.springer.com/chapter/10.1007%2F978-3-319-06483-3_42#)

### Public Resources
* Weservice: (http://ares.research.cs.dal.ca/~sajadi/wikisim)
* Source Code: (https://github.com/asajadi/wikisim)




# Preprocessing

The first step is to download the wikipedia database dumps and import them to mysql. We do a preprocessing on the sql dumps for mainly three reasons:

* The tables are huge, containing many column and rows we do not use. Removing the unnessary information, that includes unused columns (such as time stamps, viewed count of the pages or categories) and all the information about talk pages, media files or user draft pages, can dramatically decreas the size of the tables.

* Forming **synonym Rings**. We extend the concept of synonym ring to Wikipedia (similar to what is called synset in Wordnet). In Wikipedia, redirection stands for equivallency, for example Car --> Automobile. But it's not always this easy and you can find all sorts of weired redirection, like:

![](../resrc/sr.jpg)

   We iterate through redirectins and remove cycles, dangling redirections and also all the chains. This process forms clusters of redirections around main pages. Then we go through all other tables (pagelinks and  category links) and replace any redirected page by its main article, the result would be much more neated, and makes the rest of the process faster.


* We remove garbage, links to non existing pages, self links, mismatching namespaces, and many other incosistencies that you can find the details in the source code).

* We apply some strategic changes, like instead of source id --> destination title format of the pagelinks, we use source id --> dest id, which is faster and preferrabel for out case. 

To complete this step, download and run the parser (written in Java) that prunes these files. You can run the following cells, but due to a known bug with ipython, you can't see bash progress messages untill the job is finished. So a better option would be simply running the script named [preparation_scripts/preprocess.sh](preparation_scripts/preprocess.sh) from bash and skipping the remaining of this section. 

## Downloading
The files will be download to the the default `~/Downloads/wikidumps` directory

In [None]:
%%system
#Downloading the datasets, it might take a while, and make sure the destination exists

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz            \
	-P ~/Downloads/wikidumps
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pagelinks.sql.gz      \
	 -P ~/Downloads/wikidumps
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-redirect.sql.gz       \
	 -P ~/Downloads/wikidumps
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-category.sql.gz       \
	-P ~/Downloads/wikidumps
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz \
	-P ~/Downloads/wikidumps

## Uncompress dump files

In [None]:
%%system
gunzip -c ~/Downloads/wikidumps/enwiki-latest-page.sql.gz		\
		> ~/Downloads/wikidumps/enwiki-latest-page.sql 		
gunzip -c ~/Downloads/wikidumps/enwiki-latest-pagelinks.sql.gz 	\
		> ~/Downloads/wikidumps/enwiki-latest-pagelinks.sql	
gunzip -c ~/Downloads/wikidumps/enwiki-latest-redirect.sql.gz		\
		> ~/Downloads/wikidumps/enwiki-latest-redirect.sql	
gunzip -c ~/Downloads/wikidumps/enwiki-latest-category.sql.gz		\
		> ~/Downloads/wikidumps/enwiki-latest-category.sql	
gunzip -c ~/Downloads/wikidumps/enwiki-latest-categorylinks.sql.gz	\
		> ~/Downloads/wikidumps/enwiki-latest-categorylinks.sql	

## Preprocessing
The following java file  does the preprosseing and creates the processed tables (ending in `main.sql`) and several log files of the errors

*Note*: you might need to recompile (`javac ProcessSQLDumps.java`) 

%%system
java ProcessSQLDumps ~/Downloads/wikidumps

## Importing to mysql
running the folling cell will set some variable in mysql for maximum performance (if you have enoguh physical memory) and starts importing. replace user and pass with the actuall user and password of the user. 

In [None]:
%%system

mysql -u <user> -p<pass> -e 'set global key_buffer_size=4G;'
mysql -u <user> -p<pass> -e 'set global bulk_insert_buffer_size=1G;'
mysql -u <user> -p<pass> -e 'set global query_cache_size = 4G;'
mysql -u <user> -p<pass> -e 'set global query_cache_limit = 4G;'
mysql -u <user> -p<pass> -e 'set global tmp_table_size = 4G;'

mysql -u <user> -p<pass> -e 'CREATE SCHEMA `enwikilast` DEFAULT CHARACTER SET binary;'
./importall  ~/Downloads/wikidumps last <MYSQLROOTPASSWORD>

# Wikipedia Interface
This is the main interface to Wikipedia database and provides basic functions given a pages, such as its:
* id or title
* synonym ring
* linkage
* in or out neighborhood. 

In [None]:
%%writefile wikipedia.py 
# uncomment

# A General Class to interact with Wiki datasets
import MySQLdb
import sys;
import os
import scipy as sp
from collections import defaultdict
import cPickle as pickle

from utils import * # uncomment


DIR_IN=0;
DIR_OUT=1;
DIR_BOTH=2;
_db = MySQLdb.connect(host="127.0.0.1",port=3306,user='root',passwd="emilios",db="enwiki20140102")
_cursor = _db.cursor()
WIKI_SIZE = 10216236;

def close():
    _cursor.close();
    _cursor=_db=None;
    _db.close();
def reopen():
    global _db, _cursor;
    if _db is None:
        _db = MySQLdb.connect(host="127.0.0.1",port=3306,user='root',passwd="emilios",db="enwiki20140102")
        _cursor = _db.cursor()
        

def id2title(wid):
    """ Returns the title for a given id

    Args: 
        wid: Wikipedia id       
    Returns: 
        The title of the page
    """
    title=None;

    _cursor.execute("""SELECT * FROM `page` where page_id = %s""", (wid,))
    row= _cursor.fetchone();
    if row is not None:
        title=row[2];          
    return title;

def ids2title(wids):
    """ Returns the titles for given list of wikipedia ids 

    Args: 
        wids: A list of Wikipedia ids          
    Returns: 
        The list of titles
    """

    wid_list = [str(wid) for wid in wids] ;
    order = ','.join(['page_id'] + wid_list) ;
    wid_str = ",".join(wid_list)
    query = "SELECT page_title FROM `page` where page_id in ({0}) order by field ({1})" \
    .format(wid_str, order);
    _cursor.execute(query);
    rows = _cursor.fetchall();
    if rows:
        rows = tuple(r[0] for r in rows)
    return rows;


def title2id(title):
    """ Returns the id for a given title

    Args: 
        wid: Wikipedia id          
    Returns: 
        The title of the page
    """        
    wid=None;

    _cursor.execute("""SELECT * FROM `page` where page_title=%s and page_namespace=0""", (title,))
    row= _cursor.fetchone();
    if row is not None:
        wid = getredir_id(row[0]) if row[3] else row[0];
    return wid;

def getredir_id(wid):
    """ Returns the target of a redirected page 

    Args:
        wid: wikipedia id of the page
    Returns:
        The id of the target page
    """
    rid=None

    _cursor.execute("""select * from redirect where rd_from=%s;""", (wid,));
    row= _cursor.fetchone();
    if row is not None:
        rid=row[1]
    return rid 


def getredir_title(wid):
    """ Returns the target title of a redirected page 

    Args:
        wid: wikipedia id of the page
    Returns:
        The title of the target page
    """
    
    title=None;
    _cursor.execute(""" select page_title from redirect INNER JOIN page
                  on redirect.rd_to = page.page_id 
                  where redirect.rd_from =%s;""", (wid));
    row=_cursor.fetchone()
    if row is not  None:
        title=row[0];
    return title;

def synonymring_titles(wid):
    """ Returns the synonim ring of a page

    Example: synonymring_titles('USA')={('U.S.A', 'US', 'United_States_of_America', ...)}

    Args:
        wid: the wikipedia id
    Returns:
        all the titles in its synonym ring
    """

    tid = getredir_id(wid);
    if tid is not None:
        wid = tid;
    _cursor.execute("""(select page_title from page where page_id=%s) union 
                 (select page_title from redirect INNER JOIN page
                    on redirect.rd_from = page.page_id 
                    where redirect.rd_to =%s);""", (wid,wid));
    rows=_cursor.fetchall();
    if rows:
        rows = tuple(r[0] for r in rows)
    return rows;

def _getlinkedpages_query(id, direction):
    query="(SELECT {0} as lid FROM pagelinks where ({1} = {2}))"
    if direction == DIR_IN:
        query=query.format("pl_from","pl_to",id);
    elif direction == DIR_OUT:
        query=query.format("pl_to","pl_from",id);
    return query;

def getlinkedpages(wid,direction):
    """ Returns the linkage for a node

    Args:
        id: the wikipedia id
        direction: 0 for in, 1 for out, 2 for all
    Returns:
        The list of the ids of the linked pages
    """
    _cursor.execute(_getlinkedpages_query(wid, direction));
    rows =_cursor.fetchall()
    if rows:
        rows = tuple(r[0] for r in rows)
    return rows

def e2i(wids):
    elist=[];
    edict=dict();
    last=0;    
    for wid in itertools.chain(*iters):
        if wid not in edict:
            edict[wid]=last;
            elist.append(wid);
            last +=1; 
    return elist, edict;

def getneighbors(wid, direction):
    """ Returns the neighborhood for a node

    Args:
        id: the wikipedia id
        direction: 0 for in, 1 for out, 2 for all
    Returns:
        The vector of ids, and the 2d array sparse representation of the graph, in the form of
        array([[row1,col1],[row2, col2]]). This form is flexible for general use or be converted to scipy.sparse 
        formats
    """
    log('getneighbors started, wid = %s, direction = %s', wid, direction)
    if id2title(wid) is None:
        return (), sp.array([])
    
    idsquery = """(select  {0} as lid) union {1}""".format(wid,_getlinkedpages_query(wid,direction));

    _cursor.execute(idsquery);
    sys.stdout.flush()


    rows = _cursor.fetchall();
    neighids = tuple(r[0] for r in rows);
    
    id2row = dict(zip(neighids, range(len(neighids))))
    sys.stdout.flush()

    neighbquery=  """select lid,pl_to as n_l_to from
                     ({0}) a  inner join
                     pagelinks on lid=pl_from""".format(idsquery);

    links=_cursor.execute(neighbquery);
    sys.stdout.flush()

    links = _cursor.fetchall();
    
    #links = tuple((id2row(u), id2row(v)) for u, v in links if (u in id2row) and (v in id2row));
    links = sp.array([[id2row[u], id2row[v]] for u, v in links if (u in id2row) and (v in id2row)]);
    sys.stdout.flush()
    log('getneighbors finished')
    return (neighids,links)

def clearcache():
    _cursor.execute("delete  from pagelinksorderedin");
    _cursor.execute("delete  from pagelinksorderedout");

def checkcache(wid, direction):
    log('checkcache started, wid = %s, direction = %s', wid, direction)
    
    em=None
    
    if direction == DIR_IN: 
        tablename = 'pagelinksorderedin';
        colname = 'in_neighb'
    elif direction == DIR_OUT: 
        tablename = 'pagelinksorderedout';
        colname = 'out_neighb';
    query =    """select {0} from {1} where cache_id={2}""".format(colname, tablename, wid)
    _cursor.execute(query);
    row = _cursor.fetchone();
    if row is not None:
        em=defaultdict(int, pickle.loads(row[0]))
    log('checkcache finished')
    return em


def cachescores(wid, em, direction):
    log('cachescores started, wid = %s, direction = %s', wid, direction)

    if direction == DIR_IN: 
        tablename = 'pagelinksorderedin';
        colname = 'in_neighb'

    elif direction == DIR_OUT: 
        tablename = 'pagelinksorderedout';
        colname = 'out_neighb';
        
    idscstr=pickle.dumps(em, pickle.HIGHEST_PROTOCOL);
    _cursor.execute("""insert into %s values (%s,'%s');""" %(tablename, wid, _db.escape_string(idscstr)));
    
    
    log('cachescores finished')


# Utils
Some small helper function for reporting purposes. 

In [None]:
%%writefile utils.py 
# uncomment
import itertools
import scipy as sp
import os

import datetime

def readds(url):    
    data = sp.genfromtxt(url, dtype=None)
    return data

def logres(outfile, instr, *params):
    outstr = instr % params;
    with open(outfile, 'a') as f:
        f.write(str(datetime.datetime.now()) + "\t" + outstr + '\n');          
        
def log(instr, *params):
    logres(logfile, instr, *params)

logfile='log.txt';
if not os.path.exists(logfile):
    log('log created') 
    os.chmod(logfile, 0777)    
    
def timeformat(sec):
    return datetime.timedelta(seconds=sec)
    

# Fast [Reversed] Pagerank Implementation

Here we have the actuall implementation of pagerank. Two implemenation are provided, both inspired  by the sparse fast solutions given in **Cleve Moler**'s book, [*Numerical Computing with MATLAB*](http://www.mathworks.com/moler/index_ncm.html). The power method is much faster with enough precision for our task. Our benchmarsk shows that this implementation is faster than networkx implementation magnititude of times

The input is a 2d array, each row of the array is an edge of the graph [[a,b], [c,d]], a and b are the node numbers. 
(In case you want to caclulate reall page rank, uncomment the line that transposes the adjacency matrix)

In [None]:
%%writefile pagerank.py 
# uncomment

# Two implementations of PageRank
import scipy as sp
import scipy.sparse as sprs
import scipy.spatial
import scipy.sparse.linalg 
#from scipy.sparse.linalg import spsolve
#import networkx as nx
#import numpy as np;
#example 1


from utils import * # uncomment

def create_csr(Z):
    """ Creates a csr presentation from 2darray presentation and 
        calculates the pagerank
    Args:
        G: input graph in the form of a 2d array, such as [[2,0], [1,2], [2,1]]
    Returns:
        Pagerank Scores for the nodes
    
    each row of the array is an edge of the graph [[a,b], [c,d]], a and b are the node numbers. 

    """   
    rows = Z[:,0];
    cols = Z[:,1];
    n = max(max(rows), max(cols))+1;
    G=sprs.csr_matrix((sp.ones(rows.shape),(rows,cols)), shape=(n,n));
    return G

def pagerank_sparse(G, p=0.85, personalize=None, reverse=False):
    """ Calculates pagerank given a csr graph
    
    Args:
        G: a csr graph.
        p: damping factor
        personlize: if not None, should be an array with the size of the nodes
                    containing probability distributions. It will be normalized automatically
        reverse: If true, returns the reversed-pagerank 
        
    Returns:
        Pagerank Scores for the nodes
     
    """
    log('pagerank_sparse started')

    if not reverse:
        G=G.T;

    n,n=G.shape
    c=sp.asarray(G.sum(axis=0)).reshape(-1)
    r=sp.asarray(G.sum(axis=1)).reshape(-1)

    k=c.nonzero()[0]

    D=sprs.csr_matrix((1/c[k],(k,k)),shape=(n,n))

    if personalize is None:
        e=sp.ones((n,1))
    else:
        e = personalize/sum(personalize);
        
    I=sprs.eye(n)
    X1 = sprs.linalg.spsolve((I - p*G.dot(D)), e);

    X1=X1/sum(X1)
    log('pagerank_sparse finished')
    return X1
def pagerank_sparse_power(G, p=0.85, max_iter = 100, personalize=None, reverse=False):
    """ Calculates pagerank given a csr graph
    
    Args:
        G: a csr graph.
        p: damping factor
        max_iter: maximum number of iterations
        personlize: if not None, should be an array with the size of the nodes
                    containing probability distributions. It will be normalized automatically
        reverse: If true, returns the reversed-pagerank 
        
    Returns:
        Pagerank Scores for the nodes
     
    """
    log('pagerank_sparse_power started')
    
    if not reverse: 
        G=G.T;

    n,n=G.shape
    c=sp.asarray(G.sum(axis=0)).reshape(-1)
    r=sp.asarray(G.sum(axis=1)).reshape(-1)

    k=c.nonzero()[0]

    D=sprs.csr_matrix((1/c[k],(k,k)),shape=(n,n))

    if personalize is None:
        e=sp.ones((n,1))
    else:
        e = personalize/sum(personalize);
        
    z = (((1-p)*(c!=0) + (c==0))/n)[sp.newaxis,:]
    G = p*G.dot(D)
    x = e/n
    oldx = sp.zeros((n,1));
    
    iteration = 0
    start = time.time()    
    while sp.linalg.norm(x-oldx) > 0.01:
        oldx = x
        x = G.dot(x) + e.dot(z.dot(x))
        iteration += 1
        if iteration >= max_iter:
            break;
    x = x/sum(x)
    
    log('pagerank_sparse_power finished')
    return x.reshape(-1) 



## Calculating Semantic Relatedness
The idea is get the neighborhood graph for each concept and calculating the similarity by embedding the graph into a vector and then perforiming cosine similarity. 

The process can be illustrated like this:
    ![](../resrc/alg.jpg)

In [None]:
%%writefile calcsim.py 
# uncomment
from __future__ import division

#Calculating Relatedness
from wikipedia import * # uncomment
from pagerank import * # uncomment
from utils import * # uncomment

from collections import defaultdict
from scipy import stats
import json
import math

def _unify_ids_scores(*id_sc_tuple):
    uids, id2in = e2i(*(ids for ids, _ in id_sc_tuple));
    
    uscs=tuple();            
    for ids,scs in id_sc_tuple:
        scs_u=sp.zeros(len(id2in))
        scs_u[[id2in[wid] for wid in ids]] = scs;            
        uscs += (scs_u,)                
    return uids, uscs       


def concept_embedding(wid, direction):
    """ Calculates concept embedding to be used in relatedness
    
    Args:
        wid: wikipedia id
        direction: 0 for in, 1 for out, 2 for all
        
    Returns:
        The neighbor ids, their scores and the whole neighorhood graph (for visualization purposes)
        
    """
    log('concept_embedding started, wid = %s, direction = %s', wid, direction)

    if direction == DIR_IN or direction==DIR_OUT:
        em = _concept_embedding_io(wid, direction)
    if direction == DIR_BOTH:
        em = _concept_embedding_both(wid, direction)
    log('concept_embedding finished')
    return em
    
def _concept_embedding_io(wid, direction):
    cached_em = checkcache(wid, direction);
    if cached_em is not None:
        log('found in cache, wid = %s, direction = %s', wid, direction)
        return cached_em;

    (ids, links) = getneighbors(wid, direction);
    if not ids:
        return None;
    scores = pagerank_sparse(create_csr(links), reverse=True)
     
    em=defaultdict(int,zip(ids, scores));    
    cachescores(wid, em, direction);
    return em
            

def _concept_embedding_both(wid, direction):            
        in_em = _concept_embedding_io(wid, DIR_IN);
        out_em = _concept_embedding_io(wid, DIR_OUT )
        if (in_em is None) or (out_em is None):
            return None;
        
        ids=list(set(in_em.keys()).union(out_em.keys()))
        in_sc=[in_em[wid] for wid in ids]
        out_sc=[out_em[wid] for wid in ids]               
        scores=([(x+y)/2 for x,y in zip(in_sc, out_sc)])

        return defaultdict(int,zip(ids, scores))

def getsim_wlm(id1, id2):
    """ Calculates wlm (ngd) similarity between two concepts 
    Arg:
        id1, id2: the two concepts 
    Returns:
        The similarity score        
    """
    in1 = set(getlinkedpages(id1, DIR_IN))
    in2 = set(getlinkedpages(id2, DIR_IN))
    f1 = len(in1)
    f2 = len(in2)
    f12=len(in1.intersection(in2))
    dist = (sp.log(max(f1,f2))-sp.log(f12))/(sp.log(WIKI_SIZE)-sp.log(min(f1,f2)));
    if (f1==0) or (f2==0) or (f12==0):
        return 0;
    sim = 1-dist if dist <=1 else 0
    return sim

def getsim_cocit(id1, id2):
    """ Calculates co-citation similarity between two concepts 
    Arg:
        id1, id2: the two concepts 
    Returns:
        The similarity score        
    """
    in1 = set(getlinkedpages(id1, DIR_IN))
    in2 = set(getlinkedpages(id2, DIR_IN))
    f1 = len(in1)
    f2 = len(in2)
    if (f1==0) or (f2==0):
        return 0;
    
    f12=len(in1.intersection(in2))
    sim = (f12)/(f1+f2-f12);
    return sim


def getsim_coup(id1, id2):
    """ Calculates coupler similarity between two concepts 
    Arg:
        id1, id2: the two concepts 
    Returns:
        The similarity score        
    """
    in1 = set(getlinkedpages(id1, DIR_OUT))
    in2 = set(getlinkedpages(id2, DIR_OUT))
    f1 = len(in1)
    f2 = len(in2)
    if (f1==0) or (f2==0):
        return 0;
    
    f12=len(in1.intersection(in2))
    sim = (f12)/(f1+f2-f12);
    return sim

def getsim_ams(id1, id2):
    """ Calculates amlser similarity between two concepts 
    Arg:
        id1, id2: the two concepts 
    Returns:
        The similarity score        
    """
    in1 = set(getlinkedpages(id1, DIR_IN))
    out1 = set(getlinkedpages(id1, DIR_OUT))
    link1 = in1.union(out1)
    
    in2 = set(getlinkedpages(id2, DIR_IN))
    out2 = set(getlinkedpages(id2, DIR_OUT))
    link2 = in2.union(out2)
    
    f1 = len(link1)
    f2 = len(link2)
    if (f1==0) or (f2==0):
        return 0;
    
    f12=len(link1.intersection(link2))
    sim = (f12)/(f1+f2-f12);
    return sim




    
    
def getsim_emb(id1,id2, direction):
    """ Calculates the similarity between two concepts
    Arg:
        id1, id2: the two concepts
        direction: 0 for in, 1 for out, 2 for all
        
    Returns:
        The similarity score
    """
    em1 = concept_embedding(id1, direction);
    em2 = concept_embedding(id2, direction);
    if (em1 is None) or (em2 is None):
        return None;
    
    ids=list(set(em1.keys()).union(em2.keys()))
    sc1=[em1[wid] for wid in ids]
    sc2=[em2[wid] for wid in ids]               
    
    return 1-sp.spatial.distance.cosine(sp.array(sc1),sp.array(sc2));

def getsim(id1,id2, method, direction=None):
    """ Calculates well-known similarity metrics between two concepts 
    Arg:
        id1, id2: the two concepts 
        method:
            wlm: Wikipedia-Miner method
            cocit: cocitation
            coup: coupling
            ams: amsler
            rvspagerank: ebedding based similarity (in our case, 
                 reversed-page rank method)
    Returns:
        The similarity score        
    """
    if method=='rvspagerank':
        return getsim_emb(id1,id2, direction)
    if method=='wlm':
        return getsim_wlm(id1,id2)
    if method=='cocit':
        return getsim_cocit(id1,id2)
    if method=='coup':
        return getsim_coup(id1,id2)
    if method=='ams':
        return getsim_ams(id1,id2)

    
def getsim_file(infilename, outfilename, method='rvspagerank', direction=None):
    """ Batched (file) similarity.
    
    Args: 
        infilename: tsv file in the format of pair1    pair2   [goldstandard]
        outfilename: tsv file in the format of pair1    pair2   similarity
        direction: 0 for in, 1 for out, 2 for all
    Returns:
        vector of scores, and Spearmans's correlation if goldstandard is given
    """
    log('getsim_file started: %s -> %s', infilename, outfilename)
    outfile = open(outfilename, 'w');
    dsdata=readds(infilename);
    gs=[];
    scores=[];
    #scores=[1-spatial.distance.cosine(vectors[row[0]],vectors[row[1]]) if (row[0] in vectors) and  (row[1] in vectors) else 0 for row in dsdata]
    spcorr=None;
    for row in dsdata:   
        log('processing %s, %s', row[0], row[1])
        if (row[0]=='None') or (row[1]=='None'):
            continue;
        if len(row)>2: 
            gs.append(row[2]);
            
        wid1 = title2id(row[0])
        wid2 = title2id(row[1])
        if (wid1=='None') or (wid2=='None'):
            sim=0;
        else:
            sim=getsim(wid1, wid2, method, direction);
        outfile.write("\t".join([str(row[0]), str(row[1]), str(sim)])+'\n')
        scores.append(sim)
    outfile.close();
    if gs:
        spcorr = sp.stats.spearmanr(scores, gs);
    log('getsim_file finished')
    return scores, spcorr

def conceptrep(wid, direction, get_titles=True, cutoff=None):
    """ Finds a representation for a concept
    
        Concept Representation is a vector of concepts with their score
    Arg:
        wid: Wikipedia id
        direction: 0 for in, 1 for out, 2 for all
        titles: include titles in the embedding (not needed for mere calculations)
        cutoff: the first top cutoff dimensions (None for all)
        
    Returns:
        the vecotr of ids, their titles and theirs scores. It also returns the
        graph for visualization purposes. 
    """
    
    log('conceptrep started, wid = %s, direction = %s', wid, direction)
    
    em=concept_embedding(wid, direction);    
    if em is None:
        return None;
    ids = em.keys();
    if cutoff is not None:
        ids = sorted(em.keys(), key=lambda k: em[k], reverse=True)
        ids=ids[:cutoff]
        em=defaultdict(int, {wid:em[wid] for wid in ids})
        
    if get_titles:
        em=defaultdict(int, {wid:(title, em[wid]) for wid,title in zip(ids,ids2title(ids))})
    log ('conceptrep finished')
    return em
    

def getembed_file(infilename, outfilename, direction, get_titles=False, cutoff=None):
    """ Batched (file) concept representation.
    
    Args: 
        infilename: tsv file in the format of pair1    pair2   [goldstandard]
        outfilename: tsv file in the format of pair1    pair2   similarity
        direction: 0 for in, 1 for out, 2 for all
        titles: include titles in the embedding (not needed for mere calculations)
        cutoff: the first top cutoff dimensions (None for all)        

    """
    
    log('getembed_file started: %s -> %s', infilename, outfilename)
    outfile = open(outfilename, 'w');
    dsdata=readds(infilename);
    scores=[];
    for row in dsdata:        
        wid = title2id(row[0])
        if wid=='None':
            em='';
        else:
            em=conceptrep(wid, direction, get_titles, cutoff)
        outfile.write(row[0]+"\t"+json.dumps(em)+"\n")
    outfile.close();
    log('getembed_file finished')



In [None]:
%load_ext autoreload
%autoreload 2
#%aimport calcsim

%aimport wikipedia

from wikipedia import * # uncomment
from calcsim import *   # uncomment
# Examples
reopen()
direction = DIR_OUT

page_title1 = 'Abortion' 
print ('page_title: ', page_title1)

page_id1 = title2id(page_title1)
print ("id: ", page_id1)

sr1 = synonymring_titles(page_id1)
print ("synonym ring: %s\n " % str(sr1[:5]))

rep1=conceptrep(page_id1, direction,  get_titles=True, cutoff=5)
print ("Concept Representation:  %s\n" % json.dumps(rep1))

print ("\n")

page_title2 = 'Miscarriage' 
print ('page_title: ', page_title2)

page_id2 = title2id(page_title2)
print ("id: ", page_id2)

sr2 = synonymring_titles(page_id2)
print ("synonym ring: %s\n " % str(sr2[:5]))

rep2=conceptrep(page_id2, direction,  get_titles=True, cutoff=5)
print ("Concept Representation: %s\n" % json.dumps(rep2))



sim = getsim(page_id1, page_id2,'rvspagerank',DIR_IN)
print ("similarity", sim)



In [None]:
%load_ext autoreload
%autoreload

from calcsim import *

import json
from IPython.display import Javascript

cre1 = conceptrep(title2id('Tehran'), DIR_OUT, get_titles=True, cutoff=200);
cre2 = conceptrep(title2id('Sanandaj'), DIR_OUT, get_titles=True, cutoff=200);


#runs arbitrary javascript, client-side
Javascript("""
           window.vizObj1={};window.vizObj2={};
           """.format(json.dumps(cre1), json.dumps(cre2)))


In [None]:
%%javascript

require.config({
    paths: {
        d3:'//129.173.212.50/~sajadi/wikisim/js/d3',
        d3_cloud:'//129.173.212.50/~sajadi/wikisim/js/d3.layout.cloud',
        simple_draw:'//129.173.212.50/~sajadi/wikisim/js/simpledraw'

    }
});

In [None]:
%%javascript

function createWords(cp){

    var titles=[];
    var scores=[];

    for (var key in cp){ 
        if (cp.hasOwnProperty(key)) {
            titles.push(cp[key][0])
            scores.push(cp[key][1])
        }
    }
    var sum = scores.reduce(function(a, b) {return a + b;});
    var min = Math.min.apply(null, scores)
    var max = Math.max.apply(null, scores)
    
    scores=scores.map(function(a){return (a/sum)*90+20});
    var words=[];
    for (var i = 0; i<titles.length; i++) {
        words.push({"text":titles[i], "size": scores[i]})
    }
    return words;
}

var words1=createWords(window.vizObj1);
//element.text(JSON.stringify(words1));
var words2=createWords(window.vizObj2);
require(['d3','d3_cloud', 'simple_draw'], function(d3,d3_cloud, simple_draw){
    $("#chart1").remove();
    element.append("<div id='chart1' style='width:49%; height:500px; float:left; border-style:solid'> </div>");
    simpledraw(words1, chart1);
    
    $("#chart2").remove();
    element.append("<div id='chart2' style='width:49%; margin-left:2%; height:500px; float:left; border-style:solid'> </div>");
    simpledraw(words2, '#chart2');    
    
});    
    
