# Tutorial NLP for MIR
## Entity Linking and Entity Similarity
We are going to use Elvis, an Entity Linking python wrapper for some Entity Linking systems. More info in http://github.com/sergiooramas/elvis

To install elvis do pip install elvis

In [None]:
import elvis
import pprint
import json

From elvis you can use Babelfy, Tagme and DBpedia Spotlight. To use Babelfy or Tagme you need an API key. To use DBpedia Spotlight you can to install a local server.

You can obtain a BabelNet API key in http://babelnet.org/register

And a Tagme API key in https://tagme.d4science.org/tagme/tagme_help.html

DBpedia Spotlight can be downloaded here https://github.com/dbpedia-spotlight/dbpedia-spotlight

The API key of Tagme and Babelfy can be configured with the methods set_babelfy_key("key") set_tagme_key("key")

The endpoint of the Spotlight server can be set with the method set_spotlight_endpoint("url")

In [None]:
elvis.set_babelfy_key("")

From elvis we can process an entire folder with text files, using the method process_folder(tool, path_to_folder)

Tool can be either ['babelfy','spotlight','tagme']

A tool can be call to query an array of sentences with methods babelfy(sentences), tagme(sentences), spotlight(sentences)

In [None]:
sentences = []
sentences.append("Madonna Louise Ciccone is an American singer.")

linked_text = elvis.babelfy(sentences)

pp = pprint.PrettyPrinter(indent=2)
pp.pprint(linked_text)

We are going to process a folder. The path of the folder should be absolute.

We are going to process a dataset of 262 artist biographies, coming from the Semantic Artist Similrity dataset http://mtg.upf.edu/download/datasets/semantic-similarity

The method process_folder creates a directory "entities/tool/dataset_name", in this case it will create the folder "entities/babelfy/mirex_biographies".

In [None]:
tutorial_folder = '/Users/soramas/dev/nlp-tutorial/'
dataset_folder = tutorial_folder + 'sas_dataset/mirex_biographies'
output_folder = tutorial_folder + 'entities/mirex_biographies/babelfy'

# To use this method you need to have the English tokenizer of NLTK. You can install it from python interpreter: 
# import nltk
# nltk.download()
elvis.process_folder('babelfy',dataset_folder,output_folder)

Elvis provides with a method to compute similarity between artists based on the entities identified in the biographies. The method is called compute_similarity(entities_folder) and receives the folder where the extracted entities are located. It returns a numpy similarity matrix and a list with the artists in the matrix. Then a second method is used to get the top_n most similar entities of every artist top_n(similarity_matrix, artists_index, n)

In [None]:
entities_folder = tutorial_folder + 'entities/mirex_biographies/babelfy'
artists_index = []
similarity_matrix, artists_index = elvis.compute_similarity(entities_folder)
top = elvis.top_n(similarity_matrix,artists_index,n=5)

The filenames of the original text files are used as entity name. In this dataset the text filename correspond with the MusicBrainz ID of the artist. The dataset comes with a mapping from MusicBrainz ID to artist names. Thus, we create a dictionary to convert mbids into artist names.

In [None]:
mbid2name = dict()
f = open(tutorial_folder+'sas_dataset/mb2uri_mirex.tsv')
for line in f.readlines():
    mbid, name, uri = line.strip().split('\t')
    mbid2name[mbid] = name

Using the obtained dictionary and the top_n list of most similar artists we visualize which are the top 5 most similar artists of a subset of the dataset.

In [None]:
for index, similars in enumerate(top[:15]):
    print mbid2name[artists_index[index]], ": ", ", ".join([mbid2name[s] for s in similars])


### Output Homogenizing
Elvis provides a method to homogenize the output of different Enity Linking systems, and add some missing semantic information from DBpedia. The method is called homogenize(tool,entities_folder). The entities to be homogenized should be in the folder entities_folder+tool. It takes some time, but the process can be speed up storing locally the files the DBpedia files used during homogenization.

In [None]:
dataset_folder = tutorial_folder + 'sample_text/'
output_folder = tutorial_folder + 'entities/sample_text/babelfy/'
elvis.process_folder('babelfy',dataset_folder,output_folder)

In [None]:
all_entities_folder = tutorial_folder + 'entities/sample_text/'
elvis.homogenize('babelfy',all_entities_folder)

In [None]:
homogenized=json.load(open(all_entities_folder + "babelfy_h/madonna.json"))
pp.pprint(homogenized[0]['entities'])