# Listar datasets

In [1]:
import requests
from collections import namedtuple

In [2]:
p = requests.get('http://datahub.io/api/rest/dataset')

In [3]:
p.status_code

200

In [4]:
lis = p.json()

In [5]:
lis[:10]

[u'1033-prog',
 u'10leading-discharges-alive-and-dead-by-principal-diagnosis-type-of-hospital-nationality-and-sex-2009',
 u'12',
 u'13',
 u'1790-2010-historical-population-york-county-virginia',
 u'1855spanishrailways',
 u'1902-norfolk-virginia-geopdf',
 u'1921-newport-news-virginia-geopdf',
 u'1944-norfolk-south-virginia-geopdf',
 u'1948-norfolk-south-virginia-geopdf']

Filtrar ieee

In [6]:
[x for x in lis if 'ieee' in x]

[u'rkb-explorer-ieee', u'twc-ieeevis']

In [7]:
ieee = _[0]

# Pegar informações de dataset

In [8]:
name = ieee
p2 = requests.get('http://datahub.io/api/rest/dataset/{}'.format(name))

In [9]:
p2.status_code

200

In [10]:
dataset = p2.json()

In [11]:
dataset.keys()

[u'owner_org',
 u'maintainer',
 u'private',
 u'maintainer_email',
 u'num_tags',
 u'id',
 u'metadata_created',
 u'relationships',
 u'metadata_modified',
 u'author',
 u'author_email',
 u'isopen',
 u'download_url',
 u'state',
 u'version',
 u'license_id',
 u'type',
 u'resources',
 u'num_resources',
 u'tags',
 u'title',
 u'tracking_summary',
 u'groups',
 u'name',
 u'license',
 u'notes_rendered',
 u'url',
 u'ckan_url',
 u'notes',
 u'license_title',
 u'ratings_average',
 u'extras',
 u'ratings_count',
 u'organization',
 u'revision_id']

Talvez seja interessante adicionar isso para desvalidar dados

In [12]:
dataset[u'metadata_created'], dataset[u'metadata_modified'], dataset['revision_id']

(u'2010-08-23T13:45:00.836912',
 u'2014-11-01T14:23:47.128757',
 u'8c840c52-2231-44ac-8c46-1a7f8c980b87')

## O ID é outro ponto de acesso

O ID também pode ser usado no http://datahub.io/api/rest/dataset{}

Aparentemente, a API 3 usa 'id' nos relacionamentos, mas a API REST não usa. Me parece seguro usar apenas 'name' no datahub.
Para linkar no RDF, o certo é usar namespace, que está em extras/namespace, mas não sei se temos essa garantia (pois 'extras' é opcional), então podemos usar 'url' também

In [13]:
dataset['id']

u'feec8014-10d0-47c1-9f4d-eed33dc68d83'

In [14]:
dataset['name']

u'rkb-explorer-ieee'

In [15]:
dataset['url']

u'http://ieee.rkbexplorer.com'

In [16]:
dataset['extras']['namespace']

u'http://ieee.rkbexplorer.com/id/'

## Obter links em "Additional Info"

Temos que verificar se isso é informação repetida em relação ao 'relationships'

In [17]:
Link = namedtuple('Link', 'name count')
additional_info = dataset['extras']
is_link = lambda x: x.startswith('links:')
link_tuple = lambda k, v: Link(k[6:], int(v))

links = [link_tuple(k, v) for k, v in additional_info.items() if is_link(k)]
links

[Link(name=u'rkb-explorer-oai', count=417),
 Link(name=u'rkb-explorer-citeseer', count=1182),
 Link(name=u'rkb-explorer-southampton', count=7),
 Link(name=u'rkb-explorer-rae2001', count=17),
 Link(name=u'rkb-explorer-acm', count=2949),
 Link(name=u'rkb-explorer-risks', count=3),
 Link(name=u'rkb-explorer-newcastle', count=73),
 Link(name=u'rkb-explorer-nsf', count=1),
 Link(name=u'rkb-explorer-curriculum', count=2),
 Link(name=u'rkb-explorer-eprints', count=643),
 Link(name=u'rkb-explorer-dblp', count=5867),
 Link(name=u'rkb-explorer-wiki', count=9),
 Link(name=u'rkb-explorer-kisti', count=516),
 Link(name=u'rkb-explorer-ibm', count=29),
 Link(name=u'rkb-explorer-ulm', count=5),
 Link(name=u'rkb-explorer-dotac', count=50),
 Link(name=u'rkb-explorer-roma', count=3),
 Link(name=u'rkb-explorer-pisa', count=18),
 Link(name=u'rkb-explorer-resex', count=6),
 Link(name=u'rkb-explorer-laas', count=97)]

## Obter links em propriedades de relationships

In [18]:
Relationship = namedtuple('Relationship', 'comment type object subject')
as_object = dataset['relationships']

relation = lambda x: Relationship(x['comment'], x['type'],
                                  x['object'], x['subject'])
relationships = list(map(relation, as_object))
relationships

[Relationship(comment=u'30', type=u'linked_from', object=u'rkb-explorer-ibm', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'561', type=u'linked_from', object=u'rkb-explorer-kisti', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'6', type=u'linked_from', object=u'rkb-explorer-resex', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'mika_i_zika', type=u'dependency_of', object=u'rkb-explorer-irit', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'1', type=u'links_to', object=u'rkb-explorer-nsf', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'1182', type=u'links_to', object=u'rkb-explorer-citeseer', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'17', type=u'links_to', object=u'rkb-explorer-rae2001', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'18', type=u'links_to', object=u'rkb-explorer-pisa', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'2', type=u'links_to', object=u'rkb-explorer-curriculum', subject=u'rkb-explorer-ieee'),
 

Ignorar 'linked_from'

A princípio, podemos obter apenas os relacionamentos do tipo 'links_to'
Depois temos que ver se existem outros importantes (entender o que significa 'dependency_of')

Pegando links_to


In [19]:
[r for r in relationships if r.type == 'links_to']

[Relationship(comment=u'1', type=u'links_to', object=u'rkb-explorer-nsf', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'1182', type=u'links_to', object=u'rkb-explorer-citeseer', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'17', type=u'links_to', object=u'rkb-explorer-rae2001', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'18', type=u'links_to', object=u'rkb-explorer-pisa', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'2', type=u'links_to', object=u'rkb-explorer-curriculum', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'2949', type=u'links_to', object=u'rkb-explorer-acm', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'3', type=u'links_to', object=u'rkb-explorer-roma', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'3', type=u'links_to', object=u'rkb-explorer-risks', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'417', type=u'links_to', object=u'rkb-explorer-oai', subject=u'rkb-explorer-ieee'),
 Relationship(comment=u'5

## Obter recursos

In [20]:
Resource = namedtuple('Resource', 'id description format url')

resource = lambda x: Resource(x['id'], x['description'], 
                              x['format'], x['url'])
resources = list(map(resource, dataset['resources']))
resources

[Resource(id=u'a21ae558-a78c-4de0-aeb3-065dff8b216a', description=u'SPARQL endpoint', format=u'api/sparql', url=u'http://ieee.rkbexplorer.com/sparql/'),
 Resource(id=u'326f75fa-6a3d-4fa7-9f4c-40beb671600a', description=u'XML Sitemap', format=u'meta/sitemap', url=u'http://ieee.rkbexplorer.com/sitemap.xml'),
 Resource(id=u'3614a6bc-96c3-40a8-85e9-c80356a5f4da', description=u'voiD file', format=u'meta/void', url=u'http://ieee.rkbexplorer.com/models/void.ttl'),
 Resource(id=u'79633a5f-b262-4003-900a-240171f31bcc', description=u'Example (RDF/XML)', format=u'example/rdf+xml', url=u'http://ieee.rkbexplorer.com/id/person-21757c2767705194600b55ff6b0ef692-1e427d6bbb6d2bb2aa5434059d6c58f4'),
 Resource(id=u'55544720-70b9-4aae-b055-ef6111515514', description=u'', format=u'application/rdf+xml', url=u'http://ieee.rkbexplorer.com/models/dump.tgz')]

Obter VOID

In [21]:
[x for x in resources if x.format == 'meta/void'][0]

Resource(id=u'3614a6bc-96c3-40a8-85e9-c80356a5f4da', description=u'voiD file', format=u'meta/void', url=u'http://ieee.rkbexplorer.com/models/void.ttl')

In [22]:
download = requests.get(_.url)

ConnectionError: ('Connection aborted.', gaierror(11001, 'getaddrinfo failed'))

Obter RDF

In [23]:
[x for x in resources if x.format == 'application/rdf+xml'][0]

Resource(id=u'55544720-70b9-4aae-b055-ef6111515514', description=u'', format=u'application/rdf+xml', url=u'http://ieee.rkbexplorer.com/models/dump.tgz')

In [24]:
download = requests.get(_.url)

ConnectionError: ('Connection aborted.', gaierror(11001, 'getaddrinfo failed'))

### Links da ieee-rkbexplorer estão offline =/