# FORCE 11
## Exploring the data in SHARE

Introduction to SHARE Queries and the current state of SHARE Data

### Names are difficult to disambiguate

SHARE attempts to break names into Given, Family, and Additional pieces. The [SHARE person schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml#L42) also includes spots for ```email```, ```affiliation```, and any links to other identifiers, such as ORCIDS, in the ```sameAs``` field.

Let's do a query to showcase different names appearing in the 5 most recent documents in SHARE

In [1]:
import furl
import requests


def query_share(size, query=None):
    SHARE_API = 'https://osf.io/api/v1/share/search/'
    search_url = furl.furl(SHARE_API)
    search_url.args['size'] = size
    search_url.args['sort'] = 'providerUpdatedDateTime'
    if query:
        search_url.args['q'] = query
    return requests.get(search_url.url).json()

def print_title_contributors(results):
    for result in results['results']:
        print(result['title'].encode('utf-8'))
        print('~~~~~~~~~')
        for name in result['contributors']:
            print(name['name'])
        print('-------------------------------------------')

In [2]:
results =  query_share(5)
print_title_contributors(results)

Statistical Interpretation including the APPROPRIATE Statistical Tests
~~~~~~~~~
Vsevolozhskaya, Olga A.
-------------------------------------------
Estimated Probability of Becoming Alcohol Dependent: Extending a Multiparametric Approach
~~~~~~~~~
Vsevolozhskaya, Olga A.
Anthony, James C.
-------------------------------------------
Rethinking Technical Services: New Frameworks, New Skill Sets, New Tool, New Roles
~~~~~~~~~
Eden, Brad
-------------------------------------------
Partnerships and New Roles in 21st-Century Academic Libraries: Collaborating, Embedding, and Cross-Training for the Future
~~~~~~~~~
Eden, Brad
-------------------------------------------
Creating Research Infrastructures in 21st-Century Academic Libraries: Conceiving, Funding, and Building New Facilities and Staff
~~~~~~~~~
Eden, Brad
-------------------------------------------


#### Names don't always show up in the same format

Let's choose a name, remove the middle initial, and see if we get a result

In [3]:
from sharepa import ShareSearch

def search_a_name(name):
    name_search = ShareSearch()
    name_search = name_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.name": {
                                "query": name, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )
    
    return name_search

def print_name_results(name):
    search = search_a_name(name)
    if search.count() == 1:
        print('There is {} document with the contributor {}'.format(  search.count(), name))
    else:
        print('There are {} documents with the contributor {}'.format(search.count(), name))

In [4]:
print_name_results('Martone, Maryann E.')
print_name_results('Martone, Maryann')
print_name_results('Maryann Martone')
print_name_results('Maryann E Martone')

There are 8 documents with the contributor Martone, Maryann E.
There are 10 documents with the contributor Martone, Maryann
There are 2 documents with the contributor Maryann Martone
There are 26 documents with the contributor Maryann E Martone


## Identifiers are Difficult to Identify

Here's a query that will search for all documents with contributors that have at least one orcid

In [5]:
recent_results = query_share(3, 'contributors.sameAs:*orcid*')

print('There are {} results'.format(recent_results['count']))
print('----------')
for result in recent_results['results']:
    print(result['title'].encode('utf-8'))
    print(result['shareProperties']['source'])
    print('~~~~~~~~~')
    for name in result['contributors']:
        print('{} - {}'.format(name['name'].encode('utf-8'), name['sameAs']))
    print('-------------------------------------------')

There are 29413 results
----------
An arithmetic Zariski pair of line arrangements with non-isomorphic fundamental group
crossref
~~~~~~~~~
Enrique Artal Bartolo - [u'http://orcid.org/0000-0002-8276-5116']
José Ignacio Cogolludo-Agustín - [u'http://orcid.org/0000-0003-1820-6755']
Benoît Guerville-Ballé - []
Miguel Marco-Buzunáriz - [u'http://orcid.org/0000-0002-6750-8971']
-------------------------------------------
Discovering biomarkers for antidepressant response: protocol from the Canadian biomarker integration network in depression (CAN-BIND) and clinical characteristics of the first patient cohort
crossref
~~~~~~~~~
Raymond W. Lam - []
Roumen Milev - []
Susan Rotzinger - []
Ana C. Andreazza - []
Pierre Blier - []
Colleen Brenner - []
Zafiris J. Daskalakis - []
Moyez Dharsee - []
Jonathan Downar - []
Kenneth R. Evans - []
Faranak Farzan - []
Jane A. Foster - []
Benicio N. Frey - []
Joseph Geraci - []
Peter Giacobbe - []
Harriet E. Feilotter - []
Geoffrey B. Hall - []
Kate L. Harkn

### Try the same for another form of identifier - an email address

In [6]:
# Contributors with Email Addresses

results = query_share(3, 'contributors.email:*')

print('There are {} results'.format(results['count']))
print('----------')
for result in results['results']:
    print(result['title'].encode('utf-8'))
    print(result['shareProperties']['source'])
    print('~~~~~~~~~')
    for name in result['contributors']:
            print('{} - {}'.format(name['name'].encode('utf-8'), name.get('email')))
    print('-------------------------------------------')


There are 52582 results
----------
Data from: Contemporary and historic factors influence differently genetic differentiation and diversity in a tropical palm
dataone
~~~~~~~~~
Galetti, Mauro - None
Ribeiro, Milton C. - None
Carvalho, Carolina C. - carolina.carvalho@ymail.com
Collevatti, Rosane G. - None
Côrtes, Marina C. - None
-------------------------------------------
Data from: Asthma-like symptoms in homeless children in the Greater Paris area in 2013: prevalence, associated factors and utilization of healthcare services in the ENFAMS survey
dataone
~~~~~~~~~
Marguet, Christophe - None
Lefeuvre, Delphine - lefeuvredel@gmail.com
Delmas, Marie-Christine - None
Vandentorren, Stéphanie - None
Chauvin, Pierre - None
-------------------------------------------
Data from: Trans-species variation in Dmrt1 is associated with sex determination in four European tree-frog species
dataone
~~~~~~~~~
Brelsford, Alan - alan.brelsford@unil.ch
Perrin, Nicolas - None
Dufresnes, Christophe - None
--

## No shared taxonomy for subjects or explicit document types - manuscripts, data, figures, etc.

One of our developers is working on this right now!

The DC field "Type" is an excellent place to start - however, there is a lot of variation for what is allowed inside of this field. 

"Element Description: The nature or genre of the content of the resource. Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary (for example, the DCMIType vocabulary )."

In [7]:
import pandas as pd
from sharepa import ShareSearch, basic_search
from sharepa.helpers import pretty_print

type_search = ShareSearch()
total_documents = basic_search.count()

type_search.aggs.bucket(
    'typeTermFilter',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations
    field='otherProperties.properties.type',
    exclude= "of|and|or",
    size=50,
)

type_results_executed = type_search.execute()

type_results = type_results_executed.aggregations.typeTermFilter.to_dict()['buckets']

type_dataframe = pd.DataFrame(type_results)
type_dataframe['percent'] = (type_dataframe['doc_count'] / total_documents)*100

In [8]:
type_dataframe

Unnamed: 0,doc_count,key,percent
0,1328663,article,21.783339
1,1167877,text,19.147264
2,1066645,journal,17.487572
3,195793,paper,3.210013
4,189537,book,3.107446
5,186536,figure,3.058245
6,185722,dataset,3.044899
7,117738,info:eu,1.930306
8,117738,repo,1.930306
9,117738,semantics,1.930306


## Documents with Descriptions

Are abstracts copyrightable?

Abstracts, summaries and descriptions are not always made available. 

In [9]:
# How many documents have descriptions?
from __future__ import division

results = query_share(10, 'description:*')
percent = (results['count']/total_documents)*100

print('There are {} results'.format(results['count']))
print('{}/{} or {}% of results have descriptions'.format(
        results['count'], total_documents,
        format(float(percent), '.2f')
    )
)

There are 3339942 results
3339942/6099446 or 54.76% of results have descriptions
