# Making Data Count

## Introduction

In November and December of 2014, we conducted a pair of online surveys of researchers and data managers (i.e., database or repository staff), asking questions about data sharing, discovery, and metrics

## Setup

In [None]:
%matplotlib inline
from IPython.display import display
from math import sqrt
from textwrap import wrap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pylab
import scipy as sp
import scipy.stats as sps
import seaborn as sns

#pylab.rcParams['figure.figsize'] = (14.0, 6.0)
sns.set_style("white", 
              {'font.sans-serif': ['Helevetica', 'Liberation Sans', 
                                   'Bitstream Vera Sans', 'sans-serif'],
               'axes.linewidth': 0,
               'xtick.direction': 'in',
               'xtick.major.size': 8.0})

LABEL_WIDTH = 25
sns.axes_style()

Scheme to consolidate subdisciplines

In [None]:
DISCIPLINE_MAP = {'-Anthropology' : 'Social science',
                  '-Archaeology' : 'Archaeology',
                  '-Area studies' : 'Social science',
                  '-Economics' : 'Social science',
                  '-Political science' : 'Social science',
                  '-Psychology' : 'Social science',
                  '-Sociology' : 'Social science',
                  '-Astronomy' : 'Space science',
                  '-Astrophysics' : 'Space science',
                  '-Environmental Science' : 'Environmental science',
                  '-Geology' : 'Earth science',
                  '-Oceanography' : 'Environmental science',
                  '-Planetary science' : 'Earth science',
                  '-Biochemistry' : 'Biology',
                  '-Bioinformatics' : 'Biology',
                  '-Biology' : 'Biology',
                  '-Evolutionary Biology' : 'Biology',
                  '-Neurobiology' : 'Biology',
                  'Social science' : 'Social science',
                  'Space science' : 'Space science',
                  'Earth science' : 'Earth science',
                  'Life science' : 'Biology',
                  'Chemistry' : 'Physical science',
                  'Physics' : 'Physical science',
                  'Computer science' : 'Computer science',
                  'Mathematics' : 'Mathematics',
                  'Information science' : 'Information science',
                  'Other' : 'Other'}

Misc. utilities

In [None]:
def tuple_normalize(counts, n_responses):
    return map(lambda x: float(x) / n_responses * 100, counts)

def interval_to_error(confidence_interval, center):
    """
    confidence_interval (tuple): low, high relative to the origin
    center: (int, float): measured value (e.g., mean)
    returns the ci as a tuple relative to the center (i.e., minus, plus the mean)
    """
    low = center - confidence_interval[0]
    high = confidence_interval[1] - center
    #return tuple(map(lambda x: (float(x) - center), confidence_interval))
    return (low, high)
    
def split_interval(interval):
    """
    split a confidence interval tuple and return as a 2-element Series
    """
    return pd.Series(interval, index= ['low','high'])

#### Bootstrap functions

In [None]:
def bootstrap_percentile_ci(data, n_samples=100000, alpha=0.05, stat_function=np.sum):
    """
    Calculates a confidence interval for True/False count data and returns a tuple (low, high) 
    
    data: (numpy array) of bools to resample
    num_samples: (int) number of times to resample
    alpha: (float) 1 - desired confidence interval  
    """
    n_responses = len(data)
    # get num_samples resampled arrays (of length n) of valid indicies for data
    indicies = np.random.randint(0, n_responses, (n_samples, n_responses))
    
    # generate sorted array of desired stat in each resampled array
    stats = [stat_function(data[x]) for x in indicies]
    stats.sort()

    # return stats at the edge of the 2.5 and 97.5 percentiles
    return (stats[int((alpha/2.0)*n_samples)], stats[int((1-alpha/2.0)*n_samples)])


def bootstrap_basic_ci(data, n_samples=100000, alpha=0.05, stat_function=np.sum):
    """
    Calculates a confidence interval for True/False count data and returns a tuple (low, high) 
    
    data: (numpy array) of bools to resample
    num_samples: (int) number of times to resample
    alpha: (float) 1 - desired confidence interval  
    """
    double_observed = 2 * stat_function(data)
    high, low = bootstrap_percentile_ci(data, n_samples=n_samples, alpha=alpha, stat_function=stat_function)

    return (double_observed - low, double_observed - high)

#### Graphing functions

In [None]:
def apply_cdl_style(fig):
    fig.set_ylabel('')
    sns.despine(ax=fig, left=True)
    # get rid of weird dashed line
    fig.lines[0].set_visible(False) 
    
    #set font sizes
    fig.tick_params(axis='x', width=2, labelsize=14, color='#808080')
    fig.tick_params(axis='y', labelsize=16)
    
    return fig


In [None]:
def graph_likert(questions, answers, interval_color='#808080'):
    
    collected_counts = pd.DataFrame(index=answers)
    stats = pd.DataFrame(index=questions.columns,columns=['mean', 'ci'])

    # set up dict for converstion from likert scale (e.g., 1-5) to 0-100%
    number_of_answers = len(answers) 
    answer_to_value = dict(zip(answers, np.arange(number_of_answers)/float(number_of_answers - 1)*100)) 

    for column in questions.columns:
        collected_counts[column] = questions[column].value_counts().dropna()

        #scale responses to go from 0 to 100
        likert_values = questions[column].dropna().map(answer_to_value)

        #cacluate mean and 95% confidence interval
        stats['mean'].loc[column] = likert_values.mean() 
        stats['ci'].loc[column] = bootstrap_basic_ci(np.array(likert_values), stat_function=np.mean)
        
    #sort stats and collected_counts by the mean   
    stats = stats.sort_index(axis=0, by='mean', ascending=True)
    collected_counts = collected_counts.T.reindex(index=stats.index)
    collected_counts = collected_counts.div(collected_counts.sum(1).astype(float)/100, axis = 0)
    
    #convert absolute interval values to distance below and above the observed value
    for index in stats.index.values:
        stats['ci'].loc[index] = interval_to_error(stats['ci'].loc[index], stats['mean'].loc[index])
    
    #split interval tuples into 2 element Series
    stats['ci'] = stats['ci'].apply(split_interval)
    
    collected_counts.index = [ '\n'.join(wrap(i, LABEL_WIDTH)) for i in collected_counts.index ]
    
    #plot percentages of each response
    fig = collected_counts.plot(kind='barh', stacked=True, grid=False, 
                                color=sns.color_palette("Blues", len(collected_counts.columns)),
                                xlim = (0,100), edgecolor='w', linewidth=2) 
    
    # plot mean and 95% confidence interval
    fig.plot(stats['mean'], np.arange(len(stats)), marker='o', color='w',axes=fig, 
             markersize=25, markeredgewidth=0, linewidth=0)
    
    fig.errorbar(stats['mean'].as_matrix(), np.arange(len(stats)), xerr=stats['ci'],
                 fmt='none', ecolor=interval_color, alpha=0.65, elinewidth=2, capsize=12, capthick=2)
    
    fig.legend(bbox_to_anchor=(0., -0.02, 1., -0.03), loc='upper left', ncol=number_of_answers, mode="expand",
                    borderaxespad=0., fontsize=14)
    
    apply_cdl_style(fig)
    
    fig.get_figure().set_size_inches(14.0, 2 * len(collected_counts.index))

    return fig , collected_counts

In [None]:
def graph_checkbox(question, bar_color='#08519c', interval_color='#808080'):
    #split_checkbox = responses[question].dropna()

    # checkbox_responses== DataFrame of bools where index=individual respondents, columns=answer choices
    #checkbox_responses = expand_checkbox(split_checkbox, answers)
    checkbox_responses = question.applymap(pd.notnull)
    
    # sum checked boxes in each column; response_counts== Series with values=sums, index=answer choices
    response_counts = checkbox_responses.sum()
    
    # resample and sum from each column to bootstrap a confidence interval
    # count_confidence_intervals== Series with values= tuples (low, high), index=answer choices
    count_confidence_intervals = checkbox_responses.apply(lambda x: bootstrap_basic_ci(np.array(x)))
        
    #normalize response_counts to percentage of total respondents to the question and sort
    response_counts = response_counts.apply(lambda x: float(x) / len(checkbox_responses) * 100)
    response_counts.sort(ascending=True)
    
    #normalize confidence intervals to percentages and sort
    count_confidence_intervals = count_confidence_intervals.apply(tuple_normalize, args=([len(checkbox_responses)]))
    count_confidence_intervals = count_confidence_intervals.reindex(index=response_counts.index)

    
    #convert absolute interval values to distance below and above the observed value
    for index in count_confidence_intervals.index.values:
        count_confidence_intervals.loc[index] = interval_to_error(count_confidence_intervals.loc[index], 
                                                                  response_counts.loc[index])
            
    #split interval tuples into 2 element Series
    count_confidence_intervals = count_confidence_intervals.apply(split_interval)
    
    response_counts.index = [ '\n'.join(wrap(i, LABEL_WIDTH)) for i in response_counts.index ]

    
    fig = response_counts.plot(kind='barh', 
                               color='#08519c',
                               edgecolor='w', grid=False, xlim=(0,100), fontsize=14)
    
    fig.errorbar(response_counts.as_matrix(), np.arange(len(response_counts)), 
                 xerr=count_confidence_intervals.T.as_matrix(),
                 fmt='none', ecolor=interval_color, alpha=0.65, elinewidth=2, capsize=12, capthick=2)
    
    
    apply_cdl_style(fig)

    fig.get_figure().set_size_inches(14.0, 2 * len(response_counts.index))
    
    return fig, response_counts, count_confidence_intervals

# Researchers

In [None]:
r_responses = pd.read_csv('MDC_Researchers.tsv', 
                          sep='\t',
                          header=[0,1], 
                          index_col=0,
                          tupleize_cols=True)

# drop blank rows (ignoring 1st column, which is required)
r_responses.dropna(axis=0, how='all', subset=r_responses.columns[1:], inplace=True)

r_responses.columns = pd.MultiIndex.from_tuples(r_responses.columns, names=['question', 'subquestion'])
r_responses.sort_index(axis=1, inplace=True)

r_responses['Which best describes your discipline?', 'Response'] = (
    r_responses['Which best describes your discipline?', 'Response'].map(DISCIPLINE_MAP, na_action='Ignore'))


### Demographics

In [None]:
DEMOGRAPHICS = ['Which best describes your employer/institution?', 
                'Where is your employer/institution located?', 
                'What is the highest degree you hold?', 
                'Which best describes your role?', 
                'Which best describes your discipline?']

print("N= " + str(len(r_responses.index)))

for column in DEMOGRAPHICS:
    count = r_responses[column, 'Response'].value_counts()
    percentages = 100 * count.apply(lambda x: float(x) / count.sum())
    display(pd.DataFrame([count, percentages], index=['count', 'percent']).T)

A total of 247 respondents completed the researcher survey. Most (78%) are employed by academic institutions. The United States (57%) and United Kingdom (14%) are easily the best represented countries. We received responses from across the academic career spectrum: principle investigators, post-docs, and grad students are all well represented. Biology is the most popular domain (53%), but environmental (17%) and social (10%) science are also significantly represented.


## Discovery

### How frequently do you use data from public sources to accomplish each of the following? 

In [None]:
FREQUENCIES = ['Never', 'Occasionally', 'Often']
sources_graph, sources_table = graph_likert(r_responses['How frequently do you use data from public sources to accomplish each of the following?<br>(e.g., data in a public database or journal article supplemental material)'], 
                               FREQUENCIES)
plt.savefig('mdc_use.svg')

Survey respondents most frequently reused public data as support for their own data collection– either before (to inform data collection) or after (to support interpretation). However, the number who did "often" reuse data to reach the main conclusions (27.8%) is still quite signficant.

In [None]:
reuse_data = r_responses['How frequently do you use data from public sources to accomplish each of the following?<br>(e.g., data in a public database or journal article supplemental material)']
never_reuse_count = reuse_data.apply(lambda x: x == 'Never').all(axis=1).value_counts()
never_reuse_percent = never_reuse_count.apply(lambda x: 100 * float(x) / never_reuse_count.sum())
print(str(never_reuse_count[True]) + " (" + str(never_reuse_percent[True]) + 
      "%) respondents answered 'Never' to all options.")

often_reuse_count = reuse_data.apply(lambda x: x == 'Often').any(axis=1).value_counts()
often_reuse_percent = often_reuse_count.apply(lambda x: 100 * float(x) / often_reuse_count.sum())
print(str(often_reuse_count[True]) + " (" + str(often_reuse_percent[True]) + 
      "%) respondents answered 'Often' to at least one option.")


### When looking for public data to use, how likely you are to search in each of the following ways?

In [None]:
LIKELIHOODS = ['No chance', 'Possible', 'Definitely']
graph_likert(r_responses['When looking for public data to use, how likely you are to search in each of the following ways?<br><em>If you never use external data, please answer hypothetically.</em>'], 
             LIKELIHOODS) 
plt.savefig('mdc_discovery.svg')

Most respondents search for data in multiple ways: through links from the literature, disciplinary databases, and general-purpose internet search. More personal reliance on other researchers (via social media or discussion forums) is much less common.

In [None]:
search_method = r_responses['When looking for public data to use, how likely you are to search in each of the following ways?<br><em>If you never use external data, please answer hypothetically.</em>']
definite_methods = search_method.apply(lambda x : x == 'Definitely').sum(axis=1).value_counts().sort_index()
definite_methods = definite_methods.apply(lambda x: float(x) / definite_methods.sum() * 100)
dm_fig = definite_methods.plot(kind='bar', color='#08519c', edgecolor='w', grid=False, ylim=(0,100), fontsize=16, rot=0,
                               title='How many methods would you "definitely" use?')
sns.despine(ax=dm_fig)
print(str(definite_methods[2:].sum()) + '% "definitely" use 2 or more methods')

A majority (63%) of respondents use multiple methods to search for data.

## Evaluation

### Please rank the importance of the following for estimating a dataset's quality

In [None]:
IMPORTANCE_LEVELS = ['5 (Least important)', '4', '3', '2', '1 (Most important)']
graph_likert(r_responses["Please rank the importance of the following for estimating a dataset's quality (e.g., when deciding whether to download it).<br><em>Please rank items from most to least important.</em>"], 
             IMPORTANCE_LEVELS)

Data quality was most frequently estimated through thoroughness of the associated documentation. Somewhat surprisingly, reuse was the least important indicator.

## Data Use

### What proportion of your data have you shared in each of the following ways?

In [None]:
PROPORTIONS = ['None', 'Some of it', 'Most/all of it']

"""
Responses to "Any means at all" generally make no sense (e.g., 44 of the 48 respondents who said they shared none of their 
data by any means also said that they shared some or most of their data through one of the other means. 
Consequently, I am dropping that question from the analysis."
"""

sharing_method = r_responses['What proportion of your data have you shared in each of the following ways?'].copy()
sharing_method.drop('Any means at all', axis=1, inplace=True)

graph_likert(sharing_method, PROPORTIONS)

More respondents (90%) have shared data by email than any other method. However, respondents who shared most or all of their data were more likely to do so via a database or repository (24%).

In [None]:
#sharing_method = responses['What proportion of your data have you shared in each of the following ways?']
shared_all_data = sharing_method.apply(lambda x : x == 'Most/all of it').any(axis=1)
fig, table = graph_likert(sharing_method[shared_all_data], PROPORTIONS)

print(str(shared_all_data.sum()) + ' (' + (str(100. * shared_all_data.sum()/shared_all_data.size)) + 
      '%) shared "most/all" of their data by some means')

In [None]:
table

In [None]:
display(r_responses.info(verbose=True))

In [None]:
shared_no_data = sharing_method.apply(lambda x : x == 'None').all(axis=1)
shared_no_data.sum() / 247.
#sharing_method[sharing_method['Any means at all'] == 'None']

### How interested you would be to know each of the following about users of your data?

In [None]:
INTEREST_LEVELS = ['5 (Least interesting)', '4', '3', '2', '1 (Most interesting)']
graph_likert(r_responses['How interested you would be to know each of the following about <strong>users<br></strong> of your data?<br><em>Please rank items from most to least interesting.</em>'], 
             INTEREST_LEVELS)


### How interested you would be to know each of the following about how your data is used?

In [None]:
graph_likert(r_responses['How interested you would be to know each of the following about <strong>how<br></strong> your data is used?<br><em>Please rank items from most to least interesting.</em>'], 
             INTEREST_LEVELS)

## Impact

### How interested you would be to know each of the following about the impact of your data?

In [None]:
INTEREST_LEVELS_FOUR = ['4 (Least interesting)', '3', '2', '1 (Most interesting)']
graph_likert(r_responses['How interested you would be to know each of the following about the impact of your data?<br><em>Please rank items from most to least interesting.</em>'], 
             INTEREST_LEVELS_FOUR)
plt.savefig('mdc_impact.svg')

Citations remain the preferred currency of academic credit, as the first choice of 85.5% of respondents. Download count is a clear second choice (by 64.5%). Links and landing page views were both much less popular.

# Data managers

In [None]:
dm_responses = pd.read_csv('MDC_Managers.tsv', 
                        sep='\t',
                        header=[0,1], 
                        index_col=0,
                        tupleize_cols=True)

# drop blank rows (ignoring 1st column, which is required)
dm_responses.dropna(axis=0, how='all', subset=dm_responses.columns[1:], inplace=True)

dm_responses.columns = pd.MultiIndex.from_tuples(dm_responses.columns, names=['question', 'subquestion'])
dm_responses.sort_index(axis=1, inplace=True)

dm_responses['Which best describes your discipline?', 'Response'] = (
    dm_responses['Which best describes your discipline?', 'Response'].map(DISCIPLINE_MAP, na_action='Ignore'))


## Demographics

In [None]:
DEMOGRAPHICS = ['Which best describes your employer/institution?', 
                'Where is your employer/institution located?', 
                'What is the highest degree you hold?', 
                'Which best describes your discipline?']

print("N= " + str(len(dm_responses.index)))

for column in DEMOGRAPHICS:
    count = dm_responses[column, 'Response'].value_counts()
    percentages = 100 * count.apply(lambda x: float(x) / count.sum())
    display(pd.DataFrame([count, percentages], index=['count', 'percent']).T)

## Current Practice

### What metrics / statistics do you currently track?

In [None]:
graph_checkbox(dm_responses['What metrics / statistics do you currently track?'])

In [None]:
tracked_metrics = dm_responses['What metrics / statistics do you currently track?'].copy()
tracked_metrics.drop('Other (please specify)', axis=1, inplace=True)
tracked_fig, tracked_table, tracked_interval = graph_checkbox(tracked_metrics)

In [None]:
tracked_table

A significant majority (84.9%) of repositories track downloads. Relatively few (23.3%) track citations to datasets or the repository as a whole (21.9%).

### What metrics / statistics do you currently expose? (e.g., on landing pages or via API)

In [None]:
graph_checkbox(dm_responses['What metrics / statistics do you currently expose?<br>(e.g., on landing pages or via API)'])

In [None]:
exposed_metrics = dm_responses['What metrics / statistics do you currently expose?<br>(e.g., on landing pages or via API)'].copy()
exposed_metrics.drop('Other (please specify)', axis=1, inplace=True)
exposed_fig, exposed_table, exposed_interval = graph_checkbox(exposed_metrics, bar_color='#6baed6')

Most repositories do not expose any metrics/statistics. Of those that do, downloads (30.1%) and views (23.3%) are the most common. All metrics are collected by a significant number of repositories that do not expose them; 64.5% of the repositories that track downloads don't expose them; 52.9% of the repositories that track citations to individual datasets don't expose them. 

In [None]:
exposed_table

In [None]:
exposed_interval

In [None]:
combined_metrics = {'tracked' : tracked_table,
                    'exposed' : exposed_table}
combined_df = pd.DataFrame(combined_metrics)
fig= combined_df.plot(kind='barh', color=sns.color_palette("Blues", 2), edgecolor='w', grid=False, xlim=(0,100), fontsize=14)
fig.legend(loc=4,frameon=False, fontsize=16.)

fig.get_figure().set_size_inches(14.0, 2 * len(tracked_table.index))

fig.errorbar(tracked_table.as_matrix(), np.arange(len(tracked_interval)) + 0.125, 
             xerr=tracked_interval.T.as_matrix(),
             fmt='none', ecolor='#808080', alpha=0.65, elinewidth=2, capsize=6, capthick=2)

fig.errorbar(exposed_table.as_matrix(), np.arange(len(exposed_interval)) - 0.125, 
             xerr=exposed_interval.T.as_matrix(),
             fmt='none', ecolor='#808080', alpha=0.65, elinewidth=2, capsize=6, capthick=2)
apply_cdl_style(fig)

plt.savefig('mdc_practice.svg')

In [None]:
exposed_interval

In [None]:
exposed_table

### What information do you require data users to supply?

In [None]:
graph_checkbox(dm_responses['What information do you require data users to supply?'])

Responding repositories were evenly split (at 46.6%) between requiring data users to supply their real name and requiring no information at all.

In [None]:
user_info = dm_responses['What information do you require data users to supply?']
real_name_required = user_info['Real name'].apply(lambda x: x == 'Real name')
graph_checkbox(user_info[real_name_required])

### How do you currently use the information you collect?

In [None]:
graph_checkbox(dm_responses['How do you currently use the information you collect?'])

The most common (56.2%) application for data metrics is as evidence of the value of the repository to funders. However, they are also used by many to inform internal decisions, such as by 46.6% to set priorities and 38.4% to inform collection development.

In [None]:
INTERNAL_USES = [u'Inform collection development', u'Inform deaccession decisions', 
                 u'Set levels of service', u'Set priorities']

internal_use = dm_responses['How do you currently use the information you collect?'][INTERNAL_USES].any(axis=1).value_counts()
internal_use.apply(lambda x: float(x) / internal_use.sum())

65.8% of respondents use the information they collect to inform at least one internal decision.

## Data Use

### How interested you would be to know each of the following about users of the data you hold?

In [None]:
graph_likert(dm_responses['How interested you would be to know each of the following about <strong>users</strong> of the data you hold?<br><em>Please rank items from most to least interesting.</em>'], 
             INTEREST_LEVELS)

There was a clear interest in knowning what disciplines or research communities the repository is serving; 67.3% chose it as the most interesting option. All the others were a roughly similar mix, execpt, perhaps, for clickstreams, which were the first choice of 16.4% of repository managers.

### How interested you would be to know each of the following about how the data you hold is used?

In [None]:
graph_likert(dm_responses['How interested you would be to know each of the following about <strong>how</strong> the data you hold is used?<br><em>Please rank items from most to least interesting.</em>'],
             INTEREST_LEVELS)

### How interested you would be to know each of the following about the impact of the data you hold?

In [None]:
INTEREST_LEVELS_SEVEN = ['7 (Least interesting)', '6', '5', '4', '3', '2', '1 (Most interesting)']
graph_likert(dm_responses['How interested you would be to know each of the following about the <strong>impact</strong> of the data you hold?<br><em>Please rank items from most to least interesting.</em>'], 
             INTEREST_LEVELS_SEVEN)

As was the case with rearchers, citations are the most desireable measure of impact as the first choice of 61.1% of respondents. In contrast to this purely scholarly measure, "real-world" impact was the second most desireable, but was the first choice of a much smaller percentage (23.2%) of respondents. 

## Conclusions

Directly relevant to metrics:

* Citations remain the preferred currency of scholarly prestige. Citation counts are the most valued measure of impact by both reasearchers and repository staff.

* As a measure of impact, download counts are securely in second place in researcher's minds, and many more repositories currently track downloads than citations (85% vs. 23%). In the near term, downloads are an attractive metric in terms of availability.

* Many repositories collect stats that they don't expose.

Researcher data use and discovery behavior:

* Researchers are using other people's data primarily in a supporting role– either before or after their own data collection.

* Researchers look for data in muliple ways; the literature, disciplinary databases, and general purpose search are each used by a majority of researchers.



