# The Benefit of Data Sharing
Godwin et al. (2025) reviewed the open-science practices in recent **visual search** literature.<br>
They gracefully provided the dataset of articles they reviewed ([link](https://osf.io/5tmey/overview)), allowing others to explore the data further.<br>
Here, we analyze their dataset to examine whether sharing data is associated with increased citations. We rely on Godwin et al.'s classification of articles into five categories based on their data-sharing practices:
1. No data shared
2. Per-subject data
3. Per-trial data
4. Per-fixation data
5. Godwin et al. (2025) claim data was shared, but unclear what level

We use the [OpenAlex](https://openalex.org/) API to retrieve citation counts for each of their articles, and compare the [FWCI](https://help.openalex.org/hc/en-us/articles/24735753007895-Field-Weighted-Citation-Impact-FWCI) and [MNCS](https://open.leidenranking.com/information/indicators) between these sharing categories.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from load_data import *
from fetch_metadata import fetch_all_metadata

In [2]:
FONT_FAMILY = "sans-serif"
TITLE_FONT = dict(family=FONT_FAMILY, size=20, color="black")
AXIS_TITLE_FONT = dict(family=FONT_FAMILY, size=16, color="black")
AXIS_TICK_FONT = dict(family=FONT_FAMILY, size=14, color="black")
LEGEND_FONT = dict(family=FONT_FAMILY, size=14, color="black")

## Prepare Data
### Load Godwin et al. (2025) Dataset

In [3]:
godwin = load_godwin2025()
godwin_subset = (
    godwin
    # exclusion criteria from Godwin et al. 2025:
    .loc[(godwin["YEAR_PUBLISHED"] >= 2017) & (godwin["YEAR_PUBLISHED"] <= 2022)]
    .loc[godwin["IS_PRIMARY_RESEARCH_HUMAN"] == "YES"]
    .loc[godwin["IS_VISUAL_SEARCH"] == "YES"]
    .loc[godwin["IS_EYE_TRACKING"] == "YES"]
    .drop(columns=[
        col for col in godwin.columns if col not in [
            "PAPER_LINK",
            "CLAIMED_TO_SHARE", "SHARING_LINK", "ACTUALLY_SHARED",
            "BY_FIXATION", "BY_TRIAL", "BY_PPT", "data_sharing_class"
        ]
    ])
)

### Fetch Metadata & Citation Counts from OpenAlex

In [4]:
metadata = fetch_all_metadata(godwin_subset["PAPER_LINK"].tolist(), sleep_period=0.01, verbose=True)
combined = (
    metadata
    .dropna(subset=["DOI"])     # drop entries with unsuccessful metadata fetch
    .drop(columns=[             # drop per-year citation counts
        col for col in metadata.columns if col.startswith("Citations20")
    ])
    .merge(godwin_subset, on="PAPER_LINK")
    .drop_duplicates(subset="DOI")
    .loc[lambda df: df["IsRetracted"] == False]   # drop retracted articles
    .assign(
        is_sharing=lambda df: df["data_sharing_class"] != "NONE"
    )
)
combined.columns

  sleep(sleep_period)     # to respect rate limits of 100 requests per second
  1%|          | 2/251 [00:02<06:30,  1.57s/it]

Error fetching metadata for link https://jov.arvojournals.org/article.aspx?articleid=2770293: Failed to fetch URL content: https://jov.arvojournals.org/article.aspx?articleid=2770293


 27%|██▋       | 68/251 [00:51<01:34,  1.93it/s]

Error fetching metadata for link 10.1177/1747021820945604: 404 Client Error: Not Found for url: https://api.openalex.org/works/doi%3A10.1177%2F1747021820945604


 32%|███▏      | 80/251 [00:58<01:48,  1.58it/s]

Error fetching metadata for link https://jov.arvojournals.org/article.aspx?articleid=2657481: Failed to fetch URL content: https://jov.arvojournals.org/article.aspx?articleid=2657481


 34%|███▍      | 86/251 [01:01<01:29,  1.85it/s]

Error fetching metadata for link 10.1037/xap0000235.: 404 Client Error: Not Found for url: https://api.openalex.org/works/doi%3A10.1037%2Fxap0000235.


 36%|███▌      | 90/251 [01:04<01:39,  1.62it/s]

Error fetching metadata for link https://www.mdpi.com/2076-3425/11/3/283: Failed to fetch URL content: https://www.mdpi.com/2076-3425/11/3/283


 37%|███▋      | 92/251 [01:05<01:54,  1.39it/s]

Error fetching metadata for link https://psycnet.apa.org/record/2019-10303-001: Failed to fetch URL content: https://psycnet.apa.org/record/2019-10303-001


 37%|███▋      | 93/251 [01:06<02:20,  1.13it/s]

Error fetching metadata for link https://www.sciencedirect.com/science/article/pii/S2451958821000750: Failed to fetch URL content: https://www.sciencedirect.com/science/article/pii/S2451958821000750


 37%|███▋      | 94/251 [01:07<01:57,  1.34it/s]

Error fetching metadata for link https://www.biorxiv.org/content/10.1101/2021.02.05.429946v1: 404 Client Error: Not Found for url: https://api.openalex.org/works/doi%3A10.1101%2F2021.02.05.429946v1


 38%|███▊      | 95/251 [01:08<02:13,  1.17it/s]

Error fetching metadata for link https://jov.arvojournals.org/article.aspx?articleid=2718934: Failed to fetch URL content: https://jov.arvojournals.org/article.aspx?articleid=2718934


 40%|████      | 101/251 [01:13<02:06,  1.18it/s]

Error fetching metadata for link https://www.sciencedirect.com/science/article/pii/S0010945221000149?via%3Dihub: Failed to fetch URL content: https://www.sciencedirect.com/science/article/pii/S0010945221000149?via%3Dihub


 41%|████      | 102/251 [01:14<02:21,  1.06it/s]

Error fetching metadata for link https://www.sciencedirect.com/science/article/abs/pii/S0010027721003589: Failed to fetch URL content: https://www.sciencedirect.com/science/article/abs/pii/S0010027721003589


 41%|████      | 103/251 [01:15<02:31,  1.03s/it]

Error fetching metadata for link https://www.sciencedirect.com/science/article/abs/pii/S0010027719303233?via%3Dihub: Failed to fetch URL content: https://www.sciencedirect.com/science/article/abs/pii/S0010027719303233?via%3Dihub


 45%|████▍     | 112/251 [01:20<01:11,  1.94it/s]

Error fetching metadata for link 10.16910/jemr.10.1.5: 404 Client Error: Not Found for url: https://api.openalex.org/works/doi%3A10.16910%2Fjemr.10.1.5


100%|██████████| 251/251 [03:14<00:00,  1.29it/s]


Index(['PAPER_LINK', 'DOI', 'OpenAlexID', 'LastUpdate', 'PublicationType',
       'PublicationYear', 'PublicationDate', 'Topics',
       'FieldWeightedCitationIndex', 'IsRetracted', 'TotalCitations', 'Error',
       'Pub2UpdateTime', 'MeanNormalizedCitationScore', 'CLAIMED_TO_SHARE',
       'SHARING_LINK', 'ACTUALLY_SHARED', 'BY_FIXATION', 'BY_TRIAL', 'BY_PPT',
       'data_sharing_class', 'is_sharing'],
      dtype='object')

## Analysis

In [14]:
def calculate_h_index(citations: pd.Series) -> int:
    """Calculates h-index from a Series of citation counts."""
    if citations.empty:
        return 0
    # Sort descending (highest citations first)
    sorted_cits = sorted(citations.dropna().astype(int), reverse=True)
    # Count how many papers have citations >= their rank
    return sum(c >= i + 1 for i, c in enumerate(sorted_cits))

In [18]:
sharing_class_order = {"NONE": 0, "PPT": 1, "TRIAL": 2, "FIXATION": 3, "UNKNOWN": 4}
metrics_per_sharing_class = (
    combined[[
        "data_sharing_class", "FieldWeightedCitationIndex", "MeanNormalizedCitationScore", "TotalCitations"
    ]]
    .groupby("data_sharing_class")
    .agg({
        "TotalCitations": [calculate_h_index],
        "FieldWeightedCitationIndex": ["count", "median", "mean", "std",],
        "MeanNormalizedCitationScore": ["median", "mean", "std",],
    })
    .sort_index(key=lambda idx: idx.map(lambda cls: sharing_class_order.get(cls)))
    .reset_index(drop=False)
)
metrics_per_sharing_class.columns = metrics_per_sharing_class.columns.map({
    ("FieldWeightedCitationIndex", "count"): "count",
    ("FieldWeightedCitationIndex", "median"): "fwci_median",
    ("FieldWeightedCitationIndex", "mean"): "fwci_mean",
    ("FieldWeightedCitationIndex", "std"): "fwci_std",
    ("MeanNormalizedCitationScore", "median"): "mncs_median",
    ("MeanNormalizedCitationScore", "mean"): "mncs_mean",
    ("MeanNormalizedCitationScore", "std"): "mncs_std",
})
metrics_per_sharing_class.columns = ["data_sharing_class", "h_index"] + list(metrics_per_sharing_class.columns[2:])
metrics_per_sharing_class

Unnamed: 0,data_sharing_class,h_index,count,fwci_median,fwci_mean,fwci_std,mncs_median,mncs_mean,mncs_std
0,NONE,30,149,1.129736,1.535285,1.353306,0.726415,0.949733,0.838379
1,PPT,14,23,1.737731,2.076712,1.494861,0.851117,1.11397,1.012093
2,TRIAL,11,25,1.022195,1.588806,1.501962,0.60794,0.954374,0.946497
3,FIXATION,16,31,1.63184,2.328929,1.942799,0.933962,1.222685,0.913544
4,UNKNOWN,4,4,1.007085,1.398109,1.346757,0.60794,0.876309,0.870967


### (1) Compare Sharing and Non-Sharing Articles
#### (1A) Field-Weighted Citation Index (FWCI)

In [6]:
fwci_pair_test = stats.mannwhitneyu(
    combined.loc[~combined["is_sharing"], "FieldWeightedCitationIndex"],
    combined.loc[combined["is_sharing"], "FieldWeightedCitationIndex"],
    alternative="less"
)
fwci_pair_test

MannwhitneyuResult(statistic=np.float64(5325.5), pvalue=np.float64(0.040063423852231626))

#### (1B) Mean Normalized Citation Score (MNCS)

In [7]:
mncs_pair_test = stats.mannwhitneyu(
    combined.loc[~combined["is_sharing"], "MeanNormalizedCitationScore"],
    combined.loc[combined["is_sharing"], "MeanNormalizedCitationScore"],
    alternative="less"
)
mncs_pair_test

MannwhitneyuResult(statistic=np.float64(5710.5), pvalue=np.float64(0.16745452190581744))

#### Visualizing Share/Non-Share Comparison

In [8]:
plot_data = (
    combined
    .melt(
        id_vars=['is_sharing'],
        value_vars=['FieldWeightedCitationIndex', 'MeanNormalizedCitationScore'],
        var_name='name',
        value_name='value'
    )
    .replace({
        'FieldWeightedCitationIndex': 'FWCI',
        'MeanNormalizedCitationScore': 'MNCS',
    })
)

pairwise_fig = go.Figure()
for is_share in plot_data["is_sharing"].unique():
    subset = plot_data[plot_data["is_sharing"] == is_share]
    subset_name = 'Sharing' if is_share else 'Not Sharing'
    pairwise_fig.add_trace(go.Violin(
        x=subset['name'], y=subset['value'],
        name=subset_name, legendgroup=subset_name, scalegroup=subset_name,
        side='positive' if is_share else 'negative',
        opacity=1.0 if is_share else 0.75,
        fillcolor='lightgrey', line_color="gray", width=1.0, spanmode="hard",
        points=False, pointpos=0, jitter=0.5,
        box=dict(visible=False, width=0.5, line=dict(color="gray")),
        meanline=dict(visible=False, color='gray'),
    ))

for i, metric in enumerate(['FWCI', 'MNCS']):
    pairwise_fig.add_vline(x=metric, line=dict(color='black', dash='dash'))
    pairwise_fig.add_annotation(
        x=metric, xanchor="center", xref="x",
        y=1.04 if metric=="FWCI" else 1.06, yanchor="middle", yref="paper",
        text="*" if metric=="FWCI" else "n.s.",
        font=TITLE_FONT,
        font_size=30 if metric=="FWCI" else 20,
        showarrow=False,
    )
    pairwise_fig.add_shape(
        type="line",
        x0=i - 0.1, x1=i + 0.1,
        y0=1.025, y1=1.025,
        yref="paper",
        line=dict(color="black", width=1),
    )

pairwise_fig.update_layout(
    width=800, height=400,
    title=dict(
        text="Citation Impact Distribution: Sharing vs. Non-Sharing",
        font=TITLE_FONT,
        x=0.5, xanchor="center", y=0.95, yanchor="top"
    ),
    violinmode='overlay',
    violingap=0,
    xaxis=dict(
        title=dict(text="Metric", font=AXIS_TITLE_FONT, standoff=5),
        tickfont=AXIS_TICK_FONT,
        zeroline=False,
    ),
    yaxis=dict(
        title=dict(text="Score", font=AXIS_TITLE_FONT, standoff=5),
        tickfont=AXIS_TICK_FONT,
        zeroline=False,
    ),
    legend=dict(
        title=dict(text="Data Sharing", font=AXIS_TITLE_FONT),
        font=LEGEND_FONT,
        x=0.5, xanchor="center", y=0.95, yanchor="top",
        bgcolor='rgba(255,255,255,0.5)',
        bordercolor='black', borderwidth=1,
    ),
    margin=dict(t=75, b=50, l=50, r=25, pad=0),
    template="plotly_white"
)

pairwise_fig.show()

### (2) Compare Type of Data Sharing
#### (2A) Field-Weighted Citation Index (FWCI)

In [9]:
fwci_sharegroup_test = stats.kruskal(*[
    combined.loc[combined["data_sharing_class"] == sharing_class, "FieldWeightedCitationIndex"].dropna()
    for sharing_class in ["FIXATION", "TRIAL", "PPT"]
])
fwci_sharegroup_test

KruskalResult(statistic=np.float64(2.900136104034142), pvalue=np.float64(0.23455432565569215))

#### (2B) Mean Normalized Citation Score (MNCS)

In [10]:
mncs_sharegroup_test = stats.kruskal(*[
    combined.loc[combined["data_sharing_class"] == sharing_class, "MeanNormalizedCitationScore"].dropna()
    for sharing_class in ["FIXATION", "TRIAL", "PPT"]
])
mncs_sharegroup_test

KruskalResult(statistic=np.float64(2.505653765320045), pvalue=np.float64(0.2856960251079464))

#### Visualizing Share-Type Comparison

In [11]:
share_groups = {"FIXATION": "#66c2a5", "TRIAL": "#fc8d62", "PPT": "#8da0cb"}
metrics = {"FieldWeightedCitationIndex": "FWCI", "MeanNormalizedCitationScore": "MNCS",}
plot_data = (
    combined
    .loc[combined["data_sharing_class"].isin(share_groups.keys())]
    .melt(
        id_vars=['data_sharing_class'],
        value_vars=list(metrics.keys()),
        var_name='name',
        value_name='value'
    )
    .replace(metrics)
)

sharegroup_fig = make_subplots(
    rows=1, cols=len(metrics), column_titles=list(metrics.values()),
    shared_yaxes=True, horizontal_spacing=0.05,
)
annotation_height = plot_data["value"].max()
for c, (metric, metric_label) in enumerate(metrics.items(), start=1):
    for sharing_class, color in share_groups.items():
        subset = plot_data[
            (plot_data["data_sharing_class"] == sharing_class) &
            (plot_data["name"] == metric_label)
        ]
        sharegroup_fig.add_trace(
            row=1, col=c,
            trace=go.Violin(
                x=subset['data_sharing_class'], y=subset['value'],
                name=sharing_class, legendgroup=sharing_class, scalegroup=sharing_class,
                side='positive', points='all', pointpos=-0.25, jitter=0.2,
                fillcolor=color, line=dict(color=color, width=1.5),
                width=1.0, spanmode="hard",
                showlegend=c==1,
                box=dict(visible=False, width=0.5, line=dict(color="gray")),
                meanline=dict(visible=False, color='gray'),
            ))
    sharegroup_fig.update_annotations(
        selector={"text": metric_label},
        font=TITLE_FONT,
        x=-0.25, xanchor="left", xref="x" if c==1 else "x2",
        y=0.95, yanchor="bottom", yref="paper",
    )
    sharegroup_fig.add_annotation(
        row=1, col=c,
        x=1, xanchor="center", xref="x",
        y=annotation_height, yanchor="bottom", yref="y",
        text="n.s.", font=TITLE_FONT,
        showarrow=False,
    )
    sharegroup_fig.add_shape(
        row=1, col=c,
        type="line",
        x0=0.75, x1=1.25,
        y0=annotation_height, y1=annotation_height,
        yref="y",
        line=dict(color="black", width=1),
    )
    sharegroup_fig.update_xaxes(
        row=1, col=c,
        title=dict(text="Type of Shared Data", font=AXIS_TITLE_FONT, standoff=10),
        tickfont=AXIS_TICK_FONT,
        zeroline=False,
    )
    sharegroup_fig.update_yaxes(
        row=1, col=c,
        tickfont=AXIS_TICK_FONT,
        zeroline=False,
    )

sharegroup_fig.update_layout(
    width=1000, height=400,
    title=dict(
        text="Citation Impact by Type of Data Sharing",
        font=TITLE_FONT,
        x=0.5, xanchor="center", y=0.95, yanchor="top"
    ),
    yaxis=dict(
        title=dict(text="Score", font=AXIS_TITLE_FONT, standoff=5),
    ),
    legend=dict(
        visible=False,
        orientation="v",
        title=dict(text="Data Sharing", font=AXIS_TITLE_FONT),
        font=LEGEND_FONT,
        x=0.5, xanchor="center", y=1.0, yanchor="top",
        bgcolor='rgba(255,255,255,0.5)',
        bordercolor='black', borderwidth=1,
    ),
    margin=dict(t=75, b=50, l=50, r=25, pad=0),
    template="plotly_white"
)
sharegroup_fig.show()

## Citation Counts over Time
### (1) Regress each Sharing Group Separately

In [12]:
all_share_groups = {**share_groups, "UNKNOWN": "red", "NONE": "gray"}
citation_counts_fig = go.Figure()
for group, color in all_share_groups.items():
    subset = combined[combined["data_sharing_class"] == group]
    symbol = 'circle' if group != "NONE" else 'x'
    citation_counts_fig.add_trace(go.Scatter(
        name=group, legendgroup=group,
        x=subset["Pub2UpdateTime"] / pd.Timedelta(weeks=1), y=subset["TotalCitations"],
        mode="markers", marker=dict(color=color, size=8, symbol=symbol),
    ))
    # fit log-log trendline per group
    x_log = np.log(subset["Pub2UpdateTime"] / pd.Timedelta(weeks=1))
    y_log = np.log(subset['TotalCitations'] + 1)  # add 1 to avoid log(0)
    slope, intercept = np.polyfit(x_log, y_log, deg=1)
    x_logspace = np.linspace(x_log.min(), x_log.max(), 100)
    y_logpred = intercept + slope * x_logspace
    y_pred = np.exp(y_logpred) - 1  # invert log transform
    x_orig = np.exp(x_logspace)
    citation_counts_fig.add_trace(go.Scatter(
        name=f"Trendline ({group})",
        x=x_orig, y=y_pred, mode="lines", line=dict(color=color, width=2, dash='dot'),
        showlegend=False,
    ))
    print("Group: {0: <12}".format(group) + f"Slope: {slope:.3f},\tIntercept: {intercept:.2f}")

# add general log-log trendline
x_log = np.log(combined["Pub2UpdateTime"] / pd.Timedelta(weeks=1))
y_log = np.log(combined['TotalCitations'] + 1)
slope, intercept = np.polyfit(x_log, y_log, deg=1)
x_logspace = np.linspace(x_log.min(), x_log.max(), 100)
y_logpred = intercept + slope * x_logspace
y_pred = np.exp(y_logpred) - 1
x_orig = np.exp(x_logspace)
citation_counts_fig.add_trace(go.Scatter(
    name="Trendline (all data)",
    x=x_orig, y=y_pred, mode="lines", line=dict(color='black', width=2, dash='dash'),
))
print("{0: <19}".format("All Data:") + f"Slope: {slope:.3f},\tIntercept: {intercept:.2f}")

citation_counts_fig.update_layout(
    width=800, height=300,
    title=dict(
        text="Citation Counts over Time Since Publication", font=TITLE_FONT,
        x=0.5, xanchor="center", y=0.95, yanchor="top",
    ),
    xaxis=dict(
        title=dict(text="Weeks Since Publication", font=AXIS_TITLE_FONT, standoff=10),
        tickfont=AXIS_TICK_FONT,
        # type="log",
        zeroline=False,
    ),
    yaxis=dict(
        title=dict(text="Total Citation Counts", font=AXIS_TITLE_FONT, standoff=5),
        tickfont=AXIS_TICK_FONT,
        # type="log",
        zeroline=True,
    ),
    legend=dict(
        orientation="h", bgcolor='rgba(0,0,0,0)',
        x=0.5, xanchor="center", y=0.96, yanchor="bottom",
    ),
    margin=dict(t=50, b=10, l=10, r=10, pad=0),
)
citation_counts_fig

Group: FIXATION    Slope: 1.500,	Intercept: -5.69
Group: TRIAL       Slope: 0.783,	Intercept: -1.90
Group: PPT         Slope: 1.360,	Intercept: -4.97
Group: UNKNOWN     Slope: 1.243,	Intercept: -4.47
Group: NONE        Slope: 0.916,	Intercept: -2.63
All Data:          Slope: 1.018,	Intercept: -3.16


### (2) Quantify the Benefit of Data Sharing
We want to quantify the citation benefit of data sharing over time.<br>
To do so we first fit a log-log regression model to all data points, predicting citation counts from time since publication: $\log(\text{Citations}) = \beta_0 + \beta_1 \log(\text{Weeks Since Publication}) + \epsilon$.<br>
Then, we compute the residuals (i.e., the difference between observed and predicted citation counts) for each article, and compare the residuals between sharing and non-sharing articles and across quartiles of publication time.

In [19]:
# fit log-linear model to all data
x_log = np.log(combined["Pub2UpdateTime"] / pd.Timedelta(weeks=1))
y_log = np.log(combined['TotalCitations'] + 1)  # add 1 to avoid log(0)
slope, intercept = np.polyfit(x_log, y_log, deg=1)

# compute residuals
ylog_pred = intercept + slope * x_log
log_residuals = (y_log - ylog_pred).rename("citation_log_residuals")
combined_with_residuals = pd.concat([combined, log_residuals], axis=1)

# compare residuals between sharing and non-sharing articles
non_sharing_log_residuals = combined_with_residuals.loc[
    ~combined_with_residuals["is_sharing"], "citation_log_residuals"
]
sharing_log_residuals = combined_with_residuals.loc[
    combined_with_residuals["is_sharing"], "citation_log_residuals"
]
residuals_pair_test = stats.mannwhitneyu(
    non_sharing_log_residuals, sharing_log_residuals, alternative="less"
)

# calculate effect size (rank-biserial correlation / CLES)
n1, n2 = len(non_sharing_log_residuals), len(sharing_log_residuals)
U_nonshare = residuals_pair_test.statistic
U_share = n1 * n2 - U_nonshare
rank_biserial_corr = (U_share - U_nonshare) / (n1 * n2)
cles = U_share / (n1 * n2)

# calculate summary statistics
residuals_summary_stats = (
    combined_with_residuals
    .groupby("is_sharing")["citation_log_residuals"]
    .agg(["count", "median", "mean", "std"])
    .reset_index()
    .rename(columns={"is_sharing": "data_sharing"})
)

# show results
print(residuals_pair_test)
print(f"Effect Sizes:\t\tRBC: {rank_biserial_corr:.3f}\tCLES: {cles:.3f}")
residuals_summary_stats

MannwhitneyuResult(statistic=np.float64(5494.0), pvalue=np.float64(0.07985514886414097))
Effect Sizes:		RBC: 0.112	CLES: 0.556


Unnamed: 0,data_sharing,count,median,mean,std
0,False,149,-0.00892,-0.058082,0.839696
1,True,83,0.136716,0.104267,0.855465
