Skip to content

This project aims to address the significant challenge of low citation rates and limited readership for journal articles, which has far-reaching implications for the scholarly community and the broader research landscape.

Notifications You must be signed in to change notification settings

BJEnrik/bigdata-cloudcomputing-from-publication-to-citation

Repository files navigation

image.png

Executive Summary

In today's competitive academic landscape, optimizing research efforts and ensuring the impact of scholarly work is crucial. However, a significant number of journal articles remain uncited and unnoticed, highlighting the need for a system to enhance research visibility and reach.

Our project aims to understand what makes a publication impactful and worthy of citation. By uncovering the factors that contribute to research impact, we can develop strategies to increase the visibility and recognition of scholarly articles.

We are driven by the need to address the challenges of research impact and bridge the gap between published research and its actual influence. Leveraging machine learning techniques and data analysis, we aim to develop a predictive model that can identify the factors influencing the citation and readership of scholarly articles. This model will help researchers and institutions prioritize their efforts by focusing on articles with a higher potential for citation and readership.

To achieve this, we are using the April 2023 Public Data File from Crossref, a comprehensive dataset containing scholarly metadata from various publications. This dataset provides standardized access to scholarly information for analysis and research purposes.

During the analysis, we compared the performance of three models: random forest, gradient boosting trees, and multilayer perceptrons. The gradient boosting trees model exhibited superior results, with a train accuracy of 84.92% and a test accuracy of 75.21%. This indicates the model's ability to generalize well to unseen data, instilling greater confidence in its predictions.

For researchers in stable fields, staying updated with emerging fields and exploring interdisciplinary collaborations can open new opportunities. Researchers seeking funding should encourage interdisciplinary collaborations, while those in the field of Data Science can leverage their expertise to contribute to the growing demand in the field.

In conclusion, our project addresses the challenges of research impact by developing a predictive model that identifies the factors influencing citation and readership. We provide valuable insights into the determinants of article impact, empowering researchers to make informed decisions about their publishing strategies. This leads to increased visibility, recognition, and wider dissemination of their work.


Problem Statement

Our project aims to address the significant challenge of low citation rates and limited readership for journal articles, which has far-reaching implications for the scholarly community and the broader research landscape. What are the factors that contribute to higher citation rates and readership? By answering this very question, our study has the potential to unlock the hidden value of research, optimize research efforts, and foster a more efficient and impactful research ecosystem. How do these factors affect the policies when it comes to research? How can these factors empower researchers in facilitating collaboration and in accelerating advancement in knowledge, in any field of study?

Overall, we want to answer this question: What determines the impact and citation-worthiness of a research publication, crucial for a researcher's recognition and access to opportunities?, in an attempt to address the challenge of low citation rates, limited readership, the implications of these to the scholarly community and research landscape, and the hidden value of research to drive changes in policies in terms of collaboration, knowledge sharing, and accelerated advancements.


Motivation

In today's highly competitive and multidisciplinary [5] academic landscape, where time and resources are scarce, it becomes crucial to optimize research efforts and ensure the impactful dissemination of scholarly work. Surprisingly, studies have shown that a significant portion of journal articles, approximately 90% [1][2], remain uncited [3], while around 50% go largely unnoticed by anyone beyond their authors and editors. This alarming trend highlights the need for a robust and efficient system to enhance the visibility and reach of research outputs.

The motivation behind our project lies in addressing these challenges and bridging the gap between the vast amount of published research and its actual impact.

By leveraging machine learning techniques and data analysis, we aim to develop a predictive model that can identify the factors influencing the citation [4] and readership of scholarly articles. This model will help researchers and institutions prioritize their efforts by focusing on articles with a higher potential for citation and readership. By providing valuable insights into the determinants of article impact, we hope to empower researchers to make informed decisions about their publishing strategies, ultimately leading to increased visibility, recognition, and the wider dissemination of their work.


Data Source

Original Data Source

The original data source for the April 2023 Public Data File is Crossref.

Crossref is a non-profit organization that provides DOIs and comprehensive metadata for academic and scholarly content. They collaborate with publishers, institutions, and researchers to ensure the accurate and reliable dissemination of scholarly

April 2023 Public Data File from Crossref

The April 2023 Public Data File from Crossref is a comprehensive dataset that contains scholarly metadata from various publications. It includes bibliographic information such as titles, authors, publication dates, DOIs (Digital Object Identifiers), abstracts, and citation data. This dataset aims to provide researchers, developers, and stakeholders with standardized access to scholarly information for analysis and research purposes.

Download Details

The dataset can be downloaded from Academic Torrents.

Academic Torrents is a platform dedicated to sharing and distributing large scientific datasets. It collaborates with researchers, institutions, and organizations to make academic data openly accessible, fostering data-driven research and promoting collaboration among scholars worldwide.

Crossref Unified Resource API

There is also an available Resource API for the April 2023 Public Data File to allow for efficient retrieval and linking of scholarly metadata from a vast network of publications.

The API enables researchers to access and integrate the dataset seamlessly into their research workflows, facilitating data-driven analyses and investigations across a wide range of scholarly materials.

Dataset Summary

  • is a data file of the public elements from Crossref’s 143.5+ million metadata records.
  • includes bibliographic information such as titles, authors, publication dates, DOIs, abstracts, and citation data.
  • also includes details about the journal or publication venue, such as the journal name, ISSN (International Standard Serial Number), publisher, and volume/issue numbers.
  • terms or phrases used to describe the main topics or subjects covered in the publication.
  • information on whether the full text of the publication is freely accessible or requires a subscription or purchase.
  • details about the funding sources that supported the research or publication.
  • information about the licensing terms and usage rights associated with the publication.

AWS S3 Bucket

The data was sourced from Academic Torrents and securely stored in an S3 bucket owned by Capstone Team One. Initially downloaded as JSON files, the data was subsequently transformed into parquet files, leveraging the benefits of its columnar format for optimized storage and analysis.

Amazon Resource Name (ARN)

arn:aws:s3:::s3bucketemr-sydney

AWS Region

ap-southeast-2

AWS CLI Access (AWS Account Required)

aws s3 ls s3://s3bucketemr-sydney/

AWS EMR cluster.png

Figure 1. AWS EMR Cluster.
AWS EMR Cluster used for the development of this project.


Raw Crossref Data

The raw Crossref dataset (see descriptions in Table 1) is a highly nested dataset with multiple layers of variables. In this representation, we will focus on the root variables, providing a glimpse into the structure of the dataset.

Variable Name Data Type Description
DOI string String representing the Digital Object Identifier for the research paper.
ISBN array Array of International Standard Book Numbers associated with the research paper.
ISSN array Array of International Standard Serial Numbers associated with the research paper.
URL string String representing the URL of the research paper.
abstract string String containing the abstract or summary of the research paper.
accepted struct Struct containing information about the acceptance status of the research paper.
alternative-id array Array of alternative identifiers associated with the research paper.
approved string Struct containing information about the approval status of the research paper.
archive array Array of archive names where the research paper is stored.
article-number string String representing the article number of the research paper.
assertion array Array of assertions made in the research paper.
author array Array of authors associated with the research paper.
award string String representing any award received by the research paper.
award-start struct Struct containing information about the start of the award associated with the research paper.
chair array Array of chairs or committee members associated with the research paper.
clinical-trial-number array Array of clinical trial numbers related to the research paper.
container-title array Array of container titles where the research paper is published.
content-created struct Struct containing information about the content creation of the research paper.
content-domain struct Struct containing information about the content domain of the research paper.
content-updated struct Struct containing information about the content update of the research paper.
created struct Struct containing information about the creation of the research paper.
degree array Array of degrees associated with the research paper.
deposited struct Struct containing information about the deposition of the research paper.
description string String containing a description or additional details about the research paper.
edition-number string String representing the edition number of the research paper.
editor array Array of editors associated with the research paper.
event struct Struct containing information about the event associated with the research paper.
funder array Array of funders or funding organizations associated with the research paper.
group-title string String representing the group title of the research paper.
indexed struct Struct containing information about the indexing of the research paper.
institution array Array of institutions associated with the research paper.
is-referenced-by-count long Long representing the count of references made to the research paper.
isbn-type array Array of ISBN types associated with the research paper.
issn-type array Array of ISSN types associated with the research paper.
issue string String representing the issue number of the research paper.
issued struct Struct containing information about the issuance of the research paper.
journal-issue struct Struct containing information about the journal issue of the research paper.
language string String representing the language of the research paper.
license array Array of licenses associated with the research paper.
link array Array of links associated with the research paper.
member string String representing the member or membership status associated with the research paper.
original-title array Array of original titles associated with the research paper.
page string String representing the page numbers of the research paper.
part-number string String representing the part number of the research paper.
posted struct Struct containing information about the posting of the research paper.
prefix string String representing the prefix associated with the research paper.
project array Array of projects associated with the research paper.
published struct Struct containing information about the publication of the research paper.
published-online struct Struct containing information about the online publication of the research paper.
published-other struct Struct containing information about other types of publication of the research paper.
published-print struct Struct containing information about the print publication of the research paper.
publisher string String representing the publisher of the research paper.
publisher-location string String representing the location of the publisher of the research paper.
reference array Array of references cited in the research paper.
reference-count long Long representing the count of references in the research paper.
references-count long Long representing the count of references associated with the research paper.
relation struct Struct containing information about the relation or relationship of the research paper.
resource struct Struct containing information about the resource associated with the research paper.
review struct Struct containing information about the review process of the research paper.
score double Double representing the score or rating assigned to the research paper.
short-container-title array Array of abbreviated container titles where the research paper is published.
short-title array Array of abbreviated titles associated with the research paper.
source string String representing the source of the research paper.
standards-body struct Struct containing information about the standards body associated with the research paper.
subject array Array of subjects or topics associated with the research paper.
subtitle array Array of subtitles associated with the research paper.
subtype string String representing the subtype of the research paper.
title array Array of titles associated with the research paper.
translator array Array of translators associated with the research paper.
type string String representing the type or category of the research paper.
update-policy string String representing the update policy associated with the research paper.
update-to array Array of updates associated with the research paper.
volume string String representing the volume number of the research paper.

Table 1. Crossref Raw Data Description.
Raw Crossref dataset description.

Final Crossref Data

The final Crossref dataset contains information related to research papers and their attributes. The dataset is structured as shown in Table 2.

Pillar Variable Name Data Type Description
Diversity & Quality offered title string Title of the paper
Diversity & Quality offered title_token_length integer Array of International Standard Book Numbers associated with the research paper.
Diversity & Quality offered type string Type of publication
Diversity & Quality offered abstract string Abstract of the paper
Diversity & Quality offered subject_list array Domains/fields of the study
Diversity & Quality offered num_subjects integer Number of domains/fields of the study
Financial Resources num_funder_awards integer Number of awarded funds
Financial Resources num_funders integer Number of funders
Network/Connections institution_name string Name of the institution that the author is affiliated with
Network/Connections institution_place string Location of the institution
Network/Connections num_authors integer Number of authors
Network/Connections affiliation_list array Name of each affiliation
Network/Connections num_affiliations integer Number of affiliations the authors have based on affiliation_id
Timeliness & Accessibility published_year long Year when the paper was published
Timeliness & Accessibility published_month long Month when the paper was published
Timeliness & Accessibility published_day long Day when the paper was published
Timeliness & Accessibility issn_type string International Standard Book Number (ISBN): identifies a single, nonrecurring publication, such as an journal
Opportunity is-referenced-by-count long Number of times the paper was referenced by another
Opportunity cited_pct double Citation percentage of the paper
Opportunity opportunity_index integer Targets, [0, 1]. 0: Upper quantile of citation_pct, 1: Lower quantile of citation_pct

Table 2. Final Crossref Dataset Description.
Project's final dataset description.

The final dataset for this project consists of four main pillars: diversity and quality, financial resources, network and connections, and timeliness and accessibility.

Under the diversity and quality pillar, we have features related to the content of the publications, including the title, domains, subjects, and abstracts of the studies.

The financial resources pillar encompasses features related to funding, such as the amount of funds awarded for the study and the granting body or funding institution.

The network and connections pillar focuses on the affiliations associated with the study, including the authors, institutions, and locations involved.

The timeliness and accessibility pillar includes metadata about the publication, such as the publication date and the publication type.

Additionally, there is a group of features under the opportunity pillar, which serves as the target variable for this study. These features include the citations count, citations percentage, and the opportunity index, which will be discussed further in subsequent sections.

Overall, this dataset provides valuable information about the diversity and quality of research papers, financial resources allocated to them, network connections of the authors, timeliness and accessibility factors, and the opportunity level of the papers.


Assumptions and Design Constraints

Proportionality Assumption

  • It is assumed that the proportionality of the target variable in the sampled data reflects the proportionality of the original data. This assumption allows for generalization of the model's performance on the entire dataset based on the sampled data.

  • Modularity

  • The machine learning pipeline is designed with modularity in mind. This allows for flexibility in modifying and reusing specific stages of the pipeline without running the entire pipeline repeatedly.

  • Memory Efficiency

  • To mitigate potential memory errors, the pipeline is optimized to minimize unnecessary re-computation. By storing intermediate results in separate variables, such as the train_data and test_data, the pipeline can be executed without rerunning the entire process.

  • Hyperparameter Tuning

  • Hyperparameter tuning is performed separately for each model. This approach promotes modularity and reduces the risk of memory errors by focusing on optimizing individual models rather than running all models as a single pipeline.

  • Usable Features

  • Features are selected based on their relevance to the prediction task. The design ensures that each pillar is represented by at least two features, which helps capture the essence of each pillar in the analysis.

  • Baseline Comparison

  • The proportional chance criterion (PCC) is established as a baseline to compare the model's performance. To consider the predictions as good, the accuracy of the model needs to exceed 1.25 times the PCC.

  • Performance Evaluation

  • The performance of each model is evaluated using various metrics, including accuracy, precision, recall, and F1-score. These metrics provide insights into the model's predictive capabilities and its ability to correctly classify instances.

  • Feature Importance

  • For tree-based models (random forest and gradient boosting trees), the feature importance is extracted to identify the most influential features. This analysis helps understand the contribution of each feature towards the model's predictions.

  • Neural Network Model

  • A multilayer perceptron (MLP) model with two hidden layers is employed. Hyperparameter tuning is performed by varying parameters such as maxIter and blockSize to optimize the model's performance.

  • Interpretability

  • The weights of the MLP model are extracted to provide feature importance, allowing for interpretability of the model's predictions.

  • Comparison with Existing Models

  • The performance of the MLP model is compared against the results obtained from the random forest and gradient boosting trees models. This comparison helps assess whether the MLP model provides better accuracy and predictive capabilities.

These assumptions and design constraints provide the foundation for developing and evaluating the machine learning models. They consider memory efficiency, modularity, feature relevance, interpretability, and performance evaluation to ensure accurate and efficient predictions based on the provided information.


Read of the Full Content from the given notebook.

About

This project aims to address the significant challenge of low citation rates and limited readership for journal articles, which has far-reaching implications for the scholarly community and the broader research landscape.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published