In today's competitive academic landscape, optimizing research efforts and ensuring the impact of scholarly work is crucial. However, a significant number of journal articles remain uncited and unnoticed, highlighting the need for a system to enhance research visibility and reach.
Our project aims to understand what makes a publication impactful and worthy of citation. By uncovering the factors that contribute to research impact, we can develop strategies to increase the visibility and recognition of scholarly articles.
We are driven by the need to address the challenges of research impact and bridge the gap between published research and its actual influence. Leveraging machine learning techniques and data analysis, we aim to develop a predictive model that can identify the factors influencing the citation and readership of scholarly articles. This model will help researchers and institutions prioritize their efforts by focusing on articles with a higher potential for citation and readership.
To achieve this, we are using the April 2023 Public Data File from Crossref, a comprehensive dataset containing scholarly metadata from various publications. This dataset provides standardized access to scholarly information for analysis and research purposes.
During the analysis, we compared the performance of three models: random forest, gradient boosting trees, and multilayer perceptrons. The gradient boosting trees model exhibited superior results, with a train accuracy of 84.92% and a test accuracy of 75.21%. This indicates the model's ability to generalize well to unseen data, instilling greater confidence in its predictions.
For researchers in stable fields, staying updated with emerging fields and exploring interdisciplinary collaborations can open new opportunities. Researchers seeking funding should encourage interdisciplinary collaborations, while those in the field of Data Science can leverage their expertise to contribute to the growing demand in the field.
In conclusion, our project addresses the challenges of research impact by developing a predictive model that identifies the factors influencing citation and readership. We provide valuable insights into the determinants of article impact, empowering researchers to make informed decisions about their publishing strategies. This leads to increased visibility, recognition, and wider dissemination of their work.
Our project aims to address the significant challenge of low citation rates and limited readership for journal articles, which has far-reaching implications for the scholarly community and the broader research landscape. What are the factors that contribute to higher citation rates and readership? By answering this very question, our study has the potential to unlock the hidden value of research, optimize research efforts, and foster a more efficient and impactful research ecosystem. How do these factors affect the policies when it comes to research? How can these factors empower researchers in facilitating collaboration and in accelerating advancement in knowledge, in any field of study?
Overall, we want to answer this question: What determines the impact and citation-worthiness of a research publication, crucial for a researcher's recognition and access to opportunities?, in an attempt to address the challenge of low citation rates, limited readership, the implications of these to the scholarly community and research landscape, and the hidden value of research to drive changes in policies in terms of collaboration, knowledge sharing, and accelerated advancements.
In today's highly competitive and multidisciplinary [5] academic landscape, where time and resources are scarce, it becomes crucial to optimize research efforts and ensure the impactful dissemination of scholarly work. Surprisingly, studies have shown that a significant portion of journal articles, approximately 90% [1][2], remain uncited [3], while around 50% go largely unnoticed by anyone beyond their authors and editors. This alarming trend highlights the need for a robust and efficient system to enhance the visibility and reach of research outputs.
The motivation behind our project lies in addressing these challenges and bridging the gap between the vast amount of published research and its actual impact.
By leveraging machine learning techniques and data analysis, we aim to develop a predictive model that can identify the factors influencing the citation [4] and readership of scholarly articles. This model will help researchers and institutions prioritize their efforts by focusing on articles with a higher potential for citation and readership. By providing valuable insights into the determinants of article impact, we hope to empower researchers to make informed decisions about their publishing strategies, ultimately leading to increased visibility, recognition, and the wider dissemination of their work.
Original Data Source
The original data source for the April 2023 Public Data File is Crossref.
Crossref is a non-profit organization that provides DOIs and comprehensive metadata for academic and scholarly content. They collaborate with publishers, institutions, and researchers to ensure the accurate and reliable dissemination of scholarly
April 2023 Public Data File from Crossref
The April 2023 Public Data File from Crossref is a comprehensive dataset that contains scholarly metadata from various publications. It includes bibliographic information such as titles, authors, publication dates, DOIs (Digital Object Identifiers), abstracts, and citation data. This dataset aims to provide researchers, developers, and stakeholders with standardized access to scholarly information for analysis and research purposes.
Download Details
The dataset can be downloaded from Academic Torrents.
Academic Torrents is a platform dedicated to sharing and distributing large scientific datasets. It collaborates with researchers, institutions, and organizations to make academic data openly accessible, fostering data-driven research and promoting collaboration among scholars worldwide.
Crossref Unified Resource API
There is also an available Resource API for the April 2023 Public Data File to allow for efficient retrieval and linking of scholarly metadata from a vast network of publications.
The API enables researchers to access and integrate the dataset seamlessly into their research workflows, facilitating data-driven analyses and investigations across a wide range of scholarly materials.
Dataset Summary
- is a data file of the public elements from Crossref’s 143.5+ million metadata records.
- includes bibliographic information such as titles, authors, publication dates, DOIs, abstracts, and citation data.
- also includes details about the journal or publication venue, such as the journal name, ISSN (International Standard Serial Number), publisher, and volume/issue numbers.
- terms or phrases used to describe the main topics or subjects covered in the publication.
- information on whether the full text of the publication is freely accessible or requires a subscription or purchase.
- details about the funding sources that supported the research or publication.
- information about the licensing terms and usage rights associated with the publication.
AWS S3 Bucket
The data was sourced from Academic Torrents and securely stored in an S3 bucket owned by Capstone Team One. Initially downloaded as JSON files, the data was subsequently transformed into parquet files, leveraging the benefits of its columnar format for optimized storage and analysis.
Amazon Resource Name (ARN)
arn:aws:s3:::s3bucketemr-sydney
AWS Region
ap-southeast-2
AWS CLI Access (AWS Account Required)
aws s3 ls s3://s3bucketemr-sydney/
Figure 1. AWS EMR Cluster.
AWS EMR Cluster used for the development of this project.
The raw Crossref dataset (see descriptions in Table 1) is a highly nested dataset with multiple layers of variables. In this representation, we will focus on the root variables, providing a glimpse into the structure of the dataset.
Variable Name | Data Type | Description |
---|---|---|
DOI | string | String representing the Digital Object Identifier for the research paper. |
ISBN | array | Array of International Standard Book Numbers associated with the research paper. |
ISSN | array | Array of International Standard Serial Numbers associated with the research paper. |
URL | string | String representing the URL of the research paper. |
abstract | string | String containing the abstract or summary of the research paper. |
accepted | struct | Struct containing information about the acceptance status of the research paper. |
alternative-id | array | Array of alternative identifiers associated with the research paper. |
approved | string | Struct containing information about the approval status of the research paper. |
archive | array | Array of archive names where the research paper is stored. |
article-number | string | String representing the article number of the research paper. |
assertion | array | Array of assertions made in the research paper. |
author | array | Array of authors associated with the research paper. |
award | string | String representing any award received by the research paper. |
award-start | struct | Struct containing information about the start of the award associated with the research paper. |
chair | array | Array of chairs or committee members associated with the research paper. |
clinical-trial-number | array | Array of clinical trial numbers related to the research paper. |
container-title | array | Array of container titles where the research paper is published. |
content-created | struct | Struct containing information about the content creation of the research paper. |
content-domain | struct | Struct containing information about the content domain of the research paper. |
content-updated | struct | Struct containing information about the content update of the research paper. |
created | struct | Struct containing information about the creation of the research paper. |
degree | array | Array of degrees associated with the research paper. |
deposited | struct | Struct containing information about the deposition of the research paper. |
description | string | String containing a description or additional details about the research paper. |
edition-number | string | String representing the edition number of the research paper. |
editor | array | Array of editors associated with the research paper. |
event | struct | Struct containing information about the event associated with the research paper. |
funder | array | Array of funders or funding organizations associated with the research paper. |
group-title | string | String representing the group title of the research paper. |
indexed | struct | Struct containing information about the indexing of the research paper. |
institution | array | Array of institutions associated with the research paper. |
is-referenced-by-count | long | Long representing the count of references made to the research paper. |
isbn-type | array | Array of ISBN types associated with the research paper. |
issn-type | array | Array of ISSN types associated with the research paper. |
issue | string | String representing the issue number of the research paper. |
issued | struct | Struct containing information about the issuance of the research paper. |
journal-issue | struct | Struct containing information about the journal issue of the research paper. |
language | string | String representing the language of the research paper. |
license | array | Array of licenses associated with the research paper. |
link | array | Array of links associated with the research paper. |
member | string | String representing the member or membership status associated with the research paper. |
original-title | array | Array of original titles associated with the research paper. |
page | string | String representing the page numbers of the research paper. |
part-number | string | String representing the part number of the research paper. |
posted | struct | Struct containing information about the posting of the research paper. |
prefix | string | String representing the prefix associated with the research paper. |
project | array | Array of projects associated with the research paper. |
published | struct | Struct containing information about the publication of the research paper. |
published-online | struct | Struct containing information about the online publication of the research paper. |
published-other | struct | Struct containing information about other types of publication of the research paper. |
published-print | struct | Struct containing information about the print publication of the research paper. |
publisher | string | String representing the publisher of the research paper. |
publisher-location | string | String representing the location of the publisher of the research paper. |
reference | array | Array of references cited in the research paper. |
reference-count | long | Long representing the count of references in the research paper. |
references-count | long | Long representing the count of references associated with the research paper. |
relation | struct | Struct containing information about the relation or relationship of the research paper. |
resource | struct | Struct containing information about the resource associated with the research paper. |
review | struct | Struct containing information about the review process of the research paper. |
score | double | Double representing the score or rating assigned to the research paper. |
short-container-title | array | Array of abbreviated container titles where the research paper is published. |
short-title | array | Array of abbreviated titles associated with the research paper. |
source | string | String representing the source of the research paper. |
standards-body | struct | Struct containing information about the standards body associated with the research paper. |
subject | array | Array of subjects or topics associated with the research paper. |
subtitle | array | Array of subtitles associated with the research paper. |
subtype | string | String representing the subtype of the research paper. |
title | array | Array of titles associated with the research paper. |
translator | array | Array of translators associated with the research paper. |
type | string | String representing the type or category of the research paper. |
update-policy | string | String representing the update policy associated with the research paper. |
update-to | array | Array of updates associated with the research paper. |
volume | string | String representing the volume number of the research paper. |
Table 1. Crossref Raw Data Description.
Raw Crossref dataset description.
The final Crossref dataset contains information related to research papers and their attributes. The dataset is structured as shown in Table 2.
Pillar | Variable Name | Data Type | Description |
---|---|---|---|
Diversity & Quality offered | title | string | Title of the paper |
Diversity & Quality offered | title_token_length | integer | Array of International Standard Book Numbers associated with the research paper. |
Diversity & Quality offered | type | string | Type of publication |
Diversity & Quality offered | abstract | string | Abstract of the paper |
Diversity & Quality offered | subject_list | array | Domains/fields of the study |
Diversity & Quality offered | num_subjects | integer | Number of domains/fields of the study |
Financial Resources | num_funder_awards | integer | Number of awarded funds |
Financial Resources | num_funders | integer | Number of funders |
Network/Connections | institution_name | string | Name of the institution that the author is affiliated with |
Network/Connections | institution_place | string | Location of the institution |
Network/Connections | num_authors | integer | Number of authors |
Network/Connections | affiliation_list | array | Name of each affiliation |
Network/Connections | num_affiliations | integer | Number of affiliations the authors have based on affiliation_id |
Timeliness & Accessibility | published_year | long | Year when the paper was published |
Timeliness & Accessibility | published_month | long | Month when the paper was published |
Timeliness & Accessibility | published_day | long | Day when the paper was published |
Timeliness & Accessibility | issn_type | string | International Standard Book Number (ISBN): identifies a single, nonrecurring publication, such as an journal |
Opportunity | is-referenced-by-count | long | Number of times the paper was referenced by another |
Opportunity | cited_pct | double | Citation percentage of the paper |
Opportunity | opportunity_index | integer | Targets, [0, 1]. 0: Upper quantile of citation_pct, 1: Lower quantile of citation_pct |
Table 2. Final Crossref Dataset Description.
Project's final dataset description.
The final dataset for this project consists of four main pillars: diversity and quality, financial resources, network and connections, and timeliness and accessibility.
Under the diversity and quality pillar, we have features related to the content of the publications, including the title, domains, subjects, and abstracts of the studies.
The financial resources pillar encompasses features related to funding, such as the amount of funds awarded for the study and the granting body or funding institution.
The network and connections pillar focuses on the affiliations associated with the study, including the authors, institutions, and locations involved.
The timeliness and accessibility pillar includes metadata about the publication, such as the publication date and the publication type.
Additionally, there is a group of features under the opportunity pillar, which serves as the target variable for this study. These features include the citations count, citations percentage, and the opportunity index, which will be discussed further in subsequent sections.
Overall, this dataset provides valuable information about the diversity and quality of research papers, financial resources allocated to them, network connections of the authors, timeliness and accessibility factors, and the opportunity level of the papers.
Proportionality Assumption
-
It is assumed that the proportionality of the target variable in the sampled data reflects the proportionality of the original data. This assumption allows for generalization of the model's performance on the entire dataset based on the sampled data.
-
Modularity
-
The machine learning pipeline is designed with modularity in mind. This allows for flexibility in modifying and reusing specific stages of the pipeline without running the entire pipeline repeatedly.
-
Memory Efficiency
-
To mitigate potential memory errors, the pipeline is optimized to minimize unnecessary re-computation. By storing intermediate results in separate variables, such as the train_data and test_data, the pipeline can be executed without rerunning the entire process.
-
Hyperparameter Tuning
-
Hyperparameter tuning is performed separately for each model. This approach promotes modularity and reduces the risk of memory errors by focusing on optimizing individual models rather than running all models as a single pipeline.
-
Usable Features
-
Features are selected based on their relevance to the prediction task. The design ensures that each pillar is represented by at least two features, which helps capture the essence of each pillar in the analysis.
-
Baseline Comparison
-
The proportional chance criterion (PCC) is established as a baseline to compare the model's performance. To consider the predictions as good, the accuracy of the model needs to exceed 1.25 times the PCC.
-
Performance Evaluation
-
The performance of each model is evaluated using various metrics, including accuracy, precision, recall, and F1-score. These metrics provide insights into the model's predictive capabilities and its ability to correctly classify instances.
-
Feature Importance
-
For tree-based models (random forest and gradient boosting trees), the feature importance is extracted to identify the most influential features. This analysis helps understand the contribution of each feature towards the model's predictions.
-
Neural Network Model
-
A multilayer perceptron (MLP) model with two hidden layers is employed. Hyperparameter tuning is performed by varying parameters such as maxIter and blockSize to optimize the model's performance.
-
Interpretability
-
The weights of the MLP model are extracted to provide feature importance, allowing for interpretability of the model's predictions.
-
Comparison with Existing Models
-
The performance of the MLP model is compared against the results obtained from the random forest and gradient boosting trees models. This comparison helps assess whether the MLP model provides better accuracy and predictive capabilities.
These assumptions and design constraints provide the foundation for developing and evaluating the machine learning models. They consider memory efficiency, modularity, feature relevance, interpretability, and performance evaluation to ensure accurate and efficient predictions based on the provided information.
Read of the Full Content from the given notebook.