Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClassifiedText return an error #84

Open
massimoaria opened this issue Nov 2, 2023 · 2 comments
Open

ClassifiedText return an error #84

massimoaria opened this issue Nov 2, 2023 · 2 comments

Comments

@massimoaria
Copy link

massimoaria commented Nov 2, 2023

Trying to use a classify task model with this simple text (an abstract of a scientific paper), the function returns an error. Using other abstracts as examples the function works well.

Here the reprex of the issue:

library(text)
#> �[0;34mThis is text (version 1.1.1).
#> �[0m�[0;32mText is new and still rapidly improving.
#>                
#> Newer versions may have improved functions and updated defaults to reflect current understandings of the state-of-the-art.
#>                Please send us feedback based on your experience.�[0m�[0;35m
#> 
#> Please note that defaults has changed in the textEmbed-functions since last version; see help(textEmbed) or www.r-text.org for more details.�[0m

x <- "OBJECTIVE TO ASSESS THE OVERALL TRENDS IN THE DEVELOPMENT AND CITATION IMPACT OF HIGH-IMPACT PAPERS IN NURSING RESEARCH WORLDWIDE TO GAIN INSIGHT INTO THE FOCUS AREAS OF NURSING RESEARCH. BACKGROUND BIBLIOMETRIC METHOD IS PROVED TO BE EFFECTIVE IN ANALYSING THE PAPERS' CHARACTERISTICS, AND IT GAINED CONSIDERABLE INTEREST FROM THE SCIENTIFIC COMMUNITY IN RECENT YEARS. AN ANALYSIS OF THE CHARACTERISTICS AND INTRINSIC PATTERNS OF HIGH-IMPACT PAPERS IN NURSING RESEARCH WILL PROVIDE AN OBJECTIVE REFLECTION OF THE RESEARCH HOT SPOTS. NURSING MANAGERS CAN POINTEDLY INCREASE FUNDING AMOUNT AND STRENGTHEN RESEARCH COOPERATION IN ORDER TO PUT THE SCIENTIFIC RESULTS INTO MANAGEMENT PRACTICE. METHODS BIBLIOMETRIC METHODS AND VISUALIZATION SOFTWARE WERE USED TO COMPREHENSIVELY ANALYSE HIGH-IMPACT PAPERS IN NURSING RESEARCH IN TERMS OF DEVELOPMENT TRENDS, COUNTRIES/REGIONS, DISTRIBUTION OF SUBJECT AREAS, RESEARCH INSTITUTES, COLLABORATIVE NETWORKS AND SUBJECT TERMS. RESULTS THERE WERE 6,886 PAPERS BETWEEN 2008 AND 2018. THE NUMBER OF PAPERS INCREASED FROM 528 IN 2008 TO 723 IN 2015, AND THEN REMAINED ABOVE 600 IN 2016 AND 2017. THESE PAPERS WERE MAINLY DISTRIBUTED IN NURSING, ONCOLOGY, PAEDIATRICS, GYNAECOLOGY, TEACHING AND EDUCATION, AND CARDIAC AND CARDIOVASCULAR SYSTEMS AND WERE CITED BY 128,845 PAPERS THAT CAME FROM 89 WEB OF SCIENCE SUBJECT AREAS. PAPERS IN NURSING RESEARCH ACCOUNTED FOR THE LARGEST SHARE OF THESE CITATIONS. THE TOP FIVE COUNTRIES IN THE WORLD IN TERMS OF THE NUMBER OF HIGH-IMPACT PAPERS WERE THE UNITED STATES, AUSTRALIA, THE UNITED KINGDOM, CANADA AND SWEDEN. THE RESEARCH INSTITUTIONS WITH THE HIGHEST NUMBER OF HIGH-IMPACT PAPERS WORLDWIDE WERE THE UNIVERSITY OF CALIFORNIA SYSTEM, THE UNIVERSITY OF PENNSYLVANIA, THE UNIVERSITY OF NORTH CAROLINA, THE UNIVERSITY OF LONDON AND THE UNIVERSITY OF TECHNOLOGY SYDNEY. IN THIS DATA SET, IT WAS SHOWN THAT RESEARCH COLLABORATIVE CIRCLES HAVE BEEN FORMED IN THE UNITED STATES, AUSTRALIA, CANADA AND EUROPE; THE SUBJECT-TERM ANALYSIS INDICATED THAT 'WOMEN' AND 'STUDENTS' HAVE ALWAYS BEEN HIGH-INTEREST POPULATIONS FOR HIGH-IMPACT PAPERS AND THAT CANCER IS STILL ONE OF THE GREATEST THREATS TO HUMAN HEALTH. FURTHERMORE, THE SUBJECT TERMS OF HIGH-IMPACT PAPERS IN NURSING RESEARCH HAVE GRADUALLY EVOLVED FROM 'DISEASE' AND 'THERAPY' TO 'SYMPTOMS'. CONCLUSION IN RECENT YEARS, THE NUMBER OF HIGH-IMPACT PAPERS PUBLISHED EACH YEAR IN NURSING RESEARCH HAS GROWN OVER TIME. NURSING HAS BEEN SHOWN TO BE A HIGHLY SPECIALIZED SUBJECT, AND THE MAJORITY OF ITS HIGH-IMPACT PAPERS HAVE BEEN PUBLISHED BY RESEARCH INSTITUTIONS. ALTHOUGH CROSS-REGIONAL COLLABORATIONS ARE BEGINNING TO EMERGE, THERE IS MUCH ROOM FOR IMPROVEMENT IN THIS REGARD. FINALLY, WOMEN, STUDENTS, CANCER AND SYMPTOMATIC CARE ARE THE CURRENT FOCUS AREAS IN NURSING RESEARCH. IMPLICATIONS FOR NURSING MANAGEMENT THIS STUDY INFORMS NURSING MANAGERS WITHIN THE NURSING RESEARCH FIELD ABOUT SUBJECT AREAS, COLLABORATIVE NETWORKS AND HOT TOPICS. IT IS BENEFICIAL TO PAY ATTENTION TO STUDIES, MANAGE SCIENTIFIC OUTPUTS, ALLOCATE RESOURCES, SEEK COOPERATION AND IMPROVE THE WORK EFFICIENCY OF SCIENTIFIC RESEARCH MANAGEMENT."

classifiedText <- textClassify(
  x,
  model = "distilbert-base-uncased-finetuned-sst-2-english",
  device = "cpu",
  tokenizer_parallelism = T,
  logging_level = "error",
  return_incorrect_results = FALSE,
  return_all_scores = FALSE,
  function_to_apply = "none",
  set_seed = 202208
)

Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  RuntimeError: The size of tensor a (582) must match the size of tensor b (512) at non-singleton dimension 1
Run `reticulate::py_last_error()` for details.

Created on 2023-11-02 with reprex v2.0.2

@massimoaria
Copy link
Author

The issue also appears when using different models, i.e. "SamLowe/roberta-base-go_emotions", "nlptown/bert-base-multilingual-uncased-sentiment", "roberta-large-mnli", etc.

@CarlViggo
Copy link
Contributor

Models like BERT, RoBERTa, etc. all take a max sequence length of 512 tokens. Check out this thread for more info: https://discuss.huggingface.co/t/longformer-and-sentiment-analysis/9416

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants