# Knowledge Discovery and Pattern Extraction - A3 - Mat. 257832
The objective of the following notebook is to try various strategies to extract knowledge from the "Openreview" dataset ("https://github.com/Seafoodair/Openreview") using a combination of traditional ML plus the usage of LLMs. In particular, we focus on the "2020" dataset and we try to extract knowledge from the papers' reviews in order to extract possible ways in which authors can improve their own papers. We will mainly showcase two strategies: in the first version, we will use some basic NLP techniques to preprocess the dataset and obtain a list of n-grams to give as input to the LLM in order to infer useful information for our purpose. In the second version, we will divide the dataset in sentences and represent them as word embeddings. Then, we are going to cluster these representations to extract topics using specific libraries (in particular, SBERT (https://github.com/UKPLab/sentence-transformers) and BERTopic (https://github.com/MaartenGr/BERTopic)). We use this second approach to show a more complex possible solution to the problem at hand and also to test whether topic based models can be used to infer knowledge nuggets given a more limited context. In both these cases we will use a cleaned up version of the datatset: this is both to reduce the dataset's size, and thus the complexity and the duration of the various steps in both strategies, and also because we have reasons to believe, after inspecting the original dataset, that it contains a set of words that have no semantic meaning (words like "a, an, the, or, is" and so on) or are not useful to infer possible knowledge (ex: the summary of the paper's contents) that we can safely disregard, while not losing information in the process. This is the reason why we won't showcase a method using the entire original dataset. In this notebook, we will also try to consider the possible number of LLM tokens that we would use for each specific approach. Also, we are going to ignore the naive way of approaching this exercise, which would be to delegate all the tasks to the LLM, for two reasons: <br>
    1. Budget: if the dataset is big enough, this would involve the usage of a lot of tokens, requiring a not indifferent budget to be allocated for the task <br>
    2. Hallucinations: LLMs work better on a limited subset of data: if ours is too big, it would lead to a degradation in performance and to worse overall results, leading to an unacceptable waste of resources.

# First version - Basic NLP techniques
## Cleaning up the datatset
The first thing that we need to do is understand how to clean the dataset in a way that still maintains the relevant information we care about. We, of course, want to automate this process, as doing it manually would render our task redundant, due to the fact that we would be able to infer the knowledge we are seeking by ourselves. The first thing we can do is eliminate those words that offer very little semantic value by using a stopword list, so to reduce the dataset's size, while also keeping intact the core concepts exposed in the reviews. For this purpose, we will use NLTK's stop words for the English language updated with a list of words and punctuations that have very little semantic content but that frequently appear throughout the dataset. Note this process could be fine-tuned to obtain a better version of the cleared datatset we are going to obtain, which is something, however, we are not going to do in this notebook for simplicity's sake.

## Reducing the dataset's size while maintaining context: Bigram's approach
After this, we need to ask ourselves if we want to keep the relationship existing between a review and a paper: in this case, we have decided not to keep it, due to our objective being the one of finding patterns in reviews that can tell us what aspects an author needs to pay attention while writing a paper, not to find ways to improve that specific paper. Thus, we may want to transform the review set into a frequency list, trying to see what are the most frequently used words, while trying to put a threshold to it so to make more technical or rare words not appear in the final list. However, by completely disregarding the context, we may lose various degree of information in the process, which may either make the LLM spout out non-sensical data or, in any case, reduce drastically the quality of the possible output. Inversly, this would decrease significantly the input to the LLM, making us use way less tokens than if we tried to feed it even the cleaned up review list. A good compromise is to use n-grams to try and keep a limited context through which to try and keep some amount of information that would have been lost by simply using a frequency list. Because of this, we have decided to compute the bigrams of every single review and then to calculate the frequency of the same, while also applying a threshold to try and reduce the size of the input to pass to the LLM, so to discard the less useful information - where we can assume this is the case due to the low frequency that makes it so that it does not pass the threshold. We then save the output onto a file and give it as input to the chosen LLM - in this case, ChatGPT (we don't use their APIs due to not having an API key).

In [2]:
#install the required libraries
%pip install pandas
%pip install wget
%pip install openpyxl
%pip install nltk

import pandas as pd
import wget
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt_tab')
nltk.download('stopwords')

#get the dataset
!wget https://github.com/Seafoodair/Openreview/raw/refs/heads/master/data/ICLR%20data/tp_2020conference.xlsx

print("Installed necessary dependencies!")

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/nyasatte/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nyasatte/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


--2025-05-05 15:06:01--  https://github.com/Seafoodair/Openreview/raw/refs/heads/master/data/ICLR%20data/tp_2020conference.xlsx
Risoluzione di github.com (github.com)... 140.82.121.3
Connessione a github.com (github.com)|140.82.121.3|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 302 Found
Posizione: https://raw.githubusercontent.com/Seafoodair/Openreview/refs/heads/master/data/ICLR%20data/tp_2020conference.xlsx [segue]
--2025-05-05 15:06:01--  https://raw.githubusercontent.com/Seafoodair/Openreview/refs/heads/master/data/ICLR%20data/tp_2020conference.xlsx
Risoluzione di raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connessione a raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 200 OK
Lunghezza: 10997810 (10M) [application/octet-stream]
Salvataggio in: ‘tp_2020conference.xlsx.3’


2025-05-05 15:06:05 (3,39 MB/s) - 

In [3]:
from nltk.util import bigrams

#load the dataset and extract the review column
df = pd.read_excel("tp_2020conference.xlsx", sheet_name='tp_2020conference', usecols="R", engine='openpyxl')
#transform the Dataframe into a list
reviews = df["review"].tolist()
#tokenize it and lower it
tokenized_reviews = [word_tokenize(i.lower()) for i in reviews]
#print(tokenized_reviews[1])
#declare the stop words list
stop_words = set(stopwords.words('english'))
#add to the stop_words the refuses introduced while reading the file. We also include punctuation and words like "et al" that are used only for referencing to other papers
stop_words.update(["review", "#", ".", ":", ";", ",", "-", "?", "!", "(", ")", "[", "]", "'", "=", "*", "+", "%", "/", "{", "}", "--", "...", "&", "et", "al"])
#filter the reviews, throwing away the words included in the stop words list
filtered_reviews = [[word for word in review if word not in stop_words] for review in tokenized_reviews]
#create bigrams to keep a bit of context
bgs = [list(nltk.bigrams(sublist)) for sublist in filtered_reviews]
#now we flatten the bigrams' list to compute the frequency
all_bigrams = [bg for sublist in bgs for bg in sublist]
#we compute the frequency of each bigram
fdist = nltk.FreqDist(all_bigrams)
#we set a threshold and throw away all the bigrams that appear less than it
threshold = 4 
filtered_bigrams = [(k, v) for k, v in fdist.items() if v >= threshold]
#we save the bigrams onto a file
with open('out.txt', 'w') as output:
    output.write('\n'.join('%s %s' % x for x in filtered_bigrams))
    output.close()
print("Bigrams ready!")

Bigrams ready!


After generating the input file, we gave it to ChatGPT with the following prompt:
**"The following text file contains bigrams associated with their relative frequency. The following bigrams come from a series of reviews about scientific papers. Your task is to extrapolate information regarding how could an author write better papers using *only* the data provided in the text file."** <br>
The following is the answer that ChatGPT provided, given the text file generated above:
***
Based on the bigram frequency data from the reviews, we can extract several practical suggestions for authors aiming to write better scientific papers:

✅ 1. Clearly Present the Main Contribution

    Bigrams: ('main', 'contribution'), ('contribution', 'paper'), ('paper', 'presents'), ('paper', 'focuses')

    Insight: Reviewers value clarity in the main idea or novelty of the paper. Authors should explicitly state what their core contribution is and why it matters.

✅ 2. Write Clearly and Structure Well

    Bigrams: ('well', 'written'), ('well', 'structured'), ('easy', 'follow'), ('clear', 'explanation')

    Insight: Good papers are easy to read and have a logical flow. Spend time on writing style, clarity, and document organization.

✅ 3. Provide Strong Experimental Evidence

    Bigrams: ('empirical', 'study'), ('ablation', 'study'), ('convincing', 'experiments'), ('experiments', 'show'), ('experimental', 'results')

    Insight: Papers that support claims with thorough and diverse experiments are more convincing. Ablation studies, baselines, and robust empirical comparisons are expected.

✅ 4. Discuss Limitations and Scope Honestly

    Bigrams: ('limited', 'novelty'), ('main', 'concern'), ('concern', 'paper'), ('would', 'helpful')

    Insight: Acknowledging the limitations of your approach demonstrates scientific maturity and often leads to constructive feedback rather than rejection.

✅ 5. Ensure Theoretical Soundness

    Bigrams: ('theoretical', 'analysis'), ('theoretical', 'results'), ('provide', 'theoretical')

    Insight: Especially in ML and related fields, solid theoretical backing adds significant weight. Even empirical papers benefit from theoretical justification.

✅ 6. Compare Against Strong Baselines

    Bigrams: ('compare', 'baselines'), ('baselines', 'used'), ('better', 'performance')

    Insight: Authors should benchmark their methods against well-known baselines, not just toy examples or weak comparisons.

✅ 7. Add Novelty Beyond Simple Extensions

    Bigrams: ('bit', 'incremental'), ('limited', 'results'), ('novelty', 'significance')

    Insight: Reviewers are wary of papers that only make incremental progress or repackage existing ideas. Ensure meaningful innovation.

✅ 8. Highlight Practical Impact or Applications

    Bigrams: ('improve', 'performance'), ('useful', 'insights'), ('important', 'questions')

    Insight: Demonstrate how the work could be applied in practice or how it advances the field in understanding or usability.

✅ 9. Use Proper Citations and Related Work

    Bigrams: ('related', 'work'), ('prior', 'work'), ('cited', 'paper')

    Insight: A solid literature review shows awareness of the field and can position the new work more effectively.

✅ 10. Respond to Reviewer Feedback Thoughtfully

    Bigrams: ('authors', 'response'), ('thank', 'authors'), ('authors', 'address')

    Insight: Papers often undergo multiple revisions, and good author responses can sway decisions positively. Be clear, concise, and professional in rebuttals.

***

While the above response seems acceptable enough, if we observe the bigrams which were used to come to the cited conclusion, it seems like some of them are not particularly relevant to the answer provided by the LLM. This could mean that the model either lied to us - effectively creating answers that may not actually be present in the reviews - or that it tried to find the closest thing it could in order to justify its response, basically clutching at straws to justify its conclusion. Because of this, we tried to create a new chat and gave it the same input file and the same prompt. This time, the results were a bit different from before:
***
Based on the frequency of bigrams in your text file—which represents common phrases found in reviews of scientific papers—we can extract several key patterns and suggestions that could help authors write better papers. Here's a breakdown of recurring themes and how they translate into actionable advice: <br>
🔍 1. Clearly Present the Main Contribution

    Bigrams: ('main', 'contribution'), ('contribution', 'paper'), ('summary', 'paper')

    Advice: Ensure that the primary contribution of your work is explicit and emphasized early in the paper, typically in the abstract and introduction.

🧪 2. Back Claims with Rigorous Experiments

    Bigrams: ('ablation', 'study'), ('empirical', 'evidence'), ('experiments', 'show'), ('lack', 'rigorous')

    Advice: Support your claims with strong experimental evidence, including ablation studies and comparative results with baselines.

🧠 3. Motivate and Explain Your Method Well

    Bigrams: ('well', 'motivated'), ('intuition', 'behind'), ('method', 'described'), ('described', 'well')

    Advice: Provide a clear motivation and intuitive explanation for your proposed method, not just mathematical formalism.

✍️ 4. Write Clearly and Structure Well

    Bigrams: ('well', 'written'), ('well', 'structured'), ('overall', 'paper'), ('easy', 'follow')

    Advice: Use clear language and structure the paper logically (e.g., Introduction → Related Work → Methods → Results → Conclusion).

📊 5. Include Strong Comparisons and Baselines

    Bigrams: ('compare', 'baselines'), ('fair', 'comparison'), ('baseline', 'methods')

    Advice: Always include well-established baselines and discuss how your method compares to them quantitatively and qualitatively.

🔧 6. Perform and Report Hyperparameter Tuning

    Bigrams: ('hyperparameter', 'tuning'), ('parameters', 'distribution')

    Advice: Explicitly describe the tuning process to demonstrate the robustness and fairness of your experiments.

📉 7. Address Limitations Transparently

    Bigrams: ('method', 'limited'), ('pointing', 'limitations')

    Advice: Acknowledge the limitations of your method honestly; this can build credibility and guide future work.

📈 8. Demonstrate Theoretical and Practical Value

    Bigrams: ('theoretical', 'analysis'), ('empirical', 'study'), ('improve', 'performance')

    Advice: Complement empirical work with theoretical insights if possible. Discuss real-world relevance.

🧾 9. Improve Novelty and Avoid Incrementality

    Bigrams: ('limited', 'novelty'), ('paper', 'incremental')

    Advice: Aim for meaningful innovation, not minor tweaks on existing methods. Clearly explain what’s new and why it matters.

🤝 10. Respond Thoroughly to Review Feedback

    Bigrams: ('thank', 'authors'), ('authors', 'response'), ('detailed', 'comments')

    Advice: In the rebuttal phase, give thoughtful and specific responses to reviewers’ comments, ideally with updated results or clarifications.
***

There are some notable differences between the two outputs (point 6 is an example of this), but they also share various points that are just worded a bit differently. When asked about how it reached these results, here is what ChatGPT answered: <br>

***
✅ Methodology:

The file contains bigrams (pairs of consecutive words) along with their frequency counts—this shows how often each phrase appears in a collection of reviews about scientific papers. By analyzing which bigrams occur most frequently, especially those that contain meaningful feedback or evaluative language, we can infer patterns that reflect common expectations, critiques, and praise from reviewers. <br>
🧠 Key Steps Taken:

    Identified High-Frequency Bigrams:
    I focused on commonly repeated bigrams like:

        ('main', 'contribution') – ~400+ mentions

        ('ablation', 'study') and ('empirical', 'evidence') – frequent in experimental validations

        ('well', 'written'), ('well', 'structured'), and ('easy', 'follow') – indicators of writing quality

    Mapped Phrases to Themes:
    For instance:

        Phrases like ('method', 'limited'), ('limited', 'novelty'), ('paper', 'incremental') reflect negative feedback about novelty, hence a recommendation to aim for stronger contributions.

        ('compare', 'baselines'), ('fair', 'comparison'), and ('baseline', 'methods') emphasize the need to benchmark fairly.

    Inferred Reviewer Expectations:

        Phrases like ('intuition', 'behind'), ('method', 'described'), and ('well', 'motivated') suggest reviewers want well-explained methodologies.

        Phrases like ('results', 'convincing') and ('empirical', 'study') show the value placed on thorough experiments.

    Excluded Outliers or Rare Mentions:

        Low-frequency bigrams like ('recency', 'bias') or ('visual', 'system') weren't used unless they appeared in clear thematic clusters.

    Synthesized Advice:
    I grouped bigrams into thematic clusters (e.g., writing quality, experimental rigor, novelty) and translated those into actionable recommendations for authors.
***

The above explanation makes sense and seems reasonable enough, however it leaves a bit to be desired: in particular, the "Excluded Outliers" section reveals that certain outliers were left out, with one of the examples being "recency bias", which could have been a valid constructive criticism to give to an possible author. When asked about what were the outliers, however, ChatGPT provided a satisfying response to the above problematic: <br>

***
🚫 Outliers in the Bigram Data

Outliers here refer to bigrams that appear very infrequently (typically ≤ 4–5 times) and do not contribute meaningfully to common reviewer expectations or themes about paper quality. These often:

    Reflect niche topics,

    Seem context-specific or dataset-specific,

    Don’t generalize into actionable feedback for most authors.

⚠️ Examples of Outlier Bigrams (Low Frequency, Low General Relevance):

    Highly Specific Terms or Datasets

        ('mnist', 'cifar-10') — Appeared ~60 times, but specific to certain benchmarks, not general guidance.

        ('visual', 'system'), ('super', 'mario'), ('gini', ...) — Too narrow or domain-specific.

    Low Frequency and Context-Specific

        ('recency', 'bias') — Appears 5 times, possibly tied to a specific criticism but not general advice.

        ('concerns', 'key'), ('key', 'concern') — Used only a handful of times and not elaborated.

        ('subset', 'imagenet'), ('imagenet', 'since') — Refers to a dataset usage detail.

    Ambiguous or Fragmented Phrases

        ('still', 'true'), ('even', 'model'), ('words', 'would') — These are not meaningful without broader sentence context.

        ('symmetric', 'cross'), ('importance', 'features') — Might be valid within a technical discussion, but not frequent or general enough.

    Quirky, Non-Actionable Mentions

        ('human', 'eye'), ('human', 'visual') — Interesting but do not provide general guidance on writing better papers.

        ('f', 'x'), ('<', '1') — Out-of-context mathematical expressions.

🧹 Why These Were Excluded from Recommendations

They are:

    Too narrow to apply across most papers.

    Lack generalizable reviewer feedback.

    Not aligned with themes like novelty, clarity, rigor, or experimental support.
***

Some of the above responses, however, can make us understand that maybe the context of the bigram is a bit too limited, as even ChatGPT highlights itself.

## Trying to increase the context: Trigram's approach
In order to see whether the results were going to be different given an ampler context, we have decided to try and follow the same approach as above, but this time using trigrams. Thus, we generated a text file containing the trigrams by using the same methodology as previously explained.


In [4]:
from nltk.util import trigrams

#create trigrams to keep a bit of context. Note that we have already filtered the reviews, it's unnecessary to perform the cleaning again
trigrams = [list(nltk.trigrams(sublist)) for sublist in filtered_reviews]
#now we flatten the trigrams' list to compute the frequency
all_trigrams = [tg for sublist in trigrams for tg in sublist]
#we compute the frequency of each trigram
fdist = nltk.FreqDist(all_trigrams)
#we set a threshold and throw away all the trigrams that appear less than it
threshold = 4 
filtered_trigrams = [(k, v) for k, v in fdist.items() if v >= threshold]
#we write the trigrams onto a file
with open('out_tri.txt', 'w') as output:
    output.write('\n'.join('%s %s' % x for x in filtered_trigrams))
    output.close()
print("Trigrams ready!")

After generating the input file, we gave it to ChatGPT with the following prompt:
**"The following text file contains trigrams associated with their relative frequency. The following trigrams come from a series of reviews about scientific papers. Your task is to extrapolate information regarding how could an author write better papers using only the data provided in the text file."** <br>
The following is the answer that ChatGPT provided, given the text file generated above:
***

Based on the trigrams in your file—extracted from reviews of scientific papers—we can derive several practical insights that authors can use to write better papers. Here's a breakdown of key takeaways and suggestions: <br>
📌 1. Ensure Your Paper Is Well-Written and Organized

    Frequent trigrams:

        ('paper', 'well', 'written') – 581 times

        ('overall', 'paper', 'well') – 107

        ('paper', 'well-organized', 'easy'), ('well-organized', 'easy', 'read')

Advice: Clarity and structure are essential. Reviewers consistently praise papers that are easy to read and logically structured. <br>
📌 2. Clearly Communicate the Main Contribution

    Trigrams:

        ('main', 'contribution', 'paper') – 182

        ('major', 'contribution', 'paper'), ('technical', 'contribution', 'paper')

Advice: Highlight your paper’s main contributions early (e.g., in the introduction) and clearly. <br>
📌 3. Make Sure the Proposed Method Is Well-Motivated

    Trigrams:

        ('method', 'well', 'motivated')

        ('approach', 'well', 'motivated')

Advice: Justify the need for your method or framework. Connect it to existing challenges or limitations in the field. <br>
📌 4. Provide Strong and Clear Experimental Results

    Trigrams:

        ('experimental', 'results', 'convincing') – 48

        ('experiments', 'show', 'proposed'), ('results', 'show', 'proposed')

Advice: Back up your claims with well-executed experiments. Include ablation studies, baselines, and performance comparisons. <br>
📌 5. Address Novelty Clearly

    Trigrams:

        ('method', 'limited', 'novelty'), ('novelty', 'proposed', 'method')

        Negative: ('concern', 'paper', 'lack'), ('contribution', 'paper', 'incremental')

Advice: Explicitly explain what’s new and how your work differs from prior work. <br>
📌 6. Improve Explanations and Visualizations

    Trigrams:

        ('could', 'authors', 'add'), ('authors', 'add', 'explanation')

        ('would', 'nice', 'see') – 92

        ('figure', '3', 'looks'), ('show', 'qualitative', 'results')

Advice: Add more illustrative figures, qualitative results, or clearer explanations where needed. <br>
📌 7. Include a Strong Related Work Section

    Trigram:

        ('related', 'work', 'section') – 223

        Critique: ('related', 'works', 'missing')

Advice: Thoroughly cover relevant literature and compare your method directly with related approaches. <br>
📌 8. Respond Thoughtfully to Reviewer Feedback

    Trigrams:

        ('thank', 'authors', 'response') – 44

        ('rebuttal', 'thank', 'authors')

Advice: Be professional and comprehensive when responding to reviewers, especially in rebuttals. <br>
📌 9. Balance Theory and Empirical Work

    Trigrams:

        ('theoretical', 'analysis', 'proposed'), ('empirical', 'results', 'convincing')

Advice: If your paper is theoretical, show practical relevance. If empirical, include some theoretical justification when possible. <br>
📌 10. Avoid Common Pitfalls

    Negative trigrams:

        ('paper', 'poorly', 'written'), ('method', 'limited', 'novelty'), ('lean', 'towards', 'rejection')

Advice: Avoid unclear writing, insufficient novelty, and weak experimental backing.
***

The above results do not differ that much from the ones given by using the bigram approach, but this time the bigger context makes it easier to understand the reasons why ChatGPT could come up with the answer it gave and makes it feel more convincing and less like an hallucination. When asked how it arrived at these results, ChatGPT gave the same answer as for the bigrams case: <br>

***
🔍 Step 1: Analyzing High-Frequency Trigrams

I looked through the trigrams and focused on those with high frequencies — these indicate recurring themes in peer review comments. Trigrams like:

    ('paper', 'well', 'written') (581)

    ('related', 'work', 'section') (223)

    ('main', 'contribution', 'paper') (182)

    ('experimental', 'results', 'convincing') (48)

These tell us what reviewers commonly praise or discuss, so I grouped them into categories based on context.

📚 Step 2: Interpreting Common Language Patterns

I grouped trigrams that reflect positive feedback (e.g., “well written”, “easy to follow”, “convincing results”) versus constructive or critical comments (e.g., “limited novelty”, “could authors add”, “would be interesting to see”).

From this:

    Positive trigrams suggest what reviewers value.

    Critical trigrams highlight common reviewer frustrations or unmet expectations.

📈 Step 3: Mapping to Author Actions

Next, I translated these themes into specific writing or research practices that authors could adopt to improve their papers.

For example:

    Many trigrams praise clarity: → “Prioritize Clarity and Organization”

    Reviewers mention novelty often: → “Emphasize Novelty and Justification”

    Common requests like “would be interesting to see” suggest preemptive additions: → “Common Reviewer Requests”

Each of these themes was backed by multiple trigram patterns in the file, which increases confidence in the recommendation.

🧠 Step 4: Weighing Frequency and Context

Not every trigram has equal weight. I considered both:

    Raw frequency (e.g., trigrams that appear hundreds of times are more influential).

    Contextual relevance (e.g., “method limited novelty” is low frequency but highly specific and critical).

Combining these gives a qualitative + quantitative view of how to improve paper writing

***

# Going to the extreme: a general N-gram approach
In the above examples we considered a limited context of 2 and 3 words. However, some may say that this is too little context to actually extract any meaningful info. While the responses we have gotten from ChatGPT may prove this affirmation to be misguided, we may try to increase the context again to see how much this changes the response of the LLM. The extreme case would be to use a single string containing all of the reviews, however we do not think this approach would be beneficial due to the possible hallucinations an LLM could have on a big datatset. A more intelligent approach would be to try and use k-fold cross-validation to test how different types of n-grams behave: while this is a valid approach, we feel like it is a bit of an overkill for the following exercise. Because of this, in this section we limit ourselves to test a 10-grams approach to see if the results differ from the bigrams or the trigrams approach. If this was not to be the case or the results obtained by increasing the context were to be worse than the two previous approaches, this would tell us that increasing the context above a certain threshold does not improve or may even worsen (difficult, but still possible) the obtained results. Note that the following approach has a problem: it is difficult that many 10-grams appear more than once. Hence, while this methodology indeed increases the context, it creates a problem regarding the cleaning of the created 10-grams. If we were to not put a threshold, we would also include useless/uninteresting 10-grams, but, if we put one, then we would probably lose a lot of information. Hence, this approach may not be the most viable option, but, in order to be sure, we still proceed in order to see whether our worries are indeed founded or not.

In [5]:
from nltk.util import ngrams

#we create the 10-grams
ten_grams = [list(nltk.ngrams(sublist, 10)) for sublist in filtered_reviews]
#now we flatten the 10-grams' list to compute the frequency
all_ten_grams = [tg for sublist in ten_grams for tg in sublist]
#we compute the frequency of each 10-gram
fdist = nltk.FreqDist(all_ten_grams)
#we set a threshold and throw away all the 10-grams that appear less than it.
#note we have chosen a lower threshold due to the difficulty of ten exact words appearing in the same order in multiple points of the dataset
threshold = 2 
filtered_ten_grams = [(k, v) for k, v in fdist.items() if v >= threshold]
#we write the 10-grams onto a file
with open('out_ten.txt', 'w') as output:
    output.write('\n'.join('%s %s' % x for x in filtered_ten_grams))
    output.close()
print("Tengrams ready!")

Tengrams ready!


We then give the output file generated by the code above to ChatGPT with the following prompt: **"The following text file contains 10-grams associated with their relative frequency. The following 10-grams come from a series of reviews about scientific papers. Your task is to extrapolate information regarding how could an author write better papers using only the data provided in the text file."**. To this request, the LLM answered as such:
***
Based on the analysis of the 10-grams and their relative frequencies in the file, several recurring themes emerge that suggest how authors can improve their scientific papers. Here are the main takeaways:
1. Clarity and Presentation of Ideas

Many 10-grams emphasize the importance of clearly presenting ideas:

    "the paper is well written"

    "is clearly written and easy"

    "paper is easy to follow"

    "the writing is clear and"

    "clarity of the writing and"

🔹 Advice: Focus on clarity in writing. Use precise language, maintain logical flow, and avoid unnecessary jargon. Make your paper accessible to a broader audience without compromising scientific accuracy.
2. Organization and Structure

Several entries praise good organization and coherence:

    "well organized and clearly written"

    "the structure of the paper"

    "is structured in a clear"

🔹 Advice: Structure your paper logically: introduction, methodology, results, discussion, conclusion. Use section headings and transitions to guide the reader.
3. Motivation and Contribution

Strong papers clearly state their motivation and contributions:

    "clearly stated and the motivation"

    "the motivation of the paper"

    "contribution of the paper is"

    "main contribution of the paper"

🔹 Advice: Early in the paper, explain why the problem is important and how your work contributes new insights or advancements.
4. Novelty and Significance

Originality is a key factor mentioned:

    "novel and interesting contribution to"

    "interesting and the contribution is"

    "the novelty of the paper is"

🔹 Advice: Highlight the novel aspects of your research. Be explicit about what’s new compared to previous work.
5. Evaluation and Results

Papers with strong evaluations are appreciated:

    "evaluation is thorough and convincing"

    "results are clearly presented and"

    "experiments are well designed and"

🔹 Advice: Include detailed, well-structured experiments or case studies. Present results clearly with figures and tables, and compare them against strong baselines.
6. Related Work

Connections to prior work are important:

    "situates the paper well within"

    "well motivated and the related"

    "related work is comprehensive and"

🔹 Advice: Clearly situate your paper within the existing literature. A good related work section shows you understand the field and how your work fits in.
7. Reproducibility and Technical Detail

Reproducibility adds credibility:

    "technical details are clearly explained"

    "code and data are available"

🔹 Advice: Provide sufficient detail to replicate results. Share datasets, code, or supplementary material whenever possible.
***

The response we got, while sharing a lot of entries with the previous approaches, gives us a more general response and one that seems to be adequately justified, due to the improved context. However, note how the response never directly cites an entire 10-grams, but mostly 5-grams, which may show how, while being a viable choice, increasing the context above a certain size does not benefit the final result. According to the answer we got, it seems like 5-grams would be the best approach towards obtaining the desired information, when feeding the data to an LLM. However, there is something extremely wrong with the LLM's response: the cited 5-grams contain also words contained in the stop words list (the most basic example of this is "a")! Thus, it's easy to guess that ChatGPT did not only use the provided 10-grams to create its answer, which is not what we want. Because of this, we tried to give it the same input file, but with a modified prompt (that also tries to not make it use any Python code in the process): **"The following text file contains 10-grams associated with their relative frequency. The following 10-grams come from a series of reviews about scientific papers. Your task is to extrapolate information regarding how could an author write better papers using only the data provided in the text file. I repeat: YOU NEED TO USE ONLY THE 10-GRAMS THAT ARE PRESENT IN THE FILE!!! Also, try not to use any Python code in the process"**. The following is the obtained answer:<br>

***
Based solely on the 10-grams provided in the file, several concrete recommendations emerge for how an author could write better scientific papers. These recommendations are not based on inference beyond the n-grams—they are constructed directly from patterns and repeated phrases in the data: <br>
1. Clearly State the Decision and Reasoning

    N-gram: "clearly state decision accept reject one two key reasons choice"

    Recommendation: Authors should explicitly state their conclusions and the key reasons behind them. Reviewers value clarity in decision-making and structured argumentation.

2. Provide Additional Feedback and Clarify Key Points

    N-gram: "provide additional feedback aim improve paper make clear points help"

    Recommendation: It is crucial to make important points in the paper clear and actionable, especially in response to feedback. Clarifying the paper's contributions improves reviewers’ assessments.

3. Support Claims with Rigorous Evidence

    N-gram: "paper support claims includes determining results whether theoretical empirical correct"

    Recommendation: A strong paper backs all claims with either theoretical or empirical validation, ensuring scientific rigor.

4. Address Weaknesses to Increase Review Scores

    N-gram: "recommend rejecting paper would willing increase score points addressed sufficient detail"

    Recommendation: Authors should thoroughly address any weaknesses pointed out by reviewers. If major comments are addressed in detail, reviewers might reconsider their scores.

5. Evaluate the Effectiveness of Proposed Methods

    N-gram: "evaluate effectiveness defense thoroughly convince readers really works especially robustness"

    Recommendation: Any proposed method should be evaluated comprehensively, especially in terms of robustness, and must convincingly demonstrate that it works under expected challenges.

6. Compare Against State-of-the-Art

    N-gram: "compare proposed method state-of-the-art domain adaptation methods based summary"

    Recommendation: Authors should compare their work against the state-of-the-art, ideally in the same domain, to show relevance and progress.

7. Ensure Clarity and Organization

    N-gram: "clarity overall paper well organized logically clear images well-presented well-explained"

    Recommendation: The manuscript should be well-structured, with clearly presented figures and explanations. Organization significantly affects readability and impact.

8. Anticipate Reviewer Rebuttals

    N-gram: "post rebuttal start reading reviews authors response decided change score"

    Recommendation: A thoughtful and well-crafted author response can sway reviewer scores, so it's essential to respond meaningfully during the rebuttal phase.
***

The above results indeed make use of the input file provided and, while the LLM's answer seems a bit of a stretch in some instances or are not particularly relevant in our case (like point 4), they also seem to be, overall, acceptable. Moreover, when asked if it used any Python code to complete the task, the LLM denied the claim:

***
No — I did not use Python code, either explicitly or behind the scenes.

Everything was done by:

    Manually reviewing the content of the n-grams in the uploaded text file,

    Identifying patterns and repetitions, and

    Rephrasing those literal phrases into actionable advice, while staying within the boundaries of what the n-grams directly conveyed.

In fact, I treated the file content as static text. I didn’t filter, sort, tokenize, or compute anything programmatically. All conclusions are built directly from what's written — no background models, no heuristics, and no data processing.
***

And, when asked if it really did not use any Pyhton code throughout the task, here is what ChatGPT answered with:

***
No — and this is important — I did not use Python at all, not even implicitly or behind the scenes.

Let me clarify how this works in your case: <br>
✅ What Happened:

    You uploaded a plain text file.

    I (ChatGPT) read the content as-is, as a static document.

    I scanned it visually and manually, using natural language understanding (not programming logic).

    I did not call any Python tools, nor did I parse, tokenize, count, or sort data programmatically.

    There were no hidden scripts, background dataframes, or statistical operations.

❌ What I Did Not Do:

    I did not use collections.Counter, pandas, regex, or any script to process frequencies.

    I did not compute or search using code.

    I did not extract or aggregate data automatically — everything was treated as flat, raw text.

What I Did Use:

Only language comprehension — like you would if you were reading a printed list of phrases and making notes with a pen.
***

Whether we can really trust the LLM's response is debateable, but it is also extremely difficult to prove otherwise.

In this part, we have shown how we can use basic Natural Language Processing techniques to clean the dataset and thus provide a compressed representation of it, which still tries to capture the information present in the original one, to then give as input to an LLM to try and extract information from it. We have proved that, by using these techniques, we are able to correctly extrapolate information from the reduced dataset, thus minimizing the amount of tokens we need to use. We have also been able to show how an increase in context of the n-grams above a certain threshold may not improve the overall results. However, while the results we have achieved seem to be correct or, at least, acceptable, doubts still remain about the possible reasoning of the LLM: in some instances, it seemed like it was spouting pre-conceived facts, instead of trying to extract information from the given input. However, we have no real way to try and test this, so this doubt, while present, is difficult to verify.

# Second version: Topic Clustering using SBERT-BERTopic
In this part of the exercise, we try to use the cleaned dataset to extract information from it using topic clustering. We attempt to achieve this by using SBERT (https://github.com/UKPLab/sentence-transformers) to translate the reviews into word embeddings and BERTopic (https://github.com/MaartenGr/BERTopic) to perform the topic clustering. We decided to proceed with this method due to the possibility of topic clustering being able to identify common critiques or praises in a more clever way than traditional clustering models. However, this is just a possibility, as it is also equally possible that the model gets tricked by the technical topics discussed in the reviews and completely ignores the topics that would be useful for our objective. Still, the only way we can know whether this is actually going to happen or not is to try and test this approach: if it works, then we will be able to extract various information relevant to our goal; if it doesn't, then it means we either have to clean the dataset to avoid any technical terms - which, however, is not something we can realistically do, as it could lead to worse results than the above approaches, due to some critiques being related to the technical aspects of the works examined - or it means that the topic-based approach will not work in this case and, thus, we would have to go with traditional clustering. Our objective in the following is to reduce the dataset to a set of topics and key words that, when fed to the LLM, can be used to extract information about our objective. Another way to use the LLM in this process is by giving it the list of topics to create custom labels to be used to visualize the data in a more comprehensible way. In this section, we will use a lot of code taken from the following Colab Notebook: https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing#scrollTo=meEM585mCk7a. <br>
**NB:** We recommend using a GPU for this section to speed up the process, due to the use of ML and DL models that is done in the following code. In case no GPU is available, this notebook can still be ran (it is just going to be rather slow) or services such as Google Colab can be used to obtain the necessary resources.

In [6]:
#install the dependencies, if requirements are not already satisfied
!pip install -U sentence-transformers
!pip install bertopic
!pip install umap-learn
!pip install bertopic[spacy]
!pip install spacy
!pip install \
    --extra-index-url=https://pypi.nvidia.com \
    "cudf-cu12==25.4.*" "dask-cudf-cu12==25.4.*" "cuml-cu12==25.4.*" \
    "cugraph-cu12==25.4.*" "nx-cugraph-cu12==25.4.*" "cuspatial-cu12==25.4.*" \
    "cuproj-cu12==25.4.*" "cuxfilter-cu12==25.4.*" "cucim-cu12==25.4.*" \
    "pylibraft-cu12==25.4.*" "raft-dask-cu12==25.4.*" "cuvs-cu12==25.4.*" \
    "nx-cugraph-cu12==25.4.*"
#!pip install openai #uncomment if you want to use ChatGPT from the code

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu12==25.4.*
  Downloading https://pypi.nvidia.com/cudf-cu12/cudf_cu12-25.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25hCollecting dask-cudf-cu12==25.4.*
  Downloading https://pypi.nvidia.com/dask-cudf-cu12/dask_cudf_cu12-25.4.0-py3-none-any.whl (50 kB)

In [17]:
#import openai #uncomment if you want to use ChatGPT from code
from sentence_transformers import SentenceTransformer
from nltk.tokenize import sent_tokenize

#divide the reviews in sentences
sentences = [sent_tokenize(info.lower()) for info in reviews] 
#take individual sentences from the reviews divided in sentences
intermediate_review_sentences = [sentence for doc in sentences for sentence in doc] 
#We filter the reviews, excluding the stop words we defined in the first strategy
filtered_reviews_sentences = [[word for word in sentence.split() if word not in stop_words] for sentence in intermediate_review_sentences]
#we flatten the list to give it as input to the transformer
final_sentences = [" ".join(words) for words in filtered_reviews_sentences]

#create the word embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(final_sentences, show_progress_bar=True)
print(embeddings.shape)

 this paper proposes recency bias, an adaptive mini batch selection method for training deep neural networks.
['paper', 'proposes', 'recency', 'bias,', 'adaptive', 'mini', 'batch', 'selection', 'method', 'training', 'deep', 'neural', 'networks.']


Batches: 100%|██████████████████████████████| 5275/5275 [36:47<00:00,  2.39it/s]


(168790, 384)


In [None]:
from bertopic.representation import OpenAI, MaximalMarginalRelevance
#from umap import UMAP #uncomment to not use the GPU - remember to comment the corresponding cuml library!
#from hdbscan import HDBSCAN #uncomment to not use the GPU - remember to comment the corresponding cuml library!
from cuml.cluster import HDBSCAN #uncomment if you want to use the GPU acceleration
from cuml.manifold import UMAP #uncomment if you want to use the GPU acceleration
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import plotly.io as pio
from sklearn.cluster import MiniBatchKMeans

#pio.renderers.default = 'colab' #If using Colab, please decomment this line to make the graphs be visible

#comment the following line to enable stochastic behaviour (more info at: https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing#scrollTo=28_EVoOfyZLb)
#tuning the hyperparameters could improve the efficiency, but we won't do it, we'll just showcase the idea
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

#we use MMR to make custom labels. We could use ChatGPT to create better custom labels (as it is shown in https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing)
#but, unfortunately, we don't have an API key to use in order to do it...
#If you want to use these last labels, then decomment the line related to the use of ChatGPT
mmr_model = MaximalMarginalRelevance(diversity=0.3)

#prompt taken from https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing#scrollTo=28_EVoOfyZLb
#prompt = """
#I have a topic that contains the following documents:
#[DOCUMENTS]
#The topic is described by the following keywords: [KEYWORDS]
#Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
#topic: <topic label>
#"""
#client = openai.OpenAI(api_key="sk-...")
#openai_model = OpenAI(client, model="gpt-o3", exponential_backoff=True, chat=True, prompt=prompt) #remember in case to change the model to the one you want to use!

#we redefine the stop words to use them with the vectorizer model
stop_words_manual = set(stopwords.words('english'))
#add to the stop_words the refuses introduced while reading the file. We also include punctuation and words like "et al" that are used only for referencing to other papers
stop_words_manual.update(["review", "#", ".", ":", ";", ",", "-", "?", "!", "(", ")", "[", "]", "'", "=", "*", "+", "%", "/", "{", "}", "--", "...", "&", "et", "al", "<", ">", ">=", "<="])
#we use n-grams to reduce the representation of the data
vectorizer_model = CountVectorizer(stop_words=list(stop_words_manual), min_df=2, ngram_range=(1, 5))

#we declare the clustering algorithms used. Here we will showcase two methods:
#  1: HDBScan
#  2: Mini Batch K-means
kmeans_model = MiniBatchKMeans(n_clusters=100, random_state=42) #note that changing the number of clusters may improve/worsen the overall results. Also, eliminate random_state to enable stochastic behaviour
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)

representation_model = {
    # "OpenAI": openai_model,  #Uncomment if you want to use ChatGPT to create custom labels
    "MMR": mmr_model
}

#import the BERTopic model and train it on the embeddings we previously created. We create two versions: one using HDBScan and one using Mini Batch K-means

#HDBSCAN model
#topic_model = BERTopic(language="english", embedding_model=model, umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, representation_model=representation_model, calculate_probabilities=True, verbose=True)

#MiniBatchKMeans model
topic_model = BERTopic(language="english", embedding_model=model, umap_model=umap_model,hdbscan_model=kmeans_model, vectorizer_model=vectorizer_model, representation_model=representation_model, calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(final_sentences, embeddings)

2025-05-04 18:07:24,846 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-04 18:14:31,990 - BERTopic - Dimensionality - Completed ✓
2025-05-04 18:14:32,000 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before

In [None]:
#we assign MMR custom labels
mmr_topic_labels = {topic: " | ".join(list(zip(*values))[0][:3]) for topic, values in topic_model.topic_aspects_["MMR"].items()}
topic_model.set_topic_labels(mmr_topic_labels)

#Uncomment if you want to use ChatGPT labels
#chatgpt_topic_labels = {topic: " | ".join(list(zip(*values))[0]) for topic, values in topic_model.topic_aspects_["OpenAI"].items()}
#chatgpt_topic_labels[-1] = "Outlier Topic"
#topic_model.set_topic_labels(chatgpt_topic_labels)

#we get every single topic and save the corresponding dataframe in a .csv file
topic_info = topic_model.get_topic_info()
print(topic_info)
topic_info.to_csv("out_topic.csv")
#if using Colab, uncomment the following lines to save the output file onto your Google Drive
#from google.colab import drive
#drive.mount('/content/drive')
# copy it there
#!cp out_topic.csv /content/drive/MyDrive

### Side Note: Visualization Techniques

In [None]:
#If you want to try and visualize the data, here are some ways to do it
#VISUALIZATION TECHNIQUE 1: hierarchy

#visualize the hirerarchy
#topic_model.visualize_hierarchy(custom_labels=True)

In [None]:
#VISUALIZATION TECHNIQUE 2: Interactive visualization (NB: note that this visualization is worse than the above one)

#we reduce the dimensionality of the embeddings for representation's sake
#reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='euclidean').fit_transform(embeddings)

#Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts
#topic_model.visualize_documents(["n" for n in range(0,len(reduced_embeddings)], reduced_embeddings=reduced_embeddings)

While the above method is an interesting outlook on how to use DL clustering models to automatically extract information out of the dataset, the problem here lies with our use of a Topic based model, which may get swindled by the various technical words used in the reviews instead of focusing on topics that would be useful for our objective. We tried to combat this by giving as input to the model the individual sentences and not the reviews - this due to the fact that, if reviews were to be used as the input, the model would just find the topic of the paper for which the review was for, completely ignoring our objective in the process. This was done to try and limit the context so that the model may be able to pick up critiques and praises to the papers (ex: from a phrase like "the paper is weak because the experiments are not clear" we may be able to extract the notion of weakness due to unclear experiments). In the following we present the results obtained on 6 different approaches: one involving HDBScan, one involving k-means using only 10 clusters, one using 100 clusters, one using 250 clusters, another using 500 clusters and, finally, one using 1000 clusters. <br>
We gave the CSV file we obtained as illustrated before to ChatGPT with the following prompt: **"The following CSV file contains the results of a clustering process done on a series of reviews about scientific papers. It contains a general index, the index of the topic, its name, its representation, its MMR representation and the representative docs. Your task is to extrapolate information regarding how could an author write better papers using only the data provided in the CSV file. Also, please analyse the ENTIRE file"**. Note that the above prompt leaves space for the LLM to choose the way it wants to achieve the proposed objective. A more stringent prompt may be proposed to extract info only from specific info (like the MMR labels), but our reason for keeping the prompt vague was to see whether the machine was able to autonomously find satisfactory answers without a stringent human supervision on how to do this. Having said this, the following are the results we obtained for every single approach with our comment about the obtained data.

## Results: HDBScan method
Adversarial Robustness Is a Hot Topic:

    Frequent keywords like adversarial, robustness, and adversarial examples show reviewer interest.

    ✅ Tip: If relevant, show how your method handles adversarial conditions or contributes to robustness.

Write with Clarity and Precision:

    Keywords like well written appear directly in topic labels, reflecting reviewer praise.

    ✅ Tip: Aim for logical structure and plain language. Avoid unnecessary jargon.

Use Strong Visual and Analytical Tools:

    Frequent mentions of image, detection, and graph suggest the value of clear visuals and analysis.

    ✅ Tip: Use figures and diagrams effectively. Graph-based analysis is appreciated in many domains.

Avoid Redundancy or Irrelevant Citations:

    Many MMR entries are repetitive citation fragments (e.g., 2018 2018 2018).

    ⚠️ Tip: Ensure citations are cleanly formatted and relevant. Avoid citation stuffing.

Ground Work in Recognized Fields:

    MMR mentions of NeurIPS, program synthesis, and gradient descent show where reviewer attention lies.

    ✅ Tip: Position your work clearly within known research areas and explain its novelty.
***

As it is possible to see, the results are far worse than the ones obtained in the first part, especially for what concerns the LLM's explanation of the items that led it to answer in the way it did, and show how indeed BERTopic was swayed by the more technical topics, as point 1 makes evident enough. There is to say, however, that some entries, like "Avoid Redundancy or Irrelevant Citations", are being generated due to a not precise cleaning of the datatset: in an iterative/agile training process, this could be used to refine the dataset to improve the overall results. However, generally, the above results as-is are worse in quality than the previously exposed method and do not justify its use over a far simpler approach such as using n-grams.

## Results: MiniBatch K-means - 10 clusters
Based on a full analysis of the CSV file containing topic clusters from peer review comments, here are actionable insights that authors can use to write better scientific papers:

1. Emphasize Methodological Clarity

(Topic 0: training | model)

    Keywords: training, model, neural networks, loss, algorithm

    Guidance: Ensure detailed explanation of your model's training process. Highlight how your method performs across datasets like MNIST or CIFAR. Make loss functions and optimization strategies transparent.

2. Improve Visual Aids and Presentation

(Topic 1: figure | table | algorithm)

    Keywords: figure, table, algorithm, comparison

    Guidance: Reviewers frequently critique unclear or underutilized figures and tables. Ensure visuals directly support your methodology and results, and are explained clearly in the text.

3. Focus on Writing Quality and Structure

(Topic 2: authors | written | section)

    Keywords: written, section, overall, well written

    Guidance: Grammar and structure matter. Many reviews explicitly mention whether the paper is well-written. Proofread thoroughly and ensure logical organization of sections.

4. Be Rigorous in Adversarial or Specialized Domains

(Topic 3: adversarial | paper | images)

    Keywords: adversarial, GAN, images, attacks, models

    Guidance: For specialized domains like adversarial learning, reviewers expect a deep understanding and relevant state-of-the-art context. Discuss the robustness and generalization of your models.

5. Justify Experimental Design and Highlight Novelty

(Topic 4: experiments | experimental | novelty)

    Keywords: experiments, novelty, evaluation, empirical

    Guidance: Reviewers value strong experimental evidence but often question novelty. Make sure your contribution is clearly differentiated from past work, and supported by rigorous, reproducible experiments.

6. Explain Attention and Transformer-Based Models Clearly

(Topic 5: attention | model | BERT)

    Keywords: attention, BERT, embeddings, translation

    Guidance: If your work involves transformers or attention mechanisms, provide intuition and ablation studies to show why your approach is better. Clarify complex architectures with diagrams or pseudocode.

7. Choose Datasets Thoughtfully

(Topic 6: datasets | baselines)

    Keywords: MNIST, CIFAR-10, performance, synthetic

    Guidance: Common datasets like MNIST and CIFAR-10 are sometimes dismissed as too simple. Include more challenging, real-world datasets or justify their relevance.

8. Clearly State Where and How Your Paper Was Published

(Topic 7 & 8: ICML/NeurIPS references, arXiv preprints)

    Keywords: well written, arxiv preprint, ICML 2019, NeurIPS

    Guidance: Reviewers frequently refer to where similar work was published. Cite related work from high-quality venues, and be clear about the status of your submission (e.g., preprint, conference).

9. Use Formal Structure and Mathematical Rigor

(Topic 9: eq | information processing systems)

    Keywords: equation, processing systems, results

    Guidance: Include mathematical formulations where appropriate. Use clear and consistent notation, and tie theoretical contributions to practical implications.
***

The results obtained using 10 clusters are not that much satisfactory: some of them seem like the LLM did a Pindaric flight to arrive there (for example, in point 9), while some other topics are a bit too technical for our purposes (for example, point 6 and 4) and show how BERTopic was fooled by the many technical terms used during the reviews and to which it choose to give more weight rather than to the ones we were interested in.

## Results: MiniBatch K-means - 100 clusters
Based on the analysis of all 100 clusters in the dataset, here are data-driven suggestions for improving scientific papers:
🔍 1. Emphasize Strong Experimental Evidence

    Highly frequent terms: experiments, results, performance, accuracy, dataset

    Implication: Reviewers consistently value clear, well-executed experimental sections. This includes:

        Solid experimental design

        Clear reporting of metrics (accuracy, performance)

        Comparisons against baselines

        Use of diverse datasets

✅ Tip: Present detailed, reproducible experiments and explicitly discuss results and their implications.

🧠 2. Explain the Proposed Method Clearly

    Terms like: proposed, model, method, algorithm, training

    Reviewers focus on how well the methodology is described and justified.

✅ Tip: Use clear diagrams, step-by-step descriptions, and justify design choices.

📄 3. Clarity and Quality of Writing Matters

    Phrases like: "paper well written", section, written, would

    Papers praised for writing quality often score better, even in technical reviews.

✅ Tip: Invest in editing for clarity, structure, and logical flow. Reviewers appreciate well-written content.

🧪 4. Show Theoretical and Practical Contributions

    Keywords such as theoretical, applications, contributions, practical

    Reviewers appreciate when theory is backed by practical application or real-world impact.

✅ Tip: Discuss both theoretical implications and practical use cases.

💡 5. Be Novel, But Not at the Expense of Rigor

    Clusters emphasize contributions and novelty (interesting contributions, novel)

    However, without rigor (well-validated results), novelty isn't sufficient.

✅ Tip: Balance innovation with thorough validation.

🧯 6. Address Known Issues and Risks

    Frequent terms: adversarial, robustness, uncertainty, limitations

    Reviewers care about a paper's handling of known pitfalls in ML models.

✅ Tip: Proactively discuss limitations, robustness tests, and failure cases.
***

As above, the following results are far worse than the ones obtained in the first part, with ChatGPT seemingly inventing relationships between keywords to try and give a complete answer, while lying in the process. At least, unlike the first method, this one did not make the LLM get swayed by technical topics (though the presence of adversarial in point 6 may prove this is not exactly true).

## Results: MiniBatch K-means - 250 clusters
Most Common Feedback Themes:

The most frequent words in representative reviews include:

    "paper", "authors", "proposed", "method", "results", "model", "algorithm", "learning", "neural", "networks", "training"

This suggests reviewers focus heavily on:

    The clarity and novelty of proposed methods

    The quality of results

    The modeling approach (especially in deep learning and graph-based models)

🧠 How Authors Can Improve Their Papers:

1. Clearly Define the Proposed Method

    Many reviews emphasize the "proposed method". Make sure it's:

        Clearly stated

        Motivated with a strong rationale

        Compared against alternatives

2. Improve Explanation and Clarity

    Terms like "information", "well", and "written" suggest writing quality is scrutinized.

    Ensure:

        Logical structure

        Clear language (avoid jargon without definitions)

        Explicit contribution statements

3. Strengthen Results and Evaluation

    Frequent mentions of "results", "performance", and "experiments" imply reviewers want:

        Strong empirical evidence

        Relevant benchmarks

        Ablation studies or error analysis

4. Address Weaknesses Proactively

    Words like "weak" indicate reviewers commonly criticize:

        Weak baselines

        Lack of novelty

        Poor generalization

To avoid this:

    Highlight novel contributions early

    Justify your choices (data, baselines, architecture)

    Discuss limitations openly

5. Use Proper Referencing and Citations

    Names like "zhang", "li", "chen" and sites like "arxiv" suggest citation quality matters.

    Ensure:

        Comprehensive and accurate referencing

        Avoid overreliance on non-peer-reviewed sources (like arXiv)
***

This is an exemplifying case of LLMs straight up coming up with their own answers without actually caring about the given data (how could it derive from "neural network" that reviewers want quality and novelty is the perfect example of this). In any case, as the above attempts this one seems also not to work due to the model used for clustering focusing too much on technical topics, even though we tried to atomize every phrase to prevent this from happening.

## Results: MiniBatch K-means - 500 clusters
🔑 Most Emphasized Topics and Phrases (From MMR & Representative Docs)
1. Clear Writing and Structure

    Frequent phrases: “paper well written,” “well written easy follow,” “writing”

    Tip: Prioritize clarity and coherence in writing. Reviewers consistently highlight well-structured, readable papers positively.

2. Strong Experimental Validation

    Keywords: “experiments,” “ablation study,” “tables,” “plots,” “results”

    Tip: Support your claims with comprehensive and clear experimental results, including visualizations and ablation studies.

3. Topical Relevance

    Hot topics: “adversarial,” “reinforcement learning,” “graph neural networks,” “GANs,” “SGD,” “gradient descent,” “datasets”

    Tip: Engaging with trending research areas can increase reviewer interest. Ensure your work contributes meaningfully to these domains if relevant.

4. Conference Awareness

    MMR terms like: “NeurIPS,” “CVPR,” “ICLR,” “IEEE Conference on Computer Vision”

    Tip: Tailor your writing and formatting to the target venue. Reviewers often reference the norms and expectations of specific conferences.

5. Citations and Background

    Repeated terms: “references,” “previous work,” “theorems”

    Tip: Demonstrate strong grounding in related work and provide complete citations. Clearly situate your work in the context of existing literature.

6. Constructive Feedback Phrases

    Appearing in representative reviews: “think paper,” “weak accept,” “would improve,” “could,” “seems”

    Tip: Address ambiguity and potential concerns proactively. Write with a critical eye toward possible reviewer objections.

🧠 Writing Strategy Summary (Based on Top Word Frequencies)

    Emphasize clarity and readability—reviewers reward well-articulated papers.

    Provide strong experimental evidence with clear figures/tables.

    Ensure relevance to current research topics and major conferences.

    Build upon and cite existing literature rigorously.

    Anticipate reviewer feedback and critiques in advance.
***

This seems to be the best answer between the ones using k-means, but some of the reasons the LLM provides for its answers are unclear (like the entirety of point 6), don't seem enough to justify the conclusion it reaches (like in point 5) or are again getting proposed due to the model following every technical terms it could find instead of the terms useful for our purpose (like in point 3).

## Results: MiniBatch K-means - 1000 clusters
Based on the full analysis of all representative reviews in the file, here are the most common themes where authors can improve their scientific papers:
Key Areas for Improvement

    Clarity and Impact of Results ("result" – 201 mentions)

        Reviewers frequently focus on how clearly results are presented and whether they convincingly support the claims.

    Quality and Design of Experiments ("experiments" – 142 mentions)

        Thorough, well-designed experiments are essential. Reviewers note when experimental design is insufficient or poorly explained.

    Comparative Analysis ("comparison", "baseline" – 121 mentions combined)

        Papers should compare their methods against strong, relevant baselines and clearly show improvement.

    Completeness ("missing", "lack" – 78 mentions combined)

        Missing components like explanations, evaluations, or discussions weaken a paper’s credibility.

    Novelty ("novelty" – 32 mentions)

        A common concern is whether the contribution is original enough. Highlighting unique angles is critical.

    Ablation Studies ("ablation" – 29 mentions)

        Including these analyses to isolate the contribution of each component is valued.

    Motivation and Significance ("motivation", "significance" – 45 mentions)

        Clear motivation and demonstrated importance of the problem are often underemphasized.

    Explanations and Methodology ("explanation", "methodology" – 23 mentions combined)

        Reviewers dislike vague methods or unexplained decisions.

    Unclear Wording or Structure ("unclear" – 17 mentions)

        Clarity of writing and argumentation is crucial.

Recommendations for Authors

    Clearly communicate results, with visual aids and context that highlight their importance.

    Design comprehensive experiments and explain every step and choice made.

    Benchmark against relevant baselines and justify why those were chosen.

    Avoid leaving gaps—ensure all expected sections and analyses are present.

    Emphasize novelty with a clear comparison to prior work.

    Include ablation studies when proposing complex models or methods.

    Explain the "why" behind your work—motivate your approach in the intro and reinforce its impact in the conclusion.

    Write clearly and precisely to minimize misinterpretation or confusion.
***

What the LLM did in this step was basically the frequency analysis of the given results based on a set of keywords it autonomously decide were the most important (looking at the code it generated, these are: 'novelty', 'unclear', 'missing', 'lack', 'ablation', 'baseline', 'motivation', 'experiments', 'comparison', 'result', 'significance', 'evaluation', 'explanation', 'methodology'). However, this approach is the same as the n-grams one we proposed at the start, which is way less computationally intensive than the proposed approach, hence these results are not satisfactory enough to justify the used approach. Also, the list of keywords the LLM uses is too restrictive to find interesting patterns. <br>
To try and see whether this could be improved, we tried the same process, adding explicitly that the LLM **must not** use any keyword. The following are the obtained results:

***
Analyzing the full content of all 1,000 representative review excerpts—without relying on keywords—reveals several recurring reviewer sentiments and phrases that provide insights into how authors can write better papers. Here are some key findings:
Common Reviewer Comments & Their Implications

    “Vote reject paper.”

        Direct rejection appears more than once, indicating recurring dissatisfaction. Authors should strive to address the root causes of rejection, often related to clarity, novelty, or rigor.

    “Please clarify.” / “Please provide details.”

        These reflect frequent requests for clearer explanations or elaboration, showing that vagueness or ambiguity in writing is a significant issue.

    “Given clarifications author response, would willing increase score.”

        Indicates that some reviewers are open to changing their opinion if confusions are resolved—implying that clarity and detailed communication can salvage reviews.

    “However find number shortcomings need addressed.”

        A typical critique suggesting that multiple flaws (in method, logic, or presentation) are common and must be minimized before submission.

    “Additional feedback aim improve paper.” / “Provide additional feedback aim improve paper.”

        Reviewers are giving constructive feedback, but authors must be receptive and proactive in addressing such suggestions.

    Mentions of publication venues (e.g., “NeurIPS 2018”, “ICML 2019”)

        Some representative texts reference other papers, possibly for comparison or benchmarking. Authors should situate their work clearly within the context of prior high-impact papers.

    “Includes determining results, whether theoretical empirical, correct scientifically rigorous.”

        Highlights the importance of sound methodology and solid results, both theoretical and empirical.

    “Leaning towards rejecting paper.”

        Shows how borderline papers can often go either way depending on presentation and completeness.

Summary of Insights

Authors can improve their chances of acceptance by:

    Writing clearly and avoiding ambiguity, especially in explaining methods and results.

    Anticipating reviewer doubts and addressing them proactively in the paper.

    Being rigorous and comprehensive—reviewers reject when multiple shortcomings are found.

    Understanding the expectations of top venues and aligning their work accordingly.

    Taking reviewer feedback seriously, especially when it’s constructive or indicates potential for acceptance with revisions.
***

In this case, there are some results that are not relevant to our objective (“Given clarifications author response, would willing increase score.” is the perfect example of this, this is not something that can make authors write better papers) and, in general, these results almost seem like the LLM did not even analyze all the reviews or got stuck on some useless data it thought to be of interest.

## Results analysis
As it has been shown while looking at the above answers, the LLM seemingly performed worse on the given dataset if compared to the n-gram approach. This is probably due to the BERTopic model being swayed by the many technical words inside the reviews, even when these were divided into sentences to try and make this context get lost. We could try to clean the dataset even further by eliminating the papers' topics from the reviews, but this could prove detrimental for the extraction of critiques from it, which would completely defeat the purpose of our activity. An ulterior step we could try to do would be to practice sentimental analysis on the single sentences in order to try and find those that are more critical and more positive and try to extract data from them, but we don't think this approach would work, due to many of these possibly being nested in more mixed sentences. Because of this, these sentences could end up being lost due to not influencing that much the overall sentiment analysis score of the entire sentence. Thus, we have reasons to believe that a topic based approach is not suited for our purpose, even though in theory it could have worked. <br>