# Assignment 3: Extracting “Better‐Paper” Guidelines from the OpenReview Dataset

This notebook documents the steps, decisions, and code used to leverage local LLMs for mining common reviewer feedback and producing a Markdown checklist of best practices.
Lacking deep ML expertise, I relied heavily on Ollama—first with DeepSeek-8B, then (for better results) Qwen-3-14B to extract shared concerns from peer reviews. Finally, I augmented those LLM-generated bullets with Sentence-BERT embeddings and clustering to group similar feedback into coherent themes.
But let’s start easily with an example of Qwen-3-14B extracting common concerns from the first three reviews in our dataset.

In [34]:
import ollama

question = r"""
I am analizing a dataset of OpenReview papers because I want to create a small markdown of general guidelines, to help authors writing better papers.
Here are three reviews of a paper I chose:
<|review1|>
 This paper proposes Recency Bias, an adaptive mini batch selection method for training deep neural networks. To select informative minibatches for training, the proposed method maintains a fixed size sliding window of past model predictions for each data sample. At a given iteration, samples which have highly inconsistent predictions within the sliding window are added to the minibatch. The main contribution of this paper is the introduction of sliding window to remember past model predictions, as an improvement over the SOTA approach: Active Bias, which maintains a growing window of model predictions. Empirical studies are performed to show the superiority of Recency Bias over two SOTA approaches. Results are shown on the task of (1) image classification from scratch and (2) image classification by fine-tuning pretrained networks. +ves: + The idea of using a sliding window over a growing window in active batch selection is interesting. + Overall, the paper is well written. In particular, the Related Work section has a nice flow and puts the proposed method into context. Despite the method having limited novelty (sliding window instead of a growing window), the method has been well motivated by pointing out the limitations in SOTA methods. + The results section is well structured. It*s nice to see hyperparameter tuning results; and loss convergence graphs in various learning settings for each dataset. Concerns: - The key concern about the paper is the lack of rigorous experimentation to study the usefulness of the proposed method. Despite the paper stating that there have been earlier work (Joseph et al, 2019 and Wang et al, 2019) that attempt mini-batch selection, the paper does not compare with them. This is limiting. Further, since the proposed method is not specific to the domain of images, evaluating it on tasks other than image classification, such as text classification for instance, would have helped validate its applicability across domains. - Considering the limited results, a deeper analysis of the proposed method would have been nice. The idea of a sliding window over a growing window is a generic one, and there have been many efforts to theoretically analyze active learning over the last two decades. How does the proposed method fit in there? (For e.g., how does the expected model variance change in this setting?) Some form of theoretical/analytical reasoning behind the effectiveness of recency bias (which is missing) would provide greater insights to the community and facilitate further research in this direction. - The claim of 20.5% reduction in test error mentioned in the abstract has not been clearly addressed and pointed out in the results section of the paper. - On the same note, the results are not conclusively in favor of the proposed method, and only is marginally better than the competitors. Why does online batch perform consistently than the proposed method? There is no discussion of these inferences from the results. - The results would have been more complete if results were shown in a setting where just recency bias is used without the use of the selection pressure parameter. In other words, an ablation study on the effect of the selection pressure parameter would have been very useful. - How important is the warm-up phase to the proposed method? Considering the paper states that this is required to get good estimates of the quantization index of the samples, some ablation studies on reducing/increasing the warm-up phase and showing the results would have been useful to understand this. - Fig 4: Why are there sharp dips periodically in all the graphs? What do these correspond to? - The intuition behind the method is described well, however, the proposed method would have been really solidified if it were analysed in the context of a simple machine learning problem (such as logistic regression). As an example, verifying if the chosen minibatch samples are actually close to the decision boundary of a model (even if the model is very simple) would have helped analyze the proposed method well. Minor comments: * It would have been nice to see the relation between the effect of using recency bias and the difficulty of the task/dataset. * In the 2nd line in Introduction, it should be *deep networks* instead of *deep networks netowrks*. * Since both tasks in the experiments are about image classification, it would be a little misleading to present them as *image classification* and *finetuning*. A more informative way of titling them would be *image classification from scratch* and *image classification by finetuning*. * In Section 3.1, in the LHS of equation 3, it would be appropriate to use P(y_i/x_i; q) instead of P(y/x_i; q) since the former term was used in the paragraph. =====POST-REBUTTAL COMMENTS======== I thank the authors for the response and the efforts in the updated draft. Some of my queries were clarified. However, unfortunately, I still think more needs to be done to explain the consistency of the results and to study the generalizability of this work across datasets. I retain my original decision for these reasons.
<|review1|>
<|review2|>
This paper proposes an interesting heuristic of batch construction from samples. Instead of the usual random sampling, the authors to sample based on some measures of the ``uncertainty”. To be specific, the uncertainty is measured as a normalized entropy estimated from a window of historical predictions. I like the idea of designing more sophisticated ways to encourage more exploration over the samples that the model is not good at. The thought is similar as active learning. It is interesting to see how similar thought can be used to improve the performance of the algorithm in the general batch gradient descent setting. On the other hand I am not quite convinced the proposed way is truly better. The main concern is the experiments do not quite show the state-of-the-art result at all. It is not even close on MNIST, CIFAR-10 and CIFAR-100. Also those datasets are relatively small one. Can authors add results on larger datasets such as tiny image net? Besides this main concern I also have some worries about the design of the algorithm. I listed them below: 1. The vanilla stochastic gradient descent can be roughly justified since the expectation of the stochastic gradient is the true gradient of the loss. Now with the proposed heuristic will this still be true? 2. Is there any guarantee the algorithm can converge? It is not clear to me as the optimization proceeds the ``uncertainty” may oscillate. Is there any condition when the convergence is guaranteed? 3. As the number of classes grows the estimation of the entropy itself is a tough problem. Is there any way to mitigate this issue other than increase the window size? Another minor comment: Could the authors add more explanation on equation (4)? For example, is related to the maximum entropy led by a uniform distribution, and the summation term in (4) is related to the empirical entropy.
<|review2|>
<|review3|>
This paper explores a well motivated but very heuristic idea for selecting the next samples to train on for training deep learning models. This method relies on looking at the uncertainty of predictions of in the recent history of statements and preferring those instances that have a predictive uncertainty over the recent predictions. This allows the training method to train on instances that are neither too hard nor too easy and focus on reducing the uncertainty whenever it has the greatest potential gain to do so. There are two extra components that make this method work: - Windowing: only looking at the recent history of the instances which has two effects: firstly, the current state of the model is explored which gives a more recent assessment relative to the current state of the model. Secondly, it makes the algorithm faster by reducing the overhead of analyzing the prediction history of samples. - Annealing the selection bias: as the training goes on the selection becomes more random and less biased. This approach is evaluated in on three simple data-sets: MNIST, CIFAR-10 and CIFAR-100. Although this is a very limited subset of models, the results are consistent and statistically significant, although their effect is not really huge. The paper gives very little theoretical justification or analysis of the results but gives only the presented empirical evidence which seems to support the hypothesis on the efficacy of the approach. Another drawback of the approach is that it introduces new hyperparameters: those governing the annealing schedule for the selection bias. Since the approach seems efficient in a relatively constrained setup, it can be reasonably expected that it might be helpful in more general situations, therefore. On the other hand, since it is only evaluated on three very similar tasks, it limits the conclusiveness of the results. That*s why I would for weak accept. In the presence of more empirical (or even theoretical) evidence, I would vote for strong accept.
<|review3|>
Can you help me creating the markdown guidelines, inserting common mistakes that are worth citing and the best solution to fix them?
I want something like:
<|result|>
# 📝 Paper Writing Guidelines for Authors

This guide outlines key principles to ...

## 🧪 Rigorous Experimentation

Many submissions fall short on experimental rigor—insufficient controls, lack of repeatability, vague protocol descriptions, or no statistical power analysis.

**Best-Practice Solutions:**

1. **Define Hypotheses & Variables**
    * State null and alternative hypotheses explicitly.

...

## 2. 📊 Comparison to Prior Work
<|result|>

Answer me with only the final guideline markdown.
"""


response = ollama.chat(model='qwen3:14b', messages = [
    {
        "role": "user",
        "content": question
    }
])

response = response['message']['content']
print(response)

<think>
Okay, I need to help the user create markdown guidelines for authors based on the reviews they provided. Let me start by analyzing the three reviews to identify common issues and areas for improvement.

First, Review 1 points out that the paper lacks rigorous experimentation. They didn't compare with earlier works like Joseph et al. and Wang et al., and didn't test on other domains like text classification. Also, there's a lack of theoretical analysis and some results are not conclusive. The abstract's claim about a 20.5% reduction in test error isn't addressed properly. They also mention issues with ablation studies and the need for more analysis on the warm-up phase and the selection pressure parameter.

Review 2 mentions that the experiments don't show state-of-the-art results on standard datasets like MNIST, CIFAR-10, and CIFAR-100. The author is concerned about the method's effectiveness and suggests testing on larger datasets like Tiny ImageNet. There are also questions a

## 💡 The initial idea
At this point, the initial idea was to go through the reviews in batches, extract the main criticisms, and add them to a text file (already filled with some tips to guide the LLM) collecting all the feedback gathered so far. To do this, I would prompt the LLM to expand the markdown document if it identified any new suggestions. I thought also to maintain a list of best solutions that could be adopted to fix the concerns.

Here's the code:

In [45]:
import ollama
import openpyxl
from io import SEEK_END

# this file contains the template result we want to get from the model
starting_result = open("expected_result.txt", "r")
final_result = open("first_idea_result.md", "w+")
final_result.write(starting_result.read())
final_result.seek(0, SEEK_END)
starting_result.close()

def get_main_question(result, review1, review2, review3):
    return f"""I am analizing a dataset of OpenReview papers because I want to create a small markdown of general guidelines, to help authors writing better papers.
So I collected some reviews and this is the result I created:
<|result|>
{result}
<|result|>
Now I have three reviews of a paper I chose.
I want you to find any new common mistakes that are worth citing and add them as a new heading to the markdown I provided you. Mantain the original style of the markdown (with headings, emoji in the headings, no headings of third level, etc.). If the mistakes are already present, do not add them. If you don't find any new mistake, leave the markdown as before.
Here are the reviews:
<|review1|>
{review1}
<|review1|>
<|review2|>
{review2}
<|review2|>
<|review3|>
{review3}
<|review3|>
Answer me with only the final guideline markdown.
"""

sheet = openpyxl.load_workbook("or_2020.xlsx").active

for row in range(5, 40, 3): # there should be sheet.max_row, but with my pc would take ~80 hours
    review1 = sheet.cell(row, 6).value
    review2 = sheet.cell(row+1, 6).value
    review3 = sheet.cell(row+2, 6).value
    final_question = get_main_question(final_result.read(), review1, review2, review3)
    final_result.seek(0, SEEK_END)
    response = ollama.chat(model='qwen3:14b', messages = [
        {
            "role": "user",
            "content": final_question
        }
    ])
    response = response['message']['content']
    if response.find('```markdown') != -1:
        response = response.split('```markdown')[1].split('```')[0] # get the guideline markdown part
    elif response.find('#') != -1:
        response = response.split('#')[1]
    final_result.write(response)

final_result.seek(0)
uniform_question = f"""
Uniform this markdown to have a uniform style (for example every heading have an emoji, no subheading of third level, start with the title "# 📝 Paper Writing Guidelines for Authors", etc). Finally, if there are headings that are too similar uniform them in a single heading.
Here's the markdown:
<|markdown|>
{final_result.read()}
<|markdown|>
"""
response = ollama.chat(model='qwen3:14b', messages = [
    {
        "role": "user",
        "content": uniform_question
    }
])
response = response['message']['content']
# Strip off any '<​/think>' section if present
if '</think>' in response:
	response = response.split('</think>', 1)[1]
final_result.write(response)
print(response)
final_result.close()



# 📝 Paper Writing Guidelines for Authors  

This guide outlines key principles to help authors avoid common pitfalls and meet the expectations of peer reviewers and the broader research community.  

## 📊 Comparison to Prior Work  

Authors often claim novelty without fair, head-to-head benchmarks against the most relevant baselines.  

### Best-Practice Solutions  

1. **Conduct a Focused Literature Review**  
   - Identify 3-5 seminal and recent papers tackling the same problem.  
   - Highlight gaps your approach fills.  

2. **Re-implement or Use Published Code**  
   - Whenever possible, obtain or re-implement baseline methods rather than quoting old results.  
   - Match preprocessing, hyperparameters, and evaluation metrics exactly.  

3. **Present Results in Comparative Tables**  
   ```markdown
   | Method        | Dataset     | Metric     | Result         |
   |---------------|-------------|------------|----------------|
   | Proposed      | X           | Accuracy   | 87.4 

The result turned out to be not too bad, but some problems arised:

- **Prompt‐length explosion:** as the markdown grows, the LLM allucinates or cuts off.
- **Inconsistency:** later generations sometimes rephrase or lose earlier entries.
- **Throughput:** re‑sending the entire file for each triple becomes slow.

For this reason, I decided to revise the strategy.

## 🔄 Revised Strategy: Use LLMs + Clustering

Rather than carrying forward the cumulative output and feeding it back to the LLM each time, I decided to revamp the workflow to separate concern extraction from guideline synthesis, improving both speed and flexibility. Here’s how it works in practice:

1. **Dynamic Review Grouping:**  
   We scan the spreadsheet row by row, grouping all consecutive reviews for the same paper title into a single batch prompt.

2. **One-Shot LLM Extraction:**  
   We send all collected reviews with a clear instruction:  
   > “List only the common concerns shared by these reviews as bullet points.”  
   The LLM returns clean, newline-delimited bullets, which we parse and trim.

3. **Checkpointing in JSON:**
	By saving title → concerns to a JSON file, we can interrupt and resume processing without redoing work.  

4. **Normalization & Deduplication:**
	Once all concerns are collected, we lowercase, strip punctuation, and filter out noise before clustering.

5. **Embeddings & Clustering** 
   Cleaned sentences are embedded with Sentence-BERT and automatically grouped (e.g. hierarchical or K-Means). For each cluster, we pick a representative sentence—either the one closest to the centroid or simply the longest—and assemble a preliminary Markdown checklist.

Finally, a quick human pass lets us assign theme labels (e.g., Novelty, Baselines, Theory) and polish the bullets into “Do”/“Don’t” guidelines.

Here's the code of points 1, 2 and 3:

In [None]:
import ollama
import openpyxl
import os
import json

OUT_PATH = "concerns.json"
SHEET_PATH = "or_2020.xlsx"

# Load the OpenReview sheet
sheet = openpyxl.load_workbook(SHEET_PATH).active

# Load existing results if any
if os.path.exists(OUT_PATH):
    with open(OUT_PATH, "r", encoding="utf-8") as f:
        results = json.load(f)
else:
    results = {}

def get_main_question_dynamic(reviews):
    """
    Build the prompt for an arbitrary number of reviews.
    reviews: list of strings, each representing a review.
    """
    sections = []
    for i, rev in enumerate(reviews, start=1):
        sections.append(f"<|review{i}|>\n{rev}\n<|review{i}|>")
    reviews_block = "\n".join(sections)
    return (
        "Perfect, now I have other reviews of a paper I chose:\n"
        f"{reviews_block}\n"
        "Answer me with only the common concerns that the reviewers have, "
        "with only the bullet points without any other explanation."
    )

question = r"""
I am analizing a dataset of OpenReview papers with reviews of the papers because I want to extract some common concern that the reviewers have, to help authors know in advance what are the common concerns.
In particular, I have three reviews of a paper I chose:
<|review1|>
 This paper proposes Recency Bias, an adaptive mini batch selection method for training deep neural networks. To select informative minibatches for training, the proposed method maintains a fixed size sliding window of past model predictions for each data sample. At a given iteration, samples which have highly inconsistent predictions within the sliding window are added to the minibatch. The main contribution of this paper is the introduction of sliding window to remember past model predictions, as an improvement over the SOTA approach: Active Bias, which maintains a growing window of model predictions. Empirical studies are performed to show the superiority of Recency Bias over two SOTA approaches. Results are shown on the task of (1) image classification from scratch and (2) image classification by fine-tuning pretrained networks. +ves: + The idea of using a sliding window over a growing window in active batch selection is interesting. + Overall, the paper is well written. In particular, the Related Work section has a nice flow and puts the proposed method into context. Despite the method having limited novelty (sliding window instead of a growing window), the method has been well motivated by pointing out the limitations in SOTA methods. + The results section is well structured. It*s nice to see hyperparameter tuning results; and loss convergence graphs in various learning settings for each dataset. Concerns: - The key concern about the paper is the lack of rigorous experimentation to study the usefulness of the proposed method. Despite the paper stating that there have been earlier work (Joseph et al, 2019 and Wang et al, 2019) that attempt mini-batch selection, the paper does not compare with them. This is limiting. Further, since the proposed method is not specific to the domain of images, evaluating it on tasks other than image classification, such as text classification for instance, would have helped validate its applicability across domains. - Considering the limited results, a deeper analysis of the proposed method would have been nice. The idea of a sliding window over a growing window is a generic one, and there have been many efforts to theoretically analyze active learning over the last two decades. How does the proposed method fit in there? (For e.g., how does the expected model variance change in this setting?) Some form of theoretical/analytical reasoning behind the effectiveness of recency bias (which is missing) would provide greater insights to the community and facilitate further research in this direction. - The claim of 20.5% reduction in test error mentioned in the abstract has not been clearly addressed and pointed out in the results section of the paper. - On the same note, the results are not conclusively in favor of the proposed method, and only is marginally better than the competitors. Why does online batch perform consistently than the proposed method? There is no discussion of these inferences from the results. - The results would have been more complete if results were shown in a setting where just recency bias is used without the use of the selection pressure parameter. In other words, an ablation study on the effect of the selection pressure parameter would have been very useful. - How important is the warm-up phase to the proposed method? Considering the paper states that this is required to get good estimates of the quantization index of the samples, some ablation studies on reducing/increasing the warm-up phase and showing the results would have been useful to understand this. - Fig 4: Why are there sharp dips periodically in all the graphs? What do these correspond to? - The intuition behind the method is described well, however, the proposed method would have been really solidified if it were analysed in the context of a simple machine learning problem (such as logistic regression). As an example, verifying if the chosen minibatch samples are actually close to the decision boundary of a model (even if the model is very simple) would have helped analyze the proposed method well. Minor comments: * It would have been nice to see the relation between the effect of using recency bias and the difficulty of the task/dataset. * In the 2nd line in Introduction, it should be *deep networks* instead of *deep networks netowrks*. * Since both tasks in the experiments are about image classification, it would be a little misleading to present them as *image classification* and *finetuning*. A more informative way of titling them would be *image classification from scratch* and *image classification by finetuning*. * In Section 3.1, in the LHS of equation 3, it would be appropriate to use P(y_i/x_i; q) instead of P(y/x_i; q) since the former term was used in the paragraph. =====POST-REBUTTAL COMMENTS======== I thank the authors for the response and the efforts in the updated draft. Some of my queries were clarified. However, unfortunately, I still think more needs to be done to explain the consistency of the results and to study the generalizability of this work across datasets. I retain my original decision for these reasons.
<|review1|>
<|review2|>
This paper proposes an interesting heuristic of batch construction from samples. Instead of the usual random sampling, the authors to sample based on some measures of the ``uncertainty”. To be specific, the uncertainty is measured as a normalized entropy estimated from a window of historical predictions. I like the idea of designing more sophisticated ways to encourage more exploration over the samples that the model is not good at. The thought is similar as active learning. It is interesting to see how similar thought can be used to improve the performance of the algorithm in the general batch gradient descent setting. On the other hand I am not quite convinced the proposed way is truly better. The main concern is the experiments do not quite show the state-of-the-art result at all. It is not even close on MNIST, CIFAR-10 and CIFAR-100. Also those datasets are relatively small one. Can authors add results on larger datasets such as tiny image net? Besides this main concern I also have some worries about the design of the algorithm. I listed them below: 1. The vanilla stochastic gradient descent can be roughly justified since the expectation of the stochastic gradient is the true gradient of the loss. Now with the proposed heuristic will this still be true? 2. Is there any guarantee the algorithm can converge? It is not clear to me as the optimization proceeds the ``uncertainty” may oscillate. Is there any condition when the convergence is guaranteed? 3. As the number of classes grows the estimation of the entropy itself is a tough problem. Is there any way to mitigate this issue other than increase the window size? Another minor comment: Could the authors add more explanation on equation (4)? For example, is related to the maximum entropy led by a uniform distribution, and the summation term in (4) is related to the empirical entropy.
<|review2|>
<|review3|>
This paper explores a well motivated but very heuristic idea for selecting the next samples to train on for training deep learning models. This method relies on looking at the uncertainty of predictions of in the recent history of statements and preferring those instances that have a predictive uncertainty over the recent predictions. This allows the training method to train on instances that are neither too hard nor too easy and focus on reducing the uncertainty whenever it has the greatest potential gain to do so. There are two extra components that make this method work: - Windowing: only looking at the recent history of the instances which has two effects: firstly, the current state of the model is explored which gives a more recent assessment relative to the current state of the model. Secondly, it makes the algorithm faster by reducing the overhead of analyzing the prediction history of samples. - Annealing the selection bias: as the training goes on the selection becomes more random and less biased. This approach is evaluated in on three simple data-sets: MNIST, CIFAR-10 and CIFAR-100. Although this is a very limited subset of models, the results are consistent and statistically significant, although their effect is not really huge. The paper gives very little theoretical justification or analysis of the results but gives only the presented empirical evidence which seems to support the hypothesis on the efficacy of the approach. Another drawback of the approach is that it introduces new hyperparameters: those governing the annealing schedule for the selection bias. Since the approach seems efficient in a relatively constrained setup, it can be reasonably expected that it might be helpful in more general situations, therefore. On the other hand, since it is only evaluated on three very similar tasks, it limits the conclusiveness of the results. That*s why I would for weak accept. In the presence of more empirical (or even theoretical) evidence, I would vote for strong accept.
<|review3|>
Answer me with only the common concern that the reviewers have, with only the bullet points without any other explanation.
"""

answer = r"""
- Rigorous Experimentation: Many submissions fall short on experimental rigor—insufficient controls, lack of repeatability, vague protocol descriptions, or no statistical power analysis.
- Comparison to Prior Work: Authors often claim novelty without fair, head-to-head benchmarks against the most relevant baselines.
- Quantitative Claims Need Proof: Percentages are meaningless without a clear definition of the metric, baseline, and statistical significance. For example, "Our method is 20% better" is vague without context.
- Dataset Size & Diversity: Too-small or homogenous datasets lead to overfitting and findings that don't generalize.
"""

# First row to process is 5, since the first 3 rows have been used to collect the starting result
row = 5
max_row = sheet.max_row

while row <= max_row:
    title = sheet.cell(row, 1).value
    
	# Skip if already processed
    if title in results:
        print(f"Skipping {title!r}, already in JSON.")
        # Skip to the next title
        while row <= max_row and sheet.cell(row, 1).value == title:
            row += 1
        continue

    # Grab the reviews for this title
    reviews = []
    while row <= max_row and sheet.cell(row, 1).value == title:
        rev = sheet.cell(row, 6).value
        if rev and rev.strip():
            reviews.append(rev.strip())
        row += 1
    print(f"Processing {title!r} with {len(reviews)} reviews.")

    final_question = get_main_question_dynamic(reviews)
    response = ollama.chat(
        model='qwen3:14b',
        messages=[
            {"role": "user",      "content": question},
            {"role": "assistant", "content": answer},
            {"role": "user",      "content": final_question}
        ]
    )['message']['content']

    # Strip off any '<​/think>' section if present
    if '</think>' in response:
        response = response.split('</think>', 1)[1]
   
    # Split into sentences by newline (and trim)
    sentences = [line.strip() for line in response.split("\n") if line.strip()]
    print(sentences)
    
	# If sentences are bullet points, remove the bullet point
    sentences = [s[2:] if s.startswith("- ") else s for s in sentences]

    # Store under the sheet title
    results[title] = sentences
    print(f"Collected {len(sentences)} concerns for {title!r}.")

# Save the results to a JSON file
with open(OUT_PATH, "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print(f"\n✅ Saved {len(results)} papers' concerns to {OUT_PATH}")

Skipping 'Prestopping: How Does Early Stopping Help Generalization Against Label Noise? | OpenReview', already in JSON.
Skipping 'Analysis and Interpretation of Deep CNN Representations as Perceptual Quality Features | OpenReview', already in JSON.
Skipping 'Improving Evolutionary Strategies with Generative Neural Networks | OpenReview', already in JSON.
Skipping 'Wide Neural Networks are Interpolating Kernel Methods: Impact of Initialization on Generalization | OpenReview', already in JSON.
Skipping 'SSE-PT: Sequential Recommendation Via Personalized Transformer | OpenReview', already in JSON.
Skipping 'Scoring-Aggregating-Planning: Learning task-agnostic priors from interactions and sparse rewards for zero-shot generalization | OpenReview', already in JSON.
Skipping 'Count-guided Weakly Supervised Localization Based on Density Map | OpenReview', already in JSON.
Skipping 'Star-Convexity in Non-Negative Matrix Factorization | OpenReview', already in JSON.
Skipping 'Noise Regularizatio

And here's the code of points 4 and 5:

In [43]:
import re
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
from collections import defaultdict
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import json

# Load your previously saved concerns.json
with open("concerns.json", "r", encoding="utf-8") as f:
    concern_map = json.load(f)

# Load the expected result from the file
expected_result = open("expected_result.txt", "r", encoding="utf-8")

# Flatten all the lists of sentences into one list
all_sentences = [sent for sentences in concern_map.values() for sent in sentences]

# Removes exact repeats so clusters aren’t skewed by identical items.
unique_sentences = list(dict.fromkeys(all_sentences))

print(unique_sentences)

def normalize(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Normalize: lowercase, strip punctuation, remove extra whitespace
normalized = [normalize(s) for s in unique_sentences]

# Remove garbage (like "okay", etc) or extremely short items
stop_words = set(stopwords.words('english'))

# Remove any lines that are too short or made up mostly of stop-words (e.g. “and”, “the”, “of”)
filtered = []
for orig, norm in zip(unique_sentences, normalized):
    tokens = nltk.word_tokenize(norm)
    # keep only if more than 3 non-stop tokens
    if sum(1 for t in tokens if t not in stop_words) >= 3:
        filtered.append((orig, norm))

sentences, norms = zip(*filtered)

# EMBEDDINGS
# Each sentence is converted into a fixed-length vector that “encodes” its meaning.
# Sentences with similar semantics end up closer in this vector space.
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(list(norms), convert_to_tensor=False)

# CLUSTERING
# Starts with each sentence as its own cluster.
# Iteratively merges the two closest clusters until only n_clusters remain.
n_clusters = 8
clusterer = AgglomerativeClustering(n_clusters=n_clusters) # or KMeans(n_clusters=n_clusters, random_state=0)
labels = clusterer.fit_predict(embeddings)

# Each cluster (that now is a group of sentences that are similar to each other) has an index.
clusters = defaultdict(list)
for lab, orig in zip(labels, sentences):
    clusters[lab].append(orig)


def pick_representative(sent_list):
    # simple heuristic: choose the longest (most detailed) sentence
    return max(sent_list, key=lambda s: len(s))

# For now we will just pick the longest sentence in each cluster as the representative.
guidelines = []
for lab, items in clusters.items():
    rep = pick_representative(items)
    guidelines.append(f"- {rep}")

print("\n".join(guidelines))
final_question = f"""
I have analized a dataset of OpenReview papers with reviews of the papers because I wanted to extract some common concern that the reviewers have, to help authors know in advance what are the common concerns. Now I have a list of general concerns, can you help me creating a nice markdown with the concerns and best solutions? For example, this can be a possible result:
<|result|>
{expected_result.read()}
<|result|>
Remember that I want general guidelines, I do not want citations to specific papers or ML-related guidelines. For example These are the concerns:
<|concerns|>
{"\n".join(guidelines)}
<|concerns|>
Answer me with only the final guideline markdown.
"""

# We can also use more powerful models like ChatGPT
response = ollama.chat(model='qwen3:14b', messages = [
    {
		"role": "user",
		"content": final_question
	}
])['message']['content']

# Strip off any '<​/think>' section if present
if '</think>' in response:
	response = response.split('</think>', 1)[1]
print(response)

with open("final_idea_result.md", "w") as f:
    f.write(response)


[nltk_data] Downloading package punkt to /Users/fabbi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/fabbi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/fabbi/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


['Rigorous Experimentation: Many submissions fall short on experimental rigor—insufficient controls, lack of repeatability, vague protocol descriptions, or no statistical power analysis.', 'Comparison to Prior Work: Authors often claim novelty without fair, head-to-head benchmarks against the most relevant baselines.', "Quantitative Claims Need Proof: Percentages are meaningless without a clear definition of the metric, baseline, and statistical significance. For example, 'Our method is 20% better' is vague without context.", "Dataset Size & Diversity: Too-small or homogenous datasets lead to overfitting and findings that don't generalize.", 'Lack of comparison with relevant baselines: Authors did not compare with key existing methods (e.g., [1], [2], [3]) or failed to cite important related works on early-stopping and label noise.', 'Limited novelty: The proposed method is described as incremental or too similar to prior work (e.g., self-training, co-training, iterative learning with 

## ⚠️ Limitations & Future Work

Our revised pipeline makes guideline generation more scalable, but several challenges remain. Processing large numbers of papers end-to-end can quickly become a bottleneck, and some of our design choices introduce trade-offs:

- **Performance & Cost:** Running a 14B-parameter LLM for every paper (plus SBERT for embeddings) is slow and expensive. You’ll need either a beefy GPU cluster or a paid inference API to maintain reasonable throughput.  
- **Fixed Number of Clusters:** Using a hard-coded `n_clusters` risks over- or under-segmenting reviewer concerns. In future iterations we should incorporate metrics (e.g. silhouette score) or density-based methods (DBSCAN/HDBSCAN) to choose cluster counts automatically.  
- **Simple Representative Heuristic:** We currently pick the longest sentence in each cluster as the “headline” concern, which may be verbose or overly technical. A more robust approach could select the sentence closest to the cluster centroid or ask a small, focused LLM to synthesize a concise summary.