## Assigment 3

### Authors: Jacopo Corrao, Pietro Farina

This assignment consist in developing code that combines LLMs and traditional ML to extract knowledge from the Openreivew review dataset.
The goal is to find interesting patterns in reviews (and possibly, why not, in papers) that can give hints to authors about how to write better papers.


We deciced to organize the work in the following:

1. Manual inspection of the dataset
2. Data Cleaning
3. Knowledge Extraction from Paper Acceptance Data using BERT-based NLP Techniques
4. Paper Acceptance Analysis
5. Pairwise comparison of papers through LLMs

### Setup



In [3]:
from google.colab import userdata
my_secret_key = userdata.get('API_KEY')

In [4]:
!pip install openai pandas openpyxl



In [3]:
from google.colab import drive

drive.mount("/content/drive")

file_path = "/content/drive/MyDrive/KnowledgeDiscoveryAndPatternExtraction/open_review_dataset.xlsx"

Mounted at /content/drive



# Manual Inspection of the Dataset

In [5]:
import pandas as pd

# Load the Excel file
excel_file = pd.ExcelFile(file_path)

# Display sheet names
print(excel_file.sheet_names)

# Load each sheet into a dictionary
sheets = {sheet_name: excel_file.parse(sheet_name) for sheet_name in excel_file.sheet_names}

# Prepare summaries for each sheet
sheet_summaries = {}
for name, df in sheets.items():
    summary = {
        "columns": df.columns.tolist(),
        "head": df.head().to_dict(orient="records")
    }
    sheet_summaries[name] = summary
for sum in sheet_summaries:
  print(sheet_summaries[sum])

['Sheet1', 'Sheet2', 'Sheet3', 'Sheet4', 'Sheet6', 'Sheet5']
{'columns': ['title', 'Unnamed: 1', 'keywords', 'E', 'F', 'G', 'decision', 'J', 'K', 'rate', 'T'], 'head': [{'title': '#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning | OpenReview', 'Unnamed: 1': '06 Nov 2016 (modified: 10 Jan 2017)', 'keywords': 'Keywords:###Deep learning, Reinforcement Learning, Games', 'E': 'Conflicts:###berkeley.edu, eecs.berkeley.edu, openai.com, ugent.be', 'F': 93, 'G': '06 Feb 2017', 'decision': 'Decision:###Reject', 'J': 1458, 'K': '22 Dec 2016 10 Jan 2017', 'rate': 4, 'T': 1655}, {'title': '#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning | OpenReview', 'Unnamed: 1': '06 Nov 2016 (modified: 10 Jan 2017)', 'keywords': 'Keywords:###Deep learning, Reinforcement Learning, Games', 'E': 'Conflicts:###berkeley.edu, eecs.berkeley.edu, openai.com, ugent.be', 'F': 93, 'G': '06 Feb 2017', 'decision': 'Decision:###Reject', 'J': 1458, 'K': '19 Dec 2

We have a dataset split in six sheets, each representing different years of the review process for some conference.

| Sheet  | Rows  | Columns | Notes                                                                 |
|--------|-------|---------|-----------------------------------------------------------------------|
| Sheet1 | 1495  | 22      | Many unnamed columns (e.g., 'Unnamed: 1', 'F', 'G'), likely noisy; includes abstract, decision, review, rates |
| Sheet2 | 2849  | 12      | Some columns in Chinese ('序号' = index, '方差' = variance); has derived stats like avgscore, confidence sum |
| Sheet3 | 4733  | 12      | Includes submission dates, IDs, abstract/keywords, rates, and decisions |
| Sheet4 | 7769  | 19      | Rich metadata (e.g., review times, reviewer level, comment length), cleaner naming |
| Sheet5 | 3457  | 25      | Multiple reviewers' scores, citation info, and a mix of identifiers |
| Sheet6 | 2966  | 26      | Very similar to Sheet5 but with some extra fields like 'arxiv', more structured reviewer columns |

In [6]:
import json

# Create a prompt with sheet summaries
prompt = "I have an Excel file with multiple sheets, this dataset represent the openreview dataset with information about the review process of some papers, my objective is to extract information that could help write better papers. Each sheet has the following structure:\n\n"

for name, summary in sheet_summaries.items():
    columns = ', '.join(summary['columns'])
    first_row = summary['head'][0] if summary['head'] else {}
    first_row_str = json.dumps(first_row)
    prompt += f"[Sheet: {name} | Columns: {columns} ] "

prompt += "Can you compare these sheets, highlight the differences in their structures, explain the different labels, and suggest how I might align them for a unified analysis? I want you to provide a deep detailed explanations of the previous points"

# Get GPT-4's response
# response = client.chat.completions.create(
#     model="openai/gpt-4",
#     messages=[
#         {"role": "system", "content": "You are a data analyst."},
#         {"role": "user", "content": prompt}
#     ]
# )

# Print GPT-4's response
# print(response.choices[0].message.content)



```
# Output of previous request:

Let's take a look at the structure of each of the Excel sheets:

**Sheet1** has columns such as title1, abstract, keywords, title, decision, rate, confidence, review, etc. It mostly contains initial attributes of the papers, like the title, abstract, keywords, and review-related data like decision, rate, confidence to judge and review content.

**Sheet2** has a mix of Chinese and English columns like 序号 (Number), title, confidence, rate源 (Rate Source), review, reviewLength, accept, confidence分数 (Confidence Score), confidence总和 (Total Confidence), avgscore, rate, 方差 (variance). It contains a detailed scoring and review related data, with added information about review lengths, acceptance status and aggregate information such as total confidence, average score, and variance.

**Sheet3** has auto_increase, id, title, submit_date, decision, Confidence, rate1, rate, review, abstract_field, keyword_field, etc. It carries submission level data like title, submission date, deposition and reviews along with basic paper attributes like abstract and keywords.

**Sheet4** has more details about paper review process and timelines, abstract, keyword, titlelength, paper_decision_time, decision, paper_decision_comment, paper_decision_commentlength, review, review_contentlength, etc. with some attributes from earlier sheets like title, decision etc.

**Sheet5** and **Sheet6** seem to be identical in their structure, containing detailed paper review ratings, decision, reviewer details, chairs, title, keywords, abstract, etc., along with citation-related details citenum, cite, link1 and some aggregation values like avgscore, fangcha (variance in Chinese).

Now, comparing these sheets:

1. Some columns such as 'title', 'review', 'rate', 'confidence', 'decision', 'abstract', 'keywords' seem common between almost all sheets and can form a baseline to join or combine these sheets.
2. Sheet1 has multiple unnamed columns that may make merging with other sheets challenging.
3. Sheet2 has some columns in Chinese which need translation or mapping with other English columns if suitable.
4. Sheet2, Sheet5, and Sheet6 have columns that provide aggregate information or metrics derived from raw data e.g., avgscore, variance (in Chinese) that might not be available in other sheets.
5. Sheets 4, 5, and 6 provide more process control level data related to paper review process and timeline of events.
6. Sheets 5 and 6 have more citation and referencing related details like 'citenum', 'cite', 'link' along with reviewer related data which the other sheets may lack.
7. The 'id' and 'paper_id' could communicate between different sheets for combination, if they are the same.

Based on this, here are some recommendations to align the sheets for a unified analysis:

1. Investigate the unnamed columns: Before proceeding with any data alignment, it would be worthwhile to explore these unnamed columns for potential useful data.

2. Translate the Chinese language columns: To make merging of the datasets easier, translate the Chinese columns into English if feasible or find their corresponding English fields in other sheets.

3. Standardize column naming: You may need to consider renaming or aliasing columns that represent the same attributes in different sheets for easy merging. For instance, 'title1' in Sheet1 could be renamed to 'title' for consistency across sheets.

4. Merge using common columns: Consider merging datasets based on common columns such as 'title', 'abstract', 'keywords', 'rate', 'review', 'confidence', 'decision' or ‘id’, ‘paper_id’, etc.

5. Handling derived metrics: Columns that contain derived information like 'avgscore' , 'fangcha', 'rate0' through 'rate6', etc. in Sheet2, Sheet5, and Sheet6 need to be handled carefully as they might not align directly with raw data in other sheets. The logic or method of their calculation needs to be reviewed to see if similar can be derived from other sheets.

6. Combining reviewing process-related data: Consider merging reviewer-related data like 'review_score_one', 'review_score_two', 'review_score_three' from Sheet4 with respective details from Sheet5 and Sheet6. Do a similar approach for paper decision and reviewing process-related data.
```



### Columns Ranking

We can explore the data and the columns to see the most meaninguful ones.

**High Importance (Core columns, appear in ≥3 sheets)**:

* title (6): Identifier for papers.
* decision (5): Outcome of the review process: acceptance or rejection.
* abstract (4): Insights on the content of the paper.
* review (4): Full review text.
* rate, rate1: Numerical review scores.
* confidence, avgscore, keywords: Reviewer confidence, mean score.
* keywords: another insight on the content of the paper.

**Medium Importance (Moderate usage or derived metrics)**:
* paper_id, rate0 to rate6: Breakdown of scores per reviewer.
* reviewer0 to reviewer4: Individual reviewers.
* cite, fangcha (variance), chairs, link1, Title1.

**Low Importance / Noisy / One-off**, these appear only once or are ambiguous:
* Columns like e, f, g, unnamed: 1, unnamed: 11, etc.
* Chinese-only: 序号, 方差, rate源 — likely derivatives of others.
* Specialized metadata: tl_dl, titlelength, review_publish_time, review_score_one, arxiv.
These should be discarded or deprioritized in correlation analysis unless later needed for specific modeling.

Overall, some columns names were missing or imprecise, requiring to manually explore the data to be able to rename them accordingly.

### Notes

From sheets 1 to 4 we have a **review-per-row**, meaning that each row represents a single evaluation from a reviewer; thus, a paper is represented by more rows. For sheets 5 and 6, instead, we have a **paper-per-row** meaning that the different comments from the reviewers are in different columns of the same rows.
This discrepancy between sheets might mislead the analysis, however, we decided to keeps the representation as in the original data, due to time constraints.
In general, **review-per-row** can be used for fine-grained analysis of individual reviewer feedback and it enables statistical summaries like average score, disagreement, or sentiment. On the other hand, **paper-per-row** is easier for summarizing overall reception and it is a cleaner format for descriptive analytics, e.g., decision outcome based on scores or comments.

Sheets 5 and 6, includes for each paper the link to the webpage describing the corresponding review process, by following the link we discovered that also the submitted paper was available. We decided to retrieve some papers to perform a **manual information extraction** via LLMs described in the last section.

# Data Cleaning

## Objective of the Work
The purpose of this cleaning procedure is to prepare a consistent and noise-free dataset for analyzing the characteristics that distinguish accepted from rejected papers on OpenReview. The ultimate goal is to identify useful patterns to improve the writing and structure of a scientific paper.

## Cleaning Steps and Motivations

### 1. Column Standardization
- Column names referring to the same concept but labeled differently were unified (e.g., abstract_field → abstract, keyword_field → keywords, title1 → title).
- Purpose: ensure consistency across sheets to facilitate data merging and cross-analysis.

### 2. Review Unification
- Columns such as reviewer0, reviewer1, etc., were concatenated into a single review column.
- Purpose: provide a compact and comprehensive textual representation of the feedback received by each paper.

### 3. Handling of rate Columns
- In sheets with multiple rating columns (e.g., rate1, rate2), the average was calculated, ignoring zeros.
- Purpose: obtain a meaningful and synthetic measure of the paper's evaluation.

### 4. Standardization of the decision Column
- Multiple decision variants (e.g., decision1, decision123) were merged into a single decision column.
- Numerical values ("0", "1") were converted into textual labels: "Reject" and "Accept".
- Purpose: obtain a consistent categorical column for outcome classification.

### 5. Removal of Low-Coverage Columns
- Columns filled in less than 1% of the rows were eliminated.
- Purpose: remove noisy and uninformative dimensions for statistical modeling.

### 6. Removal of Columns Containing Links
- Columns containing URLs (e.g., http://...) were removed as they are not relevant for analysis.

### 7. Final Cleanup of Redundant Columns
- In Sheet2, the column accept was renamed to decision for semantic consistency.
- In Sheet5, the chairs column was removed.
- In all sheets, only the title column was kept; duplicates like title1 were dropped.

### 8. Basic Text Cleaning
- Basic text preprocessing was applied: converting to lowercase, removing special characters, extra spaces, and irrelevant text portions.
- Purpose: ensure uniformity in textual fields for subsequent NLP or classification tasks.

### 9. Removal of Non-Significant Columns
- Throughout the process, columns deemed irrelevant to the goal of identifying patterns useful for improving paper acceptance probability were removed.

## Conclusion
This data cleaning process has transformed a heterogeneous dataset into a clean, consistent, and analytically usable version. It now enables reliable statistical or machine learning analysis to discover patterns among paper features and their acceptance likelihood. This forms a solid basis for providing actionable and data-driven advice to researchers preparing a scientific paper.

# Knowledge Extraction from Paper Acceptance Data using BERT-based NLP Techniques

## 1. Introduction

The primary objective is to extract actionable insights that can guide researchers in crafting more effective paper submissions by leveraging natural language processing (NLP) techniques and machine learning-based topic modeling.

### 1.1 Dataset Overview

The dataset includes the following key fields:
- `title`: Title of the paper.
- `keywords`: Keywords assigned by authors.
- `decision`: Outcome of the submission ("Accept" or "Reject").
- `rate` / `avgscore`: Numerical score given by reviewers.
- `review_publish_time`, `paper_decision_time`: Timestamps for tracking timelines.
- `review_contentlength`, `paper_decision_commentlength`: Length of review and final decision comments.

---

## 2. Methodology

### 2.1 Data Preprocessing

Data was extracted from Excel sheets containing multiple years of conference submissions. Each sheet was processed independently to ensure modularity and flexibility across different editions of the same conference.

Key preprocessing steps included:
- Cleaning and standardizing keyword formatting (`Keywords:###BERT, model compression` → `["BERT", "model compression"]`)
- Extracting clean decisions (`Accept` / `Reject`)
- Computing title lengths and extracting n-grams (bigrams, trigrams)

### 2.2 Natural Language Processing with BERT

#### 2.2.1 Topic Modeling via BERTopic

To identify thematic differences between accepted and rejected papers, we employed **BERTopic**, a state-of-the-art topic modeling technique based on BERT embeddings. This method allows for:
- Unsupervised clustering of titles into semantically coherent topics
- Comparison of dominant topics in accepted vs. rejected papers
- Insight into which themes are most associated with successful submissions

BERTopic leverages transformer-based contextual embeddings to capture nuanced semantic relationships between paper titles, offering a richer alternative to traditional methods like LDA.

#### 2.2.2 Keyword Analysis

We analyzed the frequency and distribution of keywords in accepted and rejected papers. In particular, we calculated:
- Absolute frequency of each keyword
- Relative lift: how much more likely a keyword appears in accepted papers compared to all papers

This allowed us to identify keywords strongly correlated with acceptance.

#### 2.2.3 Power Words in Titles

Using bigram and trigram analysis, we identified frequent combinations of words appearing in paper titles. These were compared across accepted and rejected sets to uncover linguistic patterns that may influence reviewer perception.

#### 2.2.4 Correlation Between Review Scores and Outcomes

Where available, we analyzed numerical review scores (`rate` or `avgscore`) to determine:
- Average scores for accepted vs. rejected papers
- Distribution of scores within each group
- Threshold values that correlate with acceptance

---

## 6. Future Work

- **Sentiment and linguistic pattern analysis using LLMs**:  
  Feed the full review texts and final decision comments into a Large Language Model (e.g., GPT, Llama, Mistral) to extract qualitative differences between accepted and rejected papers. This could highlight:
  - Common reasons for rejection (e.g., lack of novelty, insufficient experiments)
  - Phrases or argument structures associated with acceptance
  - Tone, clarity, and persuasiveness in reviewer feedback

  This would complement the current quantitative approach with deeper qualitative insights that are difficult to extract via traditional NLP methods.

- **Automated classification** of paper abstracts to predict acceptance likelihood
- Integration of **citation graphs** to assess prior work coverage
- Analysis of **reviewer engagement** through comment length and detail level
- Correlation between **reviewer agreement** and final decision outcome

---

In [6]:
!pip install pandas BERTopic Counter numpy

Collecting BERTopic
  Downloading bertopic-0.17.0-py3-none-any.whl.metadata (23 kB)
Collecting Counter
  Downloading Counter-1.0.0.tar.gz (5.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->BERTopic)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->BERTopic)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->BERTopic)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=0.4.1->BERTopic)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.

In [1]:
import pandas as pd
from bertopic import BERTopic
from collections import Counter
import numpy as np
import re

# Extract n-grams (bigrams/trigrams) from titles
def extract_ngrams(texts, n=2):
    tokens = [re.split(r'\s+', t.lower()) for t in texts]
    ngrams = []
    for tok in tokens:
        for i in range(len(tok)-n+1):
            ngrams.append(tuple(tok[i:i+n]))
    return Counter(ngrams)

# Function to load all sheets
def load_all_sheets(file_path):
    xls = pd.ExcelFile(file_path, engine='openpyxl')
    all_data = []
    for sheet in xls.sheet_names:
        df = pd.read_excel(xls, sheet_name=sheet)
        if 'title' in df.columns and 'decision' in df.columns:
            df['sheet'] = sheet  # Add sheet name for traceability
            all_data.append(df)
    return pd.concat(all_data, ignore_index=True)

# Function to clean decision values ('Accept' or 'Reject')
def clean_decision(decision):
    if isinstance(decision, str):
        if 'Accept' in decision:
            return 'Accept'
        elif 'Reject' in decision:
            return 'Reject'
    return None

# Function to extract keywords
def extract_keywords(keywords_str):
    if isinstance(keywords_str, str):
        return [kw.strip() for kw in keywords_str.split(',') if kw.strip()]
    return []

# Function to count words in title
def count_words(title):
    return len(str(title).split())

# Function to run global topic modeling
def run_topic_modeling(titles):
    print("\nRunning Global Topic Modeling...")
    vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
    model = BERTopic(language="english", vectorizer_model=vectorizer, min_topic_size=10, verbose=False)
    topics, probs = model.fit_transform(titles)
    print(model.get_topic_info().head(10))
    return model

# Function to count global keywords
def count_global_keywords(df, col='keywords'):
    all_keywords = [item for sublist in df[col] for item in sublist]
    return Counter(all_keywords)

# Helper function to avoid KeyError
def safe_explode(df, col):
    """Internal function to handle optional columns"""
    if col in df.columns:
        return df.explode(col)[col].dropna().tolist()
    else:
        return []

# 1. Prepare data from a sheet
def prepare_sheet_data(df_sheet):
    print("Preparing sheet data...")
    optional_cols = ['title', 'keywords', 'decision', 'rate', 'publish_time', 'review_contentlength']
    relevant_cols = [col for col in optional_cols if col in df_sheet.columns]
    if 'title' not in df_sheet.columns:
        print("Warning: Required column 'title' is missing. Cannot proceed.")
        return None
    df = df_sheet[relevant_cols].copy()
    # Clean keywords
    if 'keywords' in df.columns:
        df['keywords'] = df['keywords'].astype(str).str.replace('Keywords:###', '', regex=False).str.split(', ')
        df['keywords'] = df['keywords'].apply(lambda x: [kw.strip() for kw in x if isinstance(kw, str) and kw.strip()] if isinstance(x, list) else [])
    # Create decision_clean only if decision column exists
    if 'decision' in df.columns:
        df['decision_clean'] = df['decision'].astype(str).str.extract('(Accept|Reject)')
    else:
        df['decision_clean'] = None
        print("Warning: Column 'decision' not found. Accept/Reject analysis skipped.")
    print(f"Data prepared: {len(df)} valid rows.")
    return df

# 2. Comparative analysis between accepted and rejected papers
def analyze_accepted_vs_rejected(df_prepared, sheet_name, summary_data):
    print(f"\nCOMPARATIVE ANALYSIS - SHEET '{sheet_name}'")
    has_decision = 'decision_clean' in df_prepared.columns and df_prepared['decision_clean'].notna().any()
    if not has_decision:
        print("No valid decisions ('Accept'/'Reject') found. Skipping Accept/Reject analysis.")
        accepted = pd.DataFrame()
        rejected = pd.DataFrame()
    else:
        accepted = df_prepared[df_prepared['decision_clean'] == 'Accept']
        rejected = df_prepared[df_prepared['decision_clean'] == 'Reject']
        print(f"Total papers: {len(df_prepared)}")
        print(f"Accepted: {len(accepted)} ({len(accepted)/len(df_prepared):.1%})")
        print(f"Rejected: {len(rejected)} ({len(rejected)/len(df_prepared):.1%})")

    # Initialize dictionary for this sheet
    summary_data[sheet_name] = {
        "total": len(df_prepared),
        "accepted_count": len(accepted),
        "rejected_count": len(rejected),
        "all_keywords": [],
        "accepted_keywords": [],
        "rejected_keywords": [],
        "title_lengths": [],
        "accepted_title_lengths": [],
        "rejected_title_lengths": [],
        "titles": [],
        "accepted_titles": [],
        "rejected_titles": [],
        "rates": [],
        "accepted_rates": [],
        "rejected_rates": [],
        "review_lengths": [],
        "accepted_review_lengths": [],
        "rejected_review_lengths": [],
        "submit_dates": [],
        "accepted_submit_dates": [],
        "rejected_submit_dates": [],
    }

    # Keyword Analysis
    if 'keywords' in df_prepared.columns and df_prepared['keywords'].apply(len).sum() > 0:
        all_kw = df_prepared.explode('keywords')['keywords'].value_counts()
        print("\nTop 10 Most Common Keywords:")
        print(all_kw.head(10))
        if has_decision:
            acc_kw = accepted.explode('keywords')['keywords'].value_counts().head(10)
            rej_kw = rejected.explode('keywords')['keywords'].value_counts().head(10)
            print("\nTop Keywords in Accepted Papers:")
            print(acc_kw)
            print("\nTop Keywords in Rejected Papers:")
            print(rej_kw)
        summary_data[sheet_name]["all_keywords"] = safe_explode(df_prepared, 'keywords')
        summary_data[sheet_name]["accepted_keywords"] = safe_explode(accepted, 'keywords')
        summary_data[sheet_name]["rejected_keywords"] = safe_explode(rejected, 'keywords')
    else:
        print("\nNo keywords available for analysis.")
        summary_data[sheet_name]["all_keywords"] = []
        summary_data[sheet_name]["accepted_keywords"] = []
        summary_data[sheet_name]["rejected_keywords"] = []

    # Title Length Analysis
    def title_length(title):
        return len(str(title).split())

    df_prepared['title_len'] = df_prepared['title'].apply(title_length)
    mean_len_all = df_prepared['title_len'].mean()
    print(f"\nAverage number of words in titles: Total={mean_len_all:.1f}")
    summary_data[sheet_name]["title_lengths"] = df_prepared['title_len'].tolist()
    summary_data[sheet_name]["titles"] = df_prepared['title'].astype(str).dropna().tolist()

    if has_decision:
        mean_len_accept = df_prepared.loc[df_prepared['decision_clean'] == 'Accept', 'title_len'].mean()
        mean_len_reject = df_prepared.loc[df_prepared['decision_clean'] == 'Reject', 'title_len'].mean()
        print(f"    Accepted={mean_len_accept:.1f}, Rejected={mean_len_reject:.1f}")
        summary_data[sheet_name]["accepted_title_lengths"] = df_prepared.loc[df_prepared['decision_clean'] == 'Accept', 'title_len'].tolist()
        summary_data[sheet_name]["rejected_title_lengths"] = df_prepared.loc[df_prepared['decision_clean'] == 'Reject', 'title_len'].tolist()
        summary_data[sheet_name]["accepted_titles"] = accepted['title'].astype(str).dropna().tolist()
        summary_data[sheet_name]["rejected_titles"] = rejected['title'].astype(str).dropna().tolist()

    # Average Rating Analysis
    if 'rate' in df_prepared.columns:
        df_prepared['rate'] = pd.to_numeric(df_prepared['rate'], errors='coerce')
        mean_rate_all = df_prepared['rate'].mean()
        print(f"\nAverage rating: Total={mean_rate_all:.1f}")
        summary_data[sheet_name]["rates"] = df_prepared['rate'].dropna().tolist()
        if has_decision:
            mean_rate_accept = df_prepared.loc[df_prepared['decision_clean'] == 'Accept', 'rate'].mean()
            mean_rate_reject = df_prepared.loc[df_prepared['decision_clean'] == 'Reject', 'rate'].mean()
            print(f"    Accepted={mean_rate_accept:.1f}, Rejected={mean_rate_reject:.1f}")
            summary_data[sheet_name]["accepted_rates"] = df_prepared.loc[df_prepared['decision_clean'] == 'Accept', 'rate'].dropna().tolist()
            summary_data[sheet_name]["rejected_rates"] = df_prepared.loc[df_prepared['decision_clean'] == 'Reject', 'rate'].dropna().tolist()
    else:
        print("\nColumn 'rate' not present. Rating analysis skipped.")

    # Review content length analysis
    if 'review_contentlength' in df_prepared.columns:
        df_prepared['review_contentlength'] = pd.to_numeric(df_prepared['review_contentlength'], errors='coerce')
        summary_data[sheet_name]["review_lengths"] = df_prepared['review_contentlength'].dropna().tolist()
        if has_decision:
            summary_data[sheet_name]["accepted_review_lengths"] = df_prepared.loc[df_prepared['decision_clean'] == 'Accept', 'review_contentlength'].dropna().tolist()
            summary_data[sheet_name]["rejected_review_lengths"] = df_prepared.loc[df_prepared['decision_clean'] == 'Reject', 'review_contentlength'].dropna().tolist()

    # Submission date analysis
    if 'publish_time' in df_prepared.columns:
        df_prepared['submit_date'] = pd.to_datetime(df_prepared['publish_time'].str.split(' ').str[0], errors='coerce')
        summary_data[sheet_name]["submit_dates"] = df_prepared['submit_date'].dropna().tolist()
        if has_decision:
            summary_data[sheet_name]["accepted_submit_dates"] = df_prepared.loc[df_prepared['decision_clean'] == 'Accept', 'submit_date'].dropna().tolist()
            summary_data[sheet_name]["rejected_submit_dates"] = df_prepared.loc[df_prepared['decision_clean'] == 'Reject', 'submit_date'].dropna().tolist()

    return accepted, rejected

# 3. Topic Modeling on Titles
def topic_modeling_titles_for_sheet(accepted, rejected, sheet_name):
    print(f"\nTopic Modeling on Titles - SHEET '{sheet_name}'")
    min_docs_for_topic_modeling = 5

    def run_topic_model(titles, label):
        if len(titles) < min_docs_for_topic_modeling:
            print(f"Fewer than {min_docs_for_topic_modeling} titles for {label}. Skipped.")
            return
        try:
            model = BERTopic(language="english", min_topic_size=min_docs_for_topic_modeling, verbose=False)
            topics, _ = model.fit_transform(titles)
            info = model.get_topic_info().head(6)
            print(f"\nTop Topics - {label}:")
            print(info)
        except Exception as e:
            print(f"Error in topic modeling for {label}: {e}")

    if not accepted.empty and 'title' in accepted.columns:
        titles = accepted['title'].astype(str).dropna().tolist()
        run_topic_model(titles, "Accepted")
    if not rejected.empty and 'title' in rejected.columns:
        titles = rejected['title'].astype(str).dropna().tolist()
        run_topic_model(titles, "Rejected")

# 4. Global Aggregated Analysis
def global_analysis(summary_data):
    print("\nGlobal Aggregated Analysis Across All Sheets")
    all_global_kw = []
    acc_global_kw = []
    rej_global_kw = []
    all_global_titles = []
    acc_global_titles = []
    rej_global_titles = []
    all_global_rates = []
    acc_global_rates = []
    rej_global_rates = []
    title_lengths_all = []
    title_lengths_acc = []
    title_lengths_rej = []
    review_lengths_all = []
    review_lengths_acc = []
    review_lengths_rej = []
    submit_dates_all = []
    submit_dates_acc = []
    submit_dates_rej = []

    for sheet in summary_data:
        data = summary_data[sheet]
        all_global_kw.extend(data["all_keywords"])
        acc_global_kw.extend(data["accepted_keywords"])
        rej_global_kw.extend(data["rejected_keywords"])
        all_global_titles.extend(data["titles"])
        acc_global_titles.extend(data["accepted_titles"])
        rej_global_titles.extend(data["rejected_titles"])
        all_global_rates.extend(data["rates"])
        acc_global_rates.extend(data["accepted_rates"])
        rej_global_rates.extend(data["rejected_rates"])
        title_lengths_all.extend(data["title_lengths"])
        title_lengths_acc.extend(data["accepted_title_lengths"])
        title_lengths_rej.extend(data["rejected_title_lengths"])
        review_lengths_all.extend(data["review_lengths"])
        review_lengths_acc.extend(data["accepted_review_lengths"])
        review_lengths_rej.extend(data["rejected_review_lengths"])
        submit_dates_all.extend(data["submit_dates"])
        submit_dates_acc.extend(data["accepted_submit_dates"])
        submit_dates_rej.extend(data["rejected_submit_dates"])

    # Global Statistics
    print("\nGlobal Statistics:")
    print(f"Total papers: {len(all_global_titles)}")
    print(f"Average number of words in titles: {np.mean(title_lengths_all):.1f}")
    if title_lengths_acc:
        print(f"Average number of words in accepted titles: {np.mean(title_lengths_acc):.1f}")
    if title_lengths_rej:
        print(f"Average number of words in rejected titles: {np.mean(title_lengths_rej):.1f}")

    # Ratings
    if all_global_rates:
        print(f"Global average rating: {np.mean(all_global_rates):.1f}")
        if acc_global_rates:
            print(f"Average rating for accepted papers: {np.mean(acc_global_rates):.1f}")
        if rej_global_rates:
            print(f"Average rating for rejected papers: {np.mean(rej_global_rates):.1f}")

    # Global Keywords
    print("\nTop 10 Global Keywords:")
    print(Counter(all_global_kw).most_common(10))
    print("\nTop 10 Keywords in Accepted Papers:")
    print(Counter(acc_global_kw).most_common(10))
    print("\nTop 10 Keywords in Rejected Papers:")
    print(Counter(rej_global_kw).most_common(10))

    # N-gram Analysis
    print("\nMost Common Bigrams in Titles:")
    print(extract_ngrams(all_global_titles, n=2).most_common(10))
    print("\nMost Common Trigrams in Titles:")
    print(extract_ngrams(all_global_titles, n=3).most_common(10))

    # Global Topic Modeling
    print("\nGlobal Topic Modeling on All Titles:")
    if len(all_global_titles) >= 5:
        model = BERTopic(language="english", min_topic_size=5, verbose=False)
        topics, _ = model.fit_transform(all_global_titles)
        print(model.get_topic_info().head(6))
    else:
        print("Not enough titles for global topic modeling.")

    # Topic vs Decision
    if len(all_global_titles) >= 5 and len(acc_global_titles) > 0 and len(rej_global_titles) > 0:
        print("\nGlobal Topic Modeling + Topic-Decision Association:")

        titles = acc_global_titles + rej_global_titles
        decisions = ['accept'] * len(acc_global_titles) + ['reject'] * len(rej_global_titles)

        df_global = pd.DataFrame({
            'title': titles,
            'decision': decisions
        })

        model = BERTopic(language="english", min_topic_size=5, verbose=False)
        topics, probs = model.fit_transform(df_global['title'])
        df_global['topic'] = topics

        topic_decisions = df_global.groupby('topic')['decision'].value_counts(normalize=True).unstack(fill_value=0).sort_values(by='accept', ascending=False)
        print(topic_decisions)

        print("\nTop Topics with Acceptance Keywords:")
        for topic_id in topic_decisions.index[:5]:
            keywords = model.get_topic(topic_id)
            if isinstance(keywords, list):
                keyword_list = ", ".join([word for word, _ in keywords[:5]])
            else:
                keyword_list = "N/A"
            print(f"Topic {topic_id}: {keyword_list}")

        topic_keywords = {topic: model.get_topic(topic) for topic in df_global['topic'].unique()}

        high_accept_topics = topic_decisions[topic_decisions['accept'] > 0.7].index.tolist()
        print("\nKeywords associated with high acceptance topics:")
        success_keywords = []
        for topic_id in high_accept_topics:
            keywords = model.get_topic(topic_id)
            if isinstance(keywords, list):
                words = [word for word, _ in keywords]
                success_keywords.extend(words)
                keyword_line = ", ".join(words[:5])
            else:
                keyword_line = "N/A"
            print(f"Topic {topic_id} → {keyword_line}")

        print("\nMost frequent keywords in winning topics:")
        print(Counter(success_keywords).most_common(10))
    else:
        print("Not enough data for Topic vs Decision analysis.")

    # Review Length Analysis
    if review_lengths_all:
        print("\nReview Length Analysis:")
        print(f"- Global average: {np.mean(review_lengths_all):.1f}")
        if review_lengths_acc:
            print(f"- Accepted: {np.mean(review_lengths_acc):.1f}")
        if review_lengths_rej:
            print(f"- Rejected: {np.mean(review_lengths_rej):.1f}")

    # Temporal Analysis
    if submit_dates_all:
        print("\nTemporal Analysis:")
        submit_df = pd.DataFrame({'date': submit_dates_all})
        submit_df['month_year'] = pd.to_datetime(submit_df['date']).dt.to_period('M')
        print(submit_df['month_year'].value_counts().sort_index())
        if submit_dates_acc and submit_dates_rej:
            accept_df = pd.DataFrame({'date': submit_dates_acc, 'type': 'accept'})
            reject_df = pd.DataFrame({'date': submit_dates_rej, 'type': 'reject'})
            combined = pd.concat([accept_df, reject_df])
            monthly = combined.resample('M', on='date').value_counts().unstack(fill_value=0)
            print(monthly)

# --- Main Execution ---
if __name__ == "__main__":
    file_path = "/content/drive/MyDrive/KnowledgeDiscoveryAndPatternExtraction/cleaned_dataset.xlsx"
    sheet_names = None
    try:
        xls = pd.ExcelFile(file_path, engine='openpyxl')
        sheet_names = xls.sheet_names
        print(f"\nFile loaded successfully. Available sheets: {sheet_names}")
    except FileNotFoundError:
        print(f"File '{file_path}' not found.")
        exit()
    except Exception as e:
        print(f"Error opening Excel file: {e}")
        exit()

    summary_data = {}  # To collect aggregated data

    for sheet_name in sheet_names:
        print(f"\n{'='*40}\nStarting analysis: {sheet_name}\n{'='*40}")
        df_sheet = pd.read_excel(xls, sheet_name=sheet_name)
        if df_sheet.empty:
            print(f"Sheet '{sheet_name}' is empty. Skipping.")
            continue
        df_prepared = prepare_sheet_data(df_sheet)
        if df_prepared is None or df_prepared.empty:
            print(f"No valid data for sheet '{sheet_name}'.")
            continue
        accepted, rejected = analyze_accepted_vs_rejected(df_prepared, sheet_name, summary_data)
        topic_modeling_titles_for_sheet(accepted, rejected, sheet_name)
        print(f"{'='*40}\nFinished analysis: {sheet_name}\n{'='*40}")

    # Final aggregated analysis
    if summary_data:
        global_analysis(summary_data)
    else:
        print("No useful data for aggregated analysis.")


File loaded successfully. Available sheets: ['Sheet1', 'Sheet2', 'Sheet3', 'Sheet4', 'Sheet6', 'Sheet5']

Starting analysis: Sheet1
Preparing sheet data...
Data prepared: 1495 valid rows.

COMPARATIVE ANALYSIS - SHEET 'Sheet1'
Total papers: 1495
Accepted: 607 (40.6%)
Rejected: 742 (49.6%)

Top 10 Most Common Keywords:
keywords
Deep learning                  1197
Unsupervised Learning           323
Computer vision                 317
Natural language processing     315
Applications                    251
Supervised Learning             233
Optimization                    192
Reinforcement Learning          175
Theory                          131
Transfer Learning                95
Name: count, dtype: int64

Top Keywords in Accepted Papers:
keywords
Deep learning                  524
Natural language processing    137
Unsupervised Learning          123
Computer vision                 99
Reinforcement Learning          95
Optimization                    93
Applications                   

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Top Topics - Accepted:
   Topic  Count                                            Name  \
0     -1      7       -1_residuals_fractalnet_ultradeep_without   
1      0     40  0_optimization_to_imaginationbased_metacontrol   
2      1     25            1_pruning_cnns_pooling_convolutional   
3      2     20          2_quantization_ternary_trained_network   
4      3     18                3_language_natural_level_compose   
5      4     18           4_quasirecurrent_chaos_automatic_long   

                                      Representation  \
0  [residuals, fractalnet, ultradeep, without, de...   
1  [optimization, to, imaginationbased, metacontr...   
2  [pruning, cnns, pooling, convolutional, convne...   
3  [quantization, ternary, trained, network, towa...   
4  [language, natural, level, compose, character,...   
5  [quasirecurrent, chaos, automatic, long, extra...   

                                 Representative_Docs  
0  [FractalNet: Ultra-Deep Neural Networks withou...  
1  

# Paper Acceptance Analysis
**Objective:** Identify patterns and best practices from a dataset of accepted and rejected scientific papers to provide actionable insights for improving paper quality and increasing acceptance rates.

---

## 1. Dataset Overview

| Metric | Value |
|--------|-------|
| Total Papers Analyzed | 23,269 |
| Accepted Papers | 7,435 (31.9%) |
| Rejected Papers | 15,834 (68.1%) |
| Average Word Count in Titles | 9.3 words |
| - Accepted | 9.4 words |
| - Rejected | 9.6 words |
| Average Rating | 5.0 |
| - Accepted | 6.5 |
| - Rejected | 4.1 |

This section provides a general summary of the dataset, including the total number of papers analyzed, acceptance and rejection rates, average title length, and average ratings. It gives context and scale to the analysis.

---

## 2. Key Insights: What Works, What Doesn’t

### Topics Associated with High Acceptance Rates

The following topics show **100% acceptance rates** based on topic modeling:

| Topic ID | Keywords |
|----------|----------|
| 616 | recursion, combinator, lattices, programmerinterpreters |
| 1320 | explain, execution, logic, module |
| 673 | subset, homotopy, correspondence, transport |
| 1317 | warm, sgdr, gapaware, staleness |
| 1325 | retinal, prosthesis, ganglion, primate |

---

### Keywords Found in Accepted Papers

| Keyword | Frequency |
|--------|-----------|
| Deep learning | 548 |
| Reinforcement Learning | 281 |
| Natural Language Processing | 138 |
| Unsupervised Learning | 135 |
| Optimization | 121 |
| Representation Learning | 106 |

---

### Keywords Commonly Found in Rejected Papers

| Keyword | Frequency |
|--------|-----------|
| Deep learning | 713 |
| deep learning | 713 |
| representation learning | 278 |
| self-supervised learning | 36 |

---

Highlights the topics and keywords most strongly associated with accepted or rejected papers. This helps identify trending and well-received research areas versus overused or less-impactful terms.

## 3. Title Structure & Length

- **Average title length**: 9.3 words  
- **Accepted papers**: ~9.4 words  
- **Rejected papers**: ~9.6 words  

### Most Common N-Grams in Titles:
- **Bigrams**: `neural networks`, `learning with`, `deep learning`
- **Trigrams**: `deep neural networks`, `generative adversarial networks`, `graph neural networks`

Analyzes the average length and common linguistic patterns (bigrams and trigrams) found in paper titles. Offers guidance on how to craft effective, standards-aligned titles that resonate with reviewers.

---

## 4. Quality = Acceptance

There is a **strong correlation between high ratings and acceptance**:
- **Accepted papers average rating**: 6.5
- **Rejected papers average rating**: 4.1

Demonstrates the strong correlation between higher average ratings and paper acceptance. Emphasizes that perceived quality — through clarity, rigor, and relevance — significantly impacts the likelihood of acceptance.

---

## 5. Review Length

| Category | Average Length (characters) |
|---------|-----------------------------|
| Global | 2561 |
| Accepted | 2538 |
| Rejected | 2569 |

Compares the average length of peer reviews across accepted and rejected papers. Suggests that review length is not a decisive factor in acceptance, but rather the content and strength of the feedback matters more.

# Pairwise comparison of papers through LLMs

In [7]:
def clean_titles_of_openreview(sheets):
    """
    Removes the word 'openreview' from the 'title' column in all sheets.

    Parameters:
        sheets (dict): Dictionary of DataFrames keyed by sheet name.

    Returns:
        dict: Cleaned sheets with updated titles.
    """
    cleaned = {}

    for sheet_name, df in sheets.items():
        df = df.copy()
        df.columns = df.columns.str.lower().str.strip()

        if 'title' in df.columns:
            df['title'] = df['title'].astype(str).str.replace(r'openreview', '', case=False, regex=True).str.strip()

        cleaned[sheet_name] = df

    return cleaned

sheets = clean_titles_of_openreview(sheets)

In this section, we explore how large language models (LLMs) can be used to manually extract useful insights from research papers. Specifically, we focus on both accepted and rejected papers from the dataset to identify patterns or information that could contribute to writing stronger submissions.


In [8]:
from collections import defaultdict

def find_titles_in_multiple_sheets(sheets):
    """
    Identifies paper titles that appear in more than one sheet.

    Parameters:
        sheets (dict): Dictionary of DataFrames keyed by sheet name.

    Returns:
        dict: Dictionary mapping each duplicate title to the list of sheet names it appears in.
    """
    title_map = defaultdict(set)

    for sheet_name, df in sheets.items():
        df.columns = df.columns.str.lower().str.strip()
        if 'title' not in df.columns:
            continue

        titles = df['title'].dropna().str.strip().unique()
        for title in titles:
            title_map[title].add(sheet_name)

    duplicates = {title: sorted(list(sheet_names)) for title, sheet_names in title_map.items() if len(sheet_names) > 1}
    return duplicates

# Example usage:
duplicate_titles = find_titles_in_multiple_sheets(sheets)
print(f"Found {len(duplicate_titles)} titles in multiple sheets:")
for title, sheet_names in duplicate_titles.items():
    print(f"- '{title}' appears in: {sheet_names}")

Found 43 titles in multiple sheets:
- 'Data augmentation instead of explicit regularization |' appears in: ['Sheet2', 'Sheet4']
- 'Efficient Exploration through Bayesian Deep Q-Networks |' appears in: ['Sheet2', 'Sheet3']
- 'Graph2Seq: Scalable Learning Dynamics for Graphs |' appears in: ['Sheet2', 'Sheet3']
- 'Massively Parallel Hyperparameter Tuning |' appears in: ['Sheet2', 'Sheet3']
- 'Open Loop Hyperparameter Optimization and Determinantal Point Processes |' appears in: ['Sheet2', 'Sheet3']
- 'Value Propagation Networks |' appears in: ['Sheet2', 'Sheet3']
- 'withdrawn |' appears in: ['Sheet2', 'Sheet3']
- 'Dataset Distillation |' appears in: ['Sheet3', 'Sheet4']
- 'Deep Imitative Models for Flexible Inference, Planning, and Control |' appears in: ['Sheet3', 'Sheet4']
- 'Double Neural Counterfactual Regret Minimization |' appears in: ['Sheet3', 'Sheet4']
- 'Pushing the bounds of dropout |' appears in: ['Sheet3', 'Sheet4']
- 'Unified recurrent network for many feature types |' appea

Since only sheet 5 and 6 has links to the submitted paper we can just focus on those:

In [9]:
def find_common_titles_between_sheets(sheets, sheet1_name, sheet2_name):
    """
    Finds papers with the same title in two specified sheets.

    Parameters:
        sheets (dict): Dictionary of DataFrames keyed by sheet name.
        sheet1_name (str): Name of the first sheet.
        sheet2_name (str): Name of the second sheet.

    Returns:
        list: List of titles common to both sheets.
    """
    df1 = sheets[sheet1_name].copy()
    df2 = sheets[sheet2_name].copy()

    df1.columns = df1.columns.str.lower().str.strip()
    df2.columns = df2.columns.str.lower().str.strip()

    titles1 = df1['title'].dropna().str.strip().unique()
    titles2 = df2['title'].dropna().str.strip().unique()

    common_titles = list(set(titles1).intersection(titles2))
    return common_titles

# Example usage:
common_titles = find_common_titles_between_sheets(sheets, 'Sheet5', 'Sheet6')
print(f"Common titles between Sheet5 and Sheet6: {len(common_titles)}")
for title in common_titles:
    print("-", title)

Common titles between Sheet5 and Sheet6: 29
- concentric spherical gnn for 3d representation learning
- direct evolutionary optimization of variational autoencoders with binary latents
- poisoned classifiers are not only backdoored, they are fundamentally broken
- learning to actively learn: a robust approach
- max-affine spline insights into deep network pruning
- almost tight l0-norm certified robustness of top-k predictions against adversarial perturbations
- is deeper better? it depends on locality of relevant features
- towards understanding label smoothing
- bridging the gap: providing post-hoc symbolic explanations for sequential decision-making problems with inscrutable representations
- align-rudder: learning from few demonstrations by reward redistribution
- mqtransformer: multi-horizon forecasts with context dependent and feedback-aware attention
- novel policy seeking with constrained optimization
- learning to solve multi-robot task allocation with a covariant-attention ba

In [10]:
def extract_title_info_from_sheets(sheets, titles):
    """
    For each given title, find and collect its rows from all sheets where it appears.

    Parameters:
        sheets (dict): Dictionary of DataFrames keyed by sheet name.
        titles (list or set): List of titles to search for.

    Returns:
        dict: Dictionary where keys are titles and values are lists of (sheet_name, DataFrame) pairs.
    """
    title_info = {}
    normalized_titles = {t.strip().lower(): t for t in titles}  # Normalize for matching, preserve original

    for norm_title, original_title in normalized_titles.items():
        matching_entries = []
        for sheet_name, df in sheets.items():
            df.columns = df.columns.str.lower().str.strip()
            if 'title' not in df.columns:
                continue

            # Safe matching without index misalignment
            mask = df['title'].astype(str).str.strip().str.lower() == norm_title
            matches = df[mask]

            if not matches.empty:
                matching_entries.append((sheet_name, matches))

        if matching_entries:
            title_info[original_title] = matching_entries

    return title_info

# Example usage:
# titles = list of duplicate paper titles
info_by_title = extract_title_info_from_sheets(sheets, common_titles)

Let's filter on the accepted papers

In [11]:
import numpy as np

def is_accepted(decision):
    """
    Determines if a decision value indicates acceptance.
    Accepts variants like strings ("accept", "Accept (Poster)") or numbers (1, np.int64(1)).
    """
    if isinstance(decision, (int, np.integer)):
        return decision == 1
    elif isinstance(decision, str):
        return 'accept' in decision.lower()
    return False

def filter_accepted_papers_with_sheets(info_by_title):
    """
    Filters for accepted papers and lists the sheets where they were accepted.

    Parameters:
        info_by_title (dict): Dictionary of (title -> [(sheet_name, df), ...])

    Returns:
        dict: title -> list of sheet names where paper was accepted
    """
    accepted = {}

    for title, entries in info_by_title.items():
        accepted_sheets = []
        for sheet_name, df in entries:
            df.columns = df.columns.str.lower().str.strip()
            if 'decision' in df.columns:
                decisions = df['decision']
                if any(is_accepted(val) for val in decisions):
                    accepted_sheets.append(sheet_name)
        if accepted_sheets:
            accepted[title] = accepted_sheets

    return accepted

# Example usage:
accepted_titles_with_sheets = filter_accepted_papers_with_sheets(info_by_title)

# Print results
for title, sheets in accepted_titles_with_sheets.items():
    print(f"✅ '{title}' accepted in sheets: {sheets}")

✅ 'almost tight l0-norm certified robustness of top-k predictions against adversarial perturbations' accepted in sheets: ['Sheet5']
✅ 'bridging the gap: providing post-hoc symbolic explanations for sequential decision-making problems with inscrutable representations' accepted in sheets: ['Sheet5']
✅ 'wiring up vision: minimizing supervised synaptic updates needed to produce a primate ventral stream' accepted in sheets: ['Sheet5']
✅ 'autonomous learning of object-centric abstractions for high-level planning' accepted in sheets: ['Sheet5']
✅ 'on the certified robustness for ensemble models and beyond' accepted in sheets: ['Sheet5']
✅ 'relational learning with variational bayes' accepted in sheets: ['Sheet5']
✅ 'augmented sliced wasserstein distances' accepted in sheets: ['Sheet5']
✅ 'open-world semi-supervised learning' accepted in sheets: ['Sheet5']


We report here all the links for simplicity:

| Title | Link of Rejected (sheet 6) | Link of Accepted (sheet 5) |
|--------|-------|---------|
| Relational Learning with Variational Bayes | https://openreview.net/forum?id=PiKUvDj5jyN | https://openreview.net/forum?id=Az-7gJc6lpr |
| On the Certified Robustness for Ensemble Models and Beyond  | https://openreview.net/forum?id=IUYthV32lbK | https://openreview.net/forum?id=tUa4REjGjTf |
| Autonomous Learning of Object-Centric Abstractions for High-Level Planning  | https://openreview.net/forum?id=PmVfnB0nkqr | https://openreview.net/forum?id=rrWeE9ZDw_ |
| Almost Tight L0-norm Certified Robustness of Top-k Predictions against Adversarial Perturbations | https://openreview.net/forum?id=iOVomQW073 | https://openreview.net/forum?id=gJLEXy3ySpu |
| Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems with Inscrutable Representations | https://openreview.net/forum?id=TETmEkko7e5 | https://openreview.net/forum?id=o-1v9hdSult |
| Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream  | https://openreview.net/forum?id=5i4vRgoZauw | https://openreview.net/forum?id=g1SzIRLQXMM |
| Open-world Semi-supervised Learning  | https://openreview.net/forum?id=6VhmvP7XZue | https://openreview.net/forum?id=O-r8LOR-CCA |
| Augmented Sliced Wasserstein Distances  | https://openreview.net/forum?id=ot9bYHvuULl | https://openreview.net/forum?id=iMqTLyfwnOO |

After retrieving all the papers, we asked to the GPT-4o model to perform a **comparative analysis** between the accepted and rejected versions. In particular we asked:



1. **Abstract Comparison**: How did the abstracts differ in clarity, structure, or emphasis? Which keywords or framing might have made the accepted version stronger?
2. **Introduction & Motivation**: Was the problem stated more clearly or urgently in the accepted paper?
Did the rejected version fail to establish novelty or significance?
3. **Related Work**: Was the literature coverage more thorough or recent in the accepted version?
Did it better position the paper within the current research landscape?
4. **Methodology & Experiments**:
Were experiments more comprehensive, reproducible, or better visualized?
Did the rejected paper lack clarity or detail in implementation?
5. **Results & Discussion**:
Was there more rigorous statistical analysis?
Was the accepted version better at interpreting results or anticipating criticism?
6. **Conclusion**: Did it better highlight contributions or suggest compelling future work?

We have also asked to evaluate the **quantitative and structureal insights**:
1. **Length Analysis**:
Word counts per section: Were key sections longer or more detailed in the accepted version?
2. **Figure/Table Use**: More/better diagrams, clearer presentation of data?
3. **Citation Density** Did the accepted version cite more or higher-impact works?

Another aspect in the analysis was the **style and writing quality**:
1. **Readability Scores**: Which version reads more clearly or is closer to the target audience?
2. **Lexical Richness**: Does the accepted version use more domain-specific, impactful language?
3. **Passive vs. Active Voice**: Better engagement and clarity in accepted paper?

Finnally, we asked to check **review comments** to see if it was possible to align reviewer comments with specific parts of the paper to identify:
* Common criticism themes
*  Specific fixes that led to acceptance
*  Overlooked suggestions in the rejected version

Let's see an example of the output:

## Model output on: Relational Learning with Variational Bayes
### Abstract Comparison
Clarity, Structure, and Emphasis
1. **Accepted version**: More polished, better flow, and uses formal transitions between ideas. Immediately connects relational learning to human cognition and AI significance before presenting the method.
2. **Rejected version**: Conveys similar content but lacks tight phrasing and structural polish. Phrases like "we propose an unsupervised learning method..." feel less guided.

Keywords/Framing
1. **Accepted**: Introduces “unsupervised,” “probabilistic graphical model,” and “relational processing tasks” with more authoritative framing.
2. **Summary**: Accepted version is stronger due to better balance between domain motivation and technical contribution. Rejected version reads more like an early draft.


### Introduction & Motivation

Problem Statement Clarity
1. **Accepted**: Provides a crisp, formal definition (APA citation), concrete example (circles and squares), and contrasts psychological vs. ML definitions early.
2. **Rejected**: Uses similar content but lacks clear transitions and doesn’t make the gap in current ML methods explicit early on.

Novelty & Significance
1. **Accepted**: Clearly states that current ML methods often rely on absolute properties and highlights this work as filling a critical gap.
2. **Rejected**: Mentions novelty but less convincingly, and doesn't emphasize what’s missing in existing approaches.


### Related Work

Coverage and Currency
1. **Accepted**: Rich discussion of supervised relational learning, SRL methods, cognitive tasks, and unsupervised disentanglement. Uses more recent and varied citations.
2. **Rejected**: Shares many citations with the accepted version but lacks deep discussion of how existing methods fall short on decoupled relational learning.

Positioning
1. **Accepted**: Clearly positions the contribution at the intersection of PGM and unsupervised relational representation. Better differentiates from neural relational inference and disentangled learning.
2. **Rejected**: Less effective at articulating the unique niche of the work.


### Methodology & Experiments

Clarity and Reproducibility
1. **Accepted**: Details assumptions on latent variables, ELBO training, independence constraints, and introduces RPDA (relation-preserving data augmentation) as a novel regularization.
2. **Rejected**: Contains same VRL core but fewer diagrams, lacks mutual information terms, and discusses optimization issues less rigorously.

Visualization & Experiment Depth
1. **Accepted**: Uses multiple datasets (MNIST, Omniglot, Yale Faces, RAVDESS). Includes comparisons against baselines and ablation studies.
2. **Rejected**: Only presents a synthetic MNIST-based experiment with qualitative results and limited testing.

### Results & Discussion

Statistical Rigor
1. **Accepted**: Reports clustering accuracy, error bars, multiple trials, and ablation studies to assess RPDA and latent design choices.
2. **Rejected**: Primarily qualitative figures with minimal quantitative evaluation and no ablation analysis.

Interpretation & Anticipating Criticism
1. **Accepted**: Addresses overfitting, shortcut path issues, and provides mitigation strategies. Discusses failure cases explicitly.
2. **Rejected**: Acknowledges optimization concerns but doesn’t propose solutions in depth.

### Conclusion

Highlighting Contributions & Future Work
1. **Accepted**: Stresses contributions in disentangling relational and absolute features, real-world applications, and even philosophical boundaries (e.g., VRL’s blindness to coupling).
2. **Rejected**: Lists contributions but lacks foresight and clarity in summarizing future directions.

---

### Quantitative & Structural Insights

Length Analysis
1. **Accepted**: More detailed in methodology, related work, and experiments.
2. **Rejected**: Shorter in scope and omits extended discussions (e.g., RPDA or variational lower bounds).

Figure/Table Use
1. **Accepted**: Includes scatter plots, reconstructions, relational mappings, ablation diagrams, and a comprehensive result table (12+ methods).
2. **Rejected**: Fewer visuals, no baselines, and lacks ablation support.

Citation Density
1. **Accepted**: Denser and more modern references (e.g., InfoGAN, NRI, RPDA, contrastive learning).
2. **Rejected**: Sparse citations and lacks key modern references (e.g., BYOL, vector arithmetic).

---

### Style and Writing Quality

Readability
1. **Accepted**: Clear, concise, and professionally written. Technical terms are introduced smoothly.
2. **Rejected**: Wordy with awkward phrasing. Needs more editorial polish.

Lexical Richness
1. **Accepted**: Uses precise domain-specific terminology (e.g., “mutual information regularization,” “non-identifiability”).
2. **Rejected**: Relies on general or repetitive language.

Passive vs. Active Voice
1. **Accepted**: More active constructions (“We propose...”, “Our contribution is...”) lead to stronger engagement.
2. **Rejected**: More passive language (“is proposed”), which weakens impact.

---

### Review Comments (Inferred from Revisions)

Common Criticism Themes (Implied)
- Insufficient evaluation and lack of comprehensive testing.
- Missing mitigation strategies for known issues (e.g., shortcut path, coupling).
- Reduced clarity and reproducibility due to sparse technical detail.

Specific Fixes That Led to Acceptance
- Introduction of RPDA.
- Mutual information penalty to enforce latent independence.
- Expanded experiments (more datasets, baseline comparisons, quantitative metrics).
- Better structured and clearer presentation.

Overlooked Suggestions in Rejected Version
- Full disentanglement of latent variables (z and b), acknowledged as an ongoing limitation.
- Inclusion of real-world data evaluations.
- Use of baseline models for comparison.

---



## Collected Results

We repeated the same questions for each of the pair of papers. This allowed us to extract more specific information about the reviewed papers, allowing for a fine-grain analysis.

We then instructed the model to aggregate relevant comparative studies, distill key patterns, and generate evidence-based recommendations for improving paper quality.

---

Based on the reviews and comparisons of accepted and rejected papers in your dataset, here are several actionable insights for improving paper acceptance odds, along with evidence from the documents:

#### Clear and Substantial Empirical Evidence Matters

* Accepted papers tend to provide **comprehensive empirical evaluations** with statistically significant improvements over baselines. For instance, the accepted ASWD paper reports detailed performance benchmarks across multiple datasets and training conditions, clearly demonstrating its superiority over alternatives with metrics like FID scores and runtime evaluations
* In contrast, the rejected version of the same work showed weaker relative improvements and lacked comparable robustness in experimental validation

**Tip**: Include ablation studies, multiple datasets, and statistically sound comparisons to strengthen the empirical section.

#### Well-Articulated Problem Setting and Novelty
* Successful papers precisely define a **new, relevant problem setting**. For example, the accepted ORCA paper introduces “open-world semi-supervised learning” and distinguishes it well from adjacent concepts like zero-shot learning or open-set recognition, giving it strong novelty and motivation.
* The rejected version lacked clarity in positioning the work as a unique contribution compared to known problems like robust SSL or novel class discovery.

**Tip**: Clearly define the scope of the problem, contrast with existing work, and motivate its real-world importance.

#### Strong Theoretical Backing Helps
* Accepted papers often offer **rigorous theoretical contributions** that support empirical claims. For instance, the accepted certified robustness paper provides formal proofs and conditions for ensemble robustness using randomized smoothing.

**Tip**: Where applicable, include proofs, bounds, or theoretical justifications that provide general insights beyond empirical results.

#### Explainability and User Studies Add Value
* Papers that include human-in-the-loop evaluations or explanations tend to be well-received when appropriate. The accepted explanation paper included IRB-approved user studies with clear hypotheses and statistically supported preferences from users, boosting the paper's credibility and impact.
* The rejected version of this work lacked similarly convincing user feedback and depth in explanation evaluation.

**Tip**: For human-centric tasks, complement technical work with user validation or real-world deployment data.

#### Review Language Signals Quality
In reviews, phrases often linked to acceptance include:
* “rigorous theoretical analysis”
* “comprehensive experiments”
* “significantly outperforms baseline methods”
* “clearly defines a novel problem setting”

Whereas rejections often correlate with phrases like:
* “unclear contribution”
* “limited novelty”
* “insufficient empirical validation”
* “hard to follow motivation”

**Tip**: Use these phrases as a litmus test: if a reviewer might not say your work “significantly outperforms baselines” or “clearly defines a novel problem,” revisit those sections.

---

Several **thematic trends** in the dataset suggest that certain research topics are more prone to acceptance or rejection at venues like ICLR:
* Trendy topics like adversarial robustness and SSL still need clear novelty or real-world alignment to be accepted.
* Cross-disciplinary ideas—especially those blending theory, application, and user interaction—appear to be favored.
* Papers without a well-differentiated contribution, even in hot topics, are more vulnerable to rejection.

---
*From the comparative analysis on the 8 pairs of papers, can you summarize the most significant improvements among them, to give general hints to all authors?*

**Substantially Strengthened Empirical Evaluation**

Across almost all pairs, accepted versions introduced:
* More datasets (e.g., CIFAR, ImageNet, MNIST variants, Minecraft).
* More comprehensive metrics (accuracy, FID, Wasserstein distance, robustness bounds).
* Comparison against stronger or more diverse baselines.
* Ablation studies to isolate contributions of each component.

**Hint**: Show performance across multiple settings and isolate contributions. Reviewers reward breadth and reproducibility.

**Clearer Problem Framing and Novelty Positioning**.

Accepted papers more explicitly:
* Defined the problem setting and how it extends or differs from prior work.
* Placed their contribution in context, sometimes including new terminology (e.g., “open-world semi-supervised learning”, “explanatory confidence”).
* Clarified why the work matters, either theoretically or practically.

**Hint**: Don't assume readers will infer your novelty—define it, compare it, and motivate it clearly.

**Improved Theoretical Insight or Justification**

For theory-heavy papers, accepted versions:

* Added proof sketches or complete proofs.
* Introduced tight bounds or conditions (e.g., certified robustness, injectivity).
* Linked theory with design choices (e.g., why a neural mapping was injective).

**Hint**: Even in application-heavy papers, theoretical grounding increases trust and interpretability.

**Increased Algorithmic and Architectural Clarity**

Accepted versions often added:

* Pseudo-code or algorithm boxes.
* Detailed architectural diagrams or hyperparameter setups.
* Design rationales (why this model, loss, or initialization?).

**Hint**: Help reviewers and readers reproduce or build on your work with clarity and completeness.

**Inclusion of Human Studies or Real-World Use Cases (Where Applicable)**

For papers involving human factors or symbolic reasoning:

* Accepted versions added IRB-approved user studies, often with statistically significant findings.
* Described deployment implications (e.g., in game AI or explainable planning).

**Hint**: When humans are involved, even small-scale user feedback can validate usefulness and clarity.

**Summary: Top 5 General Improvements**

| Theme	| Description |
|---|---|
|Stronger Experiments |	More datasets, metrics, ablations, baselines |
|Clear Novelty Framing|	Explicit differentiation from prior work; clearer motivation
|Theoretical or Conceptual Rigor	|Conditions, proofs, or formal assumptions that support contributions
|Algorithm & Architecture Clarity	|Diagrams, algorithm boxes, training details
|Human-Centric Validation |	User studies, practical relevance, or task deployment results|

---

*From the comparative analysis on the 8 pairs of papers, can you summarize the most significant quantitative improvements among them, to give general hints to all authors?*

From the comparative analysis of the 8 accepted–rejected paper pairs in your dataset, several quantitative improvements consistently contributed to acceptance. These improvements span performance metrics, experimental scope, statistical validation, and reporting clarity. Here's a synthesis of the most impactful quantitative enhancements that authors should consider:

**Significantly Expanded Empirical Coverage**

| Aspect |	Rejected Version |	Accepted Version |
|---|---|---|
| # of Datasets	| 1–2 |	3–5+ (e.g., CIFAR-10, ImageNet, MNIST, Minecraft) |
| # of Baselines |	2–3	| 4–6+ (including strong recent models) |
| # of Metrics |	Accuracy or one primary metric |	Accuracy + FID + certified radius + runtime + uncertainty |

**Hint**: Triple your experimental scope. Evaluate on more diverse datasets and metrics to demonstrate robustness.

**Addition of Ablation Studies and Sensitivity Analysis**

|Component	| Rejected Version |	Accepted Version|
|---|---|---|
|Ablation Studies |	Often missing |	Present in all accepted versions|
|# of Ablation Variants |	0–2	| 3–6+ (e.g., loss terms, architecture parts, data sizes)|
|Hyperparameter Sensitivity	| Rarely included |	Grid tested (e.g., margin parameters, noise levels)|

**Hint**: For every architectural or loss-function element, show its effect quantitatively.

**Substantial Gains in Key Metrics Over Baselines**

| Paper Domain	| Metric |	Improvement (Accepted vs Rejected) |
|---|---|---|
|Open-World SSL (ORCA) |	Accuracy on unseen |	+96% on ImageNet unseen classes|
|Sliced Wasserstein Distance	| FID in image tasks |	10–20% better than SWD and GSWD|
|Certified Robustness |	Certified accuracy |	+3–7% in L₂-radius robustness over prior SOTA|
|Planning Explanation| 	User task success rate |	+35% improvement in user task performance |

**Hint**: Ensure your improvements are both large in magnitude and well-justified statistically (confidence intervals, standard deviations, etc.).

**Statistical Significance and Reporting Discipline**

|Reporting Style	| Rejected Version	| Accepted Version|
|-|-|-|
|Standard Deviations | Sometimes omitted	| Reported for nearly all metrics|
|Significance Tests	| Absent or anecdotal |	p-values reported in user studies, comparisons|
|Repetitions	| 1–3 runs |	5–10+ runs (with mean ± std)|

**Hint**: Report results across multiple seeds and include statistical tests where applicable.

**More Detailed Runtime and Resource Analysis**

|Runtime Benchmarking |	Rejected Version	|Accepted Version|
|-|-|-|
|Training Time |	Not measured or vague |	Included in tables/charts|
|Inference Time	| Rare |	Measured on multiple hardware settings|
|Memory/Scalability |	Not explored	| Occasionally benchmarked|

**Hint**: If your method claims scalability or practicality, quantify it in runtime or memory benchmarks.

**Summary: Quantitative Improvement Targets**

| Dimension |	Suggested Target for Strong Submissions |
|--|--|
|# Datasets |	≥ 3 diverse and standard datasets|
|# Baselines |	≥ 4, including most recent strong ones|
|# Evaluation Metrics |	≥ 3 (accuracy + task-specific + secondary)|
|Ablation Studies |	≥ 3 configurations (each with metric impact)|
|Experimental Runs	| ≥ 5 seeds per experiment with std reporting|
|Metric Gains vs Baselines |	≥ 10–20% unless margins are already tight|
|Statistical Tests (if human)	| p-values or confidence intervals (α = 0.05)|
|Runtime/Memory	| Included if efficiency is a selling point |

These quantitative refinements are not only persuasive to reviewers but also signal research maturity.


## Feedbacks on the methodology

Although allowing for a fine-grain analysis, this approach has three main flaws:
1. **time consuming**: requires to manually download each pair of papers from the openreview website.
2. **expensive**: we are feeding the LLMs the two original PDFs for each comparative study. The input prompt with the files can vary from ~20K to ~38K tokens, while the output analysis requires ~2K tokens. We can reach around ~40K tokens, which is ~$0.44 to analyze a single paper.
3. **data scarcity**: only a few papers in the entire dataset had both accepted and rejected versions available for download. While this smaller sample allows for fine-grained analysis, it may not be sufficient to generalize findings across the whole dataset.



