# Analysis of Paper Review

In this assignment, I chose to use both trigrams and embeddings to analyze the reviews of rejected papers because they offer complementary strengths in understanding the textual data. The goal of this analysis is to identify recurring themes and reasons for rejection.

Trigrams are particularly useful because they capture sequences of three consecutive words, which provide more context than individual words. For example, a trigram like "lack of clarity" is much more informative than just "lack" or "clarity" on their own. By extracting trigrams that include negative keywords such as "poor," "weak," or "insufficient," I can focus on phrases that are likely to highlight specific problems in the papers.

However, relying solely on trigrams has its limitations. Trigrams are based on exact word matches, so similar phrases with different wording, like "poor clarity" and "unclear explanation," would not be grouped together. This is where embeddings come into play. Embeddings, generated using a pre-trained model like Sentence Transformers, encode the semantic meaning of words or phrases into numerical vectors. These vectors allow me to measure the similarity between phrases based on their meaning rather than their exact wording. By grouping trigrams with similar embeddings, I can cluster phrases that express the same idea, even if they use different words. This helps to reduce redundancy and ensures that I capture broader themes in the reviews.


## *Common*

I start with importing various libraries

In [None]:
# --- Cell 1: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import word_tokenize, trigrams, ngrams
from nltk.probability import FreqDist
import re
from collections import defaultdict
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np
import csv

I have started by importing the data and filtering the review by only the paper that where rejected since these review will contains reasons that explain why they weren't accepted. 

In [None]:
# --- Cell 2: Filter for Rejected Papers ---
# Filter papers with "Decision:###Reject"
data = pd.read_excel('tp_2020conference.xlsx')
rejected_papers = data[data['paper_decision'].str.contains("Decision:###Reject", na=False)]

I have clean and standardized the text by lowercasing, removing numbers, special characters and stopwords to have a consistent format. The I have tokenized them into words.

In [None]:
# --- Cell: Preprocessing and Tokenization ---
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

def clean_text(text):
    text = str(text).lower()  # Convert to lowercase
    text = re.sub(r'\d+', ' ', text)  # Remove numbers
    text = re.sub(r'\([^)]*\)', ' ', text)  # Remove text inside parentheses
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)  # Remove special characters (keep only letters and spaces)
    text = re.sub(r'\b(weak|reject|accept|e|g)\b', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_tokens)

rejected_papers['clean_review'] = rejected_papers['review'].apply(clean_text).apply(remove_stopwords)

I have extracted trigrams from the cleaned text that contained common negative keywords that often indicate reason for rejection. The trigrams should provide the context around these keywords

In [None]:
# --- Cell: Extract Trigrams ---
negative_keywords = {"lack", "poor", "unclear", "weak", "missing", "limited", "no", "incomplete", "insufficient"}

trigrams_list = []

for review in rejected_papers['clean_review']:
    sentences = nltk.sent_tokenize(review)  # Split review into sentences
    for sentence in sentences:
        tokens = word_tokenize(sentence)
        # Generate trigrams
        trigrams_list.extend([trigram for trigram in nltk.trigrams(tokens) if any(word in negative_keywords for word in trigram)])


## *Version 1*

In this first version I try to calculate the frequency distribution of the trigrams but this have a poor result since it do not consider trigrams with same words but different order the same trigram

In [None]:
# --- Cell: Frequency Distribution ---
trigram_freq = FreqDist(trigrams_list)
# Get the most common trigrams
most_common_trigrams = trigram_freq.most_common(20)


# --- Cell: Visualization ---
# Plot most common trigrams
plt.figure(figsize=(10, 5))
trigrams_labels = [' '.join(trigram) for trigram, _ in most_common_trigrams]
trigrams_counts = [count for _, count in most_common_trigrams]
plt.barh(trigrams_labels, trigrams_counts, color='lightcoral')
plt.xlabel("Frequency")
plt.ylabel("Trigrams")
plt.title("Most Common Negative Sentiment Trigrams with Keywords")
plt.gca().invert_yaxis()
plt.show()



# --- Cell: Insights ---
# Print the most common trigrams
print("Most Common Negative Sentiment Trigrams with Keywords:")
for trigram, count in most_common_trigrams:
    print(f"{' '.join(trigram)}: {count}")

So I have tried to group the trigrams in this way: if two of the words present in the trigrams are same they are putted together

In [None]:
# --- Cell: Group Trigrams by Similarity ---
def group_trigrams_by_similarity(trigrams):
    """
    Group trigrams that have at least two identical words.
    """
    grouped_trigrams = defaultdict(list)

    for trigram in trigrams:
        # Generate all combinations of two words from the trigram
        two_word_combinations = list(combinations(trigram, 2))

        # Use the two-word combination as a key to group similar trigrams
        for combo in two_word_combinations:
            grouped_trigrams[combo].append(trigram)

    # Filter groups to keep only those with more than one trigram
    grouped_trigrams = {key: value for key, value in grouped_trigrams.items() if len(value) > 1}

    return grouped_trigrams

grouped_trigrams = group_trigrams_by_similarity(trigrams_list)

# Count occurrences of each group
grouped_trigram_counts = {key: len(value) for key, value in grouped_trigrams.items()}

# Sort by frequency
sorted_grouped_trigrams = sorted(grouped_trigram_counts.items(), key=lambda x: x[1], reverse=True)

# --- Cell: Visualization ---
# Plot most common grouped trigrams
plt.figure(figsize=(10, 5))
grouped_trigrams_labels = ['  '.join(key) for key, _ in sorted_grouped_trigrams[:20]]
grouped_trigrams_counts = [count for _, count in sorted_grouped_trigrams[:20]]
plt.barh(grouped_trigrams_labels, grouped_trigrams_counts, color='lightcoral')
plt.xlabel("Frequency")
plt.ylabel("Grouped Trigrams")
plt.title("Most Common Grouped Trigrams with Similar Words")
plt.gca().invert_yaxis()
plt.show()

# --- Cell: Insights ---
# Print the most common grouped trigrams
print("Most Common Grouped Trigrams with Similar Words:")
for trigram, count in sorted_grouped_trigrams[:20]:
    print(f"{' & '.join(trigram)}: {count}")


Also this method produced poor result so I developed the Version 2

In [None]:
# WHOLE CODE 1.1
# --- Cell 1: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import word_tokenize, trigrams, ngrams
from nltk.probability import FreqDist
import re
from collections import defaultdict
from itertools import combinations

# --- Cell: Filter for Rejected Papers ---
# Filter papers with "Decision:###Reject"
data = pd.read_excel('tp_2020conference.xlsx')
rejected_papers = data[data['paper_decision'].str.contains("Decision:###Reject", na=False)]

# --- Cell: Preprocessing and Tokenization ---
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

def clean_text(text):
    text = str(text).lower()  # Convert to lowercase
    text = re.sub(r'\d+', ' ', text)  # Remove numbers
    text = re.sub(r'\([^)]*\)', ' ', text)  # Remove text inside parentheses
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)  # Remove special characters (keep only letters and spaces)
    text = re.sub(r'\b(weak|reject|accept|e|g)\b', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_tokens)

rejected_papers['clean_review'] = rejected_papers['review'].apply(clean_text).apply(remove_stopwords)


# --- Cell: Extract Trigrams 
negative_keywords = {"lack", "poor", "unclear", "weak", "missing", "limited", "no", "incomplete", "insufficient"}

trigrams_list = []

for review in rejected_papers['clean_review']:
    sentences = nltk.sent_tokenize(review)  # Split review into sentences
    for sentence in sentences:
        tokens = word_tokenize(sentence)
        # Generate trigrams 
        trigrams_list.extend([trigram for trigram in nltk.trigrams(tokens) if any(word in negative_keywords for word in trigram)])

# --- Cell: Frequency Distribution ---
trigram_freq = FreqDist(trigrams_list)


# Get the most common trigrams
most_common_trigrams = trigram_freq.most_common(20)


# --- Cell: Visualization ---
# Plot most common trigrams
plt.figure(figsize=(10, 5))
trigrams_labels = [' '.join(trigram) for trigram, _ in most_common_trigrams]
trigrams_counts = [count for _, count in most_common_trigrams]
plt.barh(trigrams_labels, trigrams_counts, color='lightcoral')
plt.xlabel("Frequency")
plt.ylabel("Trigrams")
plt.title("Most Common Negative Sentiment Trigrams with Keywords")
plt.gca().invert_yaxis()
plt.show()



# --- Cell: Insights ---
# Print the most common trigrams
print("Most Common Negative Sentiment Trigrams with Keywords:")
for trigram, count in most_common_trigrams:
    print(f"{' '.join(trigram)}: {count}")


In [None]:
# WHOLE CODE 1.2
# --- Cell 1: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import word_tokenize, trigrams, ngrams
from nltk.probability import FreqDist
import re
from collections import defaultdict
from itertools import combinations

# --- Cell: Filter for Rejected Papers ---
# Filter papers with "Decision:###Reject"
data = pd.read_excel('tp_2020conference.xlsx')
rejected_papers = data[data['paper_decision'].str.contains("Decision:###Reject", na=False)]

# --- Cell: Sentiment Analysis for Rejected Papers ---
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
rejected_papers['review_sentiment'] = rejected_papers['review'].apply(lambda x: sia.polarity_scores(x)['compound'])

# --- Cell: Preprocessing and Tokenization ---
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

def clean_text(text):
    text = str(text).lower()  # Convert to lowercase
    text = re.sub(r'\d+', ' ', text)  # Remove numbers
    text = re.sub(r'\([^)]*\)', ' ', text)  # Remove text inside parentheses
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)  # Remove special characters (keep only letters and spaces)
    text = re.sub(r'\b(weak|reject|accept|e|g)\b', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_tokens)

rejected_papers['clean_review'] = rejected_papers['review'].apply(clean_text).apply(remove_stopwords)

def group_trigrams_by_similarity(trigrams):
    """
    Group trigrams that have at least two identical words.
    """
    grouped_trigrams = defaultdict(list)

    for trigram in trigrams:
        # Generate all combinations of two words from the trigram
        two_word_combinations = list(combinations(trigram, 2))

        # Use the two-word combination as a key to group similar trigrams
        for combo in two_word_combinations:
            grouped_trigrams[combo].append(trigram)

    # Filter groups to keep only those with more than one trigram
    grouped_trigrams = {key: value for key, value in grouped_trigrams.items() if len(value) > 1}

    return grouped_trigrams

# --- Cell: Extract Trigrams ---
negative_keywords = {"lack", "poor", "unclear", "weak", "missing", "limited", "no", "incomplete", "insufficient"}

trigrams_list = []

for review in rejected_papers['clean_review']:
    sentences = nltk.sent_tokenize(review)  # Split review into sentences
    for sentence in sentences:
        tokens = word_tokenize(sentence)
        # Generate trigrams
        trigrams_list.extend([trigram for trigram in nltk.trigrams(tokens) if any(word in negative_keywords for word in trigram)])

# --- Cell: Group Trigrams by Similarity ---
grouped_trigrams = group_trigrams_by_similarity(trigrams_list)

# Count occurrences of each group
grouped_trigram_counts = {key: len(value) for key, value in grouped_trigrams.items()}

# Sort by frequency
sorted_grouped_trigrams = sorted(grouped_trigram_counts.items(), key=lambda x: x[1], reverse=True)

# --- Cell: Visualization ---
# Plot most common grouped trigrams
plt.figure(figsize=(10, 5))
grouped_trigrams_labels = [' & '.join(key) for key, _ in sorted_grouped_trigrams[:20]]
grouped_trigrams_counts = [count for _, count in sorted_grouped_trigrams[:20]]
plt.barh(grouped_trigrams_labels, grouped_trigrams_counts, color='lightcoral')
plt.xlabel("Frequency")
plt.ylabel("Grouped Trigrams")
plt.title("Most Common Grouped Trigrams with Similar Words")
plt.gca().invert_yaxis()
plt.show()

# --- Cell: Insights ---
# Print the most common grouped trigrams
print("Most Common Grouped Trigrams with Similar Words:")
for trigram, count in sorted_grouped_trigrams[:20]:
    print(f"{' & '.join(trigram)}: {count}")



## *Version 2 with embeddings*

In this version I try to group the trigrams by semantic similarity through an embedding model and cosine similarity, so they should be grouped by common themes

In [None]:
# --- Cell: Trigram Embedding and Grouping ---
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

def get_trigram_embedding(trigram):
    word_embeddings = [embedding_model.encode(word) for word in trigram]
    return np.mean(word_embeddings, axis=0)

def group_trigrams_by_embedding(trigrams, threshold=0.8):
    trigram_embeddings = {trigram: get_trigram_embedding(trigram) for trigram in trigrams}
    grouped_trigrams = defaultdict(list)
    visited = set()

    for trigram1, embedding1 in trigram_embeddings.items():
        if trigram1 in visited:
            continue
        group = [trigram1]
        visited.add(trigram1)
        for trigram2, embedding2 in trigram_embeddings.items():
            if trigram2 in visited:
                continue
            similarity = cosine_similarity([embedding1], [embedding2])[0][0]
            if similarity >= threshold:
                group.append(trigram2)
                visited.add(trigram2)
        grouped_trigrams[tuple(group)].append(group)

    return grouped_trigrams

grouped_trigrams = group_trigrams_by_embedding(trigrams_list, threshold=0.8)

The I have saved the result in a csv file for a more easy use

In [None]:
# --- Cell: CSV Output ---
def save_grouped_trigrams_with_first_to_csv(grouped_trigrams, output_file):
    with open(output_file, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["First Trigram", "Grouped Trigrams"])
        for group in grouped_trigrams.items():
            trigrams_text = ' | '.join([' '.join(trigram) for trigram in group])
            first_trigram = ' '.join(group[0])
            writer.writerow([first_trigram,  trigrams_text])

output_file = "grouped_trigrams_with_first.csv"
save_grouped_trigrams_with_first_to_csv(grouped_trigrams, output_file)

In [None]:
# WHOLE CODE
# --- Cell 1: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk import word_tokenize
import re
from collections import defaultdict
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np
import csv


# --- Cell: Filter for Rejected Papers ---
# Filter papers with "Decision:###Reject"
data = pd.read_excel('tp_2020conference.xlsx')
rejected_papers = data[data['paper_decision'].str.contains("Decision:###Reject", na=False)]


# --- Cell: Preprocessing and Tokenization ---
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

def clean_text(text):
    text = str(text).lower()  # Convert to lowercase
    text = re.sub(r'\d+', ' ', text)  # Remove numbers
    text = re.sub(r'\([^)]*\)', ' ', text)  # Remove text inside parentheses
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)  # Remove special characters (keep only letters and spaces)
    text = re.sub(r'\b(weak|reject|accept|e|g)\b', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_tokens)


rejected_papers['clean_review'] = rejected_papers['review'].apply(clean_text).apply(remove_stopwords)

# Load a pre-trained embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # You can use other models like 'paraphrase-MiniLM-L6-v2'

def get_trigram_embedding(trigram):
    """
    Compute the embedding for a trigram by averaging the embeddings of its words.
    """
    word_embeddings = [embedding_model.encode(word) for word in trigram]
    return np.mean(word_embeddings, axis=0)

def group_trigrams_by_embedding(trigrams, threshold=0.8):
    """
    Group trigrams based on cosine similarity of their embeddings.
    """
    trigram_embeddings = {trigram: get_trigram_embedding(trigram) for trigram in trigrams}
    grouped_trigrams = defaultdict(list)
    visited = set()

    for trigram1, embedding1 in trigram_embeddings.items():
        if trigram1 in visited:
            continue
        group = [trigram1]
        visited.add(trigram1)
        for trigram2, embedding2 in trigram_embeddings.items():
            if trigram2 in visited:
                continue
            similarity = cosine_similarity([embedding1], [embedding2])[0][0]
            if similarity >= threshold:
                group.append(trigram2)
                visited.add(trigram2)
        grouped_trigrams[tuple(group)].append(group)

    return grouped_trigrams

# --- Cell: Save Grouped Trigrams with First Trigram and Counts to CSV ---
def save_grouped_trigrams_with_first_to_csv(grouped_trigrams, grouped_trigram_counts, output_file):
    """
    Save grouped trigrams to a CSV file with the first trigram for each group.
    """
    with open(output_file, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["First Trigram", "Grouped Trigrams"])  # Header
        for group in grouped_trigrams.items():
            # Join trigrams in the group with " | "
            trigrams_text = ' | '.join([' '.join(trigram) for trigram in group])
            # Use the first trigram of the group
            first_trigram = ' '.join(group[0])
            # Write the trigrams, count, first trigram, and grouping criterion to the CSV
            writer.writerow([first_trigram, grouped_trigram_counts, trigrams_text])


# --- Cell: Extract Trigrams ---
negative_keywords = {"lack", "poor", "unclear", "weak", "missing", "limited", "no", "incomplete", "insufficient"}

trigrams_list = []

for review in rejected_papers['clean_review']:
    sentences = nltk.sent_tokenize(review)  # Split review into sentences
    for sentence in sentences:
        tokens = word_tokenize(sentence)
        # Generate trigrams
        trigrams_list.extend([trigram for trigram in nltk.trigrams(tokens) if any(word in negative_keywords for word in trigram)])

# --- Cell: Group Trigrams by Embedding ---
grouped_trigrams = group_trigrams_by_embedding(trigrams_list, threshold=0.8)

# Count occurrences of each group
grouped_trigram_counts = {key: len(value) for key, value in grouped_trigrams.items()}

# Sort by frequency
sorted_grouped_trigrams = sorted(grouped_trigram_counts.items(), key=lambda x: x[1], reverse=True)


# --- Cell: Insights ---
# Print the most common grouped trigrams
print("Most Common Grouped Trigrams (Using Embeddings):")
for group, count in sorted_grouped_trigrams[:20]:
    print(f"{' | '.join([' '.join(trigram) for trigram in group])}: {count}")

# --- Cell: Save Grouped Trigrams with First Trigram and Counts to CSV ---
# Specify the output file path
output_file = "grouped_trigrams_with_first.csv"

# Save the grouped trigrams with the first trigram and details to the CSV file
save_grouped_trigrams_with_first_to_csv(grouped_trigrams, grouped_trigram_counts, output_file)
print(f"Grouped trigrams with the first trigram and details have been saved to {output_file}")

## Conclusion

Combining trigrams and embeddings creates a balanced approach. Trigrams provide interpretable, context-rich phrases that directly point to specific issues, while embeddings enable me to generalize and group these issues into meaningful clusters. For example, trigrams like "poor clarity," "unclear explanation," and "lack of detail" might all be grouped under a broader theme of "clarity issues." This dual approach ensures that my analysis captures both the granular details and the overarching patterns in the data.


*Suggestion: Copy the Whole code in Google colab*