# TASK 1 for KDSH2025

### we have to categorize the research papers into publishable and non-publishable

Pre-processing:

Step 1: Convert Research Papers into Structured Format

The first step is to extract text and segment it into meaningful sections (like abstract, methods, results, etc.).
a. Extract Text from PDF

    Use libraries such as:
        PyPDF2: Basic text extraction.
        PDFMiner: Better for more complex PDFs.
        Grobid: Specifically designed for scientific papers, extracts structured content (title, abstract, references).

Example with PyPDF2:

In [None]:
import PyPDF2

def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as pdf_file:
        reader = PyPDF2.PdfReader(pdf_file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text

pdf_text = extract_text_from_pdf('research_paper.pdf')
print(pdf_text)


b. Split Text into Sections

    Pattern Matching: Use regex to identify common headings (e.g., Introduction, Methods, Results, Conclusion).
    Tools: Grobid is excellent for automatically segmenting into structured formats.

Example (Regex for Common Headings):

In [None]:
import re

def split_into_sections(text):
    sections = re.split(r'\b(Abstract|Introduction|Methods|Results|Discussion|Conclusion)\b', text, flags=re.IGNORECASE)
    return {sections[i].strip(): sections[i + 1].strip() for i in range(1, len(sections) - 1, 2)}

sections = split_into_sections(pdf_text)
print(sections['Abstract'])


Step 2: Tokenize and Clean Data

Once the text is extracted and segmented, process it to make it suitable for analysis.
a. Tokenization

    Split text into words or sentences using tools like spaCy or NLTK.
    For example, spaCy offers robust tokenization for different languages.

Example:

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(sections['Abstract'])
tokens = [token.text for token in doc]
print(tokens)


b. Remove Stopwords

    Stopwords are common words (e.g., "the", "is", "and") that do not add value to the analysis.
    Use prebuilt stopword lists from NLTK or spaCy, or customize your list.

Example:

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
tokens = word_tokenize(sections['Abstract'])
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)


c. Normalize Text

    Convert all text to lowercase to ensure case consistency.
    Remove special characters, punctuation, and numbers if irrelevant.

Example:

In [None]:
import string

def normalize_text(tokens):
    table = str.maketrans('', '', string.punctuation)
    return [word.translate(table).lower() for word in tokens if word.isalpha()]

normalized_tokens = normalize_text(filtered_tokens)
print(normalized_tokens)


d. Lemmatization

    Reduce words to their base or dictionary form (e.g., "running" → "run").
    Use spaCy or WordNetLemmatizer from NLTK.

Example:

In [None]:
lemmatized_tokens = [token.lemma_ for token in doc if token.is_alpha]
print(lemmatized_tokens)


Step 3: Output Structured Format

Store the preprocessed data in a structured format like JSON or CSV for easy access.

Example JSON Format:

In [None]:
import json

data = {
    "Abstract": lemmatized_tokens,
    "Introduction": [],  # Repeat preprocessing for other sections
    "Methods": [],
    "Results": [],
    "Conclusion": []
}

with open('structured_data.json', 'w') as json_file:
    json.dump(data, json_file)


Additional Tools

    Grobid:
        For scientific papers, Grobid provides structured extraction including metadata, references, and sections.
        Grobid Documentation

    Other Libraries:
        Textract: Handles different file types (PDF, DOCX, etc.).
        Tika: Extracts text and metadata.

Feature Extraction:

1. Language Quality

Assess grammar, coherence, and readability.
a. Grammar and Spelling

    Tools:
        LanguageTool: An open-source grammar and spell checker.
        Grammarly API: (Paid) for advanced grammar checking.

    Implementation: Use LanguageTool to calculate the number of grammatical errors:

In [None]:
from language_tool_python import LanguageTool

tool = LanguageTool('en-US')
text = "This is an example sentence with errors."
matches = tool.check(text)
grammar_error_count = len(matches)
print(f"Number of grammar errors: {grammar_error_count}")


b. Coherence

    Measure how logically connected the sentences are.
    Semantic Similarity:
        Use embeddings (e.g., SBERT or GPT embeddings) to calculate the cosine similarity between consecutive sentences.

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
sentences = ["This is the first sentence.", "This is the second sentence."]
embeddings = model.encode(sentences)
coherence_score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"Coherence score: {coherence_score}")


c. Readability

    TextStat: Calculate readability scores like Flesch Reading Ease, Gunning Fog Index, etc.

In [None]:
import textstat

text = "This is a sample sentence to calculate readability."
flesch_score = textstat.flesch_reading_ease(text)
print(f"Flesch Reading Ease score: {flesch_score}")


2. Methodology Validity

Assess whether the methods are appropriate for the research objectives.
a. Keyword Analysis

    Extract keywords from the "Methods" section and compare them with the "Objectives" section.
    Use KeyBERT or TF-IDF for keyword extraction.

In [None]:
from keybert import KeyBERT

kw_model = KeyBERT()
methods_text = "We used neural networks to classify images."
keywords = kw_model.extract_keywords(methods_text, keyphrase_ngram_range=(1, 2), stop_words='english')
print(keywords)


b. Concept Matching

    Use embeddings to check if the topics covered in the methodology align with the research objectives.
        Generate embeddings for both sections and compute cosine similarity.

c. Completeness of Methodology

    Check for the presence of essential subsections:
        Tools/technologies used.
        Dataset description.
        Experimental setup.
        Statistical tests or validation techniques.
    Example: Use regex to search for keywords like "dataset," "tool," or "experiment."

3. Claim Validation

Identify overly ambitious or unsupported claims.
a. Fact-Checking Claims

    Tools:
        Use pre-trained models like TARS-QA (from Hugging Face) or fine-tune BERT for fact-checking tasks.
    Approach:
        Extract claims from the text (e.g., "This method improves accuracy by 20%").
        Compare claims with the supporting data in the results section.

b. Hyperbolic Language Detection

    Look for hyperbolic phrases using pattern matching or pre-trained models.
        Example: Check for words like "revolutionary," "breakthrough," "unprecedented," etc.

In [None]:
import re

text = "Our method achieves unprecedented accuracy."
hyperbolic_words = ["revolutionary", "breakthrough", "unprecedented"]
found_hyperbolic = [word for word in hyperbolic_words if word in text.lower()]
print(f"Hyperbolic words: {found_hyperbolic}")


c. Claim-Data Alignment

    Calculate alignment between claims and data:
        Compare numerical claims with reported experimental results.
        Check consistency in units, scales, and statistical validity.

4. Additional Features

Beyond the specified aspects, consider these additional features:
a. Structural Features

    Section Lengths:
        Measure the word count of each section (e.g., abstract, methods) to detect imbalance or missing sections.
    Reference Count:
        Count the number of references cited in the paper.

b. Semantic Features

    Use Topic Modeling (e.g., LDA) to identify dominant topics in the paper.
    Sentiment Analysis:
        Analyze sentiment in the discussion or conclusion sections to detect optimism or bias.

c. Statistical Features

    Calculate the proportion of cited references to sentences to measure scientific rigor.

5. Feature Storage

Store extracted features in a structured format like a Pandas DataFrame or JSON for further analysis.

Example:

import pandas as pd

features = {
    "grammar_errors": grammar_error_count,
    "coherence_score": coherence_score,
    "flesch_score": flesch_score,
    "method_objective_alignment": 0.85,  # Example similarity score
    "hyperbolic_words_count": len(found_hyperbolic)
}

df = pd.DataFrame([features])
print(df)


6. Tools and Libraries

    Language Quality:
        spaCy, TextStat, Sentence Transformers, LanguageTool.
    Methodology Validity:
        KeyBERT, Regex, Hugging Face models.
    Claim Validation:
        Hugging Face Transformers, regex.
    Storage:
        Pandas, NumPy.

This process ensures comprehensive feature extraction, providing critical insights for publishability classification. Let me know if you'd like help with specific implementations!

Model Training:

In [None]:
1. Challenges and Considerations

    Small Dataset: Only 15 labeled reference papers are provided, so data scarcity is a significant challenge.
    Complexity: Research papers contain dense information, requiring models that can handle long-form text.
    Solution: Use transfer learning with pre-trained models like BERT or similar transformers to leverage knowledge from large datasets.

2. Pipeline for Model Training
a. Preprocessing the Dataset

    Text Cleaning:
        Apply preprocessing steps (as described earlier): tokenization, stopword removal, lemmatization, etc.
        Ensure that each paper's sections (abstract, methods, results) are merged into a single input text or treated as separate features.

    Label Encoding:
        Encode labels into binary values:
            Publishable → 1
            Non-Publishable → 0

Example:

In [None]:
data = [
    {"text": "Paper 1 content...", "label": 1},
    {"text": "Paper 2 content...", "label": 0},
]


    Train-Test Split:
        Use an 80-20 split for training and testing.
        Optionally, use k-fold cross-validation (e.g., 5 folds) for better model reliability.

Example:

In [None]:
from sklearn.model_selection import train_test_split

texts = [d["text"] for d in data]
labels = [d["label"] for d in data]
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)


b. Choose a Pre-trained Transformer Model

    Use a pre-trained transformer model from Hugging Face:
        BERT: General-purpose transformer for language tasks.
        SciBERT: Specifically trained on scientific literature.
        DistilBERT: Lightweight version of BERT for faster inference.

Install Hugging Face Transformers:

In [None]:
pip install transformers datasets


c. Tokenization

    Use the tokenizer corresponding to the chosen pre-trained model.
    Tokenize input text, truncating or padding to handle variable lengths.

Example:

In [None]:
from transformers import AutoTokenizer

model_name = "allenai/scibert_scivocab_uncased"  # Replace with "bert-base-uncased" if not using SciBERT
tokenizer = AutoTokenizer.from_pretrained(model_name)

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")


d. Model Fine-Tuning

    Load Pre-trained Model:
        Load a pre-trained transformer model with a classification head (e.g., BERTForSequenceClassification).

Example:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Binary classification


    Define the Optimizer and Scheduler:
        Use AdamW optimizer and a learning rate scheduler for efficient training.

Example:

In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)


    Define Loss Function:
        Use Cross-Entropy Loss for binary classification tasks.

    Training Loop:
        Use Hugging Face’s Trainer API or manually define a PyTorch training loop.

Example (Using Trainer API):

In [None]:
from transformers import Trainer, TrainingArguments
import torch

# Convert labels to tensors
train_labels = torch.tensor(train_labels)
val_labels = torch.tensor(val_labels)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset={
        "input_ids": train_encodings["input_ids"],
        "attention_mask": train_encodings["attention_mask"],
        "labels": train_labels,
    },
    eval_dataset={
        "input_ids": val_encodings["input_ids"],
        "attention_mask": val_encodings["attention_mask"],
        "labels": val_labels,
    },
    optimizers=(optimizer, None),  # No scheduler for simplicity
)

trainer.train()


3. Evaluate the Model

    Metrics:
        Accuracy, Precision, Recall, F1-Score.
        Use sklearn for metric calculation:

In [None]:
from sklearn.metrics import classification_report

predictions = trainer.predict(val_encodings)
preds = predictions.predictions.argmax(-1)
print(classification_report(val_labels, preds))


    Confusion Matrix:
        Visualize performance using a confusion matrix.

4. Handle Small Dataset Challenges
a. Data Augmentation

    Back-Translation: Translate text to another language and back to introduce variations.
    Paraphrasing: Use models like T5 or Pegasus to rephrase sentences.

Example:

In [None]:
from transformers import pipeline

paraphraser = pipeline("text2text-generation", model="t5-small")
paraphrased_text = paraphraser("Your original text here", max_length=50, num_return_sequences=1)
print(paraphrased_text[0]["generated_text"])


b. Transfer Learning

    Fine-tune the pre-trained model on domain-specific datasets (e.g., research papers) before training on the 15 labeled examples.

c. Use Few-Shot Learning

    Leverage prompt engineering with large models (e.g., OpenAI GPT-3/4) for few-shot classification:
        Provide labeled examples as part of the input prompt.

5. Save and Deploy the Model

    Save Model:

In [None]:
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")


    Deploy Model:
        Use FastAPI or Flask to serve the model as an API for real-time predictions.

This process provides a robust framework to fine-tune a pre-trained model for the task, even with a small labeled dataset. Let me know if you'd like help implementing a specific part!


Evaluation:

Evaluating your model is a crucial step in understanding its performance and ensuring it generalizes well to unseen data. Here's a detailed guide to evaluating a classification model using Accuracy, F1-Score, and Confusion Matrix:
1. Dataset Setup

Ensure that you have a validation dataset (or test dataset) that was not used during training.

    Input: Text data and corresponding true labels (val_texts and val_labels).
    Output: Predicted labels from the model and metrics to evaluate the predictions.

2. Perform Model Predictions

    Prepare Validation Data:
        Tokenize the validation data if not already done.

    val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")

    Get Predictions:
        Use the trained model to predict on the validation dataset.
        Extract raw logits (output scores) and convert them into class predictions.

Example:

In [None]:
import torch

# Put model in evaluation mode
model.eval()

# Perform prediction
with torch.no_grad():
    outputs = model(
        input_ids=val_encodings["input_ids"],
        attention_mask=val_encodings["attention_mask"]
    )
logits = outputs.logits
predicted_labels = torch.argmax(logits, dim=1).cpu().numpy()  # Get class predictions


True Labels: Ensure your true labels (val_labels) are stored in the same format as predictions (e.g., as a NumPy array):

In [None]:
true_labels = val_labels.numpy() if isinstance(val_labels, torch.Tensor) else val_labels


3. Metrics Overview
a. Accuracy

    Definition: The percentage of correctly classified samples.
    Formula:
    Accuracy=Number of Correct PredictionsTotal Number of Predictions
    Accuracy=Total Number of PredictionsNumber of Correct Predictions​

b. F1-Score

    Definition: Harmonic mean of Precision and Recall, especially useful when the dataset is imbalanced.
    Formula:
    F1-Score=2⋅Precision⋅RecallPrecision+Recall
    F1-Score=2⋅Precision+RecallPrecision⋅Recall​
    Where:
        Precision: Percentage of correctly predicted positive samples.
        Precision=True PositivesTrue Positives+False Positives
        Precision=True Positives+False PositivesTrue Positives​
        Recall: Percentage of actual positives correctly predicted.
        Recall=True PositivesTrue Positives+False Negatives
        Recall=True Positives+False NegativesTrue Positives​

c. Confusion Matrix

    Definition: A matrix that summarizes the counts of true positive, true negative, false positive, and false negative predictions.
    Format:
    	Predicted Positive	Predicted Negative
    Actual Positive	True Positive (TP)	False Negative (FN)
    Actual Negative	False Positive (FP)	True Negative (TN)

4. Compute Metrics

    Using scikit-learn: Install scikit-learn if not already installed:

In [None]:
pip install scikit-learn


Calculate Accuracy, Precision, Recall, F1-Score, and Confusion Matrix:

In [None]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

# Accuracy
accuracy = accuracy_score(true_labels, predicted_labels)

# F1-Score
f1 = f1_score(true_labels, predicted_labels, average='binary')  # Use 'macro' or 'weighted' for multi-class

# Confusion Matrix
cm = confusion_matrix(true_labels, predicted_labels)

# Classification Report (Precision, Recall, F1-Score for each class)
report = classification_report(true_labels, predicted_labels)

print(f"Accuracy: {accuracy}")
print(f"F1-Score: {f1}")
print("Confusion Matrix:")
print(cm)
print("Classification Report:")
print(report)


Visualize Confusion Matrix:

    Use matplotlib or seaborn to plot the confusion matrix for better interpretability.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Non-Publishable", "Publishable"], yticklabels=["Non-Publishable", "Publishable"])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()


5. Example Output

Given the following sample results:

    True Labels: [1, 0, 1, 1, 0]
    Predicted Labels: [1, 0, 1, 0, 0]

The metrics would look like:

    Accuracy: 45=0.854​=0.8 (80%)
    F1-Score: 0.80.8
    Confusion Matrix:

6. Considerations

    Imbalanced Datasets:
        If one class dominates (e.g., most papers are non-publishable), focus more on F1-Score than Accuracy, as Accuracy can be misleading in such cases.
    Threshold Adjustments:
        If logits are used for predictions, adjust the decision threshold to optimize precision or recall based on the use case.

This process ensures a comprehensive evaluation of your model’s performance, highlighting strengths and areas for improvement. Let me know if you need help implementing these steps!


Output:

e. Output

    Create a binary output: 1 (Publishable) or 0 (Non-Publishable).

 Batch Prediction for Multiple Papers

If multiple research papers need to be classified, process them in batches.
Steps:

    Tokenize all papers in a batch using the tokenizer.
    Pass the batch through the model for inference.
    Collect and store predictions for each paper.

In [None]:
batch_texts = [
    "Paper 1 content...",
    "Paper 2 content...",
    "Paper 3 content..."
]

# Tokenize the batch
batch_encodings = tokenizer(batch_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")

# Predict for the batch
with torch.no_grad():
    outputs = model(
        input_ids=batch_encodings["input_ids"],
        attention_mask=batch_encodings["attention_mask"]
    )

# Convert logits to probabilities and get predictions
logits = outputs.logits
probs = F.softmax(logits, dim=1)
predictions = torch.argmax(probs, dim=1).tolist()

# Combine results with confidence scores
output_data = [
    {"Paper ID": idx + 1, "Prediction": pred, "Confidence": probs[i][pred].item()}
    for i, (idx, pred) in enumerate(zip(range(len(batch_texts)), predictions))
]

# Save to CSV
df = pd.DataFrame(output_data)
df.to_csv("batch_predictions.csv", index=False)
