# Title:

#### Group Member Names :
Ekamjyot Singh

### INTRODUCTION:
Exploring the Landscape of Natural Language Processing Research

#### AIM :
The core requirement of the assignment is to implement and evaluate the described NLP Taxonomy Classifier in the referenced paper. This means making sense of how the model was designed, training it on a dataset, and measuring the performance of the model with respect to the defined taxonomy while classifying NLP research papers. We are specifically working toward assessing the model's accuracy, precision, and recall, besides the possible improvements or extensions.

#### Github Repo:
Exploring-NLP-Research GitHub Repository
(https://github.com/sebischair/Exploring-NLP-Research?tab=readme-ov-file)

#### DESCRIPTION OF PAPER:
The paper presents a fine-tuned BERT-based model for the classification of NLP research papers with respect to a finely grained taxonomy. The taxonomy includes different levels of concepts about NLP that will let a model predict both specific and hierarchical concepts. It has been trained on a large dataset of 178,521 papers from ACL Anthology, arXiv cs.CL domain, and Scopus. The model was trained by initialization with the weights from allennlp/transformer-specter2-base and fine-tuning on a weakly labeled dataset. The approach can be said to improve the model's ability to accurately classify papers across a myriad of concepts in NLP.

#### PROBLEM STATEMENT :
With rapid advancement and diversification, NLP research has come of age. Currently, methods for their classification and analysis are usually lacking in structure due to an overwhelming number of scientific papers. There lies a lacking framework that can let one understand the trends and gaps in the research. There is a pressing need for a robust classification system that will help in systematic categorization on a well-defined taxonomy.

#### CONTEXT OF THE PROBLEM:
NLP research is progressing very fast, with new methods, models, and applications constantly appearing. The lack of a comprehensive, well-structured classification system might to some extent confuse understanding of the extent of all contributions. A correctly defined taxonomy offers clarity, permits the recognition of tendencies within research, and thus enables researchers and practitioners to be more informative about recent developments and key focus areas in NLP.

#### SOLUTION:
The proposed solution will involve development of fine-tuned language model based on BERT for classification of NLP research paper. This taxonomy requires predicting the concepts of multi-levels toward a multi-label classification task. The model should have been built training on big data and fine-tuned with domain-specific data for the ability to categorize the document with high accuracy and to identify the paper with the relevant NLP concepts. The answer responds to the demand for a systematic and effective categorization system by giving the real skeletal frame for the analysis of NLP research.

# Background
Reference: Schopf, T., Arabi, K., Matthes, F. (2023). On the Landscape of Natural Language Processing Research. Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023).
Explanation: 
The paper describes a new method to classify NLP research papers with the use of a fine-tuned BERT model. It provides a highly detailed taxonomy that will help classify the paper based on its research topics and contributions.

Dataset/Input: The model was trained on a dataset comprising 178,521 papers drawn from ACL Anthology, arXiv cs.CL domain, and Scopus. In essence, it comprises all forms of research articles concerning NLP and thus effectively covers this broad sphere for this particular subject
Weakness: .
Nevertheless, the model may still be weak concerning the robustness of its performance against varying content and taxonomy coverage across the papers. Furthermore, challenges in classifying papers that lie beyond predefined taxonomy or reflect new trends in research not well represented in the training data may emer




# Implement paper code :
from typing import List
import torch
from torch.utils.data import DataLoader
from transformers import BertForSequenceClassification, AutoTokeni
zer
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('TimSchopf/nlp_taxonomy_classifier')
model = BertForSequenceClassification.from_pretrained('TimSchopf/nlp_taxonomy_classifier')

# prepare data
papers = [{'title': 'Attention Is All You Need', 'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.'},
          {'title': 'SimCSE: Simple Contrastive Learning of Sentence Embeddings', 'abstract': 'This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearmans correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are ava
          ilable.'}]
# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]


def predict_nlp_concepts(model, tokenizer, texts: List[str], batch_size=8, device=None, shuffle_data=False):
    """
    helper function for predicting NLP concepts of scientific pap
    """
    
    # tokenize texts
    def tokenize_dataset(sentences, tokenizer):
        sentences_num = len(sentences)
        dataset = []
        for i in range(sentences_num):
            
            sentence = tokenizer(sentences[i], padding="max_length", truncation=True, return_tensors='pt', max_length=model.config.max_position
            
            # get input_ids, token_type_ids, and attention_mask
            input_ids = sentence['input_ids'][0]
            token_type_ids = sentence['token_type_ids'][0]
            attention_mask = sentence['attention_mask'][0]

            dataset.append((input_ids, token_type_ids, attention_mask))
        return dataset

    tokenized_data = tokenize_dataset(sentencexts, tokenizer=tokenizer)
    
    # get the individual input formats for the model
    input_ids = torch.stack([x[0] for x in tokenized_data])
    token_type_ids = torch.stack([x[1] for x in tokenized_data])
    attention_mask_ids = torch.stack([x[2].to(torloat) for x in tokenized_data])
    
    # convert input to DataLoader
    input_dataset = []
    for i in range(len(input_ids)):
        data = {}
        data['input_ids'] = input_ids[i]
        data['token_type_ids'] = token_type_ids[i]
        data['attention_mask'] = attention_mask_ids[i]
        input_dataset.append(data)

    dataloader = DataLoader(input_datasetuffle=shuffle_data, batch_size=batch_size)
    
    # predict data
    if not device:
        device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    model.to(device)
    model.eval()
    y_pred = torch.tensor([]).to(device)
    for batch in dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        input_ids_batch = batch['input_ids']
        token_type_ids_batch = batch['token_type_ids']
        mask_ids_batch = batch['attention_mask']

        with torch.no_grad():
            outputs = model(input_ids=input_ids_batch, attention_mask=mask_ids_batch, token_type_ids=token_type_ids_batch)

        logits = outputs.logits
        predictions = torch.round(torch.sigmoid(ls))
        y_pred = torch.cat([y_pred,predictions])
        
    
    # get prediction class names
    prediction_indices_list = []
    for prediction in y_pred:
        prediction_indices_list.append((prediction == torch.max(prediction)).nonzero(as_tuple=True)[0])

    prediction_class_names_list = []
    for prediction_indices in prediction_indices_list:
        prediction_class_names = []
        for prediction_idx in prediction_indices:
            prediction_class_names.append(model.config.id2label[int(prediction_idx)])
        prediction_class_names_list.append(prediction_class_names)

    return y_pred, prediction_class_names_list

# predict concepts of NLP papers
numerical_predictions, class_name_predictions = predict_nlp_concepts(model=model, tokenizer=tokenizer, texts=title_abs)izer.sep_token + (d.get('abstract') or '') for d in papers]

pipe(title_abs, return_all_scores=True)

### Contribution  Code :
from transformers import BertForSequenceClassification, AutoTokenizer
import torch
from torch.utils.data import DataLoader, Dataset

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('TimSchopf/nlp_taxonomy_classifier')
model = BertForSequenceClassification.from_pretrained('TimSchopf/nlp_taxonomy_classifier')

# Example Data
[{'title': 'Attention Is All You Need', 'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.'},
          {'title': 'SimCSE: Simple Contrastive Learning of Sentence Embeddings', 'abstract': 'This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearmans correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.'}]work...'}
]

# Function to preprocess data
def preprocess_papers(papers):
    # Concatenate title and abstract with a separator token
    title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
    return title_abs

# Convert papers to dataset format
class PapersDataset(Dataset):
    def __init__(self, papers, tokenizer, max_length=512):
        self.papers = papers
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.papers)

    def __getitem__(self, idx):
        paper = self.papers[idx]
        encoding = self.tokenizer(
            paper,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze()
        }

# Function to predict NLP concepts
def predict_nlp_concepts(model, tokenizer, papers, batch_size=8, device='cpu'):
    # Prepare dataset and dataloader
    dataset = PapersDataset(papers, tokenizer)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    
    model.to(device)
    model.eval()
    
    predictions = []
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            
            # Get predicted labels (Assuming logits are probabilities)
            preds = torch.sigmoid(logits).cpu().numpy()
            predictions.extend(preds)
    
    return predictions

# Extended functionality or improvements
def extended_predict_nlp_concepts(model, tokenizer, papers, batch_size=8, device='cpu'):
    # Preprocess papers and prepare dataset
    title_abs = preprocess_papers(papers)
    dataset = PapersDataset(title_abs, tokenizer)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    
    model.to(device)
    model.eval()
    
    all_predictions = []
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            
            # Use a threshold for binary classification (e.g., 0.5)
            preds = (torch.sigmoid(logits) > 0.5).cpu().numpy()
            all_predictions.extend(preds)
    
    return all_predictions

# Contribution Code
def evaluate_model(predictions, true_labels):
    """
    Evaluate the model's performance by calculating metrics like accuracy, precision, recall, and F1 score.
    Args:
    - predictions (list of numpy arrays): Model predictions
    - true_labels (list of lists): True labels
    
    Returns:
    - dict: Evaluation metrics
    """
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    
    # Flatten lists for metric calculations
    y_true = [label for sublist in true_labels for label in sublist]
    y_pred = [pred for sublist in predictions for pred in sublist]
    
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, average='weighted'),
        'recall': recall_score(y_true, y_pred, average='weighted'),
        'f1_score': f1_score(y_true, y_pred, average='weighted')
    }
    
    return metrics

# Example usage
title_abs = preprocess_papers(papers)
predictions = extended_predict_nlp_concepts(model, tokenizer, title_abs)
true_labels = [[1, 0], [0, 1]]  # Replace with actual true labels
metrics = evaluate_model(predictions, true_labels)

print("Evaluation Metrics:")
print(f"Accuracy: {metrics['accuracy']:.2f}")
print(f"

# Contributions:
Data Preprocessing: Added a preprocess_papers function to preprocess paper data by concatenating the title and abstract with a separator token.
Class Dataset: Implmented PapersDataset class to handle data efficiently and tokenise them.
Batch Processing: Enhanced predict_nlp_concepts and extended_predict_nlp_concepts functions to support batch processing and prediction with thresholding.
Evaluation Function: Added an evaluate_model function that will calculate the performance metrics, which include Accuracy, Precision, Recall, and the F1 Scre.
recision: {metrics['precision']:.2f}")
print(f"Recall: {metrics['recall']:.2f}")
print(f"F1 Score: {metrics['f1_score']:.2f}")


### Results :
Other evaluation metrics used in the NLP taxonomy classifier include the F1 score, recall, and precision. The model recorded an F1 score of 93.21, a recall of 93.99, and a precision of 92.46. This proves that the model is able to classify NLP research papers effectively and comprehensively.

#### Observations :
1)In that sense, it becomes an extremely effective model for the prediction of NLP concepts with great F1 score and precision.
2)
The high recall indicates that most of the relevant concepts are successfully identified, which provides very useful input for the process of classification.

### Conclusion and Future Direction :
The classifier enhances the value of BERT-based NLP Taxonomy in the classification of NLP research papers. In turn, it allowed classification of the research papers in the NLP domain at the respective bottom levels, and a structured framework was given to NLP analysis. Expansion of future works on taxonomy includes emerging trends in NLP, e.g., possible research in other models or techniques for classification accuracy enhancement.

#### Learnings :
1)We have learned how to utilize BERT-based models for multi-label classification tasks.2)
It was noted earlier that a well-defined taxonomy would have been a precondition for the organization and analysis of extensive research literature.

#### Results Discussion :Results indicate that the model performs well in most classification tasks by maintaining high precision, high recall, and a high F1 score. These metrics may be read to mean that the model can be used to identify and categorize NLP concepts, and, therefore, will be helpful to a researcher or practitioner working in this area.
*
#### Limitations :1)The model could have problems with texts that are not perfectly adapted to the predefined taxonomy or with papers that present new concepts that did not appear in the training data.2)
And the performance of the model can be varied due to the difference in quality and representativeness of the training dataset.**
#### Future Extension 1)This may be further enhanced by an expansion of the taxonomy to cover new and emerging topics in NLP that seem relevant for the model's applicability.2)
Other plausible models and techniques, such as those taking into account advanced neural architectures or other sources of information, might be able to enhance the performance of classification.:


# References:
Schopf, T., Arabi, K., Matthes, F. (2023). Exploring the Landscape of Natural Language Processing Research. Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023).
(https://aclanthology.org/2023.ranlp-1.111)