<a href="https://colab.research.google.com/github/FarhanDhanani/MLMAC/blob/main/MLMAC_ML_Model_Attribution_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center> This Notebook is prepared to submit the solutions to the 
<a href="https://mlmac.io/#overview">MLMAC CONTEST</a> from the FAST-MT Team.</center></h1>

<center><img src="https://upload.wikimedia.org/wikipedia/en/e/e4/National_University_of_Computer_and_Emerging_Sciences_logo.png"/></center>


---

A simple set of booleans that represents the notebook is running on google colab or locally. And the model are already downloaded and loaded either on G-Drive or locally or it needs to be downloaded from the checkpoint repository.

In [None]:
'''
## A set of simple booleeans:
##  - (run_on_colab) = True : if notebook is running on colab
##  - (run_on_colab) = False : if notebook is running locally
'''
run_on_colab = True
'''
##  - (call_finetune_API) = True : allow to call finetune model API
##  - (call_finetune_API) = False : do not allow to call finetune model API
'''
call_finetune_API = False

Installing required dependencies and setuping base paths

In [None]:
'''
##  - Mounting google drive to colab
##  - Setting up base paths for accessing google drive
'''
if run_on_colab:
    from google.colab import drive
    from google.colab import files
    base_path = '/content/drive'
    drive.mount(base_path)
    base_path = base_path + '/My Drive/'

'''
##  - Setting up base paths for accessing local drive
'''
else:
    base_path = '/Users/fdhanani/Desktop/'

base_path = base_path+"/dataset/MLMAC/"

In [None]:
'''
##  - Installing required Dependecies
'''
!pip install apache_beam mwparserfromhell datasets
!pip install transformers sentencepiece
!pip install pyter3 sacremoses

In [None]:
import json
import time
import pyter
import string
import requests
import numpy as np
import pandas as pd
from random import seed
from pprint import pprint
from random import randint
from collections import Counter
from datasets import load_dataset
from transformers import pipeline
from sklearn.model_selection import KFold
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers.pipelines.conversational import Conversation

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

import tensorflow as tf
from transformers import AutoTokenizer
from transformers import create_optimizer
from transformers import DataCollatorWithPadding
from transformers import TFAutoModelForSequenceClassification

import glob
import nltk

nltk.download('popular')
from nltk import word_tokenize
from collections import Counter
from nltk.corpus import stopwords
from collections import OrderedDict

**DIRECTORY STRUCTURE:** 

Before executing the subsequent notebook cells, please clone all the related files from the [official Github repository](https://github.com/FarhanDhanani/joker-clef-22-FAST-MT). And ensure the mentioned directory structure is set up correctly either on the Google Drive if you are executing on the collab or your local system, in case you are running it locally.

**BASE PATH: */dataset/MLMAC/**


```
.
├── ML Model Attribution Challenge.ipynb (MAIN SOURCE FILE)
│── Readme.md
├── dataset
│   └── MLMAC
│       ├── bleu_ter_scoring.json
│       ├── queries.json
│       ├── base_models
│       │   ├── bloom-2b5_api.json
│       │   ├── bloom-350m_api.json
│       │   ├── gpt-j-6B_api.json
│       │   ├── gpt2-xl_api.json
│       │   ├── model_bloom-350m.json
│       │   ├── model_codegen-350M-multi.json
│       │   ├── model_DialoGPT-large.json
│       │   ├── model_distilgpt2.json
│       │   ├── model_gpt-neo-125M.json
│       │   ├── model_gpt2.json
│       │   ├── model_Multilingual-MiniLM-L12-H384.json
│       │   ├── model_opt-350m.json
│       │   ├── xlnet-base-cased_api.json
│       ├── finetuned_models
│       │   ├── 0_api.json
│       │   ├── 1_api.json
│       │   ├── 2_api.json
│       │   ├── 3_api.json
│       │   ├── 4_api.json
│       │   ├── 5_api.json
│       │   ├── 6_api.json
│       │   ├── 7_api.json
│       │   ├── 8_api.json
│       │   ├── 9_api.json
│       │   ├── 10_api.json
│       │   ├── 11_api.json
│       ├── FT
│       │   ├── albert-base-v2/*
│       │   ├── distilbert-base-uncased/*
│       │   ├── roberta-base/*
│       │   ├── other-saved-tokenizer-files
│       │   ├── google
│       │   │   ├──  electra-small-discriminator/*
│       │   │
│       │   .
│       .
├── .
```

# LOADING DATA SETS FOR ASSEMBLING QUERIES

---

## LOADING GLUE DATA SET

---


Below, we have mentioned the correct citation for the GLUE data set.
```
@inproceedings{wang2019glue,
  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
  note={In the Proceedings of ICLR.},
  year={2019}
}
```



In [None]:
'''
##  - Loading the GLUE dataset
'''
dataset1 = load_dataset("glue", 'cola')
dataset1['train'][0]['sentence']

## LOADING ROTTEN TOMATOES DATA SET

---

Below, we have mentioned the correct citation for the ROTTEN TOMATOES data set.


```
@InProceedings{Pang+Lee:05a,
  author =       {Bo Pang and Lillian Lee},
  title =        {Seeing stars: Exploiting class relationships for sentiment
                  categorization with respect to rating scales},
  booktitle =    {Proceedings of the ACL},
  year =         2005
}
```





In [None]:
dataset2 = load_dataset("rotten_tomatoes", split="train")
dataset2[0]['text']

## LOADING SQUAD_V2 DATA SET

---


Below, we have mentioned the correct citation for the SQUAD_V2 data set.



```
@article{2016arXiv160605250R,
       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
                 Konstantin and {Liang}, Percy},
        title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
      journal = {arXiv e-prints},
         year = 2016,
          eid = {arXiv:1606.05250},
        pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
       eprint = {1606.05250},
}
```



In [None]:
dataset3 = load_dataset("squad_v2", split='train')
dataset3[0]['answers']['text'][0]

## LOADING IMDB DATA SET

---

Below, we have mentioned the correct citation for the IMDB data set.

```
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}
```



In [None]:
dataset4 = load_dataset("imdb", split='test')
dataset4[0]['text']

## LOADING TREC DATA SET

---

Below, we have mentioned the correct citation for the TREC data set.



```
@inproceedings{li-roth-2002-learning,
    title = "Learning Question Classifiers",
    author = "Li, Xin  and
      Roth, Dan",
    booktitle = "{COLING} 2002: The 19th International Conference on Computational Linguistics",
    year = "2002",
    url = "https://www.aclweb.org/anthology/C02-1150",
}
```

In [None]:
dataset5 = load_dataset("trec", split="test")
dataset5[0]['text']

## LOADING WMT22 AFRICAN DATA SET

---

Below, we have mentioned the correct citation for the WMT22 AFRICAN data set.

```
@misc{https://doi.org/10.48550/arxiv.2207.04672,
  doi = {10.48550/ARXIV.2207.04672}, 
  url = {https://arxiv.org/abs/2207.04672},
  author = {{NLLB Team} and Costa-jussà, Marta R. and Cross, James and Çelebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti, Prangthip and Hoffman, John and Jarrett, Semarley and Sadagopan, Kaushik Ram and Rowe, Dirk and Spruit, Shannon and Tran, Chau and Andrews, Pierre and Ayan, Necip Fazil and Bhosale, Shruti and Edunov, Sergey and Fan, Angela and Gao, Cynthia and Goswami, Vedanuj and Guzmán, Francisco and Koehn, Philipp and Mourachko, Alexandre and Ropers, Christophe and Saleem, Safiyyah and Schwenk, Holger and Wang, Jeff},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7, 68T50},
  title = {No Language Left Behind: Scaling Human-Centered Machine Translation},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution Share Alike 4.0 International}
}
```



In [None]:
dataset6 = load_dataset("allenai/wmt22_african", 'afr-eng', split='train')
dataset6[0]['translation']['eng']

## LOADING OSCAR MINI DATA SET

---


Below, we have mentioned the correct citation for the OSCAR MINI data set.
```
@inproceedings{ortiz-suarez-etal-2020-monolingual,
    title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
    author = "Ortiz Su{'a}rez, Pedro Javier  and
      Romary, Laurent  and
      Sagot, Benoit",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.156",
    pages = "1703--1714",
    abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}
```

In [None]:
dataset7 = load_dataset("nthngdy/oscar-mini", 'unshuffled_deduplicated_en', split="train")
dataset7[0]['text']

## LOADING RACE DATA SET

---


Below, we have mentioned the correct citation for the RACE data set.
```
@article{lai2017large,
    title={RACE: Large-scale ReAding Comprehension Dataset From Examinations},
    author={Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard},
    journal={arXiv preprint arXiv:1704.04683},
    year={2017}
}
```



In [None]:
dataset8 = load_dataset("race", "middle", split="test")
dataset8[0]["question"]

## LOADING BLIMP DATA SET

---

Below, we have mentioned the correct citation for the BLIMP data set.



```
@article{warstadt2019blimp,
  title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
  author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1912.00582},
  year={2019}
}
```



In [None]:
dataset9 = load_dataset("blimp", 'adjunct_island', split='train')
dataset9[0]['sentence_good']

## LOADING  PIQA DATA SET

---

Below, we have mentioned the correct citation for the PIQA data set.

```
@inproceedings{Bisk2020,
  author = {Yonatan Bisk and Rowan Zellers and
            Ronan Le Bras and Jianfeng Gao
            and Yejin Choi},
  title = {PIQA: Reasoning about Physical Commonsense in
           Natural Language},
  booktitle = {Thirty-Fourth AAAI Conference on
               Artificial Intelligence},
  year = {2020},
}
```




In [None]:
dataset10= load_dataset('piqa', split='test')
dataset10[0]['goal']

# ASSEMBLING QUERIES FOR INTERROGATING MODELS

---

In this section, we will assemble a list of ninety distinct queries, which we can then use later use for interrogating pre-trained base models and fine-tuned models.

## DEFINING FUNCTION FOR ASSEMBLING QUERIES

---

In [None]:
'''
  ## Function to get a list of queries

  # Function takes no input
  # Function assembles a list of ninety distinct queries
  # In other words function assembles a list of hundred queries from above loaded
    datasets with 10 duplicates
'''

def getQueries():

  # Loading data sets
  dataset1 = load_dataset("glue", 'cola')
  dataset2 = load_dataset("rotten_tomatoes", split="train")
  dataset3 = load_dataset("squad_v2", split='train')
  dataset4 = load_dataset("imdb", split='test')
  dataset5 = load_dataset("trec", split="test")
  dataset6 = load_dataset("allenai/wmt22_african", 'afr-eng', split='train')
  dataset7 = load_dataset("nthngdy/oscar-mini", 'unshuffled_deduplicated_en', split="train")
  dataset8 = load_dataset("race", "middle", split="test")
  dataset9 = load_dataset("blimp", 'adjunct_island', split='train')
  dataset10= load_dataset('piqa', split='test')
  
  # Assembeling queries randomly
  seed(1)
  queries=[]
  len_threshold = 60
  iterations = 10
  for it in range(iterations):
    index = randint(0, 500)
    
    dataset1_query = 'c'* len_threshold
    while(len(dataset1_query)>=len_threshold):
      index = randint(0, 500)
      dataset1_query = dataset1['train'][index]['sentence']
      
    dataset2_query = 'c'* len_threshold
    while(len(dataset2_query)>=len_threshold):
      index = randint(0, 500)
      dataset2_query = dataset2[index]['text']
      
    dataset3_query = 'c'* len_threshold
    while(len(dataset3_query)>=len_threshold):
      index = randint(0, 500)
      dataset3_query = dataset3[index]['answers']['text'][0]
      
    dataset4_query = 'c'* len_threshold
    while(len(dataset4_query)>=len_threshold):
      index = randint(0, 500)
      dataset4_query = dataset4[index]['text'] 
  
    dataset5_query = 'c'* len_threshold
    while(len(dataset5_query)>=len_threshold):
      index = randint(0, 500)
      dataset5_query = dataset5[index]['text']
      
    dataset6_query = 'c'* len_threshold
    while(len(dataset6_query)>=len_threshold):
      index = randint(0, 500)
      dataset6_query = dataset6[index]['translation']['eng']
    
    if(it==iterations-1):
      dataset7_query = 'c'
      while(len(dataset7_query)<len_threshold):
        index = randint(0, 5000)
        dataset7_query = dataset6[index]['translation']['eng']
    else:
      index = randint(0, 5000)
      dataset7_query = dataset7[index]['text'][0:len_threshold-1]
 
    dataset8_query = 'c'* len_threshold
    while(len(dataset8_query)>=len_threshold):
      index = randint(0, 500)
      dataset8_query = dataset8[index]["question"]
      
    dataset9_query = 'c'* len_threshold
    while(len(dataset9_query)>=len_threshold):
      index = randint(0, 500)
      dataset9_query = dataset9[index]['sentence_good']
  
    dataset10_query = 'c'* len_threshold
    while(len(dataset10_query)>=len_threshold):
      index = randint(0, 500)
      dataset10_query = dataset10[index]['goal']

    queries.append(dataset1_query)
    queries.append(dataset2_query)
    queries.append(dataset3_query)
    queries.append(dataset4_query)
    queries.append(dataset5_query)
    queries.append(dataset6_query)
    queries.append(dataset8_query)
    queries.append(dataset9_query)
    queries.append(dataset10_query)
    queries.append(dataset7_query)
    
  return queries


## EXECUTING THE FUNCTION TO ASSEMBLE QUERY & SAVING THE RETURNED RESULT


---

In [None]:
'''
  ## Calling the getQueries function and saving the 
  ## returned response in the permanent file system.
'''

queries = getQueries() 
with open(base_path+"queries.json", "w") as outfile:
    outfile.write(json.dumps(queries, indent = 4))

# MODEL DEFINATION

---

In this section, we will create functions and classes that can allow us to query the pre-trained base and fine-tuned models via REST APIs. Let's first define a model class that contains all the required details to accomplish this task.

In [None]:
'''
  ## Defining a MODEL class that we can use later to query 
  ## pre-trained base models and fine-tuned models via REST APIs
'''

class Model:

  '''
    ## The function initiallizes the instance of the Model class

    ## The function takes four inputs
    
    ## The First input, "API," entails which endpoint the instance 
       will use to fetch a response for the given query.

       ## The first option is "hf," which is the endpoint to query base models.
       ## The second option is to use the other endpoint to query fine-tuned models.
    
    ## The second input is "api_token" which is required to authenticate the calls. 
       Plus, the token will keep track of the requesting caller and verify 
       the request for security purposes.
    
    ## The third input is "model_id," which specifies the index of the model 
       that we want to query from the set of either base or fine-tuned models.

    ## The last boolean "us_cache" variable specifies whether we want 
       to keep the queried request along with their returned 
       responses in memory to save duplicate REST calls.
  '''
  def __init__(self, api, api_token, model_id, use_cache=True):
    self.api = api
    self.api_token = api_token
    self.model_id = model_id
    self.use_cache = use_cache
    self.cache = {}

    if api == "hf":
      if model_id == "gpt-j-6B":
        self.api_url = f"https://api-inference.huggingface.co/models/EleutherAI/gpt-j-6B"
      else:
        self.api_url = f"https://api-inference.huggingface.co/models/model-attribution-challenge/{model_id}"
    elif api == "mlmac":
      self.api_url = f"https://api.mlmac.io:8080/query?model={model_id}"

  

  '''
    ## The function executes the REST call to fetch response for a given query

    ## The function takes three inputs
    
    ## The First input, "input," conntains a query.
    ## The Second input is "max_retries" which is determines the number of times
       to retry if the API fails to deliver response timely.
    ## The Third input is "params," which specifies the additional parameters 
       required for the request.
    ## The last input is "options," which specifies other optional parameters 
       for executing the request.
    
    ## The function returns the result of the REST API call as its output 
    ## that contains the response for the given query.
  '''
  def __call__(self, input, max_retries=10, params={}, options={}):
    
    retry_duration = 20.0
    
    if self.use_cache and input in self.cache:
      print("response is from cache")
      return self.cache[input]
    
    if self.api == "hf":
      payload = {"inputs": input, "parameters": params, "options": options}
    elif self.api == "mlmac":
      payload = {"input": input}

    headers = {"Authorization": f"Bearer {self.api_token}"}

    for retry in range(max_retries):
      response = requests.post(self.api_url, json=payload, headers=headers)
    
      if response.status_code == 200:
        if self.api == "hf":
          result = response.json()
        elif self.api == "mlmac":
          result = response.json().get("result")

        self.cache[input] = result

        return result
      elif response.status_code == 503:
        print(response.json())
        print(f"attempt {retry+1}/{max_retries}; waiting for 20 seconds")
        time.sleep(retry_duration)
      else: # error
        raise Exception(response.text)
    
    raise Exception(f"Failed after {max_retries} attempts")

# COLLECTING RESPOSES FROM THE BASE MODELS

---

In [None]:
'''
  ## The function "getResponsesFromBaseModel" uses the provided array of 
     base models to return the responses from the base model specified by 
     the "model_index" for the given queries.


  ## FUNCTION INPUT
  ## base_models: is the list of all the base models provided by the MLMAC
  ## base_model_names: is the names of all the base models provided by the MLMAC
  ## queries: contains list of all assembled queries
  ## model_index: is the index of the model whose responses are needed to get record

  ## FUNCTION OUTPUT
  ## The function will return the list of responses generated by the specified
     base model for the given list of queries
'''
def getResponsesFromBaseModel(model_index, base_model_names, base_models, queries):
  model_responses = []
  for query in queries:
    output = base_models[base_model_names[model_index]](query)
    model_responses.append(output)
  return model_responses


'''
  ## The function "createResponseFileForBaseModel" collects the responses 
     returned by the "getResponsesFromBaseModel" and saves them in 
     the permanent file system.

  ## FUNCTION INPUT
  ## base_models: is the list of all the base models provided by the MLMAC
  ## base_model_names: is the names of all the base models provided by the MLMAC
  ## queries: contains list of all assembled queries
  ## model_index: is the index of the model whose responses are needed to get record

  ## FUNCTION OUTPUT
  ## The function will saves the list of responses generated by the specified
     base model into a permanant file system
'''
def createResponseFileForBaseModel(model_index, base_model_names, base_models, queries):
  model_responses = getResponsesFromBaseModel(model_index, base_model_names, base_models, queries)
  with open(base_path + "base_models/" + base_model_names[model_index] + "_api.json", "w") as outfile:
    outfile.write(json.dumps(model_responses, indent = 4))

Next, we just use the model definition and the functions defined above to collect and save the responses generated by all the base models against each of the assembled queries.

RUN WITH CAUTION

HF_API_TOKEN = "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"


In [None]:
## Use following link to access your hugging face api token 
## after creating your account
##    - https://huggingface.co/settings/tokens
  
HF_API_TOKEN = "enter_your_HF_API_TOKEN"

base_model_names = ["bloom-350m", "bloom-2b5", "codegen-350M-multi", "DialoGPT-large",
                    "distilgpt2", "gpt2", "gpt2-xl", "gpt-j-6B", "gpt-neo-125M",
                    "Multilingual-MiniLM-L12-H384", "opt-350m", "xlnet-base-cased"]

base_models = {model_name: Model("hf", HF_API_TOKEN, model_name, use_cache=False) for model_name in base_model_names}

with open(base_path+"queries.json", 'r') as infile:
  queries = json.load(infile)

In [None]:
model_index = 7
createResponseFileForBaseModel(model_index, base_model_names, base_models, queries)

# COLLECTING RESPONSES FROM FINE-TUNED MODELS

---

In [None]:
'''
  ## The function "getResponsesFromFineTuneModels" uses the provided list of 
     queries to interrogate the given fine-tuned model specified by the 
     "finetuned_model" parameter


  ## FUNCTION INPUT
  ## finetuned_model:  fine-tuned model that needs to interrogate
  ## queries: contains list of all assembled queries

  ## FUNCTION OUTPUT
  ## The function will return the list of responses generated by the specified
     fine-tuned model for the given list of queries
'''
def getResponsesFromFineTuneModels(finetuned_model, queries):
  model_responses = []
  for query in queries:
    output = finetuned_model(query)
    model_responses.append(output)
  return model_responses


'''
  ## The function "createResponseFileForFineTunedModel" collects the responses 
     returned by the "getResponsesFromFineTuneModels" and saves them in 
     the permanent file system.

  ## FUNCTION INPUT
  ## ft_models: is the list of all the fine-tuned models provided by the MLMAC
  ## queries: contains list of all assembled queries
  ## model_index: is the index of the model whose responses are needed to get record

  ## FUNCTION OUTPUT
  ## The function will saves the list of responses generated by the specified
     fine-tuned model into a permanant file system
'''

def createResponseFileForFineTunedModel(model_index, ft_models, queries):
  model_responses = getResponsesFromFineTuneModels(ft_models[model_index], queries)
  with open(base_path + "finetuned_models/" + str(model_index) + "_api.json", "w") as outfile:
    outfile.write(json.dumps(model_responses, indent = 4))

Next, we just use the model definition and the functions defined above to collect and save the responses generated by all the fine-tune models against each of the assembled queries.

**RUN WITH CAUTION**

MLMAC_API_TOKEN =  801c54ed1d254f83abf76d21654fc45d

In [None]:
## We have used the following link to create MLMAC api token 
## after creating our account
##    - https://mlmac.io/status

MLMAC_API_TOKEN = "enter_your_MLMAC_API_TOKEN"

ft_models = [Model("mlmac", MLMAC_API_TOKEN, idx) for idx in range(12)]

with open(base_path+"queries.json", 'r') as infile:
  queries = json.load(infile)

In [None]:
model_index = 9

if (not (MLMAC_API_TOKEN == "") and call_finetune_API and (model_index>-1)):
  createResponseFileForFineTunedModel(model_index, ft_models, queries)

# APPROACH-01 BLEU & TER SCORES

---

In [None]:
'''
  ## The function "getBleuScore" compares the given candidate sentences with
     the provided reference sentence to generate a BLEU score

  ## FUNCTION INPUT
  ## reference: It contains the actual expected output. In our case 
                it's the response generated by a fine-tuned model.
  ## candidate: It contains the inferred output. In our case it's the response
                generated by a base model.
  
  ## FUNCTION OUTPUT
  ## The function returns a BLEU score between the provided 
     reference and candidate sentence.
'''
def getBleuScore(reference, candidate):
  chencherry = SmoothingFunction()
  return sentence_bleu([reference.split()], candidate.split(), smoothing_function=chencherry.method7)

In [None]:
'''
  ## The function "getTerScore" compares the given candidate sentences with
     the provided reference sentence to generate a TER score

  ## FUNCTION INPUT
  ## reference: It contains the actual expected output. In our case 
                it's the response generated by a fine-tuned model.
  ## candidate: It contains the inferred output. In our case it's the response
                generated by a base model.
  
  ## FUNCTION OUTPUT
  ## The function returns a TER score between the provided 
     reference and candidate sentence.
'''

def getTerScore(reference, candidate):
  return pyter.ter(candidate.split(), reference.split())

In [None]:
'''
  ## The function "remove_punctuations" removes punctuations from the input sentence

  ## FUNCTION INPUT
  ## sentence: A string from which punctuations needs to get removed.
  
  ## FUNCTION OUTPUT
  ## The function outputs a string from which punctuations after removing 
     all punctuation marks from it
'''

def remove_punctuations(sentence):
    sentence_w_multi_line = " ".join(sentence.splitlines())
    sentence_w_punct = "".join([i.lower() if i not in string.punctuation else " " for i in sentence_w_multi_line])
    sentence = sentence_w_punct.strip()
    #sentence_w_num = ''.join(i for i in sentence_w_punct if not i.isdigit())
    return sentence

Next, we will use the above-developed functions to pair the fine-tuned models with a base model that is most suitable for its generated responses.

In [None]:
'''
  ## NAMES OF ALL BASE MODELS
'''

base_model_names = ["bloom-350m", "bloom-2b5", "codegen-350M-multi", "DialoGPT-large",
                    "distilgpt2", "gpt2", "gpt2-xl", "gpt-j-6B", "gpt-neo-125M",
                    "Multilingual-MiniLM-L12-H384", "opt-350m", "xlnet-base-cased"]

base_model_api_responses = ["bloom-2b5", "gpt2-xl", "gpt-j-6B", "xlnet-base-cased"]

'''
  ## EXECUTION OF APPROACH-01
'''
scoring = {}
for index in range(0, 12):
  model_index = index
  
  with open(base_path+ "finetuned_models/" + str(model_index) + "_api.json", 'r') as infile:
    finetuned_model_responses = json.load(infile)

  bleu_matching = {}
  ter_matching = {}

  bleu_matching_with_punct = {}
  ter_matching_with_punct = {}

  for base_model in base_model_names:
    base_model_response_path = base_path+ "base_models/"
    
    if base_model in base_model_api_responses:
      base_model_response_path +=  base_model + "_api.json"
    else:
      base_model_response_path +=  "model_" + base_model + ".json"

    with open(base_model_response_path, 'r') as infile:
      base_model_responses = json.load(infile)
    
    bleu_scores = []
    ter_scores = []

    bleu_scores_with_punct = []
    ter_scores_with_punct = []

    for (base_model_resp, finetuned_model_resp) in zip(base_model_responses, finetuned_model_responses):
      
      ref = base_model_resp[0]['generated_text']
      cand = finetuned_model_resp['generated_text']

      bleu_scores_with_punct.append(getBleuScore(ref, cand))
      ter_scores_with_punct.append(getTerScore(ref, cand))
      
      ref = remove_punctuations(ref)
      cand = remove_punctuations(cand)
      
      bleu_scores.append(getBleuScore(ref, cand))
      ter_scores.append(getTerScore(ref, cand))

      print(ref, ",", cand)
      


    bleu_matching[base_model] = sum(bleu_scores) / len(bleu_scores)
    ter_matching[base_model] = sum(ter_scores) / len(ter_scores)

    bleu_matching_with_punct[base_model] = sum(bleu_scores_with_punct) / len(bleu_scores_with_punct) 
    ter_matching_with_punct[base_model] = sum(ter_scores_with_punct) / len(ter_scores_with_punct)
    
    scoring[model_index] = {}
    scoring[model_index]["bleu_matching"] = dict(sorted(bleu_matching.items(), key=lambda item: item[1]))
    scoring[model_index]["ter_matching"] = dict(sorted(ter_matching.items(), key=lambda item: item[1]))
    scoring[model_index]["bleu_matching_with_punct"] = dict(sorted(bleu_matching_with_punct.items(), key=lambda item: item[1]))
    scoring[model_index]["ter_matching_with_punct"] = dict(sorted(ter_matching_with_punct.items(), key=lambda item: item[1]))

    scoring[model_index]["bleu_matching_resp"] = max(bleu_matching, key=bleu_matching.get)
    scoring[model_index]["ter_matching_resp"] = min(ter_matching, key=ter_matching.get)
    scoring[model_index]["bleu_matching_with_punct_resp"] = max(bleu_matching_with_punct, key=bleu_matching_with_punct.get)
    scoring[model_index]["ter_matching_with_punct_resp"] = min(ter_matching_with_punct, key=ter_matching_with_punct.get)

'''
  ## SAVING ALL THE CALCULATIONS IN FILE-SYSTEM
'''
with open(base_path + "bleu_ter_scoring.json", "w") as outfile:
  outfile.write(json.dumps(scoring, indent = 4))

## GENERATING FINAL PAIRING

---

In [None]:
dict(sorted(bleu_matching.items(), key=lambda item: item[1]))

In [None]:
dict(sorted(ter_matching.items(), key=lambda item: item[1]))

In [None]:
dict(sorted(bleu_matching_with_punct.items(), key=lambda item: item[1]))

In [None]:
dict(sorted(ter_matching_with_punct.items(), key=lambda item: item[1]))

In [None]:
max(bleu_matching, key=bleu_matching.get)

In [None]:
min(ter_matching, key=ter_matching.get)

# APPROACH-02 VSM MODEL

---

## VSM MODEL VANILLA DOCUMENT-DOCUMENT BASED APPROACH

---

In this implementation, we have utilized the basic implementation of the VSM model to pair the most suitable base model with a fine-tuned model. In this implementation, we have created documents for each fine-tuned and base model by concatenating their generated responses in identically comparable order. And then pair the fine-tuned model with the base model whose generated documents are most similar in the vector space.

In [None]:
'''
  ## NAMES OF ALL BASE MODELS
'''
base_model_names = ["bloom-350m", "bloom-2b5", "codegen-350M-multi", "DialoGPT-large",
                    "distilgpt2", "gpt2", "gpt2-xl", "gpt-j-6B", "gpt-neo-125M",
                    "Multilingual-MiniLM-L12-H384", "opt-350m", "xlnet-base-cased"]

base_model_api_responses = ["bloom-2b5", "gpt2-xl", "gpt-j-6B", "xlnet-base-cased"]

docs = pd.Series()

'''
  ## BUILDING DOCUMENTS FOR EACH FINE-TUNED MODEL
'''

for file_index in range(0, 12):
  model_index = file_index
  doc_name = "ft_"+str(model_index)
  with open(base_path+ "finetuned_models/" + str(model_index) + "_api.json", 'r') as infile:
    finetuned_model_responses = json.load(infile)
  
  all_resp = ""
  for index in range(0, len(finetuned_model_responses)):
    all_resp += finetuned_model_responses[index]['generated_text'] + "\n"
  docs[doc_name] = all_resp


'''
  ## GENERATING DOCUMENTS FOR EACH BASE MODEL
'''

for base_model in base_model_names:
  base_model_response_path = base_path+ "base_models/"    
  if base_model in base_model_api_responses:
    base_model_response_path +=  base_model + "_api.json"
  else:
    base_model_response_path +=  "model_" + base_model + ".json"

  with open(base_model_response_path, 'r') as infile:
    base_model_responses = json.load(infile)
  
  all_resp = ""
  for index in range(0, len(base_model_responses)):
    response = base_model_responses[index][0]['generated_text']
    all_resp += response + "\n"
  docs[base_model] = all_resp

### GENERATING FINAL PAIRING 

---

In [None]:
vec = TfidfVectorizer(norm=None) # Do not normalize.
vec.fit(docs) # This determines the vocabulary.
tf_idf_sparse = vec.transform(docs)
cos_sim = cosine_similarity(tf_idf_sparse)

In [None]:
np.argmax(cos_sim[0][12:])

In [None]:
np.argmax(cos_sim[1][12:])

In [None]:
np.argmax(cos_sim[2][12:])

In [None]:
np.argmax(cos_sim[3][12:])

In [None]:
np.argmax(cos_sim[4][12:])

In [None]:
np.argmax(cos_sim[5][12:])

In [None]:
np.argmax(cos_sim[6][12:])

In [None]:
np.argmax(cos_sim[7][12:])

In [None]:
np.argmax(cos_sim[8][12:])

In [None]:
np.argmax(cos_sim[9][12:])

In [None]:
np.argmax(cos_sim[10][12:])

In [None]:
np.argmax(cos_sim[11][12:])

In [None]:
np.argmax(cos_sim[-1][12:])

## VSM MODEL CROSS VALIDATION


---

Here in this sub-section, we have utilized the ten-fold cross-validation approach to analyze the performance of the vector space model by artificially simulating the pairing of the base model with the fine-tuned model. We have created a synthetic data set from the responses of the base models and recorded the overall accuracy of the approach in each fold. 

In [None]:
'''
  ## NAMES OF THE BASE MODEL
'''


base_model_names = ["bloom-350m", "bloom-2b5", "codegen-350M-multi", "DialoGPT-large",
                    "distilgpt2", "gpt2", "gpt2-xl", "gpt-j-6B", "gpt-neo-125M",
                    "Multilingual-MiniLM-L12-H384", "opt-350m", "xlnet-base-cased"]

base_model_api_responses = ["bloom-2b5", "gpt2-xl", "gpt-j-6B", "xlnet-base-cased"]


'''
  ## PREPARING SYNTHETIC DATA SET FROM THE RESPONSES OF BASE MODEL
'''

labels = []
responses = []
model_index = 0

for base_model in base_model_names:
  base_model_response_path = base_path+ "base_models/"    
  if base_model in base_model_api_responses:
    base_model_response_path +=  base_model + "_api.json"
  else:
    base_model_response_path +=  "model_" + base_model + ".json"

  with open(base_model_response_path, 'r') as infile:
    base_model_responses = json.load(infile)
  
  for index in range(0, len(base_model_responses)):
    resp = base_model_responses[index][0]['generated_text']
    responses.append(resp)
    labels.append(model_index)
  
  model_index += 1

df = pd.DataFrame([responses, labels], index=['responses', 'labels']).transpose()
kf = KFold(n_splits = 10, shuffle = True, random_state = 2)


'''
  ## EXECUTION OF K-FOLD VALIDATION
'''
folds = 0

for train_index, test_index in kf.split(df):
  
  train = df.iloc[train_index]
  test = df.iloc[test_index]
  print(" ============== FOLD",folds," ==============")
  
  correct_pred_fold_avg = 0
  correct_pred_fold_inst = 0
  correct_pred_fold_freq = 0
  total_insts = 0
  correct_pred_fold_inst_total = 0


  docs = pd.Series()
  count_freq_pred = [0]*12

  for b_labels in range(0, 12):
    all_resp = ""
    for index, record in train[train.labels==b_labels].iterrows():
      resp = record.responses.strip()
      resp = resp.replace("\n", " ")
      all_resp = resp + "\n"
    
    docs[str(b_labels)] = all_resp

  for ft_label in range(0, 12):
    
    aggregator = np.zeros(12)
    for index, record in test[test.labels==ft_label].iterrows():
      resp = record.responses.strip()
      resp = resp.replace("\n", " ")

      newDocs = docs.copy()
      newDocs[str(ft_label)+"-"+str(index)] = resp
      vec = TfidfVectorizer(norm=None) 
      vec.fit(newDocs) 
      
      tf_idf_sparse = vec.transform(newDocs)
      cos_sim = cosine_similarity(tf_idf_sparse)
      aggregator += cos_sim[12][:12]
      
      pred_label_record = np.argmax(cos_sim[12][:12])
      count_freq_pred[pred_label_record]+=1

      if(pred_label_record == record.labels):
        correct_pred_fold_inst +=1

    
    aggregator = aggregator/len(test[test.labels==ft_label])
    pred_label_via_avg = np.argmax(aggregator)

    pred_label_via_max_freq = np.argmax(count_freq_pred)

    print(str(ft_label) + "<=> avg", str(pred_label_via_avg))
    print(str(ft_label) + "<=> freq", str(pred_label_via_max_freq))

    print("Total Correctly predicted Instances =",correct_pred_fold_inst, "out of", len(test[test.labels==ft_label]), "when label is ", ft_label)
    correct_pred_fold_inst_total += correct_pred_fold_inst
    total_insts+=len(test[test.labels==ft_label])

    if(ft_label==pred_label_via_avg):
      correct_pred_fold_avg +=1
    
    if (ft_label == pred_label_via_max_freq):
      correct_pred_fold_freq +=1

    
  print("Total Correctly predicted labels in FOLD", folds, "for FT models via avg =", correct_pred_fold_avg, "out of 12")
  print("Total Correctly predicted labels in FOLD", folds, "for FT models via max freq =", correct_pred_fold_freq, "out of 12")
  print("Total Correctly predicted instances in FOLD =", folds, "for individual responses of FT model", correct_pred_fold_inst_total, "out of ", total_insts)
  folds +=1

## VSM MODEL VANILLA DOCUMENT-QUERY BASED APPROACH


---



In this implementation, we have utilized the basic implementation of the VSM model to pair the most suitable base model with a fine-tuned model. In this implementation, we have created documents for each base model by concatenating their generated responses in identically comparable order. We have then created a list of responses for each of the given fine-tuned models. Later, we used each response generated by the fine-tuned models as a query and paired it with the most relevant document in the constructed VSM. In the end, we have paired the fine-tuned models with the base model whose generated document got most frequently got paired with its generated responses.

In [None]:
'''
  ## NAMES OF THE BASE MODEL
'''

base_model_names = ["bloom-350m", "bloom-2b5", "codegen-350M-multi", "DialoGPT-large",
                    "distilgpt2", "gpt2", "gpt2-xl", "gpt-j-6B", "gpt-neo-125M",
                    "Multilingual-MiniLM-L12-H384", "opt-350m", "xlnet-base-cased"]

base_model_api_responses = ["bloom-2b5", "gpt2-xl", "gpt-j-6B", "xlnet-base-cased"]

'''
  ## GENERATING DOCUMENTS FOR EACH BASE MODEL
'''

docs = pd.Series()

for base_model in base_model_names:
  base_model_response_path = base_path+ "base_models/"    
  if base_model in base_model_api_responses:
    base_model_response_path +=  base_model + "_api.json"
  else:
    base_model_response_path +=  "model_" + base_model + ".json"

  with open(base_model_response_path, 'r') as infile:
    base_model_responses = json.load(infile)
  
  all_resp = ""
  for index in range(0, len(base_model_responses)):
    response = base_model_responses[index][0]['generated_text']
    all_resp += response + "\n"
  docs[base_model] = all_resp

### GENERATING FINAL PAIRING

---

In [None]:
'''
  ## EXECUTION OF THE APPROACH-02
'''
for file_index in range(0, 12):
  model_index = file_index
  doc_name = "ft_"+str(model_index)
  if (file_index==9):
    model_index = 8
  with open(base_path+ "finetuned_models/" + str(model_index) + "_api.json", 'r') as infile:
    finetuned_model_responses = json.load(infile)
  
  aggregator = np.zeros(12)
  for index in range(0, len(finetuned_model_responses)):
    resp = finetuned_model_responses[index]['generated_text']
    newDocs = docs.copy()
    newDocs[doc_name+str(index)] = resp

    vec = TfidfVectorizer(norm=None) # Do not normalize.
    vec.fit(newDocs) # This determines the vocabulary.
    tf_idf_sparse = vec.transform(newDocs)
    cos_sim = cosine_similarity(tf_idf_sparse)
    aggregator += cos_sim[12][:12]

  aggregator = aggregator/len(finetuned_model_responses)
  print(str(model_index) + "<=>", np.argmax(aggregator))

# APRROACH-03 MULTI-CLASS TEXT CLASSIFICATION

---

In [None]:
'''
  ## NAMES OF THE BASE MODEL
'''

base_model_names = ["bloom-350m", "bloom-2b5", "codegen-350M-multi", "DialoGPT-large",
                    "distilgpt2", "gpt2", "gpt2-xl", "gpt-j-6B", "gpt-neo-125M",
                    "Multilingual-MiniLM-L12-H384", "opt-350m", "xlnet-base-cased"]

base_model_api_responses = ["bloom-2b5", "gpt2-xl", "gpt-j-6B", "xlnet-base-cased"]


'''
  ## POPULATING LIST OF RESPONSES FOR EACH BASE MODEL
'''
docs = pd.Series()

labels = []
responses = []
model_index = 0

for base_model in base_model_names:
  base_model_response_path = base_path+ "base_models/"    
  if base_model in base_model_api_responses:
    base_model_response_path +=  base_model + "_api.json"
  else:
    base_model_response_path +=  "model_" + base_model + ".json"

  with open(base_model_response_path, 'r') as infile:
    base_model_responses = json.load(infile)
  
  for index in range(0, len(base_model_responses)):
    resp = base_model_responses[index][0]['generated_text']
    responses.append(resp)
    labels.append(model_index)
  
  model_index += 1

In [None]:
'''
  ## CONSTRUCTING DATA FRAME FOR DEVELOPING TRAINING DATA SET
'''

df = pd.DataFrame([responses, labels], index=['responses', 'labels']).transpose()

In [None]:
'''
  ## CONSTRUCTING LIST OF MODEL NAMES
'''

model_names = ["distilbert-base-uncased", 'google/electra-small-discriminator', 'albert-base-v2', "roberta-base"]

## 10-FOLD CROSS VALIDATION USING DISTIL-BERT


---

In [None]:
model_name = model_names[0]
kf = KFold(n_splits = 10, shuffle = True, random_state = 2)
accuracy = []
folds = kf.split(df)


correct_pred_fold_avg = 0
correct_pred_fold_freq = 0
folds_no = 0

for train_index, test_index  in folds:
  train = df.iloc[train_index]
  test = df.iloc[test_index]

  dataset = Dataset.from_pandas(train)
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  tokenized_dataset = dataset.map((lambda x: tokenizer(x["responses"], truncation=True, max_length=647)), batched=True)
  data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

  tf_train_set = tokenized_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
  )


  batch_size = 16
  num_epochs = 3
  batches_per_epoch = len(tokenized_dataset) // batch_size
  total_train_steps = int(batches_per_epoch * num_epochs)
  optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

  model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=12)
  model.compile(optimizer=optimizer)
  model.fit(x=tf_train_set, epochs=num_epochs)

  classifications = []
  labels = []
  for ft_label in range(0, 12):
    
    aggregator = np.zeros(12)
    frequencer = np.zeros(12)
    correct_pred_fold_inst = 0

    for index, record in test[test.labels==ft_label].iterrows():
      resp  = record['responses']
      label = record['labels']

      tokenized = tokenizer(resp, return_tensors="np", padding="longest", truncation=True)
      output = model(tokenized).logits
      model_pred = np.argmax(output, axis=1)
      aggregator+=output.numpy()[0]
      frequencer[model_pred]+=1

      classifications.append(model_pred)
      labels.append(ft_label)

      if(model_pred == ft_label):
        correct_pred_fold_inst+=1

    pred_label_via_avg = np.argmax(aggregator/len(test[test.labels==ft_label]))
    pred_label_via_max_freq = np.argmax(frequencer)

    print(str(ft_label) + "<=> avg", str(pred_label_via_avg))
    print(str(ft_label) + "<=> freq", str(pred_label_via_max_freq))
    print("Total Correctly predicted Instances =",correct_pred_fold_inst, "out of", len(test[test.labels==ft_label]), "when label is ", ft_label)

    if(ft_label==pred_label_via_avg):
      correct_pred_fold_avg +=1
    
    if (ft_label == pred_label_via_max_freq):
      correct_pred_fold_freq +=1

    
  print("Total Correctly predicted labels in FOLD", folds_no, "for FT models via avg =", correct_pred_fold_avg, "out of 12")
  print("Total Correctly predicted labels in FOLD", folds_no, "for FT models via max freq =", correct_pred_fold_freq, "out of 12")
  accuracy.append(((classifications == np.array(labels)).sum() / len(test)))
  print(accuracy)
  folds_no+=1

## 10-FOLD CROSS VALIDATION USING ELECTRA-SMALL


---

In [None]:
model_name = model_names[1]
kf = KFold(n_splits = 10, shuffle = True, random_state = 2)
accuracy = []
folds = kf.split(df)


correct_pred_fold_avg = 0
correct_pred_fold_freq = 0
folds_no = 0

for train_index, test_index  in folds:
  train = df.iloc[train_index]
  test = df.iloc[test_index]

  dataset = Dataset.from_pandas(train)
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  tokenized_dataset = dataset.map((lambda x: tokenizer(x["responses"], truncation=True, max_length=647)), batched=True)
  data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

  tf_train_set = tokenized_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
  )


  batch_size = 16
  num_epochs = 3
  batches_per_epoch = len(tokenized_dataset) // batch_size
  total_train_steps = int(batches_per_epoch * num_epochs)
  optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

  model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=12)
  model.compile(optimizer=optimizer)
  model.fit(x=tf_train_set, epochs=num_epochs)

  classifications = []
  labels = []
  for ft_label in range(0, 12):
    
    aggregator = np.zeros(12)
    frequencer = np.zeros(12)
    correct_pred_fold_inst = 0

    for index, record in test[test.labels==ft_label].iterrows():
      resp  = record['responses']
      label = record['labels']

      tokenized = tokenizer(resp, return_tensors="np", padding="longest", truncation=True)
      output = model(tokenized).logits
      model_pred = np.argmax(output, axis=1)
      aggregator+=output.numpy()[0]
      frequencer[model_pred]+=1

      classifications.append(model_pred)
      labels.append(ft_label)

      if(model_pred == ft_label):
        correct_pred_fold_inst+=1

    pred_label_via_avg = np.argmax(aggregator/len(test[test.labels==ft_label]))
    pred_label_via_max_freq = np.argmax(frequencer)

    print(str(ft_label) + "<=> avg", str(pred_label_via_avg))
    print(str(ft_label) + "<=> freq", str(pred_label_via_max_freq))
    print("Total Correctly predicted Instances =",correct_pred_fold_inst, "out of", len(test[test.labels==ft_label]), "when label is ", ft_label)

    if(ft_label==pred_label_via_avg):
      correct_pred_fold_avg +=1
    
    if (ft_label == pred_label_via_max_freq):
      correct_pred_fold_freq +=1

    
  print("Total Correctly predicted labels in FOLD", folds_no, "for FT models via avg =", correct_pred_fold_avg, "out of 12")
  print("Total Correctly predicted labels in FOLD", folds_no, "for FT models via max freq =", correct_pred_fold_freq, "out of 12")
  accuracy.append(((classifications == np.array(labels)).sum() / len(test)))
  print(accuracy)
  folds_no+=1

## 10-FOLD CROSS VALIDATION USING ALBERT-BASE


---

In [None]:
model_name = model_names[2]
kf = KFold(n_splits = 10, shuffle = True, random_state = 2)
accuracy = []
folds = kf.split(df)


correct_pred_fold_avg = 0
correct_pred_fold_freq = 0
folds_no = 0

for train_index, test_index  in folds:
  train = df.iloc[train_index]
  test = df.iloc[test_index]

  dataset = Dataset.from_pandas(train)
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  tokenized_dataset = dataset.map((lambda x: tokenizer(x["responses"], truncation=True, max_length=647)), batched=True)
  data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

  tf_train_set = tokenized_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
  )


  batch_size = 16
  num_epochs = 3
  batches_per_epoch = len(tokenized_dataset) // batch_size
  total_train_steps = int(batches_per_epoch * num_epochs)
  optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

  model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=12)
  model.compile(optimizer=optimizer)
  model.fit(x=tf_train_set, epochs=num_epochs)

  classifications = []
  labels = []
  for ft_label in range(0, 12):
    
    aggregator = np.zeros(12)
    frequencer = np.zeros(12)
    correct_pred_fold_inst = 0

    for index, record in test[test.labels==ft_label].iterrows():
      resp  = record['responses']
      label = record['labels']

      tokenized = tokenizer(resp, return_tensors="np", padding="longest", truncation=True)
      output = model(tokenized).logits
      model_pred = np.argmax(output, axis=1)
      aggregator+=output.numpy()[0]
      frequencer[model_pred]+=1

      classifications.append(model_pred)
      labels.append(ft_label)

      if(model_pred == ft_label):
        correct_pred_fold_inst+=1

    pred_label_via_avg = np.argmax(aggregator/len(test[test.labels==ft_label]))
    pred_label_via_max_freq = np.argmax(frequencer)

    print(str(ft_label) + "<=> avg", str(pred_label_via_avg))
    print(str(ft_label) + "<=> freq", str(pred_label_via_max_freq))
    print("Total Correctly predicted Instances =",correct_pred_fold_inst, "out of", len(test[test.labels==ft_label]), "when label is ", ft_label)

    if(ft_label==pred_label_via_avg):
      correct_pred_fold_avg +=1
    
    if (ft_label == pred_label_via_max_freq):
      correct_pred_fold_freq +=1

    
  print("Total Correctly predicted labels in FOLD", folds_no, "for FT models via avg =", correct_pred_fold_avg, "out of 12")
  print("Total Correctly predicted labels in FOLD", folds_no, "for FT models via max freq =", correct_pred_fold_freq, "out of 12")
  accuracy.append(((classifications == np.array(labels)).sum() / len(test)))
  print(accuracy)
  folds_no+=1

## 10-FOLD CROSS VALIDATION USING ROBERTA-BASE


---

In [None]:
model_name = model_names[3]
kf = KFold(n_splits = 10, shuffle = True, random_state = 2)
accuracy = []
folds = kf.split(df)


correct_pred_fold_avg = 0
correct_pred_fold_freq = 0
folds_no = 0

for train_index, test_index  in folds:
  train = df.iloc[train_index]
  test = df.iloc[test_index]

  dataset = Dataset.from_pandas(train)
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  tokenized_dataset = dataset.map((lambda x: tokenizer(x["responses"], truncation=True, max_length=647)), batched=True)
  data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

  tf_train_set = tokenized_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
  )


  batch_size = 16
  num_epochs = 3
  batches_per_epoch = len(tokenized_dataset) // batch_size
  total_train_steps = int(batches_per_epoch * num_epochs)
  optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

  model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=12)
  model.compile(optimizer=optimizer)
  model.fit(x=tf_train_set, epochs=num_epochs)

  classifications = []
  labels = []
  for ft_label in range(0, 12):
    
    aggregator = np.zeros(12)
    frequencer = np.zeros(12)
    correct_pred_fold_inst = 0

    for index, record in test[test.labels==ft_label].iterrows():
      resp  = record['responses']
      label = record['labels']

      tokenized = tokenizer(resp, return_tensors="np", padding="longest", truncation=True)
      output = model(tokenized).logits
      model_pred = np.argmax(output, axis=1)
      aggregator+=output.numpy()[0]
      frequencer[model_pred]+=1

      classifications.append(model_pred)
      labels.append(ft_label)

      if(model_pred == ft_label):
        correct_pred_fold_inst+=1

    pred_label_via_avg = np.argmax(aggregator/len(test[test.labels==ft_label]))
    pred_label_via_max_freq = np.argmax(frequencer)

    print(str(ft_label) + "<=> avg", str(pred_label_via_avg))
    print(str(ft_label) + "<=> freq", str(pred_label_via_max_freq))
    print("Total Correctly predicted Instances =",correct_pred_fold_inst, "out of", len(test[test.labels==ft_label]), "when label is ", ft_label)

    if(ft_label==pred_label_via_avg):
      correct_pred_fold_avg +=1
    
    if (ft_label == pred_label_via_max_freq):
      correct_pred_fold_freq +=1

    
  print("Total Correctly predicted labels in FOLD", folds_no, "for FT models via avg =", correct_pred_fold_avg, "out of 12")
  print("Total Correctly predicted labels in FOLD", folds_no, "for FT models via max freq =", correct_pred_fold_freq, "out of 12")
  accuracy.append(((classifications == np.array(labels)).sum() / len(test)))
  print(accuracy)
  folds_no+=1

## GENERATING FINAL PAIRING

---

In [None]:
'''
  ## MODEL TRAINING
'''

dataset = Dataset.from_pandas(df)
model_name =  'google/electra-small-discriminator' #'google/electra-small-discriminator' #'albert-base-v2' #"roberta-base" #distilbert-base-uncased"


def preprocess_function(examples):
    return tokenizer(examples["responses"], truncation=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_dataset = dataset.map(preprocess_function, batched=True)


data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

tf_train_set = tokenized_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)


batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_dataset) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=12)
model.compile(optimizer=optimizer)
model.fit(x=tf_train_set, epochs=num_epochs)

In [None]:
'''
  ## SAVING THE TRAINED MODEL TO FILE SYSTEM
'''

model.save_pretrained (base_path+ "FT/"+model_name)
tokenizer.save_pretrained(base_path+ "FT/"+model_name)

In [None]:
'''
  ## LOADIG BACK THE SAVED TRAINED MODEL FROM FILE SYSTEM
'''

model = TFAutoModelForSequenceClassification.from_pretrained(base_path+ "FT/"+model_name)
tokenizer = AutoTokenizer.from_pretrained(base_path+ "FT/"+model_name)

In [None]:
'''
  ## LOADIG BACK THE SAVED TRAINED MODEL FROM FILE SYSTEM & GENERATING PAIRING
'''

for file_index in range(0, 12):
  model_index = file_index
  doc_name = "ft_"+str(model_index)
  with open(base_path+ "finetuned_models/" + str(model_index) + "_api.json", 'r') as infile:
    finetuned_model_responses = json.load(infile)
  
  responses = []
  for index in range(0, len(finetuned_model_responses)):
    resp = finetuned_model_responses[index]['generated_text']
    responses.append(resp)

  tokenized = tokenizer(responses, return_tensors="np", padding="longest")

  outputs = model(tokenized).logits

  classifications = np.argmax(outputs, axis=1)
  print("freq", np.argmax(np.bincount(classifications)), "==== avg", np.argmax(tf.reduce_sum(outputs, 0).numpy()/100))

# APPROACH-04 MULTI-CLASS TEXT CLASSIFICATION VIA ONE-VS-ALL

In [None]:
with open(base_path+"queries.json", 'r') as infile:
  queries = json.load(infile)

for file_index in range(0, 1):
  model_index = file_index
  doc_name = "ft_"+str(model_index)
  if (file_index==9):
    model_index = 8
  with open(base_path+ "finetuned_models/" + str(model_index) + "_api.json", 'r') as infile:
    finetuned_model_responses = json.load(infile)
  
  responses = []
  iit=0
  for index in range(0, len(finetuned_model_responses)):
    resp = finetuned_model_responses[index]['generated_text']
    if (not(resp in queries)):
      print(resp)
      iit +=1
  print(iit)

In [None]:
base_model_names = ["bloom-350m", "bloom-2b5", "codegen-350M-multi", "DialoGPT-large",
                    "distilgpt2", "gpt2", "gpt2-xl", "gpt-j-6B", "gpt-neo-125M",
                    "Multilingual-MiniLM-L12-H384", "opt-350m", "xlnet-base-cased"]

base_model_api_responses = ["bloom-2b5", "gpt2-xl", "gpt-j-6B", "xlnet-base-cased"]


base_model = base_model_names[0]
if base_model in base_model_api_responses:
  base_model_response_path = base_path+ "base_models/"+base_model + "_api.json"
else:
  base_model_response_path = base_path+ "base_models/"+  "model_" + base_model+".json"

with open(base_model_response_path, 'r') as infile:
    base_model_responses = json.load(infile)

model_index = 11
doc_name = "ft_"+str(model_index)
with open(base_path+ "finetuned_models/" + str(model_index) + "_api.json", 'r') as infile:
  finetuned_model_responses = json.load(infile)

for base_resp, fine_tune_resp in zip(base_model_responses, finetuned_model_responses):
  if (fine_tune_resp['generated_text'] == base_resp[0]['generated_text']):
    print(fine_tune_resp['generated_text'])


In [None]:
base_model_names = ["bloom-350m", "bloom-2b5", "codegen-350M-multi", "DialoGPT-large",
                    "distilgpt2", "gpt2", "gpt2-xl", "gpt-j-6B", "gpt-neo-125M",
                    "Multilingual-MiniLM-L12-H384", "opt-350m", "xlnet-base-cased"]

base_model_api_responses = ["bloom-2b5", "gpt2-xl", "gpt-j-6B", "xlnet-base-cased"]



base_model = "DialoGPT-large"
base_model_response_path = base_path+ "base_models/"    
if base_model in base_model_api_responses:
  base_model_response_path +=  base_model + "_api.json"
else:
  base_model_response_path +=  "model_" + base_model + ".json"

with open(base_model_response_path, 'r') as infile:
  base_model_responses = json.load(infile)

iit = 0
for index in range(0, len(base_model_responses)):
  response = base_model_responses[index][0]['generated_text']
  if (not(response in queries)):
    print(response)
    iit +=1
print(iit)