# IS20140 Final Project - Overview

## The Idea:

To create a version of Mat2Vec trained on data from the social sciences as opposed to data from the material sciences.

## The major issue:

Transformer models have poor Theory of Mind, thus they struggle in the social sciences (citation needed)

## The Alternative:

Use fine-tuned models to gain a broad understanding of large subsets of articles, and make it easier to refine a selection.

## The Implementation: 

NB Need to grab the citation for T-5

## Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Install all required files

In [None]:
#Import modules required to run 8-bit T-5 network
!pip install --quiet bitsandbytes
!pip install --quiet git+https://github.com/huggingface/transformers.git # Install latest version of transformers
!pip install --quiet accelerate
!pip install --quiet sentencepiece
!pip install --quiet tokenizers
!pip install --quiet datasets
!pip install --quiet evaluate
!pip install --quiet torch
!pip install --quiet numpy
!pip install pandas==1.3.4 #Force version 1.3.4 as read_excel fails otherwise
!pip install --quiet sentence-transformers
!pip install --quiet sklearn
!pip uninstall xlrd -y
!pip install xlrd==1.2.0

^C
[31mERROR: Operation cancelled by user[0m


## Preprocess actual data

In [3]:
#Import and preprocess dataset
import pandas as pd
#dataset = pd.read_excel("/content/drive/MyDrive/College/IS20140/Final Project (1)/savedrecs-tagSociology_HighlyCited-date26092022-1.xlsx")
dataset = pd.read_excel("/content/drive/MyDrive/College/GEOG30370/savedrecs-for_Geography_project.xls")
dataset = dataset.dropna(axis=0, subset=["Abstract"]) #Drop rows with no known abstract
#print(dataset.head())
print(dataset)

    Publication Type                                            Authors  \
0                  J  van Puijenbroek, PJTM; Buijse, AD; Kraak, MHS;...   
1                  J                Rossetti, G; Viaroli, P; Ferrari, I   
2                  J  Fernandes, MR; Aguiar, FC; Martins, MJ; Rivaes...   
3                  J  Sheng, Q; Xu, W; Chen, L; Wang, L; Wang, YD; L...   
4                  C  Fan, KW; Fok, L; Ma, XB; Yeh, P; Cheng, DS; Le...   
..               ...                                                ...   
907                J                 Robb, DM; Pieters, R; Lawrence, GA   
908                J  Fu, CH; Xu, Y; Bundy, A; Gruss, A; Coll, M; He...   
909                J  Matson, L; Ng, GHC; Dockry, M; Nyblade, M; Kin...   
910                J  Schmeller, D; Bohm, M; Arvanitidis, C; Barber-...   
912                J  Kane, A; Monadjem, A; Aschenborn, HKO; Bildste...   

    Book Authors Book Editors Book Group Authors  \
0            NaN          NaN                Na

References: 

    @inproceedings{specter2020cohan,
      title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}},
      author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld},
    booktitle={ACL},
    year={2020}
    }

### Use models as they are:

In [4]:
#Guess topics using T5
# As it stands, we MIGHT be able to run a fixed question against each abstract and use that to generate a summary?
# It all feels a bit on the tenuous side. Heck, it all feels VERY on the tenuous side.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Use a variant of T5 that has already been fine-tuned on question-answering tasks

tokenizer = AutoTokenizer.from_pretrained("MaRiOrOsSi/t5-base-finetuned-question-answering")
model = AutoModelForSeq2SeqLM.from_pretrained("MaRiOrOsSi/t5-base-finetuned-question-answering")

def get_topics(input, model=model, tokenizer=tokenizer, max_output_length=50):
    question = "What are the main themes in this article?"
    context = input
    context = str(context)
    fully_formed_input = "question: "+question+" context: "+context
    fully_formed_input = str(fully_formed_input)
    input_ids = tokenizer(fully_formed_input, return_tensors="pt").input_ids
    input_ids = input_ids.to('cpu')
    outputs = model.generate(input_ids, max_new_tokens=max_output_length)
    output_string = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return output_string

all_abstracts = dataset["Abstract"].to_numpy()
topics_list = list()
model = model
tokenizer = tokenizer
max_output_length = 50

for abstract in all_abstracts:
    topics = get_topics(abstract, model, tokenizer, max_output_length)
    topics_list.append(topics)

dataset["Topics"] = topics_list
print(dataset)

Downloading:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (543 > 512). Running this sequence through the model will result in indexing errors


    Publication Type                                            Authors  \
0                  J  van Puijenbroek, PJTM; Buijse, AD; Kraak, MHS;...   
1                  J                Rossetti, G; Viaroli, P; Ferrari, I   
2                  J  Fernandes, MR; Aguiar, FC; Martins, MJ; Rivaes...   
3                  J  Sheng, Q; Xu, W; Chen, L; Wang, L; Wang, YD; L...   
4                  C  Fan, KW; Fok, L; Ma, XB; Yeh, P; Cheng, DS; Le...   
..               ...                                                ...   
907                J                 Robb, DM; Pieters, R; Lawrence, GA   
908                J  Fu, CH; Xu, Y; Bundy, A; Gruss, A; Coll, M; He...   
909                J  Matson, L; Ng, GHC; Dockry, M; Nyblade, M; Kin...   
910                J  Schmeller, D; Bohm, M; Arvanitidis, C; Barber-...   
912                J  Kane, A; Monadjem, A; Aschenborn, HKO; Bildste...   

    Book Authors Book Editors Book Group Authors  \
0            NaN          NaN                Na

In [5]:
dataset

Unnamed: 0,Publication Type,Authors,Book Authors,Book Editors,Book Group Authors,Author Full Names,Book Author Full Names,Group Authors,Article Title,Source Title,...,Research Areas,IDS Number,Pubmed Id,Open Access Designations,Highly Cited Status,Hot Paper Status,Date of Export,UT (Unique WOS ID),Web of Science Record,Topics
0,J,"van Puijenbroek, PJTM; Buijse, AD; Kraak, MHS;...",,,,"van Puijenbroek, Peter J. T. M.; Buijse, Antho...",,,Species and river specific effects of river fr...,RIVER RESEARCH AND APPLICATIONS,...,Environmental Sciences & Ecology; Water Resources,HH6OJ,,"hybrid, Green Published",,,2022-11-02,WOS:000455850600007,0,Increasing diversity and diversification
1,J,"Rossetti, G; Viaroli, P; Ferrari, I",,,,"Rossetti, Giampaolo; Viaroli, Pierluigi; Ferra...",,,ROLE OF ABIOTIC AND BIOTIC FACTORS IN STRUCTUR...,RIVER RESEARCH AND APPLICATIONS,...,Environmental Sciences & Ecology; Water Resources,497HJ,,,,,2022-11-02,WOS:000270049100002,0,zooplankton diversity and abundance
2,J,"Fernandes, MR; Aguiar, FC; Martins, MJ; Rivaes...",,,,"Fernandes, Maria Rosario; Aguiar, Francisca C....",,,Long-term human-generated alterations of Tagus...,CATENA,...,Geology; Agriculture; Water Resources,KS7LP,,,,,2022-11-02,WOS:000518488500042,0,Human intervention and natural disturbances in...
3,J,"Sheng, Q; Xu, W; Chen, L; Wang, L; Wang, YD; L...",,,,"Sheng, Qiang; Xu, Wang; Chen, Long; Wang, Lei;...",,,Effect of Urban River Morphology on the Struct...,SUSTAINABILITY,...,Science & Technology - Other Topics; Environme...,4B0HI,,gold,,,2022-11-02,WOS:000845469900001,0,Conservation of natural rivers and floods.
4,C,"Fan, KW; Fok, L; Ma, XB; Yeh, P; Cheng, DS; Le...",,"Shang, HQ",,"Fan, K. W.; Fok, L.; Ma, X. B.; Yeh, P.; Cheng...",,,Field Investigation on Biodiversity in the Eas...,PROCEEDINGS OF THE 2ND INTERNATIONAL YELLOW RI...,...,Environmental Sciences & Ecology; Water Resources,BIL99,,,,,2022-11-02,WOS:000260656600012,0,"Water, geological, and environmental factors"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
907,J,"Robb, DM; Pieters, R; Lawrence, GA",,,,"Robb, Daniel M.; Pieters, Roger; Lawrence, Gre...",,,Fate of turbid glacial inflows in a hydroelect...,ENVIRONMENTAL FLUID MECHANICS,...,Environmental Sciences & Ecology; Mechanics; M...,XN0XH,34966250.0,"Green Published, hybrid",,,2022-11-02,WOS:000699006000001,0,"Temperature, conductivity, and turbidity"
908,J,"Fu, CH; Xu, Y; Bundy, A; Gruss, A; Coll, M; He...",,,,"Fu, Caihong; Xu, Yi; Bundy, Alida; Gruss, Arna...",,,Making ecological indicators management ready:...,ECOLOGICAL INDICATORS,...,Biodiversity & Conservation; Environmental Sci...,JE3DM,,"Green Published, Green Accepted, hybrid, Green...",,,2022-11-02,WOS:000490574200003,0,Conservation and sustainability
909,J,"Matson, L; Ng, GHC; Dockry, M; Nyblade, M; Kin...",,,,"Matson, Laura; Ng, G-H Crystal; Dockry, Michae...",,,Transforming research and relationships throug...,ENVIRONMENTAL SCIENCE & POLICY,...,Environmental Sciences & Ecology,PR5US,,hybrid,,,2022-11-02,WOS:000607302000013,0,"Respect, respect, and value in life"
910,J,"Schmeller, D; Bohm, M; Arvanitidis, C; Barber-...",,,,"Schmeller, Dirk S.; Bohm, Monika; Arvanitidis,...",,,Building capacity in biodiversity monitoring a...,BIODIVERSITY AND CONSERVATION,...,Biodiversity & Conservation; Environmental Sci...,FJ0YJ,,Green Submitted,,,2022-11-02,WOS:000412437200001,0,"Research, policy and practice to tackle biodiv..."


In [6]:
# Summarize the abstract using T5
# The main themes/topics bit is kinda already doing that. And to be honest that's all we really need.

# As it stands, we MIGHT be able to run a fixed question against each abstract and use that to generate a summary?
# It all feels a bit on the tenuous side. Heck, it all feels VERY on the tenuous side.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Use a variant of T5 that has already been fine-tuned on question-answering tasks

tokenizer = AutoTokenizer.from_pretrained("MaRiOrOsSi/t5-base-finetuned-question-answering")
model = AutoModelForSeq2SeqLM.from_pretrained("MaRiOrOsSi/t5-base-finetuned-question-answering")

def get_topics(input, model=model, tokenizer=tokenizer, max_output_length=50):
    question = "What are the main themes in this article?"
    context = input
    fully_formed_input = "question: "+question+" context: "+context
    fully_formed_input = str(fully_formed_input)
    input_ids = tokenizer(fully_formed_input, return_tensors="pt").input_ids
    input_ids = input_ids.to('cpu')
    outputs = model.generate(input_ids, max_new_tokens=max_output_length)
    output_string = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return output_string

all_abstracts = dataset["Abstract"].to_numpy()
topics_list = list()
model = model
tokenizer = tokenizer
max_output_length = 50

for abstract in all_abstracts:
    topics = get_topics(abstract, model, tokenizer, max_output_length)
    topics_list.append(topics)

dataset["Topics"] = topics_list
print(dataset)

Token indices sequence length is longer than the specified maximum sequence length for this model (543 > 512). Running this sequence through the model will result in indexing errors


    Publication Type                                            Authors  \
0                  J  van Puijenbroek, PJTM; Buijse, AD; Kraak, MHS;...   
1                  J                Rossetti, G; Viaroli, P; Ferrari, I   
2                  J  Fernandes, MR; Aguiar, FC; Martins, MJ; Rivaes...   
3                  J  Sheng, Q; Xu, W; Chen, L; Wang, L; Wang, YD; L...   
4                  C  Fan, KW; Fok, L; Ma, XB; Yeh, P; Cheng, DS; Le...   
..               ...                                                ...   
907                J                 Robb, DM; Pieters, R; Lawrence, GA   
908                J  Fu, CH; Xu, Y; Bundy, A; Gruss, A; Coll, M; He...   
909                J  Matson, L; Ng, GHC; Dockry, M; Nyblade, M; Kin...   
910                J  Schmeller, D; Bohm, M; Arvanitidis, C; Barber-...   
912                J  Kane, A; Monadjem, A; Aschenborn, HKO; Bildste...   

    Book Authors Book Editors Book Group Authors  \
0            NaN          NaN                Na

In [7]:
#If we can make it output the relevant citation we are off. To. The. Races.

In [8]:
torch.cuda.empty_cache()

## Scratchpad - Semantic Search with FAISS

https://huggingface.co/course/chapter5/6?fw=pt

I have no idea if this will work, and to be honest I'm scared to find out.

In [10]:
#Dataset's already there and ready to go, so we just need to run it through the model and we're hopefully golden.

from transformers import AutoTokenizer, AutoModel
import torch

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

device = torch.device("cpu")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0): MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_features

In [13]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

    # Original code from Huggingface course:
    embedding = get_embeddings(comments_dataset["text"][0]) #Get the embeddings for the text provided.
    embedding.shape

In [14]:
embedding = get_embeddings(dataset["Abstract"][0])
embedding.shape
#torch.Size([1, 768])

torch.Size([1, 768])

In [15]:
#Map embeddings to text
embeddings_dataset = dataset.map(
    lambda x: {"embeddings": get_embeddings(x["Abstract"]).detach().cpu().numpy()[0]}
)
embeddings_dataset.add_faiss_index(column="embeddings")

AttributeError: ignored

In [None]:
question = "What caused the appearance of intersex fish?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

#torch.Size([1, 768])

In [None]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [None]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

------

#### Scratchpad - debugging training issues

    ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values,encoder_last_hidden_state. For reference, the inputs it received are input_ids,attention_mask,decoder_input_ids,decoder_attention_mask.

If I had to guess, this is an issue with the compute metrics.

    metric = evaluate.load("accuracy") #Set metric to evaluate model performance

And

    return metric.compute(predictions=predictions, references=labels)



Compute Metrics are defined in the ```compute_metrics``` section of the ```Trainer``` class, as shown below:

    trainer = Trainer(
        model=model_8bit,
        args=training_args,
        train_dataset=tokenized_split_dataset["train"],
        eval_dataset=tokenized_split_dataset["test"],
        compute_metrics=compute_metrics,
    )

And the ```compute_metrics``` function is defined in 

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)



In [None]:
# Scratchpad - Use Trainer without any custom compute metrics.
  # I remember this came from an issue with the inputs...

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import TrainingArguments, Trainer
import numpy as np
import accelerate
import evaluate
import torch

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

selected_model = "t5-3b-sharded" #@param ["t5-11b-sharded", "t5-3b-sharded"]
model_id=f"ybelkada/{model_name}"

model_8bit = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

training_args = TrainingArguments(
    output_dir="test_trainer",  #Set output directory
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    fp16=True,
    optim="adafactor",
    label_names=["input_ids", "decoder_input_ids"],
    include_inputs_for_metrics=True,
    )
metric = evaluate.load("accuracy") #Set metric to evaluate model performance
  #NB That metric above is the loss function. MSE is mostly used for regression tasks.

trainer = Trainer(
    model=model_8bit,
    args=training_args,
    train_dataset=tokenized_split_dataset["train"],
    eval_dataset=tokenized_split_dataset["test"],
    compute_metrics=compute_metrics,
)

trainer.train() #NB Looks like we just need to figure out how to compute the loss.
  #It may be getting confused by the attention masks, since
#trainer.save_model("test_trainer/saved_model")
#trainer.save_pretrained("test_trainer/saved_model")

# Run saved Model

In [None]:
#Use model
#We know this will break, that's okay.
new_model_8bit = "test_trainer/saved_model"
trained_tokenizer = AutoTokenizer.from_pretrained(new_model_8bit)
trained_model_8bit = AutoModelForSeq2SeqLM.from_pretrained(new_model_8bit, device_map="auto", load_in_8bit=True)

#Use the new 8-bit model:
max_new_tokens = 50
input_string = "What causes the relationship between metrics, markets and affect in the contemporary UK academy?" #Need to fill this with a relevant string. We know T5 can draw inferences and answer questions based on training data.

input_ids = trained_tokenizer(
    input_string, return_tensors="pt"
).input_ids  #Set this to an input parameter. Also need to make sure we're setting lengths appropriately.

outputs = trained_model_8bit.generate(input_ids, max_new_tokens=max_new_tokens)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# **References (get these into the right citation style)**

https://huggingface.co/blog/hf-bitsandbytes-integration

https://discuss.huggingface.co/t/t5-finetuning-tips/684/5

https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil 