<a href="https://colab.research.google.com/github/Abhijith-Nagarajan/CS_546_Project/blob/feature%2Fquery-classification%2Fllm-based/Query_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Loading the libraries

In [None]:
!pip3 install datasets

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

In [None]:
import pandas as pd
import numpy as np
import spacy
import re
import torch.nn as nn
import torch.optim as optim

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

In [None]:
from spacy import displacy

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

### Methods to identify nature of query - Alternatives to just relying on keywords


1. Dependency Parsing
2. Number of entities and types of entities
3. Presence of hypothetical knowledge or actionable terms
4. Presence of subordinate clauses
5. Coreference and Anaphora
6. Relation to prior knowledge/assumption




<h5> Starting with dependency parsing using SpaCy </h5><br>
<p> Sample questions to try: <br>
    <ol>
        <li> Which are the mammalian orthologs of Drosophila Yki? </li>
        <li> Is the BAGEL algorithm used for arrayed CRISPR screens?</li>
        <li> Which is the  subcellular localization of ERAP2?</li>
        <li> Which histone modifications have been associated to alternative splicing?</li>
        <li> What role might diet play in influencing gene expression associated with metabolic diseases? </li>
        <li> How could prolonged antibiotic use impact gut microbiota diversity, and what are the potential health consequences?</li>
    </ol>
</p>

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
sentence = 'How could prolonged antibiotic use impact gut microbiota diversity, and what are the potential health consequences?'

In [None]:
sentence2 = 'Which is the subcellular localization of ERAP2?'

In [None]:
doc = nlp('How could mutations in BRCA1 influence cancer treatment outcomes?')

In [None]:
doc2 = nlp(sentence2)

In [None]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 120})

In [None]:
 def get_depth(token):
    if not list(token.children):  # No children implies a leaf node
        return 1
    return 1 + max(get_depth(child) for child in token.children)

In [None]:
for token in doc:
    if token.dep_ == "ROOT":
        print(f'Root node: {token}')
        children = list(token.children)
        print(f'Total Children: {len(children)}')
        print(f'Total Depth: {get_depth(token)}')
        break

Root node: mutations
Total Children: 4
Total Depth: 6


In [None]:
G = nx.DiGraph()

# Add nodes and edges based on dependency structure
for token in doc:
    G.add_node(token.text)
    if token.dep_ != "ROOT":  # If not root, add edge from head to child
        G.add_edge(token.head.text, token.text)

In [None]:
# Define layout for vertical tree (root at top, leaves at bottom)
pos = nx.multipartite_layout(G, subset_key=lambda n: nx.shortest_path_length(G, source=doc.root.text, target=n))

# Plot the graph
plt.figure(figsize=(10, 6))
nx.draw(G, pos, with_labels=True, node_size=2000, node_color="lightblue", font_size=10, font_weight="bold", arrows=True)
plt.title("Dependency Tree (Vertical Layout)")
plt.show()

NetworkXError: all nodes need a subset_key attribute: <function <lambda> at 0x7deae0b7f640>

In [None]:
for token in doc:
    print(f'Processing token: {token}. Head of token: {token.head}')
    #print(token.text, token.dep_, token.head.text, token.head.pos_,"\n")

Processing token: How. Head of token: mutations
Processing token: could. Head of token: mutations
Processing token: mutations. Head of token: mutations
Processing token: in. Head of token: mutations
Processing token: BRCA1. Head of token: outcomes
Processing token: influence. Head of token: cancer
Processing token: cancer. Head of token: treatment
Processing token: treatment. Head of token: outcomes
Processing token: outcomes. Head of token: in
Processing token: ?. Head of token: mutations


### Manipulating the dataset

In [None]:
bioasq_dataset = pd.read_json('training12b_new.json')

In [None]:
def get_required_fields(item: dict):
    try:
        question = item['body']
        question_type = item['type']
    except:
        return 'Error: Could not find body or type'
    return [question, question_type]

In [None]:
df = bioasq_dataset.questions.map(lambda item: get_required_fields(item))

In [None]:
df = pd.DataFrame(df.tolist(), columns=['question', 'question_type'])

In [None]:
for q_type in df.question_type.unique():
    print(f'Processing {q_type}')
    q = df[df.question_type==q_type].sample(5)
    print(q,"\n")

Processing summary
                                               question question_type
1284  What is known about food intolerance and gluten ?       summary
4510          What links developmental pathways to ALS?       summary
571   Which histone modifications have been associat...       summary
5044                              What is telegenetics?       summary
1064                            What is Sotos syndrome?       summary 

Processing list
                                               question question_type
1155        Which receptors are targeted by suvorexant?          list
615   Which are the mammalian orthologs of Drosophil...          list
3905  What is the active ingredient in the most comm...          list
323         What is the treatment of acute myocarditis?          list
4901                  What are the hallmarks of cancer?          list 

Processing yesno
                                               question question_type
3552  Is the BAGEL algorithm used 

### Building the model

In [None]:
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/228k [00:00<?, ?B/s]



In [None]:
model = AutoModelForSequenceClassification.from_pretrained('allenai/scibert_scivocab_uncased')

### Testing

In [None]:
from datasets import load_dataset

In [None]:
for item in bioasq_dataset.questions[:3]:
    question = item['body']
    question_type = item['type']
    print(f'Question: {question}')
    print(f'Question Type: {question_type}\n')

Question: Is Hirschsprung disease a mendelian or a multifactorial disorder?
Question Type: summary

Question: List signaling molecules (ligands) that interact with the receptor EGFR?
Question Type: list

Question: Is the protein Papilin secreted?
Question Type: yesno

