## Challenge Description
The purpose of this challenge is to build an AI assistant capable of providing knowledge contained in the Malawi Technical Guidelines for Integrated Disease Surveillance and Response (TGs for IDSR).

You will train an open-source LLM to answer context-specific questions about Malawian public health processes, case definitions and guidelines, with training done on a dataset derived from the Malawi TGs for IDSR.

### Dataset 
This is a custom dataset of questions and answers specifically tailored for public health and disease surveillance encompassing a spectrum of questions and answers vital to the field. This dataset is tailored to address the specific queries health professionals commonly encounter during disease surveillance activities. It includes inquiries related to how to use forms, clarification on abbreviations found in data collection forms, application of clinical information, clinical case

The training dataset contains questions and answers, contextualized within the TG booklets. The questions come in various types, including what, why, who, where, and those seeking comparisons between concepts.

In [59]:
import numpy as np
import pandas as pd
import nltk
from nltk import word_tokenize
import seaborn as sns
import matplotlib.pyplot as plt
import re
import os

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, Trainer, TrainingArgumentsuments
from sklearn.model_selection import train_test_split

RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
No module named 'transformers.integrations.deepspeed'; 'transformers.integrations' is not a package

In [44]:
train_df = pd.read_csv('Train.csv')
test_df = pd.read_csv('Test.csv')

In [45]:
train_df

Unnamed: 0,ID,Question Text,Question Answer,Reference Document,Paragraph(s) Number,Keywords
0,Q829,Compare the laboratory confirmation methods fo...,Chikungunya is confirmed using serological tes...,TG Booklet 6,"154, 166",Laboratory Confirmation For Chikungunya Vs. Di...
1,Q721,When should specimens be collected for Anthrax...,Specimens should be collected during the vesic...,TG Booklet 6,140,"Anthrax Specimen Collection: Timing, Preparati..."
2,Q464,Which key information should be recorded durin...,"During a register review, key information abou...",TG Booklet 3,439-440,"Register Review, Key Information, Suspected Ca..."
3,Q449,Why is the District log of suspected outbreaks...,The log includes information about response ac...,TG Booklet 3,412,"District Log, Response Activities, Steps Taken..."
4,Q6,What do Community based surveillance strategie...,Community-based surveillance strategies focus ...,TG Booklet 1,86,"Community-based Surveillance Strategies, Ident..."
...,...,...,...,...,...,...
743,Q413,Which section of the guidelines provides a des...,Section 11.0 of these 3rd Edition Malawi IDSR ...,TG Booklet 3,376,"Control Measures Description, Priority Disease..."
744,Q626,"Does MEF stand for an abbreviation in the TG, ...",Medical Teams International,TG Booklet 6,106,Medical Teams International
745,Q1141,In what ways do the verification and documenta...,"In emergency contexts, verification and docume...",TG Booklet 5,105-106,"Verification, Documentation, Early Warning, Em..."
746,Q331,What role does the examination of burial cerem...,Examining burial ceremonies helps identify pot...,TG Booklet 3,287,"Burial Ceremonies Examination, Exposure, Trans..."


In [46]:
test_df

Unnamed: 0,ID,Question Text
0,Q4,"What is the definition of ""unusual event"""
1,Q5,What is Community Based Surveillance (CBS)?
2,Q9,What kind of training should members of VHC re...
3,Q10,What is indicator based surveillance (IBS)?
4,Q13,What is Case based surveillance?
...,...,...
494,Q1229,Where should completeness be evaluated in the ...
495,Q1230,Which dimensions of completeness are crucial i...
496,Q1236,How can the completeness of case reporting be ...
497,Q1239,Where should completeness and timeliness of re...


In [47]:
def preprocess(text):
    text = text.lower()
    text = re.sub('[^\w\s]', '', text)
    return text

In [48]:
train_df['Question Text'] = train_df['Question Text'].apply(preprocess)
train_df['Question Answer'] = train_df['Question Answer'].apply(preprocess)
test_df['Question Text'] = test_df['Question Text'].apply(preprocess)

In [49]:
reference_documents = {}
directory = "MW_TGBookletsExcel"
for file in os.listdir(directory):
    if file.endswith(".xlsx"):
        file_name = os.path.splitext(file)[0]
        file_path = os.path.join(directory, file)
        xl = pd.ExcelFile(file_path)
        text = ""
        for sheet_name in xl.sheet_names:
            df = pd.read_excel(xl, sheet_name=sheet_name)
            if 'TG_IDSR' in df.columns:
                text += " ".join(df['TG_IDSR'].astype(str))
            else:
                print(f"Warning: 'TG_IDSR' column not found in '{sheet_name}' of file '{file}'.")
        reference_documents[file_name] = text

print("Reference documents loaded successfully.")

Reference documents loaded successfully.


In [50]:
reference_documents



In [51]:
train_data = []
for index, row in train_df.iterrows():
    reference_text = reference_documents.get(row["Reference Document"], "")
    concatenated_text = row["Question Text"] + "" + reference_text
    train_data.append({"text": concatenated_text, "target": (row["Question Answer"], row["Paragraph(s) Number"], row["Keywords"])})

print(concatenated_text)



In [52]:
train_data[:10]

  'target': ('chikungunya is confirmed using serological tests and pcr while diabetes diagnosis involves blood glucose measurements',
   '154, 166',
   'Laboratory Confirmation For Chikungunya Vs. Diabetes')},
  'target': ('specimens should be collected during the vesicular stage and caution is needed due to b anthracis high infectivity specimen collection methods and storagetransport details vary for different forms of anthrax',
   '140',
   'Anthrax Specimen Collection: Timing, Preparation, Transport')},
  'target': ('during a register review key information about suspected cases should be recorded including patient details signs and symptoms date of onset outcome and immunization status this information serves as a basis for case investigation activities the recorded data contributes to the linelisting of suspected cases which is crucial for analyzing outbreak patterns identifying risk factors and guiding further investigations and responses',
   '439-440',
   'Register Review, Key 

In [53]:
train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)

In [56]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenize_data(data_list):
    tokenized_data = []
    for data in data_list:
        tokenized_data.append(tokenizer(data['text'], padding=True, truncation=True))
    return tokenized_data

train_data_tokenized = tokenize_data(train_data)
validation_data_tokenized = tokenize_data(val_data)

In [57]:
class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
        
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    
    def __len__(self):
        return len(self.labels)
    
train_dataset = CustomDataset(train_data_tokenized, train_labels)
validation_dataset = CustomDataset(validation_data_tokenized, validation_labels)

NameError: name 'train_labels' is not defined

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset
)

trainer.train()