## Challenge Description
The purpose of this challenge is to build an AI assistant capable of providing knowledge contained in the Malawi Technical Guidelines for Integrated Disease Surveillance and Response (TGs for IDSR).

You will train an open-source LLM to answer context-specific questions about Malawian public health processes, case definitions and guidelines, with training done on a dataset derived from the Malawi TGs for IDSR.

### Dataset
This is a custom dataset of questions and answers specifically tailored for public health and disease surveillance encompassing a spectrum of questions and answers vital to the field. This dataset is tailored to address the specific queries health professionals commonly encounter during disease surveillance activities. It includes inquiries related to how to use forms, clarification on abbreviations found in data collection forms, application of clinical information, clinical case

The training dataset contains questions and answers, contextualized within the TG booklets. The questions come in various types, including what, why, who, where, and those seeking comparisons between concepts.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!pip install accelerate -U
!pip install transformers[torch]

Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.27.2


In [4]:
import numpy as np
import pandas as pd
import nltk
from nltk import word_tokenize
import seaborn as sns
import matplotlib.pyplot as plt
import re
import os

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split

In [5]:
train_df = pd.read_csv('/content/drive/MyDrive/Malawi_health_systems_LLMs/Train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/Malawi_health_systems_LLMs/Test.csv')

In [6]:
train_df

Unnamed: 0,ID,Question Text,Question Answer,Reference Document,Paragraph(s) Number,Keywords
0,Q829,Compare the laboratory confirmation methods fo...,Chikungunya is confirmed using serological tes...,TG Booklet 6,"154, 166",Laboratory Confirmation For Chikungunya Vs. Di...
1,Q721,When should specimens be collected for Anthrax...,Specimens should be collected during the vesic...,TG Booklet 6,140,"Anthrax Specimen Collection: Timing, Preparati..."
2,Q464,Which key information should be recorded durin...,"During a register review, key information abou...",TG Booklet 3,439-440,"Register Review, Key Information, Suspected Ca..."
3,Q449,Why is the District log of suspected outbreaks...,The log includes information about response ac...,TG Booklet 3,412,"District Log, Response Activities, Steps Taken..."
4,Q6,What do Community based surveillance strategie...,Community-based surveillance strategies focus ...,TG Booklet 1,86,"Community-based Surveillance Strategies, Ident..."
...,...,...,...,...,...,...
743,Q413,Which section of the guidelines provides a des...,Section 11.0 of these 3rd Edition Malawi IDSR ...,TG Booklet 3,376,"Control Measures Description, Priority Disease..."
744,Q626,"Does MEF stand for an abbreviation in the TG, ...",Medical Teams International,TG Booklet 6,106,Medical Teams International
745,Q1141,In what ways do the verification and documenta...,"In emergency contexts, verification and docume...",TG Booklet 5,105-106,"Verification, Documentation, Early Warning, Em..."
746,Q331,What role does the examination of burial cerem...,Examining burial ceremonies helps identify pot...,TG Booklet 3,287,"Burial Ceremonies Examination, Exposure, Trans..."


In [7]:
test_df

Unnamed: 0,ID,Question Text
0,Q4,"What is the definition of ""unusual event"""
1,Q5,What is Community Based Surveillance (CBS)?
2,Q9,What kind of training should members of VHC re...
3,Q10,What is indicator based surveillance (IBS)?
4,Q13,What is Case based surveillance?
...,...,...
494,Q1229,Where should completeness be evaluated in the ...
495,Q1230,Which dimensions of completeness are crucial i...
496,Q1236,How can the completeness of case reporting be ...
497,Q1239,Where should completeness and timeliness of re...


In [8]:
reference_documents = {}
directory = "/content/drive/MyDrive/Malawi_health_systems_LLMs/MW_TGBookletsExcel"
for file in os.listdir(directory):
    if file.endswith(".xlsx"):
        file_name = os.path.splitext(file)[0]
        file_path = os.path.join(directory, file)
        xl = pd.ExcelFile(file_path)
        text = ""
        for sheet_name in xl.sheet_names:
            df = pd.read_excel(xl, sheet_name=sheet_name)
            if 'TG_IDSR' in df.columns:
                text += " ".join(df['TG_IDSR'].astype(str))
            else:
                print(f"Warning: 'TG_IDSR' column not found in '{sheet_name}' of file '{file}'.")
        reference_documents[file_name] = text

print("Reference documents loaded successfully.")

Reference documents loaded successfully.


In [9]:
reference_documents



In [12]:
train_data = []
for index, row in train_df.iterrows():
    reference_text = reference_documents.get(row["Reference Document"], "")
    concatenated_text = row["Question Text"] + " " + reference_text
    train_data.append({"text": concatenated_text, "target": (row["Question Answer"], row["Paragraph(s) Number"], row["Keywords"])})


In [13]:
train_data[:1]

  'target': ('Chikungunya is confirmed using serological tests and PCR, while diabetes diagnosis involves blood glucose measurements.',
   '154, 166',
   'Laboratory Confirmation For Chikungunya Vs. Diabetes')}]

In [14]:
def clean_text(text):
    # Remove unnecessary characters and multiple spaces
    cleaned_text = re.sub(r'\s+', ' ', text)

    # Remove non-alphanumeric characters except for specific characters like .,/?&% and whitespace
    cleaned_text = re.sub(r'[^\w\s.,/?&%]', '', cleaned_text)

    # Remove consecutive dots, commas, question marks, slashes, and ampersands
    cleaned_text = re.sub(r'\.{2,}', '.', cleaned_text)
    cleaned_text = re.sub(r',{2,}', ',', cleaned_text)
    cleaned_text = re.sub(r'\?{2,}', '?', cleaned_text)
    cleaned_text = re.sub(r'/+', '/', cleaned_text)
    cleaned_text = re.sub(r'&+', '&', cleaned_text)

    return cleaned_text.strip()

In [16]:
cleaned_train_data = []

for data in train_data:
    cleaned_text = clean_text(data['text'])
    cleaned_train_data.append({'text': cleaned_text, 'target': data['target']})

print(cleaned_train_data)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [19]:
cleaned_train_data