<a href="https://colab.research.google.com/github/KagontleBooysen/Final-Lung-Cancer-project/blob/main/chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install necessary libraries
!pip install transformers nltk

# Importing necessary libraries
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf
import re
from nltk.tokenize import word_tokenize
from nltk.translate.bleu_score import sentence_bleu
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



In [None]:
# Load BERT model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForQuestionAnswering.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFBertForQuestionAnswering.

All the weights of TFBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForQuestionAnswering for predictions without further training.


In [None]:
context = """
Lung cancer is a type of cancer that begins in the lungs, characterized by uncontrolled cell growth in the lung tissues. It is a significant health challenge in South Africa, with an estimated 8,000 new cases diagnosed annually. Lung cancer is the leading cause of cancer-related deaths among men and one of the top five cancers affecting women. The primary risk factor for lung cancer is smoking, with approximately 20% of the adult population identified as smokers, contributing to its high incidence. Environmental factors, such as exposure to asbestos and industrial pollutants, also play a role. Additionally, South Africa’s high HIV/AIDS prevalence, with about 13% of the adult population living with HIV, exacerbates the lung cancer burden, as immunocompromised individuals are at higher risk. Late-stage diagnosis is common due to limited access to healthcare services and inadequate screening programs, resulting in poorer outcomes. Treatment access is further hindered by the high costs associated with chemotherapy, radiation, and surgical interventions, which are often beyond the reach of many South Africans relying on the overburdened public healthcare system. Public health efforts, including anti-smoking campaigns and initiatives for early detection, are ongoing, but there is a pressing need for more comprehensive and accessible screening programs, along with enhanced support systems for patients and their families.
"""

In [None]:
# Define preprocess_text function for text cleaning and tokenization
def preprocess_text(text):
    """ Clean and tokenize text. """
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text).lower()
    return text

In [None]:
# Define chat function for answering questions
def chat(question, reference=None):
    try:
        # Preprocess question text
        question = preprocess_text(question)

        # Tokenize the input message and context
        inputs = tokenizer(question, context, return_tensors='tf')

        # Get the model's output
        outputs = model(inputs)

        # Extract the answer start and end logits
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits

        # Get the most likely start and end token positions
        start_index = tf.argmax(start_logits, axis=-1).numpy()[0]
        end_index = tf.argmax(end_logits, axis=-1).numpy()[0]

        # Check if the indices are valid
        if start_index <= end_index and start_index < len(inputs['input_ids'][0]) and end_index < len(inputs['input_ids'][0]):
            # Convert token indices back to tokens
            input_ids = inputs['input_ids'].numpy()[0]
            answer_tokens = tokenizer.convert_ids_to_tokens(input_ids[start_index:end_index+1])

            # Clean the answer
            answer = tokenizer.convert_tokens_to_string(answer_tokens)

            # Calculate BLEU score if reference answer is provided
            if reference:
                reference = preprocess_text(reference).split()
                candidate = preprocess_text(answer).split()
                bleu_score = sentence_bleu([reference], candidate)
                print(f"Reference: {' '.join(reference)}")
                print(f"Candidate: {' '.join(candidate)}")
                print(f"BLEU Score: {bleu_score}")
                return answer, bleu_score
            else:
                return answer, None
        else:
            return "I'm sorry, I don't have the information you are looking for.", None
    except Exception as e:
        return str(e), None

In [None]:
# Placeholder function for fine-tuning the model with actual data
def fine_tune_model(train_data, val_data):
    """
    Placeholder function for fine-tuning the model.
    """
    optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

    # Compile the model with the optimizer
    model.compile(optimizer=optimizer, loss=model.compute_loss)

    # Convert the data to TensorFlow Dataset objects
    train_dataset = tf.data.Dataset.from_tensor_slices((
        {'input_ids': train_data['input_ids'], 'attention_mask': train_data['attention_mask']},
        {'start_positions': train_data['start_positions'], 'end_positions': train_data['end_positions']}
    ))
    val_dataset = tf.data.Dataset.from_tensor_slices((
        {'input_ids': val_data['input_ids'], 'attention_mask': val_data['attention_mask']},
        {'start_positions': val_data['start_positions'], 'end_positions': val_data['end_positions']}
    ))

In [None]:
# Example usage
question = "What is a significant health challenge in South Africa?"
reference = "Lung cancer"
answer, bleu_score = chat(question, reference)
print(f"Question: {question}")
print(f"Answer: {answer}")
if bleu_score is not None:
    print(f"BLEU Score: {bleu_score}")

Reference: lung cancer
Candidate: lung cancer
BLEU Score: 1.491668146240062e-154
Question: What is a significant health challenge in South Africa?
Answer: lung cancer
BLEU Score: 1.491668146240062e-154


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
