# Import library

In [1]:
pip install transformers torch beautifulsoup4 requests nltk

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

# Setup For scraped data

**Explanation on code**
- Libraries: requests for fetching web pages, BeautifulSoup for parsing HTML.
- Function fetch_detailed_content: This function takes a list of URLs, fetches their HTML content, and parses specific parts to extract structured data.
- HTTP GET Request: For each URL, an HTTP request is made. If the response is successful, the HTML is parsed.
- Content Parsing: The script looks for specific HTML tags and structures (like headers and paragraphs) to extract and organize text by sections.
- Error Handling: Checks for unsuccessful HTTP requests and prints an error message.
- Data Organization: Each URL's content is stored in a dictionary with headings as keys and lists of related text as values.

In [27]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup

# Define a function to fetch detailed content from a list of URLs
def fetch_detailed_content(urls):
    # Initialize a dictionary to store the content from each URL
    content = {}
    # Loop through each URL in the list
    for url in urls:
        # Make an HTTP GET request to the URL
        response = requests.get(url)
        # Initialize a dictionary to store the detailed content for the current URL
        detailed_content = {}

        # Check if the response status is successful (HTTP 200 OK)
        if response.status_code == 200:
            # Parse the HTML content of the page using BeautifulSoup
            soup = BeautifulSoup(response.text, 'lxml') # you also can use http-parser or lxml
            # Find all div elements with a specific class that contains the main content
            overview_sections = soup.find_all('div', class_='container mt-rem48px')

            # Initialize variable to keep track of the current section being processed
            current_headline = None
            # Iterate over elements within the main content section that match specified tags and attributes
            for element in overview_sections[0].find_all(['h4', 'h3', 'p', 'ul'], attrs={'data-identity': ['headline', 'paragraph-element', 'unordered-list']}, recursive=True):
                if element.name == 'h3' or element.name == 'h4':
                    # Update the current section headline if an h3 or h4 tag is found
                    current_headline = element.get_text(strip=True)
                    detailed_content[current_headline] = []
                elif element.name == 'p' and current_headline:
                    # Add paragraph text to the current section if a paragraph tag is found
                    detailed_content[current_headline].append(element.get_text(strip=True))
                elif element.name == 'ul' and current_headline:
                    # Add all list items text to the current section if an unordered list tag is found
                    list_items = [li.get_text(strip=True) for li in element.find_all('li')]
                    detailed_content[current_headline].extend(list_items)

            # Store the detailed content dictionary under its corresponding URL in the content dictionary
            content[url] = detailed_content
        else:
            # Print an error message if the HTTP request failed
            print(f"Failed to retrieve content from {url}, status code: {response.status_code}")

    # Return the dictionary containing all the content
    return content

# List of URLs to fetch content from
urls = [
    'https://my.clevelandclinic.org/health/diseases/17854-amyloidosis-aa',
    'https://my.clevelandclinic.org/health/diagnostics/9731-a1c',
    'https://my.clevelandclinic.org/health/body/23181-adenoids',
    'https://my.clevelandclinic.org/health/body/21901-epidermis',
    'https://my.clevelandclinic.org/health/body/24836-blood',
    'https://my.clevelandclinic.org/health/body/23173-cartilage',
    'https://my.clevelandclinic.org/health/body/21672-facial-muscles',
    'https://my.clevelandclinic.org/health/diseases/24742-decidual-cast',
    'https://my.clevelandclinic.org/health/diseases/17883-q-fever',
    'https://my.clevelandclinic.org/health/diseases/15662-non-hodgkin-lymphoma',
    'https://my.clevelandclinic.org/health/diseases/24814-macrocytosis',
    'https://my.clevelandclinic.org/health/diseases/knee-sprain',
    'https://my.clevelandclinic.org/health/treatments/22667-dermal-fillers',
    'https://my.clevelandclinic.org/health/treatments/7015-parathyroid-surgery',
    'https://my.clevelandclinic.org/health/treatments/24889-yoga-therapy',
    'https://my.clevelandclinic.org/health/treatments/16288-radical-nephrectomy',
    'https://my.clevelandclinic.org/health/treatments/21879-salpingectomy',
    'https://my.clevelandclinic.org/health/procedures/covid-vaccine',
    'https://my.clevelandclinic.org/health/procedures/13794-lung-volume-reduction-surgery-lvrs',
    'https://my.clevelandclinic.org/health/procedures/prostate-brachytherapy',
    'https://my.clevelandclinic.org/health/procedures/voice-feminization-surgery',
    'https://my.clevelandclinic.org/health/procedures/autologous-stem-cell-transplant',
    'https://my.clevelandclinic.org/health/drugs/21715-antifungals',
    'https://my.clevelandclinic.org/health/drugs/23683-phenol-throat-spray',
    'https://my.clevelandclinic.org/health/drugs/20429-quinapril-tablets',
    'https://my.clevelandclinic.org/health/drugs/20933-dexmethylphenidate-tablets',
    'https://my.clevelandclinic.org/health/drugs/18822-liothyronine-tablets',
    'https://my.clevelandclinic.org/health/diagnostics/23019-antibody-test',
    'https://my.clevelandclinic.org/health/diagnostics/17961-barium-enema',
    'https://my.clevelandclinic.org/health/diagnostics/22138-c3-complement-blood-test',
    'https://my.clevelandclinic.org/health/diagnostics/4849-hiv-testing',
    'https://my.clevelandclinic.org/health/diagnostics/24457-magnetic-resonance-cholangiopancreatography-mrcp',
    'https://my.clevelandclinic.org/health/symptoms/21819-abdominal-distension-distended-abdomen',
    'https://my.clevelandclinic.org/health/symptoms/24764-dactylitis-sausage-fingers',
    'https://my.clevelandclinic.org/health/symptoms/21999-hemostasis',
    'https://my.clevelandclinic.org/health/symptoms/palinopsia',
    'https://my.clevelandclinic.org/health/symptoms/17899-vaginal-bleeding'
]

# Fetch the content for the given URLs
content = fetch_detailed_content(urls)

# Print out the fetched content for each URL
for url, contents in content.items():
    print(f"Data from URL: {url}")
    for key, value in contents.items():
        print(f"Section: {key}")
        for text in value:
            print(f"  - {text}")



Data from URL: https://my.clevelandclinic.org/health/diseases/17854-amyloidosis-aa
Section: What is AA amyloidosis?
  - AA amyloidosis is one type of the rare disorder amyloidosis (pronounced “am-uh-loy-doh-sis”). Amyloidosis happens when proteins in your body lose their three-dimensional (3D) structure and become twisted clumps of misshapen fibrils (amyloid deposits) that gather on your organs and tissues.
  - AA amyloidosis is also known as secondary amyloidosis or amyloid serum A protein. This amyloidosis type happens when you have high levels of inflammation in your body that boost the serum A protein levels in your bloodstream. You may have high serum A protein levels if you have a long-lasting infection or inflammatory disease. In a sense, AA amyloidosis is a serious complication of inflammatory diseases and conditions. Healthcare providers treat AA amyloidosis by controlling the underlying disease or condition.
Section: How does AA amyloidosis affect my body?
  - AA amyloidosis 

# Setup Model for chatbot that using T5-large

**Explanation on code**
- Pre-trained Models: Utilizes T5's pre-trained models for natural language understanding and generation.
- Function select_context: This function selects the text section most relevant to the question from a structured content dictionary. It considers both keyword matches in section titles and the length of the content to determine relevance.
- Function generate_answer_t5: Generates a natural language answer by using the selected context to form a more detailed input for the T5 model. This function ensures that the input does not exceed T5's maximum length limitations and applies beam search for better quality outputs.
- Text Generation: The model generates an answer based on the question and context, providing a human-like response that considers the provided information.

In [28]:
# Import necessary libraries
from transformers import T5ForConditionalGeneration, T5Tokenizer
import numpy as np

# Load pre-trained tokenizer and model from Hugging Face's Transformers library
tokenizer = T5Tokenizer.from_pretrained('t5-large')
model = T5ForConditionalGeneration.from_pretrained('t5-large')

# Define a function to select the most relevant section of content based on the input question
def select_context(question, content):
    # Convert the question to lowercase to facilitate case-insensitive matching
    question_lower = question.lower()
    best_score = -1
    best_section = None

    # Iterate over the content dictionary
    for url, sections in content.items():
        for section_title, texts in sections.items():
            # Join all text elements into a single string for the section
            section_text = " ".join(texts)
            # Calculate match score based on the presence of section title keywords in the question
            match_score = sum(keyword in question_lower for keyword in section_title.lower().split())
            section_length = len(section_text)

            # Update the best section based on match score and section text length
            if match_score > best_score or (match_score == best_score and section_length > len(best_section['content'] if best_section else "")):
                best_score = match_score
                best_section = {
                    'source': url,
                    'section': section_title,
                    'content': section_text
                }

    # Return the section with the highest relevance to the question
    return best_section

# Define a function to generate an answer using the T5 model
def generate_answer_t5(question, model, tokenizer, content):
    # Select the most relevant content section for the question
    selected = select_context(question, content)
    if not selected:
        return "Sorry, I could not find any relevant information."

    # Construct a query for the T5 model by combining context and question
    context = f"Source: {selected['source']}\nSection: {selected['section']}\nDetails: {selected['content']}"
    input_text = f"question: {question} context: {context}"
    inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs, max_length=1000, num_beams=10, early_stopping=True)

    # Decode and return the generated answer, skipping special tokens
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# def chat(model, tokenizer, content):
#     print("Hello! Ask me any question about medical topics.")
#     while True:
#         question = input("You: ")
#         if question.lower() in ['quit', 'exit']:
#             print("Goodbye!")
#             break

#         answer = generate_answer(question, model, tokenizer, content)
#         print("Bot:", answer)
#         feedback = input("Was this answer helpful? (Yes/No) ")

# # Main execution
# chat(model, tokenizer, content)



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Setup Model for chatbot that using GPT2-large

**Explanation on code**
- Library Imports: Uses the transformers library to import GPT-2 specific model and tokenizer functionalities.
- Tokenization and Model Setup: The tokenizer is set up to pad from the left, and the pad token is explicitly set if not already defined. This configuration is necessary for certain types of model inference where consistent token length is required.
- Function select_context_gpt2: This function finds the most relevant section of provided content by checking keyword matches in a case-insensitive manner and considering the length of the content. It ensures that the question aligns as closely as possible with the content.
- Function generate_answer_gpt2: After selecting the most relevant content, this function formats a query for the GPT-2 model and uses it to generate a natural language answer. It includes settings for beam search and preventing repetition to enhance the quality and variety of the output.

In [30]:
# Import necessary libraries
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np

# Initialize tokenizer and model using GPT-2 large variant
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
# Ensure that padding is applied from the left
tokenizer.padding_side = "left"
model = GPT2LMHeadModel.from_pretrained('gpt2-large')

# Set a pad token if it's not already set, using the end of sentence token as the pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Define a function to select the most relevant section of content for a given question
def select_context_gpt2(question, content):
    question_lower = question.lower()
    best_score = -1
    best_section = None

    # Iterate through the content and calculate a match score based on keyword relevance
    for url, sections in content.items():
        for section_title, texts in sections.items():
            section_text = " ".join(texts)
            keywords = section_title.lower().split()
            match_score = sum(keyword in question_lower for keyword in keywords)
            section_length = len(section_text)

            # Update the best section if this section has a higher match score or if the score is the same but the content is longer
            if match_score > best_score or (match_score == best_score and section_length > (len(best_section['content']) if best_section else 0)):
                best_score = match_score
                best_section = {
                    'source': url,
                    'section': section_title,
                    'content': section_text
                }

    # Return the best section or a message if no relevant content was found
    return best_section if best_section else {"content": "Sorry, I could not find any relevant information."}

# Define a function to generate an answer using GPT-2
def generate_answer_gpt2(question, model, tokenizer, content):
    selected = select_context_gpt2(question, content)
    # Prepare the context for the model by including the source, section title, and content
    context = f"Source: {selected['source']}\nSection: {selected['section']}\n{selected['content']}"
    input_text = f"question: {question} context: {context}"
    # Encode the input text for the model
    inputs = tokenizer.encode_plus(input_text, return_tensors='pt', max_length=512, truncation=True)
    # Generate an answer using the model
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=1000,
        num_beams=3,
        no_repeat_ngram_size=2,
        early_stopping=True
    )

    # Decode and return the generated text, skipping special tokens
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# def chat(model, tokenizer, content):
#     print("Hello! Ask me any question about medical topics.")
#     while True:
#         question = input("You: ")
#         if question.lower() in ['quit', 'exit']:
#             print("Goodbye!")
#             break
#         answer = generate_answer_gpt2(question, model, tokenizer, content)
#         print("Bot:", answer)
#         feedback = input("Was this answer helpful? (Yes/No) ")

# # Start the chat
# chat(model, tokenizer, content)


# Chat section

**Explanation on code**
- Chat Functionality: Users can select which model to interact with. The script provides a simple interface to switch between models or to exit the program.
- Interactive Chat: Once a model is selected, the script allows for a conversational interface where users can pose questions, and the model provides answers based on preloaded content. The script captures user feedback after each response, which could be utilized for improving model performance in practical applications.

In [20]:
from transformers import T5ForConditionalGeneration, T5Tokenizer, GPT2LMHeadModel, GPT2Tokenizer

# Define a function to load both T5 and GPT-2 models along with their tokenizers
def load_models():
    # Load the T5 model and tokenizer
    t5_tokenizer = T5Tokenizer.from_pretrained('t5-large')
    t5_model = T5ForConditionalGeneration.from_pretrained('t5-large')

    # Load the GPT-2 model and tokenizer
    gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    gpt2_tokenizer.padding_side = "left"  # Ensure padding is on the left for GPT-2
    gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2-large')

    # Set the pad token for GPT-2 if it's not already set
    if gpt2_tokenizer.pad_token is None:
        gpt2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        gpt2_model.resize_token_embeddings(len(gpt2_tokenizer))

    # Return the initialized models and tokenizers
    return t5_tokenizer, t5_model, gpt2_tokenizer, gpt2_model

# Initialize both models and tokenizers
t5_tokenizer, t5_model, gpt2_tokenizer, gpt2_model = load_models()

# Define a function to handle the chat interaction
def run_chat():
    while True:
        print("\nSelect a chat model to use:")
        print("1: T5 Model")
        print("2: GPT-2 Model")
        print("3: Exit Program")
        choice = input("Enter your choice (1, 2, or 3): ")

        if choice == '1':
            chat(t5_model, t5_tokenizer, content, model_type='T5')
        elif choice == '2':
            chat(gpt2_model, gpt2_tokenizer, content, model_type='GPT2')
        elif choice == '3':
            print("Exiting the program.")
            break
        else:
            print("Invalid choice. Please select either 1, 2, or 3.")

# Define a function for chatting using the specified model
def chat(model, tokenizer, content, model_type='T5'):
    print(f"\nHello! Ask me any question about medical topics. You are now chatting with the {model_type} model.")
    while True:
        question = input("You: ")
        if question.lower() in ['quit', 'exit']:
            print(f"You are exiting the {model_type} model chat.")
            break

        # Generate an answer based on the model selected
        if model_type == 'T5':
            answer = generate_answer_t5(question, model, tokenizer, content)
        elif model_type == 'GPT2':
            answer = generate_answer_gpt2(question, model, tokenizer, content)

        print("Bot:", answer)
        feedback = input("Was this answer helpful? (Yes/No) ")

# Start the chat interaction
run_chat()


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Select a chat model to use:
1: T5 Model
2: GPT-2 Model
3: Exit Program
Enter your choice (1, 2, or 3): 1

Hello! Ask me any question about medical topics. You are now chatting with the T5 model.
You: How is vaginal bleeding treated?
Bot: medication or surgery is needed to treat vaginal bleeding
Was this answer helpful? (Yes/No) Yes
You: How many disease on your database?
Bot: two
Was this answer helpful? (Yes/No) yes
You: can you show me?
Bot: there are ways you can lower your risk
Was this answer helpful? (Yes/No) No
You: Surgical treatment for vaginal bleeding
Bot: Endometrial ablation
Was this answer helpful? (Yes/No) Yes
You: What causes AA amyloidosis?
Bot: abnormal proteins clump together
Was this answer helpful? (Yes/No) Yes
You: aaaa
Bot: True
Was this answer helpful? (Yes/No) No
You: adasfkkasdglsdfklghsdf
Bot: False
Was this answer helpful? (Yes/No) yes
You: What is that car color?
Bot: blue
Was this answer helpful? (Yes/No) No
You: error
Bot: too much clotting
Was this answ

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Bot: question: What is this medication? context: Source: https://my.clevelandclinic.org/health/drugs/20933-dexmethylphenidate-tablets
Section: What should I tell my care team before I take this medication?
They need to know if you have any of these conditions: Circulation problems in fingers and toes Hardening or blockages of the arteries or heart blood vessels Heart disease or a heart defect High blood pressure History of a drug or alcohol abuse problem History of stroke Mental illness Suicidal thoughts, plans, or attempt; a previous suicide attempt by you or a family member An unusual or allergic reaction to dexmethylphenidate, methylphenidate, other medications, foods, dyes, or preservatives Pregnant or trying to get pregnant Breast-feeding or planning to breast-feed; you are breastfeeding or plan on breastfeeding; your child has a medical condition that may affect your ability to take the medication; or you plan to have a child who will be taking the drug.
What is the most importan

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Bot: question: How should I use this medication? context: Source: https://my.clevelandclinic.org/health/drugs/20933-dexmethylphenidate-tablets
Section: How should I use this medication?
Take this medication by mouth with a glass of water. Follow the directions on the prescription label. You can take this medication with or without food. Take your doses at regular intervals. Usually the last dose of the day will be taken at least 4 to 6 hours before your normal bedtime, so it will not interfere with sleep. Do not take your medication more often than directed. A special MedGuide will be given to you by the pharmacist with each prescription and refill. Be sure to read this information carefully each time. Talk to your care team regarding the use of this medication in children. While this medication may be prescribed for children as young as 6 years for selected conditions, precautions do apply. Overdosage: If you think you have taken too much of this medicine contact a poison control cent