01 - Convert PDFs to Text

Define pdf_dir - this is the absolute path to the directory with all PDF files you want evaluated

Define output_dir - this is the absolute path to the directory to output all the preprocessed PDF files

In [1]:
pdf_dir = r'/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Work/Software/Research/nimlab/gpt_document_reader/amnesia_cases/papers_enrolled'
output_dir = r"/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Work/Software/Research/nimlab/gpt_document_reader/amnesia_cases/ocr"

**Preprocess Using Text Extraction**

In [None]:
import os
import textract

class PDFTextExtractor:
    """
    A class to handle PDF text extraction and saving it to a file.

    Attributes
    ----------
    None

    Methods
    -------
    extract_text_from_pdf(file_path: str) -> str:
        Extracts text from a given PDF file and returns it as a string.
    save_text_to_file(text: str, output_file_path: str) -> None:
        Saves a given text string to a specified file path.
    extract_text_from_pdf_dir(pdf_dir: str, output_dir: str) -> None:
        Iterates through a directory of PDF files, extracts text, and saves it to text files in an output directory.
    """
    
    @staticmethod
    def extract_text_from_pdf(file_path: str) -> str:
        """
        Extracts text from a given PDF file and returns it as a string.

        Parameters
        ----------
        file_path : str
            The path of the PDF file to extract text from.

        Returns
        -------
        str
            The extracted text as a string.
        """
        text = textract.process(file_path)
        return text.decode("utf-8")
    
    @staticmethod
    def save_text_to_file(text: str, output_file_path: str) -> None:
        """
        Saves a given text string to a specified file path.

        Parameters
        ----------
        text : str
            The text string to save.
        output_file_path : str
            The path where the text file will be saved.

        Returns
        -------
        None
        """
        with open(output_file_path, "w", encoding="utf-8") as output_file:
            output_file.write(text)
    
    @staticmethod
    def extract_text_from_pdf_dir(pdf_dir: str, output_dir: str) -> None:
        """
        Iterates through a directory of PDF files, extracts text, and saves it to text files in an output directory.

        Parameters
        ----------
        pdf_dir : str
            The directory containing the PDF files.
        output_dir : str
            The directory where text files will be saved.

        Returns
        -------
        None
        """
        for file_name in os.listdir(pdf_dir):
            if file_name.endswith(".pdf"):
                input_file_path = os.path.join(pdf_dir, file_name)
                output_file_path = os.path.join(output_dir, f"{os.path.splitext(file_name)[0]}.txt")
                text = PDFTextExtractor.extract_text_from_pdf(input_file_path)
                PDFTextExtractor.save_text_to_file(text, output_file_path)

Preprocess the PDF files

In [None]:
# Create an instance of the PDFTextExtractor class
pdf_extractor = PDFTextExtractor()
# Call the extract_text_from_pdf_dir method
pdf_extractor.extract_text_from_pdf_dir(pdf_dir, output_dir)

**Extract Tezt Using OCR**

In [16]:
from pdf2image import convert_from_path
from PyPDF2 import PdfReader
import pytesseract
import os
from tqdm import tqdm
import cv2
import numpy as np

class OCROperator:
    """
    A class to handle OCR text extraction from PDFs in a directory.

    Attributes
    ----------
    None

    Methods
    -------
    extract_text_from_pdf(file_path: str) -> str:
        Extracts text from a given PDF file using OCR and returns it as a string.
    save_text_to_file(text: str, output_file_path: str) -> None:
        Saves the extracted text to a specified file path.
    extract_text_from_pdf_dir(pdf_dir: str, output_dir: str) -> None:
        Iterates through a directory of PDF files and extracts text using OCR.
    """
    
    @staticmethod
    def preprocess_image(image):
        """
        Preprocesses the image for OCR.

        Parameters
        ----------
        image : PIL.Image.Image
            The image to preprocess.

        Returns
        -------
        PIL.Image.Image
            The preprocessed image.
        """
        # Convert the image to grayscale
        gray = cv2.cvtColor(np.array(image), cv2.COLOR_BGR2GRAY)
        
        # Apply Gaussian blur to reduce noise
        blurred = cv2.GaussianBlur(gray, (5, 5), 0)
        
        # Convert the image to binary (black and white)
        _, binary = cv2.threshold(blurred, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
        
        return binary
        
    @staticmethod
    def extract_text_from_pdf(file_path: str) -> str:
        """
        Extracts text from a PDF using OCR and returns it as a string.

        Parameters
        ----------
        file_path : str
            The path of the PDF file.

        Returns
        -------
        str
            The OCR-extracted text.
        """
        images = convert_from_path(file_path)
        text = ""

        for image in images:
            text += pytesseract.image_to_string(image)

        return text
    
    @staticmethod
    def get_pdf_page_count(file_path: str) -> int:
        pdf = PdfReader(file_path)
        return len(pdf.pages)

    @staticmethod
    def save_text_to_file(text: str, output_file_path: str) -> None:
        """
        Saves the extracted text to a specified file path.

        Parameters
        ----------
        text : str
            The text to save.
        output_file_path : str
            The file path to save the text to.

        Returns
        -------
        None
        """
        with open(output_file_path, 'w', encoding='utf-8') as f:
            f.write(text)

    @staticmethod
    def extract_text_from_pdf_dir(pdf_dir: str, output_dir: str, page_threshold=50) -> None:
        """
        Iterates through a directory of PDF files and extracts text using OCR.

        Parameters
        ----------
        pdf_dir : str
            The directory containing the PDF files.
        output_dir : str
            The directory to save the extracted text to.

        Returns
        -------
        None
        """
        os.makedirs(output_dir,exist_ok=True)
        for file_name in tqdm(os.listdir(pdf_dir)):
            if file_name.endswith('.pdf'):
                input_file_path = os.path.join(pdf_dir, file_name)
                
                #Exclude if over page count
                page_count = OCROperator.get_pdf_page_count(input_file_path)
                if page_count > page_threshold:
                    print(f"Skipping {file_name} as it has {page_count} pages, exceeding the threshold of {page_threshold}.")
                    continue
                
                output_file_path = os.path.join(output_dir, f"{os.path.splitext(file_name)[0]}_OCR.txt")

                text = OCROperator.extract_text_from_pdf(input_file_path)
                OCROperator.save_text_to_file(text, output_file_path)


In [17]:
OCROperator.extract_text_from_pdf_dir(pdf_dir, output_dir)

  6%|▋         | 7/109 [02:28<27:54, 16.41s/it]  

Skipping 2020 - ESO-WSO 2020 Joint Meeting Abstracts.pdf as it has 750 pages, exceeding the threshold.


 98%|█████████▊| 107/109 [42:33<00:16,  8.29s/it] 

Skipping 2020 - ePoster Sessions.pdf as it has 706 pages, exceeding the threshold.


100%|██████████| 109/109 [43:54<00:00, 24.17s/it]


02 - Preprocess the Extracted Text

In [25]:
import os
import re

# Updated TextPreprocessor class without removing newlines
class TextPreprocessor:
    """
    A class to preprocess text files in a directory.

    Attributes:
    - input_dir (str): The directory containing the text files to be preprocessed.
    - output_dir (str): The directory where the preprocessed text files will be saved.

    Methods:
    - preprocess_text: Applies various preprocessing steps to a given text.
    - remove_non_ascii: Removes non-ASCII characters from the text.
    - process_files: Reads each text file from the input directory, applies preprocessing, and saves it to the output directory.
    """

    def __init__(self, input_dir):
        """
        Initializes the TextPreprocessor class with input and output directories.

        Parameters:
        - input_dir (str): Path to the directory containing the text files to be preprocessed.
        - output_dir (str): Path to the directory where the preprocessed text files will be saved.
        """
        self.input_dir = input_dir
        self.output_dir = os.path.join(input_dir, 'preprocessed')

    @staticmethod
    def preprocess_text(text):
        """
        Applies various preprocessing steps to a given text.

        Parameters:
        - text (str): The text to be preprocessed.

        Returns:
        - str: The preprocessed text.
        """
        text = re.sub(r'([,.:;])', r'\1 ', text)
        text = re.sub(r'([(])', r' \1', text)
        text = re.sub(r'([)])', r'\1 ', text)
        text = re.sub(r'(?<![a-zA-Z0-9-])-(?![a-zA-Z0-9-])', r' - ', text)
        return text

    @staticmethod
    def remove_non_ascii(text):
        """
        Removes non-ASCII characters from the text.

        Parameters:
        - text (str): The text from which non-ASCII characters will be removed.

        Returns:
        - str: The text with non-ASCII characters removed.
        """
        return re.sub(r'[^\x00-\x7F]+', ' ', text)

    def process_files(self):
        """
        Reads each text file from the input directory, applies preprocessing, and saves it to the output directory.
        """
        # Create output directory if it doesn't exist
        if not os.path.exists(self.output_dir):
            os.makedirs(self.output_dir)
        
        # Loop through each file in the input directory
        for filename in os.listdir(self.input_dir):
            if filename.endswith('.txt'):
                input_filepath = os.path.join(self.input_dir, filename)
                output_filepath = os.path.join(self.output_dir, filename)
                
                # Read the original text
                with open(input_filepath, 'r', encoding='utf-8') as input_file:
                    original_text = input_file.read()
                
                # Apply preprocessing
                preprocessed_text = self.preprocess_text(original_text)
                cleaned_text = self.remove_non_ascii(preprocessed_text)
                
                # Save the preprocessed text
                with open(output_filepath, 'w', encoding='utf-8') as output_file:
                    output_file.write(cleaned_text)

In [26]:
# Initialize the TextPreprocessor class and preprocess the files
preprocessor = TextPreprocessor(output_dir)
preprocessor.process_files()

03 - Label Text Sections

Define Text Chunking Methods

In [20]:
# Importing required libraries
import os
import re
import json

class TextChunker:
    """
    A class to chunk a given text into smaller segments based on a token limit.
    
    Attributes:
    - text (str): The text to be chunked.
    - token_limit (int): The maximum number of tokens allowed in each chunk.
    - chunks (list): List to store the generated text chunks.
    
    Methods:
    - chunk_text: Splits the text into smaller segments based on the token limit.
    - get_chunks: Returns the list of generated text chunks.
    """
    
    def __init__(self, text, token_limit):
        """
        Initializes the TextChunker class with the text and token limit.
        
        Parameters:
        - text (str): The text to be chunked.
        - token_limit (int): The maximum number of tokens allowed in each chunk.
        """
        self.text = text
        self.token_limit = token_limit
        self.chunks = []
    
    def chunk_text(self):
        """
        Splits the text into smaller segments based on the token limit.
        """
        words = self.text.split()
        current_chunk = []
        current_chunk_tokens = 0
        
        for word in words:
            # Considering each word as a token and adding 1 for the space
            tokens_in_word = len(word.split()) + 1
            
            if current_chunk_tokens + tokens_in_word <= self.token_limit:
                current_chunk.append(word)
                current_chunk_tokens += tokens_in_word
            else:
                self.chunks.append(' '.join(current_chunk))
                current_chunk = [word]
                current_chunk_tokens = tokens_in_word
        
        # Adding the last chunk if any words are left
        if current_chunk:
            self.chunks.append(' '.join(current_chunk))
    
    def get_chunks(self):
        """
        Returns the list of generated text chunks.
        
        Returns:
        - list: List containing the generated text chunks.
        """
        return self.chunks
    
# > Example Usage:
# # Setting the token limit to 75% of GPT-3's maximum token limit (4096)
# token_limit = int(0.75 * 4096)  # About 3072 tokens

# # Reading the text file
# file_path = '/mnt/data/Horn 2017PD Fxconn_OCR.txt'
# with open(file_path, 'r', encoding='utf-8') as file:
#     text = file.read()

# # Creating an instance of the TextChunker class
# text_chunker = TextChunker(text, token_limit)

# # Chunking the text
# text_chunker.chunk_text()

# # Getting the list of chunks
# chunks = text_chunker.get_chunks()

# # Displaying the first chunk as a sample
# chunks[0][:500]  # Displaying the first 500 characters of the first chunk as a sample


Define Natural Language Processing Software

Latent Dirichlet Allocation approach employed

In [21]:
# Fixing the JSON serialization issue by converting NumPy int64 to native Python int

class SectionLabeler:
    """
    A class to label sections of a given text based on chunking and topic modeling.
    
    Attributes:
    - folder_path (str): The path to the folder containing the text files.
    - article_type (str): The type of article (e.g., 'research', 'case_report').
    - lda (LatentDirichletAllocation): The Latent Dirichlet Allocation model for topic modeling.
    
    Methods:
    - label_sections: Labels the sections of each text file in the folder.
    """
    
    def __init__(self, folder_path, article_type):
        """
        Initializes the SectionLabeler class with the folder path and article type.
        
        Parameters:
        - folder_path (str): The path to the folder containing the text files.
        - article_type (str): The type of article (e.g., 'research', 'case_report').
        """
        self.folder_path = folder_path
        self.article_type = article_type
        self.lda = LatentDirichletAllocation(n_components=7, random_state=42) # Number of topics can be adjusted
    
    def label_sections(self):
        """
        Labels the sections of each text file in the folder.
        """
        for filename in os.listdir(self.folder_path):
            if filename.endswith(".txt"):
                file_path = os.path.join(self.folder_path, filename)
                
                with open(file_path, 'r', encoding='utf-8') as f:
                    text = f.read()
                
                # Chunk the text using TextChunker
                chunker = TextChunker(text, int(4096 * 0.75))  # 75% of the maximum token limit of 4096
                chunker.chunk_text()
                chunks = chunker.get_chunks()
                
                # Vectorize the text chunks for topic modeling
                vectorizer = CountVectorizer()
                dtm = vectorizer.fit_transform(chunks)
                
                # Fit LDA model to the document-term matrix
                self.lda.fit(dtm)
                
                # Get the topic for each chunk
                topic_results = self.lda.transform(dtm)
                chunk_labels = topic_results.argmax(axis=1)
                
                # Store chunk and its label in a dictionary
                labeled_chunks = {}
                for i, label in enumerate(chunk_labels):
                    labeled_chunks[f"Chunk_{i+1}"] = {"text": chunks[i], "topic": int(label)}  # Convert to native int
                
                # Save the labeled chunks to a JSON file
                json_filename = filename.replace('.txt', '_labeled.json')
                json_file_path = os.path.join(self.folder_path, json_filename)
                
                with open(json_file_path, 'w', encoding='utf-8') as json_f:
                    json.dump(labeled_chunks, json_f, ensure_ascii=False, indent=4)
# > Example Usage:
# # Initialize and run the SectionLabeler
# labeler = SectionLabeler("/mnt/data", "research")
# labeler.label_sections()

Define OpenAI Labelling Software

In [22]:
import openai  # Make sure to install the OpenAI package

class OpenAIEvaluator:
    """
    A class to evaluate text chunks using the OpenAI API based on the type of article.

    Attributes:
    - api_key (str): OpenAI API key.
    - article_type (str): The type of article (e.g., 'research', 'case').
    - questions (dict): Dictionary mapping article types to evaluation questions.

    Methods:
    - __init__: Initializes the OpenAIEvaluator class with the API key path and article type.
    - read_api_key: Reads the OpenAI API key from a file.
    - evaluate_with_openai: Evaluates a text chunk based on the question corresponding to the article type.
    """

    def __init__(self, api_key_path):
        """
        Initializes the OpenAIEvaluator class.

        Parameters:
        - api_key_path (str): Path to the file containing the OpenAI API key.
        - article_type (str): The type of article (e.g., 'research', 'case').
        """
        self.api_key = self.read_api_key(api_key_path)
        openai.api_key = self.api_key

    def read_api_key(self, file_path):
        """
        Reads the OpenAI API key from a file.

        Parameters:
        - file_path (str): Path to the file containing the OpenAI API key.

        Returns:
        - str: OpenAI API key.
        """
        with open(file_path, 'r') as file:
            return file.readline().strip()

    def evaluate_with_openai(self, chunk, questions):
        """
        Evaluates a chunk based on multiple posed questions using OpenAI API.

        Parameters:
        - chunk (str): The text chunk to be evaluated.
        - questions (list): A list of questions for evaluation.

        Returns:
        - dict: A dictionary where keys are questions and values are binary decisions (0 or 1).
        """
        question_list = list(questions.keys())
        question_prompt = "\n".join([f"{q}" for q in question_list])
        prompt = f"Text Chunk: {chunk}\n{question_prompt}"

        try:
            response = openai.Completion.create(
                engine="gpt-3.5-turbo-16k-0613",
                prompt=prompt,
                max_tokens=10  # Adjust as needed
            )
            decision_text = response.choices[0].text.strip()
            decisions = decision_text.split("\n")
            
            if len(decisions) != len(questions):
                print("Warning: The number of decisions does not match the number of questions.")
                valid_decisions = [line.strip() for line in decisions if line.strip()]
                if len(valid_decisions) != len(questions):
                    decisions = valid_decisions
                    print("Solved warning.")
                else:
                    print('Decisions here, returning None: ', decisions)
                    return None

            return {q: 1 if "Y" in d else 0 for q, d in zip(questions, decisions)}
            
        except Exception as e:
            print(f"An error occurred: {e}")
            return "Unidentified"

In [23]:
# Importing required libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import json
import os
import openai
from tqdm import tqdm
from fuzzywuzzy import fuzz

class SectionLabeler:
    """
    A class to label sections of a text document using LDA and OpenAI's GPT-3.

    Attributes:
    - folder_path (str): The path to the folder containing text files.
    - article_type (str): The type of article (e.g., 'research', 'case').
    - lda_model (object): The trained LDA model for topic modeling.
    - vectorizer (object): The CountVectorizer object for text vectorization.
    - chunker (object): The TextChunker object for text chunking.
    - api_key (str): The OpenAI API key. [Replace with actual API key later]

    Methods:
    - train_lda: Trains the LDA model based on the text chunks.
    - dominant_topic: Finds the dominant topic for a given text chunk.
    - label_with_openai: Labels a section based on the dominant topic using OpenAI's GPT-3.
    - process_files: Processes all text files in the specified folder.
    - save_to_json: Saves the labeled sections to a JSON file.
    """

    def __init__(self, folder_path, article_type, api_key_path=None, label_method='LDA'):
        """
        Initializes the SectionLabeler class with the folder path and article type.

        Parameters:
        - api_key_path (str): Path to the file containing the OpenAI API key.
        - article_type (str): The type of article (e.g., 'research' or 'case').
        - folder_path (str): The path to the folder containing text files.
        - article_type (str): The type of article (e.g., 'research', 'case').
        """
        self.folder_path = folder_path
        self.article_type = article_type
        self.api_key_path = api_key_path
        self.lda_model = None
        self.vectorizer = None
        self.chunker = None
        self.label_method = label_method
        self.token_max = 4098
        self.token_safety = 0.4


    def select_labels(self):
        # Define section labels for each article type
        if self.article_type == "research":
            self.section_headers = {
            "Abstract": ["Abstract"],
            "Introduction": ["Background", "Introduction", "Intro"],
            "Methods": ["Methods", "Materials", "Material and Methods", "Materials & Methods", "Methodology", "Subjects and Methods"],
            "Results": ["Results", "Findings"],
            "Discussion": ["Discussion", "Interpretation"],
            "Conclusion": ["Conclusion", "Summary"],
            "References": ["References", "Bibliography", "Citations"]
            }
        if self.article_type == "case":
            self.section_headers = {
            "Abstract": ["Abstract"],
            "Introduction": ["Background", "Introduction", "Intro"],
            "Methods": ["Methods", "Materials", "Material and Methods", "Materials & Methods", "Methodology", "Subjects and Methods"],
            "Results": ["Results", "Findings"],
            "Discussion": ["Discussion", "Interpretation"],
            "Conclusion": ["Conclusion", "Summary"],
            "References": ["References", "Bibliography", "Citations"]
            }
        else:
            self.section_headers = {
            "Abstract": ["Abstract"],
            "Introduction": ["Background", "Introduction", "Intro"],
            "Methods": ["Methods", "Materials", "Material and Methods", "Materials & Methods", "Methodology", "Subjects and Methods"],
            "Results": ["Results", "Findings"],
            "Discussion": ["Discussion", "Interpretation"],
            "Conclusion": ["Conclusion", "Summary"],
            "References": ["References", "Bibliography", "Citations"]
            }
            
    def get_questions(self):
        if self.article_type == "research":
            return None
        elif self.article_type == "case":
            questions = {'Does this contain a neurological case report? (Y/N)': 'case_report'}
            return questions
        else:
            return None
        
    def train_lda(self, text_chunks):
        """
        Trains the LDA model based on the text chunks.

        Parameters:
        - text_chunks (list): List of text chunks.

        Returns:
        - object: Trained LDA model.
        """
        self.vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
        data_vectorized = self.vectorizer.fit_transform(text_chunks)
        lda_model = LatentDirichletAllocation(n_components=len(self.topic_labels), max_iter=10, learning_method='online')
        lda_Z = lda_model.fit_transform(data_vectorized)
        self.lda_model = lda_model
        return lda_model      
    
    def train_lda_on_all_files(self):
        """
        Trains the LDA model on text from all files in the specified folder.

        Returns:
        - object: Trained LDA model.
        """
        all_text_chunks = []

        for filename in tqdm(os.listdir(self.folder_path)):
            if filename.endswith('.txt'):
                with open(os.path.join(self.folder_path, filename), 'r') as f:
                    text = f.read()

                self.chunker = TextChunker(text, np.round(4096*0.75)) # Set to 75% max token limit
                self.chunker.chunk_text()
                chunks = self.chunker.get_chunks()

                all_text_chunks.extend(chunks)

        # Train LDA model on all text chunks
        return self.train_lda(all_text_chunks)  
        
    def dominant_topic(self, text_chunk):
        """
        Finds the dominant topic for a given text chunk.

        Parameters:
        - text_chunk (str): The text chunk to be labeled.

        Returns:
        - int: Index of the dominant topic.
        """
        text_vectorized = self.vectorizer.transform([text_chunk])
        topic_probability_scores = self.lda_model.transform(text_vectorized)
        dominant_topic_index = topic_probability_scores.argmax()
        return dominant_topic_index
    
    def label_with_lda(self, text_chunk):
        # Get the list of section labels based on the article type
        return self.topic_labels[self.dominant_topic(text_chunk)]
    
    def extract_topic_significant_words(self, dominant_topic_index):
        """
        Extracts significant words for a specific topic in the LDA model.

        Parameters:
        - dominant_topic_index (int): The index of the dominant topic.

        Returns:
        - list: Significant words for the dominant topic.
        """
        # Initialize an empty list to store the significant words for the dominant topic
        significant_words = []

        # Get the topic-word distribution for the dominant topic
        topic_word_distribution = self.lda_model.components_[dominant_topic_index]

        # Get the indices of the top N significant words
        N = 10  # Adjust as needed
        top_word_indices = topic_word_distribution.argsort()[-N:][::-1]

        # Get the actual words from the vectorizer
        feature_names = self.vectorizer.get_feature_names_out()
        significant_words = [feature_names[i] for i in top_word_indices]

        return significant_words

    def label_with_exact_matching(self, text):
        labeled_sections = {}
        current_section = None
        current_text = ""

        # Adding newline to section labels for exact matching
        section_labels_with_newline = []
        for labels in self.section_headers.values():
            section_labels_with_newline.extend([f"\n{label}\n" for label in labels])

        for line in text.split('\n'):
            if f"\n{line}\n" in section_labels_with_newline:
                if current_section:
                    labeled_sections[current_section] = current_text.strip()
                current_section = line
                current_text = ""
            else:
                current_text += line + "\n"

        # Adding the last section
        if current_section:
            labeled_sections[current_section] = current_text.strip()

        return labeled_sections

    def label_with_fuzzy_matching(self, text, labeled_sections):
        current_section = None
        current_text = ""

        for line in text.split('\n'):
            for section, labels in self.section_headers.items():
                if any(fuzz.partial_ratio(line, label) > 80 for label in labels):
                    if current_section:
                        if current_section not in labeled_sections:
                            labeled_sections[current_section] = current_text.strip()
                        else:
                            labeled_sections[current_section] += "\n" + current_text.strip()
                    current_section = section
                    current_text = ""
                    break
            else:
                current_text += line + "\n"

        # Adding the last section
        if current_section:
            if current_section not in labeled_sections:
                labeled_sections[current_section] = current_text.strip()
            else:
                labeled_sections[current_section] += "\n" + current_text.strip()

        return labeled_sections

    def label_with_loose_matching(self, text, labeled_sections):
        current_section = None
        current_text = ""

        for line in text.split('\n'):
            for section, labels in self.section_headers.items():
                if any(re.search(f"\\b{label}\\b", line, re.IGNORECASE) for label in labels):
                    if current_section:
                        if current_section not in labeled_sections:
                            labeled_sections[current_section] = current_text.strip()
                        else:
                            labeled_sections[current_section] += "\n" + current_text.strip()
                    current_section = section
                    current_text = ""
                    break
            else:
                current_text += line + "\n"

        # Adding the last section
        if current_section:
            if current_section not in labeled_sections:
                labeled_sections[current_section] = current_text.strip()
            else:
                labeled_sections[current_section] += "\n" + current_text.strip()

        return labeled_sections

    def label_text(self, text, show_residuals=True):
        labeled_sections = self.label_with_exact_matching(text)
        labeled_sections = self.label_with_fuzzy_matching(text, labeled_sections)
        labeled_sections = self.label_with_loose_matching(text, labeled_sections)
        
        # Check for residual text
        labeled_text = "".join(list(labeled_sections.values()))
        residual_text = set(text) - set(labeled_text)
        
        if self.label_method=='openai':
            # Extract only the "References" section from labeled_sections
            references_section = {key: value for key, value in labeled_sections.items() if key == "References"}
            
            # Remove each labeled section from residual_text
            residual_text = text
            for section_text in labeled_sections.values():
                residual_text = residual_text.replace(section_text, "")
            return references_section, residual_text
        else:
            if len(residual_text) > 100:
                print(f"Warning: High number of characters missed ({len(residual_text)}), please investigate manually.")
                if show_residuals:
                    print(residual_text)
            return labeled_sections, residual_text
        
    def process_files(self):
        """
        Processes all text files in the specified folder.
        """
        output_dict = {}

        self.select_labels()

        for filename in tqdm(os.listdir(self.folder_path)):
            if filename.endswith('.txt'):
                with open(os.path.join(self.folder_path, filename), 'r') as f:
                    text = f.read()

                # Initialize labeled_sections
                labeled_sections = {}

                # Use Keyword matching to find sections first
                if self.label_method == 'keyword_matching' or self.label_method == 'openai':
                    labeled_sections, text = self.label_text(text)

                if self.label_method != 'keyword_matching':
                    # Split the text into sections
                    self.chunker = TextChunker(text, np.round(self.token_max * self.token_safety))  # Set to 75% max token limit
                    self.chunker.chunk_text()
                    chunks = self.chunker.get_chunks()

                    for i, chunk in enumerate(chunks):
                        if self.label_method == 'LDA':
                            self.train_lda_on_all_files()
                            label = self.label_with_lda(chunk)
                            labeled_sections[label] = chunk  # Moved this inside the if block
                        elif self.label_method == 'openai':
                            # Set questions
                            self.questions_dict = self.get_questions()
                            # Intializing sections
                            for k, v in self.questions_dict.items():
                                labeled_sections[v] = []
                                                            
                            self.openai_evaluator = OpenAIEvaluator(self.api_key_path)
                            label_dict = self.openai_evaluator.evaluate_with_openai(chunk, self.questions_dict)
                            for question, decision in label_dict.items():
                                if decision == 1:
                                    print(self.questions_dict[question])
                                    print(chunk)
                                    labeled_sections[self.questions_dict[question]].append(chunk)

                        else:
                            raise ValueError(f"Labelling method {self.label_method} invalid. Select LDA or openai or keyword_matching")

                output_dict[filename] = labeled_sections

        self.save_to_json(output_dict)

    def save_to_json(self, output_dict):
        """
        Saves the labeled sections to a JSON file.

        Parameters:
        - output_dict (dict): Dictionary containing the labeled sections.

        Returns:
        - None
        """
        # Create a new directory in the same root folders
        out_dir = os.path.join(os.path.dirname(self.folder_path),  f"{self.label_method}_labeled_text")
        os.makedirs(out_dir, exist_ok=True)
        with open(os.path.join(out_dir, 'labeled_sections.json'), 'w') as f:
            json.dump(output_dict, f, indent=4)



In [24]:
# Define input and output directory paths
article_type = "research" 
api_key_path = "/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Work/Software/Research/nimlab/openai_key.txt"

# Initialize the SectionLabeler class and process the files
section_labeler = SectionLabeler(folder_path=os.path.join(output_dir, 'preprocessed'), article_type=article_type, api_key_path=api_key_path, label_method='keyword_matching')
section_labeler.process_files()

100%|██████████| 107/107 [03:19<00:00,  1.86s/it]


# 04 - Define Your Research Questions for OpenAI

Using the print_question_template Function

Overview
- The print_question_template function serves as a tool to help users generate a template of questions based on a specific context—be it for inclusion criteria, exclusion criteria, or evaluation questions. 
- You will then edit the template for your research questions and pass it to the GPT submission class. 
- This function is particularly useful for researchers or students who want to fine-tune the evaluation process of text-based data.

```
Function Signature
python

Copy code
def print_question_template(question_type: str) -> None:
Parameters
question_type (str): This is the type of questions you want to print. The options are 'inclusion', 'exclusion', 'evaluation', or 'custom'.
Returns
None: The function prints the template to the console but does not return any value.
How to Use
Choose the Question Type: Decide the type of questions you need. The options are:

'inclusion': For questions related to the inclusion criteria.
'exclusion': For questions related to the exclusion criteria.
'evaluation': For questions that evaluate the quality or specifics of the text.
'custom': For your custom set of questions.

Use the function like so:
```

>python
>Copy code
>print_question_template("inclusion")
>This will print a JSON-formatted dictionary of questions related to the inclusion criteria.
>
>Modify the Template: You can copy the printed dictionary and modify the questions or their corresponding labels as >needed.
>
>Pass to the Evaluator: Once modified, this dictionary can be passed as an argument to the OpenAIChatEvaluator or >any similar class for further processing.
>
>Example Usage
>Here's how you can use the function to get a template for inclusion questions:
>
>python
>Copy code
>print_question_template("inclusion")
>The output will look something like this:
>
>json
>Copy code
>{
>    "Amnesia case report? (Y/N)": "case_report",
>    "Published in English? (Y/N)": "is_english"
>}
>Custom Template
>If you choose the 'custom' type, the function will return a skeleton template with placeholders, which you can >fill in to create your own set of questions.


Remember, the idea is to make it easier for you to generate, modify, and utilize question templates for your specific needs.

In [27]:
def print_question_template(question_type):
    """
    Prints out a template for questions based on the specified question type.
    
    Parameters:
    - question_type (str): Type of questions to print ('inclusion', 'exclusion', 'evaluation').
    
    Returns:
    None
    """
    if question_type == "inclusion":
        questions = {'Amnesia case report? (Y/N)': 'case_report',
                     'Published in English? (Y/N)': 'is_english'}
    elif question_type == "exclusion":
        questions = {'Transient amnesia, reversible amnesia symptom, severe confabulation or drug use, toxicity, epilepsy-related confusion, psychological or psychiatric-related amnesia (functional amnesia)': 'other_cause',
                        'Did not examine/report both retrograde and anterograde memory domains': 'not_both_domains',
                        'Without descriptive/qualitative/quantitative data on amnesia severity/memory tests/questions/scenarios/details': 'not_enough_information',
                        'Had global cognitive impairment disproportionate to memory loss': 'disproportionate_impairment',
                        'Without measurable lesion-related brain MR/CT scans': 'no_scan',
                        'Had focal or widespread brain atrophy': 'neurodegenerative',
                        'Atypical cases with selective (e.g., semantic) memory loss or material/topographic-specific memory loss': 'atypical_case'
                    }  
    elif question_type == "evaluation":
        questions = {'Does the patient(s) represent(s) the whole experience of the investigator (center) or is the selection method unclear to the extent that other patients with similar presentation may not have been reported? (Good/Bad/Unclear)': 'representative_case_quality',
                        'Was patient’s causal exposure clearly described? (Good/Bad/Unclear)': 'causality_quality',
                        'Were diagnostic tests or assessment methods and the results clearly described (amnesia tests)? (Good/Bad/Unclear)': 'phenotyping_quality',
                        'Were other alternative causes that may explain the observation (amnesia) ruled out? (Good/Bad/Unclear)': 'workup_quality',
                        'Were patient’s demographics, medical history, comobidities clearly described? (Good/Bad/Unclear)': 'clinical_covariates_quality',
                        'Were patient’s symptoms, interventions, and clinical outcomes clearly presented as a timeline? (Good/Bad/Unclear)': 'history_quality',
                        'Was the lesion image taken around the time of observation (amnesia) assessment? (Good/Bad/Unclear)': 'temporal_causality_quality',
                        'Is the case(s) described with sufficient details to allow other investigators to replicate the research or to allow practitioners make inferences related to their own practice? (Good/Bad/Unclear)': 'history_quality_2'
        }
    elif question_type == "custom":
        questions = {' ? (metric/metric/metric)': 'question_label',
                        ' ? (metric/metric/metric)': 'question_label',
                        ' ? (metric/metric/metric)': 'question_label',
                        ' ? (metric/metric/metric)': 'question_label',
                        ' ? (metric/metric/metric)': 'question_label',
                        ' ? (metric/metric/metric)': 'question_label',
                        ' ? (metric/metric/metric)': 'question_label',
                        ' ? (metric/metric/metric)': 'question_label',
        }
        return questions
    else:
        print("Invalid question type. Please choose 'inclusion', 'exclusion', or 'evaluation'.")
        return
    
    print("Here is the template for the type of questions you chose:")
    print(json.dumps(questions, indent=4))

# # Example usage:
print_question_template("inclusion")
print_question_template("exclusion")
print_question_template("custom")

Here is the template for the type of questions you chose:
{
    "Amnesia case report? (Y/N)": "case_report",
    "Published in English? (Y/N)": "is_english"
}
Here is the template for the type of questions you chose:
{
    "Transient amnesia, reversible amnesia symptom, severe confabulation or drug use, toxicity, epilepsy-related confusion, psychological or psychiatric-related amnesia (functional amnesia)": "other_cause",
    "Did not examine/report both retrograde and anterograde memory domains": "not_both_domains",
    "Without descriptive/qualitative/quantitative data on amnesia severity/memory tests/questions/scenarios/details": "not_enough_information",
    "Had global cognitive impairment disproportionate to memory loss": "disproportionate_impairment",
    "Without measurable lesion-related brain MR/CT scans": "no_scan",
    "Had focal or widespread brain atrophy": "neurodegenerative",
    "Atypical cases with selective (e.g., semantic) memory loss or material/topographic-specifi

{' ? (metric/metric/metric)': 'question_label'}

**Critical Note**
- You should phrase these questions such that a Yes results in the manuscript 'passing'
- This can be made more robust by having users deliberately map a Yes for each question to whether it is good or not. 

In [28]:
question = {
    "Does this have case report-style information (Y)? Or is it a different type of article (N)": "case_report",
    "Do you think there might be a figure in this which has patient neuroimaging? (Y/N)": "has_imaging",
    "Do you think this has ruled out transient global amnesia, reversible amnesia symptom, drug use, toxicity, epilepsy-related confusion, psychological/psychiatric/functional amnesia? (Y/N)": "no_confounds",
    "Do you think this examined both retrograde and anterograde amnesia?": "examined_both_domains",
    "Do you think there was a good description of the amnesia severity, either qualitatively or quantitatively? (Y/N)": "good_severity_grading",
    "Do you think this did not have global cognitive impairment disproportionate to memory loss? (Y/N)": "proportionate_impairment",
    "Do you think this was unrelated to focal/global brain atrophy? (Y/N)": "neurodegenerative_ruled_out",
    "Do you think this represents a typical case of amnesia": "typical_case"
}

# 05 - Ask Questions to OpenAI

**Using the OpenAIChatEvaluator Class**

The OpenAIChatEvaluator class extends the OpenAIEvaluator class to provide additional functionality for text evaluation based on OpenAI's chat models.

Prerequisites
```
Python 3.x
OpenAI Python package
A JSON file containing labeled sections
```

Initialization
```
To initialize an instance of OpenAIChatEvaluator, you need to provide:

API Key Path: The path to a file containing your OpenAI API key.
JSON File Path: The path to a JSON file containing the labeled sections you want to evaluate.
Keys to Consider: A list of keys you want the evaluator to consider for evaluation.
Article Type: The type of article you are evaluating (e.g., 'research', 'case').
```

Methods
```
- read_json
- This method reads a JSON file from a given file path.
- json_data = evaluator.read_json('labeled_sections.json')

- get_questions
- This method generates evaluation questions based on the article_type. It returns a dictionary of questions.
- questions = evaluator.get_questions()

- send_to_openai
- This method takes a list of text chunks and sends them to OpenAI for evaluation.  It returns a list of answers corresponding to the chunks.
- answers = evaluator.send_to_openai(['chunk1', 'chunk2'])
```
____
# Workflow Example

Here's how you could use OpenAIChatEvaluator to evaluate a list of text chunks.

> python
> # Initialize the evaluator
> evaluator = OpenAIChatEvaluator('your_api_key.txt', 'labeled_sections.json', ['Introduction', 'Methods'], 'case')
>
> # Generate questions based on the article type
> questions = evaluator.get_questions()
>
> # Evaluate a list of text chunks
> chunks = ['This is a sample chunk.', 'This is another sample chunk.']
> answers = evaluator.send_to_openai(chunks)
>
> # Print the answers
> print(answers)

By following this guide, you should be able to use the OpenAIChatEvaluator class for evaluating text based on OpenAI's chat models.



In [58]:
import openai  # Make sure to install the OpenAI package
import json
import numpy as np
from tqdm import tqdm
import time

class OpenAIChatEvaluator(OpenAIEvaluator):
    """
    Class to evaluate text chunks using OpenAI's chat models.
    
    Attributes:
    - token_limit (int): The maximum number of tokens allowed in each OpenAI API call.
    - question_token (int): The number of tokens reserved for the question.
    - answer_token (int): The number of tokens reserved for the answer.
    - json_data (dict): The data read from the JSON file.
    - keys_to_consider (list): List of keys to consider from the JSON file.
    - article_type (str): The type of article (e.g., 'research', 'case').
    - questions (dict): Dictionary mapping article types to evaluation questions.
    """
    
    def __init__(self, api_key_path, json_file_path, keys_to_consider, question_type, question, token_limit=16000, question_token=500, answer_token=500):
        """
        Initializes the OpenAIChatEvaluator class.
        
        Parameters:
        - api_key_path (str): Path to the file containing the OpenAI API key.
        - json_file_path (str): Path to the JSON file containing the text data.
        - keys_to_consider (list): List of keys to consider from the JSON file.
        - article_type (str): The type of article (e.g., 'research', 'case').
        - token_limit (int): The maximum number of tokens allowed in each OpenAI API call. Default is 16000.
        - question_token (int): The number of tokens reserved for the question. Default is 500.
        - answer_token (int): The number of tokens reserved for the answer. Default is 500.
        """
        super().__init__(api_key_path)  # Call the parent class's constructor
        self.questions = question
        self.token_limit = token_limit
        self.question_token = question_token
        self.answer_token = answer_token
        self.json_path = json_file_path
        self.json_data = self.read_json(json_file_path)
        self.keys_to_consider = keys_to_consider
        self.question_type = question_type
        self.extract_relevant_text()
        self.all_answers = {}
        self.debug = False

    def read_json(self, json_file_path):
        """
        Reads JSON data from a file.
        
        Parameters:
        - json_file_path (str): Path to the JSON file containing the text data.
        
        Returns:
        - dict: The data read from the JSON file.
        """
        try:
            with open(json_file_path, 'r') as file:
                return json.load(file)
        except FileNotFoundError:
            print(f"Error: File {json_file_path} not found.")
            return {}
        except json.JSONDecodeError:
            print("Error: Could not decode the JSON file.")
            return {}

    
    def extract_relevant_text(self):
        """
        Extracts and stores relevant text sections based on keys_to_consider.
        """
        self.relevant_text_by_file = {}
        for file_name, sections in self.json_data.items():
            selected_text = ""
            for key, value in sections.items():
                if key in self.keys_to_consider:
                    selected_text += value
            self.relevant_text_by_file[file_name] = selected_text

    def evaluate_all_files(self):
        for file_name, selected_text in tqdm(self.relevant_text_by_file.items()):
            # Initialize a dictionary to store answers for this file
            self.all_answers[file_name] = {}
            if self.debug:
                print('On file:', file_name)
            
            # Chunk the text
            text_chunker = TextChunker(selected_text, np.round((self.token_limit) * 0.7))
            text_chunker.chunk_text()
            chunks = text_chunker.get_chunks()
            if self.debug:
                print('Number of chunks:', len(chunks))
            
            # Initialize a dictionary to store chunk-level answers for each question
            for question in self.questions.keys():
                self.all_answers[file_name][question] = {}

            # Send a query for each chunk
            for chunk_index, chunk in enumerate(chunks):
                if self.debug:
                    print('On chunk:', chunk_index)
                # Reset the conversation each time
                conversation = []
                conversation.append({"role": "system", "content": "You are a helpful assistant."})
                conversation.append({"role": "user", "content": f"Text Chunk: {chunk}"})
                
                # Initialize a conversation with OpenAI for this chunk
                for q_index, q in enumerate(self.questions.keys()):
                    # Use a while loop to allow 3 submission attempts
                    retry_count = 0
                    while retry_count < 3:
                        try:
                            # Add the question to the conversation and send it
                            conversation.append({"role": "user", "content": q})
                            response = openai.ChatCompletion.create(
                                model="gpt-3.5-turbo-16k",
                                messages=conversation
                            )
                            
                            # Retrieve the assistant's last answer
                            answer = response['choices'][-1]['message']['content']
                            
                            # Store the answer for this question and this chunk
                            self.all_answers[file_name][q][f"chunk_{chunk_index+1}"] = answer
                            
                            # Add the assistant's answer back to the conversation to maintain context
                            conversation.append({"role": "assistant", "content": answer})
                            
                            time.sleep(0.1)
                            break  # Exit the loop if successful
                        
                        #Handle Exceptions
                        except Exception as e:
                            if type(e).__name__ == 'RateLimitError':
                                print(f"Rate limit error: {e}. Retrying... ({retry_count+1})")
                                retry_count += 1
                                time.sleep(30)
                            else:
                                print(f"An error occurred: {e}. Retrying... ({retry_count+1})")
                                retry_count += 1
                                time.sleep(5)
                                            
                    if retry_count == 3:
                        self.all_answers[file_name][q][f"chunk_{chunk_index+1}"] = "Unidentified"
                    
        return self.all_answers
        
    def send_to_openai(self, chunks):
        """
        Sends text chunks to OpenAI for evaluation.
        
        Parameters:
        - chunks (list): List of text chunks to evaluate.
        
        Returns:
        - list: List of answers received from OpenAI.
        """
        answers = []
        for chunk in chunks:
            prompt = f"Text Chunk: {chunk}\n{self.questions}"

            try:
                response = openai.Completion.create(
                    engine="gpt-3.5-turbo-16k",
                    prompt=prompt,
                    max_tokens=self.answer_token  # Adjust as needed
                )
                decision_text = response.choices[0].text.strip()
                answers.append(decision_text)

            except Exception as e:
                print(f"An error occurred during response handling: {e}")
                answers.append("Unidentified")

        return answers
    
    def save_to_json(self, output_dict):
        """
        Saves the labeled sections to a JSON file.

        Parameters:
        - output_dict (dict): Dictionary containing the labeled sections.

        Returns:
        - None
        """
        # Create a new directory in the same root folder
        out_dir = os.path.join(os.path.dirname(self.json_path), "text_evaluations")
        os.makedirs(out_dir, exist_ok=True)
        
        # Save the dictionary to a JSON file
        with open(os.path.join(out_dir, f'{self.question_type}_evaluations.json'), 'w') as f:
            json.dump(output_dict, f, indent=4)

In [59]:
api_key_path = "/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Work/Software/Research/nimlab/openai_key.txt"
json_file_path = "/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Work/Software/Research/nimlab/gpt_document_reader/amnesia_cases/ocr/keyword_matching_labeled_text/labeled_sections.json"

# Define the keys you want to consider (exclude 'References')
keys_to_consider = ["Introduction", "Methods", "Results", "Discussion", "Conclusion"]  # Add or remove keys as per your requirement

# Define the type of article and questions
article_type = "research"

In [60]:
evaluator = OpenAIChatEvaluator(api_key_path, json_file_path, keys_to_consider, article_type, question)
answers = evaluator.evaluate_all_files()
evaluator.save_to_json(answers)

  3%|▎         | 3/107 [00:26<14:08,  8.16s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 176434 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


  5%|▍         | 5/107 [01:59<47:25, 27.90s/it]  

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 171690 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


  7%|▋         | 8/107 [03:58<49:36, 30.07s/it]  

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 173860 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)
Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 173121 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


  9%|▉         | 10/107 [07:05<1:30:34, 56.02s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 171560 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 11%|█         | 12/107 [08:24<1:10:00, 44.21s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 171877 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 13%|█▎        | 14/107 [09:49<1:03:01, 40.66s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 171947 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 14%|█▍        | 15/107 [11:18<1:24:26, 55.07s/it]

An error occurred: The server is overloaded or not ready yet.. Retrying... (1)
Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 174505 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 16%|█▌        | 17/107 [12:49<1:15:58, 50.65s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 176312 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)
Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 174787 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 22%|██▏       | 24/107 [16:52<31:50, 23.01s/it]  

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 175596 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 23%|██▎       | 25/107 [18:12<54:22, 39.78s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 172904 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 27%|██▋       | 29/107 [20:34<40:14, 30.95s/it]  

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 175148 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 28%|██▊       | 30/107 [21:53<58:13, 45.37s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 171955 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 30%|██▉       | 32/107 [23:31<54:05, 43.27s/it]  

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 175179 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 32%|███▏      | 34/107 [25:02<49:35, 40.76s/it]  

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 175310 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 35%|███▍      | 37/107 [26:31<33:05, 28.36s/it]  

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 175206 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 38%|███▊      | 41/107 [28:14<22:31, 20.47s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 178141 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)
Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 177424 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 39%|███▉      | 42/107 [30:43<1:03:56, 59.03s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 177516 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 43%|████▎     | 46/107 [32:47<31:20, 30.83s/it]  

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 175099 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 44%|████▍     | 47/107 [34:17<48:27, 48.46s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 174469 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 46%|████▌     | 49/107 [35:48<42:58, 44.45s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 173761 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 49%|████▊     | 52/107 [37:27<29:34, 32.26s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 172549 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 50%|████▉     | 53/107 [38:41<40:17, 44.76s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 172603 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 51%|█████▏    | 55/107 [40:20<37:22, 43.12s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 176607 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 52%|█████▏    | 56/107 [41:39<45:58, 54.09s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 175077 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 55%|█████▌    | 59/107 [43:14<27:18, 34.14s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 173238 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 59%|█████▉    | 63/107 [45:40<20:08, 27.46s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 176804 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 60%|█████▉    | 64/107 [46:55<29:45, 41.53s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 171201 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 61%|██████    | 65/107 [48:21<38:32, 55.06s/it]

An error occurred: The server is overloaded or not ready yet.. Retrying... (1)


 64%|██████▎   | 68/107 [48:50<16:05, 24.75s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 178079 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 65%|██████▌   | 70/107 [50:18<19:28, 31.58s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 177501 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 66%|██████▋   | 71/107 [51:31<26:24, 44.00s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 176217 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 69%|██████▉   | 74/107 [53:33<19:37, 35.68s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 173822 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 72%|███████▏  | 77/107 [55:10<14:50, 29.69s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 178404 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 75%|███████▍  | 80/107 [56:55<12:25, 27.62s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 177316 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 78%|███████▊  | 83/107 [58:43<10:45, 26.90s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 172857 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 79%|███████▊  | 84/107 [1:00:19<18:17, 47.72s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 175744 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 79%|███████▉  | 85/107 [1:01:44<21:39, 59.06s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 174905 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 81%|████████▏ | 87/107 [1:03:19<16:46, 50.33s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 175723 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 83%|████████▎ | 89/107 [1:05:12<15:16, 50.92s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 176275 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)
Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 172268 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 84%|████████▍ | 90/107 [1:08:01<24:26, 86.28s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 177802 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 90%|████████▉ | 96/107 [1:10:22<03:57, 21.56s/it]

An error occurred: The server is overloaded or not ready yet.. Retrying... (1)
Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 173422 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 91%|█████████ | 97/107 [1:11:59<07:21, 44.18s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 173273 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 93%|█████████▎| 100/107 [1:13:34<03:37, 31.14s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 172617 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 96%|█████████▋| 103/107 [1:15:12<01:44, 26.10s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 172223 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


 99%|█████████▉| 106/107 [1:16:43<00:24, 24.60s/it]

Rate limit error: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-Y2tKyCPFO6tIjtCtOVZ7c9tr on tokens per min. Limit: 180000 / min. Current: 173952 / min. Contact us through our help center at help.openai.com if you continue to have issues.. Retrying... (1)


100%|██████████| 107/107 [1:17:54<00:00, 43.69s/it]

{'Mosimann et al. - 2012 - Fornix infarction and Korsakoff dementia after coi_OCR.txt': {'Does this have case report-style information (Y)? Or is it a different type of article (N)': {'chunk_1': 'Y'}, 'Do you think there might be a figure in this which has patient neuroimaging? (Y/N)': {'chunk_1': 'Y'}, 'Do you think this has ruled out transient global amnesia, reversible amnesia symptom, drug use, toxicity, epilepsy-related confusion, psychological/psychiatric/functional amnesia? (Y/N)': {'chunk_1': 'Based on the provided information, it is not possible to determine whether all of those possibilities have been ruled out.'}, 'Do you think this examined both retrograde and anterograde amnesia?': {'chunk_1': 'Based on the provided information, it is stated that the article discusses anterograde amnesia, but it does not mention retrograde amnesia specifically.'}, 'Do you think there was a good description of the amnesia severity, either qualitatively or quantitatively? (Y/N)': {'chunk_1':




Convert Inclusion/Exclusion Results to CSV

In [67]:
path_to_json = r'/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Work/Software/Research/nimlab/gpt_document_reader/amnesia_cases/ocr/keyword_matching_labeled_text/text_evaluations/research_evaluations.json'

In [68]:
import pandas as pd
import os
import json

class InclusionExclusionSummarizer:
    """
    Class to summarize inclusion/exclusion criteria based on the answers received from GPT-3.5.
    
    Attributes:
    - json_path (str): Path to the JSON file containing the answers.
    - data (dict): The data read from the JSON file.
    - df (DataFrame): Pandas DataFrame to store summarized results.
    """
    
    def __init__(self, json_path):
        """
        Initializes the InclusionExclusionSummarizer class.
        
        Parameters:
        - json_path (str): Path to the JSON file containing the answers.
        """
        self.json_path = json_path
        self.data = self.read_json()
        self.df = self.summarize_results()
    
    def read_json(self):
        """
        Reads JSON data from a file.
        
        Returns:
        - dict: The data read from the JSON file.
        """
        with open(self.json_path, 'r') as file:
            return json.load(file)
    
    def summarize_results(self):
        """
        Summarizes the results by converting answers to binary form.
        
        Returns:
        - DataFrame: Pandas DataFrame containing the summarized results.
        """
        summary_dict = {}
        for article, questions in self.data.items():
            summary_dict[article] = {}
            for question, chunks in questions.items():
                # Convert all chunk answers to lowercase and check for "yes" keywords
                binary_answers = [1 if 'y' in answer.lower() else 0 for answer in chunks.values()]
                # Sum up the binary answers for each question
                summary_dict[article][question] = sum(binary_answers)
        
        # Convert the summary dictionary to a DataFrame
        df = pd.DataFrame.from_dict(summary_dict, orient='index')
        
        # Set all values above 0 to 1
        df[df > 0] = 1
        
        return df
    
    def drop_rows_with_zeros(self):
        """
        Drops any row in the DataFrame that contains a zero.
        
        Returns:
        - DataFrame: A new DataFrame with rows containing zeros removed.
        """
        return self.df[(self.df == 0).sum(axis=1) == 0]
    
    def save_to_csv(self, dropped=False):
        """
        Saves the DataFrame to a CSV file.
        
        Parameters:
        - dropped (bool): Indicates whether rows have been dropped from the DataFrame.
        
        Returns:
        - None
        """
        # Create a new directory in the same root folder
        out_dir = os.path.join(os.path.dirname(self.json_path), "inclusion_exclusion_results")
        os.makedirs(out_dir, exist_ok=True)
        
        # Determine the name of the CSV file based on whether rows have been dropped
        file_name = "automated_filtered_results.csv" if dropped else "unfiltered_results.csv"
        
        # Save the DataFrame to a CSV file
        csv_path = os.path.join(out_dir, file_name)
        if dropped:
            self.drop_rows_with_zeros().to_csv(csv_path)
        else:
            self.df.to_csv(csv_path)
            
    def run(self):
        """
        Executes all the summarization, saving and optional row-dropping steps in one method.
        
        Returns:
        - None
        """
        self.save_to_csv()
        self.save_to_csv(dropped=True)
        return self.df

In [69]:

summarizer = InclusionExclusionSummarizer(path_to_json)
result_df = summarizer.run()
result_df

Unnamed: 0,Does this have case report-style information (Y)? Or is it a different type of article (N),Do you think there might be a figure in this which has patient neuroimaging? (Y/N),"Do you think this has ruled out transient global amnesia, reversible amnesia symptom, drug use, toxicity, epilepsy-related confusion, psychological/psychiatric/functional amnesia? (Y/N)",Do you think this examined both retrograde and anterograde amnesia?,"Do you think there was a good description of the amnesia severity, either qualitatively or quantitatively? (Y/N)",Do you think this did not have global cognitive impairment disproportionate to memory loss? (Y/N),Do you think this was unrelated to focal/global brain atrophy? (Y/N),Do you think this represents a typical case of amnesia
Mosimann et al. - 2012 - Fornix infarction and Korsakoff dementia after coi_OCR.txt,1,1,0,1,1,1,1,1
Parkin - IMPAIRMENT OF MEMORY FOLLOWING DISCRETE THALAMIC I_OCR.txt,1,0,0,1,1,1,0,0
"Amnesia after right frontal subcortical lesion, following removal of a colloid cyst of the septum pellucidum and third ventricle_OCR.txt",1,1,0,1,1,1,1,0
Yasuda et al. - DISSOCIATION BETWEEN SEMANTIC AND AUTOBIOGRAPHIC M_OCR.txt,1,1,0,1,1,1,0,1
Kapur et al. - 1996 - Anterograde but not retrograde memory loss followi_OCR.txt,1,1,0,1,1,1,1,0
...,...,...,...,...,...,...,...,...
Schnider et al. - 1992 - Dissociation of Color From Object in Amnesia_OCR.txt,1,0,0,1,1,0,0,0
Abe et al. - 1998 - Amnesia after a discrete basal forebrain lesion_OCR.txt,1,1,0,1,1,1,0,0
Coffey - 1989 - Hypothalamic and basal forebrain germinoma present_OCR.txt,1,1,0,1,1,1,0,0
Poreh et al. - 2006 - Anterograde and retrograde amnesia in a person wit_OCR.txt,1,1,0,1,1,1,1,1
