<h2>Pipeline Stage 2: Extract-O-Bot</h2>

<p>Welcome to stage 2 of our pipeline! After filtering out ineligible candidates, the second step in our pipeline is to extract keywords from student applications and position profiles. We will be using the OpenAI API to generate keywords from the text data. Our text data includes any unstructured text from the student applications and position profiles. This encompasses student resumes, student essays, job descriptions, external position summaries, and any free text input added to student or position profiles.</p>


<p>Below the Extract-O-Bot code, you will find previously generated results that you can review without needing to run Extract-O-Bot.<p>

In [1]:
#~~~
import re  # For regular expressions
from concurrent.futures import ThreadPoolExecutor, as_completed, wait # For multithreading
import os  # For os.path operations
import time  # For time-related functions
import glob  # For file matching
import pandas as pd  # For data manipulation
from tqdm.notebook import tqdm  # For progress bar
import threading  # For threading operations
import tiktoken  # For token estimation
import ipywidgets as widgets  # For displaying widgets
from openai import OpenAI, RateLimitError, AzureOpenAI # For OpenAI API
from mypackage.utils import *  # For utility functions
import json

# Error handling
class RateLimitError(Exception):
    pass
class TimeoutException(Exception):
    pass

In [None]:
# Load the project path
path_to_project = load_project_path()

if path_to_project:
    print(f"Project path loaded: {path_to_project}")
else:
    print("Please set the project path in the initial notebook.")

<h3>API Key Setup</h3>

<p>Below, you need to add your own API key. You can access these via Azure or OpenAI. Please note that they do cost money to use.</p>


In [3]:
# OpenAI API key setup

# client = AzureOpenAI(
#     azure_endpoint = 'endpoint_goes_here',
#     api_key= 'key_goes_here',
#     api_version="2024-05-01-preview"
# )

# My API key
client = OpenAI(api_key='Your API key goes here')

In [4]:
# read in the data. This is the table from pipeline 1. 

path = f'{path_to_project}/data/SP_table/SP1.parquet'


# Load the data
SP = pd.read_parquet(path)

In [None]:
print(SP.shape) 

<h3>Setting the Number of Student Applications and Job Postings for Testing</h3>

<p>I would recommend using a subset of the data for testing purposes. You can set the number of student applications and job postings you want to use for testing.</p>

<p>The number of rows processed will be <code>(num_job_postings * 2) + (num_student_applications * 2)</code> because we are using the student applications and job postings twice.</p>


In [6]:
# num_student_applications = 25
# num_job_postings = 25

# #region Using subset of SP for testing

# # take a random sample of 4 rows from SP
# # SP = SP.sample(4, random_state=40)

# # create a pos_grouped df with just the pos_(Do Not Modify) Job Posting column, pos_grouped by pos_(Do Not Modify) Job Posting
# pos_grouped = SP['pos_(Do Not Modify) Job Posting']

# stu_grouped = SP['stu_(Do Not Modify) Application']

# # edit the pos_grouped df to only unique values
# pos_grouped = pos_grouped.drop_duplicates()

# # edit the stu_grouped df to only unique values
# stu_grouped = stu_grouped.drop_duplicates()

# # Randomly select 50 unique job postings from the pos_grouped df
# sampled_job_postings = pos_grouped.sample(num_student_applications , random_state=30)

# # Randomly select 50 unique applicants from the stu_grouped df
# sampled_applicants = stu_grouped.sample(num_job_postings, random_state=30)

# # edit SP to only rows where the pos_(Do Not Modify) Job Posting is in the sampled_job_postings
# SP = SP[SP['pos_(Do Not Modify) Job Posting'].isin(sampled_job_postings)]

# # edit SP to only rows where the stu_(Do Not Modify) Application is in the sampled_applicants
# SP = SP[SP['stu_(Do Not Modify) Application'].isin(sampled_applicants)]

# print('Shape')
# print(SP.shape)

# print('Unique Job Postings')
# print(SP['pos_(Do Not Modify) Job Posting'].nunique())

# print('Unique Applicants')
# print(SP['stu_(Do Not Modify) Application'].nunique())

# #endregion


<h3>Instructions for Extraction</h3>

<p>Below are all the prompts I wrote for the keyword extraction. This is where you will have the most control over the results.</p>

<p>The specific instructions I set are just something I came up with on the fly. I encourage everyone who tries this to create their own set of instructions to see what kind of results you get and how flexible the extraction can be.</p>


In [7]:
# stu technical skill and education extraction instructions
S_TS_Ed_EOB_instructions = '''
---

Task Description:  
You are an AI designed to extract and label technical skill and educational background keywords from job application profiles. Your objective is to identify and categorize the applicant's technical skills and educational background. Each technical skill should be labeled with a Relevance Score and a Skill Level Score.

---

Input Information:  
You will receive an application profile that includes the following sections:  
- Applicant Resume
- Applicant Essays 
- Education Summary

---

Extraction Criteria:

1. Technical Skills:
    - Identify keywords that represent the technical skills possessed by the applicant.
    - Each skill should be on its own line. For example, if the application states 'Qualitative & Quantitative Research Skills', Qualitative Research should be on one line, and Quantitative Research should be on another. 
    - Separate skills. If 'Data Analysis (pandas, NumPy, scikit-learn)' is in the application profile, Data Analysis, pandas, NumPy, and scikit-learn should all be on separate lines. 
    - Ensure ALL technical skills are extracted from the applicant's profile. 
    - Do not include soft skills when doing technical skill extraction.
   

2. Education:
    - Identify keywords that represent the applicant's educational background.
    - Don't only include the degree or major but also any relevant courses or academic experiences that are mentioned in the profile.

---

Labeling Criteria:

1. Technical Skills:
    - Consider the entire profile’s context when labeling the keywords. For instance, if "statistics" is extracted from the Skill/Education Summary, and the Resume shows extensive experience/interest in statistics, this context should be reflected in the Relevance and Skill Level Scores.
    - Each keyword should have 2 labels related to a Relevance Score and a Skill Level Score.
    - Relevance Score: 
        - The relevance score evaluates how well a technical skill aligns with the applicant’s field of study or industry of interest. To determine this, consider the applicant’s essays, work experience, and other relevant information in their profile.
        - Labels:
            - Low: The skill is not directly related to the applicant's experience, career goals, or industry of interest, or the skill is only mentioned one time with little to no context.
            - Medium: The skill is somewhat related to the applicant's experience, career goals, or industry of interest, but the applicant has limited experience or demonstrates a vague interest in using the skill.
            - High: The skill is directly related to the applicant's experience, career goals, or industry of interest, and the applicant has relevant experience or demonstrates a clear interest in using the skill.
    - Skill Level Score:
        - The skill level score evaluates the applicant’s proficiency in each technical skill.  
        - If not explicitly stated, assume the expected proficiency level of a skill based on the context of the application profile. If it's still unclear, default to Intermediate.
        - Labels:
            - Beginner: The applicant has basic knowledge or limited experience with the skill.
            - Intermediate: The applicant has a moderate level of knowledge or experience with the skill.
            - Advanced: The applicant has an advanced level of knowledge or extensive experience with the skill.
            
2. Education:
   - No labeling required for education keywords. Simply extract the relevant keywords.
        
---

Output Criteria:

- It is CRITICAL that you adhere strictly to the format provided below.
- Output should be in plain text without any special formatting (e.g., HTML, Markdown, LaTeX).
- Do not format your output with lists using hyphens or bullet points.
- Do not use any bold text (i.e. **bold**) or formatted headers.
- Do not add any explanations or extra text beyond the required output format.
- Do not include any additional explanations or words in the output beyond what is outlined under 'Expected Output Format:' in these instructions.

---

Expected Output Format:

Technical Skills:  
[Technical Skill Keyword], (Relevance Score: [Insert Relevance Score Here]), (Skill Level Score: [Insert Skill Level Label Here])

Education:
[Education Keyword]

---
'''

# stu Industry of Interest, Soft Skills, and Values/Motivations/Mission extraction instructions
S_In_SS_V_EOB_instructions = '''
---

Task Description:  
You are an AI designed to extract and label keywords from job application profiles related to the applicant's industry of interest, soft skills, and values/motivations/mission. 

---

Input Information:  
You will receive an application profile that includes the following sections:  
- Applicant Resume
- Applicant Essays 
- Applicant Education Summary

---

Extraction Criteria:

1. Industries of Interest:
   - Identify keywords that represent the industries the applicant is interested in.
   - Utilize information related to the applicant's field of study, career goals, work experience, and any other relevant details in the profile to extract these keywords.
   - In addition to directly extracting keywords from the document, infer and create keywords that likely correspond to the applicant's industry of interest based on the context of the application. For instance, if the applicant does not explicitly state "Business Analytics" but demonstrates significant interest in analyzing business trends, has taken coursework related to Business Analytics, and shows a likely inclination toward this field, then "Business Analytics" should be added as a keyword.
   - Ensure ALL industries of interest are extracted or inferred from the applicant's profile, reflecting a comprehensive understanding of the applicant's potential career directions.

2. Soft Skills:
    - Identify keywords that represent the soft skills possessed by the applicant.
    - Each skill should be on its own line. For example, if the application profile states 'Strong communication & organizational skills', Strong communication should be on one line, and Organizational skills should be on another.
    - Separate skills. If 'Team player (collaboration, communication)' is in the application profile, Team player, collaboration, and communication should all be on separate lines.
    - Ensure ALL soft skills are extracted from the applicant's profile.

3. Values, Motivations, and Mission:
    - Identify keywords that represent the applicant's values, motivations, and mission.

---

Labeling Criteria

1. Industries of Interest:
   - Each industry of interest keyword should be labeled with an "Industry of Interest Score."
   - Industry of Interest Score:
      - The Industry of Interest Score measures how closely an industry keyword aligns with the applicant's academic background, work experience, and career goals.
      - To determine the score, consider the overall context of the applicant's profile, including their essays, resume, and any relevant information in the Skill/Education Summary.
      - Labels:
         - Primary: This score is assigned to industries that are directly related to the applicant’s primary field of study or career aspirations. These are the main industries the applicant is focused on and has substantial experience or demonstrated interest in.
         - Secondary: This score is assigned to industries that are closely related to the applicant’s primary industry of interest. These industries may be a natural extension or complementary field where the applicant has some experience or interest.
         - Tertiary: This score is assigned to industries that are less related to the applicant’s main interests or experiences. These may be industries the applicant has mentioned in passing or has limited experience in.
      - If the industry of interest is not explicitly stated in the profile, infer the appropriate score based on the context provided by the applicant's experience and stated goals. If the relevance is still unclear, default to labeling it as Secondary.


2. Soft Skills:
   - Each soft skill keyword should be labeled with a "Skill Level Score."
   - Skill Level Score:
      - The Skill Level Score measures the applicant's proficiency or expertise in each soft skill.
      - To determine the score, consider the full context of the applicant's profile, including their resume, essays, and any relevant information in the Skill/Education Summary. For example:
         - Beginner: This score is assigned to soft skills where the applicant has basic knowledge or limited experience. The applicant may have recently started developing this skill or has only had minimal exposure to it in academic or work settings.
         - Intermediate: This score is assigned to soft skills where the applicant has a moderate level of proficiency. The applicant has demonstrated a working knowledge of the skill and has applied it in various contexts, but may not yet be highly proficient.
         - Advanced: This score is assigned to soft skills where the applicant has extensive experience and a high level of proficiency. The applicant has a deep understanding of the skill, often demonstrated through significant accomplishments, leadership, or repeated successful application in complex situations.
   - If the proficiency level of a skill is not explicitly stated in the profile, infer the appropriate score based on the overall context provided by the applicant's experience and the nature of the skill. If the level of proficiency is still unclear, default to labeling it as Intermediate.


3. Values, Motivations, and Mission:
    - No labeling is necessary for keywords in this category. Simply extract the relevant keywords.

---

Output Criteria:

- It is CRITICAL that you adhere strictly to the format provided below.
- Output should be in plain text without any special formatting (e.g., HTML, Markdown, LaTeX).
- Do not format your output with lists using hyphens or bullet points.
- Do not use any bold text (i.e. **bold**) or formatted headers.
- Do not add any explanations or extra text beyond the required output format.
- Do not include any additional explanations or words in the output beyond what is outlined under 'Expected Output Format:' in these instructions.

---

Expected Output Format:

Industries of Interest:
[Industry Keyword], (Industry Label: [Insert Industry Label Here])

Soft Skills:  
[Soft Skill Keyword], (Skill Level Score: [Insert Skill Level Label Here])

Values, Motivations, and Mission:
[Value Keyword]

---
'''

# pos technical skill extraction instructions
P_TS_EOB_instructions = ''' 
---

Task Description:  
You are an AI designed to extract and label technical skill keywords from job position profiles. The profiles contain information related to a particular job position. Your objective is to identify and categorize technical skills possessed by the applicant. Each technical skill should be labeled with a Relevance Score and a Skill Level Score.

---

Input Information:  
You will receive a job position profile that includes the following sections:  
1. Company Name  
2. Position Title  
3. Position Description
4. Other Skill/Requirements/Preferences Summary


---

Extraction Criteria:
    - Identify keywords that represent the technical skills possessed by the applicant.
    - If the Position Title includes relevant technical skills, extract them as keywords.
    - Treat the Position Description and Other Skill/Requirements/Preferences Summary sections as distinct entities. Extract keywords separately from each section. However, do not duplicate keywords.
    - Each skill should be on its own line. For example, if the application states 'Qualitative & Quantitative Research Skills', Qualitative Research should be on one line, and Quantitative Research should be on another. 
    - Seperate skills. If 'Data Analysis (pandas, NumPy, scikit-learn)' is in the application profile, Data Analysis, pandas, NumPy and scikit-learn should all be on seperate lines. 
    - include education requirements/preferences in the extraction. For example, if the job position requires a Bachelor's degree in Computer Science, extract 'Computer Science' as a technical skill. If a job position states 'We are looking for candidates with a background in Finance or Economics,' extract 'Finance' and 'Economics' as technical skills.
    - include experience requirements/preferences in the extraction. For example, if the position states 'We are looking for candidates with experience in business analytics,' extract 'Business Analytics' as a technical skill.
    - Ensure ALL technical skills are extracted from the applicant's profile. 

---

Labeling Criteria:
- Consider the entire profile’s context when labeling the keywords. For instance, if "statistics" is extracted from the Position Summary, and the Position Description states that the applicant should have experience in statistics, this context should be reflected in the Necessity and Skill Level Scores.
- Each keyword should have 2 labels related to a Necessity Score and a Skill Level Score.
- Necessity Score:
    - The necessary score evaluates how important a technical skill is to the job position. 
    - If not explicitly stated, assume the necessity of skills based on the context of the job position profile. If it's still unclear, default to Medium.
    - Labels:
        - High: The skill is a critical requirement for the job position.
        - Medium: The skill is a preferred requirement for the job position.
        - Low: The skill is a nice addition but not a requirement for the job position.
- Skill Level Score: 
    - The skill level score evaluates the expected proficiency level of each technical skill.
    - If not explicitly stated, assume the expected proficiency level of a skill based on the context of the job position profile. If it's stillunclear, default to Intermediate.
       - Beginner: Applicants are expected to have basic knowledge or limited experience with the skill.
       - Intermediate: Applicants are expected to have a moderate level of knowledge or experience with the skill.
       - Advanced: Applicants are expected to have an advanced level of knowledge or extensive experience with the skill.
        
---

Output Criteria:

- It is CRITICAL that you adhere strictly to the format provided below.
- Output should be in plain text without any special formatting (e.g., HTML, Markdown, LaTeX).
- Do not format your output with lists using hyphens or bullet points 
- Do not use any bold text (i.e. **bold**) or formatted headers. 
- Do not add any explanations or extra text beyond the required output format.
- Do not include any additional explanations or words in the output beyond what is outlines under 'Expected Output Format:' in these instruction. 

---

Expected Output Format:  

Technical Skills:  
[Technical Skill Keyword], (Necessity Score: [Insert Relevance Score Here]), (Skill Level Score: [Insert Skill Level Label Here])

---

'''

# pos Industry of Interest, Soft Skills, and Company Values/Culture/Mission extraction instructions
P_In_SS_V_EOB_instructions = '''
---

Task Description:  
You are an AI designed to extract and label keywords from job position profiles for internships. Your objective is to identify and categorize keywords into three groups: Industry,  Soft Skills, and Company Values/Culture/Mission. Each keyword category has specific extraction and labeling criteria.

---

Input Information:  
You will receive a job position profile that includes the following sections:  
1. Company Name  
2. Position Title  
3. Position Description  
4. Position Other Skill/Requirements/Preferences

---

Extraction Criteria:

1. Industry:
   - Identify keywords related to the industry of the job.
   - If the Position Title includes information relevant to the job's industry focus, extract them as keywords.
   - Treat the Position Description and Position Other Skill/Requirements/Preferences Summary sections as distinct entities. Extract keywords separately from each section. However, do not duplicate keywords.
   
2. Soft Skills:
   - Identify keywords that represent the soft skills required or preferred for the job position.
   - Each soft skill should be on its own line. For example, if the position profile states 'Strong communication & organizational skills', extract 'Strong communication' as one keyword and 'Organizational skills' as another.
   - Separate skills. If 'Team player (collaboration, communication)' is in the position profile, extract 'Team player', 'Collaboration', and 'Communication' as separate keywords.
   - Ensure ALL soft skills required or preferred for the position are extracted from the profile.
   - Treat the Position Description and Other Skill/Requirements/Preferences Summary sections as distinct entities. Extract keywords separately from each section. However, do not duplicate keywords.

3. Company Values, Culture, and Mission:
   - Identify keywords that represent the company's values, culture, and mission.

---

Labeling Criteria:

1. Industry:
   - Label each industry keyword with:
     - Primary: The main industries of the job.
     - Secondary: Closely related industries.
     - Tertiary: Industries less related to the main industries.
   - If not explicitly stated, assume the industry label (Primary, Secondary, or Tertiary) based on the context of the job position profile. If it's still unclear, default to Secondary.
   
2. Soft Skills:
   - Each soft skill keyword should be labeled with a "Skill Level Score."
   - Skill Level Score:
      - The Skill Level Score measures the level of proficiency or expertise required for each soft skill in the job position.
      - To determine the score, consider the full context of the job position profile, including the Position Description and Position Skill/Requirements/Preferences Summary. For example:
         - Beginner: This score is assigned to soft skills where basic knowledge or limited experience is required. The position may be entry-level or involve tasks that don't require advanced proficiency.
         - Intermediate: This score is assigned to soft skills where a moderate level of proficiency is required. The position may require a working knowledge of the skill, with some experience applying it in various contexts.
         - **Advanced**: This score is assigned to soft skills where a high level of proficiency is required. The position may involve complex tasks, leadership roles, or situations where extensive experience and expertise in the skill are necessary.
   - If the required proficiency level of a skill is not explicitly stated in the profile, infer the appropriate score based on the overall context provided by the job description and requirements. If the level of proficiency is still unclear, default to labeling it as Intermediate.

3. Company Values, Culture, and Mission:
   - No labeling is required for these keywords; simply extract the relevant terms.


---

Output Expectations:

- It is CRITICAL that you adhere strictly to the format provided below.
- Output should be in plain text without any special formatting (e.g., HTML, Markdown, LaTeX).
- Do not format your output with lists using hyphens or bullet points.
- Do not use any bold text (i.e. **bold**) or formatted headers.
- Do not add any explanations or extra text beyond the required output format.
- Do not include any additional explanations or words in the output beyond what is outlined under 'Expected Output Format:' in these instructions.

---

Expected Output Format:

Industries:  
[Industry Keyword], (Industry Label: [Insert Industry Label Here])

Soft Skills:  
[Soft Skill Keyword], (Skill Level Score: [Insert Skill Level Label Here])

Company Values, Culture, and Mission:  
[Company Values, Culture, and Mission Keyword]

---
'''


<h3>Grouped Keyword Extraction Code: All Qualitative Data</h3>

<p>This code is part of the Extract-O-Bot pipeline stage, with the primary objective of extracting keywords from student applications and job postings based on specific instructions. The extraction process is tailored to identify and categorize keywords related to technical skills, education, industry of interest, soft skills, and company values/culture/mission.</p>

<h4>Key Extraction Instructions:</h4>

<ul>
    <li><strong>Student Technical Skills and Education (S_TS_Ed):</strong> Extracts and labels technical skill and educational background keywords from student profiles. The extraction is based on the resume, essays, and education summary provided by the applicant. Technical skills are labeled with a Relevance Score and a Skill Level Score, while educational keywords are simply extracted without labels.</li>
    <li><strong>Student Industry of Interest, Soft Skills, and Values/Motivations/Mission (S_In_SS_V):</strong> Extracts and labels keywords related to the applicant's industry of interest, soft skills, and values/motivations/mission. Keywords are inferred from the student's entire profile, and specific labeling criteria are applied to each category.</li>
    <li><strong>Position Technical Skills (P_TS):</strong> Extracts and labels technical skill keywords from job position profiles. The process considers various sections of the job description, including the position title, description, and any additional requirements or preferences.</li>
    <li><strong>Position Industry of Interest, Soft Skills, and Company Values/Culture/Mission (P_In_SS_V):</strong> Extracts and labels keywords related to the industry of the job, required soft skills, and the company's values, culture, and mission. Each keyword category follows specific extraction and labeling criteria.</li>
</ul>

<h4>Key Functions:</h4>

<ul>
    <li><strong>load_and_combine_saved_dfs:</strong> Loads and combines multiple saved DataFrames from a specified directory, removing duplicates. This function is crucial for resuming processing from where it left off in case of timeouts or interruptions.</li>
    <li><strong>restart_processing:</strong> Restarts the keyword extraction process by loading previously saved progress and updating the DataFrame to mark already processed profiles. It ensures that the process continues smoothly without reprocessing already completed tasks.</li>
    <li><strong>save_temp_dataframe:</strong> Saves a temporary DataFrame containing only the processed rows to disk, used primarily in case of timeouts to ensure progress isn't lost.</li>
    <li><strong>process_keywords:</strong> Organizes and categorizes the extracted keywords into predefined categories, handling special cases for specific keyword formatting.</li>
    <li><strong>replenish_rps & replenish_tps:</strong> Continuously replenishes the rate and token limit semaphores to allow more requests and tokens per second, ensuring the API limits are respected while processing the data.</li>
    <li><strong>extract_o_bot:</strong> The core function that handles the interaction with the OpenAI API, sending formatted profiles for keyword extraction, managing rate limits, and handling errors like rate limit exceeded or generic failures.</li>
    <li><strong>clean_text:</strong> Prepares and formats the text data for a given profile, ensuring it is ready for keyword extraction. The formatting is customized based on whether the profile is a student or a position.</li>
    <li><strong>extract_keywords_single_profile:</strong> Extracts keywords from a single profile by first cleaning the text and then processing the keywords. The function handles different return cases based on the profile type and keyword category.</li>
    <li><strong>process_batch & process_batch_with_timeout:</strong> Processes batches of profiles for keyword extraction using concurrent processing. The functions manage batch sizes based on token limits and RPM constraints, incorporating a timeout mechanism to handle potential long-running operations.</li>
    <li><strong>get_batch:</strong> Manages the extraction of keywords for profiles in the DataFrame by processing them in batches. It calculates the batch size based on token and RPM constraints and handles the processing efficiently.</li>
    <li><strong>execute_keyword_extractions:</strong> Coordinates the keyword extraction process across multiple types of profiles and keyword categories, iterating through predefined extraction jobs.</li>
    <li><strong>estimate_tokens:</strong> Estimates the number of tokens required to process a given text using a specified model, aiding in planning and resource allocation.</li>
    <li><strong>calculate_profile_tokens:</strong> Calculates the estimated token counts for different types of profile extractions, assigning token counts based on the profile type and model used.</li>
    <li><strong>estimate_processing_time:</strong> Estimates the total processing time required based on the number of tokens and the API rate limits, helping to manage expectations and scheduling.</li>
</ul>

<h4>Additional Details:</h4>
<ul>
    <li><strong>Model Selection:</strong> The script is set to use the 'gpt-4o-mini' model by default for testing, as it's 50x cheaper than 'gpt-4o'. For final runs, 'gpt-4o' can be used for better output quality.</li>
    <li><strong>API Rate Limits:</strong> Despite being on a higher API tier, the rate limiting experience has led to setting a safer threshold of 150 requests per minute, balancing efficiency and stability.</li>
    <li><strong>Token and Price Estimation:</strong> The script includes detailed calculations for token usage and cost estimation based on the selected model, ensuring users are aware of the processing requirements and associated costs.</li>
    <li><strong>Final Save and Data Combination:</strong> After processing, the final DataFrame is saved and combined with any previously saved output, ensuring that the results are consolidated and ready for further analysis or use.</li>
</ul>

<p>This code automates the extraction of relevant keywords from student and job profiles, facilitating a more efficient and scalable approach to matching candidates with opportunities. The extensive use of threading, error handling, and resource management ensures that the process is robust and can handle large datasets within API constraints.</p>


In [None]:
# grouped keyword extraction code. All qualitative data.

def load_and_combine_saved_dfs(temp_dir, base_filename):
    """
    Function Name: load_and_combine_saved_dfs
    Purpose/Description:
        Loads and combines multiple saved DataFrames from the specified directory.
        The DataFrames are combined and duplicates are removed.
    Parameters:
        - temp_dir (str): The directory where the saved DataFrames are located.
        - base_filename (str): The base filename pattern to match when loading files.
    Return Value:
        - pandas.DataFrame: A combined DataFrame containing all the loaded data, or an empty DataFrame if no files are found.
    """
    
    # Find all files matching the pattern in the specified directory
    all_files = glob.glob(os.path.join(temp_dir, f"{base_filename}_*.parquet"))
    
    # Load all found DataFrames into a list
    df_list = [pd.read_parquet(file) for file in all_files]
    
    # If any DataFrames were loaded, combine them and remove duplicates
    if df_list:
        combined_df = pd.concat(df_list, ignore_index=True).drop_duplicates()
        print(f"Combined {len(df_list)} saved DataFrames.")
    else:
        combined_df = pd.DataFrame()  # If no files were found, return an empty DataFrame
    
    return combined_df


def restart_processing(dataframe, pbar, timeout_dir, periodic_save_dir, base_filename, max_tokens_per_minute, max_rpm, profile_type, keyword_cats, assistant_id):
    """
    Function Name: restart_processing
    Purpose/Description:
        Restarts the keyword extraction process by loading previously saved progress and resuming from where it left off.
        Updates the DataFrame to mark already processed profiles and continues processing the remaining ones.
    Parameters:
        - dataframe (pandas.DataFrame): The DataFrame containing profiles to process.
        - pbar (tqdm.tqdm): The progress bar object to update progress.
        - timeout_dir (str): The directory where temporary saved files are located.
        - periodic_save_dir (str): The directory to save periodic checkpoint files.
        - base_filename (str): The base name for saving and loading checkpoint files.
        - max_tokens_per_minute (int): The maximum number of tokens that can be processed per minute.
        - max_rpm (int): The maximum number of requests that can be processed per minute.
        - profile_type (str): The type of profile ('stu' for student or 'pos' for position).
        - keyword_cats (str): The category of keywords to extract.
        - assistant_id (str): The ID of the assistant responsible for keyword extraction.
    """
    
    # Load and combine previously saved DataFrames
    combined_df = load_and_combine_saved_dfs(timeout_dir, base_filename)
    
    # Determine the correct ID column based on the profile type
    id_column = f'{profile_type}_(Do Not Modify) Job Posting' if profile_type == 'pos' else f'{profile_type}_(Do Not Modify) Application'
    
    # Drop duplicates to ensure unique profiles
    combined_unique_profiles = combined_df.drop_duplicates(subset=[id_column])
    
    # Define the processed flag column for the keyword category
    processed_flag = f"{keyword_cats}_processed"
    
    # Mark profiles as processed if they exist in the combined DataFrame
    if not combined_df.empty:
        dataframe.loc[dataframe[id_column].isin(combined_df[id_column]), processed_flag] = True
    
    # Identify remaining profiles that need to be processed
    remaining_df = dataframe[~dataframe[processed_flag]]
    
    # Drop duplicates from the remaining profiles
    remaining_unique_profiles = remaining_df.drop_duplicates(subset=[id_column])

    # Check if there are any profiles left to process
    if remaining_df.empty:
        print("No remaining rows to process. All data has been processed.")
    else:
        print(f"{len(combined_unique_profiles)} profiles have been processed for {keyword_cats} extraction. {len(remaining_unique_profiles)} profiles remaining.")
        # Continue processing the remaining profiles
        get_batch(remaining_df, pbar, timeout_dir, periodic_save_dir, base_filename, max_tokens_per_minute, max_rpm, profile_type, keyword_cats, assistant_id)


def save_temp_dataframe(dataframe, timeout_dir, base_filename, keyword_cats):
    """
    Function Name: save_temp_dataframe
    Purpose/Description:
        Saves a temporary DataFrame containing only the processed rows to disk. Used in case of timeouts to save progress.
    Parameters:
        - dataframe (pandas.DataFrame): The DataFrame containing profiles to save.
        - timeout_dir (str): The directory where the temporary files will be saved.
        - base_filename (str): The base name for the saved file.
        - keyword_cats (str): The category of keywords that have been processed.
    Return Value:
        - str: The path to the saved file, or None if no rows were saved.
    """
    
    # Filter the DataFrame to include only processed rows
    processed_dataframe = dataframe[dataframe[f"{keyword_cats}_processed"] == True]
    
    if not processed_dataframe.empty:
        # Generate a timestamp for the filename
        timestamp = time.strftime("%Y%m%d-%H%M%S")
        # Create the full path for the temporary file
        temp_path = os.path.join(timeout_dir, f"{base_filename}_{timestamp}.parquet")
        # Save the processed DataFrame to the temporary file
        processed_dataframe.to_parquet(temp_path)
        print(f"Saved processed rows to {temp_path} due to timeout.")
        return temp_path
    else:
        print("No rows to save; skipping save.")
        return None


def process_keywords(keywords, keyword_cat):
    """
    Function Name: process_keywords
    Purpose/Description:
        Processes extracted keywords, organizing them into predefined categories.
        Handles special cases for specific keyword formatting.
    Parameters:
        - keywords (str): The raw keyword string extracted from the profile.
        - keyword_cat (str): The category of keywords being processed (e.g., 'S_TS_Ed', 'P_TS').
    Return Value:
        - tuple: Processed keywords for each category based on the input keyword_cat.
    """
    
    #region Initial processing for special cases. Add more as needed.
    keywords = keywords.replace('Natural Language Processing (NLP)', 'Natural Language Processing')
    keywords = keywords.replace('(NLP)', 'Natural Language Processing')
    #endregion
    
    # Dictionary to store the processed keywords for each category
    processed_keywords = {
        'Technical Skills:': "",
        'Education:': "",
        'Industries of Interest:': "",
        'Soft Skills:': "",
        'Values, Motivations, and Mission:': "",
        'Industries:': "",
        'Company Values, Culture, and Mission:': ""
    }

    # Split the input keywords into lines
    lines = keywords.splitlines()
    # Remove empty lines and strip whitespace
    lines = [line.strip() for line in lines if line.strip()]
    current_section = None

    # Iterate through the lines to categorize them
    for line in lines:
        line = line.strip()  # Trim whitespace from the start and end
        if line in processed_keywords:
            current_section = line  # Set the current section based on the header
        elif current_section:
            # Add the line to the appropriate section
            processed_keywords[current_section] += line + "\n"
    
    # Return results based on the keyword category
    if keyword_cat == 'S_TS_Ed':
        return (
            processed_keywords['Technical Skills:'].strip(),
            processed_keywords['Education:'].strip()
        )
    elif keyword_cat == 'S_In_SS_V':
        return (
            processed_keywords['Industries of Interest:'].strip(),
            processed_keywords['Soft Skills:'].strip(),
            processed_keywords['Values, Motivations, and Mission:'].strip()
        )
    elif keyword_cat == 'P_TS':
        return processed_keywords['Technical Skills:'].strip()
    elif keyword_cat == 'P_In_SS_V':
        return (
            processed_keywords['Industries:'].strip(),
            processed_keywords['Soft Skills:'].strip(),
            processed_keywords['Company Values, Culture, and Mission:'].strip()
        )


def replenish_rps():
    """
    Function Name: replenish_rps
    
    Purpose/Description:
    Continuously replenishes the rate limit semaphore to allow more requests per second (RPS). 
    This function runs in a loop, waiting for the signal that requests have started and 
    then replenishing the semaphore periodically.
    These values are set in the 'Threads for resource management' region.

    Parameters:
    None

    Return Value:
    None
    """
    while True:
        time.sleep(0.1)  # Short delay to reduce CPU usage
        requests_started.wait()  # Wait until the request process starts
        while True:
            time.sleep(5)  # Wait before replenishing the rate limit
            for _ in range(rps - rate_limit._value):
                rate_limit.release()  # Release the semaphore to allow more requests


def replenish_tps():
    """
    Function Name: replenish_tps
    
    Purpose/Description:
    Continuously replenishes the token limit semaphore to allow more tokens per second (TPS) 
    to be processed. This function runs in a loop, waiting for the signal that requests 
    have started and then replenishing the semaphore periodically. 
    These values are set in the 'Threads for resource management' region. 

    Parameters:
    None

    Return Value:
    None
    """
    while True:
        time.sleep(0.1)  # Short delay to reduce CPU usage
        requests_started.wait()  # Wait until the request process starts
        while True:
            time.sleep(1)  # Wait before replenishing the token limit
            tokens_to_add = tps - token_limit._value  # Calculate how many tokens to add
            if tokens_to_add > 0:
                token_limit.release(tokens_to_add)  # Replenish the semaphore with the required tokens


def extract_o_bot(formatted_profile, assistant_id, row, keyword_cats):
    global total_tokens_used, total_input_tokens_used, total_output_tokens_used, threads_in_loop
    try:
        
        tokens_needed = row[f'TokenCount_{keyword_cats}']
        token_limit.acquire(tokens_needed)
        rate_limit.acquire()
        
        if not requests_started.is_set():
            requests_started.set()  # Signal that requests have started
            
        while True:
            try:
                # Create a new thread and send the formatted profile as a message
                thread = client.beta.threads.create()
                message = client.beta.threads.messages.create(thread_id=thread.id, content=formatted_profile, role="user")
                run = client.beta.threads.runs.create(thread_id=thread.id, assistant_id=assistant_id)
                break  # Exit loop if successful
            except Exception as e:
                # Handle rate limit exceeded error (HTTP 429)
                if '429' in str(e):
                    print("Rate limit exceeded. Retrying after a delay.")
                    print("Error:", e)
                    time.sleep(5)  # Wait before retrying
                    continue
                # Handle bad request error (HTTP 400)
                elif '400' in str(e):
                    print(e)
                    continue
                else:
                    raise e  # Raise other exceptions
                
        with threads_lock:
            threads_in_loop += 1  # Increment active threads count
        
        while True:
            time.sleep(2)
            try:
                # Check the status of the run
                run_status = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
                if run_status.status == "completed":
                        
                    # Update the total tokens used globally
                    total_input_tokens_used_in_run = run_status.usage.prompt_tokens
                    total_input_tokens_used += total_input_tokens_used_in_run
                    
                    total_output_tokens_used_in_run = run_status.usage.completion_tokens
                    total_output_tokens_used += total_output_tokens_used_in_run
                    
                    total_tokens_used_in_run = run_status.usage.total_tokens
                    total_tokens_used += total_tokens_used_in_run
                    
                    break  # Exit loop when processing is complete

                elif run_status.status == "failed":
                    # Handle rate limit exceeded within the run
                    if run_status.last_error.code == 'rate_limit_exceeded':
                        match1 = re.search(r'in (\d+\.\d+)s', run_status.last_error.message)
                        match2 = re.search(r'in (\d+)ms', run_status.last_error.message)
                        if match1:
                            wait_time = float(match1.group(1)) + 1
                            print(run_status.last_error.message)
                            print('#ERROR3#: Rate limit exceeded. Waiting for', wait_time, 'seconds.')
                            with threads_lock:
                                threads_in_loop -= 1
                            time.sleep(wait_time)
                            return extract_o_bot(formatted_profile, assistant_id, row, keyword_cats)
                        elif match2:
                            wait_time = float(match2.group(1)) / 1000.0 + 1  # Convert milliseconds to seconds
                            print(run_status.last_error.message)
                            print('#ERROR3#: Rate limit exceeded. Waiting for', wait_time, 'seconds.')
                            with threads_lock:
                                threads_in_loop -= 1
                            time.sleep(wait_time)
                            return extract_o_bot(formatted_profile, assistant_id, row, keyword_cats)
                    # Handle a generic error message
                    elif 'Sorry, something went wrong' in str(run_status.last_error):
                        print("Something went wrong. Trying again")
                        time.sleep(2)
                        with threads_lock:
                            threads_in_loop -= 1
                        return extract_o_bot(formatted_profile, assistant_id, row, keyword_cats)
                    else:
                        print("Run failed. Trying again", run_status.last_error)
                        with threads_lock:
                            threads_in_loop -= 1
                        return extract_o_bot(formatted_profile, assistant_id, row, keyword_cats)   
            except RateLimitError as e:
                print('#ERROR2#: Rate limit exceeded.')
                time.sleep(2)
                with threads_lock:
                    threads_in_loop -= 1
                return extract_o_bot(formatted_profile, assistant_id, row, keyword_cats)   
            except Exception as e:
                if 'rate_limit_exceeded' in str(e):
                    print('#ERROR2#: Rate limit exceeded. waiting and retrying.')
                    with threads_lock:
                        threads_in_loop -= 1
                    time.sleep(10)
                    return extract_o_bot(formatted_profile, assistant_id, row, keyword_cats)
                else:
                    print("Error:", e)
                    raise e  # Raise other exceptions
                
        with threads_lock:
            threads_in_loop -= 1  # Decrement active threads count

        # Retrieve and return the result message from the assistant
        messages = client.beta.threads.messages.list(thread_id=thread.id)
        for message in reversed(messages.data):
            role = message.role  
            for content in message.content:
                if content.type == 'text' and role == 'assistant':
                    return content.text.value, total_tokens_used_in_run, total_input_tokens_used_in_run, total_output_tokens_used_in_run

    except RateLimitError as e:
        print('#ERROR1#: Rate limit exceeded.')
        time.sleep(3)
    except Exception as e:
        if 'rate_limit_exceeded' in str(e):
            time.sleep(3)
            return extract_o_bot(formatted_profile, assistant_id, row, keyword_cats)
        else:
            raise e  # Raise other exceptions if not related to rate limit


def clean_text(row, profile_type):
    """
    Function Name: clean_text
    Purpose/Description:
        Cleans and formats the text data for a given profile, preparing it for keyword extraction.
        The formatting is based on the profile type (student or position).
    Parameters:
        - row (pandas.Series): A row from the DataFrame containing profile data.
        - profile_type (str): The type of profile ('stu' for student or 'pos' for position).
    Return Value:
        - str: A formatted string containing the relevant information from the profile.
    """

    if profile_type == 'stu':
        # Format the text for a student profile
        formatted_text = f"""
        ----------------------------------------
        Applicant Profile:
        ----------------------------------------

        Applicant Name: {row['stu_Legal Name']}

        Applicant Resume: 
        ----------------------------------------
        {row['stu_Resume_text']}

        Applicant Essays:
        ----------------------------------------

        Essay - Dream Companies
        {row['stu_Essay - Dream Companies']}

        Essay - Experience in field of Study
        {row['stu_Essay - Experience in field of Study']}

        Essay - Influence on interest in the field
        {row['stu_Essay - Influence on interest in the field']}

        Essay - Internship Career Goals
        {row['stu_Essay - Internship Career Goals']}

        Essay - Overall Career Goals
        {row['stu_Essay - Overall Career Goals']}

        Applicant Skill/Education Summary
        ----------------------------------------

        {row['stu_position education summary']}
        """
        return formatted_text

    elif profile_type == 'pos':
        # Format the text for a position profile
        formatted_text = f""" 
        -----------------------------------------------
        Position Profile 
        -----------------------------------------------

        Company Name: {row['pos_Company']}
        Position Title: {row['pos_Name']}

        Position Description:
        -----------------------------------------------

        {row['pos_Job_desc_text']}

        Other Skills/Requirements/Preferences Summary
        -----------------------------------------------

        Other requirements/preferences: 
        {row['pos_Other requirements/preferences']}

        Other Skills:
        {row['pos_Other Skills']}

        Position Summary:
        {row['pos_External Position Summary']}
        """
        return formatted_text


def extract_keywords_single_profile(row, profile_type, keyword_cats, assistant_id, id_column):
    """
    Function Name: extract_keywords_single_profile
    Purpose/Description:
        Extracts keywords from a single profile by first cleaning the text and then processing the keywords.
    Parameters:
        - row (pandas.Series): A row from the DataFrame containing profile data.
        - profile_type (str): The type of profile ('stu' for student or 'pos' for position).
        - keyword_cats (str): The category of keywords to extract (e.g., 'S_TS_Ed', 'P_TS').
        - assistant_id (str): The ID of the assistant responsible for keyword extraction.
        - id_column (str): The name of the column containing the unique ID for the profile.
    Return Value:
        - tuple: A tuple containing the profile ID, extracted keywords, and token usage statistics.
    """

    # Clean the text for the profile
    cleaned_text = clean_text(row, profile_type)

    # Extract keywords from the cleaned text using the specified assistant
    keywords, actual_tokens_used, actual_input_tokens_used, actual_output_tokens_used = extract_o_bot(cleaned_text, assistant_id, row, keyword_cats)

    # Process the extracted keywords to categorize them appropriately
    processed_keywords = process_keywords(keywords, keyword_cats)

    # Handle the different return cases based on the profile type and keyword category
    if keyword_cats == 'S_TS_Ed':
        # Handle technical skills and education for student profiles
        tech_keywords, edu_keywords = processed_keywords
        return row[id_column], tech_keywords, edu_keywords, actual_tokens_used, actual_input_tokens_used, actual_output_tokens_used
    
    elif keyword_cats == 'S_In_SS_V':
        # Handle industry, soft skills, and values for student profiles
        industry_keywords, soft_keywords, values_keywords = processed_keywords
        return row[id_column], industry_keywords, soft_keywords, values_keywords, actual_tokens_used, actual_input_tokens_used, actual_output_tokens_used
    
    elif keyword_cats == 'P_TS':
        # Handle technical skills for position profiles
        tech_keywords = processed_keywords
        return row[id_column], tech_keywords, actual_tokens_used, actual_input_tokens_used, actual_output_tokens_used
    
    elif keyword_cats == 'P_In_SS_V':
        # Handle industry, soft skills, and values for position profiles
        industry_keywords, soft_keywords, values_keywords = processed_keywords
        return row[id_column], industry_keywords, soft_keywords, values_keywords, actual_tokens_used, actual_input_tokens_used, actual_output_tokens_used


def process_batch(dataframe, pbar, periodic_save_dir, base_filename, profile_type, keyword_cats, assistant_id, batch, num_in_batch, batch_completed):
    """
    Function Name: process_batch
    Purpose/Description:
        Processes a batch of profiles for keyword extraction using concurrent processing. Updates the DataFrame with extracted keywords.
    Parameters:
        - dataframe (pandas.DataFrame): The DataFrame containing profiles to process.
        - pbar (tqdm.tqdm): The progress bar object to update progress.
        - periodic_save_dir (str): The directory to save periodic checkpoint files.
        - base_filename (str): The base name for saving checkpoint files.
        - profile_type (str): The type of profile ('stu' for student or 'pos' for position).
        - keyword_cats (str): The category of keywords to extract.
        - assistant_id (str): The ID of the assistant responsible for keyword extraction.
        - batch (pandas.DataFrame): The batch of profiles to process in this iteration.
        - num_in_batch (int): The number of profiles in the current batch.
        - batch_completed (threading.Event): Event to signal that the batch has been processed.
    """

    global total_keyword_extractions, num_keyword_extractions
    id_column = f'{profile_type}_(Do Not Modify) Job Posting' if profile_type == 'pos' else f'{profile_type}_(Do Not Modify) Application'
    workers = num_in_batch  # Number of worker threads to use, equal to the number of profiles in the batch

    # Use a ThreadPoolExecutor to process profiles concurrently
    with ThreadPoolExecutor(max_workers=workers) as executor:
        # Submit the keyword extraction task for each profile in the batch
        future_to_profile = {
            executor.submit(extract_keywords_single_profile, row, profile_type, keyword_cats, assistant_id, id_column)
            for idx, row in batch.iterrows()
        }

        # Process the results as they are completed
        for future in as_completed(future_to_profile):
            try:
                result = future.result()
                # Handle different return types based on keyword category
                if keyword_cats == 'S_TS_Ed':
                    profile_id, tech_keywords, edu_keywords, actual_tokens_used, actual_input_tokens_used, actual_output_tokens_used = result
                    dataframe.loc[dataframe[id_column] == profile_id, f'{profile_type}_technical_keywords'] = tech_keywords
                    dataframe.loc[dataframe[id_column] == profile_id, f'{profile_type}_education_keywords'] = edu_keywords
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Tokens_Used_{keyword_cats}'] = actual_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Input_Tokens_Used_{keyword_cats}'] = actual_input_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Output_Tokens_Used_{keyword_cats}'] = actual_output_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, 'S_TS_Ed_processed'] = True
                
                elif keyword_cats == 'S_In_SS_V':
                    profile_id, industry_keywords, soft_keywords, values_keywords, actual_tokens_used, actual_input_tokens_used, actual_output_tokens_used = result
                    dataframe.loc[dataframe[id_column] == profile_id, f'{profile_type}_industry_keywords'] = industry_keywords
                    dataframe.loc[dataframe[id_column] == profile_id, f'{profile_type}_soft_keywords'] = soft_keywords
                    dataframe.loc[dataframe[id_column] == profile_id, f'{profile_type}_values_keywords'] = values_keywords
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Tokens_Used_{keyword_cats}'] = actual_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Input_Tokens_Used_{keyword_cats}'] = actual_input_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Output_Tokens_Used_{keyword_cats}'] = actual_output_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, 'S_In_SS_V_processed'] = True
                
                elif keyword_cats == 'P_TS':
                    profile_id, tech_keywords, actual_tokens_used, actual_input_tokens_used, actual_output_tokens_used = result
                    dataframe.loc[dataframe[id_column] == profile_id, f'{profile_type}_technical_keywords'] = tech_keywords
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Tokens_Used_{keyword_cats}'] = actual_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Input_Tokens_Used_{keyword_cats}'] = actual_input_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Output_Tokens_Used_{keyword_cats}'] = actual_output_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, 'P_TS_processed'] = True
                
                elif keyword_cats == 'P_In_SS_V':
                    profile_id, industry_keywords, soft_keywords, values_keywords, actual_tokens_used, actual_input_tokens_used, actual_output_tokens_used = result
                    dataframe.loc[dataframe[id_column] == profile_id, f'{profile_type}_industry_keywords'] = industry_keywords
                    dataframe.loc[dataframe[id_column] == profile_id, f'{profile_type}_soft_keywords'] = soft_keywords
                    dataframe.loc[dataframe[id_column] == profile_id, f'{profile_type}_values_keywords'] = values_keywords
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Tokens_Used_{keyword_cats}'] = actual_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Input_Tokens_Used_{keyword_cats}'] = actual_input_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, f'Actual_Output_Tokens_Used_{keyword_cats}'] = actual_output_tokens_used
                    dataframe.loc[dataframe[id_column] == profile_id, 'P_In_SS_V_processed'] = True

                # Update the global keyword extraction count and save progress periodically
                num_keyword_extractions += 1
                if num_keyword_extractions % 50 == 0:
                    dataframe.to_parquet(f"{periodic_save_dir}/{base_filename}_{num_keyword_extractions}.parquet")
                pbar.update(1)  # Update the progress bar
            except Exception as e:
                print("Error processing profile:", e)
                continue
        
        batch_completed.set()  # Signal that the batch has been completed
         
    
def process_batch_with_timeout(dataframe, pbar, timeout_dir, periodic_save_dir, base_filename, profile_type, keyword_cats, assistant_id, batch, num_in_batch):
    """
    Function Name: process_batch_with_timeout
    Purpose/Description:
        Processes a batch of profiles with a timeout mechanism to handle potential long-running operations.
        It triggers a countdown timer and a timeout handler to save progress if processing takes too long.
    Parameters:
        - dataframe (pandas.DataFrame): The DataFrame containing profiles to process.
        - pbar (tqdm.tqdm): The progress bar object to update progress.
        - timeout_dir (str): The directory to save temporary files if a timeout occurs.
        - periodic_save_dir (str): The directory to save periodic checkpoint files.
        - base_filename (str): The base name for saving temporary and checkpoint files.
        - profile_type (str): The type of profile ('stu' for student or 'pos' for position).
        - keyword_cats (str): The category of keywords to extract.
        - assistant_id (str): The ID of the assistant responsible for keyword extraction.
        - batch (pandas.DataFrame): The batch of profiles to process in this iteration.
        - num_in_batch (int): The number of profiles in the current batch.
    """

    # Initialize threading events to track the countdown and batch completion.
    countdown_finished = threading.Event()
    batch_completed = threading.Event()

    def countdown_timer(seconds):
        """
        Countdown timer that runs in a separate thread, updating the progress bar every second.
        When the countdown finishes, it sets the countdown_finished event.
        This is used to trigger the next batch processing step.
        """
        with tqdm(total=seconds, desc=f"Processing {num_in_batch} rows. Time Till Next Batch:", bar_format="{l_bar}{bar}| {remaining} seconds", leave=False) as pbar:
            for i in range(seconds):
                time.sleep(1)  # Wait for 1 second
                pbar.update(1)  # Update the progress bar
        countdown_finished.set()  # Set the event to indicate the countdown has finished

    def timeout_handler():
        """
        Timeout handler that checks if the countdown has finished.
        If the countdown is finished, it saves the current state and restarts the process.
        If the countdown is still running, it ignores the timeout.
        """
        if countdown_finished.is_set():
            save_temp_dataframe(dataframe, timeout_dir, base_filename, keyword_cats)
            print("Timeout occurred, saving current state and restarting process.")
            restart_processing(dataframe, pbar, timeout_dir, periodic_save_dir, base_filename, max_tokens_per_minute, max_rpm, profile_type, keyword_cats, assistant_id)
        else:
            print("Timeout occurred, but batch has already started processing. Ignoring timeout.")

    # Start the countdown timer in a separate thread with a 60-second countdown.
    countdown_thread = threading.Thread(target=countdown_timer, args=(60,))
    countdown_thread.start()

    # Start the timeout handler with a 300-second delay.
    timeout_thread = threading.Timer(300, timeout_handler)
    timeout_thread.start()

    try:
        # Process the batch of profiles
        process_batch(dataframe, pbar, periodic_save_dir, base_filename, profile_type, keyword_cats, assistant_id, batch, num_in_batch, batch_completed)
        countdown_finished.wait()  # Wait for the countdown to finish
        batch_completed.wait()  # Wait for the batch to complete processing
    finally:
        # Ensure threads are cleaned up properly after processing
        countdown_thread.join()
        timeout_thread.cancel()


def get_batch(dataframe, pbar, timeout_dir, periodic_save_dir, base_filename, max_tokens_per_minute, max_rpm, profile_type, keyword_cats, assistant_id):
    """
    Function Name: get_batch
    Purpose/Description:
        Handles the extraction of keywords for profiles in the DataFrame by processing them in batches.
        Manages batch size based on token limits and requests per minute (RPM) constraints.
    Parameters:
        - dataframe (pandas.DataFrame): The DataFrame containing profiles to process.
        - pbar (tqdm.tqdm): The progress bar object to update progress.
        - timeout_dir (str): The directory to save temporary files if a timeout occurs.
        - periodic_save_dir (str): The directory to save periodic checkpoint files.
        - base_filename (str): The base name for saving temporary and checkpoint files.
        - max_tokens_per_minute (int): The maximum number of tokens that can be processed per minute.
        - max_rpm (int): The maximum number of requests that can be processed per minute.
        - profile_type (str): The type of profile ('stu' for student or 'pos' for position).
        - keyword_cats (str): The category of keywords to extract.
        - assistant_id (str): The ID of the assistant responsible for keyword extraction.
    """

    # Determine the correct ID and token column based on the profile type.
    id_column = f'{profile_type}_(Do Not Modify) Job Posting' if profile_type == 'pos' else f'{profile_type}_(Do Not Modify) Application'
    token_column = [f'TokenCount_{keyword_cats}']

    unique_profiles = dataframe.drop_duplicates(subset=[id_column])  # Remove duplicate profiles to avoid double processing.
    batch = pd.DataFrame(columns=unique_profiles.columns)  # Initialize an empty DataFrame for batching.
    batch_token_count = 0  # Initialize the token counter for the batch.
    num_in_batch = 0  # Initialize the counter for the number of profiles in the batch.

    for idx, row in unique_profiles.iterrows():
        # Skip profiles that have already been processed for the current keyword category.
        if dataframe.loc[dataframe[id_column] == row[id_column], f'{keyword_cats}_processed'].values[0]:
            continue
        
        current_token_count = int(row[token_column].iloc[0])  # Get the token count for the current profile.

        # If adding the current profile exceeds the token or RPM limits, process the current batch.
        if batch_token_count + current_token_count > max_tokens_per_minute or num_in_batch == max_rpm:
            process_batch_with_timeout(dataframe, pbar, timeout_dir, periodic_save_dir, base_filename, profile_type, keyword_cats, assistant_id, batch, num_in_batch)

            # Reset the batch and token counter after processing.
            batch = pd.DataFrame(columns=unique_profiles.columns)
            batch_token_count = 0
            num_in_batch = 0
        
        # Add the current profile to the batch.
        if batch.empty:
            batch = pd.DataFrame([row])
        else:
            batch = pd.concat([batch, pd.DataFrame([row])], ignore_index=True)
        
        # Update the token count and batch size.
        batch_token_count += current_token_count
        num_in_batch += 1

    # Process any remaining profiles in the final batch.
    if not batch.empty:
        process_batch_with_timeout(dataframe, pbar, timeout_dir, periodic_save_dir, base_filename, profile_type, keyword_cats, assistant_id, batch, num_in_batch)


def execute_keyword_extractions(SP, pbar, timeout_dir, periodic_save_dir, base_filename, max_tokens_per_minute, max_rpm):
    """
    Function Name: execute_keyword_extractions
    Purpose/Description:
        Executes the keyword extraction process for multiple types of profiles and keyword categories.
        It iterates through predefined extraction jobs and calls the extract_keywords function for each.
    Parameters:
        - SP (pandas.DataFrame): The DataFrame containing profiles to process.
        - pbar (tqdm.tqdm): The progress bar object to update progress.
        - timeout_dir (str): The directory to save temporary files if a timeout occurs.
        - periodic_save_dir (str): The directory to save periodic checkpoint files.
        - base_filename (str): The base name for saving temporary and checkpoint files.
        - max_tokens_per_minute (int): The maximum number of tokens that can be processed per minute.
        - max_rpm (int): The maximum number of requests that can be processed per minute.
    """

    # Define the extraction jobs for each profile type and keyword category.
    extraction_jobs = [
        ('stu', 'S_TS_Ed', S_TS_Ed_EOB.id),
        ('stu', 'S_In_SS_V', S_In_SS_V_EOB.id),
        ('pos', 'P_TS', P_TS_EOB.id),
        ('pos', 'P_In_SS_V', P_In_SS_V_EOB.id)
    ]
    
    # Iterate through the extraction jobs and process each one.
    for profile_type, keyword_cats, assistant_id in extraction_jobs:
        get_batch(SP, pbar, timeout_dir, periodic_save_dir, base_filename, max_tokens_per_minute, max_rpm, profile_type, keyword_cats, assistant_id)


def estimate_tokens(text, model):
    """
    Function Name: estimate_tokens
    Purpose/Description: 
        Estimates the number of tokens required to process a given text using a specified model.
    Parameters:
        - text (str): The text for which to estimate the token count.
        - model (str): The model to be used for encoding the text (e.g., 'gpt-4o-mini').
    Return Value:
        - int: The number of tokens required to process the text.
    """
    encoding = tiktoken.encoding_for_model(model)  # Get the encoding for the specified model.
    tokens = encoding.encode(text)  # Encode the text into tokens.
    return len(tokens)  # Return the total number of tokens.


def calculate_profile_tokens(df, profile_type, model):
    """
    Function Name: calculate_profile_tokens
    Purpose/Description: 
        Calculates the estimated token counts for different types of profile extractions in a DataFrame.
        It assigns token counts to various columns based on the profile type (student or position) and the model used.
    Parameters:
        - df (pandas.DataFrame): The DataFrame containing the profiles to process.
        - profile_type (str): The type of profile ('stu' for student or 'pos' for position).
        - model (str): The model to be used for estimating tokens (e.g., 'gpt-4o-mini').
    Return Value:
        - pandas.DataFrame: The updated DataFrame with calculated token counts for each profile.
    """

    #region instruction token estimation
    # Estimated number of tokens required for instructions for each profile type and category.
    S_TS_Ed_EOB_instructions_token_est = 825
    S_In_SS_V_EOB_instructions_token_est = 1200
    P_TS_EOB_instructions_token_est = 875
    P_In_SS_V_EOB_instructions_token_est = 925
    #endregion

    #region average token output
    # Average number of tokens expected in the output for each profile type and category.
    S_TS_Ed_avg_token_output = 1200
    S_In_SS_V_avg_token_output = 400
    P_TS_avg_token_output = 1200
    P_In_SS_V_avg_token_output = 400
    #endregion

    # Determine the correct ID column based on the profile type.
    id_column = f'{profile_type}_(Do Not Modify) Job Posting' if profile_type == 'pos' else f'{profile_type}_(Do Not Modify) Application'
    unique_profiles = df.drop_duplicates(subset=[id_column])  # Remove duplicate profiles to avoid double counting.

    for idx, row in unique_profiles.iterrows():
        # Generate the formatted text and calculate the token count for each unique profile.
        formatted_text = clean_text(row, profile_type)
        token_count = estimate_tokens(formatted_text, model)
        
        if profile_type == 'stu':
            # Calculate and assign token counts for student profiles.
            df.loc[df[id_column] == row[id_column], 'TokenCount_S_TS_Ed'] = token_count + S_TS_Ed_EOB_instructions_token_est + S_TS_Ed_avg_token_output
            df.loc[df[id_column] == row[id_column], 'TokenCount_S_In_SS_V'] = token_count + S_In_SS_V_EOB_instructions_token_est + S_In_SS_V_avg_token_output
            df.loc[df[id_column] == row[id_column], 'TokenInputCount_S_TS_Ed'] = token_count + S_TS_Ed_EOB_instructions_token_est
            df.loc[df[id_column] == row[id_column], 'TokenInputCount_S_In_SS_V'] = token_count + S_In_SS_V_EOB_instructions_token_est
            df.loc[df[id_column] == row[id_column], 'TokenOutputCount_S_TS_Ed'] = S_TS_Ed_avg_token_output
            df.loc[df[id_column] == row[id_column], 'TokenOutputCount_S_In_SS_V'] = S_In_SS_V_avg_token_output
        elif profile_type == 'pos':
            # Calculate and assign token counts for position profiles.
            df.loc[df[id_column] == row[id_column], 'TokenCount_P_TS'] = token_count + P_TS_EOB_instructions_token_est + P_TS_avg_token_output
            df.loc[df[id_column] == row[id_column], 'TokenCount_P_In_SS_V'] = token_count + P_In_SS_V_EOB_instructions_token_est + P_In_SS_V_avg_token_output
            df.loc[df[id_column] == row[id_column], 'TokenInputCount_P_TS'] = token_count + P_TS_EOB_instructions_token_est
            df.loc[df[id_column] == row[id_column], 'TokenInputCount_P_In_SS_V'] = token_count + P_In_SS_V_EOB_instructions_token_est
            df.loc[df[id_column] == row[id_column], 'TokenOutputCount_P_TS'] = P_TS_avg_token_output
            df.loc[df[id_column] == row[id_column], 'TokenOutputCount_P_In_SS_V'] = P_In_SS_V_avg_token_output

    return df  # Return the DataFrame with updated token counts.


def estimate_processing_time(df, est_total_tokens, max_tokens_per_minute, max_rpm):
    """
    Function Name: estimate_processing_time
    Purpose/Description: 
        Estimates the total processing time required based on the number of tokens and the maximum allowed tokens per minute.
        Takes into account the rate limits of the API to determine whether tokens or requests per minute are the limiting factor.
    Parameters:
        - df (pandas.DataFrame): The DataFrame containing the profiles to process.
        - est_total_tokens (int): The estimated total number of tokens required for processing.
        - max_tokens_per_minute (int): The maximum number of tokens that can be processed per minute.
        - max_rpm (int): The maximum number of requests that can be processed per minute.
    Return Value:
        - tuple: A tuple containing the estimated hours and minutes required for processing.
    """
    pos_id_column = 'pos_(Do Not Modify) Job Posting'
    stu_id_column = 'stu_(Do Not Modify) Application'
    
    # Count unique profiles for positions and students.
    unique_pos_profiles = len(df.drop_duplicates(subset=[pos_id_column]))
    unique_stu_profiles = len(df.drop_duplicates(subset=[stu_id_column]))
    
    # Calculate the total number of profiles to process.
    total_profiles = unique_pos_profiles + unique_stu_profiles
    
    # Total number of profile processing operations (since each profile is processed twice).
    total_processing_ops = total_profiles * 2
    
    # Calculate the average tokens per profile (this is an approximation).
    avg_tokens_per_profile = est_total_tokens / total_processing_ops
    
    # Calculate the maximum tokens that can be processed per minute given the RPM constraint.
    max_tokens_possible_by_rpm = max_rpm * avg_tokens_per_profile
    
    if max_tokens_possible_by_rpm <= max_tokens_per_minute:
        # RPM is the limiting factor.
        num_batches = total_processing_ops / max_rpm
    else:
        # Tokens per minute is the limiting factor.
        num_batches = est_total_tokens / max_tokens_per_minute
    
    # Calculate the total estimated time in minutes.
    est_minutes = int(num_batches)
    
    # Convert minutes into hours and remaining minutes.
    est_hours = est_minutes // 60
    est_minutes = est_minutes % 60
    
    return est_hours, est_minutes  # Return the estimated hours and minutes required.


# Define the model to use for generating alignment scores
'''
Note: ALWAYS USE THE 'gpt-4o-mini' MODEL IF YOU ARE ONLY TESTING THE CODE. 
It's 50x cheaper than the 'gpt-4o' model, and if you are running/modifying the code, 
you will likely need to run it over and over, and the output matters less. 
Switch to 'gpt-4o' if the primary focus is the output.
'''
model = 'gpt-4o-mini'  # Set the model to 'gpt-4o-mini' for testing or 'gpt-4o' for final runs.
API_tier = 3  # Set the API tier based on your OpenAI account level.


# Define the maximum number of requests per minute to avoid rate limiting
'''
Note: Despite being on a higher API tier, I experienced rate limiting at 200 requests per minute. 
To avoid issues, I kept it at 150 requests per minute as a safe threshold. 
This rate allows processing of 150 rows per minute, which is still efficient, but requires some wait time between batches.
At Tier 2, you should get 5000 requests per minute, but the limit was still 200 for me. 
The issue might be related to the per-second rate of sending requests. 
Someone will need to troubleshoot this if higher throughput is required.
'''
max_rpm = 150  # Set the maximum requests per minute to avoid rate limiting.


# Start the timer to measure the time taken for the entire process.
start_time = time.time()


# Create the assistants with specific instructions and parameters.
# Each assistant is tailored to extract different types of information from profiles.
#region Assistants
# Assistant for extracting technical skills and education from student profiles.
S_TS_Ed_EOB = client.beta.assistants.create(
    name="Student Profile Technical Skill and Education Extract-O-Bot",  # Name of the assistant.
    instructions=S_TS_Ed_EOB_instructions,  # Instructions specific to this task.
    model=model,  # Model to use for this assistant.
    temperature=1,  # Sampling temperature for generating diverse outputs.
    top_p=1  # Nucleus sampling parameter, ensures diverse outputs.
)

# Assistant for extracting industry interests, soft skills, and values from student profiles.
S_In_SS_V_EOB = client.beta.assistants.create(
    name="Student Profile Industry of Interest, Soft Skills, and Values Extract-O-Bot",  # Name of the assistant.
    instructions=S_In_SS_V_EOB_instructions,  # Instructions specific to this task.
    model=model,  # Model to use for this assistant.
    temperature=1,  # Sampling temperature for generating diverse outputs.
    top_p=1  # Nucleus sampling parameter, ensures diverse outputs.
)

# Assistant for extracting technical skills from job position profiles.
P_TS_EOB = client.beta.assistants.create(
    name="Position Profile Technical Skill Extract-O-Bot",  # Name of the assistant.
    instructions=P_TS_EOB_instructions,  # Instructions specific to this task.
    model=model,  # Model to use for this assistant.
    temperature=1,  # Sampling temperature for generating diverse outputs.
    top_p=1  # Nucleus sampling parameter, ensures diverse outputs.
)

# Assistant for extracting industry, soft skills, and company values from job position profiles.
P_In_SS_V_EOB = client.beta.assistants.create(
    name="Position Profile Industry of Interest, Soft Skills, and Company Values Extract-O-Bot",  # Name of the assistant.
    instructions=P_In_SS_V_EOB_instructions,  # Instructions specific to this task.
    model=model,  # Model to use for this assistant.
    temperature=1,  # Sampling temperature for generating diverse outputs.
    top_p=1  # Nucleus sampling parameter, ensures diverse outputs.
)
#endregion


#region token and time estimation

# Calculate the total number of keyword extractions needed. 
# Each profile is processed twice for different categories, hence the multiplication by 2.
total_keyword_extractions = (SP['pos_(Do Not Modify) Job Posting'].nunique() + SP['stu_(Do Not Modify) Application'].nunique()) * 2

#region tokens per minute
''' 
Note: The maximum number of tokens per minute varies depending on your OpenAI API tier and the model used.
Info on tokens per minute for different models and API tiers can be found here: https://platform.openai.com/docs/guides/rate-limits
Info on your specific API tier and rate limits can be found in the 'Your Profile' section under 'Limits' on the OpenAI platform.
If you are using tier 1, doing a large number of comparisons will be slooowww (2+ hours for 1000 comparisons). You can only process about 4-5 rows a minute. 
Once you are tier 2 or higher, you can process 150 rows a minute and 1000 rows in 7 minutes.
'''

# Determine max tokens per minute based on the model and API tier
if model == 'gpt-4o-mini':
    if API_tier == 1:
        max_tokens_per_minute = 200000  # Tier 1 limit for 'gpt-4o-mini'
    elif API_tier == 2:
        max_tokens_per_minute = 2000000  # Tier 2 limit for 'gpt-4o-mini'
    elif API_tier == 3:
        max_tokens_per_minute = 4000000  # Tier 3 limit for 'gpt-4o-mini'
    elif API_tier == 4:
        max_tokens_per_minute = 10000000  # Tier 4 limit for 'gpt-4o-mini'
    elif API_tier == 5:
        max_tokens_per_minute = 150000000  # Tier 5 limit for 'gpt-4o-mini'
    else:
        raise ValueError("Unsupported API tier for 'gpt-4o-mini'")

elif model == 'gpt-4o':
    if API_tier == 1:
        max_tokens_per_minute = 25000  # Tier 1 limit for 'gpt-4o'
    elif API_tier == 2:
        max_tokens_per_minute = 450000  # Tier 2 limit for 'gpt-4o'
    elif API_tier == 3:
        max_tokens_per_minute = 800000  # Tier 3 limit for 'gpt-4o'
    elif API_tier == 4:
        max_tokens_per_minute = 2000000  # Tier 4 limit for 'gpt-4o'
    elif API_tier == 5:
        max_tokens_per_minute = 30000000  # Tier 5 limit for 'gpt-4o'
    else:
        raise ValueError("Unsupported API tier for 'gpt-4o'")

else:
    print('I only made code for gpt-4o and gpt-4o-mini. If you want to use a different model, you will have to add the token limits yourself.')
    token_limit_input = input('Enter a token limit per minute for the model you are using: ')
    
    # Convert the input to an integer and set it as the max_tokens_per_minute
    try:
        max_tokens_per_minute = int(token_limit_input.replace(",", "").strip())  # Remove commas if present and convert to integer
    except ValueError:
        raise ValueError("Invalid input. Please enter a valid integer for the token limit.")

#endregion

#region Actual tokens used columns
# Initialize columns to track the actual tokens used during processing for each category.
SP['Actual_Tokens_Used_S_TS_Ed'] = 0
SP['Actual_Tokens_Used_S_In_SS_V'] = 0
SP['Actual_Tokens_Used_P_TS'] = 0
SP['Actual_Tokens_Used_P_In_SS_V'] = 0
SP['Actual_Input_Tokens_Used_S_TS_Ed'] = 0
SP['Actual_Input_Tokens_Used_S_In_SS_V'] = 0
SP['Actual_Input_Tokens_Used_P_TS'] = 0
SP['Actual_Input_Tokens_Used_P_In_SS_V'] = 0
SP['Actual_Output_Tokens_Used_S_TS_Ed'] = 0
SP['Actual_Output_Tokens_Used_S_In_SS_V'] = 0
SP['Actual_Output_Tokens_Used_P_TS'] = 0
#endregion

#region token count columns
# Initialize columns to track the estimated token count for each profile type and category.
SP['TokenCount_S_TS_Ed'] = 0
SP['TokenCount_S_In_SS_V'] = 0
SP['TokenCount_P_TS'] = 0
SP['TokenCount_P_In_SS_V'] = 0
#endregion

# Calculate the token count for student and position profiles.
SP = calculate_profile_tokens(SP, 'stu', model)
SP = calculate_profile_tokens(SP, 'pos', model)

#region token use and price estimation

# Define the relevant columns for token count and usage tracking.
token_cols = ['TokenCount_S_TS_Ed', 
              'TokenCount_S_In_SS_V', 
              'TokenInputCount_S_TS_Ed', 
              'TokenInputCount_S_In_SS_V',
              'TokenOutputCount_S_TS_Ed',
              'TokenOutputCount_S_In_SS_V',
              'TokenCount_P_TS', 
              'TokenCount_P_In_SS_V',  
              'TokenInputCount_P_TS', 
              'TokenInputCount_P_In_SS_V',
              'TokenOutputCount_P_TS',
              'TokenOutputCount_P_In_SS_V']

# Sum tokens for position profiles
pos_columns = ['pos_(Do Not Modify) Job Posting'] + token_cols[6:]  # Only include token columns related to position profiles.
pos_token_subset = SP[pos_columns].drop_duplicates()  # Drop duplicates to avoid double counting.

total_tokens_P_TS = pos_token_subset['TokenCount_P_TS'].sum()
total_tokens_P_In_SS_V = pos_token_subset['TokenCount_P_In_SS_V'].sum()
total_tokens_P = total_tokens_P_TS + total_tokens_P_In_SS_V

total_input_tokens_P_TS = pos_token_subset['TokenInputCount_P_TS'].sum()
total_input_tokens_P_In_SS_V = pos_token_subset['TokenInputCount_P_In_SS_V'].sum()
total_input_tokens_P = total_input_tokens_P_TS + total_input_tokens_P_In_SS_V

total_output_tokens_P_TS = pos_token_subset['TokenOutputCount_P_TS'].sum()
total_output_tokens_P_In_SS_V = pos_token_subset['TokenOutputCount_P_In_SS_V'].sum()
total_output_tokens_P = total_output_tokens_P_TS + total_output_tokens_P_In_SS_V

# Sum tokens for student profiles
stu_columns = ['stu_(Do Not Modify) Application'] + token_cols[:6]  # Only include token columns related to student profiles.
stu_token_subset = SP[stu_columns].drop_duplicates()  # Drop duplicates to avoid double counting.
total_tokens_S_TS_Ed = stu_token_subset['TokenCount_S_TS_Ed'].sum()
total_tokens_S_In_SS_V = stu_token_subset['TokenCount_S_In_SS_V'].sum()
total_tokens_S = total_tokens_S_TS_Ed + total_tokens_S_In_SS_V

total_input_tokens_S_TS_Ed = stu_token_subset['TokenInputCount_S_TS_Ed'].sum()
total_input_tokens_S_In_SS_V = stu_token_subset['TokenInputCount_S_In_SS_V'].sum()
total_input_tokens_S = total_input_tokens_S_TS_Ed + total_input_tokens_S_In_SS_V

total_output_tokens_S_TS_Ed = stu_token_subset['TokenOutputCount_S_TS_Ed'].sum()
total_output_tokens_S_In_SS_V = stu_token_subset['TokenOutputCount_S_In_SS_V'].sum()
total_output_tokens_S = total_output_tokens_S_TS_Ed + total_output_tokens_S_In_SS_V
    
# Compute the total token count for all profiles (position + student).
est_total_tokens = total_tokens_P + total_tokens_S
total_input_tokens = total_input_tokens_P + total_input_tokens_S
total_output_tokens = total_output_tokens_P + total_output_tokens_S

# Compute the max token count for all profiles to identify the most demanding profile in terms of tokens.
max_token_value = SP[token_cols].max().max()

#region cost per token
# Set the cost per token based on the model used.
if model == 'gpt-4o-mini':
    price_per_input_token = 0.00000015
    price_per_output_token = 0.00000030
elif model == 'gpt-4o':
    price_per_input_token = 0.000005
    price_per_output_token = 0.000015

# Estimate the total cost for processing based on token usage.
est_total_price = (total_input_tokens * price_per_input_token) + (total_output_tokens * price_per_output_token)
#endregion

#endregion

#region time estimation
# Estimate the processing time based on the number of tokens, tokens per minute, and requests per minute limits.
est_hours, est_minutes = estimate_processing_time(SP, est_total_tokens, max_tokens_per_minute, max_rpm)
#endregion

# Ensure a minimum processing time of 4 minutes, even if the estimated time is less.
if est_hours == 0 and est_minutes < 3:
    est_minutes = 4

#region Display the token and time estimation results
# Display the estimated number of profiles, tokens, price, and processing time.
total_rows_label = widgets.HTML(value=f"<b>Total number of profiles to process:</b> {total_keyword_extractions}")
est_tokens_label = widgets.HTML(value=f"<b>Estimate for total number of tokens required:</b> {est_total_tokens}")
est_price_label = widgets.HTML(value=f"<b>Estimate for total price for processing:</b> ${est_total_price:.2f}")
est_time_label = widgets.HTML(value=f"<b>Estimated time required:</b> {est_hours} hours and {est_minutes} minutes")
display(total_rows_label, est_tokens_label, est_price_label, est_time_label)
#endregion


#endregion


#region keywords columns
# Initialize columns to store extracted keywords for both student and position profiles.
SP.loc[:, 'pos_technical_keywords'] = None
SP.loc[:, 'stu_technical_keywords'] = None
SP.loc[:, 'pos_industry_keywords'] = None
SP.loc[:, 'stu_industry_keywords'] = None
SP.loc[:, 'pos_soft_keywords'] = None
SP.loc[:, 'stu_soft_keywords'] = None
SP.loc[:, 'pos_values_keywords'] = None
SP.loc[:, 'stu_values_keywords'] = None
SP.loc[:, 'stu_education_keywords'] = None
#endregion


#region processed flag columns
# Initialize columns to track whether each profile has been processed for each category.
SP['S_TS_Ed_processed'] = False
SP['S_In_SS_V_processed'] = False
SP['P_TS_processed'] = False
SP['P_In_SS_V_processed'] = False
#endregion


#region Global variables
# Initialize global variables to track the total tokens used and the number of keyword extractions performed.
total_tokens_used = 0
total_input_tokens_used = 0
total_output_tokens_used = 0
num_keyword_extractions = 0
threads_in_loop = 0
#endregion


#region Threads for resource management
# Define resource limits and thread management variables.
rps = 80  # Requests per second limit.
tps = 66666  # Tokens per second limit.
threads_lock = threading.Lock()  # Lock for managing access to shared resources.
rate_limit = threading.Semaphore(rps)  # Semaphore for managing request rate limits.
token_limit = threading.Semaphore(tps)  # Semaphore for managing token usage limits.
threading.Thread(target=replenish_rps, daemon=True).start()  # Start thread to replenish request rate limits.
threading.Thread(target=replenish_tps, daemon=True).start()  # Start thread to replenish token limits.
batch_completed = threading.Event()  # Event to signal batch completion.
requests_started = threading.Event()  # Event to signal that requests have started.
#endregion


#region thread monitoring
# # For monitoring the number of threads waiting for API calls to be completed. Commented out for now, but useful for debugging.
# # Variable to track the number of threads currently processing.
# def monitor_loop_threads():
#     while True:
#         with threads_lock:
#             print(f"Threads in loop: {threads_in_loop}")
#         time.sleep(5)  # Print the number of threads in the loop every 5 seconds

# # Start the loop monitoring in a background thread
# threading.Thread(target=monitor_loop_threads, daemon=True).start()
#endregion


#region define paths for saving
# Define directories for saving intermediate and final results.
timeout_dir = f'{path_to_project}/data/extract_o_bot_saves/timeout_dir'
periodic_save_dir = f'{path_to_project}/data/extract_o_bot_saves/periodic_save_dir'
final_save_path = f'{path_to_project}/data/extract_o_bot_saves/final_save_dir/SP_extraction_save.parquet'
#endregion


# Run the keyword extraction process, displaying a progress bar.
with tqdm(total=total_keyword_extractions, desc="Extracting keywords") as pbar:
    execute_keyword_extractions(SP, pbar, timeout_dir, periodic_save_dir, 'SP_extraction_temp', max_tokens_per_minute, max_rpm)


#region Final token and time calculation
total_price = (total_input_tokens_used * price_per_input_token) + (total_output_tokens_used * price_per_output_token)
end_time = time.time()
elapsed_time = end_time - start_time
elapsed_hours = int(elapsed_time // 3600)
elapsed_minutes = int((elapsed_time % 3600) // 60)
elapsed_seconds = int(elapsed_time % 60)
formatted_time = f"{elapsed_hours}h {elapsed_minutes}m {elapsed_seconds}s"
process_complete_label = widgets.HTML(value=f"<b>Processing complete. Keywords extracted from {total_keyword_extractions} profiles. Final DataFrame saved to:</b> {final_save_path}")
tokens_label = widgets.HTML(value=f"<b>Total tokens used:</b> {total_tokens_used}")
price_label = widgets.HTML(value=f"<b>Total price for processing:</b> ${total_price:.2f}")
time_label = widgets.HTML(value=f"<b>Total time used:</b> {formatted_time}")
#endregion


#region final save

# combine the finished output with any previously saved output
combined_df = load_and_combine_saved_dfs(path, 'SP_extraction_save')

# if combinded_df is empty, save the final DataFrame
if combined_df.empty:
    SP.to_parquet(final_save_path)   
else: 
    final_df = pd.concat([SP, combined_df], ignore_index=True).drop_duplicates()
    final_df.to_parquet(final_save_path)
    
#endregion


display(process_complete_label, tokens_label, price_label, time_label)

<h3>Extract-O-Bot Output Analysis</h3>

<p>In this section, we will delve into the results generated by the Extract-O-Bot.</p>

<p>The table we're working with contains keyword extractions generated using the GPT-4o-mini model, encompassing all student and company profiles from 2024. This table will also serve as the foundation for Pipeline 3, where we will evaluate and score the extracted keywords.</p>

<p><strong>Note:</strong> I opted to use GPT-4o-mini for the AI-generated tables to avoid the cost of using GPT-4o. However, I’ve found that GPT-4o provides significantly better extraction results. The tables in the Technical White Paper were generated using GPT-4o. If you'd like to generate and view results using GPT-4o, run a sample with Extract-O-Bot. Be sure to comment out SP_path and uncomment SP_path_final_save for the final output.</p>

<p>Exploring these results is critical for understanding the accuracy and relevance of the extracted keywords, evaluating how well the extracted data aligns with the intended categorization criteria, and identify any potential areas for improvement. By analyzing the output, we aim to ensure that the keyword extraction process not only meets the predefined guidelines but also adds significant value to the matching process.This analysis is crucial for validating the Extract-O-Bot's performance and ensuring that the keyword extraction process effectively supports the broader objective of matching students with suitable job opportunities. By continuously evaluating and refining the output, we can enhance the overall efficiency and reliability of the pipeline.</p>




In [10]:
SP_path = f'{path_to_project}/data/SP_table/SP2_post_extract_o_bot.parquet'
SP = pd.read_parquet(SP_path) 

# SP_path_final_save = f'{path_to_project}/data/extract_o_bot_saves/final_save_dir/SP_extraction_save.parquet'
# SP = pd.read_parquet(SP_path_final_save) 

In [11]:
stu_cols = '''
stu_Legal Name
stu_technical_keywords
stu_industry_keywords
stu_soft_keywords
stu_values_keywords
stu_education_keywords
'''

pos_cols = '''
pos_Company
pos_Name
pos_technical_keywords
pos_industry_keywords
pos_soft_keywords
pos_values_keywords
'''

stu_cols = as_list(stu_cols)
pos_cols = as_list(pos_cols)

stu_keywords = SP[stu_cols]
pos_keywords = SP[pos_cols]

<h3>Examples of Keyword Extraction and Labels from Students and Companies</h3>

<p>The tables below present examples of some of the keywords extracted by the Extract-O-Bot. These examples include both student and company profiles, showcasing the extracted keywords along with their assigned labels.</p>

In [None]:
print('Student Keywords:')
stu_keywords.sample(10)
pretty_print(stu_keywords.sample(10))

In [None]:
print("Position Profile Keywords:")
pos_keywords.sample(10)
pretty_print(pos_keywords.sample(10))