## Setting Up the Environment

We begin by importing necessary libraries:

- **os, json, pathlib.Path**: For file system operations and reading configuration files.
- **pandas**: To handle data in structured formats.
- **transformers**: To load and work with a language model from Hugging Face.
- **IPython.display**: To display output in Markdown format for readability.


In [None]:
import os
import json
import pandas as pd
from pathlib import Path
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
)
from IPython.display import display, Markdown


This step ensures the environment has all the required tools and libraries loaded.

## Configuration and Credentials

Next, we define and set up key file paths such as the root directory, data directory, and configuration directory. The script reads Hugging Face credentials from a `credentials.json` file. This allows secure access to private model resources. We also check if the environment variable `HUGGINGFACE_TOKEN` exists to ensure authentication with Hugging Face is configured properly.

In [None]:
# %%
# Paths
paths = {
    'root': Path.cwd().parent,
    'data': Path.cwd().parent / "data",
    "config": Path.cwd().parent / "config"
}

# Load Hugging Face credentials
with open(paths["config"] / 'credentials.json') as f:
    credentials = json.load(f)

if "HUGGINGFACE_TOKEN" in os.environ or "HUGGINGFACE_TOKEN" in credentials:
    print("Environment variable HUGGINGFACE_TOKEN set.")

## Reading Instructions

The script attempts to open and read an `instructions.txt` file from the configuration directory. It handles potential errors gracefully:
- Notifying if the file doesn't exist.
- Catching and reporting any other exceptions that occur during file reading.

In [None]:
# %%
# Define the file path
file_path = paths["config"] / "instructions.txt"

try:
    # Open the file and read its content
    with open(file_path, 'r') as file:
        instructions = file.read()
        print("Instructions successfully read!")
        # print(instructions)
except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

## Loading Candidate Data

This section attempts to load candidate data from a Parquet file. If reading the Parquet file fails (e.g., due to format issues or missing file), it falls back to loading a CSV version. The data contains job titles and a ranking index, which is used for further processing.

In [4]:
# %%
# Shortlisted candidates
try:
    data = pd.read_parquet(paths['data'] / "processed/filtered.parquet", columns=['job_title', 'rank']).set_index("rank")
except Exception as e:
    print(f"Failed to load parquet file: {e}. Loading CSV instead.")
    data = pd.read_csv(paths['data'] / "processed/filtered.csv", index_col=['rank'], usecols=['job_title'])

## Initializing the Language Model

Here, we initialize a language model pipeline for text generation:

1. **Select a Model**: We use Microsoft's Phi-3 model, a compact variant suitable for inference.
2. **Check Device**: The script checks if a GPU is available and uses it for faster computation; otherwise, it falls back to CPU.
3. **Load Model and Tokenizer**: The model and its tokenizer are loaded, and a text generation pipeline is set up for ease of use.

This setup allows us to generate human-like text based on specified prompts.

In [None]:
# %%
# Initialize the model and tokenizer
model_name = "microsoft/Phi-3-small-128k-instruct"

device = "cuda" if torch.cuda.is_available() else "cpu"  # Use GPU if available

# Adjust model loading for GPU
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    # torch_dtype=torch.float16,  # Use half precision
    trust_remote_code=True,
    device_map=device,  # Use GPU if available
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set up the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

## Generating and Displaying Output

In the final step:

1. **Define Search Criteria**: We specify a job search phrase and location.
2. **Sample Candidate Data**: A random sample of job titles from the loaded data is selected.
3. **Prepare Prompt Messages**: Messages are structured for the model:
   - A system message sets the context for the AI.
   - A user message defines the task: searching for suitable candidates based on criteria, with sample data included.
4. **Generate Text**: The model generates text based on the prompt.
5. **Display Output**: The generated text is displayed in Markdown format, making it readable and formatted nicely.

This flow demonstrates how to integrate data processing, model interaction, and result display, offering both technical depth and executive overview.

In [None]:
# Generate results
list_similar_roles = 'aspiring human resources'
location = "New York"

search_criteria = """**Criteria:**
- **Motivation:** Demonstrated interest and passion for a career in Human Resources.
- **Experience & Background:** Relevant experience, transferable skills, and educational background.
- **Fit with Company Values:** Highly motivated individuals with fundamental HR knowledge are preferred over those with extensive credentials.
"""

response_format = """**Response Format:**
Provide a ranked table in Markdown format with the following columns:
- **Rank:** Position from 1 (top) to 5.
- **Candidate:** Description of the candidate.
- **Experience & Background:** Summary of relevant experience and qualifications.
- **Reasoning:** Explanation for the ranking based on the criteria.
"""

# Sample data
list_of_candidates = data['job_title'].sample(5, random_state=42).to_list()

user_prompt = f"""**Prompt:**
Valuate the list of candidates based on the search criteria and location: "{location}" for role(s) the role of {list_similar_roles}.

{search_criteria}

Identify and rank the top 5 candidates who are most suitable for a position in human resources. Present your findings in a Markdown-formatted table that includes the **Candidate**, **Experience & Background**, and **Reasoning for Choosing This Candidate**.

**List of Candidates:**
{list_of_candidates}
"""

role_system_description = 'You are an HR Recruitment Specialist. Your role is to assist in identifying, evaluating, and ranking job candidates based on their qualifications, experience, and suitability for specific roles. You prioritize motivated individuals with fundamental knowledge and present information in clear, organized formats to support hiring decisions.'

# Prepare messages in the format required by the model
messages = [
    {"role": "system", "content": role_system_description},
    {"role": "user", "content": user_prompt},
]

# Generate text
generation_args = {
    "max_new_tokens": 5_000,
    "return_full_text": False,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
display(Markdown("Generated Output:", output[0]['generated_text']))