## Coding challenge Part II (up to 60 min) - Advanced LLM Orchestration Features

After getting to know the basics of the LLM Orchestration Service, this coding challenge focuses on interacting on more advanced features.
Again, let's assume the role of a Business AI developer, tasked with integrating AI features into business software products.

Here, we are working on an HR system used for screening and summarizing CVs of job applicants, so that recruiters get an initial overview of candidates.

As we are dealing with personal information as part of CVs, we must ensure that this personal information is not transmitted to the third-party LLM providers that host LLM models (e.g. Azure, Google Cloud, AWS)

### Scenario

You are to develop a LLM-based summarization service that allows a business user of a HR system input a `.txt` document and outputs a summary of a certain length.


### Setup the LLM Orchestration Service

In [None]:
# Set up the required authentication, see Part 1
%pip install python-dotenv
from dotenv import load_dotenv
# loading variables from file
load_dotenv("tmp/dummy.env")

In [None]:
from gen_ai_hub.orchestration.models.message import SystemMessage, UserMessage
from gen_ai_hub.orchestration.models.template import Template, TemplateValue
from gen_ai_hub.orchestration.models.llm import LLM
from gen_ai_hub.orchestration.models.config import OrchestrationConfig

cv_template = Template(
    messages=[
        SystemMessage("You are a helpful AI assistant that creates summaries from documents."),
        UserMessage(
            "Summarize the following {{?document_type}} in {{?length}} sentences {{?additional_instructions}}: {{?document}}"
        ),
    ],
    defaults=[
        TemplateValue(name="document_type", value="document"),
        TemplateValue(name="length", value="10"),
        TemplateValue(name="additional_instructions", value=""),
    ],
)

cv_config = OrchestrationConfig(
    template=cv_template,
    llm=LLM(name="gpt-4o", version="latest", parameters={"max_tokens": 256, "temperature": 0.1}),
)

### Run the Summarization Request

Make a request to the LLM, including the CV data, notice the `template_values` and how we can use the `additional_instructions` field to modify the prompt.

In [None]:
from gen_ai_hub.orchestration.utils import load_text_file
from gen_ai_hub.orchestration.service import OrchestrationService

doc_as_string = load_text_file("data/cv.txt")

api_url = "https://api.ai.internalprod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/d8070b9586567b37"
orchestration_service = OrchestrationService(api_url=api_url)

result = orchestration_service.run(
    config=cv_config,
    template_values=[
        TemplateValue(name="document", value=doc_as_string),
        TemplateValue(name="document_type", value="CV"),
        TemplateValue(name="length", value="4"),
        TemplateValue(name="additional_instructions", value="and structure using markdown headings (Candidate Information, Skills, Education). Use an emoji for each heading."),
    ]
)
print(result.orchestration_result.choices[0].message.content)

### Customize Summarization Functionality to a Class

For code structuring and ease of reuse, we'll create a class `CVSummarizer`.

In [None]:
class CVSummarizer:
    def __init__(self, config):
        api_url = "https://api.ai.internalprod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/d8070b9586567b37"
        self.service = OrchestrationService(api_url=api_url)
        self.config = config
        self.cv = None

    def load_cv_by_file_path(self, file_path):
        self.cv = load_text_file(file_path)
    
    def summarize(self):
        if not self.cv:
            raise "CV not set!"
        response = self.service.run(
            config=self.config,
            template_values=[
                TemplateValue(name="document", value=self.cv),
                TemplateValue(name="length", value="1"),
            ],
        )
        return response.orchestration_result.choices[0].message.content


summarizer = CVSummarizer(config=cv_config)
summarizer.load_cv_by_file_path("data/cv.txt")
cv_summary = summarizer.summarize()
print(cv_summary)

# TODO make the output of the summarized CV more structured, e.g. as in the example above

### Try Out Data Masking

Data Masking can be configured using content categories (called `ProfileEntity`).
If `ProfileEntity.EMAIL` is configured for the `SAPDataPrivacyIntegration` backend, all detected emails in the input to the LLM will be replaced with the string "MASKED_EMAIL". The LLM provider will not know abpout the applicant's email address.

In [None]:
from gen_ai_hub.orchestration.models.data_masking import DataMasking
from gen_ai_hub.orchestration.models.sap_data_privacy_integration import SAPDataPrivacyIntegration, MaskingMethod, \
    ProfileEntity

data_masking = DataMasking(
    providers=[
        SAPDataPrivacyIntegration(
            method=MaskingMethod.ANONYMIZATION, # or MaskingMethod.PSEUDONYMIZATION
            entities=[
                ProfileEntity.EMAIL,
                ProfileEntity.PHONE,
                ProfileEntity.PERSON,
                ProfileEntity.LOCATION,
            ]
        )
    ]
)

config_w_masking = OrchestrationConfig(
    template=cv_template,
    llm=LLM(name="gpt-4o", version="latest", parameters={"max_tokens": 256, "temperature": 0.1}),
    data_masking=data_masking
)

result = orchestration_service.run(
    config=config_w_masking,
    template_values=[
        TemplateValue(name="document", value=doc_as_string),
        TemplateValue(name="length", value="10"),
        TemplateValue(name="document_type", value="CV"),
        TemplateValue(name="additional_instructions", value="and use markdown. Include contact details"),
    ]
)

print(result.orchestration_result.choices[0].message.content)

### Adapt the CVSummarizer to use Data Masking

Make the CVSummarizerPrivacyConserving Class use the Data Masking Config:

In [None]:
class CVSummarizerPrivacyConserving:
    def __init__(self, config):
        api_url = "https://api.ai.internalprod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/d8070b9586567b37"
        self.service = OrchestrationService(api_url=api_url)
        self.config = config
        self.cv = None

    def load_cv_by_file_path(self, file_path):
        self.cv = load_text_file(file_path)
    
    def summarize_with_masking(self):
        pass # TODO

summarizer = CVSummarizerPrivacyConserving(config=config_w_masking)
summarizer.load_cv_by_file_path("data/cv.txt")
cv_summary = summarizer.summarize_with_masking()
print(cv_summary)

### Explore Data Masking (Limitations)

The Data Masking capabilities of LLM Orchestration service are under active development.
When using the service, you should be aware of the limitations of the Data Masking feature.

Get the types of data that can be masked:

In [None]:
# There are additional entities to mask
list(ProfileEntity)

#### Search for Edge-Cases

We need to test the Data Masking feature with different types of CVs that contain various types of personally identifiable information.
We're looking for information in the CVs that was not successfully masked.

**Copy the existing CV and try adapting the information. See how you can provoke masking failures!**

**Please note the cases you find, so that we can include them in our test suites!**


In [None]:
more_data_masking = DataMasking(
    providers=[
        SAPDataPrivacyIntegration(
            method=MaskingMethod.ANONYMIZATION, # or MaskingMethod.PSEUDONYMIZATION
            entities=[
                ProfileEntity.PERSON,
                # TODO additional entities here
            ]
        )
    ]
)

config_w_more_masking = OrchestrationConfig(
    template=cv_template,
    llm=LLM(name="gpt-4o", version="latest", parameters={"max_tokens": 256, "temperature": 0.1}),
    data_masking=more_data_masking
)

summarizer = CVSummarizerPrivacyConserving(config=config_w_more_masking)
summarizer.load_cv_by_file_path("data/different_cv.txt") # TODO adapt to other CVs or create your own
cv_summary = summarizer.summarize_with_masking()
print(cv_summary)

In [None]:
Issues found:

* TODO

### Stretch Goal: Add Content Filtering to the CV Summarizer
The CV Summarizer should notify us if a CV contains inappropriate language so that we may review the application. We can use the input filtering functionality of the LLM orchestration service for this purpose. It was introduced at the end of Part 1.

In [None]:
from gen_ai_hub.orchestration.models.azure_content_filter import AzureContentFilter, AzureThreshold

input_filter = AzureContentFilter(hate=AzureThreshold.ALLOW_SAFE,
                                  violence=AzureThreshold.ALLOW_SAFE,
                                  self_harm=AzureThreshold.ALLOW_SAFE,
                                  sexual=AzureThreshold.ALLOW_SAFE)

config_w_masking_filtering = OrchestrationConfig() # TODO add relevant parameters

# TODO Call CVSummarizerPrivacyConserving with the `config_w_masking_filtering` config and the CV in data/different_cv.txt
# TODO Deal with the `OrchestrationError` that is returned when the filter detects something