<a href="https://colab.research.google.com/github/Ilvecho/Web-Scraping/blob/main/Docs_elaboration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this notebook is to elaborate the text extracted from the web.

In particular, to fine-tune in a supervised fashion, we need **prompt - expected answer** pairs.

Hence, we leverage existing LLMs (ChatGPT) to create such pairs starting from the text we extracted.

Then, further processing is needed, namely:
- We need to translate the pairs from English to Italian
- We need to reformat the pairs to have it how LoRA likes it

In [None]:
import numpy as np
import pandas as pd
import os
import re
import plotly.graph_objects as go

from google.colab import userdata
from google.colab import files,drive
drive.mount('/content/gdrive')

# From text to prompt-answer pairs

In [None]:
!pip install -qU langchain

In [None]:
!pip install -qU openai

In [None]:
from openai import OpenAI
from langchain_core.prompts import PromptTemplate
import json

In [None]:
os.environ['OPENAI_API_KEY'] = userdata.get('OpenAI_API_Key')

In [None]:
client = OpenAI()

model_3_5_turbo_1106 = "gpt-3.5-turbo-1106"
model = model_3_5_turbo_1106

template_general_questions = """
    I provide you with the following context: '''{transcript}'''.
    You must identify the general topic that is discussed in the provided context.
    Once the general topic is identified, you need to generate 5 pairs of Question-Answer on the general topic.
    Since the questions are generic, the answers must be at least 2 sentences (but do not go above 6 sentences).
"""

template_specific_questions = """
    I provide you with the following context: '''{transcript}'''.
    You must identify the general topic that is discussed in the provided context.
    Once the general topic is identified, one related sub-topic covered in the provided context.
    Create two Question-Answer pair for said sub-topic. Double check that they are two.
    Since the question are specific to a sub-topic, the answer must be at most four sentences long.
    Repeat the above actions for five different sub-topics covered in the context.
    Before providing the output, review your answer and make sure that five sub topics have been identified.
"""

content = """
    You are a helpful assistant that reads documents, understand their content, and generate Question-Answer pairs.
    Your output will be used to perform supervised fine tuning of a LLM - keep it in mind when formulating both the question and the answer.
    The desired output format is the following:
    - The first line of the output should be "Topic:" followed by the topic identified in the provided document
    - Identify the questions with "Question:" and the answers with "Answer:"
    - each question and each answer need to be in one line only. The result of this is that each line will start either with "Question:" or with "Answer:"
    - Do not add anything else
    - Do not leave any empty line
    Avoid referring to any Named Entity in the questions, unless extremely relevant for the document content.
    Email addresses and phone numbers are not relevant for me - do not mention them at any time.
"""

Still need to define the "transcript" variable

In [None]:
transcript = open('/content/gdrive/MyDrive/Syllog/transcripts/text_01.txt', 'r').read()

In [None]:
prompt_question = PromptTemplate(
            input_variables = ["transcript"],
            template=template_specific_questions #   template_general_questions
        )

query = prompt_question.format(
    transcript = transcript,
)

response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": content},
        {"role": "user", "content": query}
    ],
    temperature=0.0,
)

flashcard_content = response.choices[0].message.content

In [None]:
flashcard_content

"Topic: HR Compliance\n\nQuestion: What is the role of HR compliance?\nAnswer: The role of HR compliance is to define policies and procedures to ensure that employment and work practices align with all applicable laws, rules, and regulations while also supporting the company's larger objectives.\n\nQuestion: Why is HR compliance mandatory?\nAnswer: HR compliance is mandatory to implement best practices for company policies, create manuals for hiring and retaining employees, enhance the business's reputation, ensure legal employment practices, and drive business results within the framework of employment laws.\n\nTopic: HR Compliance Officer\n\nQuestion: What is the significance of an HR compliance officer?\nAnswer: HR compliance officers ensure that all individuals in the company adhere to policies and guidelines, work alongside a dedicated HR compliance team, and proactively ensure that the system meets all legal obligations.\n\nQuestion: What are the prerequisites for becoming an HR 