<a href="https://colab.research.google.com/github/Ilvecho/FineTuned_LLM/blob/main/Docs_elaboration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this notebook is to elaborate the text extracted from the web.

In particular, to fine-tune in a supervised fashion, we need **prompt - expected answer** pairs.

Hence, we leverage existing LLMs (ChatGPT) to create such pairs starting from the text we extracted.

Then, further processing is needed, namely:
- We need to translate the pairs from English to Italian
- We need to reformat the pairs to have it how LoRA likes it

In [1]:
import numpy as np
import pandas as pd
import os
import re
import plotly.graph_objects as go

from google.colab import userdata
from google.colab import files,drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# From text to prompt-answer pairs

Install the needed libraries

In [None]:
!pip install -qU langchain

In [None]:
!pip install -qU openai

In [None]:
from openai import OpenAI
from langchain_core.prompts import PromptTemplate
import json

get the API key from the local secure storage

In [None]:
os.environ['OPENAI_API_KEY'] = userdata.get('OpenAI_API_Key')

Prompt templates

In [None]:
client = OpenAI()

model_3_5_turbo_1106 = "gpt-3.5-turbo-1106"
model = model_3_5_turbo_1106

template_general_questions = """
    I provide you with the following context: '''{transcript}'''.
    You must identify the general topic that is discussed in the provided context.
    Once the general topic is identified, you need to generate 5 pairs of Question-Answer on the general topic.
    Since the questions are generic, the answers must be at least 2 sentences (but do not go above 6 sentences).
"""

template_specific_questions = """
    I provide you with the following context: '''{transcript}'''.
    You must identify the general topic that is discussed in the provided context.
    Once the general topic is identified, one related sub-topic covered in the provided context.
    In the output list all the identified sub-topics in a numbered list. You can use it to double check that the identified sub-topics are five.
    Create two Question-Answer pair for said sub-topic. Double check that they are two.
    Since the question are specific to a sub-topic, the answer must be at most four sentences long.
    Repeat the above actions for five different sub-topics covered in the context.
    Before providing the output, review your answer and make sure that five sub topics have been identified.
"""

content = """
    You are a helpful assistant that reads documents, understand their content, and generate Question-Answer pairs.
    Your output will be used to perform supervised fine tuning of a LLM - keep it in mind when formulating both the question and the answer.
    The desired output format is the following:
    - The first line of the output should be "Topic:" followed by the topic identified in the provided document
    - Identify the questions with "Question:" and the answers with "Answer:"
    - each question and each answer need to be in one line only. The result of this is that each line will start either with "Question:" or with "Answer:"
    Avoid referring to any Named Entity in the questions, unless extremely relevant for the document content.
    Email addresses and phone numbers are not relevant for me - do not mention them at any time.
"""

In [None]:
file_list = os.listdir('/content/gdrive/MyDrive/Syllog/transcripts')

tot_chars = []

for file in file_list:

  if 'ITA' in file:
    continue

  file_path = '/content/gdrive/MyDrive/Syllog/transcripts/' + file
  transcript = open(file_path, 'r').read()

  general_response = ''
  specific_response = ''

  general_question = PromptTemplate(
              input_variables = ["transcript"],
              template=template_general_questions #  template_specific_questions
          )

  query = general_question.format(transcript = transcript)

  response = client.chat.completions.create(
      model=model,
      messages=[
          {"role": "system", "content": content},
          {"role": "user", "content": query}
      ],
      temperature=0.0,
  )

  general_response = response.choices[0].message.content


  if os.path.getsize(file_path) / 1024 > 4.5:  # If the file is long enough, get the specific questions as well

    specific_question = PromptTemplate(
                input_variables = ["transcript"],
                template=template_specific_questions   #  template_general_questions
            )

    query = specific_question.format(transcript = transcript)

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": content},
            {"role": "user", "content": query}
        ],
        temperature=0.0,
    )

    specific_response = response.choices[0].message.content

  # Save the generated pairs
  with open(f'/content/gdrive/MyDrive/Syllog/QA_pairs/' + file, 'w') as text_file:
    print(f"Saving questions from file: {file}")
    text_file.write(general_response + '\n\n\n' + specific_response)
    text_file.close()

Saving questions from file: text_01.txt
Saving questions from file: text_02.txt
Saving questions from file: text_08.txt
Saving questions from file: text_07.txt
Saving questions from file: text_06.txt
Saving questions from file: text_03.txt
Saving questions from file: text_04.txt
Saving questions from file: text_05.txt
Saving questions from file: text_09.txt
Saving questions from file: text_10.txt
Saving questions from file: text_23.txt
Saving questions from file: text_25.txt
Saving questions from file: text_29.txt
Saving questions from file: text_32.txt
Saving questions from file: text_33.txt
Saving questions from file: text_34.txt
Saving questions from file: text_37.txt
Saving questions from file: text_42.txt
Saving questions from file: text_45.txt
Saving questions from file: text_46.txt
Saving questions from file: text_47.txt
Saving questions from file: text_58.txt
Saving questions from file: text_59.txt
Saving questions from file: text_62.txt
Saving questions from file: text_65.txt


# Translate & save in proper format

In [2]:
!pip install -U deep-translator

Collecting deep-translator
  Downloading deep_translator-1.11.4-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m700.7 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: deep-translator
Successfully installed deep-translator-1.11.4


In [3]:
from deep_translator import GoogleTranslator

In [4]:
example = "This is just an example to test the capabilities of the translator. Let's see how it performs"

GoogleTranslator(source='en', target='it').translate(text=example)

'Questo è solo un esempio per testare le capacità del traduttore. Vediamo come si comporta'