<a href="https://colab.research.google.com/github/datafyresearcher/datafy-finetuning-beginner/blob/main/notebooks/Medium/01_LLM_DataConverter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Supported Dataset**
In our scenarios, we utilize the State Bank of Pakistan for the creation of custom datasets, while the target area is equipped with its own department.

Certainly, here are brief details of the mentioned departments in the State of Pakistan:

1. **Banking Policy and Regulation Group:** This group is responsible for formulating and implementing policies related to banking and financial regulations in Pakistan. It plays a crucial role in maintaining the stability and integrity of the banking sector.

2. **Banking Supervision Group:** The Banking Supervision Group oversees and regulates banks and financial institutions operating in Pakistan. Its primary objective is to ensure the safety and soundness of the banking system by monitoring compliance with regulatory standards and conducting supervision activities.

To assist users in utilizing this model effectively, this tutorial offers guidance on running it using Google Colab, a platform known for its user-friendly onboarding process. Google Colab provides a Notebook that simplifies the setup and initial steps.

By utilizing the provided Google Colab Notebook, users can seamlessly build their chosen custom dataset within a conversational context, elevating the interactive experience and streamlining dataset processes.


# **Requirements**

Before you begin with this tutorial, please ensure that you fulfil the following prerequisites:

1. A basic understanding of Excel, Python, Hugging Face (Datasets), and Pandas.

2. The ability to create or modified JSON dataset.

3. A demonstrated enthusiasm for learning and enhancing your skills.

# **Follow Steps**

This tutorial aims to help you achieve the following key steps:

1. Convert Excel data into JSON format.
2. Apply data wrangling processes to the data, including tasks such as removing '&', stripping, and replacing '\n'.
3. Build or modify datasets compatible with Hugging Face.
4. Understand the structure transformation from Pandas to JSON datasets.
5. Comprehend the dataset structure used by OpenAI, including the roles of "System," "User," and "Assistant" with their respective content.
6. Familiarize yourself with the Hugging Face dataset structure, which primarily involves "Question and Answer" pairs.
7. Ensure that your dataset is appropriately structured and ready for fine-tuning the Language Model.


In [None]:
import os, json
import pandas as pd
from pprint import pprint

# **Excel to JSONL**

Transforming Excel data into JSON format involves the process of reformatting the information contained within an Excel spreadsheet into a structured JSON (JavaScript Object Notation) representation. This typically entails mapping Excel cells, rows, and columns into JSON objects, arrays, and key-value pairs, respectively, to enable easy data exchange and interoperability with other software systems that support JSON.

In [None]:
# Load the Dataset from Google Drive
!gdown 17PjyNrPPHdtyHlJotMudRX9crADe6WLE
# Read the Excel file
data = pd.read_excel("(State Bank) Question and Answer.xlsx")
data

Downloading...
From: https://drive.google.com/uc?id=17PjyNrPPHdtyHlJotMudRX9crADe6WLE
To: /content/(State Bank) Question and Answer.xlsx
  0% 0.00/10.2k [00:00<?, ?B/s]100% 10.2k/10.2k [00:00<00:00, 32.7MB/s]


Unnamed: 0,question,answer
0,What is one of the main objectives of The Stat...,One of the main objectives of The State Bank o...
1,What is the role of the Banking Policy & Regul...,The Banking Policy & Regulations Department (B...
2,What are the goals of the Banking Conduct & Co...,"To identify banking conduct trends, review reg..."
3,What kind of advice does the EPD provide to th...,The EPD advises on issues related to Trade Pol...
4,What is the role of the Banking Supervision De...,BSD plays a pivotal role in maintaining the so...
5,BSD-2 plays a pivotal role in maintaining the ...,Financial products and services enable consume...


Perform data wrangling procedures on the dataset, which involve several essential tasks for preparing and cleansing the data. These tasks encompass removing special characters like '&', eliminating leading and trailing spaces (known as stripping), and substituting newline characters '\n' with appropriate replacements as needed. These actions are crucial to ensure the data is in a clean and standardized format, ready for further analysis or processing.

In [None]:
# Replace '&' with 'and' in all columns
data_replace = data.apply(lambda col: col.str.replace('&', 'and'))
# Strip leading and trailing whitespace from all columns
data_strip = data_replace.apply(lambda col: col.str.strip())
# Replace '\n' with '' in all columns
data_escape = data_strip.apply(lambda col: col.str.replace('\n', ''))
data_escape

Unnamed: 0,question,answer
0,What is one of the main objectives of The Stat...,One of the main objectives of The State Bank o...
1,What is the role of the Banking Policy and Reg...,The Banking Policy and Regulations Department ...
2,What are the goals of the Banking Conduct and ...,"To identify banking conduct trends, review reg..."
3,What kind of advice does the EPD provide to th...,The EPD advises on issues related to Trade Pol...
4,What is the role of the Banking Supervision De...,BSD plays a pivotal role in maintaining the so...
5,BSD-2 plays a pivotal role in maintaining the ...,Financial products and services enable consume...


#### Data Instance Structure

Create or adapt datasets to be compatible with Hugging Face, a popular natural language processing library. Gain a comprehensive understanding of how to transform the structure of data from Pandas format into JSON datasets, ensuring they can seamlessly integrate with Hugging Face's tools and models. This involves converting data stored in Pandas DataFrames into the JSON format, which is commonly used for NLP tasks and can be easily consumed by Hugging Face's resources.

In [None]:
# Convert DataFrame to JSON and load in structur format
json_data = data_escape.to_json(orient='records', indent=4)
json_data = json.loads(json_data)
pprint(json_data)

[{'answer': 'One of the main objectives of The State Bank of Pakistan (SBP) is '
            'to ensure a robust and efficient financial sector capable of '
            'catering to the needs of the general public, businesses, and '
            'regulated institutions.',
  'question': 'What is one of the main objectives of The State Bank of '
              'Pakistan (SBP)?'},
 {'answer': 'The Banking Policy and Regulations Department (BPRD) is entrusted '
            'with the responsibility of formulating and implementing an '
            'effective regulatory regime to achieve the overall objective of '
            'SBP. This includes formulating regulations responsive to the '
            'local and international environment.',
  'question': 'What is the role of the Banking Policy and Regulations '
              'Department (BPRD)?'},
 {'answer': 'To identify banking conduct trends, review regulatory '
            'interventions, promote grievance handling, empower consumers, '
    

Gain a comprehensive understanding of the dataset structures employed by OpenAI and Hugging Face, each with its distinct elements.

## Hugging Face Dataset Format: LLaMa2 - Falcon
In contrast, Hugging Face's dataset structure primarily revolves around "Question and Answer" pairs. This format centers on questions posed to AI models and their respective answers, creating a dataset optimized for tasks involving question-answering and natural language understanding. Familiarizing yourself with these distinct dataset structures is essential for effectively working with the data provided by both OpenAI and Hugging Face.

## Open AI Dataset Format: ChatGPT
OpenAI's dataset structure comprises three key components: "System," "User," and "Assistant," each fulfilling specific roles within the dataset. "System" typically represents pre-defined system prompts or instructions, while "User" simulates user input or queries. On the other hand, "Assistant" corresponds to the responses generated by AI models. These elements collectively form the dialogue-based dataset structure used by OpenAI.

In [None]:
def chat_gpt(json_data = None):
  # Create list to store data.
  data_store = list()

  # read the all question and answer.
  for data in json_data:
    # define the system content by user side.
    user_input_system = "You are a distinguished banking expert, extensively trained to adeptly handle a wide array of banking and financial matters, with a distinct focus on the intricacies of the Pakistan Banking System"
    # create default prompt
    data_role_content = {
        "messages": [
            {"role": "system","content": user_input_system},
            {'role': 'user','content': data['question']},
            {'role': 'assistant','content': data['answer']},
        ]
    }
    # store dictory into single list format.
    data_store.append(data_role_content)
  return data_store

In [None]:
data_format_type = "OpenAI"  ## Two choice type:  "OpenAI" OR "Hugging Face"

# Write data to a JSONL file
with open("LLM_Custom_Dataset.jsonl", 'w') as data_file:
  if data_format_type == "Hugging Face":
    for item in json_data:
      data_file.write(json.dumps(item) + '\n')
    data_file.close()
  elif data_format_type == "OpenAI":
    chat_data = chat_gpt(json_data)
    for item in chat_data:
      data_file.write(json.dumps(item) + '\n')
    data_file.close()

Make sure your dataset is well-organized and prepared for the fine-tuning process of the Language Model. This entails ensuring that the data is structured in a suitable format and is in a state where it can be effectively used for fine-tuning purposes.