# Instruction

In this sample_code, we will explain the detail of using this package to generate your own chat dataset.

## Data Preparation

For this instruction, we prepared a sample dataset for you.


Let's first take a look of this sample dataset, you can open the sample_dataset.xlsx through your xlsx reader(ex. Excel, GoogleSheet).  
You would see there are several sheet in this file: QuestionAskingMerge, Self_cognition, End_of_Conversation, PC and Screen  

This file represents the example raw data format for users, you can follow the rules to prepare your own input data:  
1. Categorize your QA data into different sheets
2. Annotate the QA data with:
    - Type (category of question)
    - Level (level of question)
    - No (question number)
    - UID (Made by Level_No)
    - Parent (The parent QA UID)
    - Well-formed question and Well-formed answers  

    #Type is the Category of this question  
    #Level means the hierarchical status of this question, for example, the initial QA of a conversation should be A level  
    (ex. Q: Hi, I wanna ask you a question. A: Hi, What you wanna ask?),  
    and the QA followed with A should be B level  
    (ex. Q: Where is the capital of Taiwan? A: The capital of Taiwan is Taipei City), and so on.  
    #In this beta version, we only provide four levels: A, B ,C and Z, where Z represents the final QA of this conversation(ex. Q: Thank you! A: You're welcome!)  
    #You can follow the pattern of the sample file to get a more detail understand.

3. Put all the sheet(all different type of categories) into a merged sheet

## Generation

### Setup

After prepared your own dataset, you can follow the code below to generate your own chat dataset

First, let's import the packages we would use in this instruction, and set the path of files.

In [None]:
# Packages
import pandas as pd
from chatgen.chat_algo import ChatAlgo
from chatgen.data_loader import load_xlsx, create_input_data

In [None]:
# Set data path
input_file = "./dataset/sample_dataset.xlsx" # sample dataset or replace with your own dataset
sheet_name = 'QuestionAskingMerge' # the sheet we would use or replace with your own merged sheet name
output_file = "./output/conversations.json" # output path

And we can load the file by function ```create_input_data()``` (we provide ```load_csv()```for you to load the sample data, but you can also load your dataset by your own way into Pandas DataFrame)

In [None]:
# Load raw data & Create input data
data = load_xlsx(input_file, sheet_name)
input_data = create_input_data(data)

In [None]:
# Let's take a look of the input_data
input_data

### Algorithm

In this algorithm, we consider some factors to generate the chat dataset by simulate the behavior of human.  
First, you should consider the ```max_depth```, which is, how deep the conversation you wanna generate.   
For example, If you wanna finetune a chatbot that are an expert for debate, mostly debate is a long-term conversation, in this case, you should set the max_depth larger.  
And the ```init_weight``` is the probability of B, C levels and ```final_level_weight``` is the Z level for initialization, since mostly we start a conversation by A level.  

During each time of the generation, the algorithm will follow the below pattern, repeat till generate ```generate_times``` conversations:  

1. Random choice a number from ```1```~```max_depth``` as the number of rounds of this conversations
2. Initialize the probability and randomize the depth of this from
3. sample one observation from input data
4. punish the probability of the observation (since most of the time we don't repeat the same QA in real life)
5. if the observation have childs, reward its childs. Otherwise, reward the same level of observation by ```child_reward``` (This will boost the conversation continue)
6. modified the probability of levels by ORD matrix  

you can check ```./chatgen/chat_algo.py``` to see that matrix, by each row we get the weights of each level(column) base on observation's level(row), the value is base on our research empiric, this also can help boost the conversation continue

7. repeat 3~6. If the sample times reach ```final_weighting_threshold```, reward Z level by ```final_level_reward``` (people not always complete the conversation, there is a probability that conversation end before getting answer)  

You can change the parameter base on your dataset to get the result you want.

In [None]:
# Params setting
params = {
    'system_prompt':'You are a helpful artificial intelligent assistant, your name is ChatGenBot.',
    'generate_times': 1000,
    'max_depth': 6,
    'init_weight': 0.05,
    'final_level_weight':0.000001,
    'current_punish': 0.01,
    'child_reward': 8000,
    'final_weighting_threshold': 2,
    'final_level_reward': 5000,
}

By using ```create_chat_history()```, the it will generate chat history of input data with params you just set

In [None]:
# Create conversation dataset
chat_algo = ChatAlgo(input_data) # initialization
chat_algo.create_chat_history(**params) # Generation

After generation, you can user ```sample_output()``` to take a glimpse of generated dataset

In [None]:
chat_algo.sample_output()

Finally, you can save the result to json file with ```to_json()```

In [None]:
chat_algo.to_json(output_file) # save to JSON