# Making your dataset
This notebook contains some logic that can be used to format a text file of conversations to the format required by Hugging Face's model.  
The text file should have conversations made of lines of dialogue alternating between two speakers.  
Each utterance of dialogue should be on a separate line.  
The conversations should be separated by a single newline.  
Each conversation should begin with Speaker1, the customer, followed by speaker2, the bot.   

Example conversation format:
```
hours of operation  
All BotBank locations are open 7am to 4pm monday through friday! What else I can help with?  
that's all  
Thank you for using BotBank.  

what are the hours?  
All BotBank locations are open 7am to 4pm monday through friday! What else I can help with?  
thats all I needed  
Glad I could help. Thanks for choosing BotBank Have a nice day.  
```



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd '/content/drive/MyDrive/Engineering/Curriculum/8th Semester/Final Year Project/cs-nlg'

/content/drive/.shortcut-targets-by-id/112cUQJc2ZRAxcUeLWAQ1tZJ7bLyg9jf0/cs-nlg


In [None]:
!ls

conversations.txt		  financial_responses.txt  requirements.txt
cs-nlgbot.ipynb			  hugging-face		   runs
cs_training_data.json		  make_dataset.ipynb	   training_schema.json
dataset_cache_OpenAIGPTTokenizer  README.md


In [None]:
import json
from random import choice

Read from a text file of conversations.   
The format must be as described above, and relies on a double newline character ('\n\n') to divide conversations.  
Parse the conversations into a list of lists.  

In [None]:
with open('conversations.txt') as txt:
    data = txt.read()

convos = data.split('\n\n')
convo_list = []

for convo in convos:
    fixed = convo.lower()
    fixed = fixed.replace('.', ' .').replace(',', '').replace('?', ' ?').replace('!', ' !').replace("'", "")
    convo_list.append(fixed.split('\n'))

convo_list[1]

['hello do i have to manually log my steps every day if i use my fitbit ?',
 'no . as long as your fitbit is within 10 feet of the wireless usb dongle it will automatically download your steps to your 10k-a-day account . is there anything else i can help with ?',
 'thats all .',
 'glad i could help . thank you for choosing fitbit custcare .']

Additional text is required to fill the 'candidates' section of the training data.  
Reads from a text file of replies from a financial question and answer dataset.   
Replies of length > 100 and of length < 40 are filtered from the data.  

In [None]:
distractors = []
with open('financial_responses.txt') as e_file:
    replies = e_file.readlines()
    for reply in replies:
        if len(reply)> 0:
            distractors.append(reply.replace('\n', ''))
distractors[:5]

['Hi, I just got my new fitbit. How do I get started?',
 'Hello, Follow the instructions provided to register your device on Fitbit.com. Go to your 10K-A-Day account and click on the Connect Fitbit Device link on the Getting Started page or the Profile link at the top, then scroll to the bottom of the page to Connect. Is there anything else I can help with?',
 "No that's all",
 'Glad I could help. Thanks for choosing FitBit CustCare and Have a nice day.',
 'Hello, Do I have to manually log my steps every day if I use my Fitbit?']

The chatbot created by Hugging Face uses a persona to apply some context to its replies.  
The following cell establishes the personality of the chatbot.   

In [None]:
personality = [
    "i am here help you with your questions and requests .",
    "i am a customer support helper for FitBit .",
    "FitBit CustCare is a Twitter bot to handle FitBit's Customer Questions",
    "FitBit CustCare can answer a wide variety of FitBit queries",
    "i am a customer support engine .",
    "i cannnot do some things that my human counter parts can but i can still help ."
]

Divide the data into training and validation sets, include the persona and distractors, and transpose the conversations into sections.  

In [None]:
train_data = {}
train = []

train_length = int(len(convo_list) * 0.8)

for i in range(train_length):
    helper = {}
    convo = convo_list[i]
    helper['personality'] = personality
    utts = []
    for i in range(0,len(convo)-1,2):
        utterance = {}
        utterance['candidates'] = [choice(distractors) for i in range(5)]
        utterance['candidates'].append(convo[i+1])
        utterance['history'] = convo[:i+1]
        utts.append(utterance)
    helper['utterances'] = utts
    train.append(helper)

validate = []
for i in range(train_length, len(convo_list)):
    helper = {}
    convo = convo_list[i]
    helper['personality'] = personality
    utts = []
    for i in range(0,len(convo)-1,2):
        utterance = {}
        utterance['candidates'] = [choice(distractors) for i in range(5)]
        utterance['candidates'].append(convo[i+1])
        utterance['history'] = convo[:i+1]
        utts.append(utterance)
    helper['utterances'] = utts
    validate.append(helper)

train_data['train'] = train
train_data['validate'] = validate


Save the training data into a json file.  

In [None]:
with open('cs_training_data.json', 'w') as file:
    json.dump(train_data, file)