# Creating a chatbot based off of me!




## Background
After reading about [quinnai](https://www.quinnha.xyz/quinn-ai), I thought that this project would be pretty fun to tackle on! I've always wanted to make an app that could just text like me, and it was super cool seeing someone implement it with OpenAI's API.

I actually completely forgot about Quinn's blog on the topic when I began the project; I was playing around with DALL-E and Stable Diffusion for a club project, and I suddenly thought about using OpenAI's APIs to make something cool, like a q&a bot or an assistant essay editor. The q&a bot seemed easier, so I messed around with really *intense* prompt generation with GPT 3.5 since it's the newest version, but it was only after around 2 hours wasted before I realized that I could just fine-tune one of OpenAI's other GPT-3 models: davinci, curie, ada, etc, to make this work. Unfortunetly, we can't fine tune the most recent versions of GPT-3, so I settled with using davinci.

At this point, I remember's quinn's article, so I dug it out of my phone's Safari history and began trying to recreate the process. This notebook documents my process in doing so!

## Data Collection

To fine-tune the model, I had to get a *ton* of my own data. I initally thought of using my iMessage data before realizing that it's way too much effort to do without a Mac. Quinn used Discord data, but I'm very sporatically active on it, and it didn't seem like the most accurate representation of myself. Somehow, Slack messages were on the table, but I also thought this was too much effort.

So, I just created a webscraper for [700 Questions to Get To Know Someone](https://thepleasantconversation.com/questions-to-get-to-know-someone/) and ran it on the server.

In [None]:
import requests
from bs4 import BeautifulSoup
import json

url = "https://thepleasantconversation.com/questions-to-get-to-know-someone/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

question_lists = soup.find_all("ol")

questions = []
for question_list in question_lists:
    for question in question_list.find_all("li"):
        questions.append(question.get_text())

with open("questions.json", "w") as f:
    json.dump(questions, f, indent=4)

I recieved a super long json file that wasn't formatted correctly, so I changed it to the format labelled on the [OpenAI Fine Tuning](https://platform.openai.com/docs/guides/fine-tuning) guidelines:

```json
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...
```
Conveniently, I completely missed the 'JSONL' part, but we'll fix this later :)


In [None]:
import json

with open("questions.json", "r") as f:
    data = json.load(f)

processed_data = []

for prompt in data:
    prompt = prompt.replace("\u00a0", "") # remove "\u00a0"
    processed_data.append({"prompt": prompt, "completion": "<ideal generated text>"})

with open("questions_new.json", "w") as f:
    json.dump(processed_data, f, indent=4)


Afterwards, I obtained something in the format I needed! I painstakingly typed in answers to every single question (and some more).

## Fine-Tuning
Afterwards, I spent a while trying to set the OpenAI API key as the global variable, but then I was able to start fine-tuning! I cheated a bit and changed my JSON file to JSONL with this [converter](https://www.convertjson.com/json-to-jsonlines.htm), then I was ready to start!

If you're new to OpenAI's API, you have to create an account with OpenAI and generate an API Key to use [here](https://platform.openai.com/account/api-keys). Every time you create tokens OpenAI will charge you, but you'll start with an $18 free trial, which was more than enough for my purposes.

I got my API key and simply followed [this guide](https://platform.openai.com/docs/guides/fine-tuning) to train my model. I chose curie (since quinn mentioned that davinci wasn't great for q&a and I'm cheap 😛)

In [None]:
# OpenAI installation
!pip install --upgrade openai

### Prepare CLI data

To make sure that the JSONL file created is in the correct format, I did the following:

In [None]:
!openai tools fine_tunes.prepare_data -f <JSONL FILE HERE>

I accepted all of the changes it suggested and checked the new file again to ensure that it worked.

### Fine-Tuning

Just run:

In [None]:
# set api key; this didn't work for me, so I just declared the api key every time. Don't be inefficient like me :P
!export OPENAI_API_KEY="<INSERT API KEY HERE>"

# start fine-tune job
!openai --api-key <INSERT API KEY> api fine_tunes.create -t <INSERT TRAINING DATA> -m <MODEL> --suffix '<INSERT CUSTOM MODEL NAME>'

In [None]:
# if the model stops, then just run the code below to restart it:
!openai --api-key <API KEY> api fine_tunes.follow -i <FINE-TUNE JOB ID>

To check the status:

In [None]:
# Retrieve the state of a fine-tune. The resulting object includes
# job status (which can be one of pending, running, succeeded, or failed)
# and other information
!openai --api-key <API KEY> api fine_tunes.get -i <FINE-TUNE JOB ID>

Testing fine tune model



In [None]:
!openai --api-key <API KEY> api completions.create -m <MODEL NAME> -p "<INSERT QUESTION HERE>"

I wasn't that satisfied with the responses, so I wrote down some more questions and refeed data with some new questions, using the model name. Basically just repeat the above but with the new fine-tuned model name you used! You can find this in the output box or in the [playground](https://platform.openai.com/playground).



I'm hoping to update it with more accurate information, have more tight question control (don't want too many questions about my personal life :P), and more! Here are some sample responses that my AI was able to generate:



```
Where are you from? -> Boston, obviously!

What is your favorite food? -> Japanese food, since its delish!

What makes you interested in someone? -> i'm usually interested in someone after chatting for a while!

Do you like coding? -> yesss, i do! it's super fun using pseudo-code
```

With $0.41, I got a decent model! I'm going to refeed it the questions a couple more times to make it more accurate, but otherwise, I'm super satisfied with the results :)

The project was super fun, and I've been just sending it to friends for them to try (rip my wallet). Pleaset let me know if you have any questions about implementation!
