##### Steps to apply this Jupyter Notebook to other Tim Ferriss Interviews
1. Select one of his interviews and find the transcipt for it [here](https://tim.blog/2018/09/20/all-transcripts-from-the-tim-ferriss-show/).
2. Copy the transcipt into a new `txt` file
3. Edit the `txt` file so that each line of dialogue is a single paragraph by removing all `\n`s.

### The idea of this project is to create a question answering model, based on a few paragraphs of provided text. Base GPT-3 models do a good job at answering questions when the answer is contained within the paragraph, however if the answer isn't contained, the base models tend to try their best to answer anyway, often leading to confabulated answers. 

### 1. Extracting Relevant Paragraphs and putting them into a `csv`

#### 1.1 Extracting paragraphs from `txt`

In [1]:
import pandas as pd
import numpy as np

def extract_paragraphs(file_path):
    all_paragraphs = []
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
        all_paragraphs = content.split("\n")
        all_paragraphs.remove('')
        data = pd.DataFrame(all_paragraphs, columns = ['line_of_dialogue'])
        data['line_of_dialogue'].replace('', np.nan, inplace=True)
        data.dropna(subset=['line_of_dialogue'], inplace=True)
        data = data.reset_index().drop('index',axis=1)
    return data

tt2020_txt = extract_paragraphs("data/tt2020.txt")

tt2020_txt.to_csv('data/tt2020.csv', index=False)

tt2020_txt

Unnamed: 0,line_of_dialogue
0,"Tim Ferriss: Tony, welcome to the show."
1,"Tony Fadell: Hi, Tim. Great to be here."
2,Tim Ferriss: I am so thrilled to finally be ha...
3,"Tony Fadell: Great guy, great guy."
4,Tim Ferriss: I have some suggested topics from...
...,...
225,Tony Fadell: Keep going. Keep going.
226,Tim Ferriss: Everyone’s fighting or has fought...
227,"Tony Fadell: Thanks, Tim. Thanks. Have a great..."
228,"Tim Ferriss: Likewise. All right, my man. Take..."


#### 1.2 Counting Tokens and context length

In [2]:
from transformers import GPT2TokenizerFast

import numpy as np
from nltk.tokenize import sent_tokenize

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    """count the number of tokens in a string"""
    return len(tokenizer.encode(text))

tt2020_csv = pd.read_csv('data/tt2020.csv')

tt2020_csv['tokens'] = tt2020_csv.line_of_dialogue.apply(count_tokens)

tt2020_csv.to_csv('data/tt2020.csv', index=False)

tt2020_csv

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Token indices sequence length is longer than the specified maximum sequence length for this model (1212 > 1024). Running this sequence through the model will result in indexing errors


Unnamed: 0,line_of_dialogue,tokens
0,"Tim Ferriss: Tony, welcome to the show.",11
1,"Tony Fadell: Hi, Tim. Great to be here.",14
2,Tim Ferriss: I am so thrilled to finally be ha...,50
3,"Tony Fadell: Great guy, great guy.",11
4,Tim Ferriss: I have some suggested topics from...,114
...,...,...
225,Tony Fadell: Keep going. Keep going.,11
226,Tim Ferriss: Everyone’s fighting or has fought...,35
227,"Tony Fadell: Thanks, Tim. Thanks. Have a great...",23
228,"Tim Ferriss: Likewise. All right, my man. Take...",20


#### 1.3 Removing every occurrence of `Tim Ferris` in the `csv`
This is done since we are only interested in what Tony Fadell has to say. Although it would be preferable to be able to find the questions that Tim posed to add to the prompts later on.

In [3]:
for i, j in tt2020_csv.iterrows():
    if 'Tim Ferriss:' in j.line_of_dialogue:
        tt2020_csv['line_of_dialogue'][i] = np.nan
        
tt2020_csv.dropna(subset=['line_of_dialogue'], inplace=True)
tt2020_csv = tt2020_csv.reset_index().drop('index',axis=1)

tt2020_csv.to_csv('data/tt2020.csv', index=False)

tt2020_csv

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tt2020_csv['line_of_dialogue'][i] = np.nan


Unnamed: 0,line_of_dialogue,tokens
0,"Tony Fadell: Hi, Tim. Great to be here.",14
1,"Tony Fadell: Great guy, great guy.",11
2,"Tony Fadell: Well, the caffeine thing and the ...",214
3,Tony Fadell: You have less control there and y...,128
4,"Tony Fadell: Right, just the alcohol.",12
...,...,...
110,"Tony Fadell: Yeah, absolutely. And there’s alw...",83
111,Tony Fadell: They most likely did. Or if they ...,27
112,Tony Fadell: Keep going. Keep going.,11
113,"Tony Fadell: Thanks, Tim. Thanks. Have a great...",23


#### 1.4 Removing dialogue without 'promptable' content

In [4]:
for i, j in tt2020_csv.iterrows():
    if int(j.tokens)<60:
        tt2020_csv['line_of_dialogue'][i] = np.nan

tt2020_csv.dropna(subset=['line_of_dialogue'], inplace=True)
tt2020_csv = tt2020_csv.reset_index().drop('index',axis=1)

tt2020_csv.to_csv('data/tt2020.csv', index=False)

tt2020_csv

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tt2020_csv['line_of_dialogue'][i] = np.nan


Unnamed: 0,line_of_dialogue,tokens
0,"Tony Fadell: Well, the caffeine thing and the ...",214
1,Tony Fadell: You have less control there and y...,128
2,"Tony Fadell: Well, the caffeine—I basically ov...",174
3,"Tony Fadell: Well, we can get into this more l...",222
4,"Tony Fadell: Well, great question. My dad was ...",388
5,"Tony Fadell: Sure, sure. Well, first, there’s ...",586
6,"Tony Fadell: Sure. Well, first it was pesterin...",201
7,"Tony Fadell: Well, first, I had already—at tha...",314
8,"Tony Fadell: Well, General Magic was trying to...",253
9,Tony Fadell: Andy Hertzfeld was one of the pri...,173


### 2. Preparing Prompt-Completetion pairs

#### 2.1 Using openai to generate questions (aka prompts from the lines of dialogue)
Note: This should take less than 5 minutes

In [5]:
import pandas as pd
import openai

tt2020_q = pd.read_csv('data/tt2020.csv')

def get_questions(line_of_dialogue):
    try:
        response = openai.Completion.create(
            engine="text-davinci-002",
            prompt=f"""Create a numbered list of questions based on {line_of_dialogue}:
            """,
            temperature=0,
            max_tokens=300,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=["\n\n"]
        )
        return response['choices'][0]['text']
    except:
        return ""
    
tt2020_q['questions']= tt2020_q.line_of_dialogue.apply(get_questions)

tt2020_q.to_csv('data/tt2020_q.csv', index=False)

tt2020_q

#### 2.2 Removing rows with empty questions
Note: These need to be investigated further. Running theory is that the context (line of dialogue) isn't enough to generate questions. It is likely that the prior `line_of_dialogue` has the necessary context for the questions to be generated. This logic is likely to also apply to the lines of dialogue that were arbitrarily removed in **1.4**.

In [8]:
tt2020_q['questions'].replace('', np.nan, inplace=True)

tt2020_q = tt2020_q.dropna()

tt2020_q = tt2020_q.reset_index().drop('index',axis=1)

tt2020_q.to_csv('data/tt2020_q.csv', index=False)

tt2020_q

Unnamed: 0,line_of_dialogue,tokens,questions
0,"Tony Fadell: Well, the caffeine thing and the ...",214,\n1. What made Tony Fadell stop drinking alcoh...
1,Tony Fadell: You have less control there and y...,128,\n1. What made Tony Fadell decide to give up a...
2,"Tony Fadell: Well, the caffeine—I basically ov...",174,\n1. What made Tony Fadell decide to give up c...
3,"Tony Fadell: Well, we can get into this more l...",222,\n1. What was the impetus for Tony Fadell's re...
4,"Tony Fadell: Well, great question. My dad was ...",388,\n1. What was Tony Fadell's first job?\n2. Wha...
5,"Tony Fadell: Sure, sure. Well, first, there’s ...",586,\n1. What is the name of the movie that tells ...
6,"Tony Fadell: Sure. Well, first it was pesterin...",201,\n1. How did Tony Fadell get in touch with the...
7,"Tony Fadell: Well, first, I had already—at tha...",314,\n1. What was Tony Fadell's startup company in...
8,"Tony Fadell: Well, General Magic was trying to...",253,\n1. What was the goal of General Magic?\n2. W...
9,Tony Fadell: Andy Hertzfeld was one of the pri...,173,\n1. What was the name of the software develop...


#### 2.3 Using openai to generate `answers` to the `questions` based on the `line_of_dialogue` (aka completions for the prompts based on the context)
Note: This should also take less than 5 minutes

In [9]:
import openai

tt2020_qa = pd.read_csv('data/tt2020_q.csv')

def get_answers(row):
    try:
        response = openai.Completion.create(
            engine="text-davinci-002",
            prompt=f"""Using {row.line_of_dialogue} answer the following questions {row.questions}:""",
            temperature=0,
            max_tokens=257,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
        return response['choices'][0]['text']
    except Exception as e:
        print (e)
        return ""
    
tt2020_qa['answers']= tt2020_qa.apply(get_answers, axis=1)

tt2020_qa.to_csv('data/tt2020_qa.csv', index=False)

tt2020_qa

#### 2.4 Eliminating questions that weren't properly answered by OpenAI
Note: This is something that also need to be further investigated!

In [100]:
tt2020_qa = pd.read_csv('data/tt2020_qa.csv')

def count_lines(cell):
    count = cell.count('\n')
    return count

for i, j in tt2020_qa.iterrows():
    if count_lines(j.questions) != count_lines(j.answers)-1:
        tt2020_qa['line_of_dialogue'][i] = np.nan
        
tt2020_qa.dropna(subset=['line_of_dialogue'], inplace=True)
tt2020_qa = tt2020_qa.reset_index().drop('index',axis=1)
tt2020_qa.to_csv('data/tt2020_qa.csv', index=False)
        
tt2020_qa 

Unnamed: 0,line_of_dialogue,tokens,questions,answers
0,"Tony Fadell: Well, the caffeine thing and the ...",214,\n1. What made Tony Fadell stop drinking alcoh...,\n\n1. Tony Fadell stopped drinking alcohol af...
1,Tony Fadell: You have less control there and y...,128,\n1. What made Tony Fadell decide to give up a...,\n\n1. Tony Fadell decided to give up alcohol ...
2,"Tony Fadell: Well, the caffeine—I basically ov...",174,\n1. What made Tony Fadell decide to give up c...,\n\n1. Tony Fadell decided to give up caffeine...
3,"Tony Fadell: Well, we can get into this more l...",222,\n1. What was the impetus for Tony Fadell's re...,\n\n1. The impetus for Tony Fadell's reset but...
4,"Tony Fadell: Well, great question. My dad was ...",388,\n1. What was Tony Fadell's first job?\n2. Wha...,\n\n1. Tony Fadell's first job was working for...
5,"Tony Fadell: Sure, sure. Well, first, there’s ...",586,\n1. What is the name of the movie that tells ...,\n\n1. The movie that tells Tony Fadell's stor...
6,"Tony Fadell: Sure. Well, first it was pesterin...",201,\n1. How did Tony Fadell get in touch with the...,\n\n1. Tony Fadell got in touch with the peopl...
7,"Tony Fadell: Well, first, I had already—at tha...",314,\n1. What was Tony Fadell's startup company in...,\n\n1. Tony Fadell's startup company in high s...
8,"Tony Fadell: Well, General Magic was trying to...",253,\n1. What was the goal of General Magic?\n2. W...,\n\n1. The goal of General Magic was to create...
9,Tony Fadell: Andy Hertzfeld was one of the pri...,173,\n1. What was the name of the software develop...,\n\n1. The software developer that Tony Fadell...


#### 2.5 Renaming columns, removing questions and answers groupings, and splitting the dataframe in two.

In [101]:
tt2020_qa = tt2020_qa.rename(columns={'line_of_dialogue':'text','questions':'prompt','tokens':'metadata', 'answers':'completion'})

tt2020_qo = tt2020_qa[['prompt']]

tt2020_ao = tt2020_qa[['completion']]

tt2020_qa

Unnamed: 0,text,metadata,prompt,completion
0,"Tony Fadell: Well, the caffeine thing and the ...",214,\n1. What made Tony Fadell stop drinking alcoh...,\n\n1. Tony Fadell stopped drinking alcohol af...
1,Tony Fadell: You have less control there and y...,128,\n1. What made Tony Fadell decide to give up a...,\n\n1. Tony Fadell decided to give up alcohol ...
2,"Tony Fadell: Well, the caffeine—I basically ov...",174,\n1. What made Tony Fadell decide to give up c...,\n\n1. Tony Fadell decided to give up caffeine...
3,"Tony Fadell: Well, we can get into this more l...",222,\n1. What was the impetus for Tony Fadell's re...,\n\n1. The impetus for Tony Fadell's reset but...
4,"Tony Fadell: Well, great question. My dad was ...",388,\n1. What was Tony Fadell's first job?\n2. Wha...,\n\n1. Tony Fadell's first job was working for...
5,"Tony Fadell: Sure, sure. Well, first, there’s ...",586,\n1. What is the name of the movie that tells ...,\n\n1. The movie that tells Tony Fadell's stor...
6,"Tony Fadell: Sure. Well, first it was pesterin...",201,\n1. How did Tony Fadell get in touch with the...,\n\n1. Tony Fadell got in touch with the peopl...
7,"Tony Fadell: Well, first, I had already—at tha...",314,\n1. What was Tony Fadell's startup company in...,\n\n1. Tony Fadell's startup company in high s...
8,"Tony Fadell: Well, General Magic was trying to...",253,\n1. What was the goal of General Magic?\n2. W...,\n\n1. The goal of General Magic was to create...
9,Tony Fadell: Andy Hertzfeld was one of the pri...,173,\n1. What was the name of the software develop...,\n\n1. The software developer that Tony Fadell...


In [102]:
tt2020_qo = tt2020_qo.astype(str).apply(lambda x: x.str.split('\n').explode())

tt2020_qo.replace('', np.nan, inplace=True)

tt2020_qo = tt2020_qo.dropna()

tt2020_qo = tt2020_qo.reset_index().drop('index',axis=1)

tt2020_qo

Unnamed: 0,prompt
0,1. What made Tony Fadell stop drinking alcohol?
1,2. How did Tony Fadell feel after stopping dri...
2,3. What did Tony Fadell think was the reason f...
3,4. What was Tony Fadell's reasoning for thinki...
4,1. What made Tony Fadell decide to give up alc...
...,...
194,4. How can staying a beginner help us achieve ...
195,5. How can working with others be our superpower?
196,1. What are some of the challenges Tony Fadell...
197,2. How has Tony Fadell dealt with these challe...


In [103]:
tt2020_ao = tt2020_ao.astype(str).apply(lambda x: x.str.split('\n').explode())

tt2020_ao.replace('', np.nan, inplace=True)

tt2020_ao = tt2020_ao.dropna()

tt2020_ao = tt2020_ao.reset_index().drop('index',axis=1)

tt2020_ao

Unnamed: 0,completion
0,1. Tony Fadell stopped drinking alcohol after ...
1,2. Tony Fadell felt better after stopping drin...
2,3. Tony Fadell thought the reason for feeling ...
3,4. Tony Fadell's reasoning for thinking alcoho...
4,1. Tony Fadell decided to give up alcohol beca...
...,...
194,4. Staying a beginner can help us achieve succ...
195,5. Working with others can be our superpower b...
196,1. Some of the challenges Tony Fadell has face...
197,2. Tony Fadell has dealt with these challenges...


#### 2.6 Cleaning up prompts and answers, and remerging
Removing numbers, leading and trailing whitespaces, and `'.'` prefixes

In [104]:
import re

numbers = r"[0-9]"

for i, j in tt2020_qo.iterrows():
    filtered_string = re.sub(numbers, '', j.prompt).removeprefix('.').strip()
    tt2020_qo['prompt'][i] = filtered_string

tt2020_qo

Unnamed: 0,prompt
0,What made Tony Fadell stop drinking alcohol?
1,How did Tony Fadell feel after stopping drinki...
2,What did Tony Fadell think was the reason for ...
3,What was Tony Fadell's reasoning for thinking ...
4,What made Tony Fadell decide to give up alcohol?
...,...
194,How can staying a beginner help us achieve suc...
195,How can working with others be our superpower?
196,What are some of the challenges Tony Fadell ha...
197,How has Tony Fadell dealt with these challenges?


In [105]:
import re

numbers = r"[0-9]"

for i, j in tt2020_ao.iterrows():
    filtered_string = re.sub(numbers, '', j.completion).removeprefix('.').strip()
    filtered_string = " " + "\n" + filtered_string + " END"
    tt2020_ao['completion'][i] = filtered_string

tt2020_ao

Unnamed: 0,completion
0,\nTony Fadell stopped drinking alcohol after ...
1,\nTony Fadell felt better after stopping drin...
2,\nTony Fadell thought the reason for feeling ...
3,\nTony Fadell's reasoning for thinking alcoho...
4,\nTony Fadell decided to give up alcohol beca...
...,...
194,\nStaying a beginner can help us achieve succ...
195,\nWorking with others can be our superpower b...
196,\nSome of the challenges Tony Fadell has face...
197,\nTony Fadell has dealt with these challenges...


Remerging

In [106]:
tt2020_ft = pd.concat([tt2020_qo, tt2020_ao], axis=1)

tt2020_ft.to_csv('data/tt2020_ft.csv', index=False)

tt2020_ft

Unnamed: 0,prompt,completion
0,What made Tony Fadell stop drinking alcohol?,\nTony Fadell stopped drinking alcohol after ...
1,How did Tony Fadell feel after stopping drinki...,\nTony Fadell felt better after stopping drin...
2,What did Tony Fadell think was the reason for ...,\nTony Fadell thought the reason for feeling ...
3,What was Tony Fadell's reasoning for thinking ...,\nTony Fadell's reasoning for thinking alcoho...
4,What made Tony Fadell decide to give up alcohol?,\nTony Fadell decided to give up alcohol beca...
...,...,...
194,How can staying a beginner help us achieve suc...,\nStaying a beginner can help us achieve succ...
195,How can working with others be our superpower?,\nWorking with others can be our superpower b...
196,What are some of the challenges Tony Fadell ha...,\nSome of the challenges Tony Fadell has face...
197,How has Tony Fadell dealt with these challenges?,\nTony Fadell has dealt with these challenges...


### 3. Fine-Tuning!

#### 3.1 Creating the jsonl file

In [107]:
import openai

!openai tools fine_tunes.prepare_data -f data/tt2020_ft.csv

^C


#### 3.2 The actual Fine-Tune

In [None]:
!openai api fine_tunes.create -t data/tt2020_ft_prepared.jsonl -m 

In [108]:
!openai api fine_tunes.create -t data/tt2020_ft_prepared.jsonl -m curie

Uploaded file from data/tt2020_ft_prepared.jsonl: file-bSkmK5YYxBVFLK4Ur0l1wpjz
Created fine-tune: ft-AjMZxnlXfnhFuakfiYtiBZfW
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2022-11-26 14:35:10] Created fine-tune: ft-AjMZxnlXfnhFuakfiYtiBZfW
[2022-11-26 14:35:16] Fine-tune costs $0.10
[2022-11-26 14:35:17] Fine-tune enqueued. Queue number: 0
[2022-11-26 14:35:18] Fine-tune started
[2022-11-26 14:36:51] Completed epoch 1/4
[2022-11-26 14:37:37] Completed epoch 2/4
[2022-11-26 14:38:23] Completed epoch 3/4
[2022-11-26 14:39:09] Completed epoch 4/4
[2022-11-26 14:39:31] Uploaded model: curie:ft-personal-2022-11-26-14-39-31
[2022-11-26 14:39:32] Uploaded result file: file-KxzZYKHh42z4vkXD09pFIsbG
[2022-11-26 14:39:32] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m curie:ft-personal-2022-11-26-14-39-31 -p <YOUR_PROMPT>



Upload progress:   0%|          | 0.00/42.1k [00:00<?, ?it/s]
Upload progress: 100%|██████████| 42.1k/42.1k [00:00<?, ?it/s]


In [109]:
!openai api fine_tunes.create -t data/tt2020_ft_prepared1.jsonl -m curie

Uploaded file from data/tt2020_ft_prepared1.jsonl: file-MlzkcYbdKFimKZJMOsbs140y
Created fine-tune: ft-WstVQvB4IVtbk7Ouk7Rf6HMr
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2022-11-26 14:47:50] Created fine-tune: ft-WstVQvB4IVtbk7Ouk7Rf6HMr
[2022-11-26 14:47:57] Fine-tune costs $0.09
[2022-11-26 14:47:57] Fine-tune enqueued. Queue number: 0
[2022-11-26 14:47:59] Fine-tune started
[2022-11-26 14:49:33] Completed epoch 1/4
[2022-11-26 14:50:18] Completed epoch 2/4
[2022-11-26 14:51:04] Completed epoch 3/4
[2022-11-26 14:51:50] Completed epoch 4/4
[2022-11-26 14:52:12] Uploaded model: curie:ft-personal-2022-11-26-14-52-12
[2022-11-26 14:52:13] Uploaded result file: file-rnsqVYSHp5h626T48V0S4hsu
[2022-11-26 14:52:13] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m curie:ft-personal-2022-11-26-14-52-12 -p <YOUR_PROMPT>



Upload progress:   0%|          | 0.00/41.3k [00:00<?, ?it/s]
Upload progress: 100%|██████████| 41.3k/41.3k [00:00<00:00, 43.5Mit/s]


In [110]:
!openai api fine_tunes.create -t data/tt2020_ft_prepared.jsonl -m davinci

^C
