# **Introduction in GPT features**

**GPU** is recomended for this assignment. `Runtime` -> `Change runtime type` -> `GPU`

**Instructions**
- Write code in the space indicated with `### START CODE HERE ###`
- Do not use loops (for/while) unless instructions explicitly tell you so. Parallelization in Deep Learning is key!
- If you get stuck, ask for help in Slack or DM `@DRU Team`

**You will learn**
- Main features GPT.
- How to build generative QA based on the given data and pre-trained GPT
- Prepare own dataset and fine-tune GPT model on them.

## Add OpenAI key to Config

Before you, sign up or log in to your OpenAI account and generate an [API key](https://platform.openai.com/account/api-keys). 

>**Note that while the first queries are free, there is a specific limit after which payment is required. OpenAI grants an initial budget of $18, more than enough to complete the lab, experiment with the pipeline amd fine-tune model.**

>**Please keep your API key for use during the lab review. We don't store your key, so after verifying your work and earning points, you can delete it in the API keys.**

In [23]:
# VALIDATION_FIELD[cls] Config

class Config:

  # Section 1
  davinci_model_cost = 0.02/1000
  ada_model_cost = 0.0004/1000

  # Section 2
  wiki_article_path = 'wiki_pages'
  
  # names of Wikipedia's articles
  article_titles = ['Kyiv', 'History of Kyiv', 'Kyiv Metro', 'Kyiv culture', 'Kyiv Music Fest', 'FC Dynamo Kyiv', 'Igor Sikorsky Kyiv Polytechnic Institute', 'Paton Bridge', 'Saint Sophia Cathedral', 'Transport in Kyiv', 'Kyiv Zoo', 'Kyiv metropolitan area', 'Taras Shevchenko National University of Kyiv', 'Euromaidan', 'Motherland Monument', 'Podil', 'Kyiv TV Tower']

  ### START CODE HERE ###
  # your OpenAI API token key
  openai_api_key = ''
  ### END CODE HERE ###

  

## Section 1 - Try GPT

[Generative Pre-trained Transformer (GPT)](https://arxiv.org/abs/2005.14165) models by OpenAI have taken the natural language processing (NLP) community by introducing compelling language models. These models can perform various NLP tasks like question answering, textual entailment, text summarisation, Etc. Without any supervised training. These language models need very few to no examples to understand the tasks and perform equivalent to or even better than the state-of-the-art models trained in a supervised fashion.

With the recent releases of [GPT-4](https://openai.com/product/gpt-4) and other models, more powerful new versions of OpenAI’s GPT model may take much time before we can exploit their full potential.

We propose considering the possibilities of the OpenAI models they offer and how they can be applied.

In [24]:
import json
import os

!pip install openai==0.27.4
import openai

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Set OpenAI key

In [25]:
# VALIDATION_FIELD[str] set_openai_api_key

openai.api_key = Config.openai_api_key

### **OpenAI API start**

Now, using OpenAI API, we can give a prompt to a ChatGPT. Create a [chat completions](https://platform.openai.com/docs/api-reference/chat/create) example from the [official guide](https://platform.openai.com/docs/guides/chat/introduction).

In [27]:
### START CODE HERE ###
answer = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0301",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the 2020 World Series?"},
    ]
    )
### END CODE HERE

print(answer)

AuthenticationError: Incorrect API key provided: sk-proj-********************************************TAQJ. You can find your API key at https://platform.openai.com/account/api-keys.


**Expected output:**

```
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The 2020 World Series was played at a neutral site, Globe Life Field, in Arlington, Texas.",
        "role": "assistant"
      }
    }
  ],
  "created": **********,
  "id": "chatcmpl-***",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 22,
    "prompt_tokens": 57,
    "total_tokens": 79
  }
}
```

> Note: The model`s answer can be different but have the same meaning


We got the result with all the query parameters and data. The assistant’s reply in `['choices'][0]['message']['content']`.

`finish_reason` is an end status model output. The possible values are:
* `stop`: API returned complete model output
* `length`: Incomplete model output due to `max_tokens` parameter or token limit
* `content_filter`: Omitted content due to a flag from our content filters
* `null`: API response still in progress or incomplete

`usage` needs to see how many tokens an API uses. This parameter is important to calculate the costs of your request. We will talk about pricing later. 

### **Models usage examples**

`gpt-3.5-turbo` is the current ChatGPT model. This model is accessible on the [web interface](https://chat.openai.com). Your start budget will be enough to end this lab and more.

> Note: You can try the `GPT-4` model if you join [waitlist](https://openai.com/waitlist/gpt-4-api) and get access. However, for lab models, free access will be enough.

Using [Completions](https://platform.openai.com/docs/api-reference/completions), you can try the most available powerful model - `text-davinci-003`. Let`s ask something about Kyiv.

Create a Completion request function with the parameters:
- `max_tokens = 30`
- `temperature = 0`
- `model = "text-davinci-003"`
- `promp is a function argument`

In [None]:
# VALIDATION_FIELD[func] gpt_completion

### START CODE HERE ###
def gpt_completion(prompt):
  answer = ...
  return answer
### END CODE HERE ###

Add prompt `"What is the height of the Kyiv TV tower in metres?"`

In [None]:
# VALIDATION_FIELD[str] kyiv_prompt

### START CODE HERE ###
kyiv_prompt = ...
### END CODE HERE ###

Show GPT's answer

In [None]:
answer = gpt_completion(kyiv_prompt)

print(answer['choices'][0]['text'])

**Expected output:**

<table>
  <tr>
    <td>Result:</td>
    <td> 385 meters</td> 
  </tr>
</table>

> Note: The model`s answer can be different but have the same meaning

The correct answer has likely been obtained. Finally, you can check on the page in [Wikipedia](https://en.wikipedia.org/wiki/Kyiv_TV_Tower).

You have tried to ask a simple question that does not require analytical work. It is like a simple Google query. Let us try some math and logical questions.

* Create prompt to calculate `8 * 6 + 6`.

In [None]:
# VALIDATION_FIELD[str] math_prompt

### START CODE HERE ###
math_prompt = ...
### END CODE HERE ###

In [None]:
answer = gpt_completion(math_prompt)

print(answer['choices'][0]['text'])

**Expected output:**

<table>
  <tr>
    <td>Answer:</td>
    <td> 54</td> 
  </tr>
</table>

> Note: The model`s answer can be different but have the same meaning

GPT was able to consider and solve a simple arithmetic example. Let us use the GPT to solve the logic puzzle.

* Create prompt to solve this exercise: `Maks have five apples. Mask give two apples to Maria, after what mother give one apple more to Maks. How many apples have Maks now??`

In [None]:
# VALIDATION_FIELD[str] logic_prompt

### START CODE HERE ###
logic_prompt = ...
### END CODE HERE ###

In [None]:
answer = gpt_completion(logic_prompt)

print(answer['choices'][0]['text'])

**Expected output:**

<table>
  <tr>
    <td>Answer:</td>
    <td> Maks has 4 apples now.</td> 
  </tr>
</table>

> Note: The model`s answer can be different but have the same meaning

### **Models mistakes**

davinci-003 performed our tasks well. You could notice that the results were not immediately correct, and some prompt changes are necessary before getting the correct answer.

GPT is imperfect, so let us look at this issue in more detail.

For example, recent events. davinci-003 is `up to June 2021` trained model, so it cannot know who is a Monarch of the United Kingdom now.

* Create prompt `Who is a monarch of the United Kingdom?`

In [None]:
# VALIDATION_FIELD[str] monarch_UK_prompt

### START CODE HERE ###
monarch_UK_prompt = ...
### END CODE HERE ###

In [None]:
answer = gpt_completion(monarch_UK_prompt)

print(answer['choices'][0]['text'])

**Expected output:**

<table>
  <tr>
    <td>Answer:</td>
    <td> Queen Elizabeth II</td> 
  </tr>
</table>

> Note: The model`s answer can be different but have the same meaning

We got the wrong answer, and you can check it on [Wikipedia page](https://en.wikipedia.org/wiki/Monarchy_of_the_United_Kingdom) (Charles III has been the monarch since 8 September 2022).

Let's test the mathematical abilities of the model and try a more complex exercise.

* Use your previous math prompt to solve `sqrt(1213*4345)`

In [None]:
# VALIDATION_FIELD[str] harder_math_prompt

### START CODE HERE ###
harder_math_prompt = ...
### END CODE HERE ###

In [None]:
answer = gpt_completion(harder_math_prompt)

print(answer['choices'][0]['text'])

**Expected output:**

<table>
  <tr>
    <td>Answer:</td>
    <td> Anything except 2295.75</td> 
  </tr>
</table>

GPT models are a powerful tool. As we have seen, he can also make mistakes and answer incorrectly to complex and simple questions (try to solve the simple exercise `8*6 + 6*8`).


### **Generate haiku**


However, models can understand what you ask, act according to the limits of the user's words, and even come up with something of your own.

Let us generate a [Haiku](https://www.britannica.com/art/haiku) about ChatGPT using the `text-ada-001` model as the fastest text model.

Create a Completion request with next parameters:
- `max_tokens = 120`
- `temperature = 0.3`
- `model = "text-ada-001"`
- `"Tell me a Haiku about ChatGPT"`

In [None]:
### START CODE HERE ###
answer_ada = ...
### END CODE HERE ###

print(answer_ada['choices'][0]['text'])

Not very well. Let's compare with `text-davinci-003` model as the bigger model.

Create a Completion request with next parameters:
- `max_tokens = 120`
- `temperature = 0.3`
- `model = "text-davinci-003"`
- `"Tell me a Haiku about ChatGPT"`

In [None]:
### START CODE HERE ###
answer_davinci = ...
### END CODE HERE ###

print(answer_davinci['choices'][0]['text'])

Now we can compare both answers

In [None]:
print('Davinci model haiku:')
print(answer_davinci['choices'][0]['text'])
print('------------------------------------')
print('Ada model haiku:')
print(answer_ada['choices'][0]['text'])
print('------------------------------------')
print('Davinci cost: %.8f $' % (Config.davinci_model_cost * answer_davinci['usage']['total_tokens']))
print('Ada cost: %.8f $' % (Config.ada_model_cost * answer_ada['usage']['total_tokens']))

As you can see, the Davinci model creates more creative and better haiku than Ada. However, using Davinchi is more expensive than Ada, following this table:

<table>
  <tr>
    <td>Davinchi:</td>
    <td> $0.0200 / 1K tokens</td> 
  </tr>
  <tr>
    <td>Ada:</td>
    <td> $0.0004 / 1K tokens</td> 
  </tr>
</table>

We need to pay more for better results. You can learn more about OpenAI pricing on the [official website](https://openai.com/pricing).

### **Creative solution**

GPT is a powerful multitasking model that can solve many different tasks. You can ask any question, how who is the Great Britan monarch or information about a favorite character, and solve math and logic tasks. However, the capabilities of the models are wider than this.

Using GPT, we can solve more complex tasks, and let's try to do something.

#### **GraphGPT**

The main idea is to create a prompt that will allow us to convert unstructured natural language into a knowledge graph.

To do this, we need to create a prompt from our task and provide the GPT recording formats and an example of execution.


You need fill `//-- add your text here --//` in the following prompt:
> * Describe the task of finding as many connections as possible and writing them down in a list.
* Note that the single format `[ENTITY 1, RELATIONSHIP, ENTITY 2]` should be used, relations are directed, and the order is important.
* For testing, use `Maks, Petro and Vlad are colleagues.`

In [None]:
### START CODE HERE ###
new_prompt = """Given a prompt, //-- add your text here --//.

If an update is a relationship, //-- add your text here --//.

Example:
prompt: Alice is Bob's roommate.
updates:
[["Alice", "roommate", "Bob"]]

prompt: //-- add your text here --//
updates:
"""
### END CODE HERE ###

In [None]:
answer = openai.Completion.create(
  model='text-davinci-003',
  prompt=new_prompt,
  max_tokens=100,
  temperature=0.3
)

print(answer['choices'][0]['text'])

<table>
  <tr>
    <td> '[["Maks", "colleague", "Petro"], ["Maks", "colleague", "Vlad"], ["Petro", "colleague", "Maks"], ["Petro", "colleague", "Vlad"], ["Vlad", "colleague", "Maks"], ["Vlad", "colleague", "Petro"]]'</td> 
  </tr>
</table>



Create functions using [networkx](https://networkx.org/documentation/stable/reference/introduction.html) for visualizing the obtained connections:

In [None]:
def create_relationships_dataframe(answer_data):
  relationships = json.loads(answer_data)
  from_lable = [relationship[0] for relationship in relationships]
  weight_lable = [relationship[1] for relationship in relationships]
  to_lable = [relationship[2] for relationship in relationships]
  df_relationships = pd.DataFrame({'from':from_lable, 'to':to_lable, 'weight':weight_lable})
  return df_relationships

def get_connection_relationship_weight(edge, df_relation):
  name = df_relation.loc[(df_relation['from']==edge[0]) & (df_relation['to']==edge[1])]['weight'].iloc[0]
  return name

def show_graph(answer):
  answer_relationship = answer['choices'][0]['text']
  df_relationships = create_relationships_dataframe(answer_relationship)
  G = nx.from_pandas_edgelist(df_relationships, 'from', 'to', create_using=nx.DiGraph())
  pos = nx.spring_layout(G)
  nx.draw(G, pos, with_labels=True, node_size=800, node_color="lightblue")
  labels = {e: get_connection_relationship_weight(e, df_relationships) for e in G.edges}
  nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)
  plt.show()

Now you can visualize your graph.

In [None]:
show_graph(answer)

The graph you obtained show the relationship between Peter, Vlad and Max, as specified in your prompt. All of them are colleagues and related to each other.

To make sure that the prompt you created works correctly, let us consider a more complex condition by adding external relations


You need fill `//-- add your text here --//` in the following prompt:
> * Paste your description and formatting.
* Generate next relationships graph: `Markus, Mario, Clara and Mykyta are friends. Maria is Clara teacher and Markus mother and Sam lives with Mario.`

In [None]:
# VALIDATION_FIELD[str] graph_prompt

### START CODE HERE ###
graph_prompt = """Given a prompt, //-- add your text here --//.

If an update is a relationship, //-- add your text here --//.

Example:
prompt: Alice is Bob's roommate.
updates:
[["Alice", "roommate", "Bob"]]

prompt: //-- add your text here --//
updates:
"""
### END CODE HERE ###

In [None]:
answer = openai.Completion.create(
  model='text-davinci-003',
  prompt=graph_prompt,
  max_tokens=100,
  temperature=0.3,
  stop = '.\n'
)

print(answer['choices'][0]['text'])
show_graph(answer)

Make sure that your graph is right and `graph_prompt` is your final version. All nodes should  have a correct connections (not always right dirrection).



## **Section 2 - Search Engine with GPT-3**

[Haystack](https://haystack.deepset.ai) is an open-source framework for building search systems that work intelligently over large document collections. Besides providing a comfortable entry point to the [OpenAI API](https://openai.com/product), Haystack offers all the other components we need to successfully implement an end-to-end NLP system with GPT: a vector database, a module for retrieval, and the pipeline that combines all those elements into one queryable system.
In this lab, we’ll demonstrate how to build a generative question-answering system that uses the GPT-3 `davinci-003` model to present results in convincing natural language.


In [None]:
!pip install openai==0.27.4
!pip install farm-haystack[colab]==1.15.1
!pip install Wikipedia-API==0.5.8
!pip install faiss-cpu==1.7.2
!pip install farm-haystack[faiss]==1.15.1
!pip install wget

In [None]:
import wikipediaapi
import json
import os
import shutil
from haystack.utils import convert_files_to_docs, clean_wiki_text
from haystack.nodes import PreProcessor
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.nodes import OpenAIAnswerGenerator
from haystack.pipelines import GenerativeQAPipeline
from haystack.utils import print_answers

### **Build a Search Engine with GPT-3**

If you’ve been online lately, you’ve likely seen the excitement about OpenAI’s newest language model, ChatGPT. ChatGPT is astonishingly good at many things, including debugging code and rewriting text in whatever style you ask it. As an offshoot of GPT-3.5, a large language model (LLM) with billions of parameters, ChatGPT owes its impressive knowledge to the fact that it’s seen a large portion of the internet during training — in the form of the Common Crawl corpus and other data.

Chatbots are understandable that people are excited by a language model that can hold a conversation and create a solid semblance of intelligence. But we need to stay critical when it comes to the validity of answers generated by these models. LLMs especially are prone to hallucinations: producing text that sounds sensible at first but doesn’t hold up to closer scrutiny and presenting things as facts that are made up entirely.
Semantic search engines — our specialty at deepset — are often powered by extractive question-answering models. These models return verbatim snippets from the knowledge base rather than generating text from scratch the way ChatGPT does.

[Haystack](https://docs.haystack.deepset.ai/docs/intro), deepset’s open-source framework for applied natural language processing (NLP), allows you to leverage multiple GPT models in your pipeline. With this approach, you can build a GPT-powered semantic search engine that uses your data as ground truth and bases its natural-language answers on the information it contains. Besides providing a comfortable entry point to the OpenAI API, Haystack offers all the other components you need to successfully implement an end-to-end NLP system with GPT: a vector database, a module for retrieval, and the pipeline that combines all those elements into one queryable system.

### **The advent of large language models**

But while the largest BERT model has 336 million parameters, OpenAI’s largest GPT-3.5 model — which ChatGPT is based on — has 520 times as many.

From observation, we can say that GPT is exceptionally good at understanding implication and intent. It can remember what’s been discussed earlier in the conversation, including figuring out what you’re referring to with words like “he” or “before that,” It can tell you when your question doesn’t make sense. All of these properties account for the increased importance of actual intelligence. It also has to generate language from scratch, a much more challenging task than returning the correct section from a corpus.

### **Different types of search engines**

Semantic search engines come in different varieties and can roughly be distinguished by the type of answer they return. The answers could consist of matching documents (in document search), answer spans (in extractive QA), or newly generated answers (in generative QA).

### **The GenerativeQAPipeline: Haystack’s component for a generative search engine**

In this lab, we use the GenerativeQAPipeline. It consists of a retriever (to find relevant documents) and a generator (to write the text) chained together. The retriever connects to the database. Like the generator, it is often (but not necessarily) based on a Transformer model. Its task is to retrieve the documents from the database that are most likely to contain valuable information based on a user’s input query.

Download dataset:

In [None]:
# VALIDATION_FIELD[str] download_kyiv_dataset
import wget
import zipfile

wget.download('https://dru.fra1.digitaloceanspaces.com/DL_pytorch/datasets/07_attention_transformers/GPT-3/wikipedia_articles.zip')
with zipfile.ZipFile("wikipedia_articles.zip","r") as zip_ref:
    zip_ref.extractall("")

If you want to try your own dataset, set your Wikipedia article values in `Config.article_titles`.

Use the code below to create a new dataset with [Wikipedia API](https://github.com/martin-majlis/Wikipedia-API).

In [None]:
'''
# you can use this function to create your own dataset
# add your Wikipedia articles to Config.article_titles

def create_own_dataset(wiki_articles_path, article_titles):
  wiki_files_path = wiki_articles_path
  if os.path.exists(wiki_files_path):
    shutil.rmtree(wiki_files_path)
  os.mkdir(wiki_files_path)
  wiki_wiki = wikipediaapi.Wikipedia('en')
  docs = []
  for title in article_titles:
    page = wiki_wiki.page(title)
    if page.exists():
      content = page.text
      doc = {'content': content, 'meta': {'title': title}}
      docs.append(doc)
      with open(f'{wiki_files_path}/{title}.txt', 'w') as f:
        f.write(doc['content'])

create_own_dataset(Config.wiki_article_path, Config.article_titles)

'''

#### **Converting and preprocessing**

Before setting up the pipeline, you need to preprocess your data and add them to the document store or database. This lab uses [FAISS](https://faiss.ai), which is a vector database.

The [`DocumentStore`](https://docs.haystack.deepset.ai/docs/document_store) expects data to be supplied as a Haystack data type called `Document` — a dictionary data type that stores information as a set of related fields.

Use [convert_files_to_docs](https://docs.haystack.deepset.ai/reference/utils-api#convert_files_to_docs) function with arguments:
* `dir_path = wiki_files_path`
* `clean_func = clean_wiki_text`
* `split_paragraphs = True`

In [None]:
# VALIDATION_FIELD[str] docs

### START CODE HERE ###
wiki_files_path = Config.wiki_article_path
docs = ...
### END CODE HERE ###

Many documents, including Wikipedia articles about popular topics, can be long. It would be best to ensure that the documents in your database are short enough for the embedding model to capture their meaning adequately.

>To do this, use the [PreProcessor](https://docs.haystack.deepset.ai/docs/preprocessor) to split them into shorter text snippets. We suggest a split `length` of **100** tokens per snippet, and an `overlap` of three tokens, to make sure no information gets lost, split by `"word"`. Clean `empty lines` and `whitespace`, but true for `header footer` and `respect sentence boundary`:

In [None]:
# VALIDATION_FIELD[str] PreProcessor

### START CODE HERE ###
preprocessor = ...
### END CODE HERE ###
processed_docs = preprocessor.process(docs)

Every document has been turned into an object of the `Document` class. This dictionary contains the document’s text and some automatically generated metadata, like which file the text came from


What do these processed documents look like? Let’s have a look at one of them:

In [None]:
processed_docs[0]

#### **Initializing the DocumentStore**

Time to set up the document store — for example, the vector-optimized FAISS database. When you initialize the document store, you need to know the length of your retriever’s document vector embeddings — the internal representations it will produce for each document. Since you’ll be working with the high-dimensional `text-embedding-ada-002` model from OpenAI, you need to set the vectors `embedding_dim` to **1536**.

>Use [FAISSDocumentStore](https://docs.haystack.deepset.ai/docs/document_store) with `faiss_index_factory_str = "Flat"`:

In [None]:
# VALIDATION_FIELD[str] FAISSDocumentStore

### START CODE HERE ###
if os.path.exists("faiss_document_store.db"):
  os.remove("faiss_document_store.db")
document_store = ...
### END CODE HERE ###

Now, delete any existing documents in the database, and add the preprocessed documents you generated earlier:

In [None]:
# VALIDATION_FIELD[cls] write_document

document_store.delete_documents()
document_store.write_documents(processed_docs)

>Note that so far, the database only contains plain-text documents. To add the high-dimensional vector embeddings — the representations of each document that make sense to the language model and that it can use for semantic search — you need to set up the model for retrieval.



#### **Retriever**

The retriever is the module that matches your query to the documents in the database and retrieves those it deems most likely to contain the answer. Retrievers can be keyword-based (like tf-idf and BM25) or encode semantic similarity using Transformer-generated text vectors. In the latter case, the retriever is also used to index the documents in your database , turning them into high-dimensional embeddings that the retriever can search.

You’ll be working with OpenAI’s most recent retrieval model, `text-embedding-ada-002`. To initialize it in Haystack, you need to provide your `OpenAI API key` and use [EmbedingRetriever](https://docs.haystack.deepset.ai/reference/retriever-api#embeddingretriever) with `batch size` **32**, `the longest length of each document sequence` **2048** for your `document store`:

In [None]:
# VALIDATION_FIELD[str] EmbeddingRetriever

### START CODE HERE ###
retriever = ...
### END CODE HERE ###

When you set up the retriever, you connect it directly to your document store. Now you can use the update_embeddings method to turn the raw documents in the document store into high-dimensional vectors that the retrieval model can search and compare.

In [None]:
# VALIDATION_FIELD[cls] update_embeddings

document_store.update_embeddings(retriever)

#### **Generator**

You are now ready to initialize the GPT model to generate text for you. The [OpenAIAnswerGenerator](https://docs.haystack.deepset.ai/reference/answer-generator-api#openaianswergenerator) node can use four different GPT models. You can use the highest performing GPT-3.5 model, `text-davinci-003`, with `temperature=.5`, `max_tokens=30` parameters and your `OpenAI key`:


In [None]:
# VALIDATION_FIELD[str] generator

### START CODE HERE ###
generator = ...
### END CODE HERE ###

#### **Pipeline**

Now that all the individual elements of your GPT search engine are set up, it’s time to pass them to your generative QA pipeline.

>Use [GenerativeQAPipeline](https://docs.haystack.deepset.ai/reference/pipelines-api#generativeqapipeline) with generator and retriever created early:

In [None]:
# VALIDATION_FIELD[str] GenerativeQAPipeline

### START CODE HERE ###
gpt_search_engine = ...
### END CODE HERE ###

#### **Querying the pipeline**

Now you can ask your system general questions about Berlin (or whatever another topic your dataset is about). In addition to the query, you can pass a few parameters to the search engine, like the number of documents the retriever should deliver to the generator and the number of answers generated (designated **“top_k”**).

>Try ask GPT about `"What is Kyiv known for?"`:

In [None]:
### START CODE HERE ###
query = ...
### END CODE HERE ###

params = {"Retriever": {"top_k": 5}, "Generator": {"top_k": 1}}
answer = gpt_search_engine.run(query=query, params=params)

To print the answer generated by your pipeline, import Haystack’s handy print_answers function. When printing the answer, it lets you determine the detail you want to see. Setting it to a minimum will print only the answer string. So what’s the search engine’s response to the question above?

In [None]:
print_answers(answer, details="minimum")

#### **Generated answers are context-dependen**

GPT-3 model generates its answers based on the documents that it receives. You can now test that by running the generator in isolation without the retriever. You can’t run it without any documents at all, though, so you need to pass it a single snippet. Here’s what happens if you use the snippet about the football team Dynamo that was printed out above.


In [None]:
dinamo_document = processed_docs[0]

for doc in processed_docs:
  if doc.meta['name'] == 'FC Dynamo Kyiv.txt':
    dinamo_document = doc
    break

In [None]:
answers = generator.predict("What is Kyiv known for?", documents=[dinamo_document], top_k = 1)
print_answers(answers, details="minimum")

And let's see what happens if you try to generate an answer for a document that doesn't have any information about your query.

We can find a document not about Kyiv Dinamo:

In [None]:
wrong_dinamo_file = processed_docs[0]
print(wrong_dinamo_file.meta)

Let's ask GPT about the Kyiv Dinamo football team in this document:

In [None]:
answers = generator.predict("How was the Kyiv Dinamo football team created?", documents=[wrong_dinamo_file], top_k = 1)
print_answers(answers, details="minimum")

Now, go back to the full version of the search engine — the one that’s ingested our whole dataset and ask a few more questions to understand better how your search engine operates:


In [None]:
query = "When is the best time to visit Kyiv?"
params = {"Retriever": {"top_k": 5}, "Generator": {"top_k": 1}}

answer = gpt_search_engine.run(query=query, params=params)
print_answers(answer, details="minimum")

In [None]:
query = "Do people from Kyiv have a own culture?"
params = {"Retriever": {"top_k": 5}, "Generator": {"top_k": 1}}

answer = gpt_search_engine.run(query=query, params=params)
print_answers(answer, details="minimum")

In [None]:
query = "Tell me about some interesting place in Kyiv"
params = {"Retriever": {"top_k": 5}, "Generator": {"top_k": 1}}

answer = gpt_search_engine.run(query=query, params=params)
print_answers(answer, details="minimum")

In [None]:
query = "How was the TV tower built?"
params = {"Retriever": {"top_k": 5}, "Generator": {"top_k": 1}}

answer = gpt_search_engine.run(query=query, params=params)
print_answers(answer, details="minimum")

In [None]:
query = "Is Kyiv a good place for clubbing?"
params = {"Retriever": {"top_k": 5}, "Generator": {"top_k": 1}}

answer = gpt_search_engine.run(query=query, params=params)
print_answers(answer, details="minimum")

## **Section 3 - Fine-tune GPT**

Fine-tuning is the process of training a [Large Language Model (LLM)](https://en.wikipedia.org/wiki/Large_language_model) to recognize a specific pattern of input and output that can be applied to any custom NLP task.

Taken from the [official docs](https://platform.openai.com/docs/guides/fine-tuning), fine-tuning lets you get more out of the GPT-3 models by providing the following:
* Higher quality results than prompt design
* Ability to train on more examples than can fit in a * prompt
* Token savings due to shorter prompts
* Lower latency requests

Fine-tuning involves the following steps:
* Prepare and upload training data
* Train a new fine-tuned model
* Use your fine-tuned model

In [None]:
import json
import os

!pip install openai==0.27.4
import openai
import pandas

### Add OpenAI API key

You need to make an account and generate an [API key](https://platform.openai.com/account/api-keys).

In [None]:
# VALIDATION_FIELD[str] set_openai_api_key_fine_tune

openai.api_key = Config.openai_api_key

Let's download our dataset.

You can also create your dataset with your questions and answers. The dataset should be of the following format:

```
{"prompt": "<question>", "completion": "<ideal answer>"}
{"prompt": "<question>", "completion": "<ideal answer>"}
{"prompt": "<question>", "completion": "<ideal answer>"}
```
and saved in `.jsonl` format file.

In [None]:
!pip install wget
import wget

wget.download('https://dru.fra1.digitaloceanspaces.com/DL_pytorch/datasets/07_attention_transformers/GPT-3/QA_DataRoot_Labs.csv')

Let`s check our dataset

In [None]:
data = pandas.read_csv('QA_DataRoot_Labs.csv')

In [None]:
print('Prompt:', data.iloc[0]['prompt'])
print('Completion:', data.iloc[0]['completion'])

**Expected output:**

<table>
  <tr>
    <td>Prompt:</td>
    <td>What is it DataRoot Labs?</td>
 </tr>
 <tr>
    <td>Completion: </td>
    <td>DataRoot Labs is a full-service Data Science & Artificial Intelligence R&D company with main focus on Big Data Management & Strategy Consulting, Data Science & Engineering.DataRoot Labs consists of AI, HighLoad and Science teams — geeks, really good at building & assembling AI-Enabled solutions & Infrastructures, complex scientific R&D.AI Lab delivers to our partners and clients the unique value leveraging Deep Learning, Computer Vision, NLP, Advanced Scoring Models'} </td>
  </tr>
</table>

### Add suffixs and unload data

Make sure to end each prompt with a suffix. According to the [OpenAI API reference](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset), we will use "` ->`".

Also, make sure to end each completion with a suffix as well; We are using "`.\n`".

In [None]:
def add_suffix(dataset):
  # add suffix
  for index, train_request in dataset.iterrows():
    # check that prompt end to " ->" and add
    if train_request['prompt'][-3:] != ' ->':
      train_request['prompt'] = ''.join([train_request['prompt'], ' ->'])

    # completion that prompt end to "."" and add '\n' if exist or '.\n' if dot (.) not exist
    if train_request['completion'][-3:] != '.\n':
      if train_request['completion'][-1] == '.':
        train_request['completion'] = ''.join([train_request['completion'], '\n'])
      else:
        train_request['completion'] = ''.join([train_request['completion'], '.\n'])
  return dataset

In [None]:
data = add_suffix(data)

Let's see some QA examples from our dataset:

In [None]:
data[0:2]

**Expected output:**

<table>
  <tr>
    <td>0</td>
    <td>What is it DataRoot Labs? -></td>
    <td>DataRoot Labs is a full-service Data Science ..</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Can the DataRoot Labs provide advice? -></td>
    <td>During the discovery call with a client, we p...</td>
  </tr>
</table>

Convert the preprocessed dataset to a proper JSONL file.

JSONL file is a newline-delimited JSON file, so we'll add a `\n` at the end of each object:

In [None]:
file_name = "training_data.jsonl"

data = data.to_dict('records')

with open(file_name, "w") as output_file:
 for entry in data:
  json.dump(entry, output_file)
  output_file.write("\n")

### Prapare data by OpenAI

You can check the training data using a CLI data preparation tool provided by OpenAI. It gives you suggestions about how you can reformat the training data.
Let's try it out with our training data. Run this line in Jupyter Notebook:
>**Note that during the computation can prompt requiring some actions. Be sure to respond with "y" to confirm and agree to all of them.**

In [None]:
!openai tools fine_tunes.prepare_data -f training_data.jsonl

You will see something similar to this:

```
Analyzing...

- Your file contains 36 prompt-completion pairs. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examples
- All prompts end with suffix ` ->`
- All completions end with suffix `.\n`
  WARNING: Some of your completions contain the suffix `.
` more than once. We suggest that you review your completions and add a unique ending
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: y


Your data will be written to a new JSONL file. Proceed [Y/n]: y

Wrote modified file to `training_data_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "training_data_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[".\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 2.94 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.
```



### Fine-tuning

With prepared data we can start fine-tulilng GRP model.

#### Create a File

For fine-tuning you need upload files to OpenAI.

> Add a prepared dataset file and create an upload response by [File create](https://platform.openai.com/docs/api-reference/files/upload) with parameters: `file=open(file_name, "rb")` and `purpose='fine-tune'` 

In [None]:
### START CODE HERE ###
file_name = 'training_data_prepared.jsonl'
upload_response = ...
### END CODE HERE ###

file_id = upload_response.id
upload_response

#### Start the fine-tuning

In the final result, you can see `file_id`. This parameter is your uploaded dataset ID, which needs to fine-tune the `davinci model`.

Start the fine-tuning by [FineTune create](https://platform.openai.com/docs/api-reference/fine-tunes/create) method whith this parameters:
- `training_file = file_id`
- `model = 'davinci'`
- `n_epochs = 8`

In [None]:
### START CODE HERE ###
fine_tune_response = ...
### END CODE HERE ###

In [None]:
fine_tune_id = fine_tune_response.id
fine_tune_response

> Write your fine-tune response id to `fine_tune_id` variable in `str format`. 

In [None]:
# VALIDATION_FIELD[str] fine_tune_response_id

### START CODE HERE ###
# for example 'ft-xxxxxxxxxxxxxxxxxxx'
fine_tune_id = ...
### END CODE HERE ###

Training can take `from 15 to 25 minutes`. You can monitor the progress with [retrieve](https://platform.openai.com/docs/api-reference/fine-tunes/retrieve).

> `id = fine_tune_id` 

The message `"Fine-tune succeeded"` will tell you that process finally gone.

In [None]:
### START CODE HERE ###
response = ...
### END CODE HERE ###

print(response.events)

Get the name of your fine-tuned model.

In [None]:
fine_tuned_model = response.fine_tuned_model
fine_tuned_model

> Write your fine-tuned model name to `fine_tuned_model` variable in `str format`.

In [None]:
# VALIDATION_FIELD[str] fine_tuned_model_name

### START CODE HERE ###
# for example 'davinci:ft-personal-2022-02-12-22-22-22'
fine_tuned_model = ...
### END CODE HERE ###

#### Competiton

Let's ask our model about who is the COO of DataRoot Labs.

> Your prompt should end with the suffix "` ->`"

In [None]:
### START CODE HERE ###
new_prompt = ...
### END CODE HERE ###

Using [Completion](https://platform.openai.com/docs/api-reference/completions/create) create a response to your fine-tuned model.
- `max_tokens = 100`
- `temperature = 0.3`
- `stop = '.\n'`
- `model = your fine-tuned model name`

In [None]:
### START CODE HERE ###
answer = ...
### END CODE HERE ###
print(answer['choices'][0]['text'])

**Expected output:**

<table>
  <tr>
    <td>Result:</td>
    <td> Yuliya Sychikova </td> 
  </tr>
</table>

# Conclusion
As we can see, our model fits well the hypothesis function to the data.

### What's next:
1. Try experimenting with GPT models and prompt
2. Using OpenAI API try text to image DALL·E or speech to text Whisper models.
3. Fine-tune own model using another QA dataset.

##### Make sure that you didn't add or delete any notebook cells. Otherwise your work may not be accepted by the validator!