<a href="https://colab.research.google.com/github/Shen220/GPT/blob/main/CCS7_Generative_Pre_trained_Transformers_Demo_(GPT).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Creating your very own Generative Pre-trained Transformer (GPT)**

This activity looks at creating a simple Generative Pre-trained Transformer (GPT) that you can use to accept input data and give a set response. We will be making use of existing pre-trained tools that are available on ***HuggingFace*** which is a repository of various machine learning models and datasets. In particular, we will be using the MiniLM-L6-v2 transformer (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) for our demo activity.

**You will be creating your own GPT based on a topic of your choice with your respective groups for CCS7. You are required to create a dataset with at least 150 Prompt-Completion pairs. You can only augment each unique question five (5) times at most.**

**Make sure to click "File" and then "Save a Copy in Drive" before making any changes to the demo.**

# **Creating a GPT**

**We start by importing the sentence transformers from HuggingFace.** This will allow us make use of the different transformer models that are available on the platform for our code.

In [None]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB

**We then import the transformer tools that we will be using for our code.** We will also be making use of Pytorch, which is a library used for machine learning and deep learning and heavily utilizes the CPU and GPU of your system.

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

**We will now upload the dataset for our GPT to use.** The file has two columns, the "prompts" and "completion". The prompts column are the potential questions that will be asked by your user when they run your given application. The completion column is the response that will be given by your application based on the given prompt.

**Hint:** An easy way to augment your dataset for GPT is to think of variations on the same question that a user may ask. For example, instead of the user only asking "Hi, how are you?", they could also ask "Hey, how are you doing?" or "How is it going?"

In [None]:
from google.colab import files
uploaded = files.upload()

Saving GPT_dataset.csv to GPT_dataset.csv


**Once your dataset has been uploaded, it will then read and stored in the "data" variable for future use.** Make sure to update the filename in the code snippet below so that it will be able to read the custom dataset you have uploaded.

In [None]:
import pandas as pd
data = pd.read_csv('GPT_dataset.csv')

**Once we have our dataset ready, our next step is to set up the model that we will be using to process our given data.** In this case, we will be using the "all-MiniLM-L6-v2" model, which is one of the many transformer models available on HuggingFace. Take note that you can use another model from this point onwards but you will need to read the documentation on how it is implemented, which means you will need to change how the code is written.

**Hint:** You can check the other models available under the sentence transformers by going to https://www.sbert.net/index.html to see other information on the framework.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = data['prompt']

embeddings = model.encode(sentences)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

**This part of the code is where you will then ask your user to input a question and the GPT model will attempt to interpret the question and provide an appropriate response.** This will provide the "best" answer based on the question of the user. You can see a score beside the response given, this score provides context on what "completion" is most appropriate for the given prompt. A score closer to 1 means that the prompt given is close to one of the actual prompts in your dataset, while a score closer to 0 indicates that the input is far from the expected prompts in the dataset.

In [None]:
question=input("Hello! What is your question for today? \n")
queries = [question]

top_k = min(5, len(sentences))
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)

    cos_scores = util.cos_sim(query_embedding, embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)

    for score, idx in zip(top_results[0], top_results[1]):
      if(score > 0.7):
        print(data['completion'][idx.item()], "(Score: {:.4f})".format(score))
        break
      else:
        print("There are no possible answers")
        break

Hello! What is your question for today? 
Good morning




Query: Good morning
Good Morning! Quite the rainy weather we are having today no? (Score: 1.0000)
