# Experimenting with Elicit (GPT-3)

- toc: true
- badges: true
- comments: true
- image: images/elicit-icon.png
- hide: false
- categories: [gpt, elicit, ought]

## What This Post Is About

A few months ago, I gained access to a new exciting tool called [Elicit](https://elicit.org/) from [Ought](https://ought.org/). Elicit is an AI research assistant that helps you answer questions "by making qualitative reasoning steps explicit and using language models to incrementally automate those steps." 

Here we can see Elicit's main page, where I have on the Task picker. (Though it's hidden here, there is a text prompt box below the picker where you can type in a question, topic or sentence.)

![View of Elicit tasks](https://raw.githubusercontent.com/JayThibs/jacques-blog/master/images/elicit-tasks.png)

Using GPT-3 on the backend, Elicit is able to help with a wide range of research and brainstorming tasks. The team behind Elicit are focused on building tools that will help millions of people think through the types of questions we ask ourselves every day.

You can watch [Elicit screencasts](https://www.youtube.com/channel/UCg5fl1Ht965Me_KuV84XFwg) on YouTube to get a sense of all the things it can do. It really is impressive.

You can see here a video of Elicit being used to help 

## What Excites Me About Tools Like Elicit

Recent innovations in machine learning have helped us build models which are great at dealing with labeled data tasks, but how can we use these models to answer questions where we don't have a labeled dataset and are a bit more subjective? For example, could we have an AI where we can ask it questions like "How do I decide what career I should dedicate my life to?" or "My partner feels jealous about my career growth, what should I do?" It would really be amazing if we could have an AI that could answer these questions.

I'm particularly excited about how AI will help make knowledge accessible to everyone in a way that will guide humanity towards a better future. Though still in its early stages, AI has started showing signs of having the potential to revolutionize education, research, and even therapy. 

One glimpse of this was in a Twitter thread where one of the employees at OpenAI said they used GPT-3 as a therapist and were able to achieve a deeper breakthrough than they had ever had with a human therapist. We're not at the point where our models can be taken out of the labs yet, but once our models become exceptional and robust, and we can make such fields unbelievably cheap, accessible, and compatible with human values, inequality of world-class services will likely change quickly.

## My Initial Look Into Elicit

For this blog post, I will focus on giving an overview of Elicit rather than going into the nitty-gritty. I'm not going to be too formal and will instead point to some of the things I find cool and just give my thoughts.

Going back to AI for mental health, we can see here how Elicit is used to help do Positive Reframing of negative statements you tell it. The video below shows how they give Elicit prompts for Few-Shot Learning. If you simply ask GPT-3 a question without any prompts, we call it a "Zero-Shot" model. Once we give it one example (K=1), it becomes a "One-Shot" model. For more than one example (K=2+), we call it a "few-shot" model. Obviously, the more examples we give it, the more accurate it will be, but it can still give great results with very few examples. As we saw in the [original GPT-3 paper](https://arxiv.org/abs/2005.14165), as you increase K, GPT-3 can perform better than a model like BERT fine-tuned on a given task (scroll a bit to see a plot example from the paper). That is why it is useful to add examples in Elicit.

> youtube: https://youtu.be/https://www.youtube.com/watch?v=AIW5xM2VMaQ


![gpt-3-vs-fine-tuned-sota-triviaqa.png](https://raw.githubusercontent.com/JayThibs/jacques-blog/master/images/gpt-3-vs-fine-tuned-sota-triviaqa.png)

On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make significant gains over zero-shot behavior, matching and exceeding the performance of the SOTA fine-tuned open-domain model, RAG.

This is exciting and all, but I should note that this not always the case and you need to be aware when a fine-tuned model trained on full training data will be better in your situation. As we can see in a recent [paper](https://arxiv.org/abs/2109.02555), GPT-3 does not mean the end of fine-tuned models:

> We investigated the performance of two powerful transformer language models, i.e. GPT-3 and BioBERT, in few-shot settings on various biomedical NLP tasks. The experimental results showed that, to a great extent, both the models underperform a language model fine-tuned on the full training data. Although GPT-3 had already achieved near state-of-the-art results in few-shot knowledge transfer on open-domain NLP tasks, it could not perform as effectively as BioBERT, which is orders of magnitude smaller than GPT-3.

That said, GPT-3 was created to be as good as possible to a whole variety of tasks with no fine-tuning, and it did not fall short of that.

### Into Forecasting?

Elicit has a feature called [Elicit Forecast](https://forecast.elicit.org/binary?binaryQuestions.sortBy=popularity&limit=20&offset=0&predictors=community) which is hidden from the main website, but they go into a bit of detail about how it works in the video below. For any aspiring superforecaster out there, this could be a great tool.

> youtube: https://youtu.be/https://www.youtube.com/watch?v=eIxoj46UibY

In the future, we can expect people will ask Elicit a question like, "Will we be able to finish project x on time?" and Elicit will give us its best guess. As time goes on, I'm sure a tool like Elicit will become an invaluable tool for not just forecasters, but also people like project managers who want to get a better idea on how long a project will actually take.

## Closing Thoughts

Having worked in government for the past 4 years, there's a lot of things that excite me about Elicit, things that I thought of building myself. I'm glad someone is building a tool like this. 

As they write on their website, I can see it being applied in government policy (Senators could become better at asking questions and could get caught up to speed on issues much more effectively than the current approach), but I can also see it working in strategic foresight teams like Horizons Canada or in regulatory bodies who have to dig into tons of PDFs to find the right information and make sure the company is compliant with the law.

I will note that one of the things I am the most excited for is creating a second brain with a tool like Elicit. Sure we can use Roam Research to connect our thoughts and we can even load tons of text in Roam Research, but I think Semantic Search with Elicit could be much more powerful by simply having it point to or answer questions we have on a certain topic after loading 100k documents on that topic.