# Phase-4-Advanced Modelling and Error Analysis

###Phase-3 output
In phase-3, a model was built which used semantic search to compare the user query with the questions in the dataset and returns the answer corresponding to the question which is most similar to the user query.

Limitation of the above model-

1) In case a user query does not matches directly with the questions in the dataset. Then our model fails to answer that question.

###How to overcome the limitations of Phase-3 model-

Lets suppose that user query does not matches with any question in the dataset directly but its answer can be found somewhere within the answers that are provided in the FAQ dataset. So, answer for the user query should also be searched in the list of answers that are provided in the FAQ dataset. This solution will be implemented in Phase-4

Reference- To implement the solution for the above mentioned problem, referred https://www.youtube.com/watch?v=YhVgl70Tn_k

###BERT for Question Answering Systems

One of the approaches that can be used in order to search the answer for the user query in the list of answers that are provided in the FAQ dataset is-

Step-1: Get the embeddings for the user query.

Step-2: Treat each pair of Question and Answer as one paragraph. Get the embeddings for all the paragraphs.

Step-3: Find the cosine similarity of the user-query with all the paragraphs and pick the top-K paragraphs that are most similar to the user-query.

Step-4: Now, pass the user-query along with each paragraph that are picked-up as a part of top-K paragraphs to Bert (trained on SQUAD). BERT will predict the best possible answer from each of the paragraph that are given to it.

Step-5: BERT will give the probability estimate for the best possible answer in the given paragraph. These probabilities represent the chances of the answer that is predicted by the BERT to be the best answer in the given paragraph. These probabilities cannot be compared with the probabilities of the answer of another paragraph. So, the last softmax layer needs to be removed. This will enable BERT to output the logits instead of probabilities.

Step-6: Compare all the answers/logits that are returned by BERT and then return the best answer as the response to the user-query.

### Limitations of the above proposed solution-

The above solution will return the answer in the contiguous form only, i.e., it will return the starting and ending point of the answer within the given paragraph. This will limit our model to work in the below mentioned cases:

Case-1: Model will fail in case the needed answer can be found only by picking-up some pieces of information from here and there within the paragraph.

Case-2: Model will fail to return the answer the in case answer can be found only by using the information in the multiple paragraphs.

### Should the above proposed solution be implemented?

Ans- I think that the model that was built in Phase-3 will work fine in around 80% of the cases. I think that if the above proposed model is trained well then, then it should increase our success rate to around atleast 85%. So, I think that the the above proposed model should be implemented.

https://github.com/cdqa-suite/cdQA can be used as reference to implement the above proposed model.

**Invested most of the time in trying to analyze that whether the above proposed solution (BERT for Question Answering Systems) should be implemented or not. Actual implementation of the above proposed model still needs to be done.**

###GPT-3 model
When discussed about the above proposed model (BERT for Question Answering Systems), got to know that, if that model will be deployed to production with large dataset, then it will be having high latency as it compares the user query embedding with the embeddings for all the question-anwer pairs that are given in the dataset.
The recommended right choice of model for this problem is seq-to-seq model. So, after some research used GPT-3 to solve this problem.

Referred the below links to implement GPT-3-

a) https://www.pragnakalp.com/question-answering-using-gpt3-examples/ 

b) https://www.youtube.com/watch?v=C-8sF81k7cY

###Expectations from GPT-3 model-

1) GPT-3 model should overome the limitations of the BERT for Question Answering Systems model.

###Experiments with GPT-3-

1) GPT-3 solved the limitation of the model that was built in Phase-4. In case, the needed answer lies in the different paragraph of the document, then GPT-3 is able to give the answer by comprehending all the paragraph that contains some part of the needed answer.

2) GPT-3 model also needs to be fine-tunned to know that how to use our model for best results. For now, this is the first-cut implementation of the GPT-3 model.

3) Most of the times when querying for the information outside of the document provided, then model responded with error that the query does not lies within the provided document. This was expected behavior.

4) When experimenting with GPT-3 model, found that it is also using its own knowledge apart from the document provided to it. For instance- In the given document, added one more name in the founders of Google. Then on querying the model, the expectation was that it should return the modified founders as the answer but GPT-3 still provided the actual founders as the answer. This was not expected behavior.

5) Another instance of GPT-3 using its own knowledge outside of the document provided. Provided the document which does not have the information about the founder of Apple. The only thing that the document has about Apple is that it is one of the 5 biggest companies in the world. But, model returned the correct response about the founder of Apple. This should not be the case. Observed few more instances like this.

6) Conclusion drawn from the above mentioned points 3, 4, and 5 is that the model can use its own knowledge (outside the provided document) on the topics that are discussed in the provided document.





In [None]:
!pip install openai

Collecting openai
  Downloading openai-0.15.0.tar.gz (40 kB)
[?25l[K     |████████                        | 10 kB 17.2 MB/s eta 0:00:01[K     |████████████████                | 20 kB 8.3 MB/s eta 0:00:01[K     |████████████████████████        | 30 kB 5.6 MB/s eta 0:00:01[K     |████████████████████████████████| 40 kB 1.9 MB/s 
Collecting pandas-stubs>=1.1.0.11
  Downloading pandas_stubs-1.2.0.49-py3-none-any.whl (161 kB)
[K     |████████████████████████████████| 161 kB 5.5 MB/s 
Building wheels for collected packages: openai
  Building wheel for openai (setup.py) ... [?25l[?25hdone
  Created wheel for openai: filename=openai-0.15.0-py3-none-any.whl size=50093 sha256=7451851374f8393bdfb383d779436b0ccad75f4e50798d0e2f9ef5e59e6a1566
  Stored in directory: /root/.cache/pip/wheels/bd/b1/b5/01a94056fd87ef0ed913b2fa6f1161076b730cf1449f579ab7
Successfully built openai
Installing collected packages: pandas-stubs, openai
Successfully installed openai-0.15.0 pandas-stubs-1.2.0.49


In [None]:
import openai
import pandas as pd
import json
import time

In [None]:
# # No needto execute this code as file is already uploaded.

# #Load Mental_Health_FAQ.csv into a pandas dataframe.
#url= 'https://raw.githubusercontent.com/ArpanJainGithub/Chatbot-MentalHealth/main/Mental_Health_FAQ.csv'
#mentalHealthFaq = pd.read_csv(url,encoding='unicode_escape')

In [None]:
# # No need to execute the below code as file is already uploaded.

# #Combining the pair of question and answer into one paragraph.
# paragraphs = []
# for i in range(mentalHealthFaq.shape[0]):
#   paragraph = mentalHealthFaq['Questions'][i] +" "+ mentalHealthFaq['Answers'][i]
#   paragraphs.append(paragraph)
# print(len(paragraphs))
# print(paragraphs)


In [None]:
# # No need to execute the below code as file is already uploaded.

# #Creating a list of dictionaries for all the paragraphs in the form of [{"text":"paragraph1"},{"text":"paragraph2"}]
# paragraphsJsonList = []
# for paragraph in paragraphs:
#   paragraphsDict = {}
#   paragraphsDict["text"] = paragraph
#   paragraphsJsonList.append(paragraphsDict)

In [None]:
# # No need to execute the below code as file is already uploaded.

# print(len(paragraphsJsonList))
# print(paragraphsJsonList)

In [None]:
## No need to execute the below code as file is already uploaded.

# #Coverting the list of dictionaries that is created above to a jsonl file.
# with open('MentalHealthQueries.jsonl', 'w') as outfile:
#     for entry in paragraphsJsonList:
#         json.dump(entry, outfile)
#         outfile.write('\n')

In [None]:
# #All the files that are uploaded should be deleted. Below mentioned Delete and GET commands are not working, so try to make them work.
# #Get the list of all the files that belongs to that user
# #GET https://api.openai.com/v1/files
# # Delete the file
# #DELETE https://api.openai.com/v1/files/{file-ulhKfqkK8wlVOIPFY7qPXSLZ}


#File is already uploaded, so commented the upload code
# #To upload the file 
openai.api_key = "YOUR API KEY"
 
# response = openai.File.create(
#  file=open("/content/drive/MyDrive/AppliedRootsAIML/ChatbotPractice/MentalHealthQueries.jsonl"),
#  purpose='answers'
# )
 
# print(response)

In [None]:
# print(response.id)

In [None]:
#Code to perform the question answering.
def testModel(Question):
  answer = openai.Answer.create(
     search_model="ada",
     model="curie",
     question= Question,
     file= "file-tjksW00XtspZgy7TPrhzhoGC",
     examples_context="In 2017, U.S. life expectancy was 78.6 years.",
     examples=[["What is human life expectancy in the United States?","78 years."]],
     max_rerank=10,
     max_tokens=1500,
     stop=["\n", "<|endoftext|>"]
 
  )
  return answer

In [None]:
questionTest= "What is the mental illness?"
print(testModel(questionTest))

{
  "answers": [
    "Mental illness is a health condition that disrupts a person's thoughts, emotions, relationships, and daily functioning. It is associated with distress and diminished capacity to engage in the ordinary activities of daily life."
  ],
  "completion": "cmpl-4lKsUkir9MZROWHpLPdcS742dTaah",
  "file": "file-tjksW00XtspZgy7TPrhzhoGC",
  "model": "curie:2020-05-03",
  "object": "answer",
  "search_model": "ada:2020-05-03",
  "selected_documents": [
    {
      "document": 0,
      "object": "search_result",
      "score": 118.843,
    },
    {
      "document": 1,
      "object": "search_result",
      "score": 122.72,
      "text": "I\u2019m a young person and one of my parents has a mental illness. What can I do? Someone else\u2019s illness is not your fault. You also can\u2019t control how someone else feels, their illness, or the things they do or say. What you can do is take care of yourself."
    },
    {
      "document": 9,
      "object": "search_result",
      "

###Testing of the GPT-3 model on our dataset.
Test-0: To check the working of the model on simple straight-forward query from the provided document.

Test-1: Make sure Model is failing on the queries that cannot be answered with the given document.

Test-2: Verify if GPT-3 model overcomes the limitations of the models that were build and proposed in Phase-3 and Phase-4.

Test-3: Check the performance on the GPT-3 model.


###Test-0: To check the working of the model on simple straight-forward query from the provided document.

Test question-0: What's the difference between CBT and DBT?

Expected result- It should provide the complete answer that is mentioned in the document against this query.

In [None]:
start = time.time()
question0= "What's the difference between CBT and DBT?"
print(testModel(question0))
end = time.time()
print("Latency= "+ str(end-start))

{
  "answers": [
    "CBT (cognitive-behavioural therapy) and DBT (dialectical behaviour therapy) are two forms of psychotherapy or \u201ctalk therapy.\u201d In both, you work with a mental health professional to learn more about the challenges you experience and learn skills to help you manage challenges on your own."
  ],
  "completion": "cmpl-4lKsYG860b5khiaBThjXFcRYAaJWr",
  "file": "file-tjksW00XtspZgy7TPrhzhoGC",
  "model": "curie:2020-05-03",
  "object": "answer",
  "search_model": "ada:2020-05-03",
  "selected_documents": [
    {
      "document": 6,
      "object": "search_result",
      "score": 112.372,
      "text": "What's the difference between psychotherapy and counselling? Psychotherapy and counselling have a lot in common and usually mean the same thing. Both are used to describe professionals who use talk-based approaches to help someone recover from a mental illness or mental health problem. Many different professionals may provide counselling or psychotherapy, inclu

### Test-1: Make sure Model is failing on the queries that cannot be answered with the given document.

**Test question 1**- Who is Lord Mahavira?

Expected result- Model should fail as the document don't have any information about Lord Mahavira.

In [None]:
start = time.time()
question1= "Who is the Lord Mahavira?"
print(testModel(question1))
end = time.time()
print("Latency= "+ str(end-start))

{
  "answers": [
    "Lord Mahavira is the founder of Jainism."
  ],
  "completion": "cmpl-4lKsckBF5qzoZcD0AMlZOijUUqKeX",
  "file": "file-tjksW00XtspZgy7TPrhzhoGC",
  "model": "curie:2020-05-03",
  "object": "answer",
  "search_model": "ada:2020-05-03",
  "selected_documents": [
    {
      "document": 9,
      "object": "search_result",
      "score": -88.386,
      "text": "Not all mental health programs in BC require a doctor\u2019s referral. This is good news for people who are looking for help! A \"self-referral\" means that you ask to see someone, and then you will be evaluated to see if you meet the criteria to receive services. Contact your local health authority to learn more about programs in your area:"
    },
    {
      "document": 7,
      "object": "search_result",
      "score": -86.157,
      "text": "Schizoid personality disorder is believed to be relatively uncommon. While some people with SPD may see it as part of who they are, other people may feel a lot of distre

 **Test question1 actual result- Fail**

Actual Result- Model gave the right answer, instead document don't have any information about the Lord Mahavira.

**Test question 2-** What are decorative laminates used for?

Expected result- Fail

In [None]:
start = time.time()
question2= "What are decorative laminates used for?"
print(testModel(question2))
end = time.time()
print("Latency= "+ str(end-start))

{
  "answers": [
    "Decorative laminates are used to create a decorative look on a surface. They are made of a thin layer of plastic or paper that is glued to a surface."
  ],
  "completion": "cmpl-4lKsf2dNESGQv1xuDex06lHJJSpf7",
  "file": "file-tjksW00XtspZgy7TPrhzhoGC",
  "model": "curie:2020-05-03",
  "object": "answer",
  "search_model": "ada:2020-05-03",
  "selected_documents": [
    {
      "document": 2,
      "object": "search_result",
      "score": -11.361,
      "text": "What is the legal status (and evidence) of CBD oil? Cannabidiol or CBD is a naturally occurring component of cannabis. It is extracted from the cannabis plant and often made into an oil for use. CBD is not psychoactive, and does not produce the \u2018high\u2019 of THC (tetrahyrocannabinol), the primary psychoactive component of cannabis. CBD is legal in Canada and has been used in the treatment of various medical conditions."
    },
    {
      "document": 6,
      "object": "search_result",
      "score":

 **Test question2 actual result- Fail.**

Model gave the right answer instead of document having no information about decorative laminates.

### Conclusion- 

When experimented with a different dataset that had just 2 paragraphs, model was failing for the queries outside the provided document, but in the above test, model was ble to answer the queries for which document don't have any information. But still, lets try some different tests to know more about the model.

###Test-2: Verify if GPT-3 model overcomes the limitations of the models that were build and proposed in Phase-3 and Phase-4.

Test question 3- What is mental health and what is the difference between mental health professionals?

Expected result- The above question has 2 different questions in it. And the model should be able to answer both the questions. This will verify that model is able to answer the query that needs the combined information of multiple paragraphs.

In [None]:
#question3 = "What are the symptoms of mental illness and is it possible for the mental patients to get recover?"
question3 = "What is mental health and what is the difference between mental health professionals?"
start = time.time()
print(testModel(question3))
end = time.time()
print("Latency= "+ str(end-start))


{
  "answers": [
    "Mental health is the ability to think, feel, and act in a way that is healthy and productive. Mental health professionals are health care professionals who have received specialized training in mental health. They can help you manage your mental health and improve your quality of life."
  ],
  "completion": "cmpl-4lKsitbIV1efLklBp2j2Vjqvtg4KS",
  "file": "file-tjksW00XtspZgy7TPrhzhoGC",
  "model": "curie:2020-05-03",
  "object": "answer",
  "search_model": "ada:2020-05-03",
  "selected_documents": [
    {
      "document": 8,
      "object": "search_result",
      "score": 73.108,
      "text": "What is mental health? We all have mental health which is made up of our beliefs, thoughts, feelings and behaviours."
    },
    {
      "document": 1,
      "object": "search_result",
      "score": 82.8,
      "text": "What's the difference between psychotherapy and counselling? Psychotherapy and counselling have a lot in common and usually mean the same thing. Both are 

Actual Result- Pass.

Model answered both the questions.

###Test-3: Check the performance of the model.

Test question-4: What's the difference between CBT and DBT?

In [None]:
question4 = "What's the difference between CBT and DBT?"
start = time.time()
print(testModel(question4).answers)
end = time.time()
print("Latency= "+ str(end-start))

['CBT (cognitive-behavioural therapy) and DBT (dialectical behaviour therapy) are two forms of psychotherapy or “talk therapy.” In both, you work with a mental health professional to learn more about the challenges you experience and learn skills to help you manage challenges on your own.']
Latency= 4.088366985321045


Latency for Test question3 = 2.9392693042755127.

Observations about the performance of the model-

1) The answer given by the model is short compared to what is given in the document. Model is allowed to give 1500 token but still getting the short anwers.

2) Model is sometimes giving the same phrase repetitively many times to form an answer. For eg- Let us say that there are 10 words in the answer provided, then sometimes there are actually only 10 words which makes sense, and rest of the 90 words are the repitition of the initial 10 words multiple times.


###What needs to further research/experiment?

1) Why the model is giving the short responses.

2) The quality of answers provided by the model compared to the answers that are mentioned in the document.

3) Those words that were present in the answer that was returned by the model are not present in the provided FAQ document. So, seems like model is answering the questions using the knowledge outside the document provided. So, this needs to be checked. 

###Calculating the accuracy for the above model

1) Is provided answer relevant?

2) Is correct paragraph has been choosen for providing the answer?

3) Is answer given by model is as informative as the answer provided in the 
document?

Combining the above parameters, tell if overall performance of the model matches the expectation?

Parameters for calculating the accuracy.

For questionTest= "What does it mean to have a mental illness?"

1) Yes-1

2) Yes

3) No- 0.2

Result not satisfactory.


For Test question-0: What's the difference between CBT and DBT?

1) Yes-1

2) Yes

3) No- 0.2

Result not satisfactory.


For Test question 1- Who is Lord Mahavira?

1) No-0

2) No

3) No-0

Result not satisfactory.


Test question 2- What are decorative laminates used for?

1) No-0

2) No

3) No-0

Result not satisfactory.


Test question 3- What is mental health and what is the difference between mental health professionals??

1) No-0.5

2) No

3) No-0.1


Accuracy- 
In terms of the relevancy of responses= 2.5 out of 5 = 50%

In terms of the amount of information that model should provide in order to solve the problem=  0.5 out of 5= 0.1%

###Latency

Average latency of the model is 3.3 seconds.

In [None]:
#Calculating Latency
#Latency for test question-1: Latency= 2.628389596939087
#Latency for test question-2: Latency= 3.8287103176116943
#Latency for test question-3: Latency= 3.978365659713745
#Latency for test question-3: Latency= 2.9392693042755127
AverageLatency = (2.628389596939087 + 3.8287103176116943 + 3.978365659713745 + 2.9392693042755127) / 4
print(AverageLatency)

3.3436837196350098


###Comparision of the models built till now

a) Models built in Phase-3: 

  1.   Semantic Search using BERT.
  
b) Models built in Phase-4:

  1.   Proposed Question and Answering System using BERT. Not implemented.
  2.   GPT-3 model.
     
We will be comparing the Semantic Search using BERT and GPT-3 model-

1) Semantic Search model is doing better as compared to the GPT-3 model as of now.

2) As the dataset, we have the set of FAQs. In my opinion, after deploying the model, around 80% of the user queries will be similar to the questions that are present in the FAQ set. Using the semantic search model in this case will provide satisfactory answers to the user query. 

3) Let us assume we deploy the GPT-3 model. Then, on the basis if current testing results, GPT-3 is giving the relevant answers to the user query but these answers are not as good as the answers that are given in the FAQ dataset. 

4) Looking at the current testing results, we can settle with the limitations of the Semantic search model, because it fulfilling the core purpose of answering the user queries in 80% of the cases. If in future, GPT-3 model will be able to provide the answers that are as good as the answers that are provided in the FAQ dataset, then we can go for the GPT-3 model.

###What can be improved if given more time-

1) Combination of both the models should be used. As 80% of the user-queries can be answered better using the Semantic Search model. And for the remaining 20% where user-query is not directly matching the FAQs in the dataset, GPT-3 model can be used.

2) Use GPT-3 model for semantic search.

3) Right now the semantic search model is just comparing the user query with the set of questions in the FAQ dataset. But it can be improved to compare the user query with question and answers as well that are in the FAQ dataset.

4) Try to check if the answers of the Questions Answering model using GPT-3 can be improved to the level of the answers present in the FAQ set. If this can be done, then this is the best.