<a href="https://colab.research.google.com/github/Jake-Jasper/AI-Capstone-Project/blob/main/Retrieval_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Retrieval evaluation

Just to evaluate the models using precision and recall...

## Question list (30)
Have the relevant document for each so you can measure, recall, precision, F! score

|Doc_ID|Name|Relevant_Q|
|---|---|---|
|1|Evidence & insights - reducing household food waste & plastic packaging (packaged vs loose)|1,2,3,4,5,6,7,14,19|
|2|Food waste collection guidance |8, 17|
|3|Food Waste Reduction Roadmap - Executive Summary|9, 14, 16,19|
|4,19|Food Waste Reduction Roadmap - Hospitality|9|
|5,19|Food Waste Reduction Roadmap - Manufacturers|9|
|6,19|Food Waste Reduction Roadmap - Primary Producers|9,10|
|7,19|Food Waste Reduction Roadmap - Retail|9,11|
|8|Identifying impacts from food and farm digestates|12,13|
|9|Industry Guidance - Dealing with Household Food Waste at AD Facilities - Management of Liners|14,8, 15|
|10|Literature review - relationship between household food waste collection and food waste prevention|8, 15, 16, 17, 18, 19|
|11|The food waste reduction roadmap progress report 2022|9, 19|
|12|Towards the 2030 Food Waste Commitment -- setting our coalition baseline|14, 19|
|13|Upscaling farm food waste measurement in the UK|19, 22|
|14|WRAP-Evidence Review Plastic Packaging and Fresh Produce 171218|3, 4, 14|

1. "How should I store bananas?"
2. "What temperature should I keep milk?"
3. "How do I increase shelf life of lettuce?
4. "How does selling fresh produce loose, affect waste?"
5. "What is a best-before date for?"
6. "What is a use-by date for?"
7. "How does removing date labels help?"
8. "What are the reasons to not participate in a food waste scheme?"
9. "What is the UN's sustainable development goal with respect to food?"
10. "Which document relates to primary producers and food waste?"
11. "Which document relates to retailers and food waste?"
12. "What is the most common use for digestate?"
13. "What affects the costs of valorisation?"
14. "How many tonnes of food waste are there each year?"
15. "What is the benefit of caddy liners?"
16. "How to prevent food loss?"
17. "How to prevent food waste?"
18. "How does composting affect food waste?"
19. "What are the main drivers of food waste? got to here!"
20. "Online shopping and food waste"
21. "What is involved with Courthold 2030?"
22. "How do you measure in field food waste?"

In [1]:
!pip install -Uq sentence-transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/163.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/163.3 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import sqlite3
from sentence_transformers import SentenceTransformer, util
import numpy as np

In [100]:
# model used to encode question (should be same as the one used to create database)
#model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [21]:
# Question
# documents have been indexed from 0 so document 1 = 0 etc
keys = list(range(1,15))
# answers (relevant questions)
values = [
    [1,2,3,4,5,6,7,14,19],
    [8,17],
    [9,14, 16],
    [9,19],
    [9,19],
    [9,10, 19],
    [9,11, 19],
    [12,13, 19],
    [14,8,15],
    [8,15,16,17,18,19],
    [9, 19],
    [14, 19],
    [19, 22],
    [1,3,4,14],

]

QA = dict(zip(keys,values))

In [4]:
# question list
Questions = [
        "How should I store bananas?",
        "What temperature should I keep milk?",
        "How do I increase shelf life of lettuce?",
        "How does selling fresh produce loose, affect waste?",
        "What is a best-before date for?",
        "What is a use-by date for?",
        "How does removing date labels help?",
        "What are the reasons to not participate in a food waste scheme?",
        "What is the UN's sustainable development goal with respect to food?",
        "Which document relates to primary producers and food waste?",
        "Which document relates to retailers and food waste?",
        "What is the most common use for digestate?",
        "What affects the costs of valorisation?",
        "How many tonnes of food waste are there each year?",
        "What is the benefit of caddy liners?",
        "How to prevent food loss?",
        "How to prevent food waste?",
        "How does composting affect food waste?",
        "What are the main drivers of food waste?",
        "Online shopping and food waste",
        "What is involved with Courthold 2030?",
        "How do you measure in field food waste?"
]

In [101]:
# '/content/drive/MyDrive/Knowledge-database-10-4-24-mpnet.db' ; /content/drive/MyDrive/Knowledge-database-10-4-24-All-mini-L6-v2.db
DB = '/content/drive/MyDrive/Knowledge-database-10-4-24-All-mini-L6-v2.db'


In [111]:
def score_retrieval(question=["No question"],
                    model=None, db=DB):
    relevant = []
    q = model.encode(question)
    conn = sqlite3.connect(db)

    # Create a cursor object
    cursor = conn.cursor()
    # load row by row
    cursor.execute('SELECT * FROM documents')

    for row in cursor:
        score  = util.pytorch_cos_sim(q, np.frombuffer(row[4], dtype=np.float32)).numpy()
        # if meets the score threshold and is not already in the list
        if score >= 0.7:
            relevant.append((row[1]))

    # return only unique relevant documents.
    return list(set(relevant))

In [97]:
# For sense checking

def find_top_k_relevance(question=["No question"], model=None, n=5, db=DB):
  q = model.encode(question)
  scores = {}
  conn = sqlite3.connect(db)

# Create a cursor object
  cursor = conn.cursor()
  # load row by row
  cursor.execute('SELECT * FROM documents')
  for row in cursor:
    scores[row[0]] = util.pytorch_cos_sim(q, np.frombuffer(row[4], dtype=np.float32)).numpy()

  return dict(sorted(scores.items(), key = lambda x: x[1], reverse = True)[:n])


find_top_k_relevance(question =["How should I store Bananas?"], model=model)

{7580: array([[0.76544386]], dtype=float32),
 239: array([[0.7401677]], dtype=float32),
 7585: array([[0.68936956]], dtype=float32),
 7485: array([[0.6675967]], dtype=float32),
 7602: array([[0.6573247]], dtype=float32)}

In [98]:
import pprint
conn = sqlite3.connect(DB)
cursor = conn.cursor()
cursor.execute("""SELECT * FROM documents
                 WHERE id = 7580""")
rows = cursor.fetchone()
conn.close()
pprint.pprint(f"{rows[3]} context: {rows[2]}, {rows[1]}")

('It is not appropriate to store bananas in the fridge as they are '
 'chilling-sensitive, and their skins become blackened, but the coolest place '
 'in the home is the best storage location and the fruit should be kept in '
 'bags. context: 14 - WRAP-Evidence Review Plastic Packaging and Fresh Produce '
 '171218.pdf, 14')


## Looks like the error is in the encoding as here it is showing that document 4 = 14? I guess it is not listing the articles in order

In [57]:
import os
DIR = "/content/drive/MyDrive/WRAP food reports"

sorted(os.listdir(DIR))

['01 - Evidence & insights - reducing household food waste & plastic packaging (packaged vs loose).pdf',
 '02 - Food waste collection guidance .pdf',
 '03 - Food Waste Reduction Roadmap - Executive Summary.pdf',
 '04- Food Waste Reduction Roadmap - Hospitality & Food Service.pdf',
 '05 - Food Waste Reduction Roadmap - Manufacturers.pdf',
 '06 - Food Waste Reduction Roadmap - Primary Producers.pdf',
 '07 - Food Waste Reduction Roadmap - Retail.pdf',
 '08 - Identifying impacts from food and farm digestates.pdf',
 '09 - Industry Guidance - Dealing with Household Food Waste at AD Facilities - Management of Liners.pdf',
 '10 - Literature review - relationship between household food waste collection and food waste prevention.pdf',
 '11 - The food waste reduction roadmap progress report 2022.pdf',
 '12 - Towards the 2030 Food Waste Commitment -- setting our coalition baseline.pdf',
 '13 - Upscaling farm food waste measurement in the UK.pdf',
 '14 - WRAP-Evidence Review Plastic Packaging and F

In [118]:
## get scores for the document retrieval
def scoring(q, qa=QA):
    precision , recall, f1 = 0,0,0
    # AS questions don't have a zeroeth q need to start from q + 1
    q_no = q + 1
    # true positive
    all_relevant = []
    for k, v in QA.items():
        if q_no in v:
            all_relevant.append(k)
    correct = 0
    for i in retrieved_docs:
        if i in all_relevant:
            correct +=1
    # no relevant docs recieved - all scores are 0
    if len(all_relevant) == 0:
        return (precision, recall, f1)
    else:
        # if there are no documents that meet our criteria
        if len(retrieved_docs) == 0:
            return (precision, recall, f1)
        else:
            precision = abs(len(set(all_relevant) & set(retrieved_docs))) / abs(len(retrieved_docs))
            # recall
            recall = abs(len(set(all_relevant) & set(retrieved_docs))) / abs(len(all_relevant))
            # avoid divide by 0 error
            if precision == 0 or recall == 0:
                f1 = 0.0
            # if the other values are not zero
            else:
                f1 = (2*precision*recall)/(precision + recall)

    return (precision, recall, f1)


# loop to get all metrics
p , r , f = [], [], []

for i in range(len(Questions)):
        q = Questions[i]
        print(q)
        retrieved_docs = score_retrieval(question = q, model=model)
        print(retrieved_docs)
        score_ = scoring(Questions.index(q))
        print(score_)
        p.append(score_[0])
        r.append(score_[1])
        f.append(score_[2])

How should I store bananas?
[1, 14]
(1.0, 1.0, 1.0)
What temperature should I keep milk?
[]
(0, 0, 0)
How do I increase shelf life of lettuce?
[14]
(1.0, 0.5, 0.6666666666666666)
How does selling fresh produce loose, affect waste?
[14]
(1.0, 0.5, 0.6666666666666666)
What is a best-before date for?
[1]
(1.0, 1.0, 1.0)
What is a use-by date for?
[]
(0, 0, 0)
How does removing date labels help?
[1]
(1.0, 1.0, 1.0)
What are the reasons to not participate in a food waste scheme?
[2, 10]
(1.0, 0.6666666666666666, 0.8)
What is the UN's sustainable development goal with respect to food?
[]
(0, 0, 0)
Which document relates to primary producers and food waste?
[10]
(0.0, 0.0, 0.0)
Which document relates to retailers and food waste?
[]
(0, 0, 0)
What is the most common use for digestate?
[8]
(1.0, 1.0, 1.0)
What affects the costs of valorisation?
[8]
(1.0, 1.0, 1.0)
How many tonnes of food waste are there each year?
[8, 9, 10, 11, 12, 13]
(0.3333333333333333, 0.4, 0.3636363636363636)
What is the 

In [119]:
import pandas as pd

df = pd.DataFrame({"Question":Questions, "Precision":p, "Recall":r, "F1":f})


In [120]:
df

Unnamed: 0,Question,Precision,Recall,F1
0,How should I store bananas?,1.0,1.0,1.0
1,What temperature should I keep milk?,0.0,0.0,0.0
2,How do I increase shelf life of lettuce?,1.0,0.5,0.666667
3,"How does selling fresh produce loose, affect w...",1.0,0.5,0.666667
4,What is a best-before date for?,1.0,1.0,1.0
5,What is a use-by date for?,0.0,0.0,0.0
6,How does removing date labels help?,1.0,1.0,1.0
7,What are the reasons to not participate in a f...,1.0,0.666667,0.8
8,What is the UN's sustainable development goal ...,0.0,0.0,0.0
9,Which document relates to primary producers an...,0.0,0.0,0.0


In [121]:
#name with db sentence transformer and method of sentence splitting
df.to_csv("/content/drive/MyDrive/All-mini-v6-spacy-core-sm.csv")