#Generating Answers Using LLM
In this project we were given a data set of 27 artifacts, and ask to answer the question "why?"

Now, "why" is a very broad question, so we narrowed it down to instead asking "Why is the artifact significant to the country that gifted it?"

How did we come up with this question? Well, usually, when gift-giving, the thing that matters most is the thought that's put behind the gift. Someone who knows you very well might give you something useful, customized, or personal, while someone who doesn't know you might give a gift card.

So, we came up with finding the significance of the artifact given, which would help the UN and fellow enthusiants to understand the value behind each of the gifts.

In [None]:
#importing libraries
import pandas as pd
!pip install serpapi
!pip install requests
!pip install -q -U google-generativeai
import serpapi
import requests
import pathlib
import textwrap
import google.generativeai as genai

from IPython.display import display
from IPython.display import Markdown

from google.colab import userdata
SERP_API_KEY=userdata.get('SERP_API_KEY')

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

Collecting serpapi
  Downloading serpapi-0.1.5-py2.py3-none-any.whl (10 kB)
Installing collected packages: serpapi
Successfully installed serpapi-0.1.5
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.2/164.2 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m718.3/718.3 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
#mounting my drive to access the list gifts file so we can all access the file
from google.colab import drive
drive.mount('/content/drive')

#data = "/content/drive/MyDrive/List_gifts_for UN - List_gifts_for UN.csv"
!ls "/content/drive/MyDrive/List_gifts_for UN - List_gifts_for UN.csv"
data = pd.read_csv("/content/drive/MyDrive/List_gifts_for UN - List_gifts_for UN.csv")



Mounted at /content/drive
'/content/drive/MyDrive/List_gifts_for UN - List_gifts_for UN.csv'


### Our Plan

We wanted to use sources and information outside of the given dataset, so we set up a function to get that.

Then, we set up another function to get the data from the UN websites.

To enter our data into the AI and generate a response, we made a function that we hard-coded our question into, but would also allow us to enter in the sources from the previous two steps.  

But how successful was our code? FInally, we set up our last function, which was programming an accuracy metric to evaluate the AI's responses.


####Step 1: Google Search, then scrape the text off the websites

We built a function to scrape text off of the top 3 searches for each artifact item that we enter.

In [None]:
#building a function to scrape the text
def google_search_scrape(search_string): #search string is the key words that we r searching up

  #set up search parameters with API key
  params = {
    "q": search_string,
    "hl": "en",
    "gl": "us",
    "num": "3",
    "google_domain": "google.com",
    "api_key": SERP_API_KEY #might need to edit the restrictions on searching when it comes to the goddess of love one
  }

  search = serpapi.search(params)

  #sort through relevant links (for loops)
  relevant_links = []
  relevant_paras = []
  for i in search["organic_results"]:
    relevant_links.append(i["link"])

    #scrape text off of relevant links(for loops)
    req = Request(
      url=i["link"],
      headers={'User-Agent': 'Mozilla/5.0'}
    )
    webpage = urlopen(req).read() #copied from Miss Haripriya's collab
    html = BeautifulSoup(webpage, 'html.parser')
    paragraphs = html.select("p")
    paras = "" #setting up an empty string instead of a list, because then the paragrapghs appear in the dictionary as a list, and can't put that into dataframe
    for para in paragraphs:
      paras = paras + para.text
    relevant_paras.append(paras)

  return {"links": relevant_links, "paras": relevant_paras} #creating a dictionary with the links and the paragrapghs of scraped text

# trial = google_search_scrape(search_string = "amphora")
# print(trial)

#relevant link are the top three links that the search identified
#relevant paras are the paragrapghs of scrape text from those top three links
#this function codes for one given link, when coding the steps needing an iteration (for loop), we need to create two new cloumns with this info, then input it into google gemini


####Step 2: Scrape the text of the given website links (make a function)

In this step, we want to scrape the text from the UN websites we were given. So, we made a function that returns the paragraph from the website.



In [None]:
def scrape(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
    paragraphs = soup.find_all('p')
    main_paragraphs = ' '.join([p.get_text() for p in paragraphs])

    return main_paragraphs

#scrape("https://www.un.org/ungifts/content/replica-of-palenque-head")

####Step 3: Code Gemini API function

Since we now have the relevant sources, we have to program access into Gemini API, which we are using to generate responses to the problem statement.

We also must set parameters that allow us to choose what sources we enter, which will be important later.





In [None]:
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY') #put your own in secrets like you did for SERP_API_KEY except this time its GOOGLE_API_KEY
genai.configure(api_key=GOOGLE_API_KEY)

In [None]:
def genai_response(un_paras):
  '''
  Two parameters
  un_paras - stores scraped text
  relevant2_paras - stores paragraphs from top three most relevant websites
  Uses query and sources to generate response
  '''
  model = genai.GenerativeModel('gemini-1.5-flash')
  response = model.generate_content(f"""
  QUERY: Why is the artifact significant to the country that gave it?
  SOURCES:
  \n{un_paras}
   """)
  to_markdown(response.text)
  return response.text

#### Step 4: Accuracy Metric
In this step, we created an accuracy metric that outputs the percentage.



In [None]:
def check_response_accuracy(response):
    if 'accuracy' in response:
        accuracy = response['accuracy']
        if isinstance(accuracy, (int, float)):
            # Assuming the accuracy is represented as a number (percentage, score, etc.)
            # You can add more specific checks based on the actual response structure
            if accuracy >= 0 and accuracy <= 100:  # Example condition for percentage accuracy
                return True
    return False


####Step 5: Code an iteration that simultaenously does the following
1) Creates new columns for the paragrapghs of scraped text

2) Inputs the data for each into the gemini API function

3) Takes the response and stores it in the column

4) Takes each response and runs it through the accuracy metric

5) Prints out the accuracy and creats a new column with it

6) Creates a new csv with all this information for easy access


In [None]:
#THE ISSUE: Okay, so the main problem is the genai function because if I input a lot of rows it gives me a forbidden error, so now I need to figure out how to enter the dataset
#into this loop without getting a forbidden error
#POSSIBLE SOLUTION: Breakup the csv into rows of 4 (tedious)

temp = data.head(5) #note: when running the acutal thing, fix this because it only lets us run one row of the dataset, but only after we try it to make sure it works

#declaring lists to put the information in; this way, i can put it in the columns without problems
links = []
paras = []
un_paras = []
llm_resp = []
accuracy_score = []

#coding a for loop that does the aove steps
for index, row in temp.iterrows():
  result = google_search_scrape(search_string = row["Name"]) #only one we don't need to fix since serp_api thing is already coded
  links.append(result["links"]) #
  paras.append("\n".join(result["paras"]))
  un_res = scrape(url = row["Link to Museum"]) #FIXED
  un_paras.append(un_res)
  t_llm_resp = genai_response(un_paras = un_res, relevant2_paras = paras) #how we are storing the values in a variable to use in accuracy score append
  llm_resp.append(t_llm_resp) #how we r making the column

  accuracy_score.append(check_response_accuracy(response = t_llm_resp))


#making the columns
temp["links"] = links
temp["paras"] = paras
temp["un_paras"] = un_paras
temp["llm_resp"] = llm_resp
temp["accuracy_score"] = accuracy_score

#output
#

#data2 = pd.read_csv(temp.to_csv())
#where accuracy is an integer/ percent
temp.to_csv("/content/drive/MyDrive/List_gifts_for UN Output.csv")

In [None]:
temp = data #note: when running the acutal thing, fix this because it only lets us run one row of the dataset, but only after we try it to make sure it works

#declaring lists to put the information in; this way, i can put it in the columns without problems

un_paras = []
llm_resp = []
accuracy_score = []

#coding a for loop that does the aove steps
for index, row in temp.iterrows():
  #result = google_search_scrape(search_string = row["Name"]) #only one we don't need to fix since serp_api thing is already coded
  #links.append(result["links"]) #
  #paras.append("\n".join(result["paras"]))
  un_res = scrape(url = row["Link to Museum"]) #FIXED
  un_paras.append(un_res)
  t_llm_resp = genai_response(un_paras = un_res) #how we are storing the values in a variable to use in accuracy score append
  llm_resp.append(t_llm_resp) #how we r making the column

  accuracy_score.append(check_response_accuracy(response = t_llm_resp))


#making the columns
# temp["links"] = links
# temp["paras"] = paras
temp["un_paras"] = un_paras
temp["llm_resp"] = llm_resp
temp["accuracy_score"] = accuracy_score

#output
#

#data2 = pd.read_csv(temp.to_csv())
#where accuracy is an integer/ percent
temp.to_csv("/content/drive/MyDrive/List_gifts_for UN Output.csv")