Assuming we have a db of documents (lots of scraped articles) ready to extract from.
We also assume we have found a list of places we are interested in knowing about.

In [1]:
import pickle
with open('initial_places.pkl', 'rb') as f:
    initial_places = pickle.load(f)

To answer many questions we'll make use of the previously scraped articles. Furthermore new information searched in the internet will be added here.

In [2]:
from qdrant_haystack import QdrantDocumentStore

document_store = QdrantDocumentStore(
    path="qdrant",
    index="Document",
    embedding_dim=768,
    recreate_index=False,
)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from src.search_scrape_PIPE import search_scrape_pipeline
search_scrape_pipe = search_scrape_pipeline(document_store)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
from duckduckgo_search import DDGS

def internet_search_tool(search:str,n_results=3)->bool:
    '''
    This function makes use of the chatgpt api and google search library to, 
    given some place in the form of a string and a country name determine if 
    that place is in the named country or even a place at all.
    For that we will first inject to gpt a summary of the top 15 results found.

    :param place: string of the place to search
    :param place: string of the country to search
    :param n_results: number of page summaries to show to gpt     
    '''

    #first of all we'll perform a search and add the results.
    result_links = search_scrape_pipe.run(search,topk=20)


    results = f"Search: {search}\nResults:\n"
    #now we'll look for our results in the doc_store using the returned links
    cnt = 0
    for doc in document_store.get_all_documents_generator():
        if doc.meta["url"] in result_links:
            results += f"({cnt+1})\n " + doc.meta["summary"] + "\n"
            cnt += 1

            #delete the link from result_links as one summary can appear in multiple documents
            #if a page has been chopped into chunks
            result_links.pop(result_links.index(doc.meta["url"]))
        
        if cnt == n_results:
            break
    
    return results


In [5]:
internet_search_tool("trolltunga, norway")

Already in DB: https://trolltunga.com/
Already in DB: https://en.wikipedia.org/wiki/Trolltunga
Already in DB: https://www.tripadvisor.com/Attraction_Review-g1096319-d3522548-Reviews-Trolltunga-Odda_Hardanger_Hordaland_Western_Norway.html
Already in DB: https://trolltunga.com/plan-your-trip/the-hike-to-trolltunga/
Already in DB: https://thenorwayguide.com/trolltunga/
Already in DB: https://www.fjordnorway.com/en/see-and-do/trolltunga
Already in DB: https://norwegianroutes.com/trolltunga/
Already in DB: https://www.trolltunganorway.com/en_GB
Already in DB: https://www.lonelyplanet.com/norway/bergen-and-the-western-fjords/odda/attractions/trolltunga/a/poi-sig/1416541/360173
Already in DB: https://viatravelers.com/trolltunga-norway/


 30%|███       | 3/10 [00:02<00:06,  1.03it/s]

Unable to extract text from https://www.earthtrekkers.com/trolltunga/


 60%|██████    | 6/10 [00:03<00:02,  1.84it/s]

Unable to extract text from https://www.lifeinnorway.net/hiking-trolltunga-beginners/


 80%|████████  | 8/10 [00:05<00:01,  1.60it/s]

Unable to extract text from https://www.youtube.com/watch?v=Z4CrrkQ9ZOA


 90%|█████████ | 9/10 [00:05<00:00,  1.95it/s]

Unable to download the article https://www.alltrails.com/trail/norway/vestland/trolltunga-t-merket


100%|██████████| 10/10 [00:05<00:00,  1.76it/s]


Unable to download the article https://www.alltrails.com/trail/norway/vestland/skjeggedal-trolltunga


Preprocessing: 100%|██████████| 5/5 [00:00<00:00, 364.27docs/s]
Extracting entities: 100%|██████████| 1/1 [00:01<00:00,  1.63s/it]
10000it [00:00, 384234.52it/s]        


'Search: trolltunga, norway\nResults:\n(1)\n Trolltunga - all you need to know about hiking the "Troll\'s tongue"Trolltunga (Troll\'s tongue) is a rock formation and a popular hike in Norway.\nTucked away between fjords, villages and towns, you will find some of the most beautiful and breathtaking rock formations in Norway, like Trolltunga (Troll\'s tongue).\nIt is here that you start the hike up to Trolltunga rock, one of the most spectacular rock formations in Norway.\nIf you are travelling from Bergen to Odda for the Trolltunga hike, you can take bus number 930 from Bergen Bus Station (bay O).\nRead more about the hike "Trolltunga guided overnight hike" here.\n(2)\n Here’s some fun facts about this small Norwegian fjord town.\nFruit farms and orchards dot the fjord road, from where apples and cherries can be purchased.\nThe famous Trolltunga.\nOdda was built on heavy industryDespite the tourism interest, the economy of Odda has always been based on heavy industry.\nOdda’s valley is 

# AGENT

We'll use langchain's agent instead of haystack because of the convenience of many tools.

In [6]:
from haystack.nodes import EmbeddingRetriever
import torch

retriever = EmbeddingRetriever(
            document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",use_gpu=True,devices=[torch.device("mps")]
        )

  return self.fget.__get__(instance, owner)()
You seem to be using sentence-transformers/multi-qa-mpnet-base-dot-v1 model with the cosine function instead of the recommended dot_product. This can be set when initializing the DocumentStore


## Step 1

Filtering actual real places from the input feed and creating a brief description.

In [7]:
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain.llms import OpenAI
from langchain.agents.tools import Tool

In [93]:
#llm initialisation
llm = OpenAI(temperature=0,openai_api_key="")

In [9]:
#wikipedia search tool
from langchain.tools import WikipediaQueryRun
from langchain.utilities import WikipediaAPIWrapper

wikipedia = WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())

In [10]:
tools_step1 = [
    Tool(
        name = "Wikipedia search",
        func = wikipedia.run,
        description= "Search and access information from Wikipedia. Should be the first source to check. "
    ),
    Tool(
        name = "Broad knowledge search",
        func = internet_search_tool,
        description = "Returns three search results based on your input text, good for broad information."
    ),
   
]

In [11]:
agent = initialize_agent(tools_step1, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

In [12]:
place = "hike"
prompt = f"""
Words like (Library, School, Coliseum, Hospital, Fjord) without any context represent the idea of these places rather than a concrete location in the world.
In contrast words like (Barcelona, Rome, Philipines), even without any context represent a concrete place on earth, somewhere unique I can pinpoint with coordinates.

Does the word {place} fall on the first category or second? Answer only A or B. Please answer only with one letter.

If you don't know the word search for it using the provided tools.
"""
#agent.run(prompt)


# Step 1 v2

If we cannot find coordinates for the place skip it. Anyhow beforehand common wods have to be filtered.

In [116]:
geolocator.geocode?

[0;31mSignature:[0m
[0mgeolocator[0m[0;34m.[0m[0mgeocode[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mquery[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexactly_one[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtimeout[0m[0;34m=[0m[0mDEFAULT_SENTINEL[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlimit[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maddressdetails[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlanguage[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgeometry[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mextratags[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcountry_codes[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mviewbox[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbounded[0m[0;34m=[0m[0;

In [117]:
import pandas as pd
from tqdm import tqdm
from geopy.geocoders import Nominatim
class coordinates_search():
    def __init__(self,df_pth:str):
        features = {
            'geonameid': 'integer id of record in geonames database',
            'name': 'name of geographical point (utf8) varchar(200)',
            'asciiname': 'name of geographical point in plain ascii characters, varchar(200)',
            'alternatenames': 'alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)',
            'latitude': 'latitude in decimal degrees (wgs84)',
            'longitude': 'longitude in decimal degrees (wgs84)',
            'feature class': 'see http://www.geonames.org/export/codes.html, char(1)',
            'feature code': 'see http://www.geonames.org/export/codes.html, varchar(10)',
            'country code': 'ISO-3166 2-letter country code, 2 characters',
            'cc2': 'alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters',
            'admin1 code': 'fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)',
            'admin2 code': 'code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)',
            'admin3 code': 'code for third level administrative division, varchar(20)',
            'admin4 code': 'code for fourth level administrative division, varchar(20)',
            'population': 'bigint (8 byte int)',
            'elevation': 'in meters, integer',
            'dem': 'digital elevation model, srtm3 or gtopo30, average elevation of 3\'\'x3\'\' (ca 90mx90m) or 30\'\'x30\'\' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.',
            'timezone': 'the iana timezone id (see file timeZone.txt) varchar(40)',
            'modification date': 'date of last modification in yyyy-MM-dd format'
        }
        self.df = pd.read_csv('NO.txt', sep="\t", header=None,names=features.keys())
        self.geolocator = Nominatim(user_agent="tourist map app")
    
    def run(self,query,country):
        #option1 search in name column
        if len(self.df[self.df["name"]==query])  > 0:
            coords = self.df[self.df["name"]==query].iloc[0][["latitude","longitude"]].to_list()
            
            #make sure
            for i in range(2):
                coords[i] = float(coords[i])
            
            return coords
        
        #option 2 use geopy
        location = self.geolocator.geocode(query,country_codes=[country])
        if location != None:
            return [location.latitude,location.longitude]
        
        #option 3 check alternate names
        for i, row in tqdm(self.df.iterrows(),desc="Looking at alternatenames",total=len(self.df)):
            if row["alternatenames"] is str:
                if query.lower() in row["alternatenames"].lower().split(" "):
                    coords = row[["latitude","longitude"]].to_list()
                    
                    #make sure
                    for i in range(2):
                        coords[i] = float(coords[i])
                    
                    return coords

        print("Unable to find coordinates for this place")
        return None

In [118]:
cordi_searchy_tool = coordinates_search("NO.txt")

  self.df = pd.read_csv('NO.txt', sep="\t", header=None,names=features.keys())


In [130]:
place = "Månafossen"

In [131]:
from langchain import PromptTemplate, OpenAI, LLMChain

prompt_template = "Does the word {place} refer to a general concept? Answer only Yes or No"

llm_chain = LLMChain(
    llm=llm,
    prompt=PromptTemplate.from_template(prompt_template)
)
input_list = [
    {"place": place},
]
result = llm_chain.apply(input_list)[0]["text"]


In [132]:
result

'\n\nNo'

In [133]:
if "no" in result.lower():
    cords = cordi_searchy_tool.run(place,"NO")
    if cords is None:
        print(place)
    else:
        print(cords)

[58.85767, 6.38368]


: 

## Step 2

We'll use the brief description to build on top of it a slightly longer one comprising more touristically relevant information.

- main attractions
- touristic landmarks if the place isn't one itself
- things to do on vacation there


We'll add a new tool for searching the actual content of the scraped webpages more precisely instead of just summaries.

In [13]:
#first the question answerer that uses a combination of qa pipeline 
# with entailment checker and t5 to generate possible answers to a question 
# and their entailment to the knowledge base.
from src.question_search_answer_entail_TOOL import question_search_answer_entail
QSAE_tool = question_search_answer_entail(document_store,retriever=retriever)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [14]:
#QSAE_tool.run("what are the main atractions in bergen?")

In [15]:
from src.simple_search_TOOL import simple_search
SS = simple_search(document_store,retriever)

In [16]:
#SS.run("places to eat in bergen",filter="Bergen",topk=2)

In [17]:
#because this is a multi input tool
from langchain.tools import StructuredTool

In [18]:
tools_step2 = [
    Tool(
        name = "Broad knowledge search",
        func = internet_search_tool,
        description = "Returns three search results based on your input text, good for broad information."
    ),

    Tool(
        name =  "Question answerer",
        func = QSAE_tool.run,
        description= "Useful to find new information. Further evaluation of the answers and fact checking has to be performed using other tools. This tool should be the last resources"
    ),

    StructuredTool.from_function(SS.run,
    name="Semantic search",
    description="Useful to search for concrete information. The filter argument should always be the place being searched for (properly capitalize the name)."
    )
   
]

In [19]:
agent = initialize_agent(tools_step2, llm, agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

In [20]:
prompt_step2 = f"""
Given the following brief description about {place} elaborate a larger description clearly portraying in different sections: main attractions, touristic landmarks(if the place isn't one itself) and things to do as a tourist on vacation there. 
Should serve as a condensed touristic guide.

Initial description:
Bergen is a city and municipality in Vestland county on the west coast of Norway. 
It is known for its port, Bryggen, which is a World Heritage Site, and for its mild winter climate. 
It is also home to the Bergen School of Meteorology, the Norwegian School of Economics, and the University of Bergen.
"""

In [21]:
#agent.run(prompt_step2)

# Step 3:
Finally we'll try to answer some frequently asked questions about the place of interest.

- How is the climate?
- What is the best time of the year to visit?
- How is public transport there?
- What are some off-the-beaten-path attractions to explore?
- Is it safe to travel here?
- Where can I find more information or maps of the area?

In [22]:
faqs = [
    f"How is the climate in {place}?",
    f"What is the best time of the year to visit {place}",
    f"How good is public transport in {place}",
    f"What are some off-the-beaten-path attractions to explore in {place}?",
    f"Is it safe to travel to {place}? Any safety concerns/precautions?",
    f"Where can I find more information or maps of {place}?"
]


In [25]:
#FAQ = {}
#for question in faqs:
#    FAQ[question] = agent.run(question)

# Whole loop:

We'll create a new document store for the information summarized.

In [None]:
#step 1 filter out words 
prompt_template = "Does the word {place} refer to a general concept? Answer only Yes or No"

llm_chain = LLMChain(
    llm=llm,
    prompt=PromptTemplate.from_template(prompt_template)
)
input_list = [
    {"place": place},
]
result = llm_chain.apply(input_list)[0]["text"]

if "no" in result.lower():
    cords = cordi_searchy_tool.run(place,"NO")
    if cords is not None:
        print(cords)

#step2 create description
agent = initialize_agent(tools_step2, llm, agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
prompt_step2 = f"""
Elaborate a condensed touristic guide about {place} clearly portraying in different sections: main attractions, touristic landmarks(if the place isn't one itself) and things to do as a tourist on vacation there.
Do not talk about other places, search and add as many information as possible.
"""
description = agent.run(prompt_step2)

summary = description + "\n\nFrequently asked questions:\n"

In [23]:
SUMMARIES = {}
for place in initial_places:
    print(f"starting to analyze {place}")
    #step1 filter
    
    summary = ""
    #step3 FAQ
    faqs = [
    f"What is the weather like in {place}?",
    f"What is the best time of the year to visit {place}?",
    f"How good is public transport in {place}?",
    f"What are some off-the-beaten-path attractions to explore in {place}?",
    f"Is it safe to travel to {place}? Any safety concerns/precautions?",
    ]

    for question in faqs:
        #having some error:
        #ValidationError: 2 validation errors for Semantic searchSchemaSchema
        #query
        #field required (type=value_error.missing)
        #filter
        #field required (type=value_error.missing)

        try:
            summary += question + " " + agent.run(f"Answer the following question: {question} The answer should be very consice.") + "\n\n"
        except:
            pass
    SUMMARIES[place] = summary
    break

starting to analyze Månafossen


[1m> Entering new AgentExecutor chain...[0m


Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..




[1m> Entering new AgentExecutor chain...[0m


Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..


In [38]:
print(SUMMARIES["Månafossen"])

How is the climate in Månafossen? Månafossen is a stunning waterfall located in the Frafjordheiene landscape conservation area in Rogaland, Norway. It is the highest waterfall in the region, with a free fall of 92 metres. It is accessible via a short but steep hike from the parking lot at Eikeskog. The hike takes about an hour round trip and offers several lookout points over the canyon. It is possible to spend the night at Friluftsgarden Mån, but you must book the accommodation first. There is also a good chance of catching a mountain trout if you have a fishing rod with you.

What is the best time of the year to visit Månafossen? The best time to visit Månafossen is during the summer months, when the weather is warm and the trails are open. The hike to the waterfall is short but steep, and it is recommended to wear good footwear. There are also camping and fishing opportunities in the area, as well as a restored mountain farm and an exhibition about the area's history. 

How good is 

In [26]:
print(summary)

Månafossen is a beautiful waterfall located near Eikeskog in the region of Rogaland. It is the tallest waterfall in the county and has a 90 meter free fall. There are several attractions and activities to do in the area, such as a guided hike to the waterfall, a camping spot with a fireplace, and a view of the canyon below. Additionally, visitors can combine a visit to Månafossen with a visit to the nearby Lysefjorden and Preikestolen. To get to Månafossen, take the exit to Frafjord from road R45 and follow the road (fv281) to Eikeskog.

Frequently asked questions:
How is the climate in Månafossen? The climate in Månafossen is generally cool and wet, with temperatures ranging from 0°C to 20°C. The area is known for its lush forests and stunning views of the waterfall. It is a great place to go for a hike and explore the area's natural beauty.
What is the best time of the year to visit Månafossen The best time to visit Månafossen is during the summer months, when the weather is warmer a

In [67]:
coordinates_searhc = coordinates_search_tool("NO.txt")

  self.df = pd.read_csv('NO.txt', sep="\t", header=None,names=features.keys())


In [82]:
location = coordinates_searhc.run("Facebook","norway")

Looking at alternatenames: 100%|██████████| 607428/607428 [00:10<00:00, 55521.38it/s]


In [84]:
location

In [129]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="user_agent")
location = geolocator.geocode("Preikestolen",country_codes=["no"])
print(location)

Preikestolen, Osterøy, Vestland, 5284, Norge


In [36]:
X.iloc[1]["alternatenames"].split(" ")

['Ozero',
 'Bukhtles-Vandet,Ozero',
 'Vouvatusyarvi,Ozero',
 'Vouvatusjarvi,Ozero',
 'Vouvatus”yarvi,Vaaggtemjaeuʹrr,Vagatamjavri,Vagatamjávri,Vaggatem,Vaggetem,Vaggetem',
 'Vandet,Vaggetemjavrre,Vaggetemvatn,Vouvatus,Vouvatusjaervi,Vouvatusjärvi,Vââggtemjäuʹrr']