This notebook is used to debug tools designed for the agent.

# DuckDuckGo Maps search:

In [1]:
from duckduckgo_search import DDGS

with DDGS() as ddgs:
    for r in ddgs.maps(keywords=''):
        print(r)

AssertionError: keywords is mandatory

# DuckDuckGo answers search:

There is a problem with searching like this, I get some text but not the whole description.

In [None]:
from duckduckgo_search import DDGS

with DDGS() as ddgs:
    answers = list(ddgs.answers("Ryvingen"))

answers

[{'icon': None,
  'text': 'Ryvingen Lighthouse A coastal lighthouse located on an island in the municipality of Mandal, Vest-Agder, Norway.',
  'topic': None,
  'url': 'https://duckduckgo.com/Ryvingen_Lighthouse'},
 {'icon': None,
  'text': 'Ryvingen Peak A rock peak 3 nautical miles west-southwest of Brapiggen Peak, on the south side of Borg Massif...',
  'topic': None,
  'url': 'https://duckduckgo.com/Ryvingen_Peak'}]

In [None]:
#I can further:
from duckduckgo_search import DDGS

with DDGS() as ddgs:
    answers = list(ddgs.answers("What is the main economic activity in bergen?"))

answers

[]

In [None]:
from serpapi import GoogleSearch

params = {
  "q": "Coffee",
  "location": "Austin, Texas, United States",
  "hl": "en",
  "gl": "us",
  "google_domain": "google.com",
  "api_key": "secret_api_key"
}

search = GoogleSearch(params)
results = search.get_dict()

In [None]:
results

{'error': 'Invalid API key. Your API key should be here: https://serpapi.com/manage-api-key'}

This answering doen't seem to work. This is why we'll make use of the scraper node and duckduck go search.

We'll search for a question, then scrape the top 25 pages that duckduckgo returns and use roberta for some question answering.

In [1]:
from haystack.document_stores import InMemoryDocumentStore
from newspaper3k_haystack import newspaper3k_scraper
from duckduckgo_search import DDGS
from haystack.pipelines import Pipeline
from haystack.nodes import PreProcessor
from haystack.nodes import EmbeddingRetriever
from haystack.nodes import FARMReader
from haystack.pipelines import ExtractiveQAPipeline
from haystack_entailment_checker import EntailmentChecker
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


class search_entail_pipe():
    def __init__(self,TOP_LINKS = 25,document_store = None):

        #we have the ability of adding our new knowledge to a bigger db
        if document_store is None:
            self.document_store = InMemoryDocumentStore()
        else:
            self.document_store = document_store

        ##Indexing pipeline stuff:
        self.scraper = newspaper3k_scraper()
        self.processor = PreProcessor(
            clean_empty_lines=False,
            clean_whitespace=False,
            clean_header_footer=False,
            split_by="sentence",
            split_length=30,
            split_respect_sentence_boundary=False,
            split_overlap=0
            )

        self.indexing_pipeline = Pipeline()
        self.indexing_pipeline.add_node(component=self.scraper, name="scraper", inputs=['File'])
        self.indexing_pipeline.add_node(component=self.processor, name="processor", inputs=['scraper'])
        self.indexing_pipeline.add_node(component=self.document_store, name="document_store", inputs=['scraper'])

        #Extraction pipeline stuff:
        self.retriever = EmbeddingRetriever(
            document_store=self.document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",use_gpu=True,devices=[torch.device("mps")]
        )

        #To entail sentence generation
        self.t5_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
        self.t5_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

        #Entailment checker stuff
        self.entailment_checker = EntailmentChecker(
        model_name_or_path = "microsoft/deberta-v2-xlarge-mnli",
        entailment_contradiction_threshold = 0.5,use_gpu=True)

        self.entailment_check_pipeline = Pipeline()
        self.entailment_check_pipeline.add_node(component=self.retriever, name="Retriever", inputs=["Query"])
        self.entailment_check_pipeline.add_node(component=self.entailment_checker, name="EntailmentChecker", inputs=["Retriever"]
        )

    
    def answer(self,question:str)->str:
        #get links to scrape
        with DDGS() as ddgs:
            results = list(ddgs.text(question, safesearch='Off'))
        
        links = [r["href"] for r in results][:25]
        
        #use indexing pipeline to get pages
        self.indexing_pipeline.run_batch(queries = links,
            params={
                "scraper":{
                    "metadata":True,
                }
            })
        
        #create embeddings for each documents so we can later on retrieve them semantically
        self.document_store.update_embeddings(self.retriever)
        
        #check entailment of the provided statement with the db
        outp = self.entailment_check_pipeline.run(query=question)
        
        return outp

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
SE_pipeline = search_entail_pipe()

  return self.fget.__get__(instance, owner)()
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
answers = SE_pipeline.answer("bergen is a small town")

 24%|██▍       | 6/25 [00:03<00:14,  1.35it/s]

Unable to extract text from https://www.cnn.com/travel/america-best-small-towns-cities/index.html


 36%|███▌      | 9/25 [00:04<00:07,  2.01it/s]

Unable to extract text from https://www.savoredjourneys.com/things-to-do-in-bergen-norway/


 40%|████      | 10/25 [00:05<00:07,  2.09it/s]

Unable to download the article https://www.niche.com/places-to-live/search/best-places-to-live/c/bergen-county-nj/


 92%|█████████▏| 23/25 [00:23<00:05,  2.81s/it]

Unable to download the article https://wikitravel.org/en/Bergen_(Germany)


 96%|█████████▌| 24/25 [00:23<00:02,  2.09s/it]

Unable to download the article https://www.niche.com/places-to-live/search/most-diverse-places/c/bergen-county-nj/


100%|██████████| 25/25 [00:23<00:00,  1.04it/s]
Preprocessing: 100%|██████████| 20/20 [00:00<00:00, 733.73docs/s]
Batches: 100%|██████████| 2/2 [00:09<00:00,  4.68s/it]docs/s]
Documents Processed: 10000 docs [00:09, 1065.97 docs/s]       
Batches: 100%|██████████| 1/1 [00:00<00:00, 15.42it/s]


In [8]:
answers

{'documents': [<Document: {'content': "Not to be confused with Bergen, a city in Norway.\n\nBergen is a town in the north of Celle district on the Lüneburg Heath in Lower Saxony, Germany. The infamous Bergen-Belsen concentration camp was located not far from Belsen, one of several farming villages in the borough.\n\nBergen is a former agricultural town, but today is economically heavily dependent on the surrounding military bases and the Bergen-Hohne Training Area to the west, which is the largest military training area in Western Europe.\n\nUnderstand [ edit ]\n\nBergen was first mentioned in 1197 and was a centre of local government, the seat of the sheriff (Amtsvogtei) and, later, the Royal Hanoverian Office. After the Kingdom of Hanover was annexed by Prussia in 1866, Bergen was absorbed into Fallingbostel district. In the reorganisation of 1885, however, Bergen was transferred into the newly formed Celle district.\n\nBergen town hall\n\nBergen's development as a market town was ra

In [None]:
from haystack.nodes.base import BaseComponent

class duckduck(BaseComponent):
    # If it's not a decision component, there is only one outgoing edge
    outgoing_edges = 1

    def __init__(self,document_store=None,save_htmls=None):
        '''
        :param document_store: (None by default) If given, all the scraped documents 
                            will be saved in the given document_store, otherwise in a temporary
                            in-memory document store.
        
        :param path: (None by default) Path where to store the downloaded article html, if None, not downloaded. Ignored if load=True
        '''

        if document_store:
            self.document_store = document_store
        else:
            self.document_store = InMemoryDocumentStore()
        

        
    def run(self, query: str, my_arg: Optional[int] = 10):
        # Insert code here to manipulate the input and produce an output dictionary
        ...
        output={
            "documents": ...,
            "_debug": {"anything": "you want"}
        }
        return output, "output_1"

    def run_batch(self, queries: List[str], my_arg: Optional[int] = 10):
        # Insert code here to manipulate the input and produce an output dictionary
        ...
        output={
            "documents": ...,
        }
        return output, "output_1"

In [3]:
import wikipediaapi

In [5]:
wiki_wiki = wikipediaapi.Wikipedia('MyProjectName (merlin@example.com)', 'en')

page_py = wiki_wiki.page('Python_(programming_language)')

In [10]:
page_py.summary

'Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.Python consistently ranks as one of the most popular programming languages.'