# Scraping

In [72]:
from playwright.async_api import async_playwright

async def run():
    async with async_playwright() as playwright:
        browser = await playwright.chromium.launch(headless=True)
        page = await browser.new_page()
        
        await page.goto("https://www.reddit.com/r/learnpython/comments/78qnze/web_scraping_in_20_lines_of_code_with/")
        await page.wait_for_load_state("networkidle")
        
        content = await page.content()
        print(content)
        
        await browser.close()
await run()

<!DOCTYPE html><html><head>
    <title>Blocked</title>
    <style>
      body {
          font: small verdana, arial, helvetica, sans-serif;
          width: 600px;
          margin: 0 auto;
      }

      h1 {
          height: 40px;
          background: transparent url(//www.redditstatic.com/reddit.com.header.png) no-repeat scroll top right;
      }
    </style>
  </head>
  <body>
    <h1>whoa there, pardner!</h1>

<p>Your request has been blocked due to a network policy.</p>

<p>Try logging in or creating an account <a href="https://www.reddit.com/login/">here</a> to get back to browsing.</p>

<p>If you're running a script or application, please register or sign in with your developer credentials <a href="https://www.reddit.com/wiki/api/">here</a>. Additionally make sure your User-Agent is not empty and is something unique and descriptive and try again. if you're supplying an alternate User-Agent string,
try changing back to default as that can sometimes result in a block.</p>

<p>

# Embedding

In [80]:
from langchain_community.document_loaders import PlaywrightURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter



text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)

urls = [
    "https://www.lespotevry.fr/",
]

selectors_to_ignore = [
    'iframe', 
    '.ad', '.ads', '.advertisement', '.banner',
    'nav', 'footer', 'header',
    '.navbar', '.menu', '.nav', '.footer', '.header',
    'form', 'button', 'input', 'select', 'textarea',
    '.form', '.button', '.btn',
    '.widget', '.social', '.share', '.tweet', '.like',
    'script', 'noscript', 'style', 'link',
    '.comments', '.comment', '.reply', '.discussion',
    '.forum', '.thread'
]

loader = PlaywrightURLLoader(urls=urls, remove_selectors=selectors_to_ignore, headless=True)

data = await loader.aload()
print(data[0].page_content)
documents = text_splitter.split_documents(data)


Aujourd'hui, votre centre est ouvert jusqu'à 19:00

Aujourd'hui votre hypermarché est ouvert jusqu'à 19:00

Vos restaurants sont ouverts jusqu'à 23:00

Le Spot

c'est quoi ?

Le Spot, c’est votre nouveau terrain de jeu shopping, food, loisirs et culture au cœur d’Evry-Courcouronnes. 126 000 m² réunissant pour la 1ère fois en France, des commerces, des restaurants, un cinéma, une médiathèque, un théâtre, une salle de spectacle, une piscine et une patinoire.  Le Spot, un hyper lieu vivant, animé, végétalisé et ouvert sur son environnement. Pour shopper, rire, se régaler, se balader, chiller ou se dépenser, Le Spot, c’est une nouvelle destination à découvrir en famille, entre les amis, ou seul, tous les jours même le dimanche et en soirée.

Je viens au spot

En ce moment dans votre centre

Les dernières news !

Ça se passe en ce moment

            Soldes d'été du 26 juin au 23 juillet

Ça se passe en ce moment

            Vivez l'Euro de foot 2024 au Spot !

Ça se passe en ce moment

  

In [77]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()
db = Chroma.from_documents(documents, embedding_model, persist_directory="./chroma")
db.persist()

In [78]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


retriever = db.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

response = rag_chain.invoke("What is the purpose of the page")

print(response)

The purpose of the page is to inform users about current events, promotions, and activities happening at the location. It also provides links to sign up for newsletters and follow on social media platforms for updates. Additionally, it offers a way to contact customer support for assistance or to report any issues.
