<figure>
    <img src="../../../figures/logo_ap.png"  width="80" height="80" align="left"/>
</figure>

# <span style="color:blue"><blue><br><br><br><br><center>&nbsp;&nbsp;&nbsp;Aprendizaje Profundo</center></span>

# <span style="color:red"><center>Biblioteca Alejandría</center></span>

## <span style="color:green"><center>Scraping Web de Fuentes Académicas</center></span>

## <span style="color:blue">Autores</span>

1. Álvaro Montenegro, alvaro.montenegro@aprendizajeprofundo.ai
1. Daniel Montenegro, daniel.montenegro@aprendizajeprofundo.ai

## <span style="color:blue">Contenido</span>

* [Librerías Necesarias](#Librerías-Necesarias)
* [ArXiv](#Librerías-Necesarias)
    * [Realizar Consultas en ArXiv](#Realizar-Consultas-en-ArXiv)
    * [Extraer Metadata en ArXiv](#Extraer-Metadata-en-ArXiv)
    * [Convertir en DataFrame en ArXiv](#Convertir-en-DataFrame-en-ArXiv)
* [Towards Data Science](#Towards-Data-Science)
    * [Subsección X](#Subsección-X)
* [Automatización](#Automatización)
* [Conclusiones](#Conclusiones)
* [Recomendaciones](#Recomendaciones)

## <span style="color:blue">Librerías Necesarias</span>

In [1]:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
from serpapi import GoogleSearch

[[Volver]](#Contenido)

## <span style="color:blue">ArXiv</span>

Texto de la sección X

### <span style="color:#4CC9F0">Realizar Consultas en ArXiv</span>

Texto de la subsección X

In [None]:
import requests
from typing import Annotated

def fetch_html(
    url: Annotated[str, "La URL a la que se hará la solicitud"]
) -> str:
    """
    Realiza una solicitud HTTP GET usando urllib y devuelve el contenido HTML.
    """
    try:
        # Realizar la solicitud
        data = requests.get(url)
        # Leer y decodificar el contenido
        html = data.text
        return html
    except Exception as e:
        # Manejo básico de errores
        return f"Error al hacer la solicitud: {str(e)}"

type_query = "all"
query = "RAG"
start = 0
max_results = 100
sortby = "submittedDate"
sortorder = "descending"
url = f'http://export.arxiv.org/api/query?search_query={type_query}:{query}&start={start}&max_results={max_results}&sortBy={sortby}&sortOrder={sortorder}'

print("URL construída:", url)
html_content = fetch_html(url=url)
#print(html_content)

URL construída: http://export.arxiv.org/api/query?search_query=all:RAG&start=0&max_results=100&sortBy=submittedDate&sortOrder=descending


[[Volver]](#Contenido)

### <span style="color:#4CC9F0">Extraer Metadata en ArXiv</span>

Texto de la sección X

In [3]:
soup = BeautifulSoup(html_content, "lxml-xml")
articles = soup.find_all("entry")

data_json = []
for entry in articles:
    prim_category = entry.find("primary_category").get("term")
    published = entry.find("published").text
    updated = entry.find("updated").text
    title = entry.find("title").text
    summary = entry.find("summary").text
    authors = entry.find_all("author")
    authors = [auth.find("name").text for auth in authors]
    link_article = entry.select('link[title="pdf"]')[0].get("href")
    data_json.append({"primary_category": prim_category, 
                      "published": published,
                      "updated": updated,
                      "title": title, 
                      "summary": summary,
                      "authors": authors, 
                      "link_article": link_article})

[[Volver]](#Contenido)

### <span style="color:#4CC9F0">Convertir en DataFrame en ArXiv</span>

Texto de la sección X

In [4]:
data_df = pd.DataFrame.from_dict(data_json)
columns_to_convert = ['published', 'updated']
data_df[columns_to_convert] = data_df[columns_to_convert].apply(pd.to_datetime)
data_df.insert(0, "id", data_df["link_article"].str.split(".").str[-1].str.replace("v*","", regex=True))
data_df["id"] = data_df["id"].astype("int")
data_df.insert(1, "version", data_df["link_article"].str.split(".").str[-1].str.split("v").str[-1])
data_df.info()
data_df.sort_values(by="published", ascending=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   id                100 non-null    int64              
 1   version           100 non-null    object             
 2   primary_category  100 non-null    object             
 3   published         100 non-null    datetime64[ns, UTC]
 4   updated           100 non-null    datetime64[ns, UTC]
 5   title             100 non-null    object             
 6   summary           100 non-null    object             
 7   authors           100 non-null    object             
 8   link_article      100 non-null    object             
dtypes: datetime64[ns, UTC](2), int64(1), object(6)
memory usage: 7.2+ KB


Unnamed: 0,id,version,primary_category,published,updated,title,summary,authors,link_article
0,198041,1,cs.SE,2024-11-29 16:09:43+00:00,2024-11-29 16:09:43+00:00,Advanced System Integration: Analyzing OpenAPI...,Integrating multiple (sub-)systems is essent...,"[Robin D. Pesl, Jerin G. Mathew, Massimo Mecel...",http://arxiv.org/pdf/2411.19804v1
1,197131,1,cs.NE,2024-11-29 14:01:34+00:00,2024-11-29 14:01:34+00:00,CantorNet: A Sandbox for Testing Topological a...,Many natural phenomena are characterized by ...,"[Michal Lewandowski, Hamid Eghbalzadeh, Bernha...",http://arxiv.org/pdf/2411.19713v1
2,197101,1,cs.IR,2024-11-29 13:57:07+00:00,2024-11-29 13:57:07+00:00,Know Your RAG: Dataset Taxonomy and Generation...,Retrieval Augmented Generation (RAG) systems...,"[Rafael Teixeira de Lima, Shubham Gupta, Cesar...",http://arxiv.org/pdf/2411.19710v1
3,195541,1,cs.HC,2024-11-29 09:07:21+00:00,2024-11-29 09:07:21+00:00,Unimib Assistant: designing a student-friendly...,Natural language processing skills of Large ...,"[Chiara Antico, Stefano Giordano, Cansu Koyutu...",http://arxiv.org/pdf/2411.19554v1
4,195391,1,cs.AI,2024-11-29 08:34:07+00:00,2024-11-29 08:34:07+00:00,Knowledge Management for Automobile Failure An...,This paper presents a knowledge management s...,"[Yuta Ojima, Hiroki Sakaji, Tadashi Nakamura, ...",http://arxiv.org/pdf/2411.19539v1
...,...,...,...,...,...,...,...,...,...
95,29591,1,cs.IR,2024-11-05 09:58:36+00:00,2024-11-05 09:58:36+00:00,HtmlRAG: HTML is Better Than Plain Text for Mo...,Retrieval-Augmented Generation (RAG) has bee...,"[Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang...",http://arxiv.org/pdf/2411.02959v1
96,28501,1,cs.CY,2024-11-05 06:44:15+00:00,2024-11-05 06:44:15+00:00,WASHtsApp -- A RAG-powered WhatsApp Chatbot fo...,"This paper introduces WASHtsApp, a WhatsApp-...","[Simon Kloker, Alex Cedric Luyima, Matthew Baz...",http://arxiv.org/pdf/2411.02850v1
97,28322,2,cs.CL,2024-11-05 06:11:17+00:00,2024-11-06 11:19:42+00:00,PersianRAG: A Retrieval-Augmented Generation S...,"Retrieval augmented generation (RAG) models,...","[Hossein Hosseini, Mohammad Sobhan Zare, Amir ...",http://arxiv.org/pdf/2411.02832v2
98,26571,1,cs.CL,2024-11-04 22:45:52+00:00,2024-11-04 22:45:52+00:00,Zebra-Llama: A Context-Aware Large Language Mo...,Rare diseases present unique challenges in h...,"[Karthik Soman, Andrew Langdon, Catalina Villo...",http://arxiv.org/pdf/2411.02657v1


## <span style="color:blue">Towards Data Science</span>

Texto de la sección X

### <span style="color:#4CC9F0">Realizar Consultas en Towards Data Science</span>

Texto de la subsección X

In [5]:
query = "RAG"
url = f"https://towardsdatascience.com/search?q={query}"

resp = requests.get(url)
html_content = resp.text

In [17]:
# Analizar el HTML
soup = BeautifulSoup(html_content, "html.parser")

# Buscar el JSON dentro de la etiqueta <script>
script_tag = soup.find("script", text=lambda t: t and "window.__APOLLO_STATE__" in t)

# Extraer el contenido JSON
if script_tag:
    script_content = script_tag.string
    # Dividir y limpiar el JSON
    json_start = script_content.find("{")  # Encontrar el inicio del JSON
    json_content = script_content[json_start:]  # Extraer el JSON
    try:
        data = json.loads(json_content)
        print(data)  # Ahora tienes un diccionario en Python
    except json.JSONDecodeError as e:
        print("Error al decodificar el JSON:", e)

import re

keys_to_search = [re.search("Post.*", key).group() for key in data.keys() if re.search("Post.*", key)]
#keys_to_search

for ks in keys_to_search:
    print(data[ks])

{'ROOT_QUERY': {'__typename': 'Query', 'variantFlags': [{'__typename': 'VariantFlag', 'name': 'goliath_externalsearch_enable_comment_deindexation', 'valueType': {'__typename': 'VariantFlagBoolean', 'value': True}}, {'__typename': 'VariantFlag', 'name': 'ios_enable_verified_book_author', 'valueType': {'__typename': 'VariantFlagBoolean', 'value': True}}, {'__typename': 'VariantFlag', 'name': 'limit_user_follows', 'valueType': {'__typename': 'VariantFlagBoolean', 'value': True}}, {'__typename': 'VariantFlag', 'name': 'mobile_custom_app_icon', 'valueType': {'__typename': 'VariantFlagBoolean', 'value': True}}, {'__typename': 'VariantFlag', 'name': 'reader_fair_distribution_non_qp', 'valueType': {'__typename': 'VariantFlagBoolean', 'value': True}}, {'__typename': 'VariantFlag', 'name': 'browsable_stream_config_bucket', 'valueType': {'__typename': 'VariantFlagString', 'value': 'curated-topics'}}, {'__typename': 'VariantFlag', 'name': 'enable_braintree_apple_pay', 'valueType': {'__typename': '

  script_tag = soup.find("script", text=lambda t: t and "window.__APOLLO_STATE__" in t)


[[Volver]](#Contenido)

## <span style="color:blue">SerpApi</span>

Texto de la sección X

### <span style="color:#4CC9F0">Realizar Consultas en SerpApi</span>

Texto de la subsección X

In [None]:
# Can only use 100 requests per month
"""params = {
  "api_key": os.environ["GOOGLE_SCHOLAR_API_KEY"],
  "engine": "google_scholar",
  "q": "RAG AI",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()
results"""

[[Volver]](#Contenido)

## <span style="color:blue">Automatización</span>

Automatización del proceso

## <span style="color:blue">Uso de la Automatización</span>

Uso de la Automatización explicada

## <span style="color:blue">Conclusiones</span>

Conclusiones del Notebook

## <span style="color:blue">Recomendaciones</span>

Recomendaciones del estudio hecho.

## <span style="color:blue">Referencias</span>

1. [Referencia]()