<figure>
    <img src="../../../figures/logo_ap.png"  width="80" height="80" align="left"/>
</figure>

# <span style="color:blue"><blue><br><br><br><br><center>&nbsp;&nbsp;&nbsp;Aprendizaje Profundo</center></span>

# <span style="color:red"><center>Biblioteca Alejandría</center></span>

## <span style="color:green"><center>Scraping Web de Fuentes Académicas</center></span>

## <span style="color:blue">Autores</span>

1. Álvaro Montenegro, alvaro.montenegro@aprendizajeprofundo.ai
1. Daniel Montenegro, daniel.montenegro@aprendizajeprofundo.ai

## <span style="color:blue">Contenido</span>

* [Librerías Necesarias](#Librerías-Necesarias)
* [ArXiv](#Librerías-Necesarias)
    * [Realizar Consultas en ArXiv](#Realizar-Consultas-en-ArXiv)
    * [Extraer Metadata en ArXiv](#Extraer-Metadata-en-ArXiv)
    * [Convertir en DataFrame en ArXiv](#Convertir-en-DataFrame-en-ArXiv)
* [Towards Data Science](#Towards-Data-Science)
    * [Subsección X](#Subsección-X)
* [Automatización](#Automatización)
* [Conclusiones](#Conclusiones)
* [Recomendaciones](#Recomendaciones)

## <span style="color:blue">Librerías Necesarias</span>

In [1]:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
from serpapi import GoogleSearch

[[Volver]](#Contenido)

## <span style="color:blue">ArXiv</span>

Texto de la sección X

### <span style="color:#4CC9F0">Realizar Consultas en ArXiv</span>

Texto de la subsección X

In [2]:
import requests
from typing import Annotated

def fetch_html(
    url: Annotated[str, "La URL a la que se hará la solicitud"]
) -> str:
    """
    Realiza una solicitud HTTP GET usando urllib y devuelve el contenido HTML.
    """
    try:
        # Realizar la solicitud
        data = requests.get(url)
        # Leer y decodificar el contenido
        html = data.text
        return html
    except Exception as e:
        # Manejo básico de errores
        return f"Error al hacer la solicitud: {str(e)}"

type_query = "all"
query = "RAG"
start = 0
max_results = 100
sortby = "submittedDate"
sortorder = "descending"
url = f'http://export.arxiv.org/api/query?search_query={type_query}:{query}&start={start}&max_results={max_results}&sortBy={sortby}&sortOrder={sortorder}'

print("URL construída:", url)
html_content = fetch_html(url=url)
#print(html_content)

URL construída: http://export.arxiv.org/api/query?search_query=all:RAG&start=0&max_results=100&sortBy=submittedDate&sortOrder=descending


[[Volver]](#Contenido)

### <span style="color:#4CC9F0">Extraer Metadata en ArXiv</span>

Texto de la sección X

In [3]:
soup = BeautifulSoup(html_content, "lxml-xml")
articles = soup.find_all("entry")

data_json = []
for entry in articles:
    prim_category = entry.find("primary_category").get("term")
    published = entry.find("published").text
    updated = entry.find("updated").text
    title = entry.find("title").text
    summary = entry.find("summary").text
    authors = entry.find_all("author")
    authors = [auth.find("name").text for auth in authors]
    link_article = entry.select('link[title="pdf"]')[0].get("href")
    data_json.append({"primary_category": prim_category, 
                      "published": published,
                      "updated": updated,
                      "title": title, 
                      "summary": summary,
                      "authors": authors, 
                      "link_article": link_article})

[[Volver]](#Contenido)

### <span style="color:#4CC9F0">Convertir en DataFrame en ArXiv</span>

Texto de la sección X

In [4]:
data_df = pd.DataFrame.from_dict(data_json)
columns_to_convert = ['published', 'updated']
data_df[columns_to_convert] = data_df[columns_to_convert].apply(pd.to_datetime)
data_df.insert(0, "id", data_df["link_article"].str.split(".").str[-1].str.replace("v*","", regex=True))
data_df["id"] = data_df["id"].astype("int")
data_df.insert(1, "version", data_df["link_article"].str.split(".").str[-1].str.split("v").str[-1])
data_df.info()
data_df.sort_values(by="published", ascending=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   id                100 non-null    int64              
 1   version           100 non-null    object             
 2   primary_category  100 non-null    object             
 3   published         100 non-null    datetime64[ns, UTC]
 4   updated           100 non-null    datetime64[ns, UTC]
 5   title             100 non-null    object             
 6   summary           100 non-null    object             
 7   authors           100 non-null    object             
 8   link_article      100 non-null    object             
dtypes: datetime64[ns, UTC](2), int64(1), object(6)
memory usage: 7.2+ KB


Unnamed: 0,id,version,primary_category,published,updated,title,summary,authors,link_article
0,173901,1,cs.CL,2025-02-24 18:16:10+00:00,2025-02-24 18:16:10+00:00,Mitigating Bias in RAG: Controlling the Embedder,In retrieval augmented generation (RAG) syst...,"[Taeyoun Kim, Jacob Springer, Aditi Raghunatha...",http://arxiv.org/pdf/2502.17390v1
1,172971,1,cs.AI,2025-02-24 16:25:25+00:00,2025-02-24 16:25:25+00:00,Benchmarking Retrieval-Augmented Generation in...,This paper introduces Multi-Modal Retrieval-...,"[Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, X...",http://arxiv.org/pdf/2502.17297v1
2,171631,1,cs.CL,2025-02-24 13:58:42+00:00,2025-02-24 13:58:42+00:00,MEMERAG: A Multilingual End-to-End Meta-Evalua...,Automatic evaluation of retrieval augmented ...,"[María Andrea Cruz Blandón, Jayasimha Talur, B...",http://arxiv.org/pdf/2502.17163v1
3,171251,1,cs.CL,2025-02-24 13:11:47+00:00,2025-02-24 13:11:47+00:00,LettuceDetect: A Hallucination Detection Frame...,Retrieval Augmented Generation (RAG) systems...,"[Ádám Kovács, Gábor Recski]",http://arxiv.org/pdf/2502.17125v1
4,170361,1,cs.CL,2025-02-24 10:37:13+00:00,2025-02-24 10:37:13+00:00,Language Model Re-rankers are Steered by Lexic...,Language model (LM) re-rankers are used to r...,"[Lovisa Hagström, Ercong Nie, Ruben Halifa, He...",http://arxiv.org/pdf/2502.17036v1
...,...,...,...,...,...,...,...,...,...
95,90171,1,cs.CL,2025-02-13 07:11:01+00:00,2025-02-13 07:11:01+00:00,Diversity Enhances an LLM's Performance in RAG...,The rapid advancements in large language mod...,"[Zhchao Wang, Bin Bi, Yanqi Luo, Sitaram Asur,...",http://arxiv.org/pdf/2502.09017v1
96,88262,2,cs.CL,2025-02-12 22:33:41+00:00,2025-02-17 23:26:44+00:00,Ask in Any Modality: A Comprehensive Survey on...,Large Language Models (LLMs) struggle with h...,"[Mohammad Mahdi Abootorabi, Amirhosein Zobeiri...",http://arxiv.org/pdf/2502.08826v2
97,87561,1,cs.AI,2025-02-12 19:59:57+00:00,2025-02-12 19:59:57+00:00,From PowerPoint UI Sketches to Web-Based Appli...,"Developing web-based GIS applications, commo...","[Haowen Xu, Xiao-Ying Yu]",http://arxiv.org/pdf/2502.08756v1
98,83562,2,cs.CL,2025-02-12 12:39:51+00:00,2025-02-17 14:29:48+00:00,Systematic Knowledge Injection into Large Lang...,Retrieval-Augmented Generation (RAG) has eme...,"[Kushagra Bhushan, Yatin Nandwani, Dinesh Khan...",http://arxiv.org/pdf/2502.08356v2


## <span style="color:blue">SerpApi</span>

Texto de la sección X

### <span style="color:#4CC9F0">Realizar Consultas en SerpApi</span>

Texto de la subsección X

In [None]:
# Can only use 100 requests per month
"""params = {
  "api_key": os.environ["GOOGLE_SCHOLAR_API_KEY"],
  "engine": "google_scholar",
  "q": "RAG AI",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()
results"""

[[Volver]](#Contenido)

## <span style="color:blue">Automatización</span>

Automatización del proceso

## <span style="color:blue">Uso de la Automatización</span>

Uso de la Automatización explicada

## <span style="color:blue">Conclusiones</span>

Conclusiones del Notebook

## <span style="color:blue">Recomendaciones</span>

Recomendaciones del estudio hecho.

## <span style="color:blue">Referencias</span>

1. [Referencia]()