# Web Scraping Explore

Se extraera información de una página web dedicada a catalogar exoplanetas:

## Paso 1. Se instalan dependencias

In [127]:
# Se instalan dependencias
import os
from bs4 import BeautifulSoup
import requests
import time
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

## Paso 2. Descargar HTML. 
En este caso, se seleccionó el catalogo de exoplanetas de la NASA:

In [128]:
# Seleccionar el recurso a descargar
resource_url = "https://science.nasa.gov/exoplanets/exoplanet-catalog/"

# Petición para descargar el fichero de Internet
response = requests.get(resource_url, time.sleep(10))

# Si la petición se ha ejecutado correctamente (código 200), entonces el contenido HTML de la página se ha podido descargar
if response:
    # Transformamos el HTML plano en un HTML real (estructurado y anidado, con forma de árbol)
    soup = BeautifulSoup(response.text, 'html')
    soup


Se verifica la respuesta de la petición, en caso de que se ejecute correctamente, se mostrará un código 200:

In [129]:
# Se verifica response
print(f'✅ Ejecución correcta: {response}')

✅ Ejecución correcta: <Response [200]>


## Paso 3. Transformación del HTML. 
Se explora la información y estructura del HTML usando la libreria BeautifulSoup:

In [130]:
# Se encuentran los contenedores de la información
soup.find_all("div", class_="hds-a11y-heading-22")

[<div class="hds-a11y-heading-22">TOI-6695 c</div>,
 <div class="hds-a11y-heading-22">Kepler-139 f</div>,
 <div class="hds-a11y-heading-22">AT2021ueyL b</div>,
 <div class="hds-a11y-heading-22">KMT-2017-BLG-2197L b</div>,
 <div class="hds-a11y-heading-22">KMT-2022-BLG-2076L b</div>,
 <div class="hds-a11y-heading-22">ZTF J1828+2308 b</div>,
 <div class="hds-a11y-heading-22">KMT-2022-BLG-1790L b</div>,
 <div class="hds-a11y-heading-22">KMT-2023-BLG-2209L b</div>,
 <div class="hds-a11y-heading-22">GJ 1289 b</div>,
 <div class="hds-a11y-heading-22">TOI-1453 b</div>,
 <div class="hds-a11y-heading-22">TOI-1453 c</div>,
 <div class="hds-a11y-heading-22">TOI-7041 b</div>,
 <div class="hds-a11y-heading-22">TOI-5143 c</div>,
 <div class="hds-a11y-heading-22">TOI-4364 b</div>,
 <div class="hds-a11y-heading-22">KOI-1843.03</div>]

In [133]:
# Se extrae información de las tag del html
exo_cards = soup.find_all('div', class_='hds-content-item content-list-item-exoplanet')
data = []
for item in exo_cards:
    exo_obj = {}
    #print(item)
    exo_link = item.find('a', class_= 'hds-content-item-heading')
    #print(exo_link)
    exo_name = exo_link.find_next('div').text.strip()
    exo_obj["Exoplanet's name:"] = exo_name
    exo_box_field = exo_link.find_all_next('div', class_= 'hds-content-item-inner')
    for field in exo_box_field:
        exo_all_tags = field.find_all_next('div', class_="CustomField")
        for field_tag in exo_all_tags:
            #print(field_tag)
            exo_tag_field = field_tag.find('span', class_= 'font-weight-bold')
            exo_field = exo_tag_field.text.strip()
            exo_value = exo_tag_field.find_next('span').text.strip()
            # Debido a que se estaba sobreescribiendo de esta forma exo_obj[exo_field] = exo_value. se uso el método .setdefault()
            exo_obj.setdefault(exo_field, exo_value)
    data.append(exo_obj)
print(data)

[{"Exoplanet's name:": 'TOI-6695 c', 'Parsecs from Earth:': '391.038', 'Planet Mass:': '36 Earths', 'Stellar Magnitude:': '12.775', 'Discovery Date:': '2025'}, {"Exoplanet's name:": 'Kepler-139 f', 'Parsecs from Earth:': '1040', 'Planet Mass:': '1.34 Jupiters', 'Stellar Magnitude:': 'Unknown', 'Discovery Date:': '2025'}, {"Exoplanet's name:": 'AT2021ueyL b', 'Parsecs from Earth:': '7800', 'Planet Mass:': '8.84 Jupiters', 'Stellar Magnitude:': 'Unknown', 'Discovery Date:': '2025'}, {"Exoplanet's name:": 'KMT-2017-BLG-2197L b', 'Parsecs from Earth:': '5570', 'Planet Mass:': '0.88 Jupiters', 'Stellar Magnitude:': 'Unknown', 'Discovery Date:': '2025'}, {"Exoplanet's name:": 'KMT-2022-BLG-2076L b', 'Parsecs from Earth:': '201.827', 'Planet Mass:': '20 Jupiters', 'Stellar Magnitude:': '18.0447', 'Discovery Date:': '2025'}, {"Exoplanet's name:": 'ZTF J1828+2308 b', 'Parsecs from Earth:': '6480', 'Planet Mass:': '1.73 Jupiters', 'Stellar Magnitude:': 'Unknown', 'Discovery Date:': '2025'}, {"Ex

## Paso 4. Procesar el DataFrame

In [134]:
# Se transforma el objeto anterior en un DataFrame
df = pd.DataFrame(data)

In [135]:
df.head()

Unnamed: 0,Exoplanet's name:,Parsecs from Earth:,Planet Mass:,Stellar Magnitude:,Discovery Date:
0,TOI-6695 c,391.038,36 Earths,12.775,2025
1,Kepler-139 f,1040.0,1.34 Jupiters,Unknown,2025
2,AT2021ueyL b,7800.0,8.84 Jupiters,Unknown,2025
3,KMT-2017-BLG-2197L b,5570.0,0.88 Jupiters,Unknown,2025
4,KMT-2022-BLG-2076L b,201.827,20 Jupiters,18.0447,2025
