
# Web Scraping y Portafolio con Yahoo Finance — Paso a Paso

Este cuaderno resuelve la tarea completa en **tres partes**:
1) **Scraping** de los **Top Gainers** de Yahoo Finance (símbolo y nombre, objetivo: 50).
2) **Descarga** de precios **mensuales ajustados** (12 meses) con `yfinance`.
3) **Construcción de una cartera** (10 acciones) con base en desempeño de los **primeros 6 meses** y **análisis** de rendimientos en los **últimos 6 meses** (cartera equiponderada).

> **Nota:** Yahoo Finance puede mostrar un *consent banner* y el listado se carga dinámicamente. El notebook ya incluye manejo básico de consentimiento, *scroll* y botón **Show more**.



## Parte 0. Configuración del entorno
Ejecute esta celda **una vez** (o desde una terminal) para instalar dependencias:


In [1]:

# Si ejecuta en Jupyter, descomente las líneas que necesite:
# !pip install --upgrade pip
# !pip install selenium webdriver-manager pandas yfinance matplotlib seaborn beautifulsoup4 lxml


## Importaciones

In [2]:

import time
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt

# --- Selenium ---
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options

# Ajustes de pandas
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)


## Utilidades Selenium

In [3]:

def build_driver(headless=False):
    """Crea un ChromeDriver con webdriver-manager. Ponga headless=True si no requiere ver la ventana."""
    chrome_opts = Options()
    chrome_opts.add_argument("--start-maximized")
    chrome_opts.add_argument("--disable-gpu")
    chrome_opts.add_argument("--no-sandbox")
    chrome_opts.add_argument("--window-size=1280,900")
    chrome_opts.add_argument("--lang=en-US,en")
    chrome_opts.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_opts.add_experimental_option("useAutomationExtension", False)
    if headless:
        chrome_opts.add_argument("--headless=new")
    service = Service(ChromeDriverManager().install())
    return webdriver.Chrome(service=service, options=chrome_opts)


def handle_consent_if_any(driver, timeout=6):
    """Intenta aceptar/continuar consentimiento si aparece un banner de cookies."""
    try:
        wait = WebDriverWait(driver, timeout)
        # Botones típicos de consentimiento
        candidates = [
            (By.XPATH, "//button//*[contains(text(),'Accept') or contains(text(),'agree') or contains(text(),'Agree')]/ancestor::button"),
            (By.XPATH, "//button[contains(., 'Accept') or contains(., 'Agree')]"),
            (By.XPATH, "//button//*[contains(text(),'Estoy de acuerdo') or contains(text(),'Aceptar')]/ancestor::button"),
        ]
        for by, selector in candidates:
            try:
                btn = wait.until(EC.element_to_be_clickable((by, selector)))
                btn.click()
                time.sleep(0.5)
                return True
            except Exception:
                continue
    except Exception:
        pass
    return False


def click_show_more_until(driver, min_rows=50, timeout=8, max_clicks=20):
    """Hace click en 'Show more' hasta que haya al menos min_rows o no quede botón."""
    wait = WebDriverWait(driver, timeout)
    clicks = 0
    while clicks < max_clicks:
        try:
            rows = driver.find_elements(By.CSS_SELECTOR, "table tbody tr")
            if len(rows) >= min_rows:
                break
            # Buscar botón 'Show more'
            show_more_btn = None
            # Variantes del botón
            for xpath in [
                "//button[.//span[contains(translate(., 'SM', 'sm'),'show more')]]",
                "//button[contains(translate(., 'SM', 'sm'),'show more')]",
                "//button[contains(@aria-label, 'Show more')]",
            ]:
                try:
                    show_more_btn = wait.until(EC.element_to_be_clickable((By.XPATH, xpath)))
                    if show_more_btn:
                        break
                except TimeoutException:
                    continue
            if show_more_btn is None:
                # Intentar hacer scroll por si carga por scroll infinito
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(1.0)
                new_rows = driver.find_elements(By.CSS_SELECTOR, "table tbody tr")
                if len(new_rows) <= len(rows):
                    break
            else:
                driver.execute_script("arguments[0].scrollIntoView(true);", show_more_btn)
                time.sleep(0.3)
                try:
                    show_more_btn.click()
                except ElementClickInterceptedException:
                    driver.execute_script("arguments[0].click();", show_more_btn)
                time.sleep(1.0)
                clicks += 1
        except Exception:
            break
    return len(driver.find_elements(By.CSS_SELECTOR, "table tbody tr"))


def parse_rows_symbol_name(driver, max_rows=50):
    """Extrae (symbol, name) de las primeras filas (hasta max_rows)."""
    rows = driver.find_elements(By.CSS_SELECTOR, "table tbody tr")
    data = []
    for i, row in enumerate(rows[:max_rows]):
        try:
            # Método 1: por celdas
            tds = row.find_elements(By.CSS_SELECTOR, "td")
            symbol = tds[0].text.strip() if len(tds) > 0 else ""
            name = tds[1].text.strip() if len(tds) > 1 else ""
            # Fallback: por enlaces
            if not symbol:
                link = row.find_element(By.CSS_SELECTOR, "a[href*='/quote/']")
                symbol = link.text.strip()
            if not name and len(tds) > 1:
                name = tds[1].get_attribute("title") or tds[1].text.strip()
            if symbol:
                data.append((symbol, name))
        except Exception:
            continue
    return pd.DataFrame(data, columns=["symbol", "name"])



## Parte 1. Scraping de **Top Gainers** (objetivo: 50 filas)


In [None]:

URL = "https://finance.yahoo.com/markets/stocks/gainers/?start=0&count=100"

driver = build_driver(headless=False)  # ponga True si no requiere ver la ventana
driver.get(URL)

# Intentar consentimiento si aparece
handle_consent_if_any(driver)   

wait = WebDriverWait(driver, 12)

# Esperar a que aparezca una tabla con filas
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table tbody tr")))

# Asegurar al menos 50 filas (si hay 'Show more' se irá presionando)
total_rows = click_show_more_until(driver, min_rows=50)
print(f"Filas visibles tras expandir: {total_rows}")

# Extraer las primeras 50 (símbolo y nombre)
df_gainers = parse_rows_symbol_name(driver, max_rows=50)
driver.quit()

print(df_gainers.shape)
df_gainers.head(10)


Filas visibles tras expandir: 53
(50, 2)


Unnamed: 0,symbol,name
0,LCID,"Lucid Group, Inc."
1,AMBA,"Ambarella, Inc."
2,IREN,IREN Limited
3,BABA,Alibaba Group Holding Limited
4,AFRM,"Affirm Holdings, Inc."
5,DOOO,BRP Inc.
6,ADSK,"Autodesk, Inc."
7,CIFR,Cipher Mining Inc.
8,SATS,EchoStar Corporation
9,S,"SentinelOne, Inc."


In [8]:

# Guardar lista de ganadores
df_gainers.to_csv("top_gainers_50.csv", index=False)
print("Archivo guardado: top_gainers_50.csv")
df_gainers.sample(min(len(df_gainers), 10), random_state=1)


Archivo guardado: top_gainers_50.csv


Unnamed: 0,symbol,name
27,GFI,Gold Fields Limited
35,BILI,Bilibili Inc.
40,MOH,"Molina Healthcare, Inc."
38,BF-A,Brown-Forman Corporation
2,IREN,IREN Limited
3,BABA,Alibaba Group Holding Limited
48,EMN,Eastman Chemical Company
29,SBSW,Sibanye Stillwater Limited
46,WPM,Wheaton Precious Metals Corp.
31,SOUN,"SoundHound AI, Inc."



## Parte 2. Descarga de precios mensuales ajustados (12 meses) con `yfinance`


In [11]:
import pandas as pd
import yfinance as yf
import time

# 1) Partimos de su lista de símbolos (por ejemplo, del scraping de la Parte 1)
#    Suponga que df_gainers tiene una columna 'symbol' con hasta 50 tickers.
symbols = (
    df_gainers["symbol"]
    .dropna()
    .astype(str).str.strip()
    .unique()
    .tolist()
)

# 2) Descargamos TODO junto: 1 año, frecuencia mensual
raw = yf.download(
    symbols,
    period="1y",
    interval="1mo",
    auto_adjust=False,   # mantenemos columnas OHLC y 'Adj Close'
    progress=False,
    group_by="column",   # estructura: columnas multinivel por campo (Adj Close, Close, etc.)
    threads=True
)

# 3) Extraemos solo la matriz de 'Adj Close' (columnas = símbolos, índice = fechas mensuales)
prices_m = raw["Adj Close"].copy()

# 4) Limpieza básica
# - Quitamos columnas totalmente vacías (símbolos sin data)
prices_m = prices_m.dropna(axis=1, how="all")

# - Si vinieron más de 12 filas por temas de cortes (o menos), forzamos a 12 más recientes
prices_m = prices_m.tail(12)

# (Opcional) Guardar a CSV
prices_m.to_csv("adj_close_monthly_1y.csv")

print(prices_m.shape)  # (12, <=50)
prices_m.head()


(12, 50)


Ticker,ADSK,AEM,AFRM,AMBA,BABA,BF-A,BF-B,BHC,BIDU,BILI,BTDR,BTU,CDE,CELH,CIFR,CNXC,COO,DOOO,EMN,EQX,FSM,GFI,GH,GSAT,HCC,HL,HMY,HP,IAG,IREN,JOYY,KGC,LCID,MIAX,MOH,NG,NGD,NXE,OLN,OS,PRVA,S,SATS,SBSW,SJM,SNDK,SOUN,SSRM,UPWK,WPM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1
2024-10-01,283.799988,85.226173,43.849998,56.189999,96.351486,42.737293,43.180271,9.2,91.230003,22.120001,7.79,25.835382,6.44,30.08,4.93,41.43668,104.68,48.756016,102.163742,5.54,4.97,16.182167,21.879999,15.75,62.753315,6.460576,10.704062,32.161942,5.54,9.12,32.636093,9.986554,22.1,,321.220001,3.46,2.75,7.36,39.802032,29.52,18.360001,25.790001,25.059999,4.67,109.094658,,5.03,6.17,13.56,65.580765
2024-11-01,291.899994,83.379219,70.010002,71.550003,85.917831,40.560219,41.267906,8.37,85.050003,19.17,14.27,23.455421,6.46,28.450001,6.7,44.135979,104.459999,48.221645,101.804054,5.65,4.78,14.228132,35.610001,29.25,69.900414,5.494973,9.122093,33.147854,5.5,13.51,37.236771,9.619984,21.799999,,297.899994,3.66,2.75,8.44,41.315338,29.889999,21.48,27.950001,25.290001,4.09,113.208176,,9.31,5.81,16.969999,61.92469
2024-12-01,295.570007,77.617378,60.900002,72.739998,83.380714,36.961182,37.247032,8.06,84.309998,18.110001,21.67,20.649555,5.72,26.34,4.64,42.486404,91.93,50.339321,88.777184,5.02,4.29,12.961446,30.549999,31.049999,53.982235,4.899923,8.149335,30.877611,5.16,9.82,40.112198,9.211755,30.200001,,291.049988,3.33,2.48,6.6,32.942024,28.52,19.549999,22.200001,22.9,3.3,106.878769,,19.84,6.96,16.35,56.012188
2025-01-01,311.339996,92.235771,61.07,76.720001,97.197182,32.840946,32.535694,7.43,90.599998,16.719999,18.52,17.898254,6.6,24.98,5.73,51.333233,96.550003,47.448833,97.698471,6.07,5.07,16.624037,46.98,22.950001,52.519218,5.668343,11.186725,30.462954,6.24,10.22,41.549911,11.209125,27.6,,310.410004,3.13,3.02,6.56,28.546507,29.780001,22.85,23.950001,27.66,3.81,103.743843,,14.15,8.03,15.76,62.216949
2025-02-01,274.209991,95.550461,64.150002,61.43,130.307556,32.555115,32.634258,7.44,86.449997,20.34,12.31,13.598727,5.15,25.690001,4.08,44.619957,90.379997,39.350475,95.933723,6.42,4.32,17.645241,42.549999,21.59,47.911221,5.119472,9.876477,25.564194,5.52,8.24,44.847069,10.652643,22.200001,,301.119995,3.0,2.72,5.29,24.745504,23.299999,24.969999,20.629999,31.23,3.19,107.276695,,10.82,9.99,15.93,68.640823



## Parte 3. Cartera de 10 acciones (top desempeño 6M) y análisis de 6M siguientes
**Estrategia:** Seleccionamos las **10 acciones con mayor retorno acumulado** en los **primeros 6 meses**.  
**Supuesto:** Cartera **equiponderada** (10% c/u).  
Luego calculamos los **rendimientos mensuales** de las 10 acciones y de la **cartera** durante los **últimos 6 meses**.


In [14]:

# Asegurar que tenemos 12 filas (12 meses). Si hay más/menos, restringimos a 12 más recientes.
if len(prices_m) >= 12:
    prices_12 = prices_m.tail(12)
else:
    prices_12 = prices_m.copy()
    print(f"Aviso: solo se tienen {len(prices_12)} meses.")

# Split 6M + 6M
first6 = prices_12.head(6)
last6  = prices_12.tail(6)

# Retorno acumulado primeros 6 meses (last/first - 1)
ret_6m = (first6.ffill().iloc[-1] / first6.ffill().iloc[0] - 1).dropna().sort_values(ascending=False)

# Selección top 10
top10 = ret_6m.head(10).index.tolist()
top10


['GH', 'SSRM', 'SOUN', 'HMY', 'BABA', 'NGD', 'GFI', 'CNXC', 'GSAT', 'AEM']

In [15]:

# Rendimientos mensuales (pct_change) para todo el tramo de 12 meses
retn = prices_12.pct_change()

# Últimos 6 meses de rendimientos para las top10 (esto da 6 filas: meses 7..12)
retn_last6_top10 = retn[top10].iloc[-6:]

# Cartera equiponderada
weights = np.repeat(1/len(top10), len(top10))
port_last6 = retn_last6_top10.dot(weights)

retn_last6_top10.head(), port_last6.head()


  retn = prices_12.pct_change()


(Ticker            GH      SSRM      SOUN       HMY      BABA       NGD  \
 Date                                                                     
 2025-04-01  0.108685  0.060818  0.144089  0.077183 -0.096801  0.072776   
 2025-05-01 -0.139953  0.111842  0.088267 -0.076140 -0.046806  0.118090   
 2025-06-01  0.281142  0.076923  0.061325 -0.042495 -0.003777  0.112360   
 2025-07-01 -0.212529 -0.062009 -0.037279 -0.036507  0.081641 -0.153535   
 2025-08-01  0.645193  0.615900  0.260407 -0.013373  0.119125  0.408115   
 
 Ticker           GFI      CNXC      GSAT       AEM  
 Date                                                
 2025-04-01  0.039151 -0.082315 -0.078619  0.084586  
 2025-05-01  0.020408  0.103252 -0.039542  0.003487  
 2025-06-01  0.029130 -0.055570  0.275731  0.011390  
 2025-07-01  0.029151 -0.016744 -0.002548  0.045657  
 2025-08-01  0.374384  0.020794  0.274159  0.159296  ,
 Date
 2025-04-01    0.032955
 2025-05-01    0.014290
 2025-06-01    0.074616
 2025-07-01   -0

### Tablas resumen

In [16]:

# 1) Ranking de selección (retorno 6M iniciales)
ranking = ret_6m.to_frame("ret_acum_first6").assign(selected=lambda x: x.index.isin(top10))
display(ranking.head(20))

# 2) Rendimientos mensuales últimos 6M por acción (top10)
display(retn_last6_top10)

# 3) Rendimientos mensuales últimos 6M de la cartera
port_df = port_last6.to_frame("portfolio_ret_last6")
display(port_df)

# 4) Métricas
port_cum_ret = (1 + port_last6).prod() - 1
port_vol = port_last6.std() * np.sqrt(12)  # volatilidad anualizada aproximada
print(f"""
Rendimiento acumulado de la cartera (últimos 6M): {port_cum_ret:.2%}
Volatilidad anualizada aprox. (a partir de 6M): {port_vol:.2%}
""" )


Unnamed: 0_level_0,ret_acum_first6,selected
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1
GH,0.946984,True
SSRM,0.625608,True
SOUN,0.614314,True
HMY,0.369654,True
BABA,0.349561,True
NGD,0.349091,True
GFI,0.340413,True
CNXC,0.326714,True
GSAT,0.324444,True
AEM,0.267721,True


Ticker,GH,SSRM,SOUN,HMY,BABA,NGD,GFI,CNXC,GSAT,AEM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2025-04-01,0.108685,0.060818,0.144089,0.077183,-0.096801,0.072776,0.039151,-0.082315,-0.078619,0.084586
2025-05-01,-0.139953,0.111842,0.088267,-0.07614,-0.046806,0.11809,0.020408,0.103252,-0.039542,0.003487
2025-06-01,0.281142,0.076923,0.061325,-0.042495,-0.003777,0.11236,0.02913,-0.05557,0.275731,0.01139
2025-07-01,-0.212529,-0.062009,-0.037279,-0.036507,0.081641,-0.153535,0.029151,-0.016744,-0.002548,0.045657
2025-08-01,0.645193,0.6159,0.260407,-0.013373,0.119125,0.408115,0.374384,0.020794,0.274159,0.159296
2025-08-29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0_level_0,portfolio_ret_last6
Date,Unnamed: 1_level_1
2025-04-01,0.032955
2025-05-01,0.01429
2025-06-01,0.074616
2025-07-01,-0.03647
2025-08-01,0.2864
2025-08-29,0.0



Rendimiento acumulado de la cartera (últimos 6M): 39.55%
Volatilidad anualizada aprox. (a partir de 6M): 40.15%



### Opcional: Gráficos (matplotlib)

In [None]:

# Traza la serie de rentabilidad mensual de la cartera en los últimos 6 meses
plt.figure()
port_last6.plot(marker="o")
plt.title("Cartera equiponderada (Top10 por 6M iniciales) — Retornos mensuales (últimos 6M)")
plt.xlabel("Mes")
plt.ylabel("Retorno mensual")
plt.grid(True)
plt.show()



### Comentarios y extensiones
- Puede probar **otras reglas de selección**: mínima volatilidad 6M, racha de momentum, filtros de liquidez, etc.
- Si Yahoo cambia el HTML, ajuste los selectores CSS/XPath en `parse_rows_symbol_name` y en `click_show_more_until`.
- En entornos con proxy o firewall, `yfinance` podría requerir configuración adicional.
