# Platforms Queries

Cuaderno-guía para probar y depurar consultas sobre la colección `platforms` en MongoDB. Incluye ejemplos ejecutables, referencias rápidas y notas de interpretación para que terceros entiendan qué devuelve cada helper.

**Navegación rápida**
- Conexión y variables base
- Slugs
- Primary Domain
- DataSources: URLs
- DataSources: Links
- DataSources: Textos
- DataSources: Role / Kind

## Conectarse a la DB y definir ejemplos

Usa `get_db()` para abrir la base por defecto (usa credenciales de `.env`). Ajusta `slug` y `datasource_url` para replicar los ejemplos con otra plataforma.

In [None]:
from src.DB.mongo import get_db

#conectar MongoDB
db = get_db()

#Seleccionar Platforms
platforms = db["platforms"]

#Inputs de ejemplo
slug = "roofstock"
datasource_url = "https://roofstock.com"

## Basic Retrieval
Funciones básicas para obtener documentos directamente por su slug o por una URL de dataSource específica.

**Qué hace cada helper**
- `get_platform_by_slug`: retorna el documento completo de una plataforma.
- `get_datasource_by_url`: retorna específicamente el sub-documento del dataSource que coincide con la URL.


In [None]:
from src.DB.platforms_querys import get_platform_by_slug, get_datasource_by_url

# Obtener documento completo de la plataforma
platform_doc = get_platform_by_slug(slug)

# Obtener dataSource específico (usando proyección posicional $ por defecto)
datasource_doc = get_datasource_by_url(slug, datasource_url)

print(f"Platform encontrada: {platform_doc.get('name') if platform_doc else 'No encontrada'}")
print(f"DataSource encontrado: {'Sí' if datasource_doc else 'No'}")
for i in datasource_doc["dataSources"]:
    print(i)

## Slugs
Funciones para listar, deduplicar y detectar slugs repetidos en `platforms`.

**Qué hace cada helper**
- `get_all_slugs`: retorna todos los slugs (puede incluir vacíos y repetidos).
- `get_unique_slugs`: usa `distinct` para quedarte solo con valores únicos.
- `get_repeated_slugs`: lista slugs duplicados con su `count`, ordenados desc.
- `get_slugs_not_inactive`: retorna slugs donde operational.status != 'inactive'.

In [None]:
from src.DB.platforms_querys import (
    get_all_slugs, 
    get_unique_slugs, 
    get_repeated_slugs,
    get_slugs_not_inactive
    )

slugs = get_all_slugs() #get_all_slugs(include_empty: bool = False) -> List[str] Retorna todos los slugs en la colección, puede incluir repetidos
slugs_unique = get_unique_slugs() #get_unique_slugs(include_empty: bool = False) -> List[str]  Retorna slugs únicos usando distinct
slugs_repetidos = get_repeated_slugs()  #get_repeated_slugs(include_empty: bool = False) -> List[Dict[str, Any]]  Retorna slugs repetidos con count, ordenados por count desc
slugs_activos = get_slugs_not_inactive() #get_slugs_not_inactive() -> List[str] Retorna slugs con operational.status != inactive

In [None]:
print("get_all_slugs (incluye repetidos)")
print(slugs)
print(len(slugs))
print("")
print("get_unique_slugs (unicos)")
print(slugs_unique)
print(len(slugs_unique))
print("")
print("get_repeated_slugs (slugs repetidos)")
print(slugs_repetidos)
print(len(slugs_repetidos))
print("")
print("get_slugs_not_inactive (exclude inactive)")
print(slugs_activos)
print(len(slugs_activos))

## Operational Status
Funciones para gestionar el estado operativo (`operational.status`, `notes`, `updatedAt`).

**Qué hace cada helper**
- `manage_operational_status`: le permite leer (get), escribir (set) o eliminar (delete) el estado operativo. Gestiona automáticamente `updatedAt`.

In [None]:
from src.DB.platforms_querys import manage_operational_status

In [None]:
# 1. Set status (Creates if not exists)
print("--- Set Status 'active' ---")
res_set = manage_operational_status(slug, action="set", status="active", notes="Initial check")
print(res_set)

# 2. Get status
print("\n--- Get Status ---")
curr_op = manage_operational_status(slug, action="get")
print(curr_op)

# 2.1 Default action is 'get'
print("\n--- Get Status (Default) ---")
curr_op_default = manage_operational_status(slug)
print(curr_op_default)

# 3. Idempotency Check (No changes)
print("\n--- Set Same Status (Idempotency) ---")
res_no_change = manage_operational_status(slug, action="set", status="active", notes="Initial check")
print(res_no_change)

# 4. Partial Update (Update notes only, status remains)
print("\n--- Update Notes Only ---")
res_notes = manage_operational_status(slug, action="set", notes="Updated notes only")
print(res_notes)
print(manage_operational_status(slug, action="get"))

# 5. Delete Status
print("\n--- Delete Status ---")
res_del = manage_operational_status(slug, action="delete")
print(res_del)

## Primary Domain
Funciones para listar y controlar el campo `primaryDomain`.

**Qué hace cada helper**
- `get_all_primary_domains`: devuelve todos los `primaryDomain` (incluye vacíos / repetidos si existen).
- `get_unique_primary_domains`: mismos valores pero únicos vía `distinct`.
- `get_repeated_primary_domains`: duplicados con su `count`, ordenados desc.
- `manage_primary_domain`: permite leer (get), escribir (set) o eliminar (delete) el primaryDomain.

In [None]:
from src.DB.platforms_querys import (
    get_all_primary_domains,
    get_unique_primary_domains,
    get_repeated_primary_domains,
)

primary_domains = get_all_primary_domains()  # get_all_primary_domains(include_empty: bool = False) -> List[str] Retorna todos los primaryDomain, puede incluir repetidos
primary_domains_unique = get_unique_primary_domains()  # get_unique_primary_domains(include_empty: bool = False) -> List[str] Retorna primaryDomain unicos usando distinct
primary_domains_repetidos = get_repeated_primary_domains()  # get_repeated_primary_domains(include_empty: bool = False) -> List[Dict[str, Any]] Retorna primaryDomain repetidos con count, ordenados por count desc


In [None]:
print("get_all_primary_domains (incluye repetidos)")
print(primary_domains)
print(len(primary_domains))
print("")

print("get_unique_primary_domains (unicos)")
print(primary_domains_unique)
print(len(primary_domains_unique))
print("")

print("get_repeated_primary_domains (primaryDomain repetidos)")
print(primary_domains_repetidos)
print(len(primary_domains_repetidos))

In [None]:
from src.DB.platforms_querys import manage_primary_domain

# 1. Get Primary Domain
print("--- Get Primary Domain ---")
current_pd = manage_primary_domain(slug, action="get")
print(f"Current: {current_pd}")

# 2. Set/Update Primary Domain
print("\n--- Set Primary Domain ---")
res_set = manage_primary_domain(slug, action="set", domain="https://roofstock.com")
print(res_set)

# 3. Verify
print(f"New value: {manage_primary_domain(slug, action='get')}")

## DataSources
Funciones para inspeccionar URLs, links y textos almacenados en `dataSources` del modelo. Cada subsección muestra el helper, parámetros clave y cómo interpretar el resultado.

### URLs: extracción y verificación
Obtiene URLs únicas, repetidas y filtradas por `primaryDomain` para un `slug` concreto.

#### `get_unique_datasource_urls`
- Input: `slug`
- Output: lista de `dataSources.url` únicos para esa plataforma.
- Útil para un vistazo rápido a las URLs que ya tenemos cargadas.

In [None]:
from src.DB.platforms_querys import get_unique_datasource_urls

datasource_urls_unique = get_unique_datasource_urls(slug)  # get_unique_datasource_urls(slug: str) -> List[str] Retorna urls unicas de dataSources.url para la plataforma

In [None]:
print("get_unique_datasource_urls (urls unicas en dataSources.url)")
print(datasource_urls_unique)
print(len(datasource_urls_unique))

#### `get_repeated_datasource_urls`
- Input: `slug`
- Output: URLs repetidas con `count`.
- Útil para detectar scraping duplicado o datos mal normalizados.

In [None]:
from src.DB.platforms_querys import get_repeated_datasource_urls

datasource_urls_repeated = get_repeated_datasource_urls(slug)  # get_repeated_datasource_urls(slug: str) -> List[Dict[str, Any]] Retorna urls repetidas en dataSources.url con count

In [None]:
print("get_repeated_datasource_urls (urls repetidas en dataSources.url con count)")
print(datasource_urls_repeated)
print(len(datasource_urls_repeated))

#### `unique_platform_urls_from_primary_domain`
Compara `dataSources.url` con `primaryDomain` y retorna las URLs válidas para esa plataforma.
- `mode="loose"`: acepta subdominios y variaciones que contengan el label base.
- `mode="strict"`: solo primaryDomain exacto y subdominios directos.

In [None]:
from src.DB.platforms_querys import unique_platform_urls_from_primary_domain

urls_loose = unique_platform_urls_from_primary_domain(slug, mode="loose")  # unique_platform_urls_from_primary_domain(slug: str, mode: str = "loose") -> List[str] URLs unicas del dataSources que matchean primaryDomain usando modo loose
urls_strict = unique_platform_urls_from_primary_domain(slug, mode="strict")  # unique_platform_urls_from_primary_domain(slug: str, mode: str = "strict") -> List[str] URLs unicas del dataSources que matchean primaryDomain usando modo strict


In [None]:
print("unique_platform_urls_from_primary_domain")
print("mode=loose Incluye primaryDomain y subdominios. Tambien acepta casos tipo ejemplo.algo.com si contiene el label base del primaryDomain")
print(urls_loose)
print(len(urls_loose))
print("")
print("mode=strict Solo incluye primaryDomain y subdominios directos. No acepta casos tipo ejemplo.algo.com si no es subdominio del primaryDomain")
print(urls_strict)
print(len(urls_strict))

### Links: navegación dentro de cada URL
Extrae links guardados en `dataSources.links` filtrando por secciones de la página.

#### `get_links_from_platform_datasource`
- `sections`: `None` (head, header, main, footer), string o lista de secciones.
- Devuelve links únicos por `datasource_url` para analizar estructura de navegación.

In [None]:
from src.DB.platforms_querys import get_links_from_platform_datasource

# get_links_from_platform_datasource(slug: str, datasource_url: str, sections=None) -> List[str]
# Retorna links unicos desde dataSources.links para una url especifica dentro de una plataforma.
# sections controla desde que partes se extraen links:
# - None: head header main footer
# - "main": solo esa seccion
# - ["header", "main"]: combinatoria

links_all = get_links_from_platform_datasource(slug, datasource_url)  # sections=None
links_main = get_links_from_platform_datasource(slug, datasource_url, sections="main")
links_footer = get_links_from_platform_datasource(slug, datasource_url, sections="footer")
links_combo = get_links_from_platform_datasource(slug, datasource_url, sections=[ "main", "footer"])

In [None]:
print("get_links_from_platform_datasource sections=None (head header main footer)")
print(links_all)
print(len(links_all))
print("")

print("get_links_from_platform_datasource sections=main (solo main)")
print(links_main)
print(len(links_main))
print("")

print("get_links_from_platform_datasource sections=head (solo head)")
print(links_footer)
print(len(links_footer))
print("")

print("get_links_from_platform_datasource sections=[ main, footer] (combinatoria)")
print(links_combo)
print(len(links_combo))

### Textos: contenido extraído de cada URL
Recupera los textos ya parseados desde `dataSources.texts` para auditar calidad de scraping.

#### `get_texts_from_platform_datasource`
- `sections`: `None`, string o lista para escoger head/header/main/footer.
- `dedupe`: controla si se deduplican los textos.
- Sirve para revisar si el contenido extraído es coherente antes de procesar embeddings.

In [None]:
from src.DB.platforms_querys import get_texts_from_platform_datasource

# get_texts_from_platform_datasource(slug: str, datasource_url: str, sections=None, dedupe: bool = True) -> List[str]
# Retorna textos desde dataSources.texts para una url especifica dentro de una plataforma.
# sections controla desde que partes se extraen textos:
# - None: head header main footer
# - "main": solo esa seccion
# - ["header", "main"]: combinatoria
# dedupe controla si se deduplica el output:
# - True: lista unica sin duplicados
# - False: lista tal cual sin deduplicar

texts_all = get_texts_from_platform_datasource(slug, datasource_url)  # sections=None dedupe=True
texts_main = get_texts_from_platform_datasource(slug, datasource_url, sections="main")  # dedupe=True
texts_header = get_texts_from_platform_datasource(slug, datasource_url, sections="header")  # dedupe=True
texts_combo_no_dedupe = get_texts_from_platform_datasource(slug, datasource_url, sections=["header", "main", "footer"], dedupe=False)

In [None]:
print("get_texts_from_platform_datasource sections=None dedupe=True (head header main footer)")
print(texts_all)
print(len(texts_all))
print("")

print("get_texts_from_platform_datasource sections=main dedupe=True (solo main)")
print(texts_main)
print(len(texts_main))
print("")

print("get_texts_from_platform_datasource sections=header dedupe=True (solo header)")
print(texts_header)
print(len(texts_header))
print("")

print("get_texts_from_platform_datasource sections=[header, main, footer] dedupe=False (combinatoria sin deduplicar)")
print(texts_combo_no_dedupe)
print(len(texts_combo_no_dedupe))

### Role: etiquetar URLs según su función
Permite leer, asignar o borrar la etiqueta `role` asociada a cada `datasource_url` (ej: `official_site`, `blog`, `careers`).

#### `datasource_role`
- `action`: `get`, `set`, `delete`.
- `role`: requerido solo en `set`.
- Útil para normalizar qué URL consideramos oficial o secundaria.

In [None]:
from src.DB.platforms_querys import datasource_role

current_role = datasource_role(slug, datasource_url, action="get")  # datasource_role(slug: str, datasource_url: str, action: str = "get", role: Optional[str] = None) -> Union[str, None, Dict[str, int]]
print("datasource_role action=get (role actual)")
print(current_role)

In [None]:
set_result = datasource_role(slug, datasource_url, action="set", role="official_site")
print("datasource_role action=set role=official_site (resultado update)")
print(set_result)

In [None]:
current_role_after_set = datasource_role(slug, datasource_url, action="get")
print("datasource_role action=get (role despues de set)")
print(current_role_after_set)

In [None]:
delete_result = datasource_role(slug, datasource_url, action="delete")
print("datasource_role action=delete (resultado unset)")
print(delete_result)

In [None]:
current_role_after_delete = datasource_role(slug, datasource_url, action="get")
print("datasource_role action=get (role despues de delete)")
print(current_role_after_delete)

### Kind: clasificar el tipo de URL
Gestiona el campo `kind` para cada `datasource_url` (ej: `website`, `social`, `marketplace`).

#### `datasource_kind`
- `action`: `get`, `set`, `delete`.
- `kind`: requerido en `set`.
- Útil para depurar contenido duplicado o separar dominios sociales/landing.

In [None]:
from src.DB.platforms_querys import datasource_kind

current_kind = datasource_kind(slug, datasource_url, action="get")  # datasource_kind(slug: str, datasource_url: str, action: str = "get", kind: Optional[str] = None) -> Union[str, None, Dict[str, int]]
print("datasource_kind action=get (kind actual)")
print(current_kind)


In [None]:
set_result = datasource_kind(slug, datasource_url, action="set", kind="website")
print("datasource_kind action=set kind=website (resultado update)")
print(set_result)

In [None]:
current_kind_after_set = datasource_kind(slug, datasource_url, action="get")
print("datasource_kind action=get (kind despues de set)")
print(current_kind_after_set)

In [None]:
delete_result = datasource_kind(slug, datasource_url, action="delete")
print("datasource_kind action=delete (resultado unset)")
print(delete_result)

In [None]:
current_kind_after_delete = datasource_kind(slug, datasource_url, action="get")
print("datasource_kind action=get (kind despues de delete)")
print(current_kind_after_delete)

## Mobile Apps
Funciones para gestionar las aplicaciones móviles en el campo `mobileApps`.

### `upsert_mobile_app(slug, url, store)`
Crea o actualiza una entrada en `mobileApps`.
- Si la `url` existe, actualiza el `store`.
- Si no existe, agrega la nueva app.

### `get_mobile_apps(slug, store=None)`
Retorna la lista de apps. Puede filtrar por `store`.

### `remove_mobile_app(slug, url=None, store=None)`
Elimina datos de `mobileApps` con flexibilidad:
- `remove_mobile_app(slug)` -> **Borra TODAS las apps** (deja lista vacía `[]`).
- `remove_mobile_app(slug, url="...")` -> Borra la app con esa URL.
- `remove_mobile_app(slug, store="...")` -> Borra **todas** las apps de esa store.
- `remove_mobile_app(slug, url="...", store="...")` -> Borra solo coincidencia exacta de ambas.

### `delete_mobile_apps_field(slug)`
Elimina el campo `mobileApps` completamente del documento.

In [None]:
from src.DB.platforms_querys import (
    upsert_mobile_app,
    get_mobile_apps,
    remove_mobile_app,
    delete_mobile_apps_field
)

In [None]:
# 1. Upsert: Agregar Google Play
print("--- Upsert Google Play ---")
res1 = upsert_mobile_app(slug, "https://play.google.com/store/apps/details?id=com.example", "google_play")
print(res1)

In [None]:
# 2. Upsert: Agregar App Store
print("\n--- Upsert App Store ---")
res2 = upsert_mobile_app(slug, "https://apps.apple.com/app/id123456", "apple_store")
print(res2)

In [None]:
# 3. Get All
print("\n--- Get All Apps ---")
apps = get_mobile_apps(slug)
print(apps)

# 4. Get by Store
print("\n--- Get Google Play Apps ---")
gp_apps = get_mobile_apps(slug, store="google_play")
print(gp_apps)

In [None]:
# 5. Remove by Store (Example)
print("\n--- Remove Google Play Apps ---")
rem_res = remove_mobile_app(slug, store="google_play")
print(rem_res)
print(get_mobile_apps(slug))

In [None]:
# 6. Delete Field
print("\n--- Delete Mobile Apps Field ---")
del_res = delete_mobile_apps_field(slug)
print(del_res)
doc_check = get_platform_by_slug(slug, {"mobileApps": 1})
print("Field exists?", "mobileApps" in doc_check if doc_check else "Doc not found")

## Social Profiles
Funciones para gestionar los perfiles sociales en el campo `socialProfiles`.

### `upsert_social_profile(slug, url, platform)`
Crea o actualiza una entrada en `socialProfiles`.
- Si la `url` existe, actualiza la `platform`.
- Si no existe, agrega el nuevo perfil.

### `get_social_profiles(slug, platform=None)`
Retorna la lista de perfiles. Puede filtrar por `platform`.

### `remove_social_profile(slug, url=None, platform=None)`
Elimina datos de `socialProfiles` con flexibilidad:
- `remove_social_profile(slug)` -> **Borra TODOS los perfiles** (deja lista vacía `[]`).
- `remove_social_profile(slug, url="...")` -> Borra el perfil con esa URL.
- `remove_social_profile(slug, platform="...")` -> Borra **todos** los perfiles de esa plataforma.
- `remove_social_profile(slug, url="...", platform="...")` -> Borra solo coincidencia exacta de ambas.

### `delete_social_profiles_field(slug)`
Elimina el campo `socialProfiles` completamente del documento.

In [None]:
from src.DB.platforms_querys import (
    upsert_social_profile,
    get_social_profiles,
    remove_social_profile,
    delete_social_profiles_field
)

In [None]:
# 1. Upsert: Agregar LinkedIn
print("--- Upsert LinkedIn ---")
res1 = upsert_social_profile(slug, "https://www.linkedin.com/company/example", "linkedin")
print(res1)

In [None]:
# 2. Upsert: Agregar Twitter
print("\n--- Upsert Twitter ---")
res2 = upsert_social_profile(slug, "https://twitter.com/example_io", "twitter")
print(res2)

In [None]:
# 3. Get All
print("\n--- Get All Social Profiles ---")
profiles = get_social_profiles(slug)
print(profiles)

# 4. Get by Platform
print("\n--- Get LinkedIn Profiles ---")
li_profiles = get_social_profiles(slug, platform="linkedin")
print(li_profiles)

In [None]:
# 5. Remove by Platform (Example)
print("\n--- Remove LinkedIn Profiles ---")
rem_res = remove_social_profile(slug, platform="linkedin")
print(rem_res)
print(get_social_profiles(slug))

In [None]:
# 6. Delete Field
print("\n--- Delete Social Profiles Field ---")
del_res = delete_social_profiles_field(slug)
print(del_res)
doc_check = get_platform_by_slug(slug, {"socialProfiles": 1})
print("Field exists?", "socialProfiles" in doc_check if doc_check else "Doc not found")