<a href="https://colab.research.google.com/github/HenryZumaeta/py4cd_EPC2025/blob/main/C04/C04_Script01_ManipulacionPandas_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Formato JSON (JavaScript Object Notation)

Gramática está definida por una gramática libre de contexto y puede representarse mediante la especificación EBNF (Extended Backus-Naur Form) simplificada.

Fundamentos de EBNF: EBNF extiende la notación clásica de Backus-Naur con operadores para expresar repetición, opcionalidad y agrupamiento de manera más concisa.

Componentes claves de un JSON:

  - Valroes: Objetos, Arreglos, Cadenas, Números, Literales, Booleanos, Nulos.
  - Objetos (Colecciones no ordenadas de pares Clave:Valor) y Arreglos (Secuencias ordenadas de valores). Ambos admiten estar vacíos.
  - Cadenas (Delimitadas por comillas dobles).
  - Números:
    - No permite ceros a la izquierda (Excepto el número 0).
    - No permite comas o guiones bajos como separadores.
    - No permite notación hexadecimal o binaria.
    - No permite valores especiales.
    - Notar que esto garantiza interoperabilidad numérica entre lenguajes.
    

In [5]:
# Parser en Python
import json
# Módulo que sigue estrictamente y al pie de la letra la gramática del RFC de JSON

json.loads('{"x":01}')

JSONDecodeError: Expecting ',' delimiter: line 1 column 7 (char 6)

In [6]:
# Validación con gramáticas formales
from lark import Lark
# Módulo para implementar parsers personalizados

Existen variantes con extensiones
  - JSON5
  - Hjson
  - BJSON (MongoDB)

## Propiedades matemáticas y lógicas
- Determinismo
- Inyectividad
- Completitud
- Autodescriptividad


# Aplicaciones
- Serialización de modelos y configuraciones
  - Hiperparámetros: Se almacenan comúnmente en archivos json.
  - Modelos ligeros: Algunos frameworks (TensorFlow.js, spaCym HuggingFace) usan JSON para metadatos o configuraciones de modelos.
  - Pipeline de ML: MLflow o KubeFlow
    - Un pipeline de ML es una secuencia automatizada de etapas que transforman datos en modelos desplegables, incluyendo:
      - Ingesta y preparación de datos.
      - Entrenamiento e hiperparametrización.
      - Evaluación y validación.
      - Registro y versionado de modelos.
      - Despliegue y monitoreo en producción.
    
      Los pipelines permiten:
      - Reproducibilidad: Mismo código -> Mismo modelo
      - Automatización: Ejecución programaciona o disparada por evento.
      - Escalabilidad: Uso eficiente de recursos computacionales.
      - Gobernanza: Trazabilidad de experimentos y decisiones.

  - Intercambio de datos en APIs
    - Restful APIs: Respeta seis principios arquitectónicos de REST (Representational State Transfer)
      - Cliente-Servidor
      - Sin estado (stateless)
      - Caché
      - Interfaz uniforme
      - Sistema en capas (Layered System)
      - Código bajo demanda (Opcional)

    - Cuatro conceptos clave
      - Recursos (Resources)
      ```
      GET/api/v1/users
      GET/api/v1/users/123
      POST/api/v1/models
      DELETE/api/v1/datasets_hpzl/456
      ```
    - Representaciones (JSON, XML, HTML, md, etc)
    - Métodos HTTP
    - Códigos de Estado
    

# Ejemplo de código


In [9]:
import pandas as pd
import numpy as np

In [10]:
# Carguemos un json desde un dataset usando la comunicación entre un servidor de
# kaggle (Proveedor de datos) y la computadora que nos provee GCE nos provee en
# notebook

# Crear un directorio oculto
!mkdir ~/.kaggle

# Creamos un archivo dentro del directorio .kaggle
!touch ~/.kaggle/kaggle.json

# Definir las credenciales que utilizaran estos dos sistemas para comunicarse
api_token = {"username":"xxxxxxxxx","key":"xxxxxxxxx"}

# El módulo json
import json

# Volcamos (llevar un objeto de memoria RAM a disco duro) el contenido de la
# variable api_token en el archivo: ~/.kaggle/kaggle.json

with open('/root/.kaggle/kaggle.json', 'w') as file:
    json.dump(api_token, file)

# Asignamos los permisos adecuados
!chmod 600 ~/.kaggle/kaggle.json

In [11]:
# Hagamos una búsqueda
!kaggle datasets list -s arXiv

ref                                                  title                                           size  lastUpdated                 downloadCount  voteCount  usabilityRating  
---------------------------------------------------  ----------------------------------------  ----------  --------------------------  -------------  ---------  ---------------  
Cornell-University/arxiv                             arXiv Dataset                             1633957934  2025-11-01 23:51:19.480000          92173       1553  0.875            
spsayakpaul/arxiv-paper-abstracts                    arXiv Paper Abstracts                       46765262  2021-09-30 05:53:24.790000           7028         66  1.0              
sumitm004/arxiv-scientific-research-papers-dataset   arXiv Scientific Research Papers Dataset    65401733  2025-02-14 00:11:08.223000           3141         53  1.0              
neelshah18/arxivdataset                              ARXIV data from 24,000+ papers              19218382

In [12]:
# https://www.kaggle.com/datasets/Cornell-University/arxiv
!kaggle datasets download Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content
 99% 1.50G/1.52G [00:13<00:00, 73.5MB/s]
100% 1.52G/1.52G [00:13<00:00, 124MB/s] 


In [13]:
# Descomprimimos el archivo zipiado
!unzip arxiv.zip

Archive:  arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [14]:
# Hay que ver la manera de observar cómo está almacenada la información
# en ese archivo JSON

# Encabezado
!head arxiv-metadata-oai-snapshot.json

# Podemos notar que la información estructura está en un formato estructurado

{"id":"0704.0001","submitter":"Pavel Nadolsky","authors":"C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan","title":"Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies","comments":"37 pages, 15 figures; published version","journal-ref":"Phys.Rev.D76:013009,2007","doi":"10.1103/PhysRevD.76.013009","report-no":"ANL-HEP-PR-07-12","categories":"hep-ph","license":null,"abstract":"  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab Tevatron, and predic

In [15]:
# Final
!tail arxiv-metadata-oai-snapshot.json

# Podemos notar que la información estructura está en un formato estructurado

{"id":"supr-con/9608003","submitter":"Oleg Tchernyshyov","authors":"A. S. Blaer (1), H. C. Ren (2), and O. Tchernyshyov (1) ((1) Columbia\n  University, (2) Rockefeller University)","title":"Extended bound states and resonances of two fermions on a periodic\n  lattice","comments":"21 pages, RevTeX, 4 Postscript figures, arithmetic errors corrected.\n  An abbreviated version (no appendix) appeared in PRB on March 1, 1997","journal-ref":"Phys. Rev. B 55, 6035 (1997)","doi":"10.1103/PhysRevB.55.6035","report-no":"CU-TP-771, RU-96-6B","categories":"supr-con cond-mat.supr-con","license":null,"abstract":"  The high-$T_c$ cuprates are possible candidates for d-wave superconductivity,\nwith the Cooper pair wave function belonging to a non-trivial irreducible\nrepresentation of the lattice point group. We argue that this d-wave symmetry\nis related to a special form of the fermionic kinetic energy and does not\nrequire any novel pairing mechanism. In this context, we present a detailed\nstudy o

In [17]:
# Vamos a cargar en memoria RAM parte de la información que provee el archivo
# JSON (primeras 100000 filas)

arxiv = pd.read_json('arxiv-metadata-oai-snapshot.json',
                     lines=True,
                     nrows=100000)
arxiv.shape

(100000, 14)

In [18]:
arxiv

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"
2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[[Pan, Hongjun, ]]"
3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Callan, David, ]]"
4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,812.3869,Steven Weinstein,Steven Weinstein,Multiple Time Dimensions,,,,,physics.gen-ph physics.class-ph,http://arxiv.org/licenses/nonexclusive-distrib...,The possibility of physics in multiple time ...,"[{'version': 'v1', 'created': 'Fri, 19 Dec 200...",2008-12-22,"[[Weinstein, Steven, ]]"
99996,812.3870,Martin Weissman,Tatiana K. Howard and Martin H. Weissman,Depth Zero Representations of Nonlinear Covers...,12 pages,,,,math.RT math.NT,http://arxiv.org/licenses/nonexclusive-distrib...,"We generalize the methods of Moy-Prasad, in ...","[{'version': 'v1', 'created': 'Fri, 19 Dec 200...",2008-12-22,"[[Howard, Tatiana K., ], [Weissman, Martin H., ]]"
99997,812.3871,Nuno Alves,Nuno Alves,Decting Errors in Reversible Circuits With Inv...,,,,,cs.AR,http://arxiv.org/licenses/nonexclusive-distrib...,Reversible logic is experience renewed inter...,"[{'version': 'v1', 'created': 'Fri, 19 Dec 200...",2008-12-22,"[[Alves, Nuno, ]]"
99998,812.3872,Silvina Cichowolski,"S. Cichowolski, G.A. Romero, M.E. Ortega, C.E....",Unveiling the birth and evolution of the HII r...,"18 pages, 13 figures",,10.1111/j.1365-2966.2008.14322.x,,astro-ph,http://arxiv.org/licenses/nonexclusive-distrib...,"Based on a multiwavelength study, the ISM ar...","[{'version': 'v1', 'created': 'Fri, 19 Dec 200...",2009-11-13,"[[Cichowolski, S., ], [Romero, G. A., ], [Orte..."


In [19]:
# Nombres de las columnas
arxiv.columns

Index(['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'],
      dtype='object')

In [20]:
# Necesito obtener todos los artículos que contengan una palabra en particular
# en el abstract (O en el título) de acuerdo a uns cuentas/cálculos primeros
# es inviable cargar todo el dataset para poder filtrar la información requerida.
# En consecuencia hay que usar una estrategia más inteligente.

# Vamos a buscar el número de artículos/papers que contemplen a una determinada
# palabra en el abstract: "python"

# Creamos una lista en blanco para guardar los artículos que cumplan con ese filtro
python_abstract = []

# Leer el archivo de datos línea por línea
archivo_arxiv = "/content/arxiv-metadata-oai-snapshot.json"

with open(archivo_arxiv, "r") as f:
    # "r":read
    for line_num, line in enumerate(f,1):
        articulo = json.loads(line)
        abstract = articulo.get("abstract", "")
        if "python" in abstract.lower():
            python_abstract.append(abstract)

# Resultado
python_abstract

['  We describe a novel, interdisciplinary, computational methods course that\nuses Python and associated numerical and visualization libraries to enable\nstudents to implement simulations for a number of different course modules.\nProblems in complex networks, biomechanics, pattern formation, and gene\nregulation are highlighted to illustrate the breadth and flexibility of\nPython-powered computational environments.\n',
 '  We have built an open-source software system for the modeling of biomolecular\nreaction networks, SloppyCell, which is written in Python and makes substantial\nuse of third-party libraries for numerics, visualization, and parallel\nprogramming. We highlight here some of the powerful features that Python\nprovides that enable SloppyCell to do dynamic code synthesis, symbolic\nmanipulation, and parallel exploration of complex parameter spaces.\n',
 '  A novel approach to protein multiple sequence alignment is discussed:\nsubstantially this method counterparts with su

In [21]:
# Cantidad de artículos que tienen la palabra Python en el Abstract
len(python_abstract)

9521

In [22]:
# Total de artículos
line_num

2872766

In [23]:
# Creamos una lista en blanco para guardar los artículos que cumplan con ese filtro
julia_abstract = []

with open(archivo_arxiv, "r") as f:
    # "r":read
    for line_num, line in enumerate(f,1):
        articulo = json.loads(line)
        abstract = articulo.get("abstract", "")
        if "julia" in abstract.lower():
            julia_abstract.append(abstract)

# Resultado
julia_abstract

['  A polynomial skew product of C^2 is a map of the form f(z,w) = (p(z),\nq(z,w)), where p and q are polynomials, such that f is regular of degree d >=\n2. For polynomial maps of C, hyperbolicity is equivalent to the condition that\nthe closure of the postcritical set is disjoint from the Julia set; further,\ncritical points either iterate to an attracting cycle or infinity. For\npolynomial skew products, Jonsson (Math. Ann., 1999) established that f is\nAxiom A if and only if the closure of the postcritical set is disjoint from the\nright analog of the Julia set. Here we present the analogous conclusion:\ncritical orbits either escape to infinity or accumulate on an attracting set.\nIn addition, we construct new examples of Axiom A maps demonstrating various\npostcritical behaviors.\n',
 '  We show that if a meromorphic function has a direct singularity over\ninfinity, then the escaping set has an unbounded component and the intersection\nof the escaping set with the Julia set contai

In [24]:
# Cantidad de artículos que tienen la palabra Julia en el Abstract
len(julia_abstract)

1948

In [25]:
# Creamos una lista en blanco para guardar los artículos que cumplan con ese filtro
r_abstract = []

with open(archivo_arxiv, "r") as f:
    # "r":read
    for line_num, line in enumerate(f,1):
        articulo = json.loads(line)
        abstract = articulo.get("abstract", "")
        if "r" in abstract.lower():
            r_abstract.append(abstract)

# Resultado
r_abstract

['  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab Tevatron, and predictions are made for\nmore detailed tests with CDF and DO data. Predictions are shown for\ndistributions of diphoton pairs produced at the energy of the Large Hadron\nCollider (LHC). Distributions of the diphoton pairs from the decay of a Higgs\nboson are contrasted with those produced from QCD processes at the LHC, showing\nthat enhanced sensitivity to the signal can be obtained with judicious\nselection of events.\n'

In [26]:
# Cantidad de artículos que tienen la palabra R en el Abstract
len(r_abstract)

2872684