# Guía práctica de uso de Hive y Pig

En la sesión práctica se presentarán los siguientes contenidos:

* Presentación del `dataset` que vamos a utilizar.
* Conceptos básicos de Hive.
* Resolución de consulta compleja con HQL en el `dataset` de vuelos.
* Conceptos básicos de Pig.
* Resolución de la misma consulta compleja con Pig Latin en el `dataset` de vuelos.
  
# Dataset de retrasos en vuelos

Vamos a usar [este](https://www.kaggle.com/datasets/tylerx/flights-and-airports-data) de Kaggle
para aprender a usar tanto Hive como Pig. Kaggle es un sitio muy popular en ciencia de datos. En este sitio los científicos de datos pueden publicar y compartir sus trabajos. Además también se pueden proponer concursos en los que los participantes compiten en la construcción del mejor modelo para el problema propuesto.

El `dataset` contiene información sobre retrasos en vuelos en EEUU. Hay dos ficheros de interés: `airports.csv` y `flights.csv`.

El primero tiene información sobre los aeropuertos y consta de los siguientes campos:
   * airport_id: identificador del aeropuerto. Numérico, aunque se utilizará un campo `string` en Hive.
   * city: ciudad del aeropuerto.
   * state: estado del aeropuerto.
   * name: nombre del aeropuerto.
   
El fichero `flights` tiene la siguiente estructura:
   * DayofMonth: día del mes del vuelo.
   * DayOfWeek: día de la semana del vuelo.
   * Carrier: Identificador de la compañía aérea.
   * OriginAirportID: Identificador del aeropuerto de origen.
   * DestAirportID: Identificador del aeropuerto de destino.
   * DepDelay: Minutos de retraso en la salida de un vuelo (puede ser negativo si el vuelo sale antes de lo previsto).
   * ArrDelay: Minutos de retraso en la llegada de un vuelo (puede ser negativo si el vuelo sale antes de lo previsto).

El directorio `notebooks` contiene el `archiv.zip` con los dos ficheros. Para descargarlo de Kaggle hay que estar registrado y se ha incluido para que no tengas que hacerlo. 

Extraemos los ficheros que nos interesan. El fichero tiene extensión `zip`. Tenemos que instalar el paquete `unzip` ya que no está disponible en el contenedor.

Primero tenemos que actualizar los repositorios de paquetes del contenedor.

In [1]:
! apt update

Get:1 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB][33m
Get:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Get:5 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [991 kB]
Get:6 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [28.5 kB]
Get:7 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [1882 kB]
Get:8 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [2448 kB]
Get:9 http://archive.ubuntu.com/ubuntu focal/restricted amd64 Packages [33.4 kB][0m
Get:10 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages [1275 kB]    [0m[33m
Get:11 http://archive.ubuntu.com/ubuntu focal/multiverse amd64 Packages [177 kB][0m[33m[33m
Get:12 http://archive.ubuntu.com/ubuntu focal/universe amd64 Packages [11.3 MB]
Get:13 ht

Luego instalamos el paquete `unzip`.

In [2]:
! apt install unzip

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  zip
The following NEW packages will be installed:
  unzip
0 upgraded, 1 newly installed, 0 to remove and 190 not upgraded.
Need to get 168 kB of archives.
After this operation, 593 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 unzip amd64 6.0-25ubuntu1.1 [168 kB]
Fetched 168 kB in 0s (576 kB/s)0m[33m
debconf: delaying package configuration, since apt-utils is not installed

7[0;23r8[1ASelecting previously unselected package unzip.
(Reading database ... 43749 files and directories currently installed.)
Preparing to unpack .../unzip_6.0-25ubuntu1.1_amd64.deb ...
7[24;0f[42m[30mProgress: [  0%][49m[39m [..........................................................] 87[24;0f[42m[30mProgress: [ 20%][49m[39m [###########...............................................] 8Unpacking unzip (6.0-25ubuntu1.1

Extraemos los ficheros que nos interesan.

In [3]:
! unzip -j -o archive.zip  airports.csv flights.csv

Archive:  archive.zip
  inflating: airports.csv            
  inflating: flights.csv             


Mostramos el número de líneas y las primeras líneas del fichero de aeropuertos, `airports.csv`:

In [4]:
! wc -l airports.csv && head airports.csv 

366 airports.csv
airport_id,city,state,name
10165,Adak Island,AK,Adak
10299,Anchorage,AK,Ted Stevens Anchorage International
10304,Aniak,AK,Aniak Airport
10754,Barrow,AK,Wiley Post/Will Rogers Memorial
10551,Bethel,AK,Bethel Airport
10926,Cordova,AK,Merle K Mudhole Smith
14709,Deadhorse,AK,Deadhorse Airport
11336,Dillingham,AK,Dillingham Airport
11630,Fairbanks,AK,Fairbanks International


Mostramos el número de líneas y las primeras líneas del fichero de aeropuertos, `flights.csv`:

In [5]:
! wc -l flights.csv && head flights.csv

2702219 flights.csv
DayofMonth,DayOfWeek,Carrier,OriginAirportID,DestAirportID,DepDelay,ArrDelay
19,5,DL,11433,13303,-3,1
19,5,DL,14869,12478,0,-8
19,5,DL,14057,14869,-4,-15
19,5,DL,15016,11433,28,24
19,5,DL,11193,12892,-6,-11
19,5,DL,10397,15016,-1,-19
19,5,DL,15016,10397,0,-1
19,5,DL,10397,14869,15,24
19,5,DL,10397,10423,33,34


Es decir, hay 365 aeropuertos (descontada la línea de cabecera) y cerca de tres millones de vuelos.

Copiamos los ficheros para hacerlos accesibles en Hadoop. Observa que hemos usado el comando `hdfs` en lugar del comando `hadoop`. Es equivalente hacerlo de una u otra forma.

In [6]:
! hdfs dfs -mkdir -p /user/root/flights
! hdfs dfs -put -f ./airports.csv /user/root/flights/
! hdfs dfs -put -f ./flights.csv /user/root/flights/
! hdfs dfs -ls /user/root/flights/

Found 2 items
-rw-r--r--   3 root supergroup      16308 2023-02-07 12:56 /user/root/flights/airports.csv
-rw-r--r--   3 root supergroup   72088113 2023-02-07 12:56 /user/root/flights/flights.csv


# Hive

Ya tenemos instalado un servidor de Hive en nuestro `clúster` Hadoop. Hive es probablemente la herramienta más utilizada en el ecosistema Hadoop. La razón es que utiliza un lenguaje de consultas llamado HQL muy similar a SQL.

También hay instalado un cliente de Hive llamado `beeline`. Podemos ejecutar comandos de `beeline` en celdas de Jupyter. Por ejemplo, el siguiente comando se conectaría a Hive y mostraría las bases de datos disponibles.

In [7]:
! beeline -u "jdbc:hive2://localhost:10000" -e "SHOW DATABASES"

Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207125732_edb34956-1100-4df6-833e-6fe783a6cbd1): SHOW DATABASES
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=root_20230207125732_edb34956-1100-4df6-833e-6fe783a6cbd1); Time taken: 1.171 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20230207125732_edb34956-1100-4df6-833e-6fe783a6cbd1): SHOW DATABASES
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=root_20230207125732_edb34956-1100-4df6-833e-6fe783a6cbd1); Ti

Podemos crear una nueva base de datos con la siguiente instrucción.

In [8]:
! beeline -u "jdbc:hive2://localhost:10000/" -e "\
CREATE DATABASE IF NOT EXISTS bda03 \
COMMENT 'Base de datos de la unidad BDA03' \
WITH DBPROPERTIES ('Creada por' = 'Javier Pérez', 'Fecha' = '20/12/22');"

Connecting to jdbc:hive2://localhost:10000/
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207125808_1bb06737-d49d-41cc-adc1-c93355f0e635): CREATE DATABASE IF NOT EXISTS bda03  COMMENT 'Base de datos de la unidad BDA03'  WITH DBPROPERTIES ('Creada por' = 'Javier P?rez', 'Fecha' = '20/12/22')
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20230207125808_1bb06737-d49d-41cc-adc1-c93355f0e635); Time taken: 0.046 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20230207125808_1bb06737-d49d-41cc-adc1-c93355f0e635): CREATE DATABASE IF NOT EXISTS bda03  COMMENT 'Base de datos de la unidad BDA03'  WITH DBPROPERTIES ('

El siguiente paso sería crear las tablas para almacenar los datos de los aeropuertos y de los vuelos. En Hive hay dos tipos de tablas:

* Internas: Son manejadas completamente por Hive. Hive copiará los datos de los ficheros usados para crear las tablas en el almacenamiento de Hive. Por defecto usará el directorio: `/user/hive/warehouse/database_name.db/`. Cuando se borre la tabla, Hive borrará tanto los datos como los metadatos.
* Externas: Los datos no los maneja Hive. Hive únicamente se ocupa de mantener los metadatos. Para crear una tabla externa hay que añadir la opción EXTERNAL. Las tablas que crearemos en este ejercicio son externas.

Para mejorar el rendimiento de Hive, las tablas se pueden particionar por el valor de una columna. Hive creará un directorio por cada valor de la columna particionada. La columna de particionamiento realmente no se almacena como un campo, pero en las consultas se mostrará como si realmente existiera ese campo. 

Por último, hay que tener en cuenta los tipos de datos que soporta Hive. Puedes consultar los tipos soportados [aquí](https://cwiki.apache.org/confluence/display/hive/languagemanual+types).

La tabla que almacenará los datos de los aeropuertos se crea así:

In [9]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "\
DROP TABLE IF EXISTS airports; \
CREATE EXTERNAL TABLE IF NOT EXISTS airports (airportid STRING, city STRING, state STRING, airportname STRING) \
COMMENT 'USA Airports' \
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' \
TBLPROPERTIES ('Autor' = 'Javier Pérez', 'Fecha' = '20/12/2022', 'skip.header.line.count'='1');"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207125913_c074cac0-cec9-414f-b4df-c1e8a0e12410): DROP TABLE IF EXISTS airports
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20230207125913_c074cac0-cec9-414f-b4df-c1e8a0e12410); Time taken: 0.183 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20230207125913_c074cac0-cec9-414f-b4df-c1e8a0e12410): DROP TABLE IF EXISTS airports
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=root_20230207125913_c074cac0-cec9-414f-b4df-c1e8a0e12410); Time taken: 0.13 seconds
INFO  : OK

Hay varias cuestiones que son interesantes comentar en la anterior instrucción:

* En primer lugar, observa que hemos añadido el nombre de la base de datos a la cadena de conexión del cliente `beeline`.
* El nombre de la tabla creada se llama `airports`.
* La tabla es externa. Eso quiere decir que los datos permanecerán en HDFS y no se moverán al almacenamiento interno de Hive.
* La tabla consta de cuatro campos de tipo texto y se corresponden con la descripción que hicimos del fichero `airports.csv`.
* Se ha especificado que el delimitador de campos es el carácter coma (,).
* Por último, se añade una propiedad que permite eliminar la cabecera del fichero `csv`.

La tabla de vuelos es similar.

In [10]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "\
DROP TABLE IF EXISTS flights; \
CREATE EXTERNAL TABLE IF NOT EXISTS flights (dayofmonth TINYINT, dayofweek TINYINT, carrier STRING, \
    depairportid STRING, arrairportid STRING, depdelay SMALLINT, arrdelay SMALLINT) \
COMMENT 'Flights' \
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' \
TBLPROPERTIES ('Autor' = 'Javier Pérez', 'Fecha' = '20/12/2022', 'skip.header.line.count'='1');"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207130028_d4e1e8b9-dd85-4019-938b-0e53623a115d): DROP TABLE IF EXISTS flights
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20230207130028_d4e1e8b9-dd85-4019-938b-0e53623a115d); Time taken: 0.03 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20230207130028_d4e1e8b9-dd85-4019-938b-0e53623a115d): DROP TABLE IF EXISTS flights
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=root_20230207130028_d4e1e8b9-dd85-4019-938b-0e53623a115d); Time taken: 0.037 seconds
INFO  : OK
I

En la tabla `flights` se han ajustado los tipos de datos numéricos para que ocupen lo menos posible. El siguiente paso será cargar los datos. Al tratarse tablas externas, Hive no moverá realmente los datos y será un proceso muy rápido.

Antes de cargar los datos tenemos que dar permisos al directorio de HDFS en el que hemos copiado los ficheros `cvs`.

In [11]:
! hdfs dfs -chmod 777 /user/root/flights

In [12]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "\
LOAD DATA INPATH '/user/root/flights/airports.csv' OVERWRITE INTO TABLE airports;"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207130236_f0bcdde8-9923-4da5-9725-c015f8fec14a): LOAD DATA INPATH '/user/root/flights/airports.csv' OVERWRITE INTO TABLE airports
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20230207130236_f0bcdde8-9923-4da5-9725-c015f8fec14a); Time taken: 0.163 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20230207130236_f0bcdde8-9923-4da5-9725-c015f8fec14a): LOAD DATA INPATH '/user/root/flights/airports.csv' OVERWRITE INTO TABLE airports
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table bda03

In [13]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "\
LOAD DATA INPATH '/user/root/flights/flights.csv' OVERWRITE INTO TABLE flights;"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207130252_9aefc2a4-7d19-4e0a-8c05-4cd5cb4e1d99): LOAD DATA INPATH '/user/root/flights/flights.csv' OVERWRITE INTO TABLE flights
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=root_20230207130252_9aefc2a4-7d19-4e0a-8c05-4cd5cb4e1d99); Time taken: 0.044 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20230207130252_9aefc2a4-7d19-4e0a-8c05-4cd5cb4e1d99): LOAD DATA INPATH '/user/root/flights/flights.csv' OVERWRITE INTO TABLE flights
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table bda03.fli

Si has ejecutado las celdas anteriores, habrás comprobado que el proceso de incorporar datos ha sido muy rápido. Esto es así porque, al tratarse de tablas externas, Hive no necesita copiar los datos y porque Hive no realiza comprobaciones de integridad.

Ya podemos hacer consultas. Por ejemplo, la siguiente consulta muestra 10 aeropuertos.

In [14]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "\
SELECT * FROM airports LIMIT 10"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207130337_069a1854-b518-49a4-8169-86981d5f3e66): SELECT * FROM airports LIMIT 10
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:airports.airportid, type:string, comment:null), FieldSchema(name:airports.city, type:string, comment:null), FieldSchema(name:airports.state, type:string, comment:null), FieldSchema(name:airports.airportname, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20230207130337_069a1854-b518-49a4-8169-86981d5f3e66); Time taken: 2.553 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20230207130337_069a185

Y la siguiente muestra 10 vuelos:

In [15]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "\
SELECT * FROM flights LIMIT 10"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207130412_15b8071d-e45c-4ae8-9be3-0b1a5ee56782): SELECT * FROM flights LIMIT 10
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:flights.dayofmonth, type:tinyint, comment:null), FieldSchema(name:flights.dayofweek, type:tinyint, comment:null), FieldSchema(name:flights.carrier, type:string, comment:null), FieldSchema(name:flights.depairportid, type:string, comment:null), FieldSchema(name:flights.arrairportid, type:string, comment:null), FieldSchema(name:flights.depdelay, type:smallint, comment:null), FieldSchema(name:flights.arrdelay, type:smallint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_2023

Vamos a hacer una consulta para aprender como usar Hive.

## Consulta en Hive: Nombre de los 5 aeropuertos con mayor número de operaciones (llegadas y salidas).

Empezamos mostrando las salidas que se producen agrupadas por aeropuerto.

In [16]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "\
SELECT depairportid as airportid, count(*) AS flights FROM flights GROUP BY depairportid \
ORDER BY flights DESC LIMIT 5;"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207130541_d5772487-fc84-4680-bce1-d1e1b9d48668): SELECT depairportid as airportid, count(*) AS flights FROM flights GROUP BY depairportid  ORDER BY flights DESC LIMIT 5
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:airportid, type:string, comment:null), FieldSchema(name:flights, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20230207130541_d5772487-fc84-4680-bce1-d1e1b9d48668); Time taken: 2.06 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20230207130541_d5772487-fc84-4680-bce1-d1e1b9d48668): SELECT depairportid as a

Podemos hacer lo mismo con las llegadas y unir las dos consultas.

In [17]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "\
SELECT airportid, COUNT(*) as flights FROM ( \
    SELECT depairportid as airportid FROM flights \
    UNION ALL \
    SELECT arrairportid as airportid FROM flights \
) f GROUP BY airportid \
ORDER BY flights DESC LIMIT 5;"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207130822_f498a4c1-1151-4cdf-bf20-934cd1f0c7de): SELECT airportid, COUNT(*) as flights FROM (      SELECT depairportid as airportid FROM flights      UNION ALL      SELECT arrairportid as airportid FROM flights  ) f GROUP BY airportid  ORDER BY flights DESC LIMIT 5
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:airportid, type:string, comment:null), FieldSchema(name:flights, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20230207130822_f498a4c1-1151-4cdf-bf20-934cd1f0c7de); Time taken: 0.586 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing c

Ya tenemos los códigos de los 5 aeropuertos con más operaciones. Lo único que queda por hacer es obtener el nombre del aeropuerto. Para ello hacemos un `join` con la tabla `airports`. Para hacerlo más inteligible creamos una tabla temporal con los resultados anteriores y el `join` lo hacemos sobre esta tabla temporal.

In [18]:
! beeline -u "jdbc:hive2://localhost:10000/bda03" -e "\
CREATE TEMPORARY TABLE airport_operations AS \
SELECT airportid, COUNT(*) as flights FROM ( \
    SELECT depairportid as airportid FROM flights \
    UNION ALL \
    SELECT arrairportid as airportid FROM flights \
) f GROUP BY airportid \
ORDER BY flights DESC LIMIT 5; \
\
SELECT airportname, flights \
FROM airport_operations JOIN airports ON airport_operations.airportid = airports.airportid \
ORDER BY flights DESC;"

Connecting to jdbc:hive2://localhost:10000/bda03
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=root_20230207131059_00d2ffe0-70fd-4189-8636-737ada0bc72b): CREATE TEMPORARY TABLE airport_operations AS  SELECT airportid, COUNT(*) as flights FROM (      SELECT depairportid as airportid FROM flights      UNION ALL      SELECT arrairportid as airportid FROM flights  ) f GROUP BY airportid  ORDER BY flights DESC LIMIT 5
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:airportid, type:string, comment:null), FieldSchema(name:flights, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20230207131059_00d2ffe0-70fd-4189-8636-737ada0bc72b); Time taken: 0.329 seconds
INFO  : Concurrency mode is disabled, n

+-------------------------------------------+----------+
|                airportname                | flights  |
+-------------------------------------------+----------+
| Hartsfield-Jackson Atlanta International  | 297087   |
| Chicago O'Hare International              | 254536   |
| Los Angeles International                 | 235988   |
| Dallas/Fort Worth International           | 208209   |
| Denver International                      | 194178   |
+-------------------------------------------+----------+
5 rows selected (35.625 seconds)
Beeline version 3.1.2 by Apache Hive
Closing: 0: jdbc:hive2://localhost:10000/bda03


# Pig

Como en el caso de Hive, también tenemos instalado el cliente de Pig en nuestro `cluster` de Hadoop (Pig no tiene servidor). Mientras que Hive es una herramienta pensada para trabajar sobre información estructurada de forma declarativa, Pig puede trabajar sobre información semiestructurada y es una mezcla de programación declarativa y procedimental. Es, por lo tanto, más flexible que Hive. Pig usa un lenguaje de consultas llamado Pig Latin.

Al igual que Hive, Pig tiene sus propios tipos de datos. Puedes consultarlos [aquí](https://pig.apache.org/docs/latest/basic.html#data-types).

Para ejecutar Pig en Jupyter debemos crear un `script` y ejecutarlo con Pig.

Por ejemplo, para leer los ficheros `airports.csv` y `flights.csv` escribimos:

In [19]:
%%writefile flights.pig

-- resgistramos la librería PiggyBank para poder usar la función de carga CSVExcelStorage.
REGISTER piggybank.jar

/*
Leemos el fichero de airports.csv.

Usamos el loader CSVExcelStorage indicando el delimitador (,) y que se debe excluir la cabecera.
*/

AIRPORTS = LOAD '$airports_file' USING
       org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
       AS (airportid:chararray, city:chararray, state:chararray, airportname:chararray);

-- Leemos el fichero fligths.csv

FLIGHTS = LOAD '$flights_file' USING
       org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
       AS (dayofmonth:int, dayofweek:int, carrier:chararray, 
               depairportid:chararray, arrairportid:chararray, depdelay:int, arrdelay:int);


-- Probamos que podemos recuperar datos.
      
-- Nos quedamos con 10 aeropuertos
AIRPORTS_10 = LIMIT AIRPORTS 10;

-- Mostramos 10 aeropuertos
DUMP AIRPORTS_10;

-- Hacemos lo mismo con los vuelos
FLIGHTS_10 = LIMIT FLIGHTS 10;
DUMP FLIGHTS_10;

Writing flights.pig


In [20]:
! pig -x local -f flights.pig -param airports_file='airports.csv' -param flights_file='flights.csv' -param output_dir='pig/output/flights'

2023-02-07 14:04:11,936 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
2023-02-07 14:04:11,940 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2023-02-07 14:04:12,065 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2023-02-07 14:04:12,065 [main] INFO  org.apache.pig.Main - Logging error messages to: /media/notebooks/pig_1675775052058.log
2023-02-07 14:04:12,102 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - user.name is deprecated. Instead, use mapreduce.job.user.name
2023-02-07 14:04:12,350 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2023-02-07 14:04:12,466 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2023-02-07 14:04:12,468 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2023-02-07 14:

Observa varias cuestiones interesantes:

* Los `scripts` de Pig admiten dos tipos de comentarios: orientados a línea y orientados a bloque.
* Hemos tenido que registrar la librería `piggybank.jar` para poder usar el `loader` `CSVExcelStorage`. Este `loader` es más potente que el que se usa por defecto en Pig y que se llama `PigStorage`. Concretamente en este ejemplo lo hemos usado para eliminar las líneas de cabecera de los ficheros `csv`.
* Cada comando de Pig Latin es atómico (hace una sola operación) y no se pueden componer, con lo que hay que ir haciendo asignaciones sucesivas. Es habitual que las asignaciones se hagan sobre la misma variable sobrescribiéndola. Desde mi punto de vista esa técnica resta claridad y prefiero ir creando nuevas variables según avanza el proceso.
* Al ejecutar Pig podemos pasar variables que son accesibles desde el `script`.
* He tenido que ejecutar Pig en modo local con la opción `-x local` ya que mi equipo se queda sin memoria si trato de ejecutarlo en Hadoop. Puedes probar a cambiar esta opción y probar si tu equipo soporta la ejecución en el `clúster` de Hadoop.
* Observa que la salida del `script` muestra 10 aeropuertos y 10 vuelos con una estructura de datos de tupla. La tupla es uno de los tipos complejos que soporta Pig. Los otros dos tipos complejos son `map` y `bag`. Usaremos el último más adelante.

## Consulta en Pig: Nombre de los 5 aeropuertos con mayor número de operaciones (llegadas y salidas).

Vamos a resolver la misma consulta que hicimos en Hive pero esta vez utilizando Pig. Seguimos una estrategia parecida a la de Hive: unimos las salidas y las llegadas y agrupamos por aeropuerto. El `script` siguiente está incompleto ya que tan sólo llega hasta hacer la agrupación, pero falta el `join` con aeropuertos para obtener el nombre. Se ha hecho así para explicar que la relación creada con GROUP no tiene a misma estructura que la equivalente con GROUP BY en Hive. Más adelante resolveremos la consulta completamente.

In [21]:
%%writefile flights.pig

-- resgistramos la librería PiggyBank para poder usar la función de carga CSVExcelStorage.
REGISTER piggybank.jar

/*
Leemos el fichero de airports.csv.

Usamos el loader CSVExcelStorage indicando el delimitador (,) y que se debe excluir la cabecera.
*/

AIRPORTS = LOAD '$airports_file' USING
       org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
       AS (airportid:chararray, city:chararray, state:chararray, airportname:chararray);

-- Leemos el fichero fligths.csv

FLIGHTS = LOAD '$flights_file' USING
       org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
       AS (dayofmonth:int, dayofweek:int, carrier:chararray, 
               depairportid:chararray, arrairportid:chararray, depdelay:int, arrdelay:int);


/*
    FOREACH ... GENERATE es similar al SELECT de SQL
*/
DEPARTURES = FOREACH FLIGHTS GENERATE depairportid AS airportid;
ARRIVES    = FOREACH FLIGHTS GENERATE arrairportid AS airportid;

OPERATIONS = UNION DEPARTURES, ARRIVES;

TOTAL_OPERATIONS = GROUP OPERATIONS BY airportid;

-- Mostramos el esquema de la relación para que se entienda cómo funciona GROUP
DESCRIBE TOTAL_OPERATIONS;

Overwriting flights.pig


In [22]:
! pig -x local -f flights.pig -param airports_file='airports.csv' -param flights_file='flights.csv' -param output_dir='pig/output/flights'

2023-02-07 14:08:24,703 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
2023-02-07 14:08:24,704 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2023-02-07 14:08:24,805 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2023-02-07 14:08:24,805 [main] INFO  org.apache.pig.Main - Logging error messages to: /media/notebooks/pig_1675775304799.log
2023-02-07 14:08:24,850 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - user.name is deprecated. Instead, use mapreduce.job.user.name
2023-02-07 14:08:25,249 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2023-02-07 14:08:25,520 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2023-02-07 14:08:25,523 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2023-02-07 14:

Observa varias cosas:

* La consulta ha tardado muy poco tiempo. Esto es debido a que Pig no realiza la consulta hasta que no se muestren los datos por pantalla o se almacenen en un fichero.
* La relación TOTAL_OPERATIONS está formada por tuplas con dos campos: `group` y OPERATIONS. El nombre `group` lo ha asignado PIG y contiene el valor del campo por el que hemos agrupado (en este caso el código de aeropuerto). OPERATIONS es un `bag` (lista de tuplas) con las tuplas agrupadas. Es decir, que si mostráramos los datos agrupados, veríamos tuplas con datos similares a estos:
    (1, (1,1,1,1,1)), donde 1 sería el código de aeropuerto.
    
Continuamos con el `script` contando y renombrando campos:

In [23]:
%%writefile flights.pig

-- resgistramos la librería PiggyBank para poder usar la función de carga CSVExcelStorage.
REGISTER piggybank.jar

/*
Leemos el fichero de airports.csv.

Usamos el loader CSVExcelStorage indicando el delimitador (,) y que se debe excluir la cabecera.
*/

AIRPORTS = LOAD '$airports_file' USING
       org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
       AS (airportid:chararray, city:chararray, state:chararray, airportname:chararray);

-- Leemos el fichero fligths.csv

FLIGHTS = LOAD '$flights_file' USING
       org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER')
       AS (dayofmonth:int, dayofweek:int, carrier:chararray, 
               depairportid:chararray, arrairportid:chararray, depdelay:int, arrdelay:int);


/*
    FOREACH ... GENERATE es similar al SELECT de SQL
*/
DEPARTURES = FOREACH FLIGHTS GENERATE depairportid AS airportid;
ARRIVES    = FOREACH FLIGHTS GENERATE arrairportid AS airportid;

OPERATIONS = UNION DEPARTURES, ARRIVES;

TOTAL_OPERATIONS = GROUP OPERATIONS BY airportid;

-- Mostramos el esquema de la relación para que se entienda cómo funciona GROUP
DESCRIBE TOTAL_OPERATIONS;

-- Renombramos campos y contamos vuelos
TOTAL_OPERATIONS = FOREACH TOTAL_OPERATIONS GENERATE group AS airportid, COUNT(OPERATIONS) AS flights;

-- Ordenamos de forma descendente por vuelos
TOTAL_OPERATIONS = ORDER TOTAL_OPERATIONS BY flights DESC;

-- Limitamos a 5 aeropuertos
TOP_TOTAL_OPERATIONS = LIMIT TOTAL_OPERATIONS 5;

-- Hacemos un join con la relación de aeropuertos para obtener el nombre
TOP_TOTAL_OPERATIONS = JOIN TOP_TOTAL_OPERATIONS BY airportid, AIRPORTS BY airportid;

DESCRIBE TOP_TOTAL_OPERATIONS;

-- Seleccionamos los campos que nos interesan
TOP_TOTAL_OPERATIONS = FOREACH TOP_TOTAL_OPERATIONS GENERATE airportname, flights;

-- Volvemos a ordenar por el número de vuelos
TOP_TOTAL_OPERATIONS = ORDER TOP_TOTAL_OPERATIONS BY flights DESC;

DUMP TOP_TOTAL_OPERATIONS;

Overwriting flights.pig


In [24]:
! pig -x local -f flights.pig -param airports_file='airports.csv' -param flights_file='flights.csv' -param output_dir='pig/output/flights'

2023-02-07 14:12:43,283 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
2023-02-07 14:12:43,284 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2023-02-07 14:12:43,349 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2023-02-07 14:12:43,349 [main] INFO  org.apache.pig.Main - Logging error messages to: /media/notebooks/pig_1675775563347.log
2023-02-07 14:12:43,364 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - user.name is deprecated. Instead, use mapreduce.job.user.name
2023-02-07 14:12:43,517 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2023-02-07 14:12:43,598 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2023-02-07 14:12:43,600 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2023-02-07 14:

2023-02-07 14:12:46,038 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:6
2023-02-07 14:12:46,249 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local1252471141_0001
2023-02-07 14:12:46,249 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Executing with tokens: []
2023-02-07 14:12:46,409 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/
2023-02-07 14:12:46,413 [Thread-6] INFO  org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter set in config null
2023-02-07 14:12:46,414 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1252471141_0001
2023-02-07 14:12:46,414 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ARRIVES,DEPARTURES,FLIGHTS,OPERATIONS,TOTAL_OPERATIONS
2023-02-07 14:12:46,414 [main] INFO  org.apache.pig.b

2023-02-07 14:12:56,331 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - (EQUATOR) 0 kvi 26214396(104857584)
2023-02-07 14:12:56,331 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - mapreduce.task.io.sort.mb: 100
2023-02-07 14:12:56,331 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - soft limit at 83886080
2023-02-07 14:12:56,331 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - bufstart = 0; bufvoid = 104857600
2023-02-07 14:12:56,331 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - kvstart = 26214396; length = 6553600
2023-02-07 14:12:56,332 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2023-02-07 14:12:56,349 [LocalJobRunner Map Task Executor #0] INFO  org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen

2023-02-07 14:13:16,354 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2023-02-07 14:13:16,354 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2023-02-07 14:13:16,354 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorProcessTree : [ ]
2023-02-07 14:13:16,355 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Processing split: Number of splits :1
Total Length = 33554432
Input split[0]:
   Length = 33554432
   ClassName: org.apache.hadoop.mapreduce.lib.input.FileSplit
   Locations:

-----------------------

2023-02-07 14:13:16,361 [LocalJobRunner Map Task Executor #0] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecor

2023-02-07 14:13:25,476 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Finished spill 0
2023-02-07 14:13:25,478 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.Task - Task:attempt_local1252471141_0001_m_000004_0 is done. And is in the process of committing
2023-02-07 14:13:25,479 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.LocalJobRunner - map
2023-02-07 14:13:25,479 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local1252471141_0001_m_000004_0' done.
2023-02-07 14:13:25,480 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.Task - Final Counters for attempt_local1252471141_0001_m_000004_0: Counters: 19
	File System Counters
		FILE: Number of bytes read=139221908
		FILE: Number of bytes written=640409
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input r

2023-02-07 14:13:26,931 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#1 about to shuffle output of map attempt_local1252471141_0001_m_000001_0 decomp: 1212 len: 1216 to MEMORY
2023-02-07 14:13:26,935 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 1212 bytes from map-output for attempt_local1252471141_0001_m_000001_0
2023-02-07 14:13:26,938 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - closeInMemoryFile -> map-output of size: 1212, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->1212
2023-02-07 14:13:26,943 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#1 about to shuffle output of map attempt_local1252471141_0001_m_000005_0 decomp: 1192 len: 1196 to MEMORY
2023-02-07 14:13:26,944 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 1192 bytes from map-output for attempt_lo

2023-02-07 14:13:27,008 [Thread-6] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete.
2023-02-07 14:13:27,130 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 14% complete
2023-02-07 14:13:27,132 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:27,155 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:27,156 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2023-02-07 14:13:27,157 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:27,192 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2023-02-07 14:13:27,193 [main] INFO  org.apache.pig.backend.h

2023-02-07 14:13:27,437 [pool-9-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2023-02-07 14:13:27,438 [pool-9-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2023-02-07 14:13:27,442 [pool-9-thread-1] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorProcessTree : [ ]
2023-02-07 14:13:27,442 [pool-9-thread-1] INFO  org.apache.hadoop.mapred.ReduceTask - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@388f9432
2023-02-07 14:13:27,442 [pool-9-thread-1] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:27,444 [pool-9-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, merg

2023-02-07 14:13:27,719 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2023-02-07 14:13:27,721 [JobControl] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:27,726 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2023-02-07 14:13:27,728 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2023-02-07 14:13:27,729 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2023-02-07 14:13:27,729 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2023-02-07 14:13:27,730 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of 

2023-02-07 14:13:27,949 [pool-12-thread-1] INFO  org.apache.hadoop.mapred.ReduceTask - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@373379bf
2023-02-07 14:13:27,949 [pool-12-thread-1] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:27,950 [pool-12-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2023-02-07 14:13:27,954 [EventFetcher for fetching Map Completion Events] INFO  org.apache.hadoop.mapreduce.task.reduce.EventFetcher - attempt_local2017498330_0003_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2023-02-07 14:13:27,959 [localfetcher#3] INFO  org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#3 about to shuffle output of map attempt_local2017498330_0003_m_000000

2023-02-07 14:13:28,291 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/
2023-02-07 14:13:28,293 [Thread-34] INFO  org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter set in config null
2023-02-07 14:13:28,297 [Thread-34] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2023-02-07 14:13:28,297 [Thread-34] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2023-02-07 14:13:28,297 [Thread-34] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2023-02-07 14:13:28,298 [Thread-34] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2023-02-07 14:13:28,298 [Thread-34] INFO  org.apache.hadoop.mapreduce

2023-02-07 14:13:28,592 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local868288986_0004
2023-02-07 14:13:28,592 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases TOTAL_OPERATIONS
2023-02-07 14:13:28,592 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: TOTAL_OPERATIONS[40,19] C:  R: 
2023-02-07 14:13:28,593 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 57% complete
2023-02-07 14:13:28,594 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:28,595 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:28,596 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system 

2023-02-07 14:13:29,002 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.LocalJobRunner - 
2023-02-07 14:13:29,002 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Starting flush of map output
2023-02-07 14:13:29,002 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Spilling map output
2023-02-07 14:13:29,002 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - bufstart = 0; bufend = 13459; bufvoid = 104857600
2023-02-07 14:13:29,002 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - kvstart = 26214396(104857584); kvend = 26212940(104851760); length = 1457/6553600
2023-02-07 14:13:29,010 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Finished spill 0
2023-02-07 14:13:29,022 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.Task - Task:attempt_local167550179_0005_m_000000_0 is done. And is in the process o

2023-02-07 14:13:29,110 [pool-18-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - 2 / 2 copied.
2023-02-07 14:13:29,110 [pool-18-thread-1] INFO  org.apache.hadoop.mapred.Task - Task attempt_local167550179_0005_r_000000_0 is allowed to commit now
2023-02-07 14:13:29,112 [pool-18-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local167550179_0005_r_000000_0' to file:/tmp/temp-2041046484/tmp603311600
2023-02-07 14:13:29,114 [pool-18-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2023-02-07 14:13:29,114 [pool-18-thread-1] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local167550179_0005_r_000000_0' done.
2023-02-07 14:13:29,115 [pool-18-thread-1] INFO  org.apache.hadoop.mapred.Task - Final Counters for attempt_local167550179_0005_r_000000_0: Counters: 24
	File System Counters
		FILE: Number of bytes read=144270677
		FILE: Number of bytes written=3141780
		FILE: Number of read operations=0
	

2023-02-07 14:13:29,374 [pool-21-thread-1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2023-02-07 14:13:29,376 [EventFetcher for fetching Map Completion Events] INFO  org.apache.hadoop.mapreduce.task.reduce.EventFetcher - attempt_local285307142_0006_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2023-02-07 14:13:29,377 [localfetcher#6] INFO  org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#6 about to shuffle output of map attempt_local285307142_0006_m_000000_0 decomp: 122 len: 126 to MEMORY
2023-02-07 14:13:29,377 [localfetcher#6] INFO  org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 122 bytes from map-output for attempt_local285307142_0006_m_000000_0
2023-02-07 14:13:29,377 [localfetcher#6] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - cl

2023-02-07 14:13:29,625 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/
2023-02-07 14:13:29,626 [Thread-56] INFO  org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter set in config null
2023-02-07 14:13:29,630 [Thread-56] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2023-02-07 14:13:29,630 [Thread-56] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2023-02-07 14:13:29,630 [Thread-56] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2023-02-07 14:13:29,630 [Thread-56] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2
2023-02-07 14:13:29,630 [Thread-56] INFO  org.apache.hadoop.mapreduce

2023-02-07 14:13:29,826 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1757441329_0007
2023-02-07 14:13:29,826 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases TOP_TOTAL_OPERATIONS
2023-02-07 14:13:29,826 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: TOP_TOTAL_OPERATIONS[54,23] C:  R: 
2023-02-07 14:13:29,828 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:29,829 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:29,831 [main] WARN  org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2023-02-07 14:13:29,834 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Map

Vemos que en esencia con Pig pordemos hacer lo mismo que con Hive (lo contrario no es siempre cierto), con una sintaxis diferente. Particularmente a mí, en consultas complejas, me parece más fácil entender Pig Latin que HQL.