[![pythonista](img/pythonista.png)](https://www.pythonista.io)

# Operaciones con conjuntos. 

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-setops.html

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Intro a UNION").getOrCreate()
ct = spark.sparkContext
%load_ext sparksql_magic

## Tablas representativa.

La tabla `COVID_NACIONAL` tiene los datos de infectados de COVID-19 en México desde el 26 de febrero de 2020 hasta el 30 de abril de 2022. 
El archivo de origen fue tomado de la fuente el 20 de febrero de 2020. 

In [None]:
spark.read.parquet('data/data_covid.parquet').createOrReplaceTempView("COVID_NACIONAL")

In [None]:
%%sparksql
SELECT first(index)
FROM COVID_NACIONAL;

In [None]:
%%sparksql
SELECT last(index)
FROM COVID_NACIONAL;

In [None]:
%%sparksql
SELECT count(index)
FROM COVID_NACIONAL;

La tabla `COVID_NACIONAL_2022` tiene los datos de infectados de COVID-19 en México desde el 1 de enero de 2022 hasta el 31 de diciembre de 2022.

In [None]:
spark.read.parquet('data/data_covid_2022.parquet').createOrReplaceTempView("COVID_NACIONAL_2022")

In [None]:
%%sparksql
SELECT first(index)
FROM COVID_NACIONAL_2022;

In [None]:
%%sparksql
SELECT last(index)
FROM COVID_NACIONAL_2022;

In [None]:
%%sparksql
SELECT count(index)
FROM COVID_NACIONAL_2022;

## La claúsula `UNION`.

```
SELECT ...
....
UNION
SELECT ...
```

Esta claúsula permite crear operaciones de conjuntos de un unión.

In [None]:
%%sparksql
SELECT 
    Index,
    Nacional
FROM COVID_NACIONAL
UNION
SELECT 
    Index, 
    Nacional 
FROM COVID_NACIONAL_2022
ORDER BY Index;

In [None]:
%%sparksql
WITH COVID_TOTAL AS
    (SELECT 
        Index,
        Nacional
    FROM COVID_NACIONAL
    UNION
    SELECT 
        Index, 
        Nacional 
    FROM COVID_NACIONAL_2022)
SELECT count(index)
FROM COVID_TOTAL

## La claúsula `INTERSECT`.

In [None]:
%%sparksql
SELECT 
    Index,
    Nacional
FROM COVID_NACIONAL
INTERSECT
SELECT 
    Index, 
    Nacional 
FROM COVID_NACIONAL_2022

In [None]:
%%sparksql
WITH COVID_TOTAL AS
    (SELECT 
        Index,
        Nacional
    FROM COVID_NACIONAL
    INTERSECT
    SELECT 
        Index, 
        Nacional 
    FROM COVID_NACIONAL_2022)
    SELECT count(index)
    FROM COVID_TOTAL

In [None]:
%%sparksql
SELECT
    a.Index,
    a.Nacional AS Nacional_original,
    b.Nacional AS Nacional_2022
FROM COVID_NACIONAL AS a
INNER JOIN COVID_NACIONAL_2022 AS b
    ON a.index = b.index;

## La claúsula EXCEPT.

In [None]:
%%sparksql
SELECT 
    Index,
    Nacional
FROM COVID_NACIONAL
EXCEPT (SELECT
        a.index,
        a.Nacional
    FROM COVID_NACIONAL AS a
    INNER JOIN COVID_NACIONAL_2022 AS b
        ON a.index = b.index);

In [None]:
%%sparksql
WITH PREVIO_A_2022
AS 
    (SELECT 
         Index,
         Nacional
     FROM COVID_NACIONAL
     EXCEPT 
     (SELECT
          a.index,
          a.Nacional
      FROM COVID_NACIONAL AS a
      INNER JOIN COVID_NACIONAL_2022 AS b
          ON a.index = b.index))
SELECT count(index)
FROM PREVIO_A_2022;

In [None]:
%%sparksql
SELECT count(index)
FROM
    (SELECT 
     Index,
     Nacional 
     FROM COVID_NACIONAL
     EXCEPT (SELECT
             a.index,
             a.Nacional
         FROM COVID_NACIONAL AS a
         INNER JOIN COVID_NACIONAL_2022 AS b
             ON a.index = b.index)
     UNION
     SELECT 
         Index,
         Nacional
     FROM COVID_NACIONAL_2022);

In [None]:
%%sparksql
SELECT 
    Index,
    Nacional
FROM COVID_NACIONAL
WHERE Index BETWEEN "2020-01-01" AND "2021-12-31";

<p style="text-align: center"><a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Licencia Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/80x15.png" /></a><br />Esta obra está bajo una <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Licencia Creative Commons Atribución 4.0 Internacional</a>.</p>
<p style="text-align: center">&copy; José Luis Chiquete Valdivieso. 2023.</p>