[![pythonista](img/pythonista.png)](https://www.pythonista.io)

# Tipos de datos complejos.

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Intro a SELECT").getOrCreate()
ct = spark.sparkContext

In [None]:
df = spark.read.parquet('data/data_covid.parquet')
df.createOrReplaceTempView("COVID_NACIONAL")

### Uso de ```SELECT``` ... ```FROM```.

```
SELECT <elementos>
FROM <fuentes>
```

### ```SELECT *```

In [None]:
spark.sql('''SELECT *
             FROM COVID_NACIONAL''')

In [None]:
spark.sql('''SELECT *
             FROM COVID_NACIONAL''').show()

In [None]:
spark.sql('''SELECT *
             FROM COVID_NACIONAL''').toPandas()

### Selección de columnas específicas.

```
SELECT <col 1>, <col 2>, ... <col n>
```

In [None]:
spark.sql('''SELECT Aguascalientes, Nacional
             FROM COVID_NACIONAL''').toPandas()

In [None]:
spark.sql('''SELECT Index, Aguascalientes, Nacional
             FROM COVID_NACIONAL''').toPandas()

## Operaciones con columnas.

Es posible realizar diversas operaciones mediante expresiones que impliquen a columnas desde un ```SELECT``` mediante el uso de varias funciones y operadores.
```
SELECT <expresion 1>, <expresion 2> ... <expresion n>
```

In [None]:
spark.sql('''SELECT (Nacional / 32)
             FROM COVID_NACIONAL''').toPandas()

In [None]:
spark.sql('''SELECT 2 * 5''').toPandas()

### Funciones

https://spark.apache.org/docs/latest/api/sql/index.html

In [None]:
spark.sql('''SELECT CONCAT( 'dia ', Index)
             FROM COVID_NACIONAL''').toPandas()

In [None]:
spark.sql('''SELECT AVG(AGUASCALIENTES)
             FROM COVID_NACIONAL''').toPandas()

In [None]:
spark.sql('''SELECT EXTRACT( 'YEAR', Index), Index
             FROM COVID_NACIONAL''').toPandas()

## Uso de ```AS```.

In [None]:
spark.sql('''SELECT AGUASCALIENTES AS Ags, 
                    (Nacional / 32) as Promedio
             FROM COVID_NACIONAL''').toPandas()

### Uso de referencias.

Uso del punto (```.```) para elementos en una tablas.

In [None]:
spark.sql('''SELECT COV.AGUASCALIENTES as Ags 
             FROM COVID_NACIONAL as COV''').toPandas()

### Selección de columnas de más de una tabla.

In [None]:
spark.sql('''SELECT (Nacional /32) as promedio
             FROM COVID_NACIONAL''').createOrReplaceTempView('Promedio_Nacional')

In [None]:
spark.sql('''SELECT c.Index, c.Nacional, p.promedio
             FROM COVID_NACIONAL as c, Promedio_Nacional as p''').toPandas()

## Uso de ```LIMIT```.

* La siguiente delcaración regresará los ```20``` primeros registros resultantes de la búsqueda en la columna ```Nacional```.

In [None]:
spark.sql('''SELECT Nacional
             FROM COVID_NACIONAL
             LIMIT 20
             ''').toPandas()

## Uso de ```WHERE```.

In [None]:
spark.sql('''SELECT Index, Aguascalientes, (Nacional/32) as promedio
             FROM COVID_NACIONAL
             WHERE Aguascalientes > (Nacional/32)
             ''').toPandas()

In [None]:
spark.sql('''SELECT *
             FROM COVID_NACIONAL
             WHERE Index = '2020-03-25'
             ''').toPandas()

### Uso de ```AND```.

In [None]:
spark.sql('''SELECT Index, Aguascalientes
             FROM COVID_NACIONAL
                 WHERE AGUASCALIENTES > (Nacional/32) 
                     AND AGUASCALIENTES > 100
             ''').toPandas()

### Uso de ```OR```.

In [None]:
spark.sql('''SELECT Index, Aguascalientes
             FROM COVID_NACIONAL
                 WHERE AGUASCALIENTES > (Nacional/32) OR AGUASCALIENTES = 0
             ''').toPandas()

### Uso de ```BETWEEN```.

In [None]:
spark.sql('''SELECT Index, Aguascalientes
             FROM COVID_NACIONAL
             WHERE Aguascalientes BETWEEN 100 AND 500
             ''').toPandas()

In [None]:
spark.sql('''SELECT Index, Aguascalientes
             FROM COVID_NACIONAL
             WHERE Index BETWEEN '2021-01-01' AND '2021-01-15'
             ''').toPandas()

## Uso de ```ORDER BY```.

In [None]:
spark.sql('''SELECT Index, Aguascalientes, Nacional
             FROM COVID_NACIONAL
             WHERE Index BETWEEN '2021-01-01' AND '2021-01-15'
             ORDER BY Nacional
             ''').toPandas()

In [None]:
spark.sql('''SELECT Index, Aguascalientes, Nacional
             FROM COVID_NACIONAL
             WHERE Index BETWEEN '2021-01-01' AND '2021-01-15'
             ORDER BY Nacional DESC
             ''').toPandas()

## ```GROUP BY```

In [4]:
iris = spark.read.option('header','true').option('inferSchema', 'true').csv('data/IRIS.csv')
iris.createOrReplaceTempView("Iris")

In [5]:
spark.sql('''SELECT * 
             FROM Iris
             ''').toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [9]:
spark.sql('''SELECT species, COUNT(species) 
             FROM Iris
             GROUP BY species
          ''').toPandas()

Unnamed: 0,species,count(species)
0,Iris-virginica,50
1,Iris-setosa,50
2,Iris-versicolor,50


<p style="text-align: center"><a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Licencia Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/80x15.png" /></a><br />Esta obra está bajo una <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Licencia Creative Commons Atribución 4.0 Internacional</a>.</p>
<p style="text-align: center">&copy; José Luis Chiquete Valdivieso. 2023.</p>