<a href="https://colab.research.google.com/github/AzulBarr/introduccion-a-las-bases-de-datos/blob/main/5_2_Spark_y_SQL_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark + SQL

PySpark es la interfaz de Python para Apache Spark. Su uso principal es trabajar con grandes volúmenes de datos y crear pipelines de procesamiento.

Sin embargo, no es necesario trabajar con big data para aprovechar PySpark. SparkSQL es una excelente herramienta para realizar análisis de datos de forma eficiente. En muchos casos, Pandas puede volverse lento y uno termina escribiendo mucho código para limpiar y transformar datos, mientras que en SQL las mismas operaciones suelen necesitar menos líneas y ser más expresivas. ¡Vamos a comenzar!

Más información aquí:
http://spark.apache.org/docs/latest/api/python/

# 1. Instalando PySpark en Google Colab

In [None]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys

import findspark
findspark.init()
findspark.find()

import pyspark

from pyspark.sql import DataFrame, SparkSession
from typing import List
import pyspark.sql.types as T
import pyspark.sql.functions as F

spark= SparkSession \
       .builder \
       .appName("5.1 Spark y SQL") \
       .getOrCreate()

spark

In [None]:
spark

# 2. Lectura de datos

Utilizamos base publica de datos del COVID.

In [None]:
import requests
path = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv"
req = requests.get(path)
url_content = req.content

csv_file_name = 'owid-covid-data.csv'
csv_file = open(csv_file_name, 'wb')

csv_file.write(url_content)
csv_file.close()

df = spark.read.csv('/content/'+csv_file_name, header=True, inferSchema=True)

#3. PySpark DataFrames

In [None]:
# Revisando el schema del dataframe
df.printSchema()

In [None]:
# Conversion date a columna
df.select(F.to_date(df.date).alias('date'))

In [None]:
#Summary estadisticas
df.describe().show()

In [None]:
#Filtrado de DataFrame.
#Pais ARGENTINA ordenados por fecha desc.
df.filter(df.location == "Argentina").orderBy(F.desc("date")).show()

In [None]:
#Agrupamos por location y como funcion de agrupacion sumamos los nuevos casos.
df.groupBy("location").sum("new_cases").orderBy(F.desc("sum(new_cases)")).show(truncate=False)

# 4. Spark SQL

El módulo SQL resulta muy accesible para interactuar con los datos mientras seguimos usando Spark. Hay menos cosas nuevas que aprender, ya que básicamente utiliza la misma sintaxis SQL con la que probablemente ya estés familiarizado.

In [None]:
#Creamos una tabla a partir del data frame
df.createOrReplaceTempView("covid_data") # tabla temporal
# df.saveAsTable("covid_data") # opcion de salvar la tabla
# df.write.mode("overwrite").saveAsTable("covid_data") # Save as table and overwrite table if exits

In [None]:

df2 = spark.sql("SELECT * from covid_data")
df2.printSchema()


In [None]:
df2.show()

In [None]:
groupDF = spark.sql("SELECT location, count(*) from covid_data group by location order by count(*)")
groupDF.show()

### N: Obtener el total de casos confirmados de COVID-19 por país hasta la fecha más reciente disponible

In [None]:
query = """
SELECT location AS pais, sum(total_cases) AS total_casos_confirmados
FROM covid_data
GROUP BY location
ORDER BY total_casos_confirmados DESC
"""

result = spark.sql(query)
result.show()

### O: Consultar el número de muertes en una fecha específica (2022-01-01)

In [None]:
query = """
SELECT date, sum(total_deaths) AS num_muertes
FROM covid_data
WHERE date = '2022-01-01'
GROUP BY date
"""

result = spark.sql(query)
result.show()

### P: Obtener la evolución diaria de los casos en un país específico ( "Argentina").

In [None]:
query = """
SELECT location AS pais, date, new_cases
FROM covid_data
WHERE iso_code = 'ARG'
ORDER BY date DESC
"""

result = spark.sql(query)
result.show(10)

### Q: Calcular porcentaje de la población vacunada por país.

In [None]:
query = """
SELECT location AS pais, max(people_vaccinated_per_hundred) AS porcentaje_personas_vacunadas
FROM covid_data
WHERE people_vaccinated_per_hundred IS NOT NULL
GROUP BY location
ORDER BY porcentaje_personas_vacunadas
"""

result = spark.sql(query)
result.show()

### R: Calcular el total de casos y muertes en el ultimo mes disponible

In [None]:
query = """
SELECT location AS pais, SUM(new_cases) AS total_casos_mes, SUM(new_deaths) AS total_muertes_mes
FROM covid_data
WHERE new_cases IS NOT NULL AND new_deaths IS NOT NULL AND month(date) >= (SELECT DISTINCT max(month(date))
                                                                          FROM covid_data
                                                                          WHERE year(date) >= (SELECT DISTINCT max(year(date))
                                                                          FROM covid_data)
                                                                          )
                                                        AND year(date) >= (SELECT DISTINCT max(year(date))
                                                                          FROM covid_data)
GROUP BY location
ORDER BY total_casos_mes DESC, total_muertes_mes DESC
"""

result = spark.sql(query)
result.show()

In [None]:
query = """
SELECT location AS pais, SUM(new_cases) AS total_casos_mes, SUM(new_deaths) AS total_muertes_mes
FROM covid_data
WHERE new_cases IS NOT NULL AND new_deaths IS NOT NULL AND month(date) >= (SELECT DISTINCT max(month(date))
                                                                          FROM covid_data
                                                                          WHERE year(date) >= (SELECT DISTINCT max(year(date))
                                                                          FROM covid_data)
                                                                          )
GROUP BY location
ORDER BY total_casos_mes DESC, total_muertes_mes DESC
"""

result = spark.sql(query)
result.show()