<a href="https://colab.research.google.com/github/MiguelAngeloTr/BIGDATA/blob/main/C1P2/%20Pr%C3%A1ctica_2_DataFrame_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Instalación de PySpark

1.   Java 8
2.   Apache Spark con hadoop
3.   Findspark (Usado para localizar Spark en el sistema)

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Formato de tablas con mejor visualización
spark

### Análisis Exploratorio de Datos

In [None]:
!wget https://jacobceles.github.io/knowledge_repo/colab_and_pyspark/cars.csv

--2024-08-12 15:27:27--  https://jacobceles.github.io/knowledge_repo/colab_and_pyspark/cars.csv
Resolving jacobceles.github.io (jacobceles.github.io)... 185.199.111.153, 185.199.108.153, 185.199.110.153, ...
Connecting to jacobceles.github.io (jacobceles.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://jacobcelestine.com/knowledge_repo/colab_and_pyspark/cars.csv [following]
--2024-08-12 15:27:28--  https://jacobcelestine.com/knowledge_repo/colab_and_pyspark/cars.csv
Resolving jacobcelestine.com (jacobcelestine.com)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to jacobcelestine.com (jacobcelestine.com)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22608 (22K) [text/csv]
Saving to: ‘cars.csv’


2024-08-12 15:27:28 (82.1 MB/s) - ‘cars.csv’ saved [22608/22608]



In [None]:
df = spark.read.csv('cars.csv', header=True, sep=";")
df.show()

+--------------------+----+---------+------------+----------+------+------------+-----+------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
|Chevrolet Chevell...|18.0|        8|       307.0|     130.0| 3504.|        12.0|   70|    US|
|   Buick Skylark 320|15.0|        8|       350.0|     165.0| 3693.|        11.5|   70|    US|
|  Plymouth Satellite|18.0|        8|       318.0|     150.0| 3436.|        11.0|   70|    US|
|       AMC Rebel SST|16.0|        8|       304.0|     150.0| 3433.|        12.0|   70|    US|
|         Ford Torino|17.0|        8|       302.0|     140.0| 3449.|        10.5|   70|    US|
|    Ford Galaxie 500|15.0|        8|       429.0|     198.0| 4341.|        10.0|   70|    US|
|    Chevrolet Impala|14.0|        8|       454.0|     220.0| 4354.|         9.0|   70|    US|
|   Plymouth Fury iii|14.0|        8|       440.0|

El comando anterior carga nuestros datos en un dataframe (DF), es decir es una estructura de datos bidimensional etiquetada con columnas de tipos potencialmente diferentes.

### Visualización de Dataframes

Hay un par de maneras de ver tu dataframe(DF) en PySpark:

1.   `df.take(n)` devolverá una lista de `n` objetos Row.
2.   `df.collect()` obtendrá todos los datos del DataFrame completo. Ten mucho cuidado al usarlo, porque si tienes un conjunto de datos grande, puedes colapsar fácilmente el nodo controlador.
3.   `df.show()` es el método más utilizado para ver un DataFrame. Hay algunos parámetros que podemos pasar a este método, como el número de filas y truncaiton. Por ejemplo, `df.show(5, False)` o ` df.show(5, truncate=False)` mostrará los datos completos sin ningún truncamiento.
4.  `df.limit(5)` **devolverá un nuevo DataFrame** tomando las `n` primeras filas. Como Spark es de naturaleza distribuida, no hay garantía de que `df.limit()` dé siempre los mismos resultados.

Veamos algunos de ellos en acción a continuación:

In [None]:
df.show(5, truncate = False)

+-------------------------+----+---------+------------+----------+------+------------+-----+------+
|Car                      |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+-------------------------+----+---------+------------+----------+------+------------+-----+------+
|Chevrolet Chevelle Malibu|18.0|8        |307.0       |130.0     |3504. |12.0        |70   |US    |
|Buick Skylark 320        |15.0|8        |350.0       |165.0     |3693. |11.5        |70   |US    |
|Plymouth Satellite       |18.0|8        |318.0       |150.0     |3436. |11.0        |70   |US    |
|AMC Rebel SST            |16.0|8        |304.0       |150.0     |3433. |12.0        |70   |US    |
|Ford Torino              |17.0|8        |302.0       |140.0     |3449. |10.5        |70   |US    |
+-------------------------+----+---------+------------+----------+------+------------+-----+------+
only showing top 5 rows



In [None]:
df.limit(5)

Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
Chevrolet Chevell...,18.0,8,307.0,130.0,3504.0,12.0,70,US
Buick Skylark 320,15.0,8,350.0,165.0,3693.0,11.5,70,US
Plymouth Satellite,18.0,8,318.0,150.0,3436.0,11.0,70,US
AMC Rebel SST,16.0,8,304.0,150.0,3433.0,12.0,70,US
Ford Torino,17.0,8,302.0,140.0,3449.0,10.5,70,US


### Visualización de las columnas del marco de datos

In [None]:
df.columns

['Car',
 'MPG',
 'Cylinders',
 'Displacement',
 'Horsepower',
 'Weight',
 'Acceleration',
 'Model',
 'Origin']

### Esquema del DataFrame

Existen dos métodos utilizados habitualmente para ver los tipos de datos de un DataFrame:

In [None]:
df.dtypes

[('Car', 'string'),
 ('MPG', 'string'),
 ('Cylinders', 'string'),
 ('Displacement', 'string'),
 ('Horsepower', 'string'),
 ('Weight', 'string'),
 ('Acceleration', 'string'),
 ('Model', 'string'),
 ('Origin', 'string')]

In [None]:
df.printSchema()

root
 |-- Car: string (nullable = true)
 |-- MPG: string (nullable = true)
 |-- Cylinders: string (nullable = true)
 |-- Displacement: string (nullable = true)
 |-- Horsepower: string (nullable = true)
 |-- Weight: string (nullable = true)
 |-- Acceleration: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- Origin: string (nullable = true)



#### Inferir el esquema de forma implícita

In [None]:
df = spark.read.csv('cars.csv', header=True, sep=";", inferSchema=True)
df.printSchema()

root
 |-- Car: string (nullable = true)
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Displacement: double (nullable = true)
 |-- Horsepower: double (nullable = true)
 |-- Weight: decimal(4,0) (nullable = true)
 |-- Acceleration: double (nullable = true)
 |-- Model: integer (nullable = true)
 |-- Origin: string (nullable = true)



Como puede ver, el tipo de dato se ha inferido automáticamente con incluso la precisión correcta para el tipo decimal. Un problema que puede surgir aquí es que a veces, cuando tienes que leer múltiples ficheros con diferentes esquemas en diferentes ficheros, puede haber un problema con la inferencia implícita que lleve a valores nulos en algunas columnas. Por lo tanto, veamos también cómo definir esquemas explícitamente.

#### Definición explícita del esquema

In [None]:
from pyspark.sql.types import *
df.columns

['Car',
 'MPG',
 'Cylinders',
 'Displacement',
 'Horsepower',
 'Weight',
 'Acceleration',
 'Model',
 'Origin']

In [None]:
# Creación de una lista del esquema en el formato nombre_columna, tipo_datos
labels = [
     ('Car',StringType()),
     ('MPG',DoubleType()),
     ('Cylinders',IntegerType()),
     ('Displacement',DoubleType()),
     ('Horsepower',DoubleType()),
     ('Weight',DoubleType()),
     ('Acceleration',DoubleType()),
     ('Model',StringType()),
     ('Origin',StringType())
]

In [None]:
# Creación del esquema que se pasará al leer el csv
schema = StructType([StructField (x[0], x[1], True) for x in labels])
schema

StructType(List(StructField(Car,StringType,true),StructField(MPG,DoubleType,true),StructField(Cylinders,IntegerType,true),StructField(Displacement,DoubleType,true),StructField(Horsepower,DoubleType,true),StructField(Weight,DoubleType,true),StructField(Acceleration,DoubleType,true),StructField(Model,StringType,true),StructField(Origin,StringType,true)))

In [None]:
df = spark.read.csv('cars.csv', header=True, sep=";", schema=schema)
df.printSchema()

root
 |-- Car: string (nullable = true)
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Displacement: double (nullable = true)
 |-- Horsepower: double (nullable = true)
 |-- Weight: double (nullable = true)
 |-- Acceleration: double (nullable = true)
 |-- Model: string (nullable = true)
 |-- Origin: string (nullable = true)



In [None]:
df.show(truncate=False)

+--------------------------------+----+---------+------------+----------+------+------------+-----+------+
|Car                             |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+--------------------------------+----+---------+------------+----------+------+------------+-----+------+
|Chevrolet Chevelle Malibu       |18.0|8        |307.0       |130.0     |3504.0|12.0        |70   |US    |
|Buick Skylark 320               |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |
|Plymouth Satellite              |18.0|8        |318.0       |150.0     |3436.0|11.0        |70   |US    |
|AMC Rebel SST                   |16.0|8        |304.0       |150.0     |3433.0|12.0        |70   |US    |
|Ford Torino                     |17.0|8        |302.0       |140.0     |3449.0|10.5        |70   |US    |
|Ford Galaxie 500                |15.0|8        |429.0       |198.0     |4341.0|10.0        |70   |US    |
|Chevrolet Impala                |14.

Como podemos ver aquí, los datos se han cargado correctamente con los tipos de datos especificados.

## Operaciones de DataFrame en columnas

En esta sección repasaremos lo siguiente:

1.   Selección de columnas
2.   Selección de varias columnas
3.   Añadir nuevas columnas
4.   Renombrar columnas
5.   Agrupación por columnas
6.   Eliminación de columnas

### Seleccionar columnas
Hay varias formas de hacer una selección en PySpark. A continuación puedes ver en qué se diferencian y cómo funciona cada una:

In [None]:
print(df.Car)
df.select(df.Car).show(3,truncate=False)

Column<'Car'>
+-------------------------+
|Car                      |
+-------------------------+
|Chevrolet Chevelle Malibu|
|Buick Skylark 320        |
|Plymouth Satellite       |
+-------------------------+
only showing top 3 rows



In [None]:
print(df['car'])
df.select(df['car']).show(3,truncate=False)

Column<'car'>
+-------------------------+
|car                      |
+-------------------------+
|Chevrolet Chevelle Malibu|
|Buick Skylark 320        |
|Plymouth Satellite       |
+-------------------------+
only showing top 3 rows



### Seleccionar varias columnas

In [None]:
print(df.Car, df.Cylinders)
df.select(df.Car, df.Cylinders).show(5,truncate=False)

Column<'Car'> Column<'Cylinders'>
+-------------------------+---------+
|Car                      |Cylinders|
+-------------------------+---------+
|Chevrolet Chevelle Malibu|8        |
|Buick Skylark 320        |8        |
|Plymouth Satellite       |8        |
|AMC Rebel SST            |8        |
|Ford Torino              |8        |
+-------------------------+---------+
only showing top 5 rows



In [None]:
print(df['car'],df['cylinders'])
df.select(df['car'],df['cylinders']).show(5,truncate=False)

Column<'car'> Column<'cylinders'>
+-------------------------+---------+
|car                      |cylinders|
+-------------------------+---------+
|Chevrolet Chevelle Malibu|8        |
|Buick Skylark 320        |8        |
|Plymouth Satellite       |8        |
|AMC Rebel SST            |8        |
|Ford Torino              |8        |
+-------------------------+---------+
only showing top 5 rows



In [None]:
from pyspark.sql.functions import col
df.select(col('car'),col('cylinders')).show(5,truncate=False)

+-------------------------+---------+
|car                      |cylinders|
+-------------------------+---------+
|Chevrolet Chevelle Malibu|8        |
|Buick Skylark 320        |8        |
|Plymouth Satellite       |8        |
|AMC Rebel SST            |8        |
|Ford Torino              |8        |
+-------------------------+---------+
only showing top 5 rows



### Agregando nuevas columnas

Analizaremos aquí tres casos:

1.   Añadir una nueva columna
2.   Añadir varias columnas
3.   Derivación de una nueva columna a partir de una existente

In [None]:
# CASO 1: Añadir una nueva columna
# Añadiremos una nueva columna llamada 'primera_columna' al final
from pyspark.sql.functions import lit
df = df.withColumn('first_column',lit(3))
# lit significa literal. Rellena la fila con el valor literal dado.
# Cuando se añaden datos estáticos / valores constantes, es una buena práctica utilizarlo.
df.show(5,truncate=False)

+-------------------------+----+---------+------------+----------+------+------------+-----+------+------------+
|Car                      |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|first_column|
+-------------------------+----+---------+------------+----------+------+------------+-----+------+------------+
|Chevrolet Chevelle Malibu|18.0|8        |307.0       |130.0     |3504.0|12.0        |70   |US    |3           |
|Buick Skylark 320        |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |3           |
|Plymouth Satellite       |18.0|8        |318.0       |150.0     |3436.0|11.0        |70   |US    |3           |
|AMC Rebel SST            |16.0|8        |304.0       |150.0     |3433.0|12.0        |70   |US    |3           |
|Ford Torino              |17.0|8        |302.0       |140.0     |3449.0|10.5        |70   |US    |3           |
+-------------------------+----+---------+------------+----------+------+------------+-----+----

In [None]:
df.printSchema()

root
 |-- Car: string (nullable = true)
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Displacement: double (nullable = true)
 |-- Horsepower: double (nullable = true)
 |-- Weight: double (nullable = true)
 |-- Acceleration: double (nullable = true)
 |-- Model: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- first_column: integer (nullable = false)



In [None]:
# CASO 2: Añadir varias columnas
# Añadiremos dos nuevas columnas llamadas 'segunda_columna' y 'tercera_columna' al final
df = df.withColumn('second_column', lit(2)) \
       .withColumn('third_column', lit('Columna'))
# lit significa literal. Rellena la fila con el valor literal dado.
# Cuando se añaden datos estáticos / valores constantes, es una buena práctica utilizarlo.
df.show(5,truncate=False)

+-------------------------+----+---------+------------+----------+------+------------+-----+------+------------+-------------+------------+
|Car                      |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|first_column|second_column|third_column|
+-------------------------+----+---------+------------+----------+------+------------+-----+------+------------+-------------+------------+
|Chevrolet Chevelle Malibu|18.0|8        |307.0       |130.0     |3504.0|12.0        |70   |US    |3           |2            |Columna     |
|Buick Skylark 320        |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |3           |2            |Columna     |
|Plymouth Satellite       |18.0|8        |318.0       |150.0     |3436.0|11.0        |70   |US    |3           |2            |Columna     |
|AMC Rebel SST            |16.0|8        |304.0       |150.0     |3433.0|12.0        |70   |US    |3           |2            |Columna     |
|Ford Torino        

In [None]:
# CASO 3: Derivar una nueva columna a partir de una existente
# Vamos a añadir una nueva columna llamada 'modelo_coche' que tiene el valor de coche y modelo añadidos juntos con un espacio en medio
from pyspark.sql.functions import concat
df = df.withColumn('car_model', concat(col("Car"), lit(" "), col("model")))
# lit significa literal. Rellena la fila con el valor literal dado.
# Cuando se añaden datos estáticos / valores constantes, es una buena práctica utilizarlo.
df.show(5,truncate=False)

+-------------------------+----+---------+------------+----------+------+------------+-----+------+------------+-------------+------------+----------------------------+
|Car                      |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|first_column|second_column|third_column|car_model                   |
+-------------------------+----+---------+------------+----------+------+------------+-----+------+------------+-------------+------------+----------------------------+
|Chevrolet Chevelle Malibu|18.0|8        |307.0       |130.0     |3504.0|12.0        |70   |US    |3           |2            |Columna     |Chevrolet Chevelle Malibu 70|
|Buick Skylark 320        |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |3           |2            |Columna     |Buick Skylark 320 70        |
|Plymouth Satellite       |18.0|8        |318.0       |150.0     |3436.0|11.0        |70   |US    |3           |2            |Columna     |Plymouth Satelli

Como podemos ver, la nueva columna modelo de carro se ha creado a partir de columnas existentes. Dado que nuestro objetivo era crear una columna que tuviera el valor de carro y modelo unidos con un espacio entre ellos, hemos utilizado el operador `concat`.

### Renombrar columnas

In [None]:
#Renombrar columnas en PySpark
df = df.withColumnRenamed('first_column', 'new_column_one') \
       .withColumnRenamed('second_column', 'new_column_two') \
       .withColumnRenamed('third_column', 'new_column_three')
df.show(truncate=False)

+--------------------------------+----+---------+------------+----------+------+------------+-----+------+--------------+--------------+----------------+-----------------------------------+
|Car                             |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|new_column_one|new_column_two|new_column_three|car_model                          |
+--------------------------------+----+---------+------------+----------+------+------------+-----+------+--------------+--------------+----------------+-----------------------------------+
|Chevrolet Chevelle Malibu       |18.0|8        |307.0       |130.0     |3504.0|12.0        |70   |US    |3             |2             |Columna         |Chevrolet Chevelle Malibu 70       |
|Buick Skylark 320               |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |3             |2             |Columna         |Buick Skylark 320 70               |
|Plymouth Satellite              |18.0|8        |3

### Borrando columnas

In [None]:
#Borrando columnas en PySpark
df = df.drop('new_column_one')
df.show(5,truncate=False)

+-------------------------+----+---------+------------+----------+------+------------+-----+------+--------------+----------------+----------------------------+
|Car                      |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|new_column_two|new_column_three|car_model                   |
+-------------------------+----+---------+------------+----------+------+------------+-----+------+--------------+----------------+----------------------------+
|Chevrolet Chevelle Malibu|18.0|8        |307.0       |130.0     |3504.0|12.0        |70   |US    |2             |Columna         |Chevrolet Chevelle Malibu 70|
|Buick Skylark 320        |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |2             |Columna         |Buick Skylark 320 70        |
|Plymouth Satellite       |18.0|8        |318.0       |150.0     |3436.0|11.0        |70   |US    |2             |Columna         |Plymouth Satellite 70       |
|AMC Rebel SST            |16.0|8 

In [None]:
#Removiendo múltiples columnas
df = df.drop('new_column_two') \
       .drop('new_column_three')
df.show(5,truncate=False)

+-------------------------+----+---------+------------+----------+------+------------+-----+------+----------------------------+
|Car                      |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|car_model                   |
+-------------------------+----+---------+------------+----------+------+------------+-----+------+----------------------------+
|Chevrolet Chevelle Malibu|18.0|8        |307.0       |130.0     |3504.0|12.0        |70   |US    |Chevrolet Chevelle Malibu 70|
|Buick Skylark 320        |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |Buick Skylark 320 70        |
|Plymouth Satellite       |18.0|8        |318.0       |150.0     |3436.0|11.0        |70   |US    |Plymouth Satellite 70       |
|AMC Rebel SST            |16.0|8        |304.0       |150.0     |3433.0|12.0        |70   |US    |AMC Rebel SST 70            |
|Ford Torino              |17.0|8        |302.0       |140.0     |3449.0|10.5        |70   |US   

### Agrupación por columnas
Aquí vemos la forma de agrupar valores de la API Dataframe. Discutiremos cómo hacerlo:


1.   Agrupar por una sola columna
2.   Agrupar por múltiples columnas

In [None]:
# Agrupando por una columna en PySpark
df.groupBy('Origin').count().show(5)

+------+-----+
|Origin|count|
+------+-----+
|Europe|   73|
|    US|  254|
| Japan|   79|
+------+-----+



In [None]:
# Contar el número de registros por modelo
df.groupBy('Model').count().show(40)

+-----+-----+
|Model|count|
+-----+-----+
|   73|   40|
|   71|   29|
|   70|   35|
|   75|   30|
|   78|   36|
|   77|   28|
|   82|   31|
|   81|   30|
|   79|   29|
|   72|   28|
|   74|   27|
|   76|   34|
|   80|   29|
+-----+-----+



In [None]:
# Contar el número de registros por carro
df.groupBy('Car').count().show(40)

+--------------------+-----+
|                 Car|count|
+--------------------+-----+
|Volkswagen 1131 D...|    1|
|Chevrolete Chevel...|    1|
|Chevrolet Monte C...|    2|
|     Ford LTD Landau|    1|
|       Honda Prelude|    1|
|      Chevrolet Nova|    3|
|   Volkswagen Rabbit|    5|
|     Ford Torino 500|    1|
|        Toyota Camry|    1|
|         Audi 100 LS|    1|
|Plymouth Valiant ...|    1|
|Toyota Corolla Ma...|    1|
|Oldsmobile Cutlas...|    1|
|Fiat 124 Sport Coupe|    1|
|     Volvo 145e (sw)|    1|
|Chevrolet Caprice...|    3|
|            Audi Fox|    1|
|    Chevrolet Camaro|    1|
|       Dodge Aspen 6|    1|
|    Pontiac Catalina|    3|
|AMC Ambassador Br...|    1|
|       Ford Maverick|    5|
|      Chevrolet Vega|    3|
|   Plymouth Fury III|    1|
|       Datsun 200-SX|    1|
|Plymouth Volare P...|    1|
|   Plymouth Arrow GS|    1|
|     Mazda RX2 Coupe|    1|
|           Subaru DL|    2|
|      Dodge Aspen SE|    1|
|    Mazda GLC Custom|    1|
|          Dat

## Operaciones a nivel de fila con DataFrames

En esta sección analizaremos los siguientes puntos:

1.   Filtrado de filas
2. 	 Obtener Filas Distintas
3.   Ordenar Filas
4.   Unión de marcos de datos

### Filtrando filas

In [None]:
# Filtrando filas en PySpark
total_count = df.count()
print("TOTAL RECORD COUNT: " + str(total_count))
europe_filtered_count = df.filter(col('Origin')=='Europe').count()
print("EUROPE FILTERED RECORD COUNT: " + str(europe_filtered_count))
df.filter(col('Origin')=='Europe').show(truncate=False)

TOTAL RECORD COUNT: 406
EUROPE FILTERED RECORD COUNT: 73
+----------------------------+----+---------+------------+----------+------+------------+-----+------+-------------------------------+
|Car                         |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|car_model                      |
+----------------------------+----+---------+------------+----------+------+------------+-----+------+-------------------------------+
|Citroen DS-21 Pallas        |0.0 |4        |133.0       |115.0     |3090.0|17.5        |70   |Europe|Citroen DS-21 Pallas 70        |
|Volkswagen 1131 Deluxe Sedan|26.0|4        |97.0        |46.0      |1835.0|20.5        |70   |Europe|Volkswagen 1131 Deluxe Sedan 70|
|Peugeot 504                 |25.0|4        |110.0       |87.0      |2672.0|17.5        |70   |Europe|Peugeot 504 70                 |
|Audi 100 LS                 |24.0|4        |107.0       |90.0      |2430.0|14.5        |70   |Europe|Audi 100 LS 70                 

In [None]:
total_count = df.count()
print("TOTAL RECORD COUNT: " + str(total_count))
US_filtered_count = df.filter(col('Origin')=='US').count()
print("USA FILTERED RECORD COUNT: " + str(US_filtered_count))
df.filter(col('Origin')=='US').show(truncate=False)

TOTAL RECORD COUNT: 406
USA FILTERED RECORD COUNT: 254
+--------------------------------+----+---------+------------+----------+------+------------+-----+------+-----------------------------------+
|Car                             |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|car_model                          |
+--------------------------------+----+---------+------------+----------+------+------------+-----+------+-----------------------------------+
|Chevrolet Chevelle Malibu       |18.0|8        |307.0       |130.0     |3504.0|12.0        |70   |US    |Chevrolet Chevelle Malibu 70       |
|Buick Skylark 320               |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |Buick Skylark 320 70               |
|Plymouth Satellite              |18.0|8        |318.0       |150.0     |3436.0|11.0        |70   |US    |Plymouth Satellite 70              |
|AMC Rebel SST                   |16.0|8        |304.0       |150.0     |3433.0|12.0   

In [None]:
total_count = df.count()
print("TOTAL RECORD COUNT: " + str(total_count))
Japan_filtered_count = df.filter(col('Origin')=='Japan').count()
print("Japan FILTERED RECORD COUNT: " + str(Japan_filtered_count))
df.filter(col('Origin')=='Japan').show(truncate=False)

TOTAL RECORD COUNT: 406
Japan FILTERED RECORD COUNT: 79
+---------------------------+----+---------+------------+----------+------+------------+-----+------+------------------------------+
|Car                        |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|car_model                     |
+---------------------------+----+---------+------------+----------+------+------------+-----+------+------------------------------+
|Toyota Corolla Mark ii     |24.0|4        |113.0       |95.0      |2372.0|15.0        |70   |Japan |Toyota Corolla Mark ii 70     |
|Datsun PL510               |27.0|4        |97.0        |88.0      |2130.0|14.5        |70   |Japan |Datsun PL510 70               |
|Datsun PL510               |27.0|4        |97.0        |88.0      |2130.0|14.5        |71   |Japan |Datsun PL510 71               |
|Toyota Corolla             |25.0|4        |113.0       |95.0      |2228.0|14.0        |71   |Japan |Toyota Corolla 71             |
|Toyota Corol

In [None]:
# Filtrando filas en PySpark sobre múltiples condiciones
total_count = df.count()
print("TOTAL RECORD COUNT: " + str(total_count))
europe_filtered_count = df.filter((col('Origin')=='Europe') &
                                  (col('Cylinders')==4)).count() # dos condiciones en el filtro
print("EUROPE FILTERED RECORD COUNT: " + str(europe_filtered_count))
df.filter((col('Origin')=='Europe') & (col('Cylinders')==4)).show(truncate=False)

TOTAL RECORD COUNT: 406
EUROPE FILTERED RECORD COUNT: 66
+----------------------------+----+---------+------------+----------+------+------------+-----+------+-------------------------------+
|Car                         |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|car_model                      |
+----------------------------+----+---------+------------+----------+------+------------+-----+------+-------------------------------+
|Citroen DS-21 Pallas        |0.0 |4        |133.0       |115.0     |3090.0|17.5        |70   |Europe|Citroen DS-21 Pallas 70        |
|Volkswagen 1131 Deluxe Sedan|26.0|4        |97.0        |46.0      |1835.0|20.5        |70   |Europe|Volkswagen 1131 Deluxe Sedan 70|
|Peugeot 504                 |25.0|4        |110.0       |87.0      |2672.0|17.5        |70   |Europe|Peugeot 504 70                 |
|Audi 100 LS                 |24.0|4        |107.0       |90.0      |2430.0|14.5        |70   |Europe|Audi 100 LS 70                 

In [None]:
# Contar el número de registros por modelo
df.groupBy('Cylinders').count().show(40)

+---------+-----+
|Cylinders|count|
+---------+-----+
|        6|   84|
|        3|    4|
|        5|    3|
|        4|  207|
|        8|  108|
+---------+-----+



In [None]:
# Filtrando filas en PySpark sobre múltiples condiciones
total_count = df.count()
print("TOTAL RECORD COUNT: " + str(total_count))
europe_filtered_count = df.filter((col('Origin')=='Europe') &
                                  (col('Cylinders')>=3)).count() # dos condiciones en el filtro
print("EUROPE FILTERED RECORD COUNT: " + str(europe_filtered_count))
df.filter((col('Origin')=='Europe') & (col('Cylinders')>=3)).show(truncate=False)

TOTAL RECORD COUNT: 406
EUROPE FILTERED RECORD COUNT: 73
+----------------------------+----+---------+------------+----------+------+------------+-----+------+-------------------------------+
|Car                         |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|car_model                      |
+----------------------------+----+---------+------------+----------+------+------------+-----+------+-------------------------------+
|Citroen DS-21 Pallas        |0.0 |4        |133.0       |115.0     |3090.0|17.5        |70   |Europe|Citroen DS-21 Pallas 70        |
|Volkswagen 1131 Deluxe Sedan|26.0|4        |97.0        |46.0      |1835.0|20.5        |70   |Europe|Volkswagen 1131 Deluxe Sedan 70|
|Peugeot 504                 |25.0|4        |110.0       |87.0      |2672.0|17.5        |70   |Europe|Peugeot 504 70                 |
|Audi 100 LS                 |24.0|4        |107.0       |90.0      |2430.0|14.5        |70   |Europe|Audi 100 LS 70                 

### Identificar filas distintas

In [None]:
#Obtener filas únicas en PySpark
df.select('Origin').distinct().show()

+------+
|Origin|
+------+
|Europe|
|    US|
| Japan|
+------+



In [None]:
#Obtener filas únicas en PySpark utilizando múltiples condiciones
df.select('Origin','model').distinct().show()

+------+-----+
|Origin|model|
+------+-----+
| Japan|   76|
|    US|   81|
|    US|   80|
|    US|   76|
| Japan|   70|
|    US|   78|
|Europe|   76|
|    US|   70|
| Japan|   75|
|Europe|   80|
| Japan|   77|
|Europe|   72|
|    US|   75|
|    US|   79|
|    US|   82|
|Europe|   75|
| Japan|   78|
|    US|   71|
| Japan|   82|
| Japan|   80|
+------+-----+
only showing top 20 rows



In [None]:
#Origen con cilindros
df.select('Origin','Cylinders').distinct().show()

+------+---------+
|Origin|Cylinders|
+------+---------+
| Japan|        6|
|    US|        4|
|    US|        8|
|    US|        6|
| Japan|        3|
|Europe|        5|
| Japan|        4|
|Europe|        6|
|Europe|        4|
+------+---------+



### Clasificación de filas

In [None]:
# Ordenar filas en PySpark
# Por defecto los datos se ordenarán en orden ascendente
df.orderBy('Cylinders').show(truncate=False)

+----------------------------+----+---------+------------+----------+------+------------+-----+------+-------------------------------+
|Car                         |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|car_model                      |
+----------------------------+----+---------+------------+----------+------+------------+-----+------+-------------------------------+
|Mazda RX-4                  |21.5|3        |80.0        |110.0     |2720.0|13.5        |77   |Japan |Mazda RX-4 77                  |
|Mazda RX-7 GS               |23.7|3        |70.0        |100.0     |2420.0|12.5        |80   |Japan |Mazda RX-7 GS 80               |
|Mazda RX2 Coupe             |19.0|3        |70.0        |97.0      |2330.0|13.5        |72   |Japan |Mazda RX2 Coupe 72             |
|Mazda RX3                   |18.0|3        |70.0        |90.0      |2124.0|13.5        |73   |Japan |Mazda RX3 73                   |
|Datsun 510 (sw)             |28.0|4        |97.0      

In [None]:
#Ordenen por modelo
df.orderBy('Model').show(truncate=False)

+--------------------------------+----+---------+------------+----------+------+------------+-----+------+-----------------------------------+
|Car                             |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|car_model                          |
+--------------------------------+----+---------+------------+----------+------+------------+-----+------+-----------------------------------+
|Chevrolet Chevelle Malibu       |18.0|8        |307.0       |130.0     |3504.0|12.0        |70   |US    |Chevrolet Chevelle Malibu 70       |
|Buick Skylark 320               |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |Buick Skylark 320 70               |
|Plymouth Satellite              |18.0|8        |318.0       |150.0     |3436.0|11.0        |70   |US    |Plymouth Satellite 70              |
|AMC Rebel SST                   |16.0|8        |304.0       |150.0     |3433.0|12.0        |70   |US    |AMC Rebel SST 70                   |

In [None]:
# Para cambiar el orden de clasificación, puede utilizar el parámetro ascendente
df.orderBy('Cylinders', ascending=False).show(truncate=False)

+-------------------------+----+---------+------------+----------+------+------------+-----+------+----------------------------+
|Car                      |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|car_model                   |
+-------------------------+----+---------+------------+----------+------+------------+-----+------+----------------------------+
|Plymouth 'Cuda 340       |14.0|8        |340.0       |160.0     |3609.0|8.0         |70   |US    |Plymouth 'Cuda 340 70       |
|Pontiac Safari (sw)      |13.0|8        |400.0       |175.0     |5140.0|12.0        |71   |US    |Pontiac Safari (sw) 71      |
|Ford Mustang Boss 302    |0.0 |8        |302.0       |140.0     |3353.0|8.0         |70   |US    |Ford Mustang Boss 302 70    |
|Buick Skylark 320        |15.0|8        |350.0       |165.0     |3693.0|11.5        |70   |US    |Buick Skylark 320 70        |
|Chevrolet Monte Carlo    |15.0|8        |400.0       |150.0     |3761.0|9.5         |70   |US   

In [None]:
# Usando groupBy y orderBy juntos
df.groupBy("Origin").count().orderBy('count', ascending=False).show(10)

+------+-----+
|Origin|count|
+------+-----+
|    US|  254|
| Japan|   79|
|Europe|   73|
+------+-----+



In [None]:
# Contar por modelo y orgarnizar el conteo de manera descendente
df.groupBy("Model").count().orderBy('count').show(10)

+-----+-----+
|Model|count|
+-----+-----+
|   74|   27|
|   77|   28|
|   72|   28|
|   79|   29|
|   71|   29|
|   80|   29|
|   81|   30|
|   75|   30|
|   82|   31|
|   76|   34|
+-----+-----+
only showing top 10 rows



### Union Dataframes

Verá tres métodos principales para realizar la unión de Datarames. Es importante conocer la diferencia entre ellos y cuál es el preferido:

*   `union()` – Se utiliza para fusionar dos DataFrames de la misma estructura/esquema. Si los esquemas no son iguales, devuelve un error
*   `unionAll()` – Esta función está obsoleta desde Spark 2.0.0, y se sustituye por union()
*   `unionByName()` - Esta función se utiliza para fusionar dos marcos de datos basándose en el nombre de la columna.

> Desde que `unionAll()`está obsoleto, **`union()` es el método preferido para fusionar dataframes.**
<br>
> La diferencia entre `unionByName()` y `union()` es que `unionByName()` resuelve las columnas por nombre, no por posición.

En otros SQLs, Union elimina los duplicados pero UnionAll fusiona dos conjuntos de datos, incluyendo así los registros duplicados. Pero, en PySpark, ambos se comportan igual e incluyen registros duplicados. La recomendación es utilizar `distinct()` o `dropDuplicates()` para remover registro duplicados.

In [None]:
# CASE 1:Unión Cuando las columnas están en orden
df = spark.read.csv('cars.csv', header=True, sep=";", inferSchema=True)
europe_cars = df.filter((col('Origin')=='Europe') & (col('Cylinders')==5))
japan_cars = df.filter((col('Origin')=='Japan') & (col('Cylinders')==3))
print("EUROPE CARS: "+str(europe_cars.count()))
print("JAPAN CARS: "+str(japan_cars.count()))
print("AFTER UNION: "+str(europe_cars.union(japan_cars).count()))

EUROPE CARS: 3
JAPAN CARS: 4
AFTER UNION: 7


In [None]:
df_europe_japan = europe_cars.union(japan_cars)
df_europe_japan.show(7)

+-------------------+----+---------+------------+----------+------+------------+-----+------+
|                Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+-------------------+----+---------+------------+----------+------+------------+-----+------+
|          Audi 5000|20.3|        5|       131.0|     103.0|  2830|        15.9|   78|Europe|
| Mercedes Benz 300d|25.4|        5|       183.0|      77.0|  3530|        20.1|   79|Europe|
|Audi 5000s (diesel)|36.4|        5|       121.0|      67.0|  2950|        19.9|   80|Europe|
|    Mazda RX2 Coupe|19.0|        3|        70.0|      97.0|  2330|        13.5|   72| Japan|
|          Mazda RX3|18.0|        3|        70.0|      90.0|  2124|        13.5|   73| Japan|
|         Mazda RX-4|21.5|        3|        80.0|     110.0|  2720|        13.5|   77| Japan|
|      Mazda RX-7 GS|23.7|        3|        70.0|     100.0|  2420|        12.5|   80| Japan|
+-------------------+----+---------+------------+----------+

In [None]:
# CASE 2: Unión cuando las columnas no están en orden
# Creación de dos marcos de datos con columnas mezcladas
df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])

In [None]:
df1.show()

+----+----+----+
|col0|col1|col2|
+----+----+----+
|   1|   2|   3|
+----+----+----+



In [None]:
df2.show()

+----+----+----+
|col1|col2|col0|
+----+----+----+
|   4|   5|   6|
+----+----+----+



In [None]:
df1.unionByName(df2).show()

+----+----+----+
|col0|col1|col2|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+



## Funciones comunes de manipulación de datos

In [None]:
# Funciones disponibles en PySpark
from pyspark.sql import functions
# De forma similar a python, podemos utilizar la función dir para ver las funciones disponibles
print(dir(functions))



### Funciones de cadena

In [None]:
# Cargando datos
from pyspark.sql.functions import col
df = spark.read.csv('cars.csv', header=True, sep=";", inferSchema=True)

**Visualizar la columna Carro en caracteres existentes, inferiores y superiores, y los 4 primeros caracteres de la columna**

In [None]:
from pyspark.sql.functions import col,lower,upper,substring
help(substring)
# alias es usado para renombrar la columna
df.select(col('Car'),lower(col('Car')),upper(col('Car')),substring(col('Car'),1,4).alias("concatenated value")).show(5, False)

Help on function substring in module pyspark.sql.functions:

substring(str, pos, len)
    Substring starts at `pos` and is of length `len` when str is String type or
    returns the slice of byte array that starts at `pos` in byte and is of length `len`
    when str is Binary type.
    
    .. versionadded:: 1.5.0
    
    Notes
    -----
    The position is not zero based, but 1 based index.
    
    Examples
    --------
    >>> df = spark.createDataFrame([('abcd',)], ['s',])
    >>> df.select(substring(df.s, 1, 2).alias('s')).collect()
    [Row(s='ab')]

+-------------------------+-------------------------+-------------------------+------------------+
|Car                      |lower(Car)               |upper(Car)               |concatenated value|
+-------------------------+-------------------------+-------------------------+------------------+
|Chevrolet Chevelle Malibu|chevrolet chevelle malibu|CHEVROLET CHEVELLE MALIBU|Chev              |
|Buick Skylark 320        |buick skylark

**Concatena la columna Coche y la columna Modelo y añade un espacio entre ellas.**

In [None]:
from pyspark.sql.functions import concat
df.select(col("Car"),col("model"), concat(col("Car"), lit(" "), col("model"))).show(5, False)

+-------------------------+-----+----------------------------+
|Car                      |model|concat(Car,  , model)       |
+-------------------------+-----+----------------------------+
|Chevrolet Chevelle Malibu|70   |Chevrolet Chevelle Malibu 70|
|Buick Skylark 320        |70   |Buick Skylark 320 70        |
|Plymouth Satellite       |70   |Plymouth Satellite 70       |
|AMC Rebel SST            |70   |AMC Rebel SST 70            |
|Ford Torino              |70   |Ford Torino 70              |
+-------------------------+-----+----------------------------+
only showing top 5 rows



### Funciones numéricas


**Mostrar el modelo más antiguo y el más moderno**

In [None]:
from pyspark.sql.functions import min, max
df.select(min(col('Model')), max(col('Model'))).show()

+----------+----------+
|min(Model)|max(Model)|
+----------+----------+
|        70|        82|
+----------+----------+



**Mostrar el peso maximo y mínimo**

In [None]:
from pyspark.sql.functions import min, max
df.select(min(col('Weight')), max(col('Weight'))).show()

+-----------+-----------+
|min(Weight)|max(Weight)|
+-----------+-----------+
|       1613|       5140|
+-----------+-----------+



**Añade 10 al peso mínimo y máximo**

In [None]:
from pyspark.sql.functions import min, max, lit
df.select(min(col('Weight'))+lit(10), max(col('Weight')+lit(10))).show()

+------------------+------------------+
|(min(Weight) + 10)|max((Weight + 10))|
+------------------+------------------+
|              1623|              5150|
+------------------+------------------+



### Operaciones con fechas

In [None]:
from pyspark.sql.functions import to_date, to_timestamp, lit
df = spark.createDataFrame([('2019-12-25 13:30:00',)], ['DOB'])
df.show()
df.printSchema()

+-------------------+
|                DOB|
+-------------------+
|2019-12-25 13:30:00|
+-------------------+

root
 |-- DOB: string (nullable = true)



In [None]:
df = df.select(to_date(col('DOB'),'yyyy-MM-dd HH:mm:ss'), to_timestamp(col('DOB'),'yyyy-MM-dd HH:mm:ss'))
df.show()
df.printSchema()

+---------------------------------+--------------------------------------+
|to_date(DOB, yyyy-MM-dd HH:mm:ss)|to_timestamp(DOB, yyyy-MM-dd HH:mm:ss)|
+---------------------------------+--------------------------------------+
|                       2019-12-25|                   2019-12-25 13:30:00|
+---------------------------------+--------------------------------------+

root
 |-- to_date(DOB, yyyy-MM-dd HH:mm:ss): date (nullable = true)
 |-- to_timestamp(DOB, yyyy-MM-dd HH:mm:ss): timestamp (nullable = true)



In [None]:
df = spark.createDataFrame([('25/Dec/2019 13:30:00',)], ['DOB'])
df = df.select(to_date(col('DOB'),'dd/MMM/yyyy HH:mm:ss'), to_timestamp(col('DOB'),'dd/MMM/yyyy HH:mm:ss'))
df.show()
df.printSchema()

+----------------------------------+---------------------------------------+
|to_date(DOB, dd/MMM/yyyy HH:mm:ss)|to_timestamp(DOB, dd/MMM/yyyy HH:mm:ss)|
+----------------------------------+---------------------------------------+
|                        2019-12-25|                    2019-12-25 13:30:00|
+----------------------------------+---------------------------------------+

root
 |-- to_date(DOB, dd/MMM/yyyy HH:mm:ss): date (nullable = true)
 |-- to_timestamp(DOB, dd/MMM/yyyy HH:mm:ss): timestamp (nullable = true)



**¿Qué es 3 días antes que la fecha más antigua y 3 días después que la fecha más reciente?**

In [None]:
from pyspark.sql.functions import date_add, date_sub
# create a dummy dataframe
df = spark.createDataFrame([('1990-01-01',),('1995-01-03',),('2021-03-30',)], ['Date'])
# find out the required dates
df.select(date_add(max(col('Date')),3), date_sub(min(col('Date')),3)).show()

+----------------------+----------------------+
|date_add(max(Date), 3)|date_sub(min(Date), 3)|
+----------------------+----------------------+
|            2021-04-02|            1989-12-29|
+----------------------+----------------------+



## Joins en PySpark

In [None]:
# Crear dos dataframes
cars_df = spark.createDataFrame([[1, 'Car A'],[2, 'Car B'],[3, 'Car C']], ["id", "car_name"])
car_price_df = spark.createDataFrame([[1, 1000],[2, 2000],[3, 3000]], ["id", "car_price"])
cars_df.show()
car_price_df.show()

+---+--------+
| id|car_name|
+---+--------+
|  1|   Car A|
|  2|   Car B|
|  3|   Car C|
+---+--------+

+---+---------+
| id|car_price|
+---+---------+
|  1|     1000|
|  2|     2000|
|  3|     3000|
+---+---------+



In [None]:
# Ejecutando un inner join para que podamos ver el id, nombre y precio de cada carro en una fila
cars_df.join(car_price_df, cars_df.id == car_price_df.id, 'inner').select(cars_df['id'],cars_df['car_name'],car_price_df['car_price']).show(truncate=False)

+---+--------+---------+
|id |car_name|car_price|
+---+--------+---------+
|1  |Car A   |1000     |
|3  |Car C   |3000     |
|2  |Car B   |2000     |
+---+--------+---------+



Como puedes ver, hemos hecho un join interno entre dos dataframes. Las siguientes uniones son soportadas por PySpark:
1. inner (default)
2. cross
3. outer
4. full
5. full_outer
6. left
7. left_outer
8. right
9. right_outer
10. left_semi
11. left_anti

## Spark SQL
SQL existe desde la década de 1970, así que puede imaginarse cuánta gente ha hecho de él su pan de cada día. A medida que se popularizaban los DataFrames, escaseaban los profesionales con los conocimientos técnicos necesarios para manejarlos. Esto llevó a la creación de Spark SQL.

In [None]:
# Load data
df = spark.read.csv('cars.csv', header=True, sep=";")
# Register Temporary Table
df.createOrReplaceTempView("temp")
# Select all data from temp table
spark.sql("select * from temp limit 5").show()
# Select count of data in table
spark.sql("select count(*) as total_count from temp").show()

+--------------------+----+---------+------------+----------+------+------------+-----+------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
|Chevrolet Chevell...|18.0|        8|       307.0|     130.0| 3504.|        12.0|   70|    US|
|   Buick Skylark 320|15.0|        8|       350.0|     165.0| 3693.|        11.5|   70|    US|
|  Plymouth Satellite|18.0|        8|       318.0|     150.0| 3436.|        11.0|   70|    US|
|       AMC Rebel SST|16.0|        8|       304.0|     150.0| 3433.|        12.0|   70|    US|
|         Ford Torino|17.0|        8|       302.0|     140.0| 3449.|        10.5|   70|    US|
+--------------------+----+---------+------------+----------+------+------------+-----+------+

+-----------+
|total_count|
+-----------+
|        406|
+-----------+

