<a href="https://colab.research.google.com/github/JhonatanWalterSen/spark-in-colab/blob/main/SparSQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creando DataFrames

# Instalaciones

In [3]:
# Instalar Java 8 scas
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [4]:
# Descargar Spark
!wget -q http://apache.osuosl.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz

In [5]:
# Descomprimir Spart
!tar xf spark-3.3.1-bin-hadoop3.tgz

In [6]:
# Establecer las variables de Entorno
import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.1-bin-hadoop3"

In [7]:
# Instalar FindSpark en el sistema
!pip install -q findspark

In [8]:
import findspark

In [9]:
findspark.init()

In [10]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('DataFrame').master('local[*]').getOrCreate()

In [9]:
spark


In [11]:
sc = spark.sparkContext

In [11]:
rdd = sc.parallelize([item for item in range(10)]).map(lambda x: (x,x**2))

In [12]:
rdd.collect()

[(0, 0),
 (1, 1),
 (2, 4),
 (3, 9),
 (4, 16),
 (5, 25),
 (6, 36),
 (7, 49),
 (8, 64),
 (9, 81)]

In [13]:
df = rdd.toDF(['numero','cuadrado'])

In [14]:
df.printSchema()

root
 |-- numero: long (nullable = true)
 |-- cuadrado: long (nullable = true)



In [15]:
df.show()

+------+--------+
|numero|cuadrado|
+------+--------+
|     0|       0|
|     1|       1|
|     2|       4|
|     3|       9|
|     4|      16|
|     5|      25|
|     6|      36|
|     7|      49|
|     8|      64|
|     9|      81|
+------+--------+



crear un dataframe a partir de un rdd con schema

In [16]:
rdd1 = sc.parallelize([(1,'jose',35.5),(2,'Teresa',54.2),(3,'Karl',12.7)])

importar

In [13]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

In [18]:
# Via uno

esquema1 = StructType(
    [
        StructField('id', IntegerType(),True),
        StructField('nombre', StringType(),True),
        StructField('saldo', DoubleType(),True),
    ]
)

In [19]:
# Via dos
esquema2 = "`id` INT, `nombre` STRING, `saldo` DOUBLE"

In [20]:
df1 = spark.createDataFrame(rdd1, schema=esquema1)

In [21]:
df1.printSchema()

root
 |-- id: integer (nullable = true)
 |-- nombre: string (nullable = true)
 |-- saldo: double (nullable = true)



In [22]:
df1.show()

+---+------+-----+
| id|nombre|saldo|
+---+------+-----+
|  1|  jose| 35.5|
|  2|Teresa| 54.2|
|  3|  Karl| 12.7|
+---+------+-----+



In [23]:
df2 = spark.createDataFrame(rdd1, schema=esquema2)

In [24]:
df2.printSchema()

root
 |-- id: integer (nullable = true)
 |-- nombre: string (nullable = true)
 |-- saldo: double (nullable = true)



In [26]:
df2.show()

+---+------+-----+
| id|nombre|saldo|
+---+------+-----+
|  1|  jose| 35.5|
|  2|Teresa| 54.2|
|  3|  Karl| 12.7|
+---+------+-----+



Crear DataFrame a partir de fuentes de datos

TXT

In [27]:
df = spark.read.text('./dataTXT.txt')

In [28]:
df.show()

+--------------------+
|               value|
+--------------------+
|Estamos en el cur...|
|En este capítulo ...|
|En esta sección e...|
|y en este ejemplo...|
+--------------------+



In [30]:
df.show(truncate=False)

+-----------------------------------------------------------------------+
|value                                                                  |
+-----------------------------------------------------------------------+
|Estamos en el curso de pyspark                                         |
|En este capítulo estamos estudiando el API SQL de Saprk                |
|En esta sección estamos creado dataframes a partir de fuentes de datos,|
|y en este ejemplo creamos un dataframe a partir de un texto plano      |
+-----------------------------------------------------------------------+



CSV

In [31]:
df1=spark.read.csv('./dataCSV.csv')

In [32]:
df1.show()

+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+--------------------+--------------------+
|        _c0|          _c1|                 _c2|                 _c3|        _c4|                 _c5|                 _c6|    _c7|   _c8|     _c9|         _c10|                _c11|             _c12|            _c13|                _c14|                _c15|
+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+--------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|        publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|vid

In [34]:
df1 = spark.read.option('header','true').csv('./dataCSV.csv')

In [35]:
df1.show()

+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|        publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13T17:13:...|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           Fal

LEER ARCHIVOS DE TEXTO CON UN DELIMITADOR DIFERENTE

In [38]:
df2 = spark.read.option('header','true').option('delimiter','|').csv('./dataTab.txt')

In [39]:
df2.show()

+----+----+----------+-----+
|pais|edad|     fecha|color|
+----+----+----------+-----+
|  MX|  23|2021-02-21| rojo|
|  CA|  56|2021-06-10| azul|
|  US|  32|2020-06-02|verde|
+----+----+----------+-----+



CREAR UN DATAFRAME A PARTIR DE UN JSON PROPORCIONANDO UN SCHEMA

In [40]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

json_schema = StructType(
    [
        StructField('color', StringType(),True),
        StructField('edad', StringType(), True),
        StructField('fecha', DateType(), True),
        StructField('pais', StringType(), True)
    ]
)

In [41]:
df4 = spark.read.schema(json_schema).json('./dataJSON.json')

In [42]:
df4.show()

+-----+----+----------+----+
|color|edad|     fecha|pais|
+-----+----+----------+----+
| rojo|  23|2021-02-21|  MX|
| azul|  56|2021-06-10|  CA|
|verde|  32|2020-06-02|  US|
+-----+----+----------+----+



In [43]:
df4.printSchema()

root
 |-- color: string (nullable = true)
 |-- edad: string (nullable = true)
 |-- fecha: date (nullable = true)
 |-- pais: string (nullable = true)



dataframe a partir de un parquet

In [44]:
df5 = spark.read.parquet('./dataPARQUET.parquet')

In [45]:
df5.show()

+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|        publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13T17:13:...|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           Fal

In [46]:
df6 = spark.read.format('parquet').load('./dataPARQUET.parquet')

In [47]:
df6.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: string (nullable = true)
 |-- likes: string (nullable = true)
 |-- dislikes: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



# TRABAJO CON COLUMNAS

In [48]:
df5.show()

+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|        publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13T17:13:...|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           Fal

In [50]:
df5.select('title').show(truncate=False)

+--------------------------------------------------------------------------------------+
|title                                                                                 |
+--------------------------------------------------------------------------------------+
|WE WANT TO TALK ABOUT OUR MARRIAGE                                                    |
|The Trump Presidency: Last Week Tonight with John Oliver (HBO)                        |
|Racist Superman | Rudy Mancuso, King Bach & Lele Pons                                 |
|Nickelback Lyrics: Real or Fake?                                                      |
|I Dare You: GOING BALD!?                                                              |
|2 Weeks with iPhone X                                                                 |
|Roy Moore & Jeff Sessions Cold Open - SNL                                             |
|5 Ice Cream Gadgets put to the Test                                                   |
|The Greatest Showman

# SELECT 

In [12]:
dft=spark.read.parquet('./datos.parquet')

In [13]:
dft.show()

+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|       publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13 17:13:01|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           False| 

In [15]:
from pyspark.sql.functions import col
dft.select(col('video_id')).show()

+-----------+
|   video_id|
+-----------+
|2kyS6SvSYSE|
|1ZAPwfrtAFY|
|5qpjK5DgCt4|
|puqaWrEC7tY|
|d380meD0W0M|
|gHZ1Qz0KiKM|
|39idVpFF7NQ|
|nc99ccSXST0|
|jr9QtXwC9vc|
|TUmyygCMMGA|
|9wRQljFNDW8|
|VifQlJit6A0|
|5E4ZBSInqUU|
|GgVmn66oK_A|
|TaTleo4cOs8|
|kgaO45SyaO4|
|ZAQs-ctOqXQ|
|YVfyYrEmzgM|
|eNSN6qet1kE|
|B5HORANmzHw|
+-----------+
only showing top 20 rows



In [16]:
# Esto sale error
# dft= select ('likes','dislikes'('likes','dislikes')).show()

In [21]:
dft.select(
    col('likes'),
    col('dislikes'),
    (col('likes')-col('dislikes')).alias('aceptación')
).show()

+------+--------+----------+
| likes|dislikes|aceptación|
+------+--------+----------+
| 57527|    2966|     54561|
| 97185|    6146|     91039|
|146033|    5339|    140694|
| 10172|     666|      9506|
|132235|    1989|    130246|
|  9763|     511|      9252|
| 15993|    2445|     13548|
| 23663|     778|     22885|
|  3543|     119|      3424|
| 12654|    1363|     11291|
|   655|      25|       630|
|  1576|     303|      1273|
|114188|    1333|    112855|
|  7848|    1171|      6677|
|  7473|     246|      7227|
|  9419|      52|      9367|
|  8011|     638|      7373|
|  5398|      53|      5345|
| 11963|      36|     11927|
|  8421|     191|      8230|
+------+--------+----------+
only showing top 20 rows



# Select Expr

In [24]:
dft.selectExpr('likes','dislikes','(likes - dislikes) as aceptacion').show()

+------+--------+----------+
| likes|dislikes|aceptacion|
+------+--------+----------+
| 57527|    2966|     54561|
| 97185|    6146|     91039|
|146033|    5339|    140694|
| 10172|     666|      9506|
|132235|    1989|    130246|
|  9763|     511|      9252|
| 15993|    2445|     13548|
| 23663|     778|     22885|
|  3543|     119|      3424|
| 12654|    1363|     11291|
|   655|      25|       630|
|  1576|     303|      1273|
|114188|    1333|    112855|
|  7848|    1171|      6677|
|  7473|     246|      7227|
|  9419|      52|      9367|
|  8011|     638|      7373|
|  5398|      53|      5345|
| 11963|      36|     11927|
|  8421|     191|      8230|
+------+--------+----------+
only showing top 20 rows



In [25]:
dft.selectExpr('count(distinct(video_id)) as videos').show()

+------+
|videos|
+------+
|  6837|
+------+



# FILTER

In [26]:
# importar col
dft.show()

+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|       publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13 17:13:01|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           False| 

In [27]:
dft.filter(col('video_id') == '2kyS6SvSYSE').show()

+-----------+-------------+--------------------+-------------+-----------+-------------------+---------------+-------+-----+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|channel_title|category_id|       publish_time|           tags|  views|likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+-------------+-----------+-------------------+---------------+-------+-----+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...| CaseyNeistat|         22|2017-11-13 17:13:01|SHANtell martin| 748374|57527|    2966|        15954|https://i.ytimg.c...|            False|           False|                 False|SHANTELL'S CHANNE...|
|2kyS6Sv

In [28]:
dft.filter(col('trending_date') != '17.14.11').show()

+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|            video_id|       trending_date|               title|       channel_title|         category_id|       publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|\nCook with confi...|             recipes|              videos| and restaurant g...| dining destinations|               null|                

In [29]:
dft2 = spark.read.parquet('./datos.parquet').where(col('likes') > 5000)

In [30]:
dft2.show()

+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|       publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13 17:13:01|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           False| 

In [None]:
#Manera 
dft2.filter((col('trending_date') != '17.14.11') & (col('likes') > 7000)).show()

In [34]:
#Manera 2 con filter
dft2.filter(col('trending_date') != '17.14.11').filter(col('likes') > 7000).show()

+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|       publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|YvfYK0EEhK4|     17.15.11|Brent Pella - Why...|         Brent Pella|         23|2017-11-14 15:32:51|"spirit airlines"...| 462490| 14132|     795|          666|https://i.ytimg.c...|            False|           False| 

# Distinct
Elimina duplicados

In [35]:
df_sin_duplicados = dft.distinct()

In [37]:
print('Conteo total del DF {}'.format(dft.count()))
print('Conteo sin duplicados {}'.format(df_sin_duplicados.count()))

Conteo total del DF 48137
Conteo sin duplicados 41428


# Drop Duplicate

In [16]:
dataframe = spark.createDataFrame([(1,'azul',567),(2,'rojo',487),(1,'azul', 345),(2,'verde',783)]).toDF('id','color','importe')

In [39]:
dataframe.show()

+---+-----+-------+
| id|color|importe|
+---+-----+-------+
|  1| azul|    567|
|  2| rojo|    487|
|  1| azul|    345|
|  2|verde|    783|
+---+-----+-------+



In [42]:
#Toma como referencia que borrar
dataframe.dropDuplicates(['id','color']).show()

+---+-----+-------+
| id|color|importe|
+---+-----+-------+
|  1| azul|    567|
|  2| rojo|    487|
|  2|verde|    783|
+---+-----+-------+



# Sort

In [45]:
dft.select(col('likes'),col('views'),col('video_id'),col('dislikes')).dropDuplicates(['video_id']).show()

+------+-------+--------------------+--------+
| likes|  views|            video_id|dislikes|
+------+-------+--------------------+--------+
| 63995|1525400|         bAkEd8r7Nnw|     896|
|   427|   9036|         eijd-yjXY9E|      14|
|  4145| 318249|         npcqBt_e4k0|     110|
|  6669| 203615|         LeWtF5y9-6Q|     136|
|  2166| 104499|         GhcqN2FDAnA|    1066|
| 10834| 160196|         v_CMMWCN5nQ|     162|
| 36068| 962042|         R8WBN3fJmwM|     845|
|   982|  36848|         oKuPJ7zF0_k|       6|
| 26482| 713615|         B3JFSL8AA70|    2443|
|275632|2822642|         f6Egj7ncOi8|    1444|
| 23922| 321885|         8gE6cek7F30|     317|
|    70|  13670|         EdkK29-TWJk|       1|
|  1131| 120802|         8szK9FBpdPI|      92|
| 12355| 294080|         6gFj1XJ6b5o|      80|
|  null|   null|\nhttp://www.Mast...|    null|
| 12070| 233766|         wOFuVNiAJQQ|     117|
| 21067| 210371|         PpElRBQ-yGc|     135|
|  4609| 363194|         q11UD-6XT-8|     955|
|   188|  311

In [47]:
dft.select(col('likes'),col('views'),col('video_id'),col('dislikes')).dropDuplicates(['video_id']).sort('likes').show()

+-----+-----+--------------------+--------+
|likes|views|            video_id|dislikes|
+-----+-----+--------------------+--------+
| null| null|\nFor more videos...|    null|
| null| null|\nFashion Editor:...|    null|
| null| null|\nAccess Hollywoo...|    null|
| null| null|\nStill haven’t s...|    null|
| null| null|\nhttps://www.you...|    null|
| null| null|Horror Outro ► ht...|    null|
| null| null|\nChapped lips ar...|    null|
| null| null|\nRoar: https://w...|    null|
| null| null|\nThe leading int...|    null|
| null| null|             \nToday|    null|
| null| null|\nONE STRANGE ROC...|    null|
| null| null|\nSNAPCHAT: fishi...|    null|
| null| null|\nInstagram: http...|    null|
| null| null|\nInstagram.com/w...|    null|
| null| null|\n5050 State Hwy....|    null|
| null| null|\nSIGN UP FOR BRA...|    null|
| null| null|\nJames Ambler an...|    null|
| null| null|\nhttp://www.Mast...|    null|
| null| null|\nEver After Tuto...|    null|
| null| null|          \nEvelin 

In [15]:
#importar desc
from pyspark.sql.functions import desc

In [49]:
dft.select(col('likes'),col('views'),col('video_id'),col('dislikes')).dropDuplicates(['video_id']).sort(desc('likes')).show()

+-------+--------+-----------+--------+
|  likes|   views|   video_id|dislikes|
+-------+--------+-----------+--------+
|3880071|39349927|7C2z4GqqS5E|   72707|
|2055137|13945717|kTlv5_Bs8aw|   23888|
|2050527|10695328|OK3GJ0WIQ8s|   14711|
|1956202|10666323|p8npDG2ulKQ|   13966|
|1735895|37736281|6ZfuNTqbHE8|   21969|
|1634124|33523622|2Vv-BfVoq4g|   21082|
|1572997| 7518332|kX0vO4vlJuU|    8113|
|1437859| 5884233|D_6QmL6rExk|    6390|
|1405355|31648454|VYOjWnS4cMY|   51547|
|1401915| 5275672|8O_MwlZ2dEg|    6268|
|1386616|15873034|ffxKSjUwKdU|   40714|
|1366736|16884972|J2HytHu5VBI|   59930|
|1290509| 6416697|2tDKp41nrw8|    4358|
|1207457|13754992|_5d-sQ7Fh5M|  280675|
|1167488| 8041970|oWjxSkJpxFU|  147643|
|1149185|24782158|FlsCjmMhFmw|  483924|
|1111592|38873543|i0p1bmr0EmE|   96407|
|1065777|14089954|dfnCAmr569k|   47839|
| 983693|14820746|tCXGJQYZ9JA|   44254|
| 975715|19716689|QwievZ1Tx-8|    9118|
+-------+--------+-----------+--------+
only showing top 20 rows



# OrderBy
es mas relacional

In [50]:
dataframe.show()

+---+-----+-------+
| id|color|importe|
+---+-----+-------+
|  1| azul|    567|
|  2| rojo|    487|
|  1| azul|    345|
|  2|verde|    783|
+---+-----+-------+



In [52]:
dataframe.orderBy(col('color').desc(), col('importe')).show()

+---+-----+-------+
| id|color|importe|
+---+-----+-------+
|  2|verde|    783|
|  2| rojo|    487|
|  1| azul|    345|
|  1| azul|    567|
+---+-----+-------+



# Limit

In [53]:
top_10 = dft.orderBy(col('views').desc()).limit(10)

In [54]:
top_10.show()

+-----------+-------------+--------------------+-------------------+-----------+-------------------+--------------------+---------+-------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|      channel_title|category_id|       publish_time|                tags|    views|  likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+-------------------+-----------+-------------------+--------------------+---------+-------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|VYOjWnS4cMY|     18.02.06|Childish Gambino ...|ChildishGambinoVEVO|         10|2018-05-06 04:00:07|"Childish Gambino...|225211923|5023450|  343541|       517232|https://i.ytimg.c...|            False|          

In [56]:
dft.select(col('likes'),col('views'),col('video_id'),col('dislikes')).dropDuplicates(['video_id']).orderBy(col('views').desc()).limit(10).show()

+-------+--------+-----------+--------+
|  likes|   views|   video_id|dislikes|
+-------+--------+-----------+--------+
| 609101|48431654|-BQJo3vK8O8|   52259|
|3880071|39349927|7C2z4GqqS5E|   72707|
|1111592|38873543|i0p1bmr0EmE|   96407|
|1735895|37736281|6ZfuNTqbHE8|   21969|
|1634124|33523622|2Vv-BfVoq4g|   21082|
|1405355|31648454|VYOjWnS4cMY|   51547|
| 850362|27973210|u9Mv98Gr5pY|   26541|
|1149185|24782158|FlsCjmMhFmw|  483924|
| 641546|24421448|U9BwWKXjVaI|   16517|
| 587326|23758250|1J76wN0TPI4|   18799|
+-------+--------+-----------+--------+



# WithColumn

agregar una nueva columna

In [57]:
df_valoracion = dft.withColumn('valoracion',col('likes') - col('dislikes'))

In [58]:
df_valoracion.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: timestamp (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)
 |-- valoracion: integer (nullable = true)



In [59]:
df_valoracion1 = (dft.withColumn('valoracion',col('likes') - col('dislikes'))
                     .withColumn('res_div',col('valoracion') % 10)
)

In [60]:
df_valoracion1.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: timestamp (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)
 |-- valoracion: integer (nullable = true)
 |-- res_div: integer (nullable = true)



In [61]:
df_valoracion1.select(col('likes'), col('dislikes'), col('valoracion'), col('res_div')).show()

+------+--------+----------+-------+
| likes|dislikes|valoracion|res_div|
+------+--------+----------+-------+
| 57527|    2966|     54561|      1|
| 97185|    6146|     91039|      9|
|146033|    5339|    140694|      4|
| 10172|     666|      9506|      6|
|132235|    1989|    130246|      6|
|  9763|     511|      9252|      2|
| 15993|    2445|     13548|      8|
| 23663|     778|     22885|      5|
|  3543|     119|      3424|      4|
| 12654|    1363|     11291|      1|
|   655|      25|       630|      0|
|  1576|     303|      1273|      3|
|114188|    1333|    112855|      5|
|  7848|    1171|      6677|      7|
|  7473|     246|      7227|      7|
|  9419|      52|      9367|      7|
|  8011|     638|      7373|      3|
|  5398|      53|      5345|      5|
| 11963|      36|     11927|      7|
|  8421|     191|      8230|      0|
+------+--------+----------+-------+
only showing top 20 rows



# With Column Renamed
cambiar el nombre de la columna

In [15]:
dataframe.show()

+---+-----+-------+
| id|color|importe|
+---+-----+-------+
|  1| azul|    567|
|  2| rojo|    487|
|  1| azul|    345|
|  2|verde|    783|
+---+-----+-------+



In [16]:
df_renombrado = dataframe.withColumnRenamed('importe','Importe total')

In [17]:
df_renombrado.show()

+---+-----+-------------+
| id|color|Importe total|
+---+-----+-------------+
|  1| azul|          567|
|  2| rojo|          487|
|  1| azul|          345|
|  2|verde|          783|
+---+-----+-------------+



Nota* si la columna no existe no te manda error

# Transformaciones

In [20]:
dataf = spark.read.parquet('./datos.parquet')

In [22]:
dataf.show()

+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|       publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13 17:13:01|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           False| 

# drop

In [23]:
dataf.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: timestamp (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [24]:
df_util = dataf.drop('comments_disabled')

In [25]:
df_util.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: timestamp (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [26]:
df_util = dataf.drop('comments_disabled','ratings_disabled','thumbnail_link')

In [30]:
df_util.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: timestamp (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



*Nota no muestra error si borrar un nombre inexistente texto en cursiva*

# Sample

hacer un muestreo

In [28]:
df_muestra = dataf.sample(0.8)

In [None]:
#Practicar
#igual con ramdonsplit

# Datos incorrecta o faltantes

In [31]:
dataf.count()

48137

In [32]:
dataf.na.drop().count()

40379

In [33]:
dataf.na.drop('any').count()

40379

In [34]:
dataf.dropna().count()

40379

In [36]:
#nulos eliminados de la colmn view
dataf.na.drop(subset=['views']).count()

40949

In [38]:
dataf.na.drop(subset=['views','dislikes']).count()

40949

In [22]:
dataf.orderBy(col('views')).select(col('views'),col('likes'), col('dislikes')).show()

+-----+-----+--------+
|views|likes|dislikes|
+-----+-----+--------+
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
| null| null|    null|
+-----+-----+--------+
only showing top 20 rows



imputarlos con un valor sustituir

In [24]:
dataf.fillna(0).orderBy(col('views')).select(col('views'),col('likes'), col('dislikes')).show()

+-----+-----+--------+
|views|likes|dislikes|
+-----+-----+--------+
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
|    0|    0|       0|
+-----+-----+--------+
only showing top 20 rows



rellenar solo ciertas columnas

In [25]:
dataf.fillna(0, subset=['likes','dislikes']).orderBy(col('views')).select(col('views'),col('likes'), col('dislikes')).show()

+-----+-----+--------+
|views|likes|dislikes|
+-----+-----+--------+
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
| null|    0|       0|
+-----+-----+--------+
only showing top 20 rows



# Acciones



In [27]:
dataf.show()

+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|       publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13 17:13:01|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           False| 

In [29]:
# configurar la acción show()
# mostrar solo 5
dataf.show(5)

+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|       publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+-------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13 17:13:01|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           False| 

In [30]:
# mostrar solo 5 y mostrar todo el contenido
dataf.show(5, truncate=False)

+-----------+-------------+--------------------------------------------------------------+---------------------+-----------+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------+--------+-------------+----------------------------------------------+-----------------+----------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [31]:
#take toma las n filas que se desee
dataf.take(1)

[Row(video_id='2kyS6SvSYSE', trending_date='17.14.11', title='WE WANT TO TALK ABOUT OUR MARRIAGE', channel_title='CaseyNeistat', category_id='22', publish_time=datetime.datetime(2017, 11, 13, 17, 13, 1), tags='SHANtell martin', views=748374, likes=57527, dislikes=2966, comment_count=15954, thumbnail_link='https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg', comments_disabled='False', ratings_disabled='False', video_error_or_removed='False', description="SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\\nwith this lens -- http://amzn.to/2rUJOmD\\nbig drone - http://tinyurl.com/h4ft3oy\\nOTHER GEAR ---  http://amzn.to/2o3GLX5\\nSony CAMERA http://amzn.to/2nOBmnv\\nOLD CAMERA; http://amzn.to/2o2cQBT\\nMAIN LENS; http://amzn.to/2od5gBJ\\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\\nBIG Canon CAMERA; http://tinyurl.com/jn4q4vz\\nBENDY TRIPOD THING; http://tinyurl.com/gw3ylz2\\nYOU NEED T

In [32]:
#head
dataf.head(1)

[Row(video_id='2kyS6SvSYSE', trending_date='17.14.11', title='WE WANT TO TALK ABOUT OUR MARRIAGE', channel_title='CaseyNeistat', category_id='22', publish_time=datetime.datetime(2017, 11, 13, 17, 13, 1), tags='SHANtell martin', views=748374, likes=57527, dislikes=2966, comment_count=15954, thumbnail_link='https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg', comments_disabled='False', ratings_disabled='False', video_error_or_removed='False', description="SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\\nwith this lens -- http://amzn.to/2rUJOmD\\nbig drone - http://tinyurl.com/h4ft3oy\\nOTHER GEAR ---  http://amzn.to/2o3GLX5\\nSony CAMERA http://amzn.to/2nOBmnv\\nOLD CAMERA; http://amzn.to/2o2cQBT\\nMAIN LENS; http://amzn.to/2od5gBJ\\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\\nBIG Canon CAMERA; http://tinyurl.com/jn4q4vz\\nBENDY TRIPOD THING; http://tinyurl.com/gw3ylz2\\nYOU NEED T

In [33]:
# obtiene una lista de la columna
dataf.select(col('likes')).collect()

[Row(likes=57527),
 Row(likes=97185),
 Row(likes=146033),
 Row(likes=10172),
 Row(likes=132235),
 Row(likes=9763),
 Row(likes=15993),
 Row(likes=23663),
 Row(likes=3543),
 Row(likes=12654),
 Row(likes=655),
 Row(likes=1576),
 Row(likes=114188),
 Row(likes=7848),
 Row(likes=7473),
 Row(likes=9419),
 Row(likes=8011),
 Row(likes=5398),
 Row(likes=11963),
 Row(likes=8421),
 Row(likes=9586),
 Row(likes=3585),
 Row(likes=11758),
 Row(likes=1707),
 Row(likes=4884),
 Row(likes=8676),
 Row(likes=4687),
 Row(likes=9033),
 Row(likes=156),
 Row(likes=715),
 Row(likes=4035),
 Row(likes=119),
 Row(likes=787419),
 Row(likes=3781),
 Row(likes=1661),
 Row(likes=2486),
 Row(likes=7515),
 Row(likes=1318),
 Row(likes=38397),
 Row(likes=6927),
 Row(likes=5389),
 Row(likes=308),
 Row(likes=7),
 Row(likes=15186),
 Row(likes=4451),
 Row(likes=33505),
 Row(likes=3417),
 Row(likes=2017),
 Row(likes=35),
 Row(likes=45406),
 Row(likes=99086),
 Row(likes=205),
 Row(likes=15397),
 Row(likes=None),
 Row(likes=None),

# Escritura de DataFrame

In [34]:
dff1 = dataf.repartition(2)

In [35]:
#Agregar un formatao distinto y asignar un separador
dff1.write.format('csv').option('sep','|').save('./salida/csv')

In [36]:
dff1.coalesce(1).write.format('csv').option('sep','|').save('./salida/csv1')

In [37]:
dataf.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: timestamp (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: integer (nullable = true)
 |-- likes: integer (nullable = true)
 |-- dislikes: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [38]:
dataf.select('comments_disabled').distinct().show()

+-----------------+
|comments_disabled|
+-----------------+
|            False|
|             null|
| sports and more.|
|          Wiz Kid|
|             True|
|         farfalle|
+-----------------+



In [39]:
df_limpio=dataf.filter(col('comments_disabled').isin('True','False'))

In [40]:
df_limpio.write.partitionBy('comments_disabled').parquet('./salida/parquet')

# Persistencia


In [41]:
dframe = spark.createDataFrame([(1,'a'),(2,'b'),(3,'c')],['id','valor'])

In [42]:
dframe.show()

+---+-----+
| id|valor|
+---+-----+
|  1|    a|
|  2|    b|
|  3|    c|
+---+-----+



In [43]:
dframe.persist()

DataFrame[id: bigint, valor: string]

In [44]:
dframe.unpersist()

DataFrame[id: bigint, valor: string]

In [46]:
from pyspark.storagelevel import StorageLevel

In [47]:
dframe.persist(StorageLevel.DISK_ONLY)

DataFrame[id: bigint, valor: string]

# Ejercicios

Los datos adjuntos a esta lección forman parte de la base de datos [NeurIPS 2020] Data Science for COVID-19 (DS4C) disponible en Kaggle. Estos datos hacen referencia a los casos de contagio de covid-19 en Corea del Sur.


El archivo csv Case contiene los casos reportados y el archivo csv PatientInfo contiene la información de los pacientes.

1. A partir del archivo csv Case, determine las tres ciudades con más casos confirmados de la enfermedad. La salida debe contener tres columnas: provincia, ciudad y casos confirmados. El resultado debe contener exactamente los tres nombre de ciudades con más casos confirmados ya que no se admiten otros valores.

2. Cree un dataframe a partir del archivo csv PatientInfo. Asegúrese de que su dataframe no contenga pacientes duplicados.

  a. ¿Cuántos pacientes tienen informado por quién se contagiaron(columna infected_by)? Obtenga solo los pacientes que tengan informado por quién se contagiaron.

  b. A partir de la salida del inciso anterior obtenga solo los pacientes femeninos. La salida no debe contener las columnas released_date y deceased_date.

  c. Establezca el número de particiones del dataframe resultante del inciso anterior en dos. Escriba el dataframe resultante en un archivo parquet. La salida debe estar particionada por la provincia y el modo de escritura debe ser overwrite.

In [48]:
casos = spark.read.option('header','true').option('inferSchema','true').csv('./Case.csv')
pacientes_info = spark.read.option('header','true').option('inferSchema','true').csv('./PatientInfo.csv')

In [49]:
casos.printSchema()

root
 |--  case_id: integer (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- group: boolean (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- confirmed: integer (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)



In [51]:
casos = casos.withColumnRenamed(' case_id', 'case_id')
casos.printSchema()

root
 |-- case_id: integer (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- group: boolean (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- confirmed: integer (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)



In [52]:
casos.show()

+-------+--------+---------------+-----+--------------------+---------+---------+----------+
|case_id|province|           city|group|      infection_case|confirmed| latitude| longitude|
+-------+--------+---------------+-----+--------------------+---------+---------+----------+
|1000001|   Seoul|     Yongsan-gu| true|       Itaewon Clubs|      139|37.538621|126.992652|
|1000002|   Seoul|      Gwanak-gu| true|             Richway|      119| 37.48208|126.901384|
|1000003|   Seoul|        Guro-gu| true| Guro-gu Call Center|       95|37.508163|126.884387|
|1000004|   Seoul|   Yangcheon-gu| true|Yangcheon Table T...|       43|37.546061|126.874209|
|1000005|   Seoul|      Dobong-gu| true|     Day Care Center|       43|37.679422|127.044374|
|1000006|   Seoul|        Guro-gu| true|Manmin Central Ch...|       41|37.481059|126.894343|
|1000007|   Seoul|from other city| true|SMR Newly Planted...|       36|        -|         -|
|1000008|   Seoul|  Dongdaemun-gu| true|       Dongan Church|       17

In [53]:
#determine las tres ciudades con más casos confirmado  provincia, ciudad y casos confirmados.
casos.orderBy(desc('confirmed')).show()

+-------+-----------------+---------------+-----+--------------------+---------+---------+----------+
|case_id|         province|           city|group|      infection_case|confirmed| latitude| longitude|
+-------+-----------------+---------------+-----+--------------------+---------+---------+----------+
|1200001|            Daegu|         Nam-gu| true|  Shincheonji Church|     4511| 35.84008|  128.5667|
|1200009|            Daegu|              -|false|contact with patient|      917|        -|         -|
|1200010|            Daegu|              -|false|                 etc|      747|        -|         -|
|6000001| Gyeongsangbuk-do|from other city| true|  Shincheonji Church|      566|        -|         -|
|2000020|      Gyeonggi-do|              -|false|     overseas inflow|      305|        -|         -|
|1000036|            Seoul|              -|false|     overseas inflow|      298|        -|         -|
|1200002|            Daegu|   Dalseong-gun| true|Second Mi-Ju Hosp...|      196|35

In [54]:
#Quitar los - en la ciudad
casos.filter(col('city') != '-').orderBy(desc('confirmed')).show()

+-------+-----------------+---------------+-----+--------------------+---------+---------+----------+
|case_id|         province|           city|group|      infection_case|confirmed| latitude| longitude|
+-------+-----------------+---------------+-----+--------------------+---------+---------+----------+
|1200001|            Daegu|         Nam-gu| true|  Shincheonji Church|     4511| 35.84008|  128.5667|
|6000001| Gyeongsangbuk-do|from other city| true|  Shincheonji Church|      566|        -|         -|
|1200002|            Daegu|   Dalseong-gun| true|Second Mi-Ju Hosp...|      196|35.857375|128.466651|
|1000001|            Seoul|     Yongsan-gu| true|       Itaewon Clubs|      139|37.538621|126.992652|
|1200003|            Daegu|         Seo-gu| true|Hansarang Convale...|      124|35.885592|128.556649|
|6000002| Gyeongsangbuk-do|   Cheongdo-gun| true|Cheongdo Daenam H...|      119| 35.64887|  128.7368|
|1000002|            Seoul|      Gwanak-gu| true|             Richway|      119| 3

In [55]:
#Quitar el 'From other city'
casos.filter(col('city') != '-').filter(col('city') != 'from other city').orderBy(desc('confirmed')).show()
# casos.filter(col('city') != '-') & (col('city') != 'from other city').orderBy(desc('confirmed')).show()

+-------+-----------------+------------+-----+--------------------+---------+---------+----------+
|case_id|         province|        city|group|      infection_case|confirmed| latitude| longitude|
+-------+-----------------+------------+-----+--------------------+---------+---------+----------+
|1200001|            Daegu|      Nam-gu| true|  Shincheonji Church|     4511| 35.84008|  128.5667|
|1200002|            Daegu|Dalseong-gun| true|Second Mi-Ju Hosp...|      196|35.857375|128.466651|
|1000001|            Seoul|  Yongsan-gu| true|       Itaewon Clubs|      139|37.538621|126.992652|
|1200003|            Daegu|      Seo-gu| true|Hansarang Convale...|      124|35.885592|128.556649|
|1000002|            Seoul|   Gwanak-gu| true|             Richway|      119| 37.48208|126.901384|
|6000002| Gyeongsangbuk-do|Cheongdo-gun| true|Cheongdo Daenam H...|      119| 35.64887|  128.7368|
|4100001|Chungcheongnam-do|  Cheonan-si| true|gym facility in C...|      103| 36.81503|  127.1139|
|1200004| 

In [56]:
# 3 columnas
casos.filter(col('city') != '-').filter(col('city') != 'from other city').orderBy(desc('confirmed')).select('province','city','confirmed').show()

+-----------------+------------+---------+
|         province|        city|confirmed|
+-----------------+------------+---------+
|            Daegu|      Nam-gu|     4511|
|            Daegu|Dalseong-gun|      196|
|            Seoul|  Yongsan-gu|      139|
|            Daegu|      Seo-gu|      124|
|            Seoul|   Gwanak-gu|      119|
| Gyeongsangbuk-do|Cheongdo-gun|      119|
|Chungcheongnam-do|  Cheonan-si|      103|
|            Daegu|Dalseong-gun|      101|
|            Seoul|     Guro-gu|       95|
| Gyeongsangbuk-do| Bonghwa-gun|       68|
|      Gyeonggi-do|  Bucheon-si|       67|
|      Gyeonggi-do| Seongnam-si|       67|
| Gyeongsangbuk-do|Gyeongsan-si|       66|
|      Gyeonggi-do|Uijeongbu-si|       50|
|            Seoul|Yangcheon-gu|       43|
|            Seoul|   Dobong-gu|       43|
|            Seoul|     Guro-gu|       41|
| Gyeongsangbuk-do|  Yechun-gun|       40|
|            Busan|  Dongnae-gu|       39|
|            Daegu|     Dong-gu|       39|
+----------

2

In [57]:
pacientes_info.printSchema()

root
 |-- patient_id: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- infection_case: string (nullable = true)
 |-- infected_by: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- symptom_onset_date: string (nullable = true)
 |-- confirmed_date: timestamp (nullable = true)
 |-- released_date: timestamp (nullable = true)
 |-- deceased_date: timestamp (nullable = true)
 |-- state: string (nullable = true)



In [58]:
pacientes_info.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|      released_date|deceased_date|   state|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|1000000001|  male|50s|  Korea|   Seoul|  Gangseo-gu|     overseas inflow|       null|            75|        2020-01-22|2020-01-23 00:00:00|2020-02-05 00:00:00|         null|released|
|1000000002|  male|30s|  Korea|   Seoul| Jungnang-gu|     overseas inflow|       null|            31|              null|2020-01-30 00:00:00|2020-03-02 00:00:00|         null|released|
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 20020

In [59]:
pacientes_info.select(col('patient_id')).count()

5165

In [60]:
pacientes_info.select(col('patient_id')).distinct().count()

5164

In [61]:
pacientes_info = pacientes_info.dropDuplicates(['patient_id'])

In [62]:
pacientes_info.count()

5164

In [64]:
from pyspark.sql.functions import count

In [65]:
pacientes_info.select(count('infected_by').alias('conteo')).show()

+------+
|conteo|
+------+
|  1346|
+------+



In [66]:
contagiados = pacientes_info.na.drop(subset=['infected_by'])

In [67]:
contagiados.count()

1346

In [68]:
contagiados.show()

+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|patient_id|   sex|age|country|province|        city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|      released_date|deceased_date|   state|
+----------+------+---+-------+--------+------------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|1000000003|  male|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 2002000001|            17|              null|2020-01-30 00:00:00|2020-02-19 00:00:00|         null|released|
|1000000005|female|20s|  Korea|   Seoul| Seongbuk-gu|contact with patient| 1000000002|             2|              null|2020-01-31 00:00:00|2020-02-24 00:00:00|         null|released|
|1000000006|female|50s|  Korea|   Seoul|   Jongno-gu|contact with patient| 10000

In [69]:
contagiados.filter(col('sex') == 'female').show()

+----------+------+---+-------+--------+-------------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|patient_id|   sex|age|country|province|         city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|      released_date|deceased_date|   state|
+----------+------+---+-------+--------+-------------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|1000000005|female|20s|  Korea|   Seoul|  Seongbuk-gu|contact with patient| 1000000002|             2|              null|2020-01-31 00:00:00|2020-02-24 00:00:00|         null|released|
|1000000006|female|50s|  Korea|   Seoul|    Jongno-gu|contact with patient| 1000000003|            43|              null|2020-01-31 00:00:00|2020-02-19 00:00:00|         null|released|
|1000000010|female|60s|  Korea|   Seoul|  Seongbuk-gu|contact with patient|

In [70]:
#limpiar de nulos
contagiados.filter((col('sex') == 'female') & (col('sex').isNotNull())).show()

+----------+------+---+-------+--------+-------------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|patient_id|   sex|age|country|province|         city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|      released_date|deceased_date|   state|
+----------+------+---+-------+--------+-------------+--------------------+-----------+--------------+------------------+-------------------+-------------------+-------------+--------+
|1000000005|female|20s|  Korea|   Seoul|  Seongbuk-gu|contact with patient| 1000000002|             2|              null|2020-01-31 00:00:00|2020-02-24 00:00:00|         null|released|
|1000000006|female|50s|  Korea|   Seoul|    Jongno-gu|contact with patient| 1000000003|            43|              null|2020-01-31 00:00:00|2020-02-19 00:00:00|         null|released|
|1000000010|female|60s|  Korea|   Seoul|  Seongbuk-gu|contact with patient|

In [71]:
# Eliminar released - deceased 
contagiados.filter((col('sex') == 'female') & (col('sex').isNotNull())).drop('released_date','deceased_date').show()

+----------+------+---+-------+--------+-------------+--------------------+-----------+--------------+------------------+-------------------+--------+
|patient_id|   sex|age|country|province|         city|      infection_case|infected_by|contact_number|symptom_onset_date|     confirmed_date|   state|
+----------+------+---+-------+--------+-------------+--------------------+-----------+--------------+------------------+-------------------+--------+
|1000000005|female|20s|  Korea|   Seoul|  Seongbuk-gu|contact with patient| 1000000002|             2|              null|2020-01-31 00:00:00|released|
|1000000006|female|50s|  Korea|   Seoul|    Jongno-gu|contact with patient| 1000000003|            43|              null|2020-01-31 00:00:00|released|
|1000000010|female|60s|  Korea|   Seoul|  Seongbuk-gu|contact with patient| 1000000003|             6|              null|2020-02-05 00:00:00|released|
|1000000014|female|60s|  Korea|   Seoul|    Jongno-gu|contact with patient| 1000000013|       

In [72]:
final_df = contagiados.filter((col('sex') == 'female') & (col('sex').isNotNull())).drop('released_date','deceased_date')

In [76]:
final_df.coalesce(2).write.partitionBy('province').mode('overwrite').parquet('./FinalEjercicio2')