<div style="background-color:white; text-align:center; padding:10px; color:black; margin-left:0px; border-radius: 10px; font-family:Trebuchet MS; font-size:45px">
<strong>Mejores prácticas en Ciencia de Datos</strong>
</div>

<center><img src="../images/Spark_logo.png" width="450"></center>

<div style="background-color:palegreen; text-align:center; padding:1px; color:black; margin-left:0px; border-radius: 5px; font-family:Trebuchet MS; font-size:35px">
Tips para el uso de Spark
</div>

<table class="table table-bordered table-hover">
  <thead>
    <tr>
      <th scope="col">Autor</th>
      <th scope="col">Fecha</th>
      <th scope="col">Lugar</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Marcos << Data Scientist >> </td>
      <td>Noviembre de 2020</td>
      <td>Ciudad de México, México</td>
    </tr>
  </tbody>
</table>

# Iniciamos Spark

In [1]:
import pyspark
sc = pyspark.SparkContext('local[*]')
sc

In [2]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

# Cargamos la base de datos del ejemplo

In [3]:
schema = StructType([StructField('Loan_ID', StringType(), True),
                     StructField('Customer_ID', StringType(), True),
                     StructField('Loan_Status', StringType(), True),
                     StructField('Current_Loan_Amount', DoubleType(), True),
                     StructField('Term', StringType(), True),
                     StructField('Credit_Score', DoubleType(), True),
                     StructField('Annual_Income', DoubleType(), True),
                     StructField('Years_in_current_job', StringType(), True),
                     StructField('Home_Ownership', StringType(), True),
                     StructField('Purpose', StringType(), True),
                     StructField('Monthly_Debt', DoubleType(), True),
                     StructField('Years_of_Credit_History', DoubleType(), True),
                     StructField('Months_since_last_delinquent', DoubleType(), True),
                     StructField('Number_of_Open_Accounts', DoubleType(), True),
                     StructField('Number_of_Credit_Problems', DoubleType(), True),
                     StructField('Current_Credit_Balance', DoubleType(), True),
                     StructField('Maximum_Open_Credit', DoubleType(), True),
                     StructField('Bankruptcies', DoubleType(), True),
                     StructField('Tax_Liens', DoubleType(), True)
                    ])

Cuando cargues una base de datos, lo recomendable es que solo selecciones las columnas que vas a necesitar, como en el siguiente ejemplo:

In [4]:
df = sqlContext.read.csv('../data/credit_train_v2.csv', sep=',', header=True, schema=schema).select('Customer_ID',
                                                                                                    'Loan_Status',
                                                                                                    'Current_Loan_Amount',
                                                                                                    'Annual_Income',
                                                                                                    'Current_Credit_Balance')

In [5]:
df.printSchema()

root
 |-- Customer_ID: string (nullable = true)
 |-- Loan_Status: string (nullable = true)
 |-- Current_Loan_Amount: double (nullable = true)
 |-- Annual_Income: double (nullable = true)
 |-- Current_Credit_Balance: double (nullable = true)



In [6]:
df.show(5)

+--------------------+-----------+-------------------+-------------+----------------------+
|         Customer_ID|Loan_Status|Current_Loan_Amount|Annual_Income|Current_Credit_Balance|
+--------------------+-----------+-------------------+-------------+----------------------+
|981165ec-3274-42f...| Fully Paid|           445412.0|    1167493.0|              228190.0|
|2de017a3-2e01-49c...| Fully Paid|           262328.0|         null|              229976.0|
|5efb2b2b-bf11-4df...| Fully Paid|        9.9999999E7|    2231892.0|              297996.0|
|e777faab-98ae-45a...| Fully Paid|           347666.0|     806949.0|              256329.0|
|81536ad9-5ccf-4eb...| Fully Paid|           176220.0|         null|              253460.0|
+--------------------+-----------+-------------------+-------------+----------------------+
only showing top 5 rows



# Realizamos algunos cálculos

In [7]:
from pyspark.sql.functions import col as c, greatest, when, lit, rand

In [8]:
tmp = df.select('*',
                when(c('Current_Loan_Amount')==99999999, c('Annual_Income')
                    ).otherwise(greatest(c('Current_Loan_Amount'),c('Annual_Income'))).cast(DecimalType(20,1)).alias('max_column'),
                (c('Current_Credit_Balance')/c('Annual_Income')).cast(DecimalType(10,5)).alias('ratio'),
                (when(rand()<0.5,0).otherwise(1)).alias('aux'))

In [9]:
tmp = tmp.select('*',
                 when(c('aux')==0, lit('2018-11-01').cast(DateType())
                     ).otherwise(lit('2018-12-01').cast(DateType())).alias('date')
                ).drop('aux')

In [10]:
tmp.show(5)

+--------------------+-----------+-------------------+-------------+----------------------+----------+-------+----------+
|         Customer_ID|Loan_Status|Current_Loan_Amount|Annual_Income|Current_Credit_Balance|max_column|  ratio|      date|
+--------------------+-----------+-------------------+-------------+----------------------+----------+-------+----------+
|981165ec-3274-42f...| Fully Paid|           445412.0|    1167493.0|              228190.0| 1167493.0|0.19545|2018-11-01|
|2de017a3-2e01-49c...| Fully Paid|           262328.0|         null|              229976.0|  262328.0|   null|2018-12-01|
|5efb2b2b-bf11-4df...| Fully Paid|        9.9999999E7|    2231892.0|              297996.0| 2231892.0|0.13352|2018-11-01|
|e777faab-98ae-45a...| Fully Paid|           347666.0|     806949.0|              256329.0|  806949.0|0.31765|2018-12-01|
|81536ad9-5ccf-4eb...| Fully Paid|           176220.0|         null|              253460.0|  176220.0|   null|2018-12-01|
+--------------------+--

Cuando se tienen tablas muy grandes (millones de registros) **no se recomienda hacer esto**:

```python
tmp.write.parquet('../output/example6')
```

En su lugar se deberá usar el siguiente código:

* `df.write.parquet(ruta)`

Usando las opciones `partitionBy()` y `mode()`.

Pero antes vamos a desarrollar una función que nos vaya mostrando el avance de nuestro proceso de guardado.

In [11]:
import datetime
import pytz
import calendar

In [12]:
def progress_time(str_log):
    currdate = datetime.datetime.strftime(datetime.datetime.now(pytz.timezone('America/Mexico_City')), '%Y-%m-%d %T')
    print('[' + currdate + ']: ' + str_log)

In [13]:
def delete_info(sc, path):
    fs = (sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration()))
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

In [14]:
def saving_data(table, path_file, table_name, partition, dates_list):
    for x in dates_list:
        delete_info(sc, path_file+table_name+'/'+partition+'='+str(x))
        progress_time(table_name+' >> '+partition+' = '+str(x))
        table.filter(c(partition)==x).write.partitionBy(partition).mode('append').parquet(path_file+table_name)
    print('Done!')

In [15]:
path_file = '../output/'
file_name = 'example6'

In [17]:
saving_data(tmp, path_file, file_name, 'date', ['2018-11-01','2018-12-01'])

[2020-12-19 20:28:40]: example6 >> date = 2018-11-01
[2020-12-19 20:28:41]: example6 >> date = 2018-12-01
Done!


Verificamos que la tabla se haya guardado correctamente

<img src="../output/example6.png" width="450">

## Revisamos los resultados

In [18]:
df = sqlContext.read.parquet(path_file + file_name)

In [19]:
df.show(5)

+--------------------+-----------+-------------------+-------------+----------------------+----------+-------+----------+
|         Customer_ID|Loan_Status|Current_Loan_Amount|Annual_Income|Current_Credit_Balance|max_column|  ratio|      date|
+--------------------+-----------+-------------------+-------------+----------------------+----------+-------+----------+
|981165ec-3274-42f...| Fully Paid|           445412.0|    1167493.0|              228190.0| 1167493.0|0.19545|2018-11-01|
|5efb2b2b-bf11-4df...| Fully Paid|        9.9999999E7|    2231892.0|              297996.0| 2231892.0|0.13352|2018-11-01|
|4ffe99d3-7f2a-44d...|Charged Off|           206602.0|     896857.0|              215308.0|  896857.0|0.24007|2018-11-01|
|90a75dde-34d5-419...| Fully Paid|           217646.0|    1184194.0|              122170.0| 1184194.0|0.10317|2018-11-01|
|018973c9-e316-495...|Charged Off|           648714.0|         null|              193306.0|  648714.0|   null|2018-11-01|
+--------------------+--

<div style="color:black; font-size:40px">
<strong>Gracias!</strong>
</div>