<div style="background-color:white; text-align:center; padding:10px; color:black; margin-left:0px; border-radius: 10px; font-family:Trebuchet MS; font-size:45px">
<strong>Mejores prácticas en Ciencia de Datos</strong>
</div>

<center><img src="../images/Spark_logo.png" width="450"></center>

<div style="background-color:palegreen; text-align:center; padding:1px; color:black; margin-left:0px; border-radius: 5px; font-family:Trebuchet MS; font-size:35px">
Hazlo con Funciones!
</div>

<table class="table table-bordered table-hover">
  <thead>
    <tr>
      <th scope="col">Autor</th>
      <th scope="col">Fecha</th>
      <th scope="col">Lugar</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Marcos << Data Scientist >> </td>
      <td>Noviembre de 2020</td>
      <td>Ciudad de México, México</td>
    </tr>
  </tbody>
</table>

# Servicios de Spark

In [1]:
import pyspark
sc = pyspark.SparkContext('local[*]')

In [2]:
sc

In [3]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

# Cargamos la base de datos del ejemplo

In [4]:
schema = StructType([StructField('incmpl', IntegerType(), True),
                     StructField('date', IntegerType(), True),
                     StructField('ratio_1', DoubleType(), True),
                     StructField('ratio_2', DoubleType(), True),
                     StructField('ratio_3', DoubleType(), True),
                     StructField('ratio_4', DoubleType(), True),
                     StructField('ratio_5', DoubleType(), True),
                     StructField('num_1', DoubleType(), True),
                     StructField('num_2', DoubleType(), True),
                     StructField('num_3', DoubleType(), True),
                     StructField('year', IntegerType(), True),
                     StructField('month', IntegerType(), True),
                     StructField('date_2', DateType(), True)])

In [5]:
df = sqlContext.read.csv('../data/credit_tiny_dummy.csv', sep=',', header=True, schema=schema)

In [6]:
df.printSchema()

root
 |-- incmpl: integer (nullable = true)
 |-- date: integer (nullable = true)
 |-- ratio_1: double (nullable = true)
 |-- ratio_2: double (nullable = true)
 |-- ratio_3: double (nullable = true)
 |-- ratio_4: double (nullable = true)
 |-- ratio_5: double (nullable = true)
 |-- num_1: double (nullable = true)
 |-- num_2: double (nullable = true)
 |-- num_3: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- date_2: date (nullable = true)



In [7]:
df.select('date','date_2','year','month','ratio_1','ratio_2','ratio_3','num_1','num_2','num_3').show(5)

+------+----------+----+-----+-------+------------------+-------+------------+------------+------------+
|  date|    date_2|year|month|ratio_1|           ratio_2|ratio_3|       num_1|       num_2|       num_3|
+------+----------+----+-----+-------+------------------+-------+------------+------------+------------+
|201201|2012-01-01|2012|    1|   1.27|0.7708797184999999|   1.23|8.3481306975|0.8482867987|0.0025750175|
|201202|2012-02-01|2012|    2|   1.27|0.7708797184999999|   1.23|8.3481306975|0.8482867987|0.0025750175|
|201203|2012-03-01|2012|    3|   1.27|0.7708797184999999|   1.23|8.3481306975|0.8482867987|0.0025750175|
|201204|2012-04-01|2012|    4|   1.27|0.7708797184999999|   1.23|8.3481306975|0.8482867987|0.0025750175|
|201205|2012-05-01|2012|    5|   1.27|0.7708797184999999|   1.23|8.3481306975|0.8482867987|0.0025750175|
+------+----------+----+-----+-------+------------------+-------+------------+------------+------------+
only showing top 5 rows



# Supongamos que buscamos realizar calculos para un mes en particular
#### Ejemplo: Diciembre 2014

In [8]:
from pyspark.sql.functions import col, sum as _sum

In [9]:
df2 = df.filter('date_2 == "2014-12-01"'
               ).select('*',
                        (col('num_3')*100).alias('new_column').cast(DecimalType(20,5)))

In [10]:
df2.select('date_2','num_3','new_column').show(5)

+----------+------------+----------+
|    date_2|       num_3|new_column|
+----------+------------+----------+
|2014-12-01|0.0103605209|   1.03605|
|2014-12-01| 0.012011615|   1.20116|
|2014-12-01|0.1216192209|  12.16192|
|2014-12-01|0.1341868555|  13.41869|
|2014-12-01|-0.033046993|  -3.30470|
+----------+------------+----------+
only showing top 5 rows



In [11]:
print('Filas: {:,}'.format(df2.count()))

Filas: 20,200


In [12]:
df2.write.partitionBy('date_2').mode('overwrite').parquet('../output/example5a')

Listo!, ya quedó guardada la tabla como se muestra en la siguiente imagen

<img src="../output/example5a.png" width="450">

### Revisamos los resultados

In [13]:
df2 = sqlContext.read.parquet('../output/example5a')

In [14]:
print('Filas: {:,}'.format(df2.count()))

Filas: 20,200


In [15]:
df2.select('date','date_2','year','month','ratio_1','ratio_2','ratio_3','num_1','num_2','num_3').show(5)

+------+----------+----+-----+-------+------------------+-------+------------+------------------+------------+
|  date|    date_2|year|month|ratio_1|           ratio_2|ratio_3|       num_1|             num_2|       num_3|
+------+----------+----+-----+-------+------------------+-------+------------+------------------+------------+
|201412|2014-12-01|2014|   12|   1.23|      0.7629253159|   1.06|7.5041156394|0.8716100398000001|0.0103605209|
|201412|2014-12-01|2014|   12|   0.65|      0.5630340219|   2.32|9.7269471799|      1.0397253728| 0.012011615|
|201412|2014-12-01|2014|   12|   0.29|      0.9821823361|   0.51|1.6189236708|       0.488424288|0.1216192209|
|201412|2014-12-01|2014|   12|   0.58|0.6503938054999999|   0.45| 1.638700494|      0.7900542656|0.1341868555|
|201412|2014-12-01|2014|   12|   0.91|      0.6164210334|   2.06|-9.9999993E7|      0.1311900016|-0.033046993|
+------+----------+----+-----+-------+------------------+-------+------------+------------------+------------+
o

# Si quisieramos calificar otro mes deberíamos hacer lo mismo
### Por ejemplo: enero 2015 a marzo 2015

In [None]:
df2 = df.filter('date_2 in ("2015-01-01","2015-02-01","2015-03-01")'
               ).select('*',
                        (col('num_3')*100).alias('new_column').cast(DecimalType(20,5)))


df2.write.partitionBy('date_2').mode('append').parquet('../output/example5a')

# Hazlo con funciones!

## Funciones

In [16]:
import datetime
import pytz
from dateutil.relativedelta import relativedelta

In [17]:
def get_dates_range(start_date, months_count):
    start_date = str(start_date)
    date_1 = datetime.datetime.strptime(start_date,'%Y-%m-%d').date().replace(day=1)
    date_list = list(map(str, [date_1 + relativedelta(months=x) for x in range(0, months_count)]))
    return date_list

In [18]:
def progress_time(str_log):
    currdate = datetime.datetime.strftime(datetime.datetime.now(pytz.timezone('America/Mexico_City')),'%Y-%m-%d %T')
    print('[' + currdate + ']: ' + str_log)

In [19]:
def delete_info(sc, path):
    fs = (sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration()))
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

In [20]:
def saving_data(table, path_file, table_name, partition, dates_list):
    for x in dates_list:
        delete_info(sc, path_file + table_name + '/' + partition + '=' + str(x))
        progress_time(table_name + ' >> ' + partition + ' = ' + str(x))
        table.filter(col(partition)==x).write.partitionBy(partition).mode('append').parquet(path_file+table_name)

In [21]:
def calculus_and_saving_data(table, start_date, months_count, path_file, table_name, partition):
    range_list = get_dates_range(start_date,months_count)
    print('Dates to process:',range_list)
    w = input('Is it dates correct [y/n]?:')
    
    if w=='y':
        tmp = table.filter(col('date_2').isin(range_list)) \
                   .select('*', (col('num_3')*100).alias('new_column').cast(DecimalType(20,5)))
        saving_data(tmp, path_file, table_name, partition, range_list)
        progress_time('end process')
    else:
        print('Set dates correctly')

## Variables

In [22]:
example_path = '../output/'
df_name = 'example5b'

## Ejemplo del funcionamiento

In [23]:
get_dates_range('2014-12-01',5)

['2014-12-01', '2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01']

In [24]:
progress_time('esto es un ejemplo')

[2020-11-06 00:05:43]: esto es un ejemplo


# Supongamos que necesitamos hacer calculos para varios meses

## Ejemplo: diciembre 2014 a abril 2015

In [25]:
calculus_and_saving_data(df, '2014-12-01', 5, example_path, df_name, 'date_2')

Dates to process: ['2014-12-01', '2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01']


Is it dates correct [y/n]?: n


Set dates correctly


In [26]:
calculus_and_saving_data(df, '2014-12-01', 5, example_path, df_name, 'date_2')

Dates to process: ['2014-12-01', '2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01']


Is it dates correct [y/n]?: y


[2020-11-06 00:06:12]: example5b >> date_2 = 2014-12-01
[2020-11-06 00:06:14]: example5b >> date_2 = 2015-01-01
[2020-11-06 00:06:15]: example5b >> date_2 = 2015-02-01
[2020-11-06 00:06:17]: example5b >> date_2 = 2015-03-01
[2020-11-06 00:06:18]: example5b >> date_2 = 2015-04-01
[2020-11-06 00:06:19]: end process


En la siguiente imagen podemos verificar que se ha guardado la tabla

<img src="../output/example5b.png" width="450">

## Revisamos los resultados

In [27]:
tmp = sqlContext.read.parquet(example_path + df_name)

In [28]:
print('Filas: {:,}'.format(tmp.count()))

Filas: 97,768


In [29]:
tmp.select('date_2','num_3','new_column').show(5)

+----------+------------+----------+
|    date_2|       num_3|new_column|
+----------+------------+----------+
|2015-01-01|-0.020233211|  -2.02332|
|2015-01-01|-0.077430956|  -7.74310|
|2015-01-01| 0.012011615|   1.20116|
|2015-01-01|0.1341868555|  13.41869|
|2015-01-01|-0.033046993|  -3.30470|
+----------+------------+----------+
only showing top 5 rows



In [30]:
tmp.groupBy('date_2').agg(_sum('new_column')).orderBy('date_2').show()

+----------+---------------+
|    date_2|sum(new_column)|
+----------+---------------+
|2014-12-01|    66569.82316|
|2015-01-01|    86451.51560|
|2015-02-01|    86482.35292|
|2015-03-01|    81442.42464|
|2015-04-01|    79700.30998|
+----------+---------------+



In [32]:
tmp.groupBy('date_2').count().orderBy('date_2').show()

+----------+-----+
|    date_2|count|
+----------+-----+
|2014-12-01|20200|
|2015-01-01|19998|
|2015-02-01|19594|
|2015-03-01|19392|
|2015-04-01|18584|
+----------+-----+



<div style="color:black; font-size:40px">
<strong>Gracias!</strong>
</div>