# Particionado

## Creamos una sesión de spark 

In [1]:
from pyspark.sql import SparkSession

Spark permite desde la creación de la sesión o contexto, indicar la cantidad de particiones que tendremos

Para esto debemos de indicar con '[ ]'  en la indicación de master la cantidad total de particiones

In [2]:
spark = SparkSession.builder.appName("Particionado").master("local[5]").getOrCreate()

21/12/17 01:39:32 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 172.18.0.2 instead (on interface eth0)
21/12/17 01:39:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/17 01:39:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
df = spark.range(0,20)
df.rdd.getNumPartitions()

5

El método 'parallelize', permite la asignar manualmente la cantidad de particiones.

In [4]:
rdd1 = spark.sparkContext.parallelize((0,20),6)
rdd1.getNumPartitions()

6

Del mismo modo cuandore creamos un RDD o DF, podemos hacer esto.

En el caso de los RDD se realiza de la siguiente forma

In [5]:
rddDesdeArchivo = spark.sparkContext.textFile("data/deporte.csv",10)

In [6]:
rddDesdeArchivo.getNumPartitions()

10

Es una buena practica tener los archivos de datos particionados para una carga mas rápida y mejor administración.

El método 'saveAsTextFile' permite almacenar los archivos, particionados o no, en un ruta.

In [12]:
rddDesdeArchivo.saveAsTextFile("partition")

In [13]:
!ls partition/

part-00000  part-00002	part-00004  part-00006	part-00008  _SUCCESS
part-00001  part-00003	part-00005  part-00007	part-00009


A continuación se muestra como cargar los multiples archivos en un mismo RDD.

Esta operación tambien se puede realizar para DF

In [15]:
!head -n 5 partition/part-00001

7,Athletics
8,Ice Hockey
9,Swimming
10,Badminton
11,Sailing


In [19]:
# Isn't the best way to launch partitions
rdd = spark.sparkContext.wholeTextFiles("partition/*")
rdd.take(4)

[('file:/home/jovyan/work/partition/part-00000',
  'deporte_id,deporte\n1,Basketball\n2,Judo\n3,Football\n4,Tug-Of-War\n5,Speed Skating\n6,Cross Country Skiing\n'),
 ('file:/home/jovyan/work/partition/part-00001',
  '7,Athletics\n8,Ice Hockey\n9,Swimming\n10,Badminton\n11,Sailing\n12,Biathlon\n13,Gymnastics\n14,Art Competitions\n'),
 ('file:/home/jovyan/work/partition/part-00002',
  '15,Alpine Skiing\n16,Handball\n17,Weightlifting\n18,Wrestling\n19,Luge\n20,Water Polo\n'),
 ('file:/home/jovyan/work/partition/part-00003',
  '21,Hockey\n22,Rowing\n23,Bobsleigh\n24,Fencing\n25,Equestrianism\n26,Shooting\n27,Boxing\n28,Taekwondo\n')]

In [30]:
lista = rdd.mapValues(lambda x : x.split()).collect()
lista

[('file:/home/jovyan/work/partition/part-00000',
  ['deporte_id,deporte',
   '1,Basketball',
   '2,Judo',
   '3,Football',
   '4,Tug-Of-War',
   '5,Speed',
   'Skating',
   '6,Cross',
   'Country',
   'Skiing']),
 ('file:/home/jovyan/work/partition/part-00001',
  ['7,Athletics',
   '8,Ice',
   'Hockey',
   '9,Swimming',
   '10,Badminton',
   '11,Sailing',
   '12,Biathlon',
   '13,Gymnastics',
   '14,Art',
   'Competitions']),
 ('file:/home/jovyan/work/partition/part-00002',
  ['15,Alpine',
   'Skiing',
   '16,Handball',
   '17,Weightlifting',
   '18,Wrestling',
   '19,Luge',
   '20,Water',
   'Polo']),
 ('file:/home/jovyan/work/partition/part-00003',
  ['21,Hockey',
   '22,Rowing',
   '23,Bobsleigh',
   '24,Fencing',
   '25,Equestrianism',
   '26,Shooting',
   '27,Boxing',
   '28,Taekwondo']),
 ('file:/home/jovyan/work/partition/part-00004',
  ['29,Cycling',
   '30,Diving',
   '31,Canoeing',
   '32,Tennis',
   '33,Modern',
   'Pentathlon',
   '34,Figure',
   'Skating',
   '35,Golf']),


In [31]:
l = [l[0] for l in lista]
l.sort()
l

['file:/home/jovyan/work/partition/part-00000',
 'file:/home/jovyan/work/partition/part-00001',
 'file:/home/jovyan/work/partition/part-00002',
 'file:/home/jovyan/work/partition/part-00003',
 'file:/home/jovyan/work/partition/part-00004',
 'file:/home/jovyan/work/partition/part-00005',
 'file:/home/jovyan/work/partition/part-00006',
 'file:/home/jovyan/work/partition/part-00007',
 'file:/home/jovyan/work/partition/part-00008',
 'file:/home/jovyan/work/partition/part-00009']

In [20]:
rddDesdeArchivo = spark.sparkContext.textFile(','.join(l),
              10).map(lambda l : l.split(","))

In [21]:
rddDesdeArchivo.take(7)

[['deporte_id', 'deporte'],
 ['1', 'Basketball'],
 ['2', 'Judo'],
 ['3', 'Football'],
 ['4', 'Tug-Of-War'],
 ['5', 'Speed Skating'],
 ['6', 'Cross Country Skiing']]

In [22]:
!pwd

/home/jovyan/work
