<a href="https://colab.research.google.com/github/AnIsAsPe/LDATopicModeling_pyspark/blob/main/Introducci%C3%B3n_a_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



![image.png](https://spark.apache.org/docs/1.1.1/img/cluster-overview.png)

# Instalación de PySpark en Colab

In [1]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
[K     |████████████████████████████████| 212.4MB 70kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 18.1MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=12bf45cb000afb5b7025b301aa4048496fc2abe46c5b4d9f87fbe2c090f503ce
  Stored in directory: /root/.cache/pip/wheels/40/1b/2c/30f43be2627857ab80062bef1527c0128f7b4070b6b2d02139
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


# Spark session

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Prueba_spark_colab").getOrCreate()
spark

# Spark Context

In [3]:
# Crear SparkContext para conectar con el cluster, antes es necesario tener SparkSession
from pyspark import SparkConf
from pyspark import SparkContext
import numpy as np

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[4]"))

# "the master" es la computadora conectada con el resto de las computadoras en el cluster que
#  administra la división y transformación de los datos

# Resilent Distributed Datasets

## Crear un objeto RDD paralelizando una colección

In [34]:
!cat /proc/cpuinfo | grep processor | wc -l #procesadores ebn máquina colab


2


In [35]:
!echo $(($(getconf _PHYS_PAGES) * $(getconf PAGE_SIZE) / (1024 * 1024))) #RAM


12993


In [12]:
a_np = np.random.randint(0,100,20)
a_rdd = sc.parallelize(a_np, 4)

In [5]:
print(type(a_np))
print(type(a_rdd))

<class 'numpy.ndarray'>
<class 'pyspark.rdd.RDD'>


In [6]:
a_rdd.collect()  # regresa los elementos distribuidos 

[89, 50, 74, 69, 68, 33, 8, 71, 51, 93, 55, 93, 88, 33, 13, 71, 62, 73, 68, 97]

In [7]:
a_rdd.glom().collect()  # con glom podemos ver como se hicieron las particiones

[[89, 50, 74, 69, 68, 33, 8, 71, 51, 93],
 [55, 93, 88, 33, 13, 71, 62, 73, 68, 97]]

In [11]:
sc.stop()
sc=SparkContext(master="local[3]")
a_rdd = sc.parallelize(a_np)
a_rdd.glom().collect() 

[[89, 50, 74, 69, 68, 33],
 [8, 71, 51, 93, 55, 93],
 [88, 33, 13, 71, 62, 73, 68, 97]]

In [13]:
a_rdd.collect() 

[53, 0, 23, 38, 31, 32, 95, 1, 35, 34, 84, 8, 77, 84, 86, 7, 98, 73, 26, 56]

## Crear un objeto RDD a partir de datos externos

In [16]:
texto = sc.textFile("/content/drive/MyDrive/Datos/gabriel_garcia_marquez_cien_annos_soledad.txt",4)


In [17]:
type(texto)

pyspark.rdd.RDD

In [18]:
texto.map(lambda s: len(s)).reduce(lambda a, b: a + b) # cantidad de caracteres

814791

# Operaciones en RDD
- Transformaciones (Map)
- Acciones (Reduce)

### Transformaciones (Map

In [19]:
#map function
sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()

[range(1, 3), range(1, 4), range(1, 5)]

In [20]:
#flatmap example.So it creates output like map function but it flattens the output in a list
sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect()

[3, 9, 4, 16, 5, 25]

In [21]:
#mapping con funciones regulares
def square_if_odd(x):
    """
    Si el numero es non, regresa el cuadrado, los pares en cambio
    no sufren transformación
    """
    if x%2==1:
        return x*x
    else:
        return x

numeros = sc.parallelize(np.arange(20))
numeros.map(square_if_odd).collect()

[0, 1, 2, 9, 4, 25, 6, 49, 8, 81, 10, 121, 12, 169, 14, 225, 16, 289, 18, 361]

###  Acciones (Reduce)

In [22]:
numbers = sc.parallelize([1, 4, 6, 2, 9, 10])
sum = numbers.reduce(lambda a,b : a+b)
sum

32

In [23]:
numbers.count() #conotar los elementos

6

In [24]:
numbers.first()

1

In [27]:
numbers.take(3)

[1, 4, 6]

In [28]:
#Encontrar el elemento máximo con reduce
numbers.reduce(lambda x,y: x if x > y else y)

10

In [29]:
#Filtros: ejemplo devuelve numeros positivos divisibles entre 3
numbers.filter(lambda x:x%3==0 and x>=0).collect()

[6, 9]

In [30]:

words = 'MapReduce, GFS, Hadoop, HDFS, Spark'.split(',')

wordRDD = sc.parallelize(words)

wordRDD.reduce(lambda w,v: w if len(w) >len(v) else v)

'MapReduce'

In [31]:
#uso de funciones comunes utilizando reduce 
def largerThan(x,y):
    """
    Regresa la palabra más larga
    """
    if len(x)> len(y):
        return x
    else:
        return y

wordRDD.reduce(largerThan)

'MapReduce'

# Groupby

In [33]:
numeros.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [40]:
result=numeros.groupBy(lambda x:x%2).collect()
result

[(0, <pyspark.resultiterable.ResultIterable at 0x7f36cc541fd0>),
 (1, <pyspark.resultiterable.ResultIterable at 0x7f36cc565b10>)]

In [42]:
sorted([(x, sorted(y)) for (x, y) in result])

[(0, [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]),
 (1, [1, 3, 5, 7, 9, 11, 13, 15, 17, 19])]

## Lazy Evaluation 
Es una estrategia de spark para acelerar operaciones paralelizadas.
Deja lista una secuencia de tareas paso por paso en una tarea pero retrasa la ejecución hasta que es absolutamente necesaria.

(ejemplo en https://dzone.com/articles/the-benefits-amp-examples-of-using-apache-spark-wi)





Documentación oficial de spark: 

* [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html#initializing-spark)

* [Cluster Mode Oberview](https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html)





