<a href="https://colab.research.google.com/github/S3gam/EDEM-Data-Analytics/blob/main/00_Introduction_to_Apache_Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prerrequisites

Installing Spark

---



In [None]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
!tar xf spark-3.2.0-bin-hadoop3.2.tgz
!pip -q install findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.0-bin-hadoop3.2"

In [None]:
import findspark
findspark.init()

Starting Spark Session and print the version


---


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# create the session - Esto es lo primero que se hace siempre con Spark

spark = SparkSession \
        .builder \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.2.0'

Creating tunnel</br>
**To Check the Spark UI, open the URL printed by running the above command : https://######/jobs/, /SQL/**


In [None]:
 from google.colab.output import eval_js
 print(eval_js("google.colab.kernel.proxyPort(4040)") + "jobs/")

https://hiiuly8awhq-496ff2e9c6d22116-4040-colab.googleusercontent.com/jobs/


# Descargar Datasets

In [None]:
# We download some datasets we need for exercices

!mkdir -p /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/frankenstein.txt -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/el_quijote.txt -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/characters.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/planets.csv -P /dataset
!ls /dataset

characters.csv	el_quijote.txt	frankenstein.txt  planets.csv


# RDD

---



## Example 1

In [None]:
textFile = spark.sparkContext.textFile("/dataset/frankenstein.txt") # Este comando está leyendo el fichero y guardándolo en una variable
textFile.first() # Esta función devuelve la primera linea del fichero

'FRANKENSTEIN'


Creation of paralelized collection de colecciones paralelizadas
This is a fast way to create a RDD:

## Example 2

La función **parallelize** Convierte una estructura de código en RDD (lo parte y lo comparte por los nodos)


In [None]:
distData = spark.sparkContext.parallelize([25, 20, 15, 10, 5]) 
distData.reduce(lambda x ,y: x + y) 

75

## Exercise 1
Count the number of lines for `el_quijote.txt` file

---



In [None]:
textFile2 = spark.sparkContext.textFile("/dataset/el_quijote.txt") # Este comando está leyendo el fichero y guardándolo en una variable
textFile2.count()

2186

## Exercise 2
Print the first line of the file `el_quijote.txt`

---



In [None]:
textFile2.first()

'DON QUIJOTE DE LA MANCHA'

## Transformations and Actions in RDDs 

### Actions

### Example 3

In [None]:
print(textFile2.count()) # Number of elements in RDD
print(textFile2.first()) # First element in RDD

2186
DON QUIJOTE DE LA MANCHA


### Transformaciones

### Example 4

In [None]:
# ReduceByKey
lines = spark.sparkContext.textFile("/dataset/frankenstein.txt") # Leemos frankestein.txt
pairs = lines.map(lambda s: (s, 1)) # Generamos una clave y un valor, cada linea es un 1. Estamos contando filas
counts = pairs.reduceByKey(lambda a, b: a + b).cache()  
counts.count() 
counts.collect()

In [None]:
# SortByKey
sorted = counts.sortByKey()
sorted.collect()

### Example 5

In [None]:
# Filter

linesWithSpark = textFile.filter(lambda line: "the" in line) # Filtramos el texto entero y encontrar la palabra "the"
linesWithSpark.count() # Con esto contamos el número de veces que sale la palabra "the"

3712

### Exercise 3
Get the word count for the file `frankenstein.txt`

---

In [None]:
# ReduceByKey EJEMPLO DE ALVARO ( Contamos las palabras que tiene )

lines = spark.sparkContext.textFile("/dataset/frankenstein.txt")
contarPalabras = lines.flatMap(lambda linea: linea.split(" ")).countByValue()

for palabra, contador in contarPalabras.items():
    print("{} : {}".format(palabra, contador))





In [None]:
# ReduceByKey EJEMPLO DE LUIS ( Contamos las palabras que tiene )

lines = spark.sparkContext.textFile("/dataset/frankenstein.txt")
pairs = lines.flatMap(lambda a: a.split(" ")).map(lambda a: (a, 1))
counts = pairs.reduceByKey(lambda a, b: a + b).cache()
counts.collect()




In [None]:
# ReduceByKey EJEMPLO DEL PROFE ( Contamos las palabras que tiene )


lines = spark.sparkContext.textFile("/dataset/frankenstein.txt")
counts = lines.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.collect()


# Exercise 4
Get TOP 10 of the words with more than 4 characters

---



In [None]:
# EJEMPLO DEL PROFE ( contamos los caracteres por palabra, filtramos por mas de 4 palabras y mostramos las 10 que más salen )

lines = spark.sparkContext.textFile("/dataset/frankenstein.txt")

lines.flatMap(lambda line: line.split(" ")) \
    .filter (lambda word: len(word)>4)\
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)\
    .map(lambda word: (word[1], word[0]))\
    .sortByKey(False)\
    .take(10)

## Key/Value Pair RDD

---



### Example 6


---



In [None]:
charac_sw = spark.sparkContext.textFile("/dataset/characters.csv")
planets_sw = spark.sparkContext.textFile("/dataset/planets.csv")
charac_sw.take(10)

In [None]:
planets_sw.take(10)

In [None]:
from itertools import islice

charac_sw_noheader = charac_sw.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it)

planets_sw_noheader = planets_sw.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it)

### Exercise 5
Get a list of the population of the planet each Star Wars character belongs to

---
