# Apache Spark Koalas

### 1. Introduccion a Koalas

Koalas es una API de pandas programada sobre Apache Spark. Tiene todas las ventajas de la implementación de los DataFrames de Spark para trabajar en clúster, pero utilizando la sintaxis de pandas.

#### Beneficios de Koalas
Koalas facilita que los Data Scientists familiarizados con pandas puedan trabajar casi de inmediato en entornos Big Data con Spark, aprendiendo a hacerlo de forma mucho más rápida y sencilla.

Además, permite emplear una única librería base para trabajar con conjuntos de datos de cualquier tamaño, en vez de tener que utilizar pandas para datasets pequeños y PySpark para grandes.

#### Ejemplos de uso de Koalas
Este notebook contiene las funciones principales de Koalas, obtenidas de la documentación oficial de https://koalas.readthedocs.io/

In [1]:
import sys
sys.executable

'/usr/bin/python3'

In [2]:
!pip install koalas

Collecting koalas
  Downloading koalas-0.32.0-py3-none-any.whl (593 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/593.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/593.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m593.2/593.2 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: koalas
Successfully installed koalas-0.32.0


In [3]:
# Install spark-related dependencies
!wget -q  https://apache.osuosl.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
!tar xf spark-3.5.0-bin-hadoop3.tgz

!pip install -q findspark
!pip install pyspark
# Set up required environment variables

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"

import findspark
findspark.init()

import pandas as pd
import pyspark

import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=025e18bd04ff2f3abc0e8daa7df9a214070c00fcbc9deb002750802ae3f0815d
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


ImportError: cannot import name 'Iterable' from 'collections' (/usr/lib/python3.10/collections/__init__.py)

### 2. Creación de objetos

Creando una serie Koalas pasando una lista de valores, permitiendo que Koalas cree un índice entero predeterminado:

In [None]:
s = ks.Series([1, 3, 5, np.nan, 6, 8])

In [None]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creando un Koalas DataFrame pasando un dict de objetos que se pueden convertir a series.

In [None]:
kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

In [None]:
kdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Creando un DataFrame de pandas pasando una matriz numpy, con un índice de fecha y hora y columnas etiquetadas:

In [None]:
dates = pd.date_range('20130101', periods=6)

In [None]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [None]:
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [None]:
pdf

Unnamed: 0,A,B,C,D
2013-01-01,0.296152,-0.215817,-0.391997,0.118745
2013-01-02,0.044379,0.349173,-0.188457,0.066455
2013-01-03,0.744525,0.26566,-0.251066,0.67719
2013-01-04,-0.489116,1.277547,0.304424,-2.593397
2013-01-05,0.116915,1.12017,-0.365417,0.465095
2013-01-06,-0.474115,1.429882,-0.86758,-1.035799


Ahora, este DataFrame de pandas se puede convertir en un DataFrame de Koalas

In [None]:
kdf = ks.from_pandas(pdf)

In [None]:
type(kdf)

databricks.koalas.frame.DataFrame

Sin embargo, se ve y se comporta igual que un DataFrame de pandas

In [None]:
kdf

Unnamed: 0,A,B,C,D
2013-01-01,0.296152,-0.215817,-0.391997,0.118745
2013-01-02,0.044379,0.349173,-0.188457,0.066455
2013-01-03,0.744525,0.26566,-0.251066,0.67719
2013-01-04,-0.489116,1.277547,0.304424,-2.593397
2013-01-05,0.116915,1.12017,-0.365417,0.465095
2013-01-06,-0.474115,1.429882,-0.86758,-1.035799


Además, es posible crear un **Koalas DataFrame desde Spark DataFrame**.

Creando un Spark DataFrame a partir de pandas DataFrame

In [None]:
spark = SparkSession.builder.getOrCreate()

In [None]:
sdf = spark.createDataFrame(pdf)

In [None]:
sdf.show()

+--------------------+--------------------+--------------------+-------------------+
|                   A|                   B|                   C|                  D|
+--------------------+--------------------+--------------------+-------------------+
| 0.29615231058572056|-0.21581669345913104| -0.3919973355730924|0.11874484580111531|
|0.044379013962365974| 0.34917316342486104|-0.18845664139431073|0.06645488760730403|
|  0.7445246714017707|  0.2656600343550899| -0.2510658115353583| 0.6771898706194357|
|-0.48911595381271716|   1.277546934890121|  0.3044237066145135| -2.593396542357387|
| 0.11691548345955091|  1.1201699234737166| -0.3654170537258906| 0.4650945694820092|
|-0.47411468152149955|   1.429882011214831| -0.8675798849779087|-1.0357987764791312|
+--------------------+--------------------+--------------------+-------------------+



Creando Koalas DataFrame desde Spark DataFrame.
`to_koalas ()` se adjunta automáticamente a Spark DataFrame y está disponible como una API cuando se importa Koalas.

In [None]:
kdf = sdf.to_koalas()

In [None]:
kdf

Unnamed: 0,A,B,C,D
0,0.296152,-0.215817,-0.391997,0.118745
1,0.044379,0.349173,-0.188457,0.066455
2,0.744525,0.26566,-0.251066,0.67719
3,-0.489116,1.277547,0.304424,-2.593397
4,0.116915,1.12017,-0.365417,0.465095
5,-0.474115,1.429882,-0.86758,-1.035799


Tiene [dtypes] específicos. Actualmente se admiten los tipos que son comunes a Spark y pandas.

In [None]:
kdf.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

### 3. Manipulación de datos


A diferencia de los pandas, los datos en un dataframe de datos de Spark no están _ordenados_, no tienen una noción intrínseca de índice. Cuando se le solicite el encabezado, Spark solo tomará el número solicitado de filas de una partición. **No hay que utilizar el df de Koalas para devolver filas específicas**, use `.loc` o` iloc` en su lugar.

In [None]:
kdf.head()

Unnamed: 0,A,B,C,D
0,0.296152,-0.215817,-0.391997,0.118745
1,0.044379,0.349173,-0.188457,0.066455
2,0.744525,0.26566,-0.251066,0.67719
3,-0.489116,1.277547,0.304424,-2.593397
4,0.116915,1.12017,-0.365417,0.465095


Muestre el índice, las columnas y los datos numéricos subyacentes.

También puede recuperar el índice; la columna de índice se puede atribuir a un DataFrame, ver más adelante

In [None]:
kdf.index

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [None]:
kdf.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [None]:
kdf.to_numpy()

array([[ 0.29615231, -0.21581669, -0.39199734,  0.11874485],
       [ 0.04437901,  0.34917316, -0.18845664,  0.06645489],
       [ 0.74452467,  0.26566003, -0.25106581,  0.67718987],
       [-0.48911595,  1.27754693,  0.30442371, -2.59339654],
       [ 0.11691548,  1.12016992, -0.36541705,  0.46509457],
       [-0.47411468,  1.42988201, -0.86757988, -1.03579878]])

**Describe** muestra un resumen estadístico rápido de sus datos

In [None]:
kdf.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.03979,0.704436,-0.293349,-0.383619
std,0.471632,0.662294,0.378098,1.233614
min,-0.489116,-0.215817,-0.86758,-2.593397
25%,-0.474115,0.26566,-0.391997,-1.035799
50%,0.044379,0.349173,-0.365417,0.066455
75%,0.296152,1.277547,-0.188457,0.465095
max,0.744525,1.429882,0.304424,0.67719


Transposición de sus datos

In [None]:
kdf.T

Unnamed: 0,0,1,2,3,4,5
A,0.296152,0.044379,0.744525,-0.489116,0.116915,-0.474115
B,-0.215817,0.349173,0.26566,1.277547,1.12017,1.429882
C,-0.391997,-0.188457,-0.251066,0.304424,-0.365417,-0.86758
D,0.118745,0.066455,0.67719,-2.593397,0.465095,-1.035799


Ordenando por su índice

In [None]:
kdf.sort_index(ascending=False)

Unnamed: 0,A,B,C,D
5,-0.474115,1.429882,-0.86758,-1.035799
4,0.116915,1.12017,-0.365417,0.465095
3,-0.489116,1.277547,0.304424,-2.593397
2,0.744525,0.26566,-0.251066,0.67719
1,0.044379,0.349173,-0.188457,0.066455
0,0.296152,-0.215817,-0.391997,0.118745


Ordenar por valor

In [None]:
kdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
0,0.296152,-0.215817,-0.391997,0.118745
2,0.744525,0.26566,-0.251066,0.67719
1,0.044379,0.349173,-0.188457,0.066455
4,0.116915,1.12017,-0.365417,0.465095
3,-0.489116,1.277547,0.304424,-2.593397
5,-0.474115,1.429882,-0.86758,-1.035799


#### Si te interesa saber mas sobre esta libreria, puedes apoyarte en el siguiente link:
* https://github.com/databricks/koalas