# Pandas API - PySpark

Vamos ahora a implementar la API de Pandas en PySpark y comparémosla con Pandas

Importamos lo necesario

In [4]:
import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession

  from pandas.core.computation.check import NUMEXPR_INSTALLED


## Creación de objetos

Desde una serie

In [5]:
s = ps.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Desde un diccionario

In [6]:
psdf = ps.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])
psdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Desde un arreglo

In [7]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]:
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
pdf

Unnamed: 0,A,B,C,D
2013-01-01,0.100382,-0.510363,-1.101851,0.244373
2013-01-02,0.694996,0.411423,-0.015967,-1.501402
2013-01-03,-1.25245,-1.04113,0.374565,0.067906
2013-01-04,-0.27945,1.077472,0.419467,-0.758035
2013-01-05,-1.967315,1.230969,-0.695647,-2.227611
2013-01-06,-0.923597,-0.10947,-0.138978,0.698801


Noten que todo lo anterior lo hicimos igual que como lo haríamos directo en Pandas. De hecho, el último DataFrame lo creamos solo utilizando Pandas y el resultado es muy similar a los anteriores.

Ahora, vamos a probar convertir ese DataFrame a formato Pandas-on-Spark

In [9]:
psdf = ps.from_pandas(pdf)
psdf

  series = series.astype(t, copy=False)


Unnamed: 0,A,B,C,D
2013-01-01,0.100382,-0.510363,-1.101851,0.244373
2013-01-02,0.694996,0.411423,-0.015967,-1.501402
2013-01-03,-1.25245,-1.04113,0.374565,0.067906
2013-01-04,-0.27945,1.077472,0.419467,-0.758035
2013-01-05,-1.967315,1.230969,-0.695647,-2.227611
2013-01-06,-0.923597,-0.10947,-0.138978,0.698801


El resultado es prácticamente el mismo. Pero ahora el objeto el objeto es de la clase `pyspark.pandas.frame.DataFrame`

In [10]:
type(psdf)

pyspark.pandas.frame.DataFrame

Ahora, vamos a crear el DataFrame directamente desde la API de Pandas en Spark utilizando el array de Pandas `pdf`

Recordemos que primero es importante siempre iniciar la sesión 

In [11]:
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(pdf)
sdf.show()

+-------------------+--------------------+--------------------+-------------------+
|                  A|                   B|                   C|                  D|
+-------------------+--------------------+--------------------+-------------------+
| 0.1003819642605223| -0.5103629101313774| -1.1018505475297005|0.24437300440695403|
| 0.6949962094071847| 0.41142326685207614|-0.01596663751324...| -1.501402173363727|
|-1.2524496691744738| -1.0411301932258201|   0.374564834401734|0.06790591990672723|
|-0.2794498851972563|  1.0774722119305427|  0.4194669380864002|-0.7580348824516717|
|-1.9673152826832414|  1.2309685669282733| -0.6956468684266478|-2.2276112631432157|
|-0.9235970872598038|-0.10946985855981894|-0.13897827910275792| 0.6988007260975541|
+-------------------+--------------------+--------------------+-------------------+



Nóten como el formato ya es ahora el nativo de Spark

Si queremos cambiarlo a un formato más amigable como el de Pandas podemos utilizar la función `.pandas_api()`

In [12]:
psdf = sdf.pandas_api()
psdf

Unnamed: 0,A,B,C,D
0,0.100382,-0.510363,-1.101851,0.244373
1,0.694996,0.411423,-0.015967,-1.501402
2,-1.25245,-1.04113,0.374565,0.067906
3,-0.27945,1.077472,0.419467,-0.758035
4,-1.967315,1.230969,-0.695647,-2.227611
5,-0.923597,-0.10947,-0.138978,0.698801


Y podemos aplicarle todas las funciones que normalmente le podríamos aplicar a un DataFrame. A continuación algunos ejemplos

In [13]:
psdf.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

In [14]:
psdf.head(3)

Unnamed: 0,A,B,C,D
0,0.100382,-0.510363,-1.101851,0.244373
1,0.694996,0.411423,-0.015967,-1.501402
2,-1.25245,-1.04113,0.374565,0.067906


In [15]:
psdf.index

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [16]:
psdf.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [17]:
psdf.to_numpy()



array([[ 0.10038196, -0.51036291, -1.10185055,  0.244373  ],
       [ 0.69499621,  0.41142327, -0.01596664, -1.50140217],
       [-1.25244967, -1.04113019,  0.37456483,  0.06790592],
       [-0.27944989,  1.07747221,  0.41946694, -0.75803488],
       [-1.96731528,  1.23096857, -0.69564687, -2.22761126],
       [-0.92359709, -0.10946986, -0.13897828,  0.69880073]])

In [18]:
psdf.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.604572,0.176484,-0.193068,-0.579328
std,0.966451,0.896106,0.601692,1.125167
min,-1.967315,-1.04113,-1.101851,-2.227611
25%,-1.25245,-0.510363,-0.695647,-1.501402
50%,-0.923597,-0.10947,-0.138978,-0.758035
75%,0.100382,1.077472,0.374565,0.244373
max,0.694996,1.230969,0.419467,0.698801


In [19]:
psdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
2,-1.25245,-1.04113,0.374565,0.067906
0,0.100382,-0.510363,-1.101851,0.244373
5,-0.923597,-0.10947,-0.138978,0.698801
1,0.694996,0.411423,-0.015967,-1.501402
3,-0.27945,1.077472,0.419467,-0.758035
4,-1.967315,1.230969,-0.695647,-2.227611


Y en general podemos aplicar cualquier función de Pandas que ya conocemos.

## Datos faltantes

Como base, la API de Pandas en Spark utiliza el valor `np.nan` para denotar datos faltantes. 

Veamos un ejemplo

In [20]:
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])
pdf1.loc[dates[0]:dates[1], 'E'] = 1
psdf1 = ps.from_pandas(pdf1)
psdf1

  series = series.astype(t, copy=False)


Unnamed: 0,A,B,C,D,E
2013-01-01,0.100382,-0.510363,-1.101851,0.244373,1.0
2013-01-02,0.694996,0.411423,-0.015967,-1.501402,1.0
2013-01-03,-1.25245,-1.04113,0.374565,0.067906,
2013-01-04,-0.27945,1.077472,0.419467,-0.758035,


Para eliminar los datos faltantes hacemos lo mismo que haríamos en Pandas.

In [21]:
psdf1.dropna(how='any')

  series = series.astype(t, copy=False)


Unnamed: 0,A,B,C,D,E
2013-01-01,0.100382,-0.510363,-1.101851,0.244373,1.0
2013-01-02,0.694996,0.411423,-0.015967,-1.501402,1.0


In [22]:
psdf1

Unnamed: 0,A,B,C,D,E
2013-01-01,0.100382,-0.510363,-1.101851,0.244373,1.0
2013-01-02,0.694996,0.411423,-0.015967,-1.501402,1.0
2013-01-03,-1.25245,-1.04113,0.374565,0.067906,
2013-01-04,-0.27945,1.077472,0.419467,-0.758035,


Al igual que en Pandas, si no le ponemos la condición de `inplace=True`, genera una copia del objeto con la modificación realizada.

También podemos llenar los valores faltantes con algún otro valor

In [23]:
psdf1.fillna(value=5)

  series = series.astype(t, copy=False)


Unnamed: 0,A,B,C,D,E
2013-01-01,0.100382,-0.510363,-1.101851,0.244373,1.0
2013-01-02,0.694996,0.411423,-0.015967,-1.501402,1.0
2013-01-03,-1.25245,-1.04113,0.374565,0.067906,5.0
2013-01-04,-0.27945,1.077472,0.419467,-0.758035,5.0


## Agrupar

Utilicemos un nuevo DataFrame

In [24]:
psdf = ps.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})
psdf

Unnamed: 0,A,B,C,D
0,foo,one,0.19752,-0.055502
1,bar,one,-0.422662,-0.107727
2,foo,two,1.808825,-1.229166
3,bar,three,0.013953,0.343982
4,foo,two,0.439945,-0.687115
5,bar,two,-0.871588,0.215851
6,foo,one,-0.386648,-0.364695
7,foo,three,-1.297976,1.263343


Podemos agrupar y aplicar funciones al igual que lo haríamos en Pandas

In [25]:
psdf.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
foo,0.761666,-1.073135
bar,-1.280297,0.452106


In [26]:
psdf.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,one,-0.189128,-0.420197
bar,one,-0.422662,-0.107727
foo,two,2.24877,-1.916281
bar,three,0.013953,0.343982
bar,two,-0.871588,0.215851
foo,three,-1.297976,1.263343


## Gráficas

Al igual que con Pandas, desde la API de PySpark podemos hacer gráficas

In [27]:
pser = pd.Series(np.random.randn(1000),
                 index=pd.date_range('1/1/2000', periods=1000))
psser = ps.Series(pser)
psser = psser.cummax()

In [28]:
psser.plot()

  series = series.astype(t, copy=False)


In [29]:
pdf = pd.DataFrame(np.random.randn(1000, 4), index=pser.index,
                   columns=['A', 'B', 'C', 'D'])
psdf = ps.from_pandas(pdf)
psdf = psdf.cummax()

In [30]:
psdf.plot()


Passing unit-less datetime64 dtype to .astype is deprecated and will raise in a future version. Pass 'datetime64[ns]' instead

