## Permutación y muestreo aleatorio

Es posible permutar (reordenar aleatoriamente) una Serie o las filas de un DataFrame usando la función numpy.random.permutation. Llamar a permutation con la longitud del eje que se desea permutar produce un array de enteros que indican el nuevo ordenamiento:

In [1]:
import pandas as pd
import numpy as np

In [6]:
df = pd.DataFrame(np.arange(6 * 8).reshape((6, 8)))
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,0,1,2,3,4,5,6,7
1,8,9,10,11,12,13,14,15
2,16,17,18,19,20,21,22,23
3,24,25,26,27,28,29,30,31
4,32,33,34,35,36,37,38,39
5,40,41,42,43,44,45,46,47


- np.arange(6 * 8) crea un arreglo NumPy con valores desde 0 hasta 47 (6 * 8 - 1), con un incremento de 1. En otras palabras, genera una secuencia de números enteros desde 0 hasta 47.
- .reshape((6, 8)) reorganiza ese arreglo en una matriz bidimensional (6 filas y 8 columnas). Cada fila representa una fila en el DataFrame, y cada columna representa una columna en el DataFrame.
- pd.DataFrame(...) crea un DataFrame de pandas a partir de la matriz bidimensional generada en el paso anterior. El DataFrame tendrá 6 filas y 8 columnas, y los valores se llenarán en orden de fila por fila.

En resumen, el código crea un DataFrame con 6 filas y 8 columnas, donde los valores son números enteros consecutivos desde 0 hasta 47. Cada fila representa una fila en el DataFrame, y cada columna representa una columna en el DataFrame

In [14]:
sampler = np.random.permutation(6)
sampler

array([1, 3, 0, 4, 5, 2], dtype=int32)

Este array puede utilizarse entonces en la indexación basada en iloc o en la función equivalente take()  :

In [15]:
df.take(sampler)
# Reordena las filas del DataFrame df según el índice proporcionado por sampler. 

Unnamed: 0,0,1,2,3,4,5,6,7
1,8,9,10,11,12,13,14,15
3,24,25,26,27,28,29,30,31
0,0,1,2,3,4,5,6,7
4,32,33,34,35,36,37,38,39
5,40,41,42,43,44,45,46,47
2,16,17,18,19,20,21,22,23


In [16]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3,4,5,6,7
1,8,9,10,11,12,13,14,15
3,24,25,26,27,28,29,30,31
0,0,1,2,3,4,5,6,7
4,32,33,34,35,36,37,38,39
5,40,41,42,43,44,45,46,47
2,16,17,18,19,20,21,22,23


Invocando take() con axis="columns", también podríamos seleccionar una permutación de las columnas:

In [23]:
column_sampler = np.random.permutation(6)
column_sampler

array([3, 4, 2, 1, 0, 5], dtype=int32)

In [24]:
df.take(column_sampler, axis="columns")

Unnamed: 0,3,4,2,1,0,5
0,3,4,2,1,0,5
1,11,12,10,9,8,13
2,19,20,18,17,16,21
3,27,28,26,25,24,29
4,35,36,34,33,32,37
5,43,44,42,41,40,45


Si se necesita seleccionar un subconjunto aleatorio (ramdom subset) sin reemplazo (la misma fila no puede aparecer dos veces), puede utilizar el método sample() en Series y DataFrame:

In [25]:
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,0,1,2,3,4,5,6,7
1,8,9,10,11,12,13,14,15
2,16,17,18,19,20,21,22,23
3,24,25,26,27,28,29,30,31
4,32,33,34,35,36,37,38,39
5,40,41,42,43,44,45,46,47


In [30]:
df.sample(n=4)

Unnamed: 0,0,1,2,3,4,5,6,7
2,16,17,18,19,20,21,22,23
4,32,33,34,35,36,37,38,39
1,8,9,10,11,12,13,14,15
0,0,1,2,3,4,5,6,7


Para generar una muestra con reemplazo (para permitir elecciones repetidas), pase replace=True a sample():

In [31]:
choices = pd.Series([-4, 1, 9, 6, -8])
choices

0   -4
1    1
2    9
3    6
4   -8
dtype: int64

In [33]:
choices.sample(n=8, replace=True)

4   -8
3    6
2    9
0   -4
0   -4
3    6
1    1
0   -4
dtype: int64

## Cálculo de indicadores/variables ficticias (dummy)¶

Otro tipo de transformación que se utiliza mucho para modelado estadístico o aplicaciones de aprendizaje automático es convertir una variable categórica en un array de dummies o indicadores, en otras palabras convertir categorías a números. Si una columna en un DataFrame tiene k valores distintos, se derivaría un array o DataFrame con k columnas que contengan todos los 1s y 0s. Pandas tiene una función pandas.get_dummies() para hacer esto, aunque también podría idear una usted mismo. Veamos un ejemplo de DataFrame:

In [34]:
df = pd.DataFrame({"key": ["A", "A", "C", "B", "D", "B","F"],
                   "data1": range(7)})
df

Unnamed: 0,key,data1
0,A,0
1,A,1
2,C,2
3,B,3
4,D,4
5,B,5
6,F,6


In [35]:
pd.get_dummies(df["key"], dtype=float)

Unnamed: 0,A,B,C,D,F
0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0
5,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0


Aquí se ha pasado dtype=float para cambiar el tipo de salida de boolean (el predeterminado en las versiones más recientes de pandas) a coma flotante (floating point).

In [36]:
df_1 = pd.get_dummies(df["key"])
df_1

Unnamed: 0,A,B,C,D,F
0,True,False,False,False,False
1,True,False,False,False,False
2,False,False,True,False,False
3,False,True,False,False,False
4,False,False,False,True,False
5,False,True,False,False,False
6,False,False,False,False,True


En algunos casos, es posible que desee añadir un prefijo a las columnas en el DataFrame del indicador, que luego se pueden fusionar con los otros datos. pandas.get_dummies tiene un argumento de prefijo para hacer esto:

In [37]:
dummies = pd.get_dummies(df["key"], prefix="key", dtype=float)
dummies

Unnamed: 0,key_A,key_B,key_C,key_D,key_F
0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0
5,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0


In [38]:
df_with_dummy = df[["data1"]].join(dummies) # .join lo veremos a detalle mas adelante
df_with_dummy

Unnamed: 0,data1,key_A,key_B,key_C,key_D,key_F
0,0,1.0,0.0,0.0,0.0,0.0
1,1,1.0,0.0,0.0,0.0,0.0
2,2,0.0,0.0,1.0,0.0,0.0
3,3,0.0,1.0,0.0,0.0,0.0
4,4,0.0,0.0,0.0,1.0,0.0
5,5,0.0,1.0,0.0,0.0,0.0
6,6,0.0,0.0,0.0,0.0,1.0


Otro ejemplo

In [44]:
df_2 = pd.DataFrame({
    "key": ["A", "A", "C", "B", "D", "B", "F"],
    "data1": range(7),
    "frutas": ["Pera", "Mango", "Manzana", "Mango", "Fresa", "Mango", "Naranja"]
})

print(df_2)

  key  data1   frutas
0   A      0     Pera
1   A      1    Mango
2   C      2  Manzana
3   B      3    Mango
4   D      4    Fresa
5   B      5    Mango
6   F      6  Naranja


In [46]:
dummies_1 = pd.get_dummies(df_2[["key", "frutas"]], prefix=["key", "frutas"], dtype=float)
dummies_1

Unnamed: 0,key_A,key_B,key_C,key_D,key_F,frutas_Fresa,frutas_Mango,frutas_Manzana,frutas_Naranja,frutas_Pera
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
5,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0



Si una fila de un DataFrame contiene categorías, tenemos que utilizar un enfoque diferente para crear las variables ficticias. Veamos el conjunto de datos MovieLens: