# La clase DataFrame

Un `DataFrame` es una agrupación de `Series` unidas bajo los mismos índices dando como resultado estructuras similares a tablas donde representar todo tipo de información.

Cada serie del `DataFrame` se puede considerar una columna a la cuál podemos establecer un nombre:

In [1]:
import pandas as pd
import numpy as np

array = np.random.uniform(-10, 10, size=[4,4])
array

array([[-3.9214417 ,  6.11023565,  5.36193942, -8.82113798],
       [-9.22863879,  7.57846801,  8.53763999, -9.70114605],
       [ 4.6685973 ,  8.60845296, -6.26455643,  4.41078475],
       [ 6.27816749,  0.85211818,  4.88392651,  4.89501855]])

In [2]:
# Representación en jupyter
df = pd.DataFrame(array, index=['A','B','C','D'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,-3.921442,6.110236,5.361939,-8.821138
B,-9.228639,7.578468,8.53764,-9.701146
C,4.668597,8.608453,-6.264556,4.410785
D,6.278167,0.852118,4.883927,4.895019


In [3]:
# Representación por pantalla
print(df)

          W         X         Y         Z
A -3.921442  6.110236  5.361939 -8.821138
B -9.228639  7.578468  8.537640 -9.701146
C  4.668597  8.608453 -6.264556  4.410785
D  6.278167  0.852118  4.883927  4.895019


In [5]:
# Tipo de un df
type(df)

pandas.core.frame.DataFrame

## Trabajando con DataFrames

Podemos consultar una columna mediante su nombre:

In [6]:
df

Unnamed: 0,W,X,Y,Z
A,9.860165,7.278812,0.144769,7.45376
B,-4.912559,-1.876585,4.148209,-8.73223
C,9.133373,9.713773,4.145827,4.627366
D,0.141455,8.270157,2.787122,4.180101


In [7]:
df['X']

A    7.278812
B   -1.876585
C    9.713773
D    8.270157
Name: X, dtype: float64

In [10]:
df.X

A    7.278812
B   -1.876585
C    9.713773
D    8.270157
Name: X, dtype: float64

Como vemos una columna es en realidad una serie:

In [11]:
type(df['X'])

pandas.core.series.Series

También podemos consultar varias columnas pasando una lista con los nombres:

In [13]:
df[ ['Y','Z'] ]

Unnamed: 0,Y,Z
A,0.144769,7.45376
B,4.148209,-8.73223
C,4.145827,4.627366
D,2.787122,4.180101


### Añadir una columna

In [52]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [67]:
df['TOTAL'] = df['W'] + df['X'] + df['Y'] + df['Z']

In [53]:
df['add1'], df['add2'] = [1, 2, 3, 4], [5, 6, 7, 8]
df

Unnamed: 0,W,X,Y,Z,TOTAL,add1,add2
A,9.860165,7.278812,0.144769,7.45376,24.737506,1,5
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164,2,6
C,9.133373,9.713773,4.145827,4.627366,27.620338,3,7
D,0.141455,8.270157,2.787122,4.180101,15.378835,4,8


In [71]:
aux_df = pd.DataFrame(data=[ [1, 2], [3, 4], [5, 6], [7, 8] ], 
                      index=['A', 'B', 'C', 'D'], columns=['add1', 'add2'])
aux_df

Unnamed: 0,add1,add2
A,1,2
B,3,4
C,5,6
D,7,8


In [72]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [59]:
df.drop(['aux1', 'aux2'], axis=1, inplace=True)
df

Unnamed: 0,W,X,Y,Z,TOTAL,add1,add2
A,9.860165,7.278812,0.144769,7.45376,24.737506,1,5
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164,2,6
C,9.133373,9.713773,4.145827,4.627366,27.620338,3,7
D,0.141455,8.270157,2.787122,4.180101,15.378835,4,8


In [17]:
df['extra'] = [1, 2, 3, 4]
df

Unnamed: 0,W,X,Y,Z,TOTAL,extra
A,9.860165,7.278812,0.144769,7.45376,24.737506,1
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164,2
C,9.133373,9.713773,4.145827,4.627366,27.620338,3
D,0.141455,8.270157,2.787122,4.180101,15.378835,4


### Borrar una columna

In [30]:
new_df = df.drop('TOTAL', axis=1, inplace=False)

In [31]:
new_df

Unnamed: 0,W,X,Y,Z
A,9.860165,7.278812,0.144769,7.45376
B,-4.912559,-1.876585,4.148209,-8.73223
C,9.133373,9.713773,4.145827,4.627366
D,0.141455,8.270157,2.787122,4.180101


In [33]:
# No se modifica el df original
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [34]:
# A no ser que le indiquemos explícitamente
df.drop('TOTAL', axis=1, inplace=True)

df

Unnamed: 0,W,X,Y,Z
A,9.860165,7.278812,0.144769,7.45376
B,-4.912559,-1.876585,4.148209,-8.73223
C,9.133373,9.713773,4.145827,4.627366
D,0.141455,8.270157,2.787122,4.180101


In [38]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [40]:
df.drop('TOTAL', axis=1)

Unnamed: 0,W,X,Y,Z
A,9.860165,7.278812,0.144769,7.45376
B,-4.912559,-1.876585,4.148209,-8.73223
C,9.133373,9.713773,4.145827,4.627366
D,0.141455,8.270157,2.787122,4.180101


In [41]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


### Borrar una fila

In [42]:
df.drop('C', axis=0)

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [43]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


### Seleccionar filas

In [46]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [49]:
df.loc['C']

W         9.133373
X         9.713773
Y         4.145827
Z         4.627366
TOTAL    27.620338
Name: C, dtype: float64

In [48]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


También podemos utilizar el índice:

In [51]:
df.iloc[2]

W         9.133373
X         9.713773
Y         4.145827
Z         4.627366
TOTAL    27.620338
Name: C, dtype: float64

### Seleccionar subset

In [74]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [77]:
# Fila C y columna Z 
df.loc['C','Z']

4.627365739552305

In [79]:
type(df.loc['C','Z'])

numpy.float64

In [80]:
df.loc[ ['C'],['Z'] ]

Unnamed: 0,Z
C,4.627366


In [81]:
type(df.loc[['C'],['Z']])

pandas.core.frame.DataFrame

In [82]:
# Filas A,B y columnas W,Y
df.loc[ ['A','B'],['W','Y'] ]

Unnamed: 0,W,Y
A,9.860165,0.144769
B,-4.912559,4.148209


## Selección condicionada

Una de las mayores utilidades de los `DataFrames` es su capacidad para realizar consultas condicionadas:

In [83]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [84]:
# Registros >0
df>0

Unnamed: 0,W,X,Y,Z,TOTAL
A,True,True,True,True,True
B,False,False,True,False,False
C,True,True,True,True,True
D,True,True,True,True,True


In [85]:
# Valor de los registros >0
df[ df>0 ]

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,,,4.148209,,
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [86]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [87]:
df['W']>0

A     True
B    False
C     True
D     True
Name: W, dtype: bool

In [88]:
# Valor de los registros cuando X>0
df[  df['W']>0  ]

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [92]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [104]:
# Valor de los registros en las columnas Y,Z si X>0
df[df['W']>0] [['Y', 'X']]

Unnamed: 0,Y,X
A,0.144769,7.278812
C,4.145827,9.713773
D,2.787122,8.270157


Podemos unir condiciones usando los operadores `or` con `|` y `and` con `&`:

In [105]:
# Valor de los registros cuando X>0 o Z<0
df[(df['X']>0) | (df['Z'] < 0)]

Unnamed: 0,W,X,Y,Z,TOTAL
A,9.860165,7.278812,0.144769,7.45376,24.737506
B,-4.912559,-1.876585,4.148209,-8.73223,-11.373164
C,9.133373,9.713773,4.145827,4.627366,27.620338
D,0.141455,8.270157,2.787122,4.180101,15.378835


In [107]:
# Valor de los registros en las columnas W e Y cuando X>0 o Z<0
df[(df['X']>0) | (df['Z'] < 0)] [['W','Y']]

Unnamed: 0,W,Y
A,9.860165,0.144769
B,-4.912559,4.148209
C,9.133373,4.145827
D,0.141455,2.787122


## Modificar índices

In [118]:
# Creamos de nuevo el dataframe
array = np.random.uniform(-10, 10, size=[4,4])
df = pd.DataFrame(array, index=['A','B','C','D'], columns=['W','X','Y','Z'])

In [119]:
df

Unnamed: 0,W,X,Y,Z
A,7.654097,-2.390534,-3.172367,4.238012
B,-6.336643,-6.489957,-6.888587,-6.799413
C,6.310137,-6.35972,-6.53854,1.760575
D,-8.041991,1.102392,-4.960883,-8.527815


In [120]:
# Añadimos una nueva Serie con el nombre de los índices
df['Códigos'] = ['AA','BB','CC','DD']

df

Unnamed: 0,W,X,Y,Z,Códigos
A,7.654097,-2.390534,-3.172367,4.238012,AA
B,-6.336643,-6.489957,-6.888587,-6.799413,BB
C,6.310137,-6.35972,-6.53854,1.760575,CC
D,-8.041991,1.102392,-4.960883,-8.527815,DD


In [111]:
# Substituimos los índices de las filas
df.set_index('Códigos')

Unnamed: 0_level_0,W,X,Y,Z
Códigos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,2.580647,-8.838819,-0.507634,-1.576856
BB,8.072826,-3.838686,1.174599,-6.542102
CC,4.017447,-2.89429,-6.4028,6.324158
DD,-9.932894,-1.29493,6.109224,5.337968


In [112]:
# No se guardan por defecto
df

Unnamed: 0,W,X,Y,Z,Códigos
A,2.580647,-8.838819,-0.507634,-1.576856,AA
B,8.072826,-3.838686,1.174599,-6.542102,BB
C,4.017447,-2.89429,-6.4028,6.324158,CC
D,-9.932894,-1.29493,6.109224,5.337968,DD


In [122]:
# A no ser que lo especifiquemos explícitamente
df.set_index('Códigos', inplace=True)

df

Unnamed: 0_level_0,W,X,Y,Z
Códigos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,7.654097,-2.390534,-3.172367,4.238012
BB,-6.336643,-6.489957,-6.888587,-6.799413
CC,6.310137,-6.35972,-6.53854,1.760575
DD,-8.041991,1.102392,-4.960883,-8.527815


In [114]:
print(df)

                W         X         Y         Z
Códigos                                        
AA       2.580647 -8.838819 -0.507634 -1.576856
BB       8.072826 -3.838686  1.174599 -6.542102
CC       4.017447 -2.894290 -6.402800  6.324158
DD      -9.932894 -1.294930  6.109224  5.337968


In [115]:
# consultamos una fila con el nuevo índice
df.loc['AA']

W    2.580647
X   -8.838819
Y   -0.507634
Z   -1.576856
Name: AA, dtype: float64

### Índices por defecto

In [123]:
df

Unnamed: 0_level_0,W,X,Y,Z
Códigos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,7.654097,-2.390534,-3.172367,4.238012
BB,-6.336643,-6.489957,-6.888587,-6.799413
CC,6.310137,-6.35972,-6.53854,1.760575
DD,-8.041991,1.102392,-4.960883,-8.527815


In [125]:
# Reiniciamos los índices y borramos los anteriores explícitamente
df.reset_index(drop=True, inplace=True)

df

Unnamed: 0,W,X,Y,Z
0,7.654097,-2.390534,-3.172367,4.238012
1,-6.336643,-6.489957,-6.888587,-6.799413
2,6.310137,-6.35972,-6.53854,1.760575
3,-8.041991,1.102392,-4.960883,-8.527815


Esto es solo la punta del iceberg, para más información sobre la clase `DataFrame` tenéis la [documentación oficial](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).