# La clase DataFrame

Un `DataFrame` es una agrupación de `Series` unidas bajo los mismos índices dando como resultado estructuras similares a tablas donde representar todo tipo de información.

Cada serie del `DataFrame` se puede considerar una columna a la cuál podemos establecer un nombre:

In [2]:
import pandas as pd
import numpy as np

array = np.random.uniform(-10, 10, size=[4,4])
array

array([[-4.55247054,  6.77223991, -8.4658286 , -3.97604152],
       [ 3.9003549 ,  0.14120416,  1.12049427,  9.85254541],
       [ 8.52473717, -8.65497523, -9.91272797, -8.77016859],
       [-0.3462199 ,  8.66681536,  8.54750545, -4.02034797]])

In [3]:
# Representación en jupyter
df = pd.DataFrame(array, index=['A','B','C','D'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,-4.552471,6.77224,-8.465829,-3.976042
B,3.900355,0.141204,1.120494,9.852545
C,8.524737,-8.654975,-9.912728,-8.770169
D,-0.34622,8.666815,8.547505,-4.020348


In [4]:
# Representación por pantalla
print(df)

          W         X         Y         Z
A -4.552471  6.772240 -8.465829 -3.976042
B  3.900355  0.141204  1.120494  9.852545
C  8.524737 -8.654975 -9.912728 -8.770169
D -0.346220  8.666815  8.547505 -4.020348


In [5]:
# Tipo de un df
type(df)

pandas.core.frame.DataFrame

In [6]:
df

Unnamed: 0,W,X,Y,Z
A,-4.552471,6.77224,-8.465829,-3.976042
B,3.900355,0.141204,1.120494,9.852545
C,8.524737,-8.654975,-9.912728,-8.770169
D,-0.34622,8.666815,8.547505,-4.020348


## Trabajando con DataFrames

Podemos consultar una columna mediante su nombre:

In [7]:
df

Unnamed: 0,W,X,Y,Z
A,-4.552471,6.77224,-8.465829,-3.976042
B,3.900355,0.141204,1.120494,9.852545
C,8.524737,-8.654975,-9.912728,-8.770169
D,-0.34622,8.666815,8.547505,-4.020348


In [8]:
df['X']

A    6.772240
B    0.141204
C   -8.654975
D    8.666815
Name: X, dtype: float64

In [9]:
df.X

A    6.772240
B    0.141204
C   -8.654975
D    8.666815
Name: X, dtype: float64

Como vemos una columna es en realidad una serie:

In [10]:
type(df['X'])

pandas.core.series.Series

También podemos consultar varias columnas pasando una lista con los nombres:

In [11]:
df[ ['W','Z'] ]

Unnamed: 0,W,Z
A,-4.552471,-3.976042
B,3.900355,9.852545
C,8.524737,-8.770169
D,-0.34622,-4.020348


In [12]:
df

Unnamed: 0,W,X,Y,Z
A,-4.552471,6.77224,-8.465829,-3.976042
B,3.900355,0.141204,1.120494,9.852545
C,8.524737,-8.654975,-9.912728,-8.770169
D,-0.34622,8.666815,8.547505,-4.020348


### Añadir una columna

In [13]:
df

Unnamed: 0,W,X,Y,Z
A,-4.552471,6.77224,-8.465829,-3.976042
B,3.900355,0.141204,1.120494,9.852545
C,8.524737,-8.654975,-9.912728,-8.770169
D,-0.34622,8.666815,8.547505,-4.020348


In [14]:
df['TOTAL'] = df['W'] + df['X'] + df['Y'] + df['Z']

In [15]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,-4.552471,6.77224,-8.465829,-3.976042,-10.222101
B,3.900355,0.141204,1.120494,9.852545,15.014599
C,8.524737,-8.654975,-9.912728,-8.770169,-18.813135
D,-0.34622,8.666815,8.547505,-4.020348,12.847753


In [16]:
df['add1'], df['add2'] = [1, 2, 3, 4], [5, 6, 7, 8]
df

Unnamed: 0,W,X,Y,Z,TOTAL,add1,add2
A,-4.552471,6.77224,-8.465829,-3.976042,-10.222101,1,5
B,3.900355,0.141204,1.120494,9.852545,15.014599,2,6
C,8.524737,-8.654975,-9.912728,-8.770169,-18.813135,3,7
D,-0.34622,8.666815,8.547505,-4.020348,12.847753,4,8


In [17]:
aux_df = pd.DataFrame(data=[ [1, 2], [3, 4], [5, 6], [7, 8] ], 
                      index=['A', 'B', 'C', 'D'], columns=['add1', 'add2'])
aux_df

Unnamed: 0,add1,add2
A,1,2
B,3,4
C,5,6
D,7,8


In [18]:
df

Unnamed: 0,W,X,Y,Z,TOTAL,add1,add2
A,-4.552471,6.77224,-8.465829,-3.976042,-10.222101,1,5
B,3.900355,0.141204,1.120494,9.852545,15.014599,2,6
C,8.524737,-8.654975,-9.912728,-8.770169,-18.813135,3,7
D,-0.34622,8.666815,8.547505,-4.020348,12.847753,4,8


In [19]:
df.drop(['add1'], axis=1, inplace=True)

In [20]:
df

Unnamed: 0,W,X,Y,Z,TOTAL,add2
A,-4.552471,6.77224,-8.465829,-3.976042,-10.222101,5
B,3.900355,0.141204,1.120494,9.852545,15.014599,6
C,8.524737,-8.654975,-9.912728,-8.770169,-18.813135,7
D,-0.34622,8.666815,8.547505,-4.020348,12.847753,8


In [21]:
df['extra'] = [1, 2, 3, 4]
df

Unnamed: 0,W,X,Y,Z,TOTAL,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,-10.222101,5,1
B,3.900355,0.141204,1.120494,9.852545,15.014599,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,-18.813135,7,3
D,-0.34622,8.666815,8.547505,-4.020348,12.847753,8,4


### Borrar una columna

In [22]:
new_df = df.drop('TOTAL', axis=1, inplace=False)

In [23]:
new_df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [24]:
# No se modifica el df original
df

Unnamed: 0,W,X,Y,Z,TOTAL,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,-10.222101,5,1
B,3.900355,0.141204,1.120494,9.852545,15.014599,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,-18.813135,7,3
D,-0.34622,8.666815,8.547505,-4.020348,12.847753,8,4


In [25]:
# A no ser que le indiquemos explícitamente
df.drop('TOTAL', axis=1, inplace=True)

df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [26]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [27]:
df.drop('TOTAL', axis=1)

KeyError: "['TOTAL'] not found in axis"

In [28]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


### Borrar una fila

In [29]:
df.drop('C', axis=0)

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [30]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [31]:
df['add2']

A    5
B    6
C    7
D    8
Name: add2, dtype: int64

### Seleccionar filas

In [32]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [33]:
df.loc['C']

W        8.524737
X       -8.654975
Y       -9.912728
Z       -8.770169
add2     7.000000
extra    3.000000
Name: C, dtype: float64

In [34]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


También podemos utilizar el índice:

In [35]:
df.iloc[2]

W        8.524737
X       -8.654975
Y       -9.912728
Z       -8.770169
add2     7.000000
extra    3.000000
Name: C, dtype: float64

### Seleccionar subset

In [None]:
df

In [36]:
# Fila C y columna Z 
df.loc['C','Z']

-8.77016858881463

In [37]:
type(df.loc['C','Z'])

numpy.float64

In [38]:
df.loc[ ['C'], ['Z'] ]

Unnamed: 0,Z
C,-8.770169


In [39]:
type(df.loc[['C'],['Z']])

pandas.core.frame.DataFrame

In [40]:
# Filas A,B y columnas W,Y
df.loc[ ['A','B'],['W','Y'] ]

Unnamed: 0,W,Y
A,-4.552471,-8.465829
B,3.900355,1.120494


## Selección condicionada

Una de las mayores utilidades de los `DataFrames` es su capacidad para realizar consultas condicionadas:

In [41]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [42]:
# Registros >0
df > 0

Unnamed: 0,W,X,Y,Z,add2,extra
A,False,True,False,False,True,True
B,True,True,True,True,True,True
C,True,False,False,False,True,True
D,False,True,True,False,True,True


In [43]:
# Valor de los registros >0
df[ df>0 ]

Unnamed: 0,W,X,Y,Z,add2,extra
A,,6.77224,,,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,,,,7,3
D,,8.666815,8.547505,,8,4


In [44]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [45]:
df['W']>0

A    False
B     True
C     True
D    False
Name: W, dtype: bool

In [46]:
# Valor de los registros cuando X>0
df[  df['W']>0  ]

Unnamed: 0,W,X,Y,Z,add2,extra
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3


In [47]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [None]:
# Valor de los registros en las columnas Y,Z si X>0
# df[df['Z'] > 0] [['Y', 'extra']]

df[(-5 < df) & (df < 0)]

Podemos unir condiciones usando los operadores `or` con `|` y `and` con `&`:

In [None]:
df

In [48]:
# Valor de los registros cuando X>0 or Z<0
df[(df['X']>0) | (df['Z'] < 0)]

Unnamed: 0,W,X,Y,Z,add2,extra
A,-4.552471,6.77224,-8.465829,-3.976042,5,1
B,3.900355,0.141204,1.120494,9.852545,6,2
C,8.524737,-8.654975,-9.912728,-8.770169,7,3
D,-0.34622,8.666815,8.547505,-4.020348,8,4


In [56]:
# Valor de los registros en las columnas W e Y cuando X>0 o Z<0
df[(df['X']>0) | (df['Z'] < 0)] [['W','Y']]

Unnamed: 0,W,Y
A,-4.552471,-8.465829
B,3.900355,1.120494
C,8.524737,-9.912728
D,-0.34622,8.547505


## Modificar índices

In [88]:
# Creamos de nuevo el dataframe
array = np.random.uniform(-10, 10, size=[4,4])
df = pd.DataFrame(array, index=['A','B','C','D'], columns=['W','X','Y','Z'])

In [89]:
df

Unnamed: 0,W,X,Y,Z
A,2.705129,8.985539,-2.140531,8.949349
B,-7.167698,-7.282757,-5.907088,7.676385
C,-9.412705,2.442505,-9.59811,4.411441
D,6.506499,5.644636,5.281078,2.862166


In [90]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [92]:
df.index.to_list()

['A', 'B', 'C', 'D']

In [99]:
df

Unnamed: 0,W,X,Y,Z
A,2.705129,8.985539,-2.140531,8.949349
B,-7.167698,-7.282757,-5.907088,7.676385
C,-9.412705,2.442505,-9.59811,4.411441
D,6.506499,5.644636,5.281078,2.862166


In [98]:
new_df = df[df['X'] > 10]
new_df

Unnamed: 0,W,X,Y,Z


In [101]:
new_df.empty

True

In [67]:
# Añadimos una nueva Serie con el nombre de los índices
df['Códigos'] = ['AA','BB','CC','DD']

df

Unnamed: 0,W,X,Y,Z,Códigos
A,1.949229,0.018863,-9.856222,6.926107,AA
B,-4.032715,7.52829,3.273706,2.680084,BB
C,-6.185562,-1.561919,-2.630363,2.749117,CC
D,-5.317242,8.285601,0.424686,-5.464411,DD


In [68]:
df.loc[len(df)] = [2, 3, 4, 5, 6]
df

Unnamed: 0,W,X,Y,Z,Códigos
A,1.949229,0.018863,-9.856222,6.926107,AA
B,-4.032715,7.52829,3.273706,2.680084,BB
C,-6.185562,-1.561919,-2.630363,2.749117,CC
D,-5.317242,8.285601,0.424686,-5.464411,DD
4,2.0,3.0,4.0,5.0,6


In [69]:
# Substituimos los índices de las filas
df.set_index('Códigos')

Unnamed: 0_level_0,W,X,Y,Z
Códigos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,1.949229,0.018863,-9.856222,6.926107
BB,-4.032715,7.52829,3.273706,2.680084
CC,-6.185562,-1.561919,-2.630363,2.749117
DD,-5.317242,8.285601,0.424686,-5.464411
6,2.0,3.0,4.0,5.0


In [70]:
# No se guardan por defecto
df

Unnamed: 0,W,X,Y,Z,Códigos
A,1.949229,0.018863,-9.856222,6.926107,AA
B,-4.032715,7.52829,3.273706,2.680084,BB
C,-6.185562,-1.561919,-2.630363,2.749117,CC
D,-5.317242,8.285601,0.424686,-5.464411,DD
4,2.0,3.0,4.0,5.0,6


In [71]:
# A no ser que lo especifiquemos explícitamente
df.set_index('Códigos', inplace=True)
df

Unnamed: 0_level_0,W,X,Y,Z
Códigos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,1.949229,0.018863,-9.856222,6.926107
BB,-4.032715,7.52829,3.273706,2.680084
CC,-6.185562,-1.561919,-2.630363,2.749117
DD,-5.317242,8.285601,0.424686,-5.464411
6,2.0,3.0,4.0,5.0


In [None]:
df['fila_a']

In [None]:
print(df)

In [72]:
# consultamos una fila con el nuevo índice
df.loc['AA']

W    1.949229
X    0.018863
Y   -9.856222
Z    6.926107
Name: AA, dtype: float64

### Índices por defecto

In [73]:
df

Unnamed: 0_level_0,W,X,Y,Z
Códigos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,1.949229,0.018863,-9.856222,6.926107
BB,-4.032715,7.52829,3.273706,2.680084
CC,-6.185562,-1.561919,-2.630363,2.749117
DD,-5.317242,8.285601,0.424686,-5.464411
6,2.0,3.0,4.0,5.0


In [76]:
# Reiniciamos los índices y borramos los anteriores explícitamente
df.reset_index(drop=True, inplace=True)

df

Unnamed: 0,W,X,Y,Z
0,1.949229,0.018863,-9.856222,6.926107
1,-4.032715,7.52829,3.273706,2.680084
2,-6.185562,-1.561919,-2.630363,2.749117
3,-5.317242,8.285601,0.424686,-5.464411
4,2.0,3.0,4.0,5.0


In [77]:
df

Unnamed: 0,W,X,Y,Z
0,1.949229,0.018863,-9.856222,6.926107
1,-4.032715,7.52829,3.273706,2.680084
2,-6.185562,-1.561919,-2.630363,2.749117
3,-5.317242,8.285601,0.424686,-5.464411
4,2.0,3.0,4.0,5.0


In [79]:
df['sin(X)'] = np.sin(df['W'])
df

Unnamed: 0,W,X,Y,Z,sin(X)
0,1.949229,0.018863,-9.856222,6.926107,0.929245
1,-4.032715,7.52829,3.273706,2.680084,0.777778
2,-6.185562,-1.561919,-2.630363,2.749117,0.097468
3,-5.317242,8.285601,0.424686,-5.464411,0.822586
4,2.0,3.0,4.0,5.0,0.909297


In [86]:
df.columns.to_list()

['W', 'X', 'Y', 'Z', 'sin(X)']

In [87]:
df.index

RangeIndex(start=0, stop=5, step=1)

Esto es solo la punta del iceberg, para más información sobre la clase `DataFrame` tenéis la [documentación oficial](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).