# La clase DataFrame

Un `DataFrame` es una agrupación de `Series` unidas bajo los mismos índices dando como resultado estructuras similares a tablas donde representar todo tipo de información.

Cada serie del `DataFrame` se puede considerar una columna a la cuál podemos establecer un nombre:

In [37]:
import pandas as pd
import numpy as np

array = np.random.uniform(-10, 10, size=[4,4])
array

array([[-1.7240448 ,  2.97950317, -1.1342438 ,  2.23533255],
       [-4.83413266, -9.96135135, -0.27271347, -3.568344  ],
       [-0.07643831,  4.59503218,  9.02601192, -5.48409867],
       [-7.03261511,  1.59842748,  8.83025173, -9.44327703]])

In [38]:
# Representación en jupyter
df = pd.DataFrame(array, index=['A','B','C','D'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,-1.724045,2.979503,-1.134244,2.235333
B,-4.834133,-9.961351,-0.272713,-3.568344
C,-0.076438,4.595032,9.026012,-5.484099
D,-7.032615,1.598427,8.830252,-9.443277


In [39]:
# Representación por pantalla
print(df)

          W         X         Y         Z
A -1.724045  2.979503 -1.134244  2.235333
B -4.834133 -9.961351 -0.272713 -3.568344
C -0.076438  4.595032  9.026012 -5.484099
D -7.032615  1.598427  8.830252 -9.443277


In [40]:
# Tipo de un df
type(df)

pandas.core.frame.DataFrame

In [41]:
df

Unnamed: 0,W,X,Y,Z
A,-1.724045,2.979503,-1.134244,2.235333
B,-4.834133,-9.961351,-0.272713,-3.568344
C,-0.076438,4.595032,9.026012,-5.484099
D,-7.032615,1.598427,8.830252,-9.443277


## Trabajando con DataFrames

Podemos consultar una columna mediante su nombre:

In [42]:
df

Unnamed: 0,W,X,Y,Z
A,-1.724045,2.979503,-1.134244,2.235333
B,-4.834133,-9.961351,-0.272713,-3.568344
C,-0.076438,4.595032,9.026012,-5.484099
D,-7.032615,1.598427,8.830252,-9.443277


In [43]:
df['X']

A    2.979503
B   -9.961351
C    4.595032
D    1.598427
Name: X, dtype: float64

In [44]:
df.X

A    2.979503
B   -9.961351
C    4.595032
D    1.598427
Name: X, dtype: float64

Como vemos una columna es en realidad una serie:

In [45]:
type(df['X'])

pandas.core.series.Series

También podemos consultar varias columnas pasando una lista con los nombres:

In [46]:
df[ ['W','Z'] ]

Unnamed: 0,W,Z
A,-1.724045,2.235333
B,-4.834133,-3.568344
C,-0.076438,-5.484099
D,-7.032615,-9.443277


In [47]:
df

Unnamed: 0,W,X,Y,Z
A,-1.724045,2.979503,-1.134244,2.235333
B,-4.834133,-9.961351,-0.272713,-3.568344
C,-0.076438,4.595032,9.026012,-5.484099
D,-7.032615,1.598427,8.830252,-9.443277


### Añadir una columna

In [48]:
df

Unnamed: 0,W,X,Y,Z
A,-1.724045,2.979503,-1.134244,2.235333
B,-4.834133,-9.961351,-0.272713,-3.568344
C,-0.076438,4.595032,9.026012,-5.484099
D,-7.032615,1.598427,8.830252,-9.443277


In [49]:
df['TOTAL'] = df['W'] + df['X'] + df['Y'] + df['Z']

In [50]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,-1.724045,2.979503,-1.134244,2.235333,2.356547
B,-4.834133,-9.961351,-0.272713,-3.568344,-18.636541
C,-0.076438,4.595032,9.026012,-5.484099,8.060507
D,-7.032615,1.598427,8.830252,-9.443277,-6.047213


In [51]:
df['add1'], df['add2'] = [1, 2, 3, 4], [5, 6, 7, 8]
df

Unnamed: 0,W,X,Y,Z,TOTAL,add1,add2
A,-1.724045,2.979503,-1.134244,2.235333,2.356547,1,5
B,-4.834133,-9.961351,-0.272713,-3.568344,-18.636541,2,6
C,-0.076438,4.595032,9.026012,-5.484099,8.060507,3,7
D,-7.032615,1.598427,8.830252,-9.443277,-6.047213,4,8


In [52]:
aux_df = pd.DataFrame(data=[ [1, 2], [3, 4], [5, 6], [7, 8] ], 
                      index=['A', 'B', 'C', 'D'], columns=['add1', 'add2'])
aux_df

Unnamed: 0,add1,add2
A,1,2
B,3,4
C,5,6
D,7,8


In [53]:
df

Unnamed: 0,W,X,Y,Z,TOTAL,add1,add2
A,-1.724045,2.979503,-1.134244,2.235333,2.356547,1,5
B,-4.834133,-9.961351,-0.272713,-3.568344,-18.636541,2,6
C,-0.076438,4.595032,9.026012,-5.484099,8.060507,3,7
D,-7.032615,1.598427,8.830252,-9.443277,-6.047213,4,8


In [54]:
df.drop(['add1'], axis=1, inplace=True)

In [55]:
df

Unnamed: 0,W,X,Y,Z,TOTAL,add2
A,-1.724045,2.979503,-1.134244,2.235333,2.356547,5
B,-4.834133,-9.961351,-0.272713,-3.568344,-18.636541,6
C,-0.076438,4.595032,9.026012,-5.484099,8.060507,7
D,-7.032615,1.598427,8.830252,-9.443277,-6.047213,8


In [56]:
df['extra'] = [1, 2, 3, 4]
df

Unnamed: 0,W,X,Y,Z,TOTAL,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,2.356547,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,-18.636541,6,2
C,-0.076438,4.595032,9.026012,-5.484099,8.060507,7,3
D,-7.032615,1.598427,8.830252,-9.443277,-6.047213,8,4


### Borrar una columna

In [57]:
new_df = df.drop('TOTAL', axis=1, inplace=False)

In [58]:
new_df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,6,2
C,-0.076438,4.595032,9.026012,-5.484099,7,3
D,-7.032615,1.598427,8.830252,-9.443277,8,4


In [59]:
# No se modifica el df original
df

Unnamed: 0,W,X,Y,Z,TOTAL,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,2.356547,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,-18.636541,6,2
C,-0.076438,4.595032,9.026012,-5.484099,8.060507,7,3
D,-7.032615,1.598427,8.830252,-9.443277,-6.047213,8,4


In [60]:
# A no ser que le indiquemos explícitamente
df.drop('TOTAL', axis=1, inplace=True)

df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,6,2
C,-0.076438,4.595032,9.026012,-5.484099,7,3
D,-7.032615,1.598427,8.830252,-9.443277,8,4


In [61]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,6,2
C,-0.076438,4.595032,9.026012,-5.484099,7,3
D,-7.032615,1.598427,8.830252,-9.443277,8,4


In [62]:
df.drop('TOTAL', axis=1)

KeyError: "['TOTAL'] not found in axis"

In [64]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,6,2
C,-0.076438,4.595032,9.026012,-5.484099,7,3
D,-7.032615,1.598427,8.830252,-9.443277,8,4


### Borrar una fila

In [65]:
df.drop('C', axis=0)

Unnamed: 0,W,X,Y,Z,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,6,2
D,-7.032615,1.598427,8.830252,-9.443277,8,4


In [66]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,6,2
C,-0.076438,4.595032,9.026012,-5.484099,7,3
D,-7.032615,1.598427,8.830252,-9.443277,8,4


In [68]:
df['add2']

A    5
B    6
C    7
D    8
Name: add2, dtype: int64

### Seleccionar filas

In [69]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,6,2
C,-0.076438,4.595032,9.026012,-5.484099,7,3
D,-7.032615,1.598427,8.830252,-9.443277,8,4


In [70]:
df.loc['C']

W       -0.076438
X        4.595032
Y        9.026012
Z       -5.484099
add2     7.000000
extra    3.000000
Name: C, dtype: float64

In [71]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,6,2
C,-0.076438,4.595032,9.026012,-5.484099,7,3
D,-7.032615,1.598427,8.830252,-9.443277,8,4


También podemos utilizar el índice:

In [72]:
df.iloc[2]

W       -0.076438
X        4.595032
Y        9.026012
Z       -5.484099
add2     7.000000
extra    3.000000
Name: C, dtype: float64

### Seleccionar subset

In [63]:
df

Unnamed: 0,W,X,Y,Z,add2,extra
A,-1.724045,2.979503,-1.134244,2.235333,5,1
B,-4.834133,-9.961351,-0.272713,-3.568344,6,2
C,-0.076438,4.595032,9.026012,-5.484099,7,3
D,-7.032615,1.598427,8.830252,-9.443277,8,4


In [None]:
# Fila C y columna Z 
df.loc['C','Z']

In [None]:
type(df.loc['C','Z'])

In [None]:
df.loc[ ['C'],['Z'] ]

In [None]:
type(df.loc[['C'],['Z']])

In [None]:
# Filas A,B y columnas W,Y
df.loc[ ['A','B'],['W','Y'] ]

## Selección condicionada

Una de las mayores utilidades de los `DataFrames` es su capacidad para realizar consultas condicionadas:

In [None]:
df

In [None]:
# Registros >0
df>0

In [None]:
# Valor de los registros >0
df[ df>0 ]

In [None]:
df

In [None]:
df['W']>0

In [None]:
# Valor de los registros cuando X>0
df[  df['W']>0  ]

In [None]:
df

In [None]:
# Valor de los registros en las columnas Y,Z si X>0
df[df['W']>0] [['Y', 'X']]

Podemos unir condiciones usando los operadores `or` con `|` y `and` con `&`:

In [None]:
# Valor de los registros cuando X>0 o Z<0
df[(df['X']>0) | (df['Z'] < 0)]

In [None]:
# Valor de los registros en las columnas W e Y cuando X>0 o Z<0
df[(df['X']>0) | (df['Z'] < 0)] [['W','Y']]

## Modificar índices

In [None]:
# Creamos de nuevo el dataframe
array = np.random.uniform(-10, 10, size=[4,4])
df = pd.DataFrame(array, index=['A','B','C','D'], columns=['W','X','Y','Z'])

In [None]:
df

In [None]:
# Añadimos una nueva Serie con el nombre de los índices
df['Códigos'] = ['AA','BB','CC','DD']

df

In [None]:
# Substituimos los índices de las filas
df.set_index('Códigos')

In [None]:
# No se guardan por defecto
df

In [None]:
# A no ser que lo especifiquemos explícitamente
df.set_index('Códigos', inplace=True)

df

In [None]:
print(df)

In [None]:
# consultamos una fila con el nuevo índice
df.loc['AA']

### Índices por defecto

In [None]:
df

In [None]:
# Reiniciamos los índices y borramos los anteriores explícitamente
df.reset_index(drop=True, inplace=True)

df

Esto es solo la punta del iceberg, para más información sobre la clase `DataFrame` tenéis la [documentación oficial](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).