# La clase DataFrame

Un `DataFrame` es una agrupación de `Series` unidas bajo los mismos índices dando como resultado estructuras similares a tablas donde representar todo tipo de información.

Cada serie del `DataFrame` se puede considerar una columna a la cuál podemos establecer un nombre:

In [1]:
import pandas as pd
import numpy as np

array = np.random.uniform(-10, 10, size=[4,4])

df = pd.DataFrame(array, index=['A','B','C','D'], columns=['W','X','Y','Z'])

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# Representación en jupyter
df

Unnamed: 0,W,X,Y,Z
A,4.281925,8.568825,8.551277,2.712402
B,-0.729713,-2.087403,4.115279,4.056047
C,-0.452809,3.109461,-1.688535,7.430385
D,2.47491,7.360632,-4.39969,-6.816107


In [3]:
# Representación por pantalla
print(df)

          W         X         Y         Z
A  4.281925  8.568825  8.551277  2.712402
B -0.729713 -2.087403  4.115279  4.056047
C -0.452809  3.109461 -1.688535  7.430385
D  2.474910  7.360632 -4.399690 -6.816107


In [4]:
# Tipo de un df
type(df)

pandas.core.frame.DataFrame

## Trabajando con DataFrames

Podemos consultar una columna mediante su nombre:

In [5]:
df['X']

A    8.568825
B   -2.087403
C    3.109461
D    7.360632
Name: X, dtype: float64

Como vemos una columna es en realidad una serie:

In [6]:
type(df['X'])

pandas.core.series.Series

También podemos consultar varias columnas pasando una lista con los nombres:

In [7]:
df[['Y','Z']]

Unnamed: 0,Y,Z
A,8.551277,2.712402
B,4.115279,4.056047
C,-1.688535,7.430385
D,-4.39969,-6.816107


### Añadir una columna

In [8]:
df['TOTAL'] = df['W'] + df['X'] + df['Y'] + df['Z']

In [9]:
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,4.281925,8.568825,8.551277,2.712402,24.114429
B,-0.729713,-2.087403,4.115279,4.056047,5.354211
C,-0.452809,3.109461,-1.688535,7.430385,8.398502
D,2.47491,7.360632,-4.39969,-6.816107,-1.380255


### Borrar una columna

In [10]:
df.drop('TOTAL', axis=1)

Unnamed: 0,W,X,Y,Z
A,4.281925,8.568825,8.551277,2.712402
B,-0.729713,-2.087403,4.115279,4.056047
C,-0.452809,3.109461,-1.688535,7.430385
D,2.47491,7.360632,-4.39969,-6.816107


In [11]:
# No se modifica el df original
df

Unnamed: 0,W,X,Y,Z,TOTAL
A,4.281925,8.568825,8.551277,2.712402,24.114429
B,-0.729713,-2.087403,4.115279,4.056047,5.354211
C,-0.452809,3.109461,-1.688535,7.430385,8.398502
D,2.47491,7.360632,-4.39969,-6.816107,-1.380255


In [12]:
# A no ser que le indiquemos explícitamente
df.drop('TOTAL', axis=1, inplace=True)

df

Unnamed: 0,W,X,Y,Z
A,4.281925,8.568825,8.551277,2.712402
B,-0.729713,-2.087403,4.115279,4.056047
C,-0.452809,3.109461,-1.688535,7.430385
D,2.47491,7.360632,-4.39969,-6.816107


### Borrar una fila

In [13]:
df.drop('D', axis=0)

Unnamed: 0,W,X,Y,Z
A,4.281925,8.568825,8.551277,2.712402
B,-0.729713,-2.087403,4.115279,4.056047
C,-0.452809,3.109461,-1.688535,7.430385


### Seleccionar filas

In [14]:
df.loc['C']

W   -0.452809
X    3.109461
Y   -1.688535
Z    7.430385
Name: C, dtype: float64

También podemos utilizar el índice:

In [15]:
df.iloc[2]

W   -0.452809
X    3.109461
Y   -1.688535
Z    7.430385
Name: C, dtype: float64

### Seleccionar subset

In [16]:
# Fila C y columna Z 
df.loc['C','Z']

7.4303850091149215

In [17]:
# Filas A,B y columnas W,Y
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,4.281925,8.551277
B,-0.729713,4.115279


## Selección condicionada

Una de las mayores utilidades de los `DataFrames` es su capacidad para realizar consultas condicionadas:

In [18]:
df

Unnamed: 0,W,X,Y,Z
A,4.281925,8.568825,8.551277,2.712402
B,-0.729713,-2.087403,4.115279,4.056047
C,-0.452809,3.109461,-1.688535,7.430385
D,2.47491,7.360632,-4.39969,-6.816107


In [19]:
# Registros >0
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,False,False,True,True
C,False,True,False,True
D,True,True,False,False


In [20]:
# Valor de los registros >0
df[df>0]

Unnamed: 0,W,X,Y,Z
A,4.281925,8.568825,8.551277,2.712402
B,,,4.115279,4.056047
C,,3.109461,,7.430385
D,2.47491,7.360632,,


In [21]:
# Valor de los registros cuando X>0
df[df['X']>0]

Unnamed: 0,W,X,Y,Z
A,4.281925,8.568825,8.551277,2.712402
C,-0.452809,3.109461,-1.688535,7.430385
D,2.47491,7.360632,-4.39969,-6.816107


In [22]:
# Valor de los registros en las columnas Y,Z si X>0
df[df['X']>0][['Y','Z']]

Unnamed: 0,Y,Z
A,8.551277,2.712402
C,-1.688535,7.430385
D,-4.39969,-6.816107


Podemos unir condiciones usando los operadores `or` con `|` y `and` con `&`:

In [23]:
# Valor de los registros cuando X>0 o Z<0
df[(df['X']>0) | (df['Z'] < 0)]

Unnamed: 0,W,X,Y,Z
A,4.281925,8.568825,8.551277,2.712402
C,-0.452809,3.109461,-1.688535,7.430385
D,2.47491,7.360632,-4.39969,-6.816107


In [24]:
# Valor de los registros en las columnas W e Y cuando X>0 o Z<0
df[(df['X']>0) | (df['Z'] < 0)][['W','Y']]

Unnamed: 0,W,Y
A,4.281925,8.551277
C,-0.452809,-1.688535
D,2.47491,-4.39969


## Modificar índices

In [25]:
# Creamos de nuevo el dataframe
array = np.random.uniform(-10, 10, size=[4,4])
df = pd.DataFrame(array, index=['A','B','C','D'], columns=['W','X','Y','Z'])

In [26]:
# Añadimos una nueva Serie con el nombre de los índices
df['Códigos'] = ['AA','BB','CC','DD']

df

Unnamed: 0,W,X,Y,Z,Códigos
A,6.302277,-0.742166,-3.265597,7.645083,AA
B,7.119599,-7.316686,-0.068432,0.445198,BB
C,4.354697,-3.544881,-8.123959,-3.356849,CC
D,0.302046,1.991204,-6.192697,8.574128,DD


In [27]:
# Substituimos los índices de las filas
df.set_index('Códigos')

Unnamed: 0_level_0,W,X,Y,Z
Códigos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,6.302277,-0.742166,-3.265597,7.645083
BB,7.119599,-7.316686,-0.068432,0.445198
CC,4.354697,-3.544881,-8.123959,-3.356849
DD,0.302046,1.991204,-6.192697,8.574128


In [28]:
# No se guardan por defecto
df

Unnamed: 0,W,X,Y,Z,Códigos
A,6.302277,-0.742166,-3.265597,7.645083,AA
B,7.119599,-7.316686,-0.068432,0.445198,BB
C,4.354697,-3.544881,-8.123959,-3.356849,CC
D,0.302046,1.991204,-6.192697,8.574128,DD


In [29]:
# A no ser que lo especifiquemos explícitamente
df.set_index('Códigos', inplace=True)

df

Unnamed: 0_level_0,W,X,Y,Z
Códigos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,6.302277,-0.742166,-3.265597,7.645083
BB,7.119599,-7.316686,-0.068432,0.445198
CC,4.354697,-3.544881,-8.123959,-3.356849
DD,0.302046,1.991204,-6.192697,8.574128


In [30]:
print(df)

                W         X         Y         Z
Códigos                                        
AA       6.302277 -0.742166 -3.265597  7.645083
BB       7.119599 -7.316686 -0.068432  0.445198
CC       4.354697 -3.544881 -8.123959 -3.356849
DD       0.302046  1.991204 -6.192697  8.574128


In [31]:
# consultamos una fila con el nuevo índice
df.loc['AA']

W    6.302277
X   -0.742166
Y   -3.265597
Z    7.645083
Name: AA, dtype: float64

### Índices por defecto

In [32]:
# Reiniciamos los índices y borramos los anteriores explícitamente
df.reset_index(drop=True, inplace=True)

df

Unnamed: 0,W,X,Y,Z
0,6.302277,-0.742166,-3.265597,7.645083
1,7.119599,-7.316686,-0.068432,0.445198
2,4.354697,-3.544881,-8.123959,-3.356849
3,0.302046,1.991204,-6.192697,8.574128


Esto es solo la punta del iceberg, para más información sobre la clase `DataFrame` tenéis la [documentación oficial](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).