## Pandas DataFrame

* Un DataFrame es un array de dos dimensiones, puedes pensar que es una tabla.
* A diferencia de una Serie aqui tenemos más de 1 sola columna (a parte del indice).

En el mundo real la data se guarda muy similar a un dataFrame.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.DataFrame(np.random.randn(10,4), columns = list('ABCD'))

In [4]:
df

Unnamed: 0,A,B,C,D
0,-0.110579,-0.688595,1.564336,-1.305491
1,-0.041443,1.072491,-1.524662,-1.508517
2,-0.09699,0.683185,0.162391,-1.208158
3,-0.965065,0.902975,-1.695447,0.706238
4,1.182873,-0.953047,-1.25598,0.764277
5,0.69886,-1.162905,-0.641405,-0.838327
6,-1.850862,0.046674,0.021869,0.922557
7,0.553259,-0.376879,-0.035828,0.338207
8,1.09492,-2.480638,1.354422,-1.190416
9,0.488835,-1.487284,0.357246,0.137698


### Leer un archivo csv con pandas

In [5]:
df = pd.read_csv('prueba.csv')
df.head()

Unnamed: 0,A,B,C,D
0,0.187497,1.12215,-0.988277,-1.985934
1,0.360803,-0.562243,-0.340693,-0.986988
2,-0.040627,0.067333,-0.452978,0.686223
3,-0.279572,-0.702492,0.252265,0.958977
4,0.537438,-1.737568,0.714727,-0.939288


### Agregando indice personalizado

In [6]:
df = pd.read_csv('prueba.csv')
days = pd.date_range('20190827', periods = 10)
df.index = days

In [7]:
df.head()

Unnamed: 0,A,B,C,D
2019-08-27,0.187497,1.12215,-0.988277,-1.985934
2019-08-28,0.360803,-0.562243,-0.340693,-0.986988
2019-08-29,-0.040627,0.067333,-0.452978,0.686223
2019-08-30,-0.279572,-0.702492,0.252265,0.958977
2019-08-31,0.537438,-1.737568,0.714727,-0.939288


In [8]:
df.index

DatetimeIndex(['2019-08-27', '2019-08-28', '2019-08-29', '2019-08-30',
               '2019-08-31', '2019-09-01', '2019-09-02', '2019-09-03',
               '2019-09-04', '2019-09-05'],
              dtype='datetime64[ns]', freq='D')

In [9]:
df.values

array([[ 1.874970e-01,  1.122150e+00, -9.882770e-01, -1.985934e+00],
       [ 3.608030e-01, -5.622430e-01, -3.406930e-01, -9.869880e-01],
       [-4.062700e-02,  6.733300e-02, -4.529780e-01,  6.862230e-01],
       [-2.795720e-01, -7.024920e-01,  2.522650e-01,  9.589770e-01],
       [ 5.374380e-01, -1.737568e+00,  7.147270e-01, -9.392880e-01],
       [ 7.001100e-02, -5.164430e-01, -1.655689e+00,  2.467210e-01],
       [ 1.268000e-03,  9.515170e-01,  2.107360e+00, -1.087260e-01],
       [-1.852580e-01,  8.565200e-01, -6.862850e-01,  1.104195e+00],
       [ 3.870230e-01,  1.706336e+00, -2.452653e+00,  2.604660e-01],
       [-1.054974e+00,  5.567750e-01, -9.452190e-01, -3.029500e-02]])

## Estadística descriptiva con DataFrames

El paquete pandas posee unas pocas pero útiles funciones que dan algunas características estadisticas de los dataFrames

In [10]:
df.describe()

Unnamed: 0,A,B,C,D
count,10.0,10.0,10.0,10.0
mean,-0.001639,0.174188,-0.444744,-0.079465
std,0.451656,1.049677,1.267397,0.971164
min,-1.054974,-1.737568,-2.452653,-1.985934
25%,-0.1491,-0.550793,-0.977513,-0.731647
50%,0.03564,0.312054,-0.569632,0.108213
75%,0.317477,0.927768,0.104026,0.579784
max,0.537438,1.706336,2.10736,1.104195


In [11]:
df.mean(0) # significa tener los estadisticos de cada columna

A   -0.001639
B    0.174188
C   -0.444744
D   -0.079465
dtype: float64

In [12]:
df.mean(1) # significa tener los estadisticos de cada fila

2019-08-27   -0.416141
2019-08-28   -0.382280
2019-08-29    0.064988
2019-08-30    0.057294
2019-08-31   -0.356173
2019-09-01   -0.463850
2019-09-02    0.737855
2019-09-03    0.272293
2019-09-04   -0.024707
2019-09-05   -0.368428
Freq: D, dtype: float64

## Extracción de dataFrames

Como en numpy y en pandas.Series hemos extraido datos de distintas maneras, algunas veces no solo 1 dato sino una parte de ellos, esto también sera muy usado en los dataFrames.

### Selección el primero y el último de las filas

In [13]:
df.head()

Unnamed: 0,A,B,C,D
2019-08-27,0.187497,1.12215,-0.988277,-1.985934
2019-08-28,0.360803,-0.562243,-0.340693,-0.986988
2019-08-29,-0.040627,0.067333,-0.452978,0.686223
2019-08-30,-0.279572,-0.702492,0.252265,0.958977
2019-08-31,0.537438,-1.737568,0.714727,-0.939288


In [14]:
df.tail()

Unnamed: 0,A,B,C,D
2019-09-01,0.070011,-0.516443,-1.655689,0.246721
2019-09-02,0.001268,0.951517,2.10736,-0.108726
2019-09-03,-0.185258,0.85652,-0.686285,1.104195
2019-09-04,0.387023,1.706336,-2.452653,0.260466
2019-09-05,-1.054974,0.556775,-0.945219,-0.030295


In [15]:
df['A']

2019-08-27    0.187497
2019-08-28    0.360803
2019-08-29   -0.040627
2019-08-30   -0.279572
2019-08-31    0.537438
2019-09-01    0.070011
2019-09-02    0.001268
2019-09-03   -0.185258
2019-09-04    0.387023
2019-09-05   -1.054974
Freq: D, Name: A, dtype: float64

In [19]:
df.A  # retornamos una serie

2019-08-27    0.187497
2019-08-28    0.360803
2019-08-29   -0.040627
2019-08-30   -0.279572
2019-08-31    0.537438
2019-09-01    0.070011
2019-09-02    0.001268
2019-09-03   -0.185258
2019-09-04    0.387023
2019-09-05   -1.054974
Freq: D, Name: A, dtype: float64

In [20]:
df[['A','B']]  # retornamos un dataFrame

Unnamed: 0,A,B
2019-08-27,0.187497,1.12215
2019-08-28,0.360803,-0.562243
2019-08-29,-0.040627,0.067333
2019-08-30,-0.279572,-0.702492
2019-08-31,0.537438,-1.737568
2019-09-01,0.070011,-0.516443
2019-09-02,0.001268,0.951517
2019-09-03,-0.185258,0.85652
2019-09-04,0.387023,1.706336
2019-09-05,-1.054974,0.556775


### Extrayendo una parte de un dataFrame

In [21]:
df[2:4]

Unnamed: 0,A,B,C,D
2019-08-29,-0.040627,0.067333,-0.452978,0.686223
2019-08-30,-0.279572,-0.702492,0.252265,0.958977


#### Usando iloc

In [24]:
df.iloc[[2,4]]  

Unnamed: 0,A,B,C,D
2019-08-29,-0.040627,0.067333,-0.452978,0.686223
2019-08-31,0.537438,-1.737568,0.714727,-0.939288


In [27]:
df.iloc[2] # fila 2

A   -0.040627
B    0.067333
C   -0.452978
D    0.686223
Name: 2019-08-29 00:00:00, dtype: float64

In [28]:
df.iloc[[2]] # columna 2

Unnamed: 0,A,B,C,D
2019-08-29,-0.040627,0.067333,-0.452978,0.686223


### Extracción en dos dimensiones del dataFrame

In [29]:
df.iloc[2:4, 1:4]

Unnamed: 0,B,C,D
2019-08-29,0.067333,-0.452978,0.686223
2019-08-30,-0.702492,0.252265,0.958977


In [30]:
df.iloc[[2,4], [1,3]]

Unnamed: 0,B,D
2019-08-29,0.067333,0.686223
2019-08-31,-1.737568,-0.939288


### Extracción en base a las etiquetas o labels

In [31]:
df['20190827':'20190829']

Unnamed: 0,A,B,C,D
2019-08-27,0.187497,1.12215,-0.988277,-1.985934
2019-08-28,0.360803,-0.562243,-0.340693,-0.986988
2019-08-29,-0.040627,0.067333,-0.452978,0.686223


In [34]:
df.loc['20190827':'20190829', 'A':'C']

Unnamed: 0,A,B,C
2019-08-27,0.187497,1.12215,-0.988277
2019-08-28,0.360803,-0.562243,-0.340693
2019-08-29,-0.040627,0.067333,-0.452978


In [35]:
df.loc['20190827':'20190829', ['A','C']]

Unnamed: 0,A,C
2019-08-27,0.187497,-0.988277
2019-08-28,0.360803,-0.340693
2019-08-29,-0.040627,-0.452978


In [36]:
df.loc['20190827']

A    0.187497
B    1.122150
C   -0.988277
D   -1.985934
Name: 2019-08-27 00:00:00, dtype: float64

In [38]:
df.loc[['20190827','20190829']] # mostramos error

KeyError: u"None of [Index([u'20190827', u'20190829'], dtype='object')] are in the [index]"

Para solucionar esto necesitamos no llamarlos como string sino como datatimes

In [41]:
from datetime import datetime

date1 = datetime(2019,8,27,0,0,0)
date2 = datetime(2019,8,29,0,0,0)

df.loc[[date1,date2]]

Unnamed: 0,A,B,C,D
2019-08-27,0.187497,1.12215,-0.988277,-1.985934
2019-08-29,-0.040627,0.067333,-0.452978,0.686223
