# Primera clase de Pandas

Estaremos hablando de **pandas** la librería de *Python* 

**cursiva**

*italica*

***cursiva en negrita***

***

# Markdown

## Encabezado 2
### Encabezado 3
#### Encabezado 4

- lista 1
- lista 2
- lista 3

[Más informaciones en](https://www.markdownguide.org/basic-syntax/)

***

## Importamos las librerías

In [96]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [98]:
matplotlib.__version__

'3.2.1'

In [2]:
np.__version__

'1.18.2'

In [4]:
# Comprobamos las versiones
pd.__version__

'0.23.0'

## Creación de objetos

Creamos una `Serie` pasaremos una serie de valores

In [5]:
# Pasamos una lista de valores entre corchetes [n1, n2, nx, ...]
# Que las funciones de Python llevan parentésis donde le pasamos unos parámetros podemos observarlos con MAYU + TAB

s = pd.Series([1, 3, 5, 6, 9])
s

0    1
1    3
2    5
3    6
4    9
dtype: int64

El resultado será una lista de valores de tipo `ENTERO` o `int64`

In [7]:
s1 = pd.Series([1.0, 5, 7, 0, np.nan, 11])
s1

0     1.0
1     5.0
2     7.0
3     0.0
4     NaN
5    11.0
dtype: float64

El resultado de esta segunda serie de valores será del tipo de datos `decimales` o `float64`

In [10]:
s2 = pd.Series(["Bob", "Marco", 7, True, 9.0, np.nan])
s2

0      Bob
1    Marco
2        7
3     True
4        9
5      NaN
dtype: object

In [17]:
nombres = ["Bob", "Marco", "Jhon", "Nicole", False]
nombres

['Bob', 'Marco', 'Jhon', 'Nicole', False]

In [20]:
type(nombres)

list

In [23]:
# llamamos el primer resultado será con índice 0
nombres[0]

'Bob'

In [24]:
# El tercer valor sería el índice 2
nombres[2]

'Jhon'

In [14]:
s3 = pd.Series(nombres)
s3

0       Bob
1     Marco
2      Jhon
3    Nicole
4     False
dtype: object

### Resumen:

- Una serie es una lista de objetos unidimensional
- En el caso sea númerica, tenemos una matriz, una tabla bidimensional
- En el caso sea mixta, entonces tenemos un `dataframe` o tipo `objeto`

***

### Creación de un `dataframe`

In [16]:
# Creamos una tabla de fechas
dates = pd.date_range(
    '20200101',
    periods=10,
    freq='M'
)
dates

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31'],
              dtype='datetime64[ns]', freq='M')

In [39]:
# construimos el dataframe
df = pd.DataFrame(
    np.random.randn(10,8),
    index=dates,
    columns=list('ABCDEFGH')
)
df

Unnamed: 0,A,B,C,D,E,F,G,H
2020-01-31,-0.385702,-1.707119,0.511851,0.146049,0.542037,-0.415407,-1.308496,-0.884677
2020-02-29,0.226818,-0.946026,-0.982451,0.413264,-0.220793,-1.569928,-0.772236,0.692063
2020-03-31,-1.253948,1.391175,-0.199537,-0.76119,0.474623,0.357972,-0.176746,1.260548
2020-04-30,0.232865,-0.648775,-1.141502,0.749856,1.285802,-0.564492,-0.844324,-0.899584
2020-05-31,0.420269,0.709554,0.127297,-0.115412,-0.003018,-0.660336,-1.82939,0.694976
2020-06-30,2.587899,0.803492,0.238012,0.330129,1.846932,-0.071476,-1.185325,-0.207883
2020-07-31,1.052504,0.908775,0.017689,0.054407,0.9264,-0.017651,-0.050236,0.409796
2020-08-31,-0.624004,2.126415,0.275771,-0.687683,-0.844267,-0.751106,-0.743584,0.012741
2020-09-30,0.948611,1.23822,-0.224607,0.842772,-0.486869,0.17345,0.033969,-0.350996
2020-10-31,-0.972929,-0.858961,-0.822929,1.736665,-0.513576,0.060879,2.190316,0.435231


### Otra forma de crear un dataframe 

es pasarle un listado de pares `clave-valor`
El formato es un `diccionario`
```
{
    key1: val1,
    key2: val2,
    ...
    keyn: valn
}
```

Los valores - `val` pueden ser:

- numérico
- string
- booleano
- vacío o nulo
- lista de valores
- diccionario
- lista de diccionarios

```
{
  key1: [lista:
        { key2: val2
        }
    ] // esto se entiende como val1
}
```

In [48]:
df2 = pd.DataFrame(
    {
            'A' : 1.5,
            'B' : pd.Timestamp('20201110'),
            'C' : "Texto",
            'D' : True,
            'E' : np.nan,
            'F' : pd.Categorical(["ml", "dl", "ai", "pd", "np", "iu", "io", "po", "lo", "mu" ]) 
    },
    index=dates
)
df2

Unnamed: 0,A,B,C,D,E,F
2020-01-31,1.5,2020-11-10,Texto,True,,ml
2020-02-29,1.5,2020-11-10,Texto,True,,dl
2020-03-31,1.5,2020-11-10,Texto,True,,ai
2020-04-30,1.5,2020-11-10,Texto,True,,pd
2020-05-31,1.5,2020-11-10,Texto,True,,np
2020-06-30,1.5,2020-11-10,Texto,True,,iu
2020-07-31,1.5,2020-11-10,Texto,True,,io
2020-08-31,1.5,2020-11-10,Texto,True,,po
2020-09-30,1.5,2020-11-10,Texto,True,,lo
2020-10-31,1.5,2020-11-10,Texto,True,,mu


In [49]:
df3 = pd.DataFrame(
    {
            'A' : 1.5,
            'B' : pd.Timestamp('20201110'),
            'C' : "Texto",
            'D' : True,
            'E' : np.nan,
            'F' : pd.Categorical(["ml", "dl", "ai", "pd", "np", "iu", "io", "po", "lo", "mu" ]) 
    }
)
df3

Unnamed: 0,A,B,C,D,E,F
0,1.5,2020-11-10,Texto,True,,ml
1,1.5,2020-11-10,Texto,True,,dl
2,1.5,2020-11-10,Texto,True,,ai
3,1.5,2020-11-10,Texto,True,,pd
4,1.5,2020-11-10,Texto,True,,np
5,1.5,2020-11-10,Texto,True,,iu
6,1.5,2020-11-10,Texto,True,,io
7,1.5,2020-11-10,Texto,True,,po
8,1.5,2020-11-10,Texto,True,,lo
9,1.5,2020-11-10,Texto,True,,mu


In [50]:
df4 = pd.DataFrame(
    {
            'A' : 1.5,
            'B' : pd.Timestamp('20201110'),
            'C' : "Texto",
            'D' : True,
            'E' : np.nan,
            'F' : ["ml", "dl", "ai", "pd", "np", "iu", "io", "po", "lo", "mu" ]
    }
)
df4

Unnamed: 0,A,B,C,D,E,F
0,1.5,2020-11-10,Texto,True,,ml
1,1.5,2020-11-10,Texto,True,,dl
2,1.5,2020-11-10,Texto,True,,ai
3,1.5,2020-11-10,Texto,True,,pd
4,1.5,2020-11-10,Texto,True,,np
5,1.5,2020-11-10,Texto,True,,iu
6,1.5,2020-11-10,Texto,True,,io
7,1.5,2020-11-10,Texto,True,,po
8,1.5,2020-11-10,Texto,True,,lo
9,1.5,2020-11-10,Texto,True,,mu


In [55]:
# Comprobamos los datos de este dataframe
df3.dtypes

A           float64
B    datetime64[ns]
C            object
D              bool
E           float64
F          category
dtype: object

In [56]:
df4.dtypes

A           float64
B    datetime64[ns]
C            object
D              bool
E           float64
F            object
dtype: object

In [57]:
categorias = pd.Categorical([1, 5, 7, 8])
categorias

[1, 5, 7, 8]
Categories (4, int64): [1, 5, 7, 8]

In [58]:
var_num = [1, 5, 7, 8]
var_num

[1, 5, 7, 8]

In [59]:
type(var_num)

list

In [60]:
type(categorias)

pandas.core.arrays.categorical.Categorical

In [65]:
type(df4)

pandas.core.frame.DataFrame

## Operaciones básica de los dataframes

In [67]:
# Observación del encabezado del dataframe
df4.head() #el resultado sería 5 primeros elementos

Unnamed: 0,A,B,C,D,E,F
0,1.5,2020-11-10,Texto,True,,ml
1,1.5,2020-11-10,Texto,True,,dl
2,1.5,2020-11-10,Texto,True,,ai
3,1.5,2020-11-10,Texto,True,,pd
4,1.5,2020-11-10,Texto,True,,np


In [70]:
df4.head(n=7) # le indicamos cuantas filas necesitamos observar

Unnamed: 0,A,B,C,D,E,F
0,1.5,2020-11-10,Texto,True,,ml
1,1.5,2020-11-10,Texto,True,,dl
2,1.5,2020-11-10,Texto,True,,ai
3,1.5,2020-11-10,Texto,True,,pd
4,1.5,2020-11-10,Texto,True,,np
5,1.5,2020-11-10,Texto,True,,iu
6,1.5,2020-11-10,Texto,True,,io


In [71]:
# Observamos las 5 últimas filas del dataframe
df.tail()

Unnamed: 0,A,B,C,D,E,F,G,H
2020-06-30,2.587899,0.803492,0.238012,0.330129,1.846932,-0.071476,-1.185325,-0.207883
2020-07-31,1.052504,0.908775,0.017689,0.054407,0.9264,-0.017651,-0.050236,0.409796
2020-08-31,-0.624004,2.126415,0.275771,-0.687683,-0.844267,-0.751106,-0.743584,0.012741
2020-09-30,0.948611,1.23822,-0.224607,0.842772,-0.486869,0.17345,0.033969,-0.350996
2020-10-31,-0.972929,-0.858961,-0.822929,1.736665,-0.513576,0.060879,2.190316,0.435231


In [72]:
# Observamos los índices de los dataframes
df.index

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31'],
              dtype='datetime64[ns]', freq='M')

In [73]:
df4.index

RangeIndex(start=0, stop=10, step=1)

In [75]:
df2.index

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31'],
              dtype='datetime64[ns]', freq='M')

In [76]:
# Observamos las columnas
df.columns

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], dtype='object')

In [77]:
df4.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [78]:
df3.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [79]:
# Comprobamos los estadísticos básico de un dataframe
df.describe()

Unnamed: 0,A,B,C,D,E,F,G,H
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,0.223238,0.301675,-0.220041,0.270886,0.300727,-0.345809,-0.468605,0.116222
std,1.132643,1.248243,0.573754,0.739704,0.869208,0.570342,1.103654,0.709592
min,-1.253948,-1.707119,-1.141502,-0.76119,-0.844267,-1.569928,-1.82939,-0.899584
25%,-0.564428,-0.806415,-0.673349,-0.072957,-0.42035,-0.636375,-1.100075,-0.315218
50%,0.229841,0.756523,-0.090924,0.238089,0.235803,-0.243442,-0.75791,0.211269
75%,0.816526,1.155859,0.210333,0.665708,0.830309,0.041246,-0.081864,0.627855
max,2.587899,2.126415,0.511851,1.736665,1.846932,0.357972,2.190316,1.260548


In [81]:
# Transponer un dataframe
transponer = df.T
transponer

Unnamed: 0,2020-01-31 00:00:00,2020-02-29 00:00:00,2020-03-31 00:00:00,2020-04-30 00:00:00,2020-05-31 00:00:00,2020-06-30 00:00:00,2020-07-31 00:00:00,2020-08-31 00:00:00,2020-09-30 00:00:00,2020-10-31 00:00:00
A,-0.385702,0.226818,-1.253948,0.232865,0.420269,2.587899,1.052504,-0.624004,0.948611,-0.972929
B,-1.707119,-0.946026,1.391175,-0.648775,0.709554,0.803492,0.908775,2.126415,1.23822,-0.858961
C,0.511851,-0.982451,-0.199537,-1.141502,0.127297,0.238012,0.017689,0.275771,-0.224607,-0.822929
D,0.146049,0.413264,-0.76119,0.749856,-0.115412,0.330129,0.054407,-0.687683,0.842772,1.736665
E,0.542037,-0.220793,0.474623,1.285802,-0.003018,1.846932,0.9264,-0.844267,-0.486869,-0.513576
F,-0.415407,-1.569928,0.357972,-0.564492,-0.660336,-0.071476,-0.017651,-0.751106,0.17345,0.060879
G,-1.308496,-0.772236,-0.176746,-0.844324,-1.82939,-1.185325,-0.050236,-0.743584,0.033969,2.190316
H,-0.884677,0.692063,1.260548,-0.899584,0.694976,-0.207883,0.409796,0.012741,-0.350996,0.435231


In [82]:
transponer.columns

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31'],
              dtype='datetime64[ns]', freq='M')

In [83]:
transponer.index

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], dtype='object')

In [85]:
# Aquí tiene sentido transponer las columnas porque son muchos valores respecto a los índices
df5 = transponer.T
df5

Unnamed: 0,A,B,C,D,E,F,G,H
2020-01-31,-0.385702,-1.707119,0.511851,0.146049,0.542037,-0.415407,-1.308496,-0.884677
2020-02-29,0.226818,-0.946026,-0.982451,0.413264,-0.220793,-1.569928,-0.772236,0.692063
2020-03-31,-1.253948,1.391175,-0.199537,-0.76119,0.474623,0.357972,-0.176746,1.260548
2020-04-30,0.232865,-0.648775,-1.141502,0.749856,1.285802,-0.564492,-0.844324,-0.899584
2020-05-31,0.420269,0.709554,0.127297,-0.115412,-0.003018,-0.660336,-1.82939,0.694976
2020-06-30,2.587899,0.803492,0.238012,0.330129,1.846932,-0.071476,-1.185325,-0.207883
2020-07-31,1.052504,0.908775,0.017689,0.054407,0.9264,-0.017651,-0.050236,0.409796
2020-08-31,-0.624004,2.126415,0.275771,-0.687683,-0.844267,-0.751106,-0.743584,0.012741
2020-09-30,0.948611,1.23822,-0.224607,0.842772,-0.486869,0.17345,0.033969,-0.350996
2020-10-31,-0.972929,-0.858961,-0.822929,1.736665,-0.513576,0.060879,2.190316,0.435231


In [89]:
# Ordenar los valores por índices o por columnas
df5.sort_index(axis=0, ascending=False) # axis 0 serían las filas

Unnamed: 0,A,B,C,D,E,F,G,H
2020-10-31,-0.972929,-0.858961,-0.822929,1.736665,-0.513576,0.060879,2.190316,0.435231
2020-09-30,0.948611,1.23822,-0.224607,0.842772,-0.486869,0.17345,0.033969,-0.350996
2020-08-31,-0.624004,2.126415,0.275771,-0.687683,-0.844267,-0.751106,-0.743584,0.012741
2020-07-31,1.052504,0.908775,0.017689,0.054407,0.9264,-0.017651,-0.050236,0.409796
2020-06-30,2.587899,0.803492,0.238012,0.330129,1.846932,-0.071476,-1.185325,-0.207883
2020-05-31,0.420269,0.709554,0.127297,-0.115412,-0.003018,-0.660336,-1.82939,0.694976
2020-04-30,0.232865,-0.648775,-1.141502,0.749856,1.285802,-0.564492,-0.844324,-0.899584
2020-03-31,-1.253948,1.391175,-0.199537,-0.76119,0.474623,0.357972,-0.176746,1.260548
2020-02-29,0.226818,-0.946026,-0.982451,0.413264,-0.220793,-1.569928,-0.772236,0.692063
2020-01-31,-0.385702,-1.707119,0.511851,0.146049,0.542037,-0.415407,-1.308496,-0.884677


In [91]:
# Ordenar los valores por índices o por columnas
df5.sort_index(axis=0, ascending=True) # axis 1 serían las columnas

Unnamed: 0,A,B,C,D,E,F,G,H
2020-01-31,-0.385702,-1.707119,0.511851,0.146049,0.542037,-0.415407,-1.308496,-0.884677
2020-02-29,0.226818,-0.946026,-0.982451,0.413264,-0.220793,-1.569928,-0.772236,0.692063
2020-03-31,-1.253948,1.391175,-0.199537,-0.76119,0.474623,0.357972,-0.176746,1.260548
2020-04-30,0.232865,-0.648775,-1.141502,0.749856,1.285802,-0.564492,-0.844324,-0.899584
2020-05-31,0.420269,0.709554,0.127297,-0.115412,-0.003018,-0.660336,-1.82939,0.694976
2020-06-30,2.587899,0.803492,0.238012,0.330129,1.846932,-0.071476,-1.185325,-0.207883
2020-07-31,1.052504,0.908775,0.017689,0.054407,0.9264,-0.017651,-0.050236,0.409796
2020-08-31,-0.624004,2.126415,0.275771,-0.687683,-0.844267,-0.751106,-0.743584,0.012741
2020-09-30,0.948611,1.23822,-0.224607,0.842772,-0.486869,0.17345,0.033969,-0.350996
2020-10-31,-0.972929,-0.858961,-0.822929,1.736665,-0.513576,0.060879,2.190316,0.435231


In [93]:
df5.sort_values(by="D", ascending=False) # ordenamos basada en la columna D de may a men

Unnamed: 0,A,B,C,D,E,F,G,H
2020-10-31,-0.972929,-0.858961,-0.822929,1.736665,-0.513576,0.060879,2.190316,0.435231
2020-09-30,0.948611,1.23822,-0.224607,0.842772,-0.486869,0.17345,0.033969,-0.350996
2020-04-30,0.232865,-0.648775,-1.141502,0.749856,1.285802,-0.564492,-0.844324,-0.899584
2020-02-29,0.226818,-0.946026,-0.982451,0.413264,-0.220793,-1.569928,-0.772236,0.692063
2020-06-30,2.587899,0.803492,0.238012,0.330129,1.846932,-0.071476,-1.185325,-0.207883
2020-01-31,-0.385702,-1.707119,0.511851,0.146049,0.542037,-0.415407,-1.308496,-0.884677
2020-07-31,1.052504,0.908775,0.017689,0.054407,0.9264,-0.017651,-0.050236,0.409796
2020-05-31,0.420269,0.709554,0.127297,-0.115412,-0.003018,-0.660336,-1.82939,0.694976
2020-08-31,-0.624004,2.126415,0.275771,-0.687683,-0.844267,-0.751106,-0.743584,0.012741
2020-03-31,-1.253948,1.391175,-0.199537,-0.76119,0.474623,0.357972,-0.176746,1.260548


In [94]:
df5.sort_values(by="D", ascending=True) # ordenamos basada en la columna D de men a may

Unnamed: 0,A,B,C,D,E,F,G,H
2020-03-31,-1.253948,1.391175,-0.199537,-0.76119,0.474623,0.357972,-0.176746,1.260548
2020-08-31,-0.624004,2.126415,0.275771,-0.687683,-0.844267,-0.751106,-0.743584,0.012741
2020-05-31,0.420269,0.709554,0.127297,-0.115412,-0.003018,-0.660336,-1.82939,0.694976
2020-07-31,1.052504,0.908775,0.017689,0.054407,0.9264,-0.017651,-0.050236,0.409796
2020-01-31,-0.385702,-1.707119,0.511851,0.146049,0.542037,-0.415407,-1.308496,-0.884677
2020-06-30,2.587899,0.803492,0.238012,0.330129,1.846932,-0.071476,-1.185325,-0.207883
2020-02-29,0.226818,-0.946026,-0.982451,0.413264,-0.220793,-1.569928,-0.772236,0.692063
2020-04-30,0.232865,-0.648775,-1.141502,0.749856,1.285802,-0.564492,-0.844324,-0.899584
2020-09-30,0.948611,1.23822,-0.224607,0.842772,-0.486869,0.17345,0.033969,-0.350996
2020-10-31,-0.972929,-0.858961,-0.822929,1.736665,-0.513576,0.060879,2.190316,0.435231
