## ¿Qué son NumPy y pandas?

![](notas_files/320px-NumPy_logo_2020.svg.png)

Numpy es una biblioteca de Python de código abierto que se utiliza para la informática científica y proporciona una serie de características que permiten a un programador de Python trabajar con matrices y matrices de alto rendimiento.

![](notas_files/320px-Pandas_logo.svg.png)

Pandas es un paquete para la manipulación de datos que usa los objetos DataFrame de R (así como diferentes paquetes de R) en un entorno Python.

Tanto NumPy como pandas se usan a menudo juntos, ya que la biblioteca de pandas depende en gran medida de la matriz NumPy para la implementación de objetos de datos de pandas y comparte muchas de sus características. Además, pandas se basa en la funcionalidad proporcionada por NumPy. Ambas bibliotecas pertenecen a lo que se conoce como la pila SciPy, un conjunto de bibliotecas de Python utilizadas para la informática científica. 

In [1]:
import pandas as pd
import numpy as np 

### Matrices NumPy

NumPy le permite trabajar con matrices y matrices de alto rendimiento. Su objeto de datos principal es el ndarray, un tipo de matriz de N dimensiones que describe una colección de "elementos" del mismo tipo. Por ejemplo:

In [2]:
np.array ([1, 2, 3, 4, 5]) # definiendo el ndarray

array([1, 2, 3, 4, 5])

Los ndarrays se almacenan de manera más eficiente que las listas de Python y permiten vectorizar las operaciones matemáticas, lo que da como resultado un rendimiento significativamente mayor que con las construcciones de bucle en Python.

Las matrices NumPy permiten seleccionar elementos de matriz, operaciones lógicas, cortar, remodelar, combinar (también conocido como "apilar"), dividir, así como varios métodos numéricos (mínimo, máximo, media, desviación estándar, varianza y más). Todos estos conceptos se pueden aplicar a los objetos pandas, que amplían estas capacidades para proporcionar un medio mucho más rico y expresivo de representar y manipular datos que los que se ofrecen con las matrices NumPy.

## Series

La Serie es el componente principal de los pandas. Una serie representa una matriz indexada etiquetada unidimensional basada en el ndarray NumPy. 

> ``` s = pd.Series(data, index=index) ```

Como una matriz, una serie puede contener cero o más valores de cualquier tipo de datos. Se puede crear e inicializar una serie pasando un valor escalar, un ndarray NumPy, una lista de Python o un Dict de Python como parámetro de datos del constructor de series.

##### A partir de una serie

In [3]:
a = pd.Series([1, 3, 5, np.nan, 6, 8])
a

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

###### Indice generado automáticamente

In [4]:
a.index

RangeIndex(start=0, stop=6, step=1)

##### A partir de un diccionario

In [5]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

##### A partir de una valor

In [6]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

### Diferencias entre ndarrays y objetos de serie

Hay algunas diferencias que vale la pena señalar entre los objetos ndarrays y Series. En primer lugar, se accede a los elementos de las matrices NumPy por su posición entera, comenzando con cero para el primer elemento. Un objeto de la serie pandas es más flexible, ya que puede usar definir su propio índice etiquetado para indexar y acceder a elementos de una matriz. También puede utilizar letras en lugar de números o numerar una matriz en orden descendente en lugar de ascendente. En segundo lugar, alinear datos de diferentes Series y hacer coincidir etiquetas con objetos de Series es más eficiente que usar ndarrays, por ejemplo, tratar con valores perdidos. Si no hay etiquetas coincidentes durante la alineación, pandas devuelve NaN (no ningún número) para que la operación no falle.

## DataFrame

Un DataFrame es una estructura de datos etiquetada bidimensional con columnas de tipos potencialmente diferentes. Puede pensar en ello como una hoja de cálculo o una tabla SQL, o un dict de objetos Series. Generalmente es el objeto de pandas más utilizado. Al igual que Series, DataFrame acepta muchos tipos diferentes de entrada:

- 1D Dict de ndarrays, listas, dict o series

- 2-D numpy.ndarray

- Ndarray estructurado

- Series

- Otro DataFrame


### Creación de un Data-Frame (df)


#### Vía Numpy


In [7]:
dates = pd.date_range('20130101', periods=6)
dates


DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]:
df = pd.DataFrame(np.random.randn(6, 4),
                  index=dates,
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.725486,0.941799,0.651274,-0.225261
2013-01-02,-1.458111,-0.436451,1.343954,-0.261167
2013-01-03,0.296856,-0.756599,-0.482995,-0.763437
2013-01-04,0.974857,1.041274,-0.133294,-1.463526
2013-01-05,0.762019,-0.573106,0.172002,0.313105
2013-01-06,0.662657,-0.005017,0.51157,-0.27687


#### Vía diccionarios de Python


In [9]:
df2 = pd.DataFrame({
    'A': 1.,
    'B': pd.Timestamp('20130102'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': 'foo',
    'G': [1,2,1,2]
})
df2

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2013-01-02,1.0,3,test,foo,1
1,1.0,2013-01-02,1.0,3,train,foo,2
2,1.0,2013-01-02,1.0,3,test,foo,1
3,1.0,2013-01-02,1.0,3,train,foo,2


## Tipos de Datos



| Pandas dtype     | Python type     | NumPy type     | Usage|
|-----------------|:---------------------:|:---------------------:|:---------------------:|
| object     | str     | string_, unicode_     | Text|
| int64     | int     | int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64     | Integer numbers|
| float64     | float     | float_, float16, float32, float64     | Floating point numbers|
| bool     | bool     | bool_     | True/False values|
| datetime64     | NA     | datetime64[ns]     | Date and time values|
| timedelta[ns]     | NA     | NA     | Differences between two datetimes|
| category     | NA     | NA     | Finite list of text values |

### Los datos faltantes

* None: el dato faltante Pythonico

* NaN (Not a number): representación de un número faltante reconocido por todo los sistemas que usan el estandar de de IEEE de coma flotante

> Ver https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html


In [10]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [11]:
np.nan + 1

nan

In [12]:
np.nan == '2'

False

In [13]:
np.nan > 0

False

## Importar datos


```python
pd.read_csv(
    '../../dataset/censo2010/hogar.csv',  # file path
    delimiter=',',  # delimitador ',',';','|','\t'
    header=0,  # número de fila como nom de col
    names=None,  # nombre de las columnas (ojo con header)
    index_col=0,  # que col es el índice
    usecols=None,  # que col usar. Ej: [0, 1, 2], ['foo', 'bar', 'baz']
    dtype=None,  # Tipo de col {'a': np.int32, 'b': str} 
    skiprows=None,  # saltear fil al init
    skipfooter=0,  # saltear fil al final
    nrows=None,  # n de fil a leer
    decimal='.',  # separador de decimal. Ej: ',' para EU dat
    quotechar='"',  # char para reconocer str
    #encoding=None,  # para los acentos, ñ, etc
)
```

In [14]:
df_in = pd.read_csv('../../../dataset/censo2010/hogar.csv')
df_in

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152594,1152595,1426434,1,1,2,1,1,1,1,1,...,2,3,1,1,1,1,1,1,1,0
1152595,1152596,1426435,1,1,1,1,1,1,1,1,...,1,2,1,2,2,1,1,1,1,0
1152596,1152597,1426436,1,1,1,1,1,1,1,1,...,2,3,1,1,1,1,1,3,3,0
1152597,1152598,1426437,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2



## Revisando un df


In [15]:

df_in.head()


Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1


In [16]:
df_in.tail(3)

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
1152596,1152597,1426436,1,1,1,1,1,1,1,1,...,2,3,1,1,1,1,1,3,3,0
1152597,1152598,1426437,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2
1152598,1152599,1426438,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2


In [17]:
df_in.index

RangeIndex(start=0, stop=1152599, step=1)

In [18]:
type(df_in.index)

pandas.core.indexes.range.RangeIndex

In [19]:
df_in.columns

Index(['HOGAR_REF_ID', 'VIVIENDA_REF_ID', 'NHOG', 'H05', 'H06', 'H07', 'H08',
       'H09', 'H10', 'H11', 'H12', 'H13', 'H14', 'H15', 'H16', 'H19A', 'H19B',
       'H19C', 'H19D', 'PROP', 'INDHAC', 'TOTPERS', 'ALGUNBI'],
      dtype='object')

In [20]:
type(df_in.columns)

pandas.core.indexes.base.Index

In [21]:
df_in.values

array([[      1,       1,       1, ...,       1,       1,       0],
       [      2,       1,       2, ...,       1,       1,       0],
       [      3,       1,       4, ...,       1,       2,       0],
       ...,
       [1152597, 1426436,       1, ...,       3,       3,       0],
       [1152598, 1426437,    1000, ...,       7,       0,       2],
       [1152599, 1426438,    1000, ...,       7,       0,       2]])

In [22]:
type(df_in.values)

numpy.ndarray

In [23]:
df_in.shape

(1152599, 23)

## Seleccionar (select)

### Subsetting con []

#### Columnas


##### Seleccionar una columna única como una serie
Hay dos components principales de una serie, el índice y la data (valores). No hay columnas en una serie.

In [24]:
df_in['ALGUNBI']


0          0
1          0
2          0
3          1
4          1
          ..
1152594    0
1152595    0
1152596    0
1152597    2
1152598    2
Name: ALGUNBI, Length: 1152599, dtype: int64

In [25]:
type(df_in['ALGUNBI'])


pandas.core.series.Series

##### Seleccionar como un DF


In [26]:
df_in[['ALGUNBI']]


Unnamed: 0,ALGUNBI
0,0
1,0
2,0
3,1
4,1
...,...
1152594,0
1152595,0
1152596,0
1152597,2


In [27]:
type(df_in[['ALGUNBI']])


pandas.core.frame.DataFrame

##### Seleccionas múltiples colmunas como un DF al pasarle una lista


In [28]:
df_in[['PROP', 'ALGUNBI']]


Unnamed: 0,PROP,ALGUNBI
0,5,0
1,1,0
2,1,0
3,5,1
4,6,1
...,...,...
1152594,1,0
1152595,1,0
1152596,1,0
1152597,0,2


##### Cambiar el orden


#### Seleccionar por número


In [29]:
df_in.columns[0]


'HOGAR_REF_ID'

In [30]:
df_in[df_in.columns[0]]


0                1
1                2
2                3
3                4
4                5
            ...   
1152594    1152595
1152595    1152596
1152596    1152597
1152597    1152598
1152598    1152599
Name: HOGAR_REF_ID, Length: 1152599, dtype: int64

In [31]:
df_in[[df_in.columns[0]]]


Unnamed: 0,HOGAR_REF_ID
0,1
1,2
2,3
3,4
4,5
...,...
1152594,1152595
1152595,1152596
1152596,1152597
1152597,1152598


In [32]:
df_in[df_in.columns[[18, 21]]]



Unnamed: 0,H19D,TOTPERS
0,2,1
1,1,1
2,1,2
3,1,8
4,1,2
...,...,...
1152594,1,1
1152595,1,1
1152596,1,3
1152597,0,0


#### Filas

##### x rango

In [33]:
df_in[0:1]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0


In [34]:
df_in[0:10]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1
5,6,2,2,1,3,1,1,1,1,1,...,3,5,1,1,1,2,5,1,1,1
6,7,2,3,1,2,2,1,1,1,1,...,1,1,1,2,1,1,3,3,1,1
7,8,2,5,1,4,1,1,5,1,1,...,1,2,1,2,1,2,1,3,2,1
8,9,2,7,1,1,1,1,1,1,1,...,5,5,1,1,1,1,6,1,2,1
9,10,3,1,1,2,1,1,1,1,1,...,1,2,1,1,1,1,3,3,2,0


> Operator Overloading
Ciertos operadores poseen un comportamiento diferenciado de a cuerdo a que objetos se los aplica. Por ejemplo, al indexar con $df[*]$, su comportamiento dependerá de si es:
* string: retornará una columna como una serie return a column as a Series
* lista de strings: retornará columnas en forma de DataFrame
* sequencia de enteros / etiquetas: retornará filas (el vector puede ser tanto enteros como etiquetas)
* sequencia de booleanos: retornará filas cuando sean True

### .loc

La indexación mediante .loc selecciona datos por filas o por columnas. Puede simultaneamente seleccionar filas o columnas y lo hace a través de las etiquetas

#### a series

In [35]:
df_in.loc[1]

HOGAR_REF_ID       2
VIVIENDA_REF_ID    1
NHOG               2
H05                1
H06                1
H07                1
H08                1
H09                1
H10                1
H11                1
H12                1
H13                1
H14                1
H15                1
H16                2
H19A               1
H19B               1
H19C               1
H19D               1
PROP               1
INDHAC             1
TOTPERS            1
ALGUNBI            0
Name: 1, dtype: int64

In [36]:
type(df_in.loc[1])

pandas.core.series.Series

In [37]:
df_in.loc[[1]]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0


In [38]:
type(df_in.loc[[1]])

pandas.core.frame.DataFrame

#### Varias Filas

In [39]:
df_in.loc[[1, 2]]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0


#### Muchas filas vía Índeces

In [40]:
df_in.loc[1:10]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1
5,6,2,2,1,3,1,1,1,1,1,...,3,5,1,1,1,2,5,1,1,1
6,7,2,3,1,2,2,1,1,1,1,...,1,1,1,2,1,1,3,3,1,1
7,8,2,5,1,4,1,1,5,1,1,...,1,2,1,2,1,2,1,3,2,1
8,9,2,7,1,1,1,1,1,1,1,...,5,5,1,1,1,1,6,1,2,1
9,10,3,1,1,2,1,1,1,1,1,...,1,2,1,1,1,1,3,3,2,0
10,11,3,3,1,1,1,1,1,1,1,...,1,3,1,2,1,1,5,1,1,0


In [41]:
df_in.loc[:10]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1
5,6,2,2,1,3,1,1,1,1,1,...,3,5,1,1,1,2,5,1,1,1
6,7,2,3,1,2,2,1,1,1,1,...,1,1,1,2,1,1,3,3,1,1
7,8,2,5,1,4,1,1,5,1,1,...,1,2,1,2,1,2,1,3,2,1
8,9,2,7,1,1,1,1,1,1,1,...,5,5,1,1,1,1,6,1,2,1
9,10,3,1,1,2,1,1,1,1,1,...,1,2,1,1,1,1,3,3,2,0


In [42]:
df_in.loc[10:]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
10,11,3,3,1,1,1,1,1,1,1,...,1,3,1,2,1,1,5,1,1,0
11,12,3,6,1,1,1,1,1,1,1,...,5,5,1,1,1,1,6,3,5,0
12,13,4,1,1,1,1,1,1,1,1,...,1,2,1,1,1,2,3,1,1,0
13,14,5,1,2,4,2,1,2,1,1,...,1,2,1,1,1,2,3,1,1,0
14,15,6,1,2,8,2,1,2,1,1,...,6,10,1,1,2,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152594,1152595,1426434,1,1,2,1,1,1,1,1,...,2,3,1,1,1,1,1,1,1,0
1152595,1152596,1426435,1,1,1,1,1,1,1,1,...,1,2,1,2,2,1,1,1,1,0
1152596,1152597,1426436,1,1,1,1,1,1,1,1,...,2,3,1,1,1,1,1,3,3,0
1152597,1152598,1426437,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2


#### Subsetear filas y columnas con loc

##### Muchas filas 

In [43]:
df_in.loc[[1, 2], ['PROP', 'ALGUNBI']]

Unnamed: 0,PROP,ALGUNBI
1,1,0
2,1,0


#### Selecting all of the rows and some columns

In [44]:
df_in.loc[:, ['PROP', 'ALGUNBI']]

Unnamed: 0,PROP,ALGUNBI
0,5,0
1,1,0
2,1,0
3,5,1
4,6,1
...,...,...
1152594,1,0
1152595,1,0
1152596,1,0
1152597,0,2


#### Selecting all of the rows and some columns

In [45]:
df_in.loc[:, 'PROP':]

Unnamed: 0,PROP,INDHAC,TOTPERS,ALGUNBI
0,5,1,1,0
1,1,1,1,0
2,1,1,2,0
3,5,6,8,1
4,6,1,2,1
...,...,...,...,...
1152594,1,1,1,0
1152595,1,1,1,0
1152596,1,3,3,0
1152597,0,7,0,2


#### accediento al elemento

In [46]:
df_in.loc[1, 'PROP']

1

In [47]:
df_in.loc[:, 'PROP'][1]

1

In [48]:
df_in.loc[:, 'PROP'].loc[1]

1

In [49]:
df_in.loc[1]['PROP']

1

In [50]:
df_in.loc[1].loc['PROP']

1

#### el elemento en forma de serie

In [51]:
df_in.loc[[1], 'PROP']

1    1
Name: PROP, dtype: int64

In [52]:
type(df_in.loc[[1], 'PROP'])

pandas.core.series.Series

#### el elemento en forma de df

In [53]:
df_in.loc[[1], ['PROP']]

Unnamed: 0,PROP
1,1


In [54]:
type(df_in.loc[[1], ['PROP']])

pandas.core.frame.DataFrame

## .iloc

In [55]:
df_in.iloc[0]

HOGAR_REF_ID       1
VIVIENDA_REF_ID    1
NHOG               1
H05                1
H06                4
H07                1
H08                1
H09                1
H10                1
H11                1
H12                1
H13                1
H14                1
H15                2
H16                2
H19A               1
H19B               2
H19C               1
H19D               2
PROP               5
INDHAC             1
TOTPERS            1
ALGUNBI            0
Name: 0, dtype: int64

In [56]:
df_in.iloc[[0]]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0


In [57]:
df_in.iloc[:, 18]

0          2
1          1
2          1
3          1
4          1
          ..
1152594    1
1152595    1
1152596    1
1152597    0
1152598    0
Name: H19D, Length: 1152599, dtype: int64

In [58]:
df_in.iloc[:, [18]]

Unnamed: 0,H19D
0,2
1,1
2,1
3,1
4,1
...,...
1152594,1
1152595,1
1152596,1
1152597,0


In [59]:
df_in.iloc[[5, 2, 4]]


Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
5,6,2,2,1,3,1,1,1,1,1,...,3,5,1,1,1,2,5,1,1,1
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1


In [60]:
df_in.iloc[3:5]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1


In [61]:
df_in.iloc[3:5, [18, 21]]

Unnamed: 0,H19D,TOTPERS
3,1,8
4,1,2


In [62]:
df_in.iloc[3:5, 18:21]

Unnamed: 0,H19D,PROP,INDHAC
3,1,5,6
4,1,6,1


## Seleccionar Donde (where)

### [ ]

#### A mano

In [63]:
df_in_head = df_in.head()

In [64]:
booleano = [False, False, True, True, True]

In [65]:
type(booleano)

list

In [66]:
df_in_head[booleano]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1


In [67]:
booleano = np.array([False, False, True, True, True])


In [68]:
type(booleano)


numpy.ndarray

In [69]:
df_in_head[booleano]


Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1


#### Condiciones simples

##### Usando operaciones lógicas: '<', '>', '==', '>=', '<=', !=


In [70]:
booleano = df_in.ALGUNBI > 0
df_in[booleano]


Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1
5,6,2,2,1,3,1,1,1,1,1,...,3,5,1,1,1,2,5,1,1,1
6,7,2,3,1,2,2,1,1,1,1,...,1,1,1,2,1,1,3,3,1,1
7,8,2,5,1,4,1,1,5,1,1,...,1,2,1,2,1,2,1,3,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152566,1152567,1426397,1,1,2,1,1,1,1,1,...,1,1,1,2,1,2,3,6,5,1
1152568,1152569,1426399,1,1,2,1,1,1,1,1,...,1,1,1,1,1,2,3,6,4,1
1152574,1152575,1426406,1,1,2,2,1,1,1,1,...,1,1,1,1,1,1,3,6,4,1
1152597,1152598,1426437,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2


In [71]:
booleano = df_in.ALGUNBI == 0
df_in[booleano]


Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
9,10,3,1,1,2,1,1,1,1,1,...,1,2,1,1,1,1,3,3,2,0
10,11,3,3,1,1,1,1,1,1,1,...,1,3,1,2,1,1,5,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152592,1152593,1426431,1,1,2,1,1,1,1,1,...,1,2,1,2,2,1,4,1,1,0
1152593,1152594,1426432,1,1,1,1,1,1,1,1,...,2,2,1,1,1,1,3,4,3,0
1152594,1152595,1426434,1,1,2,1,1,1,1,1,...,2,3,1,1,1,1,1,1,1,0
1152595,1152596,1426435,1,1,1,1,1,1,1,1,...,1,2,1,2,2,1,1,1,1,0


#### Condiciones Múltiples


##### Usando &, | , ~

In [72]:
booleano = ~((df_in.ALGUNBI > 0) & (df_in.PROP == 1))
df_in[booleano]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152594,1152595,1426434,1,1,2,1,1,1,1,1,...,2,3,1,1,1,1,1,1,1,0
1152595,1152596,1426435,1,1,1,1,1,1,1,1,...,1,2,1,2,2,1,1,1,1,0
1152596,1152597,1426436,1,1,1,1,1,1,1,1,...,2,3,1,1,1,1,1,3,3,0
1152597,1152598,1426437,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2


In [73]:
booleano = ((df_in.ALGUNBI > 0) | (df_in.PROP != 1))
df_in[booleano]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1
5,6,2,2,1,3,1,1,1,1,1,...,3,5,1,1,1,2,5,1,1,1
6,7,2,3,1,2,2,1,1,1,1,...,1,1,1,2,1,1,3,3,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152590,1152591,1426426,1,1,2,1,1,1,1,1,...,2,3,1,2,1,1,3,3,4,0
1152592,1152593,1426431,1,1,2,1,1,1,1,1,...,1,2,1,2,2,1,4,1,1,0
1152593,1152594,1426432,1,1,1,1,1,1,1,1,...,2,2,1,1,1,1,3,4,3,0
1152597,1152598,1426437,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2


In [74]:
booleano = (df_in.ALGUNBI > 0) & (df_in.PROP == 1)
df_in[booleano]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
7,8,2,5,1,4,1,1,5,1,1,...,1,2,1,2,1,2,1,3,2,1
57,58,45,1,2,2,1,1,1,1,1,...,2,3,1,1,1,1,1,6,8,1
101,102,90,1,2,4,2,1,1,1,2,...,1,1,1,2,1,2,1,5,3,1
109,110,96,1,3,8,2,3,1,2,0,...,1,1,2,2,1,2,1,6,5,1
115,116,100,1,2,4,2,1,1,1,2,...,1,1,1,2,1,2,1,6,7,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152385,1152386,1426165,1,1,1,1,2,1,1,1,...,2,2,1,2,1,1,1,3,2,1
1152386,1152387,1426166,1,1,1,1,2,1,1,1,...,2,2,1,2,1,1,1,1,1,1
1152387,1152388,1426167,1,1,1,1,2,1,1,1,...,2,2,1,2,1,1,1,5,5,1
1152388,1152389,1426168,1,1,1,1,2,1,1,1,...,2,2,1,2,1,1,1,1,1,1


#### Filtrando por ocurrencias: isin


In [75]:
df_in[ df_in.PROP.isin([1,2]) ]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
7,8,2,5,1,4,1,1,5,1,1,...,1,2,1,2,1,2,1,3,2,1
14,15,6,1,2,8,2,1,2,1,1,...,6,10,1,1,2,1,1,1,1,0
17,18,9,1,1,4,1,1,1,1,1,...,5,5,1,1,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152589,1152590,1426425,1,1,2,1,1,1,1,1,...,2,3,1,1,1,1,1,2,2,0
1152591,1152592,1426429,1,1,2,1,1,1,1,1,...,2,4,1,1,1,1,1,3,5,0
1152594,1152595,1426434,1,1,2,1,1,1,1,1,...,2,3,1,1,1,1,1,1,1,0
1152595,1152596,1426435,1,1,1,1,1,1,1,1,...,1,2,1,2,2,1,1,1,1,0


#### Buscando missing values

In [76]:
df_in[ df_in.PROP.isnull() ]

Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI


### Usar .loc para seleccior columnas y bools de filas

In [77]:
df_in.loc[ df_in.PROP.isin([1,2]), ['PROP','ALGUNBI']]


Unnamed: 0,PROP,ALGUNBI
1,1,0
2,1,0
7,1,1
14,1,0
17,1,0
...,...,...
1152589,1,0
1152591,1,0
1152594,1,0
1152595,1,0


#### Usar la comparación de dos columnas


In [78]:
df_in.loc[ df_in.PROP > df_in.ALGUNBI ]


Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,1,1,1,1,4,1,1,1,1,1,...,2,2,1,2,1,2,5,1,1,0
1,2,1,2,1,1,1,1,1,1,1,...,1,2,1,1,1,1,1,1,1,0
2,3,1,4,1,3,1,1,1,1,1,...,3,5,1,1,1,1,1,1,2,0
3,4,1,6,1,1,1,1,1,1,1,...,3,5,1,1,1,1,5,6,8,1
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152592,1152593,1426431,1,1,2,1,1,1,1,1,...,1,2,1,2,2,1,4,1,1,0
1152593,1152594,1426432,1,1,1,1,1,1,1,1,...,2,2,1,1,1,1,3,4,3,0
1152594,1152595,1426434,1,1,2,1,1,1,1,1,...,2,3,1,1,1,1,1,1,1,0
1152595,1152596,1426435,1,1,1,1,1,1,1,1,...,1,2,1,2,2,1,1,1,1,0


## ORDEN (Order by)


### Asendiente

In [79]:
df_in.sort_values(['PROP','ALGUNBI'])


Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
27,28,17,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2
28,29,18,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2
33,34,21,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2
34,35,22,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2
35,36,23,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1151462,1151463,1424966,1,2,4,2,1,1,1,1,...,3,3,1,2,2,1,6,5,6,1
1151463,1151464,1424967,1,1,4,1,1,1,1,1,...,3,3,1,1,1,1,6,6,8,1
1151820,1151821,1425430,1,1,2,1,1,1,2,0,...,1,1,1,2,1,2,6,5,3,1
1151821,1151822,1425430,2,1,1,1,1,1,2,0,...,1,1,1,2,1,2,6,6,7,1


### Descendiente


In [80]:
df_in.sort_values(['PROP','ALGUNBI'], ascending=False)


Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
4,5,2,1,1,1,2,1,1,1,1,...,5,5,1,1,2,1,6,1,2,1
8,9,2,7,1,1,1,1,1,1,1,...,5,5,1,1,1,1,6,1,2,1
25,26,16,1,1,1,1,1,1,1,1,...,3,5,1,1,1,1,6,1,2,1
30,31,19,2,1,1,1,1,1,1,1,...,3,5,1,1,1,1,6,6,8,1
123,124,106,1,2,4,2,1,1,1,1,...,2,2,1,1,2,1,6,6,7,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1150547,1150548,1423792,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2
1150548,1150549,1423793,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2
1151377,1151378,1424839,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2
1152597,1152598,1426437,1000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,7,0,2


## Contar

### Valores no nulos

In [81]:
df_in.count()

HOGAR_REF_ID       1152599
VIVIENDA_REF_ID    1152599
NHOG               1152599
H05                1152599
H06                1152599
H07                1152599
H08                1152599
H09                1152599
H10                1152599
H11                1152599
H12                1152599
H13                1152599
H14                1152599
H15                1152599
H16                1152599
H19A               1152599
H19B               1152599
H19C               1152599
H19D               1152599
PROP               1152599
INDHAC             1152599
TOTPERS            1152599
ALGUNBI            1152599
dtype: int64

### Cantidad de registros

In [82]:
df_in.shape[0]

1152599

## Agrupar por (Group By)

In [83]:
df_in.groupby(['PROP','ALGUNBI']).size()

PROP  ALGUNBI
0     2            2465
1     0          639712
      1            9246
2     0           67111
      1            2674
3     0          294851
      1           48592
4     0           42047
      1            2466
5     0           21368
      1            2425
6     0           16269
      1            3373
dtype: int64

In [84]:
# NOTA: SE PUEDE TENER LAS VARIABLES DE AGRUPACION COMO COLUMNAS Y NO COMO INDICES
df_in.groupby(['PROP','ALGUNBI'],as_index=False).size()

Unnamed: 0,PROP,ALGUNBI,size
0,0,2,2465
1,1,0,639712
2,1,1,9246
3,2,0,67111
4,2,1,2674
5,3,0,294851
6,3,1,48592
7,4,0,42047
8,4,1,2466
9,5,0,21368


In [85]:
df_in.groupby(['PROP','ALGUNBI']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H13,H14,H15,H16,H19A,H19B,H19C,H19D,INDHAC,TOTPERS
PROP,ALGUNBI,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,2,2465,2465,2465,2465,2465,2465,2465,2465,2465,2465,...,2465,2465,2465,2465,2465,2465,2465,2465,2465,2465
1,0,639712,639712,639712,639712,639712,639712,639712,639712,639712,639712,...,639712,639712,639712,639712,639712,639712,639712,639712,639712,639712
1,1,9246,9246,9246,9246,9246,9246,9246,9246,9246,9246,...,9246,9246,9246,9246,9246,9246,9246,9246,9246,9246
2,0,67111,67111,67111,67111,67111,67111,67111,67111,67111,67111,...,67111,67111,67111,67111,67111,67111,67111,67111,67111,67111
2,1,2674,2674,2674,2674,2674,2674,2674,2674,2674,2674,...,2674,2674,2674,2674,2674,2674,2674,2674,2674,2674
3,0,294851,294851,294851,294851,294851,294851,294851,294851,294851,294851,...,294851,294851,294851,294851,294851,294851,294851,294851,294851,294851
3,1,48592,48592,48592,48592,48592,48592,48592,48592,48592,48592,...,48592,48592,48592,48592,48592,48592,48592,48592,48592,48592
4,0,42047,42047,42047,42047,42047,42047,42047,42047,42047,42047,...,42047,42047,42047,42047,42047,42047,42047,42047,42047,42047
4,1,2466,2466,2466,2466,2466,2466,2466,2466,2466,2466,...,2466,2466,2466,2466,2466,2466,2466,2466,2466,2466
5,0,21368,21368,21368,21368,21368,21368,21368,21368,21368,21368,...,21368,21368,21368,21368,21368,21368,21368,21368,21368,21368


## Funciones de Agregación

 |    Agregación   |      Descripción      |
 |-----------------|:---------------------:|
 | count()         | Contar el n de casos  |
 | first(), last() | Primer y último item  |
 | mean(), median()| Media, Mediana        |
 | min(), max()    | Mínimo y Máximo       |
 | std(), var()    | Varianza y desvio     |
 | mad()           | Desviación abs mediana|
 | prod()          | Producto de los items |
 | sum()           | Suma de los Casos     |

### Calcular la media por grupo

![](notas_files/groupby-example.png)

In [86]:
( df_in
    .groupby(['PROP','ALGUNBI'])
    .mean()
)

Unnamed: 0_level_0,Unnamed: 1_level_0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H13,H14,H15,H16,H19A,H19B,H19C,H19D,INDHAC,TOTPERS
PROP,ALGUNBI,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,2,435113.864097,555295.901826,757.646247,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0
1,0,603795.279263,758350.46135,1.083014,1.039271,1.822239,1.082889,1.00357,1.004314,1.0,1.009212,...,1.01045,1.143344,2.215835,3.422187,1.004671,1.289855,1.148475,1.062703,2.108658,2.479202
1,1,475890.517737,604263.872485,1.611616,1.395306,2.553861,1.435864,1.121133,1.039368,1.246052,0.925698,...,0.871512,2.544019,1.796561,2.276011,1.1061,1.570409,1.138871,1.484642,4.066083,3.814623
2,0,569535.949174,718025.70066,1.119638,1.095484,1.901655,1.151063,1.009626,1.005916,1.0,1.036432,...,1.02852,1.424118,2.042303,3.086066,1.013619,1.328441,1.144805,1.136058,2.301635,2.471577
2,1,442732.551234,563904.281227,1.654824,1.647719,2.965969,1.644353,1.239342,1.048616,1.20718,1.073298,...,0.970456,3.408003,1.611818,1.882573,1.186612,1.717277,1.169035,1.688482,4.586761,4.083396
3,0,559084.665088,705990.791135,1.149045,1.042364,1.750542,1.100054,1.011026,1.006264,1.0,1.008231,...,1.047573,1.222841,1.630145,2.474643,1.023846,1.247552,1.064266,1.213467,2.665699,2.295308
3,1,386821.999877,502532.358022,1.818756,1.153606,2.119053,1.23685,1.320444,1.018131,1.120678,0.929556,...,1.461064,1.848226,1.202873,1.250514,1.302498,1.742838,1.133149,1.758541,4.236335,2.464809
4,0,580057.61417,730560.250648,1.142436,1.066307,1.854211,1.121554,1.010869,1.005684,1.0,1.017362,...,1.043784,1.294123,1.809784,2.774609,1.024282,1.399886,1.187314,1.158965,2.423597,2.32585
4,1,486273.088808,620033.500811,1.659773,1.401865,2.655312,1.424169,1.243715,1.032036,1.172749,0.965126,...,1.148824,2.779805,1.367397,1.554745,1.203974,1.6691,1.186131,1.647202,4.720195,3.710868
5,0,551720.709191,696519.829324,1.064302,1.0424,1.687711,1.101928,1.006131,1.018673,1.0,1.004118,...,1.013618,1.172782,1.490406,2.165762,1.015771,1.392643,1.098231,1.174373,3.384734,2.590416


### Calcular medidas resúmenes por grupo


In [87]:
( df_in[['PROP','ALGUNBI','TOTPERS']]
    .groupby(['PROP','ALGUNBI'])
    .agg(['min', 'max', 'mean', 'median'])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,TOTPERS,TOTPERS,TOTPERS,TOTPERS
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,median
PROP,ALGUNBI,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0,2,0,0,0.0,0.0
1,0,1,8,2.479202,2.0
1,1,1,8,3.814623,4.0
2,0,1,8,2.471577,2.0
2,1,1,8,4.083396,4.0
3,0,1,8,2.295308,2.0
3,1,1,8,2.464809,2.0
4,0,1,8,2.32585,2.0
4,1,1,8,3.710868,4.0
5,0,1,8,2.590416,2.0


### Calcular diferentes medidas resúmenes por grupo por variables


In [88]:
( df_in
    .groupby(['PROP','ALGUNBI'])
    .agg({
        'NHOG': ['min', 'max'],
        'TOTPERS':['mean', 'median']
        })
)


Unnamed: 0_level_0,Unnamed: 1_level_0,NHOG,NHOG,TOTPERS,TOTPERS
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,median
PROP,ALGUNBI,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0,2,1,1000,0.0,0.0
1,0,1,46,2.479202,2.0
1,1,1,37,3.814623,4.0
2,0,1,36,2.471577,2.0
2,1,1,34,4.083396,4.0
3,0,1,48,2.295308,2.0
3,1,1,41,2.464809,2.0
4,0,1,38,2.32585,2.0
4,1,1,38,3.710868,4.0
5,0,1,24,2.590416,2.0


## Funciones Lambdas
Funciones anónimas aplicadas a todas las columnas con apply


In [89]:
( df_in
    .apply(lambda x: x[x > 2]
    .count())
)

HOGAR_REF_ID       1152597
VIVIENDA_REF_ID    1152590
NHOG                 26224
H05                   8179
H06                 116010
H07                      0
H08                   1125
H09                   1204
H10                      0
H11                      0
H12                   4579
H13                      0
H14                  89386
H15                 292326
H16                 699073
H19A                     0
H19B                     0
H19C                     0
H19D                     0
PROP                431391
INDHAC              558761
TOTPERS             456218
ALGUNBI                  0
dtype: int64

In [90]:
( df_in
    .groupby(['PROP','ALGUNBI'])
    .apply(lambda x: (x+0.)/x.sum()*100)
)

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  .apply(lambda x: (x+0.)/x.sum()*100)


Unnamed: 0,HOGAR_REF_ID,VIVIENDA_REF_ID,NHOG,H05,H06,H07,H08,H09,H10,H11,...,H15,H16,H19A,H19B,H19C,H19D,PROP,INDHAC,TOTPERS,ALGUNBI
0,8.482363e-09,6.718969e-09,0.004397,0.004490,0.011092,0.004247,0.004651,0.004594,0.004680,0.004661,...,0.006280,0.004322,0.004607,0.006721,0.004261,0.007970,0.004680,0.001383,0.001807,
1,5.177925e-10,2.061321e-10,0.000289,0.000150,0.000086,0.000144,0.000156,0.000156,0.000156,0.000155,...,0.000071,0.000091,0.000156,0.000121,0.000136,0.000147,0.000156,0.000074,0.000063,
2,7.766888e-10,2.061321e-10,0.000577,0.000150,0.000257,0.000144,0.000156,0.000156,0.000156,0.000155,...,0.000212,0.000228,0.000156,0.000121,0.000136,0.000147,0.000156,0.000074,0.000126,
3,3.255996e-07,6.397891e-08,0.199336,0.035137,0.020751,0.035039,0.036470,0.039888,0.038941,0.042680,...,0.096339,0.142776,0.037523,0.029146,0.037272,0.031980,0.041237,0.053121,0.100364,0.041237
4,3.500505e-07,1.083315e-07,0.018622,0.020665,0.010454,0.041894,0.021834,0.027093,0.024777,0.030874,...,0.104537,0.094985,0.024355,0.017443,0.050088,0.017593,0.029647,0.006417,0.016464,0.029647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152594,2.984025e-04,2.940338e-04,0.000144,0.000150,0.000172,0.000144,0.000156,0.000156,0.000156,0.000155,...,0.000141,0.000137,0.000156,0.000121,0.000136,0.000147,0.000156,0.000074,0.000063,
1152595,2.984028e-04,2.940340e-04,0.000144,0.000150,0.000086,0.000144,0.000156,0.000156,0.000156,0.000155,...,0.000071,0.000091,0.000156,0.000242,0.000272,0.000147,0.000156,0.000074,0.000063,
1152596,2.984031e-04,2.940342e-04,0.000144,0.000150,0.000086,0.000144,0.000156,0.000156,0.000156,0.000155,...,0.000141,0.000137,0.000156,0.000121,0.000136,0.000147,0.000156,0.000222,0.000189,
1152597,1.074628e-01,1.042104e-01,0.053545,,,,,,,,...,,,,,,,,0.040568,,0.040568


## Having
Luego de agrupar y resumir, filtado de filas resultantes


In [91]:
(
    df_in[['PROP','ALGUNBI','TOTPERS']]
    .groupby(['PROP','ALGUNBI'])
    .sum()
    .groupby(['TOTPERS'])
    .filter(lambda x: x['TOTPERS'] > 10000)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,TOTPERS
PROP,ALGUNBI,Unnamed: 2_level_1
1,0,1585975
1,1,35270
2,0,165870
2,1,10919
3,0,676774
3,1,119770
4,0,97795
5,0,55352
6,0,43545
6,1,12148


## Top 1 de un grupo
últil para extraer los primeros casos de cada grupo que cumplen una condición


In [92]:
(df_in[['PROP','ALGUNBI','TOTPERS']]
    .sort_values('TOTPERS', ascending=False)
    .groupby(['PROP','ALGUNBI'])
    .head(1)
)


Unnamed: 0,PROP,ALGUNBI,TOTPERS
199325,1,0,8
722844,3,1,8
722841,3,0,8
722831,6,0,8
555723,2,0,8
555729,1,1,8
556037,4,0,8
555644,2,1,8
555127,4,1,8
271727,5,0,8


## Agregar Filas y Columnas

### Agregar Columnas


In [93]:

df_1 = df_in[['PROP', 'INDHAC']]
df_1

Unnamed: 0,PROP,INDHAC
0,5,1
1,1,1
2,1,1
3,5,6
4,6,1
...,...,...
1152594,1,1
1152595,1,1
1152596,1,3
1152597,0,7


In [94]:
df_2 = df_in[['TOTPERS', 'ALGUNBI']]
df_2

Unnamed: 0,TOTPERS,ALGUNBI
0,1,0
1,1,0
2,2,0
3,8,1
4,2,1
...,...,...
1152594,1,0
1152595,1,0
1152596,3,0
1152597,0,2


In [95]:
df = pd.concat([df_1,df_2],axis=1,sort=False)
df

Unnamed: 0,PROP,INDHAC,TOTPERS,ALGUNBI
0,5,1,1,0
1,1,1,1,0
2,1,1,2,0
3,5,6,8,1
4,6,1,2,1
...,...,...,...,...
1152594,1,1,1,0
1152595,1,1,1,0
1152596,1,3,3,0
1152597,0,7,0,2


### Agregar Filas

In [96]:
df_1 = df_in.iloc[0:3][['PROP', 'INDHAC']]
df_1

Unnamed: 0,PROP,INDHAC
0,5,1
1,1,1
2,1,1


In [97]:
df_2 = df_in.iloc[3:6][['PROP', 'INDHAC']]
df_2


Unnamed: 0,PROP,INDHAC
3,5,6
4,6,1
5,5,1


#### Opción 1

In [98]:
df = pd.concat([df_1,df_2],axis=0,sort=False)
df

Unnamed: 0,PROP,INDHAC
0,5,1
1,1,1
2,1,1
3,5,6
4,6,1
5,5,1


#### Opción 2

In [99]:
df = df_1.append([df_2])
df

  df = df_1.append([df_2])


Unnamed: 0,PROP,INDHAC
0,5,1
1,1,1
2,1,1
3,5,6
4,6,1
5,5,1


## Transform

![](notas_files/transform-example.png)

In [100]:
( df_in[['PROP','ALGUNBI','TOTPERS']]
    .groupby(['PROP','ALGUNBI'])
    .transform(lambda x: x-x.mean()/x.std())
)



  .transform(lambda x: x-x.mean()/x.std())


Unnamed: 0,TOTPERS
0,-0.954643
1,-0.817160
2,0.182840
3,6.052921
4,0.200861
...,...
1152594,-0.817160
1152595,-0.817160
1152596,1.182840
1152597,


## Multi-Índices

In [101]:
df = ( df_in[['PROP','ALGUNBI','TOTPERS']]
    .groupby(['PROP','ALGUNBI'])
    .mean()
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,TOTPERS
PROP,ALGUNBI,Unnamed: 2_level_1
0,2,0.0
1,0,2.479202
1,1,3.814623
2,0,2.471577
2,1,4.083396
3,0,2.295308
3,1,2.464809
4,0,2.32585
4,1,3.710868
5,0,2.590416


In [102]:
df.index

MultiIndex([(0, 2),
            (1, 0),
            (1, 1),
            (2, 0),
            (2, 1),
            (3, 0),
            (3, 1),
            (4, 0),
            (4, 1),
            (5, 0),
            (5, 1),
            (6, 0),
            (6, 1)],
           names=['PROP', 'ALGUNBI'])

## Pivot Table

![](notas_files/pivot-table-datasheet.png)

## CrossTabs


![](notas_files/crosstab_cheatsheet.png)

## Stack
![](https://pandas.pydata.org/pandas-docs/stable/_images/reshaping_stack.png)

## Unstack
![](notas_files/reshaping_unstack.png)

In [103]:
df = ( df_in[['PROP','ALGUNBI','TOTPERS']]
    .groupby(['PROP','ALGUNBI'])
    .mean()
)
df.unstack()

Unnamed: 0_level_0,TOTPERS,TOTPERS,TOTPERS
ALGUNBI,0,1,2
PROP,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,,,0.0
1,2.479202,3.814623,
2,2.471577,4.083396,
3,2.295308,2.464809,
4,2.32585,3.710868,
5,2.590416,3.28701,
6,2.676563,3.601542,


![](notas_files/reshaping_unstack_1.png)

![](https://pandas.pydata.org/pandas-docs/stable/_images/reshaping_unstack_0.png)

# Joins

## Tablas a Unir

![](notas_files/join-setup_1.png)

In [104]:
data = {'key': [1, 2, 3], 'value': ['x1', 'x2', 'x3']}
x = pd.DataFrame.from_dict(data)
x

Unnamed: 0,key,value
0,1,x1
1,2,x2
2,3,x3


In [105]:
data = {'key': [1, 2, 4], 'value': ['y1', 'y2', 'y3']}
y = pd.DataFrame.from_dict(data)
y

Unnamed: 0,key,value
0,1,y1
1,2,y2
2,4,y3


## Inner join

Esta cláusula busca coincidencias entre 2 tablas, en función a una columna que tienen en común. De tal modo que sólo la intersección se mostrará en los resultados.

![](notas_files/INNER_JOIN.webp)

Veamos un Ejemplo

![](notas_files/join-inner_2.png)

In [106]:
#inner join in python pandas

inner_join_df= pd.merge(x, y, on='key', how='inner')
inner_join_df 


Unnamed: 0,key,value_x,value_y
0,1,x1,y1
1,2,x2,y2


## Left Join

A diferencia de un INNER JOIN, donde se busca una intersección respetada por ambas tablas, con LEFT JOIN damos prioridad a la tabla de la izquierda, y buscamos en la tabla derecha. Si no existe ninguna coincidencia para alguna de las filas de la tabla de la izquierda, de igual forma todos los resultados de la primera tabla se muestran.

![](notas_files/LEFT_JOIN.webp)

Veamos un Ejemplo

![](notas_files/left_joint.png)

In [107]:
left_join_df = pd.merge(x, y, on='key', how='left')
left_join_df

Unnamed: 0,key,value_x,value_y
0,1,x1,y1
1,2,x2,y2
2,3,x3,


## Right Join

En el caso de RIGHT JOIN la situación es muy similar, pero aquí se da prioridad a la tabla de la derecha.

![](notas_files/RIGHT_JOIN.webp)

Veamos un Ejemplo

![](notas_files/right_joint.png)

In [108]:
right_join_df = pd.merge(x, y, on='key', how='right')
right_join_df

Unnamed: 0,key,value_x,value_y
0,1,x1,y1
1,2,x2,y2
2,4,,y3


## Full Join

Mientras que LEFT JOIN muestra todas las filas de la tabla izquierda, y RIGHT JOIN muestra todas las correspondientes a la tabla derecha, FULL OUTER JOIN (o simplemente FULL JOIN) se encarga de mostrar todas las filas de ambas tablas, sin importar que no existan coincidencias (usará NULL como un valor por defecto para dichos casos).

![](notas_files/FULL_JOIN.webp)

Veamos un Ejemplo

![](notas_files/outer_joint.png)

In [109]:
outer_join_df = pd.merge(x, y, on='key', how='outer')
outer_join_df

Unnamed: 0,key,value_x,value_y
0,1,x1,y1
1,2,x2,y2
2,3,x3,
3,4,,y3


## Relaciones

### Uno a Uno

In [110]:

data = {'key': [1, 2, 3], 'value': ['x1', 'x2', 'x3']}
x = pd.DataFrame.from_dict(data)
x

Unnamed: 0,key,value
0,1,x1
1,2,x2
2,3,x3


In [111]:

data = {'key': [1, 2, 3], 'value': ['y1', 'y2', 'y3']}
y = pd.DataFrame.from_dict(data)
y

Unnamed: 0,key,value
0,1,y1
1,2,y2
2,3,y3


In [112]:
inner_join_df= pd.merge(x, y, on='key', how='inner')
inner_join_df 

Unnamed: 0,key,value_x,value_y
0,1,x1,y1
1,2,x2,y2
2,3,x3,y3


### Uno a Muchos (o Muchos a Uno)

In [113]:
data = {'key': [1, 2, 2, 1], 'value': ['x1', 'x2', 'x3', "x4"]}
x = pd.DataFrame.from_dict(data)
x



Unnamed: 0,key,value
0,1,x1
1,2,x2
2,2,x3
3,1,x4


In [114]:

data = {'key': [1, 2], 'value': ['y1', 'y2']}
y = pd.DataFrame.from_dict(data)
y

Unnamed: 0,key,value
0,1,y1
1,2,y2


![](notas_files/join-one-to-many_5.png)

In [115]:
inner_join_df= pd.merge(x, y, on='key', how='inner')
inner_join_df 

Unnamed: 0,key,value_x,value_y
0,1,x1,y1
1,1,x4,y1
2,2,x2,y2
3,2,x3,y2


### Muchos a Muchos

In [116]:
data = {'key': [1, 2, 2, 3], 'value': ['x1', 'x2', 'x3', "x4"]}
x = pd.DataFrame.from_dict(data)
x

Unnamed: 0,key,value
0,1,x1
1,2,x2
2,2,x3
3,3,x4


In [117]:
data = {'key': [1, 2, 2, 3], 'value': ['y1', 'y2', 'y3', "y4"]}
y = pd.DataFrame.from_dict(data)
y

Unnamed: 0,key,value
0,1,y1
1,2,y2
2,2,y3
3,3,y4


![](notas_files/join-many-to-many_6.png))

In [118]:
inner_join_df= pd.merge(x, y, on='key', how='inner')
inner_join_df 

Unnamed: 0,key,value_x,value_y
0,1,x1,y1
1,2,x2,y2
2,2,x2,y3
3,2,x3,y2
4,2,x3,y3
5,3,x4,y4


## Union all

![](notas_files/unionAll.png)

In [119]:
data = {'val_0': [1, 2, 2, 3], 'value': ['a1', 'b2', 'c3', "d4"]}
y = pd.DataFrame.from_dict(data)
y


Unnamed: 0,val_0,value
0,1,a1
1,2,b2
2,2,c3
3,3,d4


In [120]:
data = {'val_0': [1, 2, 2, 3], 'value': ['a1', 'b2', 'e3', "f4"]}
x = pd.DataFrame.from_dict(data)
x

Unnamed: 0,val_0,value
0,1,a1
1,2,b2
2,2,e3
3,3,f4


In [121]:
df_union_all= pd.concat([y, x], ignore_index=True)
df_union_all

Unnamed: 0,val_0,value
0,1,a1
1,2,b2
2,2,c3
3,3,d4
4,1,a1
5,2,b2
6,2,e3
7,3,f4


### Union

![](notas_files/union.png)

In [122]:
df_union= pd.concat([x, y],ignore_index=True).drop_duplicates()
df_union

Unnamed: 0,val_0,value
0,1,a1
1,2,b2
2,2,e3
3,3,f4
6,2,c3
7,3,d4
