# Pandas - Operaciones
Inteligencia Artificial - Facundo A. Lucianna - CEIA - FIUBA

Ya vimos una introducción de Pandas. Ahora profundizaremos las operaciones que puede realizarse en los DataFrames. A solo de mención de que aunque veamos las operaciones más comunes, es solo la punta de iceberg.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Creemos el siguiente DataFrame con personalidades de Mar del Plata. Para este caso, lo creamos como una lista de lista
persona_1 = ["Clotilde", "Acosta", "3 octubre de 1940", "Mar del Plata", "Actriz"]
persona_2 = ["Rosario", "Bléfari", "24/12/65", "MDP", "Cantante"]
persona_3 = ["Norberto", "Carredegoas", "1936-04-12", "San Antonio Oeste"]
persona_4 = ["Bárbara", "Torres", "11-04-1973", "M del plata", "Actuación"]
persona_5 = ["Eugenio", "Weinbaum", "17/08/1961", "MDQ", "Conductor"]

personajes = [persona_1, persona_2, persona_3, persona_4, persona_5]

In [3]:
df_personas = pd.DataFrame(personajes, columns = ['Nombre', 'Apellido', 'Fecha Nacimiento', 'Ciudad', 'Profesion'])

In [4]:
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,MDP,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


## Usando .loc o .iloc

En los DataFrames de Pandas existen diferentes formas de seleccionar los registros de las filas y columnas. 
- **.loc**: Permite seleccionar mediante etiquetas o declaraciones condicionales
- **.iloc**: Permite seleccionar los elementos en base a la posición

In [5]:
# En este caso los indices son numericos, por lo que accedemos a la posición 3 y a la columna "Ciudad"
df_personas.loc[3, "Ciudad"]

'M del plata'

Vemos que Mar del Plata está mal escrita. Podemos corregirlo usando .loc en modo de asignación

In [6]:
df_personas.loc[3, "Ciudad"] = "Mar del Plata"

In [7]:
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,MDP,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,Mar del Plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


In [8]:
# En este caso accedemos por posición. La posicion de filas 1 que de casualidad corresponde al indice 1 porque elegimos indices numericos 
# y se encuentra ordenado, y a la columna en posición 3 que corresponde a "Ciudad"
df_personas.iloc[1, 3]

'MDP'

Vemos que Mar del Plata está mal escrita. Podemos corregirlo usando .iloc en modo de asignación

In [9]:
df_personas.iloc[1, 3] = "Mar del Plata"

In [10]:
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,Mar del Plata,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,Mar del Plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


## Las primeras herramientas…

Pandas nos da herramientas para realizar un primer exploración de un DataFrame nuevo. Familiarizarnos a nivel macro de como está conformado el mismo.

### .head() or tail()

Lo primero que queremos hacerlo, es poder ver la tabla. Pero como hay veces que es son muy grandes, podemos ver la cabecera o la cola del DataFrame.

In [11]:
# De esta forma podemos cargar CSV en Pandas, lo veremos en mas detalle en otro notebook
df_salaries = pd.read_csv("./datasets/Salaries.csv", low_memory=False)

In [12]:
# El metodo head nos permite ver las N primera filas del DataFrame. Es util para una primera exploración.
N = 10
df_salaries.head(N)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,
5,6,DAVID SULLIVAN,ASSISTANT DEPUTY CHIEF II,118602.0,8601.0,189082.74,,316285.74,316285.74,2011,,San Francisco,
6,7,ALSON LEE,"BATTALION CHIEF, (FIRE DEPARTMENT)",92492.01,89062.9,134426.14,,315981.05,315981.05,2011,,San Francisco,
7,8,DAVID KUSHNER,DEPUTY DIRECTOR OF INVESTMENTS,256576.96,0.0,51322.5,,307899.46,307899.46,2011,,San Francisco,
8,9,MICHAEL MORRIS,"BATTALION CHIEF, (FIRE DEPARTMENT)",176932.64,86362.68,40132.23,,303427.55,303427.55,2011,,San Francisco,
9,10,JOANNE HAYES-WHITE,"CHIEF OF DEPARTMENT, (FIRE DEPARTMENT)",285262.0,0.0,17115.73,,302377.73,302377.73,2011,,San Francisco,


In [13]:
# El metodo tail nos permite ver las N últimas filas del DataFrame.
df_salaries.tail(10)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
148644,148645,Randy D Winn,"Stationary Eng, Sewage Plant",0.00,0.00,0.00,0.00,0.0,0.0,2014,,San Francisco,PT
148645,148646,Carolyn A Wilson,Human Services Technician,0.00,0.00,0.00,0.00,0.0,0.0,2014,,San Francisco,PT
148646,148647,Not provided,Not provided,Not Provided,Not Provided,Not Provided,Not Provided,0.0,0.0,2014,,San Francisco,
148647,148648,Joann Anderson,Communications Dispatcher 2,0.00,0.00,0.00,0.00,0.0,0.0,2014,,San Francisco,PT
148648,148649,Leon Walker,Custodian,0.00,0.00,0.00,0.00,0.0,0.0,2014,,San Francisco,PT
148649,148650,Roy I Tillery,Custodian,0.00,0.00,0.00,0.00,0.0,0.0,2014,,San Francisco,PT
148650,148651,Not provided,Not provided,Not Provided,Not Provided,Not Provided,Not Provided,0.0,0.0,2014,,San Francisco,
148651,148652,Not provided,Not provided,Not Provided,Not Provided,Not Provided,Not Provided,0.0,0.0,2014,,San Francisco,
148652,148653,Not provided,Not provided,Not Provided,Not Provided,Not Provided,Not Provided,0.0,0.0,2014,,San Francisco,
148653,148654,Joe Lopez,"Counselor, Log Cabin Ranch",0.00,0.00,-618.13,0.00,-618.13,-618.13,2014,,San Francisco,PT


### .shape()

Si queremos saber la estructura general en cantidad de filas y columnas, tenemos el método *.shape()*

In [14]:
df_salaries.shape

(148654, 13)

### .info()

El método .info() es sumamente útil para saber la estructura del DataFrame. Nos dice cuántas filas se tienen, y para cada columna, el nombre, la cantidad de datos no nulos y el tipo de variable es la columna. En general object son columnas de strings.

Además nos indica el espacio en memoria de RAM que ocupa el DataFrame

In [15]:
df_salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   EmployeeName      148654 non-null  object 
 2   JobTitle          148654 non-null  object 
 3   BasePay           148049 non-null  object 
 4   OvertimePay       148654 non-null  object 
 5   OtherPay          148654 non-null  object 
 6   Benefits          112495 non-null  object 
 7   TotalPay          148654 non-null  float64
 8   TotalPayBenefits  148654 non-null  float64
 9   Year              148654 non-null  int64  
 10  Notes             0 non-null       float64
 11  Agency            148654 non-null  object 
 12  Status            38119 non-null   object 
dtypes: float64(3), int64(2), object(8)
memory usage: 14.7+ MB


Además si tenemos un índice que no es el dado por defecto (posición), nos presenta la información con el tipo de índice, en este caso numérico:

In [16]:
df_salaries.set_index("Id").info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148654 entries, 1 to 148654
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   EmployeeName      148654 non-null  object 
 1   JobTitle          148654 non-null  object 
 2   BasePay           148049 non-null  object 
 3   OvertimePay       148654 non-null  object 
 4   OtherPay          148654 non-null  object 
 5   Benefits          112495 non-null  object 
 6   TotalPay          148654 non-null  float64
 7   TotalPayBenefits  148654 non-null  float64
 8   Year              148654 non-null  int64  
 9   Notes             0 non-null       float64
 10  Agency            148654 non-null  object 
 11  Status            38119 non-null   object 
dtypes: float64(3), int64(1), object(8)
memory usage: 14.7+ MB


O en el caso que es una variable categorica:

In [17]:
df_salaries.set_index("EmployeeName").info()

<class 'pandas.core.frame.DataFrame'>
Index: 148654 entries, NATHANIEL FORD to Joe Lopez
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   JobTitle          148654 non-null  object 
 2   BasePay           148049 non-null  object 
 3   OvertimePay       148654 non-null  object 
 4   OtherPay          148654 non-null  object 
 5   Benefits          112495 non-null  object 
 6   TotalPay          148654 non-null  float64
 7   TotalPayBenefits  148654 non-null  float64
 8   Year              148654 non-null  int64  
 9   Notes             0 non-null       float64
 10  Agency            148654 non-null  object 
 11  Status            38119 non-null   object 
dtypes: float64(3), int64(2), object(7)
memory usage: 14.7+ MB


### .dtypes

Este atributo nos devuelve para cada columna, que tipo de variable es:

In [18]:
df_salaries.dtypes

Id                    int64
EmployeeName         object
JobTitle             object
BasePay              object
OvertimePay          object
OtherPay             object
Benefits             object
TotalPay            float64
TotalPayBenefits    float64
Year                  int64
Notes               float64
Agency               object
Status               object
dtype: object

### .describe()

Por otro lado, describe es otro gran método muy útil para las columnas numéricas. Nos provee de medidas estadísticas de centro y dispersión de las columnas que tienen números. Con esto podemos tener un rápido primer pantallazo de la distribución de cada variable.

In [19]:
df_salaries.describe()

Unnamed: 0,Id,TotalPay,TotalPayBenefits,Year,Notes
count,148654.0,148654.0,148654.0,148654.0,0.0
mean,74327.5,74768.321972,93692.554811,2012.522643,
std,42912.857795,50517.005274,62793.533483,1.117538,
min,1.0,-618.13,-618.13,2011.0,
25%,37164.25,36168.995,44065.65,2012.0,
50%,74327.5,71426.61,92404.09,2013.0,
75%,111490.75,105839.135,132876.45,2014.0,
max,148654.0,567595.43,567595.43,2014.0,


Este método nos devuelve un DataFrame con los cálculos, por lo que podemos acceder al mismo y realizar todo tipo de operaciones

In [20]:
df_salaries.describe().loc["count"]

Id                  148654.0
TotalPay            148654.0
TotalPayBenefits    148654.0
Year                148654.0
Notes                    0.0
Name: count, dtype: float64

## Slicing

Usando .loc() o .iloc() podemos hacer operaciones más avanzadas que solo llamar a un elemento. Se puede realizar cortes del DataFrame para obtener sub-DataFrames

In [21]:
df_population = pd.read_csv("./datasets/countries_population.csv")

Con .iloc() podemos realizar cortes dando el rango de posición, similar a las listas o array de Numpy

In [22]:
df_population.iloc[:10, 0:8]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54208.0,55434.0,56234.0,56699.0
1,Africa Eastern and Southern,AFE,"Population, total",SP.POP.TOTL,130836765.0,134159786.0,137614644.0,141202036.0
2,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996967.0,9169406.0,9351442.0,9543200.0
3,Africa Western and Central,AFW,"Population, total",SP.POP.TOTL,96396419.0,98407221.0,100506960.0,102691339.0
4,Angola,AGO,"Population, total",SP.POP.TOTL,5454938.0,5531451.0,5608499.0,5679409.0
5,Albania,ALB,"Population, total",SP.POP.TOTL,1608800.0,1659800.0,1711319.0,1762621.0
6,Andorra,AND,"Population, total",SP.POP.TOTL,13410.0,14378.0,15379.0,16407.0
7,Arab World,ARB,"Population, total",SP.POP.TOTL,92197715.0,94724540.0,97334438.0,100034191.0
8,United Arab Emirates,ARE,"Population, total",SP.POP.TOTL,92417.0,100801.0,112112.0,125130.0
9,Argentina,ARG,"Population, total",SP.POP.TOTL,20481781.0,20817270.0,21153042.0,21488916.0


In [23]:
# Funciona con indices negativos
df_population.iloc[-10:-1, 0:8]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963
256,Virgin Islands (U.S.),VIR,"Population, total",SP.POP.TOTL,32500.0,34300.0,35000.0,39800.0
257,Vietnam,VNM,"Population, total",SP.POP.TOTL,32670050.0,33666110.0,34683410.0,35721210.0
258,Vanuatu,VUT,"Population, total",SP.POP.TOTL,63689.0,65700.0,67793.0,69944.0
259,World,WLD,"Population, total",SP.POP.TOTL,3032156000.0,3071596000.0,3124561000.0,3189656000.0
260,Samoa,WSM,"Population, total",SP.POP.TOTL,108627.0,112112.0,115768.0,119552.0
261,Kosovo,XKX,"Population, total",SP.POP.TOTL,947000.0,966000.0,994000.0,1022000.0
262,"Yemen, Rep.",YEM,"Population, total",SP.POP.TOTL,5315351.0,5393034.0,5473671.0,5556767.0
263,South Africa,ZAF,"Population, total",SP.POP.TOTL,17099840.0,17524530.0,17965730.0,18423160.0
264,Zambia,ZMB,"Population, total",SP.POP.TOTL,3070780.0,3164330.0,3260645.0,3360099.0


In [24]:
df_population_with_ct_index = df_population.set_index("Country Code")

Con .loc() también podemos hacer slicing. 

OBS: El rango de valores no respeta ningún orden de ningún tipo, por ejemplo, en este caso recorta las filas que están entre ARB y ARM (que en .loc() es inclusivo las dos) pero que estén en orden alfabético es una casualidad de cómo estaba la columna “Country Code” en el momento que se llamó como índice.

In [25]:
df_population_with_ct_index.loc["ARB":"ARM", "2010":"2013"]

Unnamed: 0_level_0,2010,2011,2012,2013
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ARB,354890097.0,363156846.0,371437642.0,379696477.0
ARE,8549998.0,8946778.0,9141598.0,9197908.0
ARG,40788453.0,41261490.0,41733271.0,42202935.0
ARM,2877314.0,2876536.0,2884239.0,2897593.0


## Mutabilidad

Un detalle importante es que los DataFrames son mutables. Si no tenemos en cuenta a esto, podemos encontarnos problemas cuando modificamos valores.

La mutabilidad es importante porque nos ahorra en memoria RAM cuando hacemos manipulaciones, que con los datos tabulados, muy rápidamente puede complicarse.

In [26]:
persona_1 = ["Clotilde", "Acosta", "3 octubre de 1940", "Mar del Plata", "Actriz"]
persona_2 = ["Rosario", "Bléfari", "24/12/65", "MDP", "Cantante"]
persona_3 = ["Norberto", "Carredegoas", "1936-04-12", "San Antonio Oeste"]
persona_4 = ["Bárbara", "Torres", "11-04-1973", "M del plata", "Actuación"]
persona_5 = ["Eugenio", "Weinbaum", "17/08/1961", "MDQ", "Conductor"]

personajes = [persona_1, persona_2, persona_3, persona_4, persona_5]

df_personas = pd.DataFrame(personajes, columns = ['Nombre', 'Apellido', 'Fecha Nacimiento', 'Ciudad', 'Profesion'])

In [27]:
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,MDP,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


Armamos un sub-DataFrame con las primeras 3 filas y las columnas Nombre y Apellido

In [28]:
df_sub = df_personas.loc[0:2, "Nombre":"Apellido"]

En este sub-DataFrame cambiamos un valor:

In [29]:
df_sub.loc[0, "Nombre"] = "Laura"

In [30]:
df_sub

Unnamed: 0,Nombre,Apellido
0,Laura,Acosta
1,Rosario,Bléfari
2,Norberto,Carredegoas


Y como el DataFrame es mutable, tambien cambio al DataFrame original. Si no tenemos en cuenta esto, nos puede generar muchos problemas

In [31]:
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Laura,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,MDP,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


Para evitar que manipulaciones de un sub-DataFrame nos afecte al DataFrame original, debemos usar el método .copy(). Usarlo en casos que se justifique, porque incrementa el uso de memoria RAM.

OBS: Si dentro del DataFrame hay datos mutables, este método no van a copiar los datos mutables que estén adentro, para ello se deben usar deep copy de Python.

In [32]:
df_sub = df_personas.loc[0:2, "Nombre":"Apellido"].copy()

In [33]:
df_sub.loc[0, "Nombre"] = "Verónica"

In [34]:
df_sub

Unnamed: 0,Nombre,Apellido
0,Verónica,Acosta
1,Rosario,Bléfari
2,Norberto,Carredegoas


In [35]:
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Laura,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,MDP,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


## Slicing Condicionales

Una manipulación muy poderosa de Pandas, es realizar slicing usando condicionales tomando alguna columna:

In [36]:
# Creamos una serie de booleanos con respecto a la columna "Country Code", buscando filas que son igual a "ARG"
condition = df_population["Country Code"] == "ARG"
condition

0      False
1      False
2      False
3      False
4      False
       ...  
261    False
262    False
263    False
264    False
265    False
Name: Country Code, Length: 266, dtype: bool

Si esta serie de booleamos la ponemos en el DataFrame, nos devuelve las filas que cumplan la condición

In [37]:
df_population[condition]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
9,Argentina,ARG,"Population, total",SP.POP.TOTL,20481781.0,20817270.0,21153042.0,21488916.0,21824427.0,22159644.0,...,42202935.0,42669500.0,43131966.0,43590368.0,44044811.0,44494502.0,44938712.0,45376763.0,45808747.0,


Además podemos realizar operaciones booleanas más avanzadas con multiples columnas y operadores booleanos. Acá es donde la potencialidad de Pandas empieza a aparecer:

In [38]:
df_population[(df_population["1960"] < 8000) & (df_population["1961"] < 8000)]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
147,St. Martin (French part),MAF,"Population, total",SP.POP.TOTL,3898.0,3996.0,4078.0,4179.0,4302.0,4471.0,...,36458.0,36018.0,35865.0,36061.0,36569.0,37264.0,38002.0,38659.0,39239.0,
179,Nauru,NRU,"Population, total",SP.POP.TOTL,4377.0,4627.0,4942.0,5270.0,5590.0,5859.0,...,10208.0,10289.0,10374.0,10474.0,10577.0,10678.0,10764.0,10834.0,10873.0,
225,Sint Maarten (Dutch part),SXM,"Population, total",SP.POP.TOTL,2833.0,3077.0,3367.0,3703.0,4063.0,4460.0,...,36607.0,37685.0,38825.0,39969.0,40574.0,40895.0,41608.0,42310.0,42846.0,
228,Turks and Caicos Islands,TCA,"Population, total",SP.POP.TOTL,5825.0,5867.0,5884.0,5870.0,5851.0,5814.0,...,34733.0,35371.0,35979.0,36558.0,37116.0,37667.0,38194.0,38718.0,39226.0,
245,Tuvalu,TUV,"Population, total",SP.POP.TOTL,5321.0,5330.0,5340.0,5341.0,5354.0,5388.0,...,10849.0,10973.0,11099.0,11232.0,11365.0,11505.0,11655.0,11792.0,11925.0,


In [39]:
df_population[(df_population["1960"] < 8000) | (df_population["1961"] < 8000)]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
52,Cayman Islands,CYM,"Population, total",SP.POP.TOTL,7870.0,8024.0,8141.0,8219.0,8299.0,8370.0,...,59933.0,60848.0,61721.0,62564.0,63382.0,64172.0,64948.0,65720.0,66498.0,
147,St. Martin (French part),MAF,"Population, total",SP.POP.TOTL,3898.0,3996.0,4078.0,4179.0,4302.0,4471.0,...,36458.0,36018.0,35865.0,36061.0,36569.0,37264.0,38002.0,38659.0,39239.0,
179,Nauru,NRU,"Population, total",SP.POP.TOTL,4377.0,4627.0,4942.0,5270.0,5590.0,5859.0,...,10208.0,10289.0,10374.0,10474.0,10577.0,10678.0,10764.0,10834.0,10873.0,
225,Sint Maarten (Dutch part),SXM,"Population, total",SP.POP.TOTL,2833.0,3077.0,3367.0,3703.0,4063.0,4460.0,...,36607.0,37685.0,38825.0,39969.0,40574.0,40895.0,41608.0,42310.0,42846.0,
228,Turks and Caicos Islands,TCA,"Population, total",SP.POP.TOTL,5825.0,5867.0,5884.0,5870.0,5851.0,5814.0,...,34733.0,35371.0,35979.0,36558.0,37116.0,37667.0,38194.0,38718.0,39226.0,
245,Tuvalu,TUV,"Population, total",SP.POP.TOTL,5321.0,5330.0,5340.0,5341.0,5354.0,5388.0,...,10849.0,10973.0,11099.0,11232.0,11365.0,11505.0,11655.0,11792.0,11925.0,


In [40]:
df_population[((df_population["1960"] < 8000) | (df_population["1961"] < 8000)) & 
              (df_population["Country Code"] != "NRU")]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
52,Cayman Islands,CYM,"Population, total",SP.POP.TOTL,7870.0,8024.0,8141.0,8219.0,8299.0,8370.0,...,59933.0,60848.0,61721.0,62564.0,63382.0,64172.0,64948.0,65720.0,66498.0,
147,St. Martin (French part),MAF,"Population, total",SP.POP.TOTL,3898.0,3996.0,4078.0,4179.0,4302.0,4471.0,...,36458.0,36018.0,35865.0,36061.0,36569.0,37264.0,38002.0,38659.0,39239.0,
225,Sint Maarten (Dutch part),SXM,"Population, total",SP.POP.TOTL,2833.0,3077.0,3367.0,3703.0,4063.0,4460.0,...,36607.0,37685.0,38825.0,39969.0,40574.0,40895.0,41608.0,42310.0,42846.0,
228,Turks and Caicos Islands,TCA,"Population, total",SP.POP.TOTL,5825.0,5867.0,5884.0,5870.0,5851.0,5814.0,...,34733.0,35371.0,35979.0,36558.0,37116.0,37667.0,38194.0,38718.0,39226.0,
245,Tuvalu,TUV,"Population, total",SP.POP.TOTL,5321.0,5330.0,5340.0,5341.0,5354.0,5388.0,...,10849.0,10973.0,11099.0,11232.0,11365.0,11505.0,11655.0,11792.0,11925.0,


In [41]:
df_population[(df_population["1960"] < 8000) | (df_population["1961"] < 8000) & 
              (df_population["Country Code"] == "ARG")]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
52,Cayman Islands,CYM,"Population, total",SP.POP.TOTL,7870.0,8024.0,8141.0,8219.0,8299.0,8370.0,...,59933.0,60848.0,61721.0,62564.0,63382.0,64172.0,64948.0,65720.0,66498.0,
147,St. Martin (French part),MAF,"Population, total",SP.POP.TOTL,3898.0,3996.0,4078.0,4179.0,4302.0,4471.0,...,36458.0,36018.0,35865.0,36061.0,36569.0,37264.0,38002.0,38659.0,39239.0,
179,Nauru,NRU,"Population, total",SP.POP.TOTL,4377.0,4627.0,4942.0,5270.0,5590.0,5859.0,...,10208.0,10289.0,10374.0,10474.0,10577.0,10678.0,10764.0,10834.0,10873.0,
225,Sint Maarten (Dutch part),SXM,"Population, total",SP.POP.TOTL,2833.0,3077.0,3367.0,3703.0,4063.0,4460.0,...,36607.0,37685.0,38825.0,39969.0,40574.0,40895.0,41608.0,42310.0,42846.0,
228,Turks and Caicos Islands,TCA,"Population, total",SP.POP.TOTL,5825.0,5867.0,5884.0,5870.0,5851.0,5814.0,...,34733.0,35371.0,35979.0,36558.0,37116.0,37667.0,38194.0,38718.0,39226.0,
245,Tuvalu,TUV,"Population, total",SP.POP.TOTL,5321.0,5330.0,5340.0,5341.0,5354.0,5388.0,...,10849.0,10973.0,11099.0,11232.0,11365.0,11505.0,11655.0,11792.0,11925.0,


Podemos hacer slicing usando cálculos estadisticos:

In [42]:
df_population[(df_population["1960"] == np.max(df_population["1960"]))]
#Equivalente:
df_population[(df_population["1960"] == df_population["1960"].max())]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
259,World,WLD,"Population, total",SP.POP.TOTL,3032156000.0,3071596000.0,3124561000.0,3189656000.0,3255146000.0,3322047000.0,...,7175500000.0,7261847000.0,7347679000.0,7433651000.0,7519371000.0,7602716000.0,7683806000.0,7763933000.0,7836631000.0,


In [43]:
df_population[(df_population["1960"] > np.nanquantile(df_population["1960"], 0.75))]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
1,Africa Eastern and Southern,AFE,"Population, total",SP.POP.TOTL,1.308368e+08,1.341598e+08,1.376146e+08,1.412020e+08,1.449202e+08,1.487700e+08,...,5.626016e+08,5.780754e+08,5.938718e+08,6.099789e+08,6.263929e+08,6.430901e+08,6.600463e+08,6.772433e+08,6.946651e+08,
3,Africa Western and Central,AFW,"Population, total",SP.POP.TOTL,9.639642e+07,9.840722e+07,1.005070e+08,1.026913e+08,1.049535e+08,1.072899e+08,...,3.804379e+08,3.908830e+08,4.015867e+08,4.125513e+08,4.237699e+08,4.352294e+08,4.469116e+08,4.588035e+08,4.708989e+08,
7,Arab World,ARB,"Population, total",SP.POP.TOTL,9.219772e+07,9.472454e+07,9.733444e+07,1.000342e+08,1.028328e+08,1.057364e+08,...,3.796965e+08,3.878998e+08,3.960283e+08,4.040429e+08,4.119428e+08,4.198520e+08,4.278703e+08,4.360807e+08,4.445178e+08,
20,Bangladesh,BGD,"Population, total",SP.POP.TOTL,4.801350e+07,4.936283e+07,5.075215e+07,5.220201e+07,5.374172e+07,5.538511e+07,...,1.527614e+08,1.545174e+08,1.562563e+08,1.579772e+08,1.596854e+08,1.613767e+08,1.630462e+08,1.646894e+08,1.663035e+08,
29,Brazil,BRA,"Population, total",SP.POP.TOTL,7.217924e+07,7.431134e+07,7.651433e+07,7.877265e+07,8.106457e+07,8.337353e+07,...,2.010359e+08,2.027637e+08,2.044718e+08,2.061631e+08,2.078338e+08,2.094693e+08,2.110495e+08,2.125594e+08,2.139934e+08,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248,Ukraine,UKR,"Population, total",SP.POP.TOTL,4.266465e+07,4.320635e+07,4.375223e+07,4.428861e+07,4.479696e+07,4.526455e+07,...,4.548965e+07,4.527216e+07,4.515404e+07,4.500467e+07,4.483114e+07,4.462252e+07,4.438620e+07,4.413205e+07,4.381458e+07,
249,Upper middle income,UMC,"Population, total",SP.POP.TOTL,1.115221e+09,1.118631e+09,1.134490e+09,1.161693e+09,1.188526e+09,1.216291e+09,...,2.374694e+09,2.394063e+09,2.412687e+09,2.431146e+09,2.449903e+09,2.466337e+09,2.480708e+09,2.492597e+09,2.501428e+09,
251,United States,USA,"Population, total",SP.POP.TOTL,1.806710e+08,1.836910e+08,1.865380e+08,1.892420e+08,1.918890e+08,1.943030e+08,...,3.160599e+08,3.183863e+08,3.207390e+08,3.230718e+08,3.251221e+08,3.268382e+08,3.283300e+08,3.315011e+08,3.318937e+08,
257,Vietnam,VNM,"Population, total",SP.POP.TOTL,3.267005e+07,3.366611e+07,3.468341e+07,3.572121e+07,3.678000e+07,3.785895e+07,...,9.075259e+07,9.171385e+07,9.267708e+07,9.364044e+07,9.460064e+07,9.554596e+07,9.646211e+07,9.733858e+07,9.816883e+07,


# Valores faltantes

Una parte muy importante de los DataFrame es el faltante de datos. Hay situaciones que filas no tienen todas las columnas completas. Pandas nos permite manipular estos casos.

Antes de empezar, mencionemos el elefante en la habitación. Python y las diferentes librerías (Numpy y Pandas) tienen valores que indican la falta de datos, pero que tienen comportamiento totalmente diferentes. Por eso es recomendable usar los método de Pandas para buscar datos faltantes, que contemplan todas estas diferencias.

In [44]:
#Ojo
print(None == None)
print(np.nan == np.nan)
print(pd.NA == pd.NA)

# R en esto es mucho mejor

True
False
<NA>


In [45]:
array = np.array([[0, 1, 2, 3, 4, 5],
              [0, 1, 2, pd.NA, 5, 6],
              [0, 1, 2, None, np.nan, 6]]).T
df_con_Nan = pd.DataFrame(array)

In [46]:
df_con_Nan

Unnamed: 0,0,1,2
0,0,0.0,0.0
1,1,1.0,1.0
2,2,2.0,2.0
3,3,,
4,4,5.0,
5,5,6.0,6.0


El metodo .notna() nos devuelve todos los valores que no son nulos

In [47]:
df_con_Nan.notna()

Unnamed: 0,0,1,2
0,True,True,True
1,True,True,True
2,True,True,True
3,True,False,False
4,True,True,False
5,True,True,True


En cambio, el método .isna() hace lo contrario

In [48]:
df_con_Nan.isna()

Unnamed: 0,0,1,2
0,False,False,False
1,False,False,False
2,False,False,False
3,False,True,True
4,False,False,True
5,False,False,False


El método .dropna() elimina las filas que tiene columnas con al menos un dato nulo:

In [49]:
df_con_Nan.dropna()

Unnamed: 0,0,1,2
0,0,0,0
1,1,1,1
2,2,2,2
5,5,6,6


Si le indicamos el eje como columna, elimina las columnas que al menos tengan una fila con NaN:

In [50]:
df_con_Nan.dropna(axis="columns")

Unnamed: 0,0
0,0
1,1
2,2
3,3
4,4
5,5


Podemos tener un poco más control, por ejemplo con **thresh** podemos eliminar las filas que tengan como minimo dos datos faltantes:

In [51]:
df_con_Nan.dropna(thresh=2)

Unnamed: 0,0,1,2
0,0,0,0.0
1,1,1,1.0
2,2,2,2.0
4,4,5,
5,5,6,6.0


Con **subset** podemos seleccionar que filas debe observar, por ejemplo en este caso se eliminarán las filas en la cual la columna 1 tengan datos nulos:

In [52]:
df_con_Nan.dropna(subset=[1])

Unnamed: 0,0,1,2
0,0,0,0.0
1,1,1,1.0
2,2,2,2.0
4,4,5,
5,5,6,6.0


Lógicamente podemos combinar los diferentes argumentos que vimos.

### Slicing
Podemos realizar slicing en vez de realizar un drop, esto es más util cuando nos queremos quedar con la filas (o columnas) que tengan NaN.

Por ejemplo, nos quedamos con la filas que tienen valores nulos en la columna 2:

In [53]:
df_con_Nan[df_con_Nan[2].isna()]

Unnamed: 0,0,1,2
3,3,,
4,4,5.0,


O el caso contrario:

In [54]:
df_con_Nan[df_con_Nan[2].notna()]

Unnamed: 0,0,1,2
0,0,0,0
1,1,1,1
2,2,2,2
5,5,6,6


Y podemos incorporar estos metodos de datos nulos para realizar slicing condicionales:

In [55]:
df_con_Nan[(df_con_Nan[2].isna()) & (df_con_Nan[0] > 3)]

Unnamed: 0,0,1,2
4,4,5,


In [56]:
df_con_nan_2 = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, np.nan],
                   [np.nan, 3, np.nan, 4]],
                  columns=list("ABCD"))

In [57]:
df_con_nan_2

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,,,,
3,,3.0,,4.0


Algo sumamente útil es el completado de datos nulos, por ejemplo, en este caso completamos con 0:

In [58]:
df_con_nan_2.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0.0
1,3.0,4.0,0.0,1.0
2,0.0,0.0,0.0,0.0
3,0.0,3.0,0.0,4.0


O con "a":

In [59]:
df_con_nan_2.fillna("a")

Unnamed: 0,A,B,C,D
0,a,2.0,a,0.0
1,3.0,4.0,a,1.0
2,a,a,a,a
3,a,3.0,a,4.0


O que complete con el dato que esta inmediatamente arriba (muy útil en series de tiempo, cuando no tenemos el registro de un día, usamos el día anterior):

In [60]:
df_con_nan_2.fillna(method="ffill")

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,3.0,4.0,,1.0
3,3.0,3.0,,4.0


O inclusive podemos hacer cosas más avanzadas, en donde con un diccionario, podemos indicar para cada columna en particular, con que completar. Acá indicamos que a la columna D la complete con el valor medio de esa columna.

In [61]:
values = {"D": np.mean(df_con_nan_2["D"])}
df_con_nan_2.fillna(value=values)

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,,,,1.666667
3,,3.0,,4.0


## .replace()

Este es otro método avanzado, el cual permite realizar reemplazos en el DataFrame 

In [62]:
persona_1 = ["Clotilde", "Acosta", "3 octubre de 1940", "Mar del Plata", "Actriz"]
persona_2 = ["Rosario", "Bléfari", "24/12/65", "MDQ", "Cantante"]
persona_3 = ["Norberto", "Carredegoas", "1936-04-12", "San Antonio Oeste"]
persona_4 = ["Bárbara", "Torres", "11-04-1973", "M del plata", "Actuación"]
persona_5 = ["Eugenio", "Weinbaum", "17/08/1961", "MDQ", "Conductor"]

personajes = [persona_1, persona_2, persona_3, persona_4, persona_5]

df_personas = pd.DataFrame(personajes, columns = ['Nombre', 'Apellido', 'Fecha Nacimiento', 'Ciudad', 'Profesion'])

In [63]:
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,MDQ,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


Por ejemplo, que reemplace cada vez que ve MDQ por Mar del Plata:

In [64]:
df_personas.replace("MDQ", "Mar del Plata")

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,Mar del Plata,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,Mar del Plata,Conductor


In [65]:
# No lo reemplaza, genera un nuevo dataframe con el reemplazo (esto es común en los métodos)
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,MDQ,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


In [66]:
#En cambio si agregamos el argumento opcional inplace=True, lo hace sobre el propio dataframe, usar con cuidado!
df_personas.replace("3 octubre de 1940", "03/10/1940", inplace=True)

In [67]:
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,03/10/1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,MDQ,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


Este método puede ser muy poderoso si utilizamos REGEX:

In [68]:
#Usando REGEX
df_personas.replace(to_replace=r'[M][D][^\s]*', value="Mar del Plata", regex=True)

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,03/10/1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,Mar del Plata,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,Mar del Plata,Conductor


In [69]:
#Usando REGEX
df_personas.replace(regex={r'[M][D][^\s]*': "Mar del Plata", r'(?:^|\W)M(?:$|\W)': 'Mar '})

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,03/10/1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,Mar del Plata,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,Mar del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,Mar del Plata,Conductor


In [70]:
#Usando diccionario
df_personas.replace({'Fecha Nacimiento': {"3 octubre de 1940": "03/10/1940", "1936-04-12": "12/04/1936",
                                          "11-04-1973":"11/04/1973", "24/12/65":"24/12/1965"}})

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,03/10/1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/1965,MDQ,Cantante
2,Norberto,Carredegoas,12/04/1936,San Antonio Oeste,
3,Bárbara,Torres,11/04/1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


## Transformaciones

Pandas también nos crear nuevas columnas o cambiar columnas existente realizando operaciones entre las que posee: 

In [71]:
persona_1 = ["Clotilde", "Acosta", "3 octubre de 1940", "Mar del Plata", "Actriz"]
persona_2 = ["Rosario", "Bléfari", "24/12/65", "MDQ", "Cantante"]
persona_3 = ["Norberto", "Carredegoas", "1936-04-12", "San Antonio Oeste"]
persona_4 = ["Bárbara", "Torres", "11-04-1973", "M del plata", "Actuación"]
persona_5 = ["Eugenio", "Weinbaum", "17/08/1961", "MDQ", "Conductor"]

personajes = [persona_1, persona_2, persona_3, persona_4, persona_5]

df_personas = pd.DataFrame(personajes, columns = ['Nombre', 'Apellido', 'Fecha Nacimiento', 'Ciudad', 'Profesion'])

In [72]:
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion
0,Clotilde,Acosta,3 octubre de 1940,Mar del Plata,Actriz
1,Rosario,Bléfari,24/12/65,MDQ,Cantante
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,
3,Bárbara,Torres,11-04-1973,M del plata,Actuación
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor


Por ejemplo, podemos crear una columna que una el nombre y apellido para tener la columna nombre completo:

In [73]:
df_personas["Nombre completo"] = df_personas["Nombre"] + " " + df_personas["Apellido"]
df_personas

Unnamed: 0,Nombre,Apellido,Fecha Nacimiento,Ciudad,Profesion,Nombre completo
0,Clotilde,Acosta,3 octubre de 1940,Mar del Plata,Actriz,Clotilde Acosta
1,Rosario,Bléfari,24/12/65,MDQ,Cantante,Rosario Bléfari
2,Norberto,Carredegoas,1936-04-12,San Antonio Oeste,,Norberto Carredegoas
3,Bárbara,Torres,11-04-1973,M del plata,Actuación,Bárbara Torres
4,Eugenio,Weinbaum,17/08/1961,MDQ,Conductor,Eugenio Weinbaum


O operaciones matemáticas, como obtener los grados centigrados en base a los grados de Farenheit:

In [74]:
df_temp_buenos_aires = pd.read_csv("./datasets/AGBUENOS.txt", header=None, delim_whitespace=True,
                                  names=["month", "day", "year", "Temp F"])

In [75]:
df_temp_buenos_aires.head(10)

Unnamed: 0,month,day,year,Temp F
0,1,1,1995,82.4
1,1,2,1995,75.1
2,1,3,1995,73.7
3,1,4,1995,77.1
4,1,5,1995,79.5
5,1,6,1995,71.3
6,1,7,1995,71.4
7,1,8,1995,75.2
8,1,9,1995,66.3
9,1,10,1995,61.8


In [76]:
df_temp_buenos_aires["Temp C"] = (df_temp_buenos_aires["Temp F"] - 32) * 0.5556

In [77]:
df_temp_buenos_aires

Unnamed: 0,month,day,year,Temp F,Temp C
0,1,1,1995,82.4,28.00224
1,1,2,1995,75.1,23.94636
2,1,3,1995,73.7,23.16852
3,1,4,1995,77.1,25.05756
4,1,5,1995,79.5,26.39100
...,...,...,...,...,...
9261,5,9,2020,61.3,16.27908
9262,5,10,2020,67.0,19.44600
9263,5,11,2020,62.4,16.89024
9264,5,12,2020,52.2,11.22312


### .apply()

Apply es un método de transformación que permite aplicar una funcion en cada elemento. Suele ser ineficiente computacionalmente, por lo que se recomienda usar en casos que no se puedan realizar transformaciones mediante otra forma o el DataFrame es chico

In [78]:
df_sensores = pd.DataFrame({"sensor_1": np.random.uniform(-1, 1, 100),
                           "sensor_2": np.random.randn(100),
                           "sensor_3": np.random.poisson(size=100)})

Creamos el siguiente DataFrame con valores aleatorios, simulando el registro de tres sensores en diferentes instantes del tiempo:

In [79]:
df_sensores.head()

Unnamed: 0,sensor_1,sensor_2,sensor_3
0,0.020209,-0.973746,0
1,0.950218,-0.183695,0
2,0.303155,-0.277144,2
3,-0.245605,-0.09257,1
4,0.495088,0.400893,1


Y con .apply() podemos calcular la exponencial de cada lectura:

In [80]:
df_sensores.apply(np.exp).head()

Unnamed: 0,sensor_1,sensor_2,sensor_3
0,1.020415,0.377666,1.0
1,2.586273,0.83219,1.0
2,1.354124,0.757945,7.389056
3,0.782231,0.911586,2.718282
4,1.640643,1.493157,2.718282


Tambien podemos volver al caso del cambio a temperatura Celsium usando .apply() y funciones anónimas:

In [81]:
# Aca las funciones anonimas son muy utiles
df_temp_buenos_aires["Temp C apply"] = df_temp_buenos_aires["Temp F"].apply(lambda x : (x - 32) * 0.5556)
df_temp_buenos_aires.head()

Unnamed: 0,month,day,year,Temp F,Temp C,Temp C apply
0,1,1,1995,82.4,28.00224,28.00224
1,1,2,1995,75.1,23.94636,23.94636
2,1,3,1995,73.7,23.16852,23.16852
3,1,4,1995,77.1,25.05756,25.05756
4,1,5,1995,79.5,26.391,26.391


Ojo, hay funciones que realizan agregaciones. Es decir, que no se aplica elemento a elemento, sino que calcula el valor de agregado para cada columna. Por ejemplo en este caso se calcula el valor medio para cada sensor:

In [82]:
# Funciones que hacen agregaciones. Obs, axis indica donde se hace la operacion (fila), pero el agregado en el otro
# eje (columna).
df_sensores.apply(np.mean)

sensor_1    0.039759
sensor_2    0.211523
sensor_3    1.040000
dtype: float64

Si cambiamos el eje, el valor medio lo va calcular para cada fila:

In [83]:
df_sensores.apply(np.mean, axis=1)

0    -0.317846
1     0.255508
2     0.675337
3     0.220608
4     0.631994
        ...   
95    0.395734
96   -0.126388
97    0.554983
98    0.997121
99    0.877795
Length: 100, dtype: float64

In [84]:
df_sensores.apply(lambda x : x["sensor_1"] - x["sensor_2"], axis=1)

0     0.993956
1     1.133913
2     0.580298
3    -0.153036
4     0.094196
        ...   
95    0.752027
96   -0.320779
97   -1.692419
98   -1.293366
99   -1.688330
Length: 100, dtype: float64

Además podemos realizar multiples operaciones al mismo tiempo:

In [85]:
df_sensores.apply([np.exp, np.sin, lambda x : 0 if x < 0.5 else 3 ]).head(10)

Unnamed: 0_level_0,sensor_1,sensor_1,sensor_1,sensor_2,sensor_2,sensor_2,sensor_3,sensor_3,sensor_3
Unnamed: 0_level_1,exp,sin,<lambda>,exp,sin,<lambda>,exp,sin,<lambda>
0,1.020415,0.020208,0,0.377666,-0.826998,0,1.0,0.0,0
1,2.586273,0.813542,3,0.83219,-0.182664,0,1.0,0.0,0
2,1.354124,0.298532,0,0.757945,-0.27361,0,7.389056,0.909297,3
3,0.782231,-0.243143,0,0.911586,-0.092437,0,2.718282,0.841471,3
4,1.640643,0.475109,0,1.493157,0.39024,0,2.718282,0.841471,3
5,1.455842,0.366816,0,0.236681,-0.991594,0,1.0,0.0,0
6,0.680545,-0.37543,0,31.255326,-0.296091,3,2.718282,0.841471,3
7,0.467463,-0.689237,0,8.473334,0.843984,3,1.0,0.0,0
8,0.486059,-0.660455,0,4.49164,0.997649,3,20.085537,0.14112,3
9,0.664204,-0.397845,0,3.602617,0.958491,3,1.0,0.0,0


## Operaciones de strings

La manipulación de strings es algo que un Data Scientist se encuentra típicamente. Pandas nos provee herramientas para esto:

In [86]:
df_salary = pd.read_csv("./datasets/Salaries.csv", chunksize=5)
df_salary = next(df_salary)

In [87]:
df_salary

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


Por ejemplo, podemos pasar a minúsculas  a una columna:

In [88]:
df_salary["JobTitle"] = df_salary["JobTitle"].str.lower()
df_salary

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,general manager-metropolitan transit authority,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,captain iii (police department),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,captain iii (police department),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,wire rope cable maintenance mechanic,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"deputy chief of department,(fire department)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


O en formato como si fuera título:

In [89]:
df_salary["EmployeeName"] = df_salary["EmployeeName"].str.title()
df_salary

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,Nathaniel Ford,general manager-metropolitan transit authority,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,Gary Jimenez,captain iii (police department),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,Albert Pardini,captain iii (police department),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,Christopher Chong,wire rope cable maintenance mechanic,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,Patrick Gardner,"deputy chief of department,(fire department)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


Podemos obtener la cantidad de chars que tiene cada string:

In [90]:
df_salary["len(Name)"] = df_salary["EmployeeName"].str.len()
df_salary

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,len(Name)
0,1,Nathaniel Ford,general manager-metropolitan transit authority,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,14
1,2,Gary Jimenez,captain iii (police department),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,12
2,3,Albert Pardini,captain iii (police department),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,,14
3,4,Christopher Chong,wire rope cable maintenance mechanic,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,,17
4,5,Patrick Gardner,"deputy chief of department,(fire department)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,15


In [91]:
# Sirve para hacer slicing tambien
df_salary[df_salary["EmployeeName"].str.startswith("Nathaniel")]

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,len(Name)
0,1,Nathaniel Ford,general manager-metropolitan transit authority,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,14


In [92]:
df_salary[df_salary["JobTitle"].str.contains("department")]

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,len(Name)
1,2,Gary Jimenez,captain iii (police department),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,12
2,3,Albert Pardini,captain iii (police department),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,,14
4,5,Patrick Gardner,"deputy chief of department,(fire department)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,15


Y tambien podemos realizar la partición de String, por ejemplo entre nombre y apellido:

In [93]:
df_salary["EmployeeName"].str.split(" ", expand=True, n=1)

Unnamed: 0,0,1
0,Nathaniel,Ford
1,Gary,Jimenez
2,Albert,Pardini
3,Christopher,Chong
4,Patrick,Gardner


In [94]:
nombre_completo_list = df_salary["EmployeeName"].str.split(" ", n = 1, expand=True)
 
df_salary["First Name"] = nombre_completo_list[0]

df_salary["Last Name"] = nombre_completo_list[1]

In [95]:
df_salary

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status,len(Name),First Name,Last Name
0,1,Nathaniel Ford,general manager-metropolitan transit authority,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,,14,Nathaniel,Ford
1,2,Gary Jimenez,captain iii (police department),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,,12,Gary,Jimenez
2,3,Albert Pardini,captain iii (police department),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,,14,Albert,Pardini
3,4,Christopher Chong,wire rope cable maintenance mechanic,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,,17,Christopher,Chong
4,5,Patrick Gardner,"deputy chief of department,(fire department)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,,15,Patrick,Gardner


# Operaciones con fechas

Otra manipulación que Pandas es muy potente es el manejo de fechas

In [96]:
server_df = pd.read_csv("./datasets/server.csv")

In [97]:
server_df.head(10)

Unnamed: 0,datetime,server_id,cpu_utilization,free_memory,session_count
0,2019-03-06 00:00:00,100,0.4,0.54,52
1,2019-03-06 01:00:00,100,0.49,0.51,58
2,2019-03-06 02:00:00,100,0.49,0.54,53
3,2019-03-06 03:00:00,100,0.44,0.56,49
4,2019-03-06 04:00:00,100,0.42,0.52,54
5,2019-03-06 05:00:00,100,0.49,0.5,54
6,2019-03-06 06:00:00,100,0.43,0.54,50
7,2019-03-06 07:00:00,100,0.43,0.51,55
8,2019-03-06 08:00:00,100,0.46,0.47,50
9,2019-03-06 09:00:00,100,0.48,0.51,51


In [98]:
server_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40800 entries, 0 to 40799
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   datetime         40800 non-null  object 
 1   server_id        40800 non-null  int64  
 2   cpu_utilization  40800 non-null  float64
 3   free_memory      40800 non-null  float64
 4   session_count    40800 non-null  int64  
dtypes: float64(2), int64(2), object(1)
memory usage: 1.6+ MB


In [99]:
# La columna datetime es un string.
# Puedo hacer ciertos filtrados sin problema, ya que comparaciones de string en Python son muy buenas. Pero estamos 
# limitados.
server_df[server_df["datetime"] < "2019-03-06 02:00:00"]

Unnamed: 0,datetime,server_id,cpu_utilization,free_memory,session_count
0,2019-03-06 00:00:00,100,0.40,0.54,52
1,2019-03-06 01:00:00,100,0.49,0.51,58
816,2019-03-06 00:00:00,101,0.82,0.21,89
817,2019-03-06 01:00:00,101,0.81,0.16,85
1632,2019-03-06 00:00:00,102,0.78,0.21,79
...,...,...,...,...,...
38353,2019-03-06 01:00:00,147,0.50,0.48,58
39168,2019-03-06 00:00:00,148,0.79,0.24,79
39169,2019-03-06 01:00:00,148,0.76,0.30,76
39984,2019-03-06 00:00:00,149,0.68,0.21,80


Pero, si convertimos a la columna *"datetime"* en tipo de dato datetime, podemos realizar tood tipo de manipulaciones

In [100]:
# Convierto a la columna datetime 
server_df['datetime2'] = pd.to_datetime(server_df['datetime'])

El anterior caso, pd.to_datetime() asumio el formato de fecha, pero podemos indicarle para facilitar el procesamiento

In [101]:
server_df["datetime"] = pd.to_datetime(server_df["datetime"], format="%Y-%m-%d %H:%M:%S")

### Slicing con fechas

El slicing con fechas podemos elegir pedazos de la fecha para realizar los cortes:

In [102]:
# Si usamos a las fechas como index
server_df_index_date = server_df.set_index("datetime")

In [103]:
server_df_index_date

Unnamed: 0_level_0,server_id,cpu_utilization,free_memory,session_count,datetime2
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-03-06 00:00:00,100,0.40,0.54,52,2019-03-06 00:00:00
2019-03-06 01:00:00,100,0.49,0.51,58,2019-03-06 01:00:00
2019-03-06 02:00:00,100,0.49,0.54,53,2019-03-06 02:00:00
2019-03-06 03:00:00,100,0.44,0.56,49,2019-03-06 03:00:00
2019-03-06 04:00:00,100,0.42,0.52,54,2019-03-06 04:00:00
...,...,...,...,...,...
2019-04-08 19:00:00,149,0.73,0.20,81,2019-04-08 19:00:00
2019-04-08 20:00:00,149,0.75,0.25,83,2019-04-08 20:00:00
2019-04-08 21:00:00,149,0.80,0.26,82,2019-04-08 21:00:00
2019-04-08 22:00:00,149,0.75,0.29,82,2019-04-08 22:00:00


In [104]:
server_df_index_date.loc['2019-03-07 02:00:00'].head(5)

Unnamed: 0_level_0,server_id,cpu_utilization,free_memory,session_count,datetime2
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-03-07 02:00:00,100,0.44,0.5,56,2019-03-07 02:00:00
2019-03-07 02:00:00,101,0.78,0.21,87,2019-03-07 02:00:00
2019-03-07 02:00:00,102,0.75,0.27,80,2019-03-07 02:00:00
2019-03-07 02:00:00,103,0.76,0.28,85,2019-03-07 02:00:00
2019-03-07 02:00:00,104,0.74,0.24,77,2019-03-07 02:00:00


In [105]:
server_df_index_date.loc['2019-03-07'].head(5)

Unnamed: 0_level_0,server_id,cpu_utilization,free_memory,session_count,datetime2
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-03-07 00:00:00,100,0.51,0.52,55,2019-03-07 00:00:00
2019-03-07 01:00:00,100,0.46,0.5,49,2019-03-07 01:00:00
2019-03-07 02:00:00,100,0.44,0.5,56,2019-03-07 02:00:00
2019-03-07 03:00:00,100,0.45,0.52,51,2019-03-07 03:00:00
2019-03-07 04:00:00,100,0.42,0.5,53,2019-03-07 04:00:00


Y este corte lo podemos hacer con strings compatibles de fechas:

In [106]:
server_df_index_date.loc['March 2019']

Unnamed: 0_level_0,server_id,cpu_utilization,free_memory,session_count,datetime2
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-03-06 00:00:00,100,0.40,0.54,52,2019-03-06 00:00:00
2019-03-06 01:00:00,100,0.49,0.51,58,2019-03-06 01:00:00
2019-03-06 02:00:00,100,0.49,0.54,53,2019-03-06 02:00:00
2019-03-06 03:00:00,100,0.44,0.56,49,2019-03-06 03:00:00
2019-03-06 04:00:00,100,0.42,0.52,54,2019-03-06 04:00:00
...,...,...,...,...,...
2019-03-31 19:00:00,149,0.79,0.26,79,2019-03-31 19:00:00
2019-03-31 20:00:00,149,0.71,0.27,81,2019-03-31 20:00:00
2019-03-31 21:00:00,149,0.71,0.31,81,2019-03-31 21:00:00
2019-03-31 22:00:00,149,0.80,0.24,84,2019-03-31 22:00:00


In [107]:
server_df_index_date.loc['2019-03']

Unnamed: 0_level_0,server_id,cpu_utilization,free_memory,session_count,datetime2
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-03-06 00:00:00,100,0.40,0.54,52,2019-03-06 00:00:00
2019-03-06 01:00:00,100,0.49,0.51,58,2019-03-06 01:00:00
2019-03-06 02:00:00,100,0.49,0.54,53,2019-03-06 02:00:00
2019-03-06 03:00:00,100,0.44,0.56,49,2019-03-06 03:00:00
2019-03-06 04:00:00,100,0.42,0.52,54,2019-03-06 04:00:00
...,...,...,...,...,...
2019-03-31 19:00:00,149,0.79,0.26,79,2019-03-31 19:00:00
2019-03-31 20:00:00,149,0.71,0.27,81,2019-03-31 20:00:00
2019-03-31 21:00:00,149,0.71,0.31,81,2019-03-31 21:00:00
2019-03-31 22:00:00,149,0.80,0.24,84,2019-03-31 22:00:00


## Extraccion de informacion de la columna date

Además, tenemos un arseanl de atributos que nos permite extraer datos de la fecha o realizar slicing condicionales:

In [108]:
server_df["year"] = server_df["datetime"].dt.year
server_df["month"] = server_df["datetime"].dt.month
server_df["day"] = server_df["datetime"].dt.day
server_df["hour"] = server_df["datetime"].dt.hour
server_df["weekday"] = server_df["datetime"].dt.weekday

In [109]:
server_df["datetime"].dt.is_leap_year

0        False
1        False
2        False
3        False
4        False
         ...  
40795    False
40796    False
40797    False
40798    False
40799    False
Name: datetime, Length: 40800, dtype: bool

In [110]:
server_df["datetime"].dt.quarter

0        1
1        1
2        1
3        1
4        1
        ..
40795    2
40796    2
40797    2
40798    2
40799    2
Name: datetime, Length: 40800, dtype: int64

In [111]:
server_df["datetime"].dt.nanosecond

0        0
1        0
2        0
3        0
4        0
        ..
40795    0
40796    0
40797    0
40798    0
40799    0
Name: datetime, Length: 40800, dtype: int64

Mucha veces la fecha está guardada en columnas numéricas, lo cual podemos crear una columna de fecha usando a estas columnas:

In [112]:
server_df["date_2"] = pd.to_datetime(server_df[["year", "month","day", "hour"]])

In [113]:
server_df

Unnamed: 0,datetime,server_id,cpu_utilization,free_memory,session_count,datetime2,year,month,day,hour,weekday,date_2
0,2019-03-06 00:00:00,100,0.40,0.54,52,2019-03-06 00:00:00,2019,3,6,0,2,2019-03-06 00:00:00
1,2019-03-06 01:00:00,100,0.49,0.51,58,2019-03-06 01:00:00,2019,3,6,1,2,2019-03-06 01:00:00
2,2019-03-06 02:00:00,100,0.49,0.54,53,2019-03-06 02:00:00,2019,3,6,2,2,2019-03-06 02:00:00
3,2019-03-06 03:00:00,100,0.44,0.56,49,2019-03-06 03:00:00,2019,3,6,3,2,2019-03-06 03:00:00
4,2019-03-06 04:00:00,100,0.42,0.52,54,2019-03-06 04:00:00,2019,3,6,4,2,2019-03-06 04:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
40795,2019-04-08 19:00:00,149,0.73,0.20,81,2019-04-08 19:00:00,2019,4,8,19,0,2019-04-08 19:00:00
40796,2019-04-08 20:00:00,149,0.75,0.25,83,2019-04-08 20:00:00,2019,4,8,20,0,2019-04-08 20:00:00
40797,2019-04-08 21:00:00,149,0.80,0.26,82,2019-04-08 21:00:00,2019,4,8,21,0,2019-04-08 21:00:00
40798,2019-04-08 22:00:00,149,0.75,0.29,82,2019-04-08 22:00:00,2019,4,8,22,0,2019-04-08 22:00:00
