
# Clase 8: Introducción a Manejo de Datos Tabulares con Pandas

**MDS7202: Laboratorio de Programación Científica para Ciencia de Datos**


### Objetivos de la clase

- Introducir los datos estructurados de forma tabular.
- Comprender los aspectos introductorios de `pandas`: `Series` y `DataFrames`.
- Indexado y operaciones básicas en `DataFrames`.
- Filtrados y queries. 


# Motivación

Primero, se presentará el dataset con el cuál estaremos trabajando durante la clase y luego, vendrán unas preguntas interesantes al respecto.




### Índices para una Vida Mejor

Para explicar `pandas`, analizaremos datos de la OECD, en particular de los índices para una Vida Mejor:


<img src="https://raw.githubusercontent.com/MDS7202/MDS7202/main/recursos/2023-01/08-Pandas1/oecd.png" alt="OECD Better life index"/>


http://www.oecdbetterlifeindex.org/

https://stats.oecd.org/Index.aspx?DataSetCode=BLI

Son 11 temas considerados como esenciales para el bienestar de la población. Cada crierio contiene uno o mas indicadores

| Tema | Indicador (Inglés) | Indicador (Español) | Unidad | Descripción |
|---|---|---|---|---|
| Vivienda 🏠 | Dwellings without basic facilities | Vivienda con Instalaciones Básicas | Porcentaje | Porcentaje de personas con inodoros de agua corriente dentro del hogar, año disponible más reciente |
|  | Housing expenditure | Gastos en Vivienda | Porcentaje | Proporción de costos de vivienda en el ingreso neto ajustado de las familias, año disponible más reciente |
|  | Rooms per person | Habitaciones por Persona | Ratio | Número promedio de habitaciones compartidas por persona en una vivienda, año disponible más reciente |
| Ingresos 💰 | Household net adjusted disposable income | Ingreso Familiar Disponible | US Dollar | Cantidad promedio de dinero que una familia gana al año, después de impuestos, año disponible más reciente |
|  | Household net wealth | Patrimonio Neto Familiar | US Dollar | Valor total promedio de los activos financieros de una familia (ahorros, acciones) menos sus pasivos (créditos), año disponible más reciente |
| Empleo ⚙️ | Labour market insecurity | Seguridad en el Empleo | Porcentaje | Pérdida esperada de ingresos cuando alguien queda desempleado, año disponible más reciente |
|  | Employment rate | Tasa de Empleo | Porcentaje | Porcentaje de personas, de 15 a 64 años de edad, actualmente con empleo remunerado, año disponible más reciente |
|  | Long-term unemployment rate | Tasa de Empleo a Largo Plazo | Porcentaje | Porcentaje de personas, de 15 a 64 años de edad, que no trabajan pero que han buscado empleo activamente durante más de un año, año disponible más reciente |
|  | Personal earnings | Ingresos Personales | US Dollar | Ingresos anuales promedio por empleado de tiempo completo, año disponible más reciente |
| Comunidad 🧑‍🤝‍🧑   | Quality of support network  | Calidad del Apoyo Social | Porcentaje | Porcentaje de personas con amigos o parientes en quienes confiar en caso de necesidad |
| Educación 📚 | Educational attainment | Nivel de Educación | Porcentaje | Porcentaje de personas, de 25 a 64 años de edad, graduadas por lo menos de educación media superior, año disponible más reciente |
|  | Student skills | Competencias de estudiantes en matemáticas, lectura y ciencias | Puntaje promedio | Desempeño promedio de estudiantes de 15 años de edad, según PISA (Programa para la Evaluación Internacional de Estudiantes) |
|  | Years in education  | Nivel de educación | Años | Duración promedio de la educación formal en la que un niño de cinco años de edad puede esperar matricularse durante su vida |
| Medio Ambiente 🌳 | Air pollution | Contaminación del Aire | Microgramos por metro cúbico | Concentración promedio de partículas (PM2.5) en ciudades con poblaciones mayores de 100,000 personas, medida en microgramos por metro cúbico, año disponible más reciente |
|  | Water quality | Calidad del Agua | Porcentaje | Porcentaje de personas que informan estar satisfechas con la calidad del agua local |
| Compromiso Cívico 🗳️  | Stakeholder engagement for developing regulations | Participación de los interesados en la elaboración de regulaciones | Puntaje promedio | Nivel de transparencia gubernamental al preparar las regulaciones, año disponible más reciente |
|  | Voter turnout | Participación electoral | Porcentaje | Porcentaje de votantes registrados que votaron durante las elecciones recientes, año disponible más reciente |
| Salud ⚕️ | Life expectancy | Esperanza de vida | Años | Número promedio de años que una persona puede esperar vivir, año disponible más reciente |
|  | Self-reported health | Salud según informan las personas | Porcentaje | Porcentaje de personas que informan que su salud es «buena o muy buena», año disponible más reciente |
| Satisfacción ✨ | Life satisfaction | Satisfacción ante la vida | Puntaje promedio | Autoevaluación promedio de satisfacción ante la vida, en una escala de 0 a 10 |
| Seguridad 🌃 | Feeling safe walking alone at night | Sentimiento de seguridad al caminar solos por la noche | Porcentaje | Porcentaje de personas que reportan sentirse seguras al caminar solas por la noche  |
|  | Homicide rate | Tasa de homicidios | Ratio | Número promedio de homicidios reportados por 100,000 personas, año disponible más reciente |
| Balance Vida Trabajo 🧘 | Employees working very long hours | Empleados que trabajan muchas horas | Porcentaje | Porcentaje de empleados que trabajan más de cincuenta horas a la semana en promedio, año disponible más reciente |
|  | Time devoted to leisure and personal care | Tiempo destinado al ocio y el cuidado personal | Horas | Número promedio de minutos al día dedicados al ocio y el cuidado personal, incluidos el sueño y la alimentación |

---

**Hasta el momento, todos los datos con los que hemos trabajado:**


### 1. Los hemos ingresado a mano

En este caso, tendríamos que copiar y pegar los datos en formato arreglo de forma manual. 

Por ejemplo, las 10 primeras filas del dataset de la OECD de las columnas:

- Air pollution
- Dwellings without basic facilities
- Educational attainment
- Employees working very long hours
- Employment rate



In [1]:
import numpy as np

datos = np.array(
    [
        [5.0, np.nan, 81.0, 12.84, 73.0],
        [16.0, 0.9, 85.0, 6.59, 72.0],
        [15.0, 1.9, 77.0, 4.7, 63.33],
        [10.0, 6.7, 49.0, 7.01, 61.0],
        [7.0, 0.2, 91.33, 3.67, 73.33],
        [16.0, 9.4, 65.0, 9.32, 62.67],
        [10.0, 23.9, 54.0, 26.01, 67.0],
        [20.0, 0.7, 93.67, 5.5, 73.67],
        [9.0, 0.5, 81.0, 2.32, 74.0],
        [8.0, 7.0, 88.67, 2.44, 74.0],
    ]
)

datos


array([[ 5.  ,   nan, 81.  , 12.84, 73.  ],
       [16.  ,  0.9 , 85.  ,  6.59, 72.  ],
       [15.  ,  1.9 , 77.  ,  4.7 , 63.33],
       [10.  ,  6.7 , 49.  ,  7.01, 61.  ],
       [ 7.  ,  0.2 , 91.33,  3.67, 73.33],
       [16.  ,  9.4 , 65.  ,  9.32, 62.67],
       [10.  , 23.9 , 54.  , 26.01, 67.  ],
       [20.  ,  0.7 , 93.67,  5.5 , 73.67],
       [ 9.  ,  0.5 , 81.  ,  2.32, 74.  ],
       [ 8.  ,  7.  , 88.67,  2.44, 74.  ]])

> **Pregunta ❓**: Entonces, ¿Cómo en numpy podría leer una planilla Excel? ¿Y un archivo json? ¿O un CSV? ¿ O una base de datos?

### 2. Solo hemos usado números

> **Pregunta ❓**: ¿Cómo puedo operar strings en numpy?

En este caso, me gustaría agregar una nueva columna a los datos: el oaís que describen los valores:

In [2]:
pais = np.array([
    "Australia",
    "Austria",
    "Belgium",
    "Brazil",
    "Canada",
    "Chile",
    "Colombia",
    "Czech Republic",
    "Denmark",
    "Estonia",
])

Pero recuerden que para que numpy funcione eficientemente, los arreglos deben ser homogeneos, es decir, del mismo tipo.

> **Pregunta: ❓** ¿Qué consecuencias podría traer el agregar esta nueva columna a los datos?

In [3]:
pais

array(['Australia', 'Austria', 'Belgium', 'Brazil', 'Canada', 'Chile',
       'Colombia', 'Czech Republic', 'Denmark', 'Estonia'], dtype='<U14')

In [4]:
nuevos_datos = np.concatenate([pais[:, np.newaxis], datos], axis=1)
nuevos_datos



array([['Australia', '5.0', 'nan', '81.0', '12.84', '73.0'],
       ['Austria', '16.0', '0.9', '85.0', '6.59', '72.0'],
       ['Belgium', '15.0', '1.9', '77.0', '4.7', '63.33'],
       ['Brazil', '10.0', '6.7', '49.0', '7.01', '61.0'],
       ['Canada', '7.0', '0.2', '91.33', '3.67', '73.33'],
       ['Chile', '16.0', '9.4', '65.0', '9.32', '62.67'],
       ['Colombia', '10.0', '23.9', '54.0', '26.01', '67.0'],
       ['Czech Republic', '20.0', '0.7', '93.67', '5.5', '73.67'],
       ['Denmark', '9.0', '0.5', '81.0', '2.32', '74.0'],
       ['Estonia', '8.0', '7.0', '88.67', '2.44', '74.0']], dtype='<U32')

In [5]:
nuevos_datos.mean(axis=1)

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> None

### 3. Trabajamos solo con índices para obtener datos.

> **Pregunta ❓** : ¿Y si mis filas o columnas tuvieran nombre (un string), cómo las podría agregar a numpy?

In [6]:
columnas = [
    "Country",
    "Air pollution",
    "Dwellings without basic facilities",
    "Educational attainment",
    "Employees working very long hours",
    "Employment rate",
]


Las recién listadas son un solo un par de limitaciones de `numpy` a la hora de manejar datos.

Como podemos ver, lamentablemente `numpy` carece de funcionalidades más avanzadas pero necesarias para ejecutar adecuada y eficientemente tareas de data science.

Aquí es donde entra en juego `pandas`.

---

## 1. Pandas 🐼


`Pandas` (derivado de _panel data_)  una librería para python utilizada para manejar datos tabulares. 

<div align='center'>
<img src="https://raw.githubusercontent.com/MDS7202/MDS7202/main/recursos/2023-01/08-Pandas1/dataframe.png" alt="DataFrames" style="width: 500px;"/>
</div>


Está diseñada para proveer herramientas que faciliten la exploración, limpieza y procesamiento de los datos. Su enfoque es *simplicidad y eficiencia*. Es, al igual que las librerías anteriores, *open-source*.


La base de pandas son los `DataFrames`.



Como convención, `Pandas` se importa de la siguiente manera:

In [7]:
import pandas as pd

In [None]:
pd.funcion(...)

### Entrada / Salida (IO)

La lectura de datos en `pandas` es muy sencilla: `pandas` es compatible con muchos tipos de archivos y fuentes de datos de forma nativa:

- `CSV`
- `Excel`
- `SQL`
- `JSON`
- ...

Los datos almacenados en estas fuentes pueden ser importados a `DataFrames` a través de las funcioes `read_*`

De la misma forma, es capaz de guardar los DataFrames en el formato que deseen usando las funciones `to_*`

<div align='center'>

<img src="https://raw.githubusercontent.com/MDS7202/MDS7202/main/recursos/2023-01/08-Pandas1/pandas_io.png" alt="DataFrames" style="width: 800px;"/>
</div>
    
Toda la información acerca de que puede o no leer la encuentran en la siguiente referencia: https://pandas.pydata.org/docs/user_guide/io.html

### Importar el Dataset

A continuacion, importaremos el dataset a un `DataFrame`. Noten la gran compatiblidad de `Jupyter` con los DataFrames (DF).

Cada DataFrame tiene **indices (Primera columna ) y columnas (primera fila)**. Comunmente se ocupan:

- En las columnas se ocupan `strings` que identifican el nombre de la variable.
- Enteros que identifican el número de la observación en las filas. 

Sin embargo, las filas también pueden ser identificadas por strings como las columnas por enteros

In [8]:
# para abrir archivos excel y visualizar hay que instalar esta dependencia extra
%pip install pandas openpyxl matplotlib plotly statsmodels

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd

# utilidad para mostrar todas las columnas
pd.set_option("display.max_columns", None)


df = pd.read_excel(
    "../../recursos/2023-02/pandas-1/dataset.xlsx"
    , header=1, index_col=0)
df

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
Colombia,23.9,17.0,1.2,,,,67,0.79,,89,54.0,410.0,14.1,10,75,1.4,53,76.2,,6.3,44.4,24.5,26.56,
Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17


In [12]:
type(df)

pandas.core.frame.DataFrame

---

## 2.- Lo Básico

A continuación veremos los atributos y métodos básicos de un DataFrame.

### Atributos

#### Columnas

Podemos ver los nombres de las columnas de nuestro `DataFrame` a través de `df.columns`

In [11]:
df.columns

Index(['Dwellings without basic facilities', 'Housing expenditure',
       'Rooms per person', 'Household net adjusted disposable income',
       'Household net wealth', 'Labour market insecurity', 'Employment rate',
       'Long-term unemployment rate', 'Personal earnings',
       'Quality of support network', 'Educational attainment',
       'Student skills', 'Years in education', 'Air pollution',
       'Water quality', 'Stakeholder engagement for developing regulations',
       'Voter turnout', 'Life expectancy', 'Self-reported health',
       'Life satisfaction', 'Feeling safe walking alone at night',
       'Homicide rate', 'Employees working very long hours',
       'Time devoted to leisure and personal care'],
      dtype='object')

#### Índices

Podemos ver los indices de las filas de nuestro `DataFrame` a través de `df.index`

In [13]:
df.index

Index(['Australia', 'Austria', 'Belgium', 'Canada', 'Chile', 'Colombia',
       'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany',
       'Greece', 'Hungary', 'Iceland', 'Ireland', 'Israel', 'Italy', 'Japan',
       'Korea', 'Latvia', 'Lithuania', 'Luxembourg', 'Mexico', 'Netherlands',
       'New Zealand', 'Norway', 'Poland', 'Portugal', 'Slovak Republic',
       'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey',
       'United Kingdom', 'United States', 'OECD - Total', 'Brazil', 'Russia',
       'South Africa'],
      dtype='object', name='Country')

#### Largo (cántidad de filas)

In [14]:
len(df)

41

#### Shape

In [15]:
df.shape

(41, 24)

### Información General del Dataframe

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41 entries, Australia to South Africa
Data columns (total 24 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Dwellings without basic facilities                 38 non-null     float64
 1   Housing expenditure                                39 non-null     float64
 2   Rooms per person                                   38 non-null     float64
 3   Household net adjusted disposable income           30 non-null     float64
 4   Household net wealth                               29 non-null     float64
 5   Labour market insecurity                           34 non-null     float64
 6   Employment rate                                    41 non-null     int64  
 7   Long-term unemployment rate                        39 non-null     float64
 8   Personal earnings                                  36 non-null     float64
 9  

### Selección de Algunos Elementos

In [17]:
df

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
Colombia,23.9,17.0,1.2,,,,67,0.79,,89,54.0,410.0,14.1,10,75,1.4,53,76.2,,6.3,44.4,24.5,26.56,
Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17


#### Head

Trae los primeros n elementos


In [22]:
df.head()

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,


#### Tail

Trae los últimos n elementos


In [23]:
df.tail(5)

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
United States,0.1,19.0,2.4,45284.0,632100.0,7.7,70,0.66,60558.0,91,91.0,488.0,17.2,10,83,3.1,65,78.6,88.0,6.9,73.9,5.5,11.09,14.44
OECD - Total,4.4,20.0,1.8,33604.0,408376.0,7.0,68,1.78,43241.0,89,78.0,486.0,17.2,14,81,2.4,68,80.2,69.0,6.5,68.4,3.7,11.01,14.98
Brazil,6.7,,,,,,61,,,90,49.0,395.0,16.2,10,73,2.2,79,74.8,,6.4,35.6,26.7,7.13,
Russia,14.8,18.0,0.9,,,,70,1.59,,89,94.0,492.0,16.2,15,55,,68,71.8,43.0,5.8,52.8,9.6,0.14,
South Africa,37.0,18.0,,,,,43,16.46,,88,73.0,,,22,67,,73,57.5,,4.7,36.1,13.7,18.12,14.92


#### Sample

Entrega n filas aleatorias

In [29]:
df.sample(5)

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Turkey,8.0,20.0,1.0,,,12.5,52,2.39,,86,39.0,425.0,18.3,20,65,1.5,86,78.0,69.0,5.5,59.8,1.4,32.64,14.79
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Poland,3.0,22.0,1.1,19814.0,210991.0,5.7,66,1.52,27046.0,86,92.0,504.0,17.6,22,82,2.6,55,78.0,58.0,6.1,67.3,0.7,5.95,14.42
Sweden,0.0,19.0,1.7,31287.0,,3.2,77,1.12,42393.0,91,83.0,496.0,19.3,6,96,2.0,86,82.4,75.0,7.3,75.6,0.9,1.07,15.18
Hungary,4.7,19.0,1.2,,104458.0,4.7,68,1.72,22576.0,86,84.0,474.0,16.4,19,77,1.2,70,76.2,60.0,5.6,56.3,1.0,3.03,


> **Pregunta ❓** Existe alguna forma de repetir el mismo muestreo aleatorio de datos?

In [30]:
df.sample(5, random_state=42)

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Netherlands,0.1,19.0,1.9,29333.0,157824.0,4.8,76,1.97,52877.0,91,78.0,508.0,18.7,14,93,2.6,82,81.6,76.0,7.4,82.0,0.6,0.42,
Hungary,4.7,19.0,1.2,,104458.0,4.7,68,1.72,22576.0,86,84.0,474.0,16.4,19,77,1.2,70,76.2,60.0,5.6,56.3,1.0,3.03,
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
New Zealand,,26.0,2.4,,388514.0,4.7,77,0.74,40043.0,96,79.0,506.0,17.7,5,89,2.5,80,81.7,88.0,7.3,65.7,1.3,15.11,14.87
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,


> **Pregunta ❓** ¿Cuál es la unidad más básica que los `DataFrames`?

In [None]:
df.values

---

## 3.- `Series`

Los objetos tipo `pd.Series` son los objetos base de los DataFrames. Estos consisten en un arreglo unidimensional (que puede contener una sucesión de valores u objetos) asociados a un índice. Además, opcionalmente pueden llevar un nombre (que sería el equivalente al nombre de la columna de un DataFrame).

In [31]:
serie = pd.Series([1, 9, 7, -5, 3, 10], name="Mi serie")
serie

0     1
1     9
2     7
3    -5
4     3
5    10
Name: Mi serie, dtype: int64

### Atributos básicos

In [32]:
serie.values

array([ 1,  9,  7, -5,  3, 10], dtype=int64)

In [34]:
list(serie.index)

[0, 1, 2, 3, 4, 5]

In [35]:
serie.dtype

dtype('int64')

In [36]:
serie.name

'Mi serie'

In [37]:
serie.shape

(6,)

### Indexado de Series

Podemos acceder a cualquier elemento de una serie usando los mismos principios de indexado que en `numpy`:

In [38]:
serie

0     1
1     9
2     7
3    -5
4     3
5    10
Name: Mi serie, dtype: int64

In [39]:
serie[0]

1

In [40]:
serie[0:2]

0    1
1    9
Name: Mi serie, dtype: int64

In [41]:
serie[0:4]

0    1
1    9
2    7
3   -5
Name: Mi serie, dtype: int64

---

## 4.- Indexado de DataFrames

En esta sección veremos como seleccionar filas y columnas a través de distintos tipos de indexados.

<div align="center">
    <img src="https://raw.githubusercontent.com/MDS7202/MDS7202/main/recursos/2023-01/08-Pandas1/subsets.png" alt="OECD Better life index" width="800px"/>
</div>

In [42]:
df.head(5)

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,


### Acceder a una serie en específico



Volviendo a nuestro ejemplo, podemos acceder a las series de nuestro Dataset por medio de un indexador al cual se le provee el nombre de alguna columna. 

Por ejemplo:

> **Water quality 💧**: Porcentaje de personas que informan estar satisfechas con la calidad del agua local


In [46]:
df['Water quality']

Country
Australia          93
Austria            92
Belgium            84
Canada             91
Chile              71
Colombia           75
Czech Republic     87
Denmark            95
Estonia            84
Finland            95
France             81
Germany            91
Greece             69
Hungary            77
Iceland            99
Ireland            85
Israel             67
Italy              71
Japan              87
Korea              76
Latvia             79
Lithuania          81
Luxembourg         84
Mexico             68
Netherlands        93
New Zealand        89
Norway             98
Poland             82
Portugal           86
Slovak Republic    85
Slovenia           90
Spain              72
Sweden             96
Switzerland        95
Turkey             65
United Kingdom     84
United States      83
OECD - Total       81
Brazil             73
Russia             55
South Africa       67
Name: Water quality, dtype: int64

In [45]:
df['Water quality']['Chile']

71

### Selector de Columnas

Veamos ahora cómo seleccionar un par de columnas en particular, como por ejemplo:

> **Water quality 💧**: Porcentaje de personas que informan estar satisfechas con la calidad del agua local

> **Air Pollution 🏙️**: Concentración promedio de partículas (PM2.5) en ciudades con poblaciones mayores de 100,000 personas


In [50]:
df[['Water quality', 'Air pollution']]

Unnamed: 0_level_0,Water quality,Air pollution
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia,93,5
Austria,92,16
Belgium,84,15
Canada,91,7
Chile,71,16
Colombia,75,10
Czech Republic,87,20
Denmark,95,9
Estonia,84,8
Finland,95,6


### Selector de filas

Para seleccionar filas, podemos entregar un indexador de filas al estilo `numpy`:


In [53]:
df[0:10]

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
Colombia,23.9,17.0,1.2,,,,67,0.79,,89,54.0,410.0,14.1,10,75,1.4,53,76.2,,6.3,44.4,24.5,26.56,
Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17


> **Pregunta ❓**: ¿Cómo seleccionamos al mismo tiempo filas y columnas?

In [55]:
df[['Water quality', 'Air pollution'][0:10]]

Unnamed: 0_level_0,Water quality,Air pollution
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia,93,5
Austria,92,16
Belgium,84,15
Canada,91,7
Chile,71,16
Colombia,75,10
Czech Republic,87,20
Denmark,95,9
Estonia,84,8
Finland,95,6


In [56]:
a[0:2, 0]

NameError: name 'a' is not defined

In [57]:
df.loc[0:10, ['Water quality', 'Air pollution']]


TypeError: cannot do slice indexing on Index with these indexers [0] of type int

In [None]:
df[['Water quality']][9: 10]

### Loc: Indexador por etiquetas

Permite acceder ciertos elementos por nombre de columnas y nombre de índices

In [58]:
df.loc[
    ["Chile", "Mexico", "Brazil", "Colombia"], # <- filas
    ["Water quality", "Air pollution"]         # <- columnas
]

Unnamed: 0_level_0,Water quality,Air pollution
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Chile,71,16
Mexico,68,16
Brazil,73,10
Colombia,75,10


In [59]:
df.loc[
    ["Chile", "Mexico", "Brazil", "Colombia"], # <- filas
    :                                          # <- columnas
]

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
Mexico,25.5,20.0,1.0,,,5.5,61,0.07,15314.0,81,38.0,416.0,15.2,16,68,3.2,63,75.4,66.0,6.5,41.8,18.1,28.7,
Brazil,6.7,,,,,,61,,,90,49.0,395.0,16.2,10,73,2.2,79,74.8,,6.4,35.6,26.7,7.13,
Colombia,23.9,17.0,1.2,,,,67,0.79,,89,54.0,410.0,14.1,10,75,1.4,53,76.2,,6.3,44.4,24.5,26.56,


In [60]:
df.loc[
    :                                        , # <- filas
    ["Water quality", "Air pollution"]         # <- columnas
]

Unnamed: 0_level_0,Water quality,Air pollution
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia,93,5
Austria,92,16
Belgium,84,15
Canada,91,7
Chile,71,16
Colombia,75,10
Czech Republic,87,20
Denmark,95,9
Estonia,84,8
Finland,95,6


> **Pregunta ❓**: Y si queremos usar filas con indexadores numéricos?

### Iloc: Indexador por Índices

Para seleccionar por índices debemos utilizar otro tipo de indexador: `iloc`

In [61]:
df.iloc[8:12, :]

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17
France,0.5,21.0,1.8,31304.0,280653.0,7.6,65,4.0,43755.0,90,78.0,496.0,16.5,13,81,2.1,75,82.4,66.0,6.5,70.5,0.5,7.67,16.36
Germany,0.2,20.0,1.8,34294.0,259667.0,2.7,75,1.57,47585.0,90,87.0,508.0,18.1,14,91,1.8,76,81.1,65.0,7.0,72.5,0.5,4.26,15.62


In [62]:
df.iloc[:, 0:3]

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia,,20.0,
Austria,0.9,21.0,1.6
Belgium,1.9,21.0,2.2
Canada,0.2,22.0,2.6
Chile,9.4,18.0,1.2
Colombia,23.9,17.0,1.2
Czech Republic,0.7,24.0,1.4
Denmark,0.5,23.0,1.9
Estonia,7.0,17.0,1.6
Finland,0.5,23.0,1.9


In [None]:
array = [1,2,3,4,56,6]
array >= 3

In [63]:
df.iloc[:, [14]]

Unnamed: 0_level_0,Water quality
Country,Unnamed: 1_level_1
Australia,93
Austria,92
Belgium,84
Canada,91
Chile,71
Colombia,75
Czech Republic,87
Denmark,95
Estonia,84
Finland,95


In [64]:
df.iloc[0, 0]

nan

### Mascaras Booleanas y Consultas 🎭: Selección por Booleanos


Una operación interesante de selcción es usar un arreglo de booleanos para seleccionar datos.

In [66]:
df

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
Colombia,23.9,17.0,1.2,,,,67,0.79,,89,54.0,410.0,14.1,10,75,1.4,53,76.2,,6.3,44.4,24.5,26.56,
Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17


> **Pregunta ❓**: ¿Cómo podríamos obtener aquellos países cuya 80% o más de su población estén conformes con la calidad del agua?

In [74]:
mask = df.loc[:, "Water quality"] <= 80
# sin .values -> serie
mask

Country
Australia          False
Austria            False
Belgium            False
Canada             False
Chile               True
Colombia            True
Czech Republic     False
Denmark            False
Estonia            False
Finland            False
France             False
Germany            False
Greece              True
Hungary             True
Iceland            False
Ireland            False
Israel              True
Italy               True
Japan              False
Korea               True
Latvia              True
Lithuania          False
Luxembourg         False
Mexico              True
Netherlands        False
New Zealand        False
Norway             False
Poland             False
Portugal           False
Slovak Republic    False
Slovenia           False
Spain               True
Sweden             False
Switzerland        False
Turkey              True
United Kingdom     False
United States      False
OECD - Total       False
Brazil              True
Russia           

In [None]:
mask = df.loc[:, "Water quality"] == 80
# con .values -> numpy array
mask

In [75]:
df.loc[mask.values, "Water quality"]

Country
Chile           71
Colombia        75
Greece          69
Hungary         77
Israel          67
Italy           71
Korea           76
Latvia          79
Mexico          68
Spain           72
Turkey          65
Brazil          73
Russia          55
South Africa    67
Name: Water quality, dtype: int64

In [76]:
mask

Country
Australia          False
Austria            False
Belgium            False
Canada             False
Chile               True
Colombia            True
Czech Republic     False
Denmark            False
Estonia            False
Finland            False
France             False
Germany            False
Greece              True
Hungary             True
Iceland            False
Ireland            False
Israel              True
Italy               True
Japan              False
Korea               True
Latvia              True
Lithuania          False
Luxembourg         False
Mexico              True
Netherlands        False
New Zealand        False
Norway             False
Poland             False
Portugal           False
Slovak Republic    False
Slovenia           False
Spain               True
Sweden             False
Switzerland        False
Turkey              True
United Kingdom     False
United States      False
OECD - Total       False
Brazil              True
Russia           

In [77]:
mascara = df["Water quality"] >= 101
mascara

Country
Australia          False
Austria            False
Belgium            False
Canada             False
Chile              False
Colombia           False
Czech Republic     False
Denmark            False
Estonia            False
Finland            False
France             False
Germany            False
Greece             False
Hungary            False
Iceland            False
Ireland            False
Israel             False
Italy              False
Japan              False
Korea              False
Latvia             False
Lithuania          False
Luxembourg         False
Mexico             False
Netherlands        False
New Zealand        False
Norway             False
Poland             False
Portugal           False
Slovak Republic    False
Slovenia           False
Spain              False
Sweden             False
Switzerland        False
Turkey             False
United Kingdom     False
United States      False
OECD - Total       False
Brazil             False
Russia           

In [78]:
df.loc[mascara, ['Water quality']]

Unnamed: 0_level_0,Water quality
Country,Unnamed: 1_level_1


In [79]:
df.loc[(df.loc[:, "Water quality"] > 80) & (df.loc[:, "Water quality"] < 94)]

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
France,0.5,21.0,1.8,31304.0,280653.0,7.6,65,4.0,43755.0,90,78.0,496.0,16.5,13,81,2.1,75,82.4,66.0,6.5,70.5,0.5,7.67,16.36
Germany,0.2,20.0,1.8,34294.0,259667.0,2.7,75,1.57,47585.0,90,87.0,508.0,18.1,14,91,1.8,76,81.1,65.0,7.0,72.5,0.5,4.26,15.62
Ireland,1.0,20.0,2.1,25310.0,217130.0,7.8,67,3.23,47653.0,95,82.0,509.0,18.1,7,85,1.3,65,81.8,83.0,7.0,75.9,0.7,5.25,
Japan,6.4,22.0,1.9,29798.0,305878.0,1.4,75,1.03,40863.0,89,,529.0,16.4,14,87,1.4,53,84.1,36.0,5.9,72.5,0.2,,


---

## 5.- Operaciones con DataFrames



### `Describe`

Calcula estadísticas descriptivas ,

In [82]:
df.describe()

Unnamed: 0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
count,38.0,39.0,38.0,30.0,29.0,34.0,41.0,39.0,36.0,41.0,40.0,40.0,40.0,41.0,41.0,39.0,41.0,41.0,38.0,41.0,41.0,41.0,39.0,23.0
mean,5.057895,20.641026,1.636842,28000.533333,288004.724138,6.817647,68.463415,2.809744,39912.611111,90.097561,77.775,486.925,17.535,13.341463,82.341463,2.166667,69.536585,79.597561,66.868421,6.534146,68.253659,3.436585,8.029744,15.08
std,8.334093,2.497232,0.426438,7012.87002,164375.661472,5.827649,7.871142,3.533804,12932.304171,4.253263,14.936597,31.382167,1.392388,5.699166,10.415397,0.570933,12.060468,4.616952,14.092687,0.742836,13.847186,6.254469,7.783618,0.667717
min,0.0,15.0,0.9,16275.0,70160.0,0.7,43.0,0.05,15314.0,78.0,38.0,395.0,14.1,3.0,55.0,1.2,47.0,57.5,33.0,4.7,35.6,0.2,0.14,13.83
25%,0.325,19.0,1.2,21504.75,159373.0,3.725,66.0,1.035,26017.75,88.0,76.0,479.5,16.575,10.0,75.0,1.75,61.0,78.0,60.0,5.9,60.0,0.6,3.36,14.63
50%,0.95,21.0,1.65,29469.5,259667.0,5.1,69.0,1.78,41628.0,91.0,81.5,494.0,17.55,14.0,84.0,2.2,69.0,81.3,69.0,6.5,70.1,1.0,5.25,14.92
75%,6.625,22.5,1.9,32395.25,386006.0,7.775,74.0,3.2,49263.25,93.0,88.0,506.0,18.3,16.0,91.0,2.55,79.0,82.4,75.75,7.2,77.7,3.1,11.05,15.59
max,37.0,26.0,2.6,45284.0,769053.0,29.8,86.0,16.46,63062.0,98.0,94.0,529.0,21.0,28.0,99.0,3.2,91.0,84.1,88.0,7.6,90.1,26.7,32.64,16.47


In [83]:
descripcion_df = df.describe()
descripcion_df

Unnamed: 0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
count,38.0,39.0,38.0,30.0,29.0,34.0,41.0,39.0,36.0,41.0,40.0,40.0,40.0,41.0,41.0,39.0,41.0,41.0,38.0,41.0,41.0,41.0,39.0,23.0
mean,5.057895,20.641026,1.636842,28000.533333,288004.724138,6.817647,68.463415,2.809744,39912.611111,90.097561,77.775,486.925,17.535,13.341463,82.341463,2.166667,69.536585,79.597561,66.868421,6.534146,68.253659,3.436585,8.029744,15.08
std,8.334093,2.497232,0.426438,7012.87002,164375.661472,5.827649,7.871142,3.533804,12932.304171,4.253263,14.936597,31.382167,1.392388,5.699166,10.415397,0.570933,12.060468,4.616952,14.092687,0.742836,13.847186,6.254469,7.783618,0.667717
min,0.0,15.0,0.9,16275.0,70160.0,0.7,43.0,0.05,15314.0,78.0,38.0,395.0,14.1,3.0,55.0,1.2,47.0,57.5,33.0,4.7,35.6,0.2,0.14,13.83
25%,0.325,19.0,1.2,21504.75,159373.0,3.725,66.0,1.035,26017.75,88.0,76.0,479.5,16.575,10.0,75.0,1.75,61.0,78.0,60.0,5.9,60.0,0.6,3.36,14.63
50%,0.95,21.0,1.65,29469.5,259667.0,5.1,69.0,1.78,41628.0,91.0,81.5,494.0,17.55,14.0,84.0,2.2,69.0,81.3,69.0,6.5,70.1,1.0,5.25,14.92
75%,6.625,22.5,1.9,32395.25,386006.0,7.775,74.0,3.2,49263.25,93.0,88.0,506.0,18.3,16.0,91.0,2.55,79.0,82.4,75.75,7.2,77.7,3.1,11.05,15.59
max,37.0,26.0,2.6,45284.0,769053.0,29.8,86.0,16.46,63062.0,98.0,94.0,529.0,21.0,28.0,99.0,3.2,91.0,84.1,88.0,7.6,90.1,26.7,32.64,16.47


> **Pregunta ❓**: ¿Cómo obtenemos el valor promedio de la cantidad de piezas por persona `Rooms per person`

In [None]:
a[0]

In [84]:
descripcion_df.loc[['mean'], ['Rooms per person']]

Unnamed: 0,Rooms per person
mean,1.636842


> **Pregunta: ❓**: ¿Por qué _casi_ el mismo selector dan valores distintos?

### Obtener totales

Como por ejemplo suma, promedio, media y desviación estándar.

In [85]:
df.sum()

Dwellings without basic facilities                       192.20
Housing expenditure                                      805.00
Rooms per person                                          62.20
Household net adjusted disposable income              840016.00
Household net wealth                                 8352137.00
Labour market insecurity                                 231.80
Employment rate                                         2807.00
Long-term unemployment rate                              109.58
Personal earnings                                    1436854.00
Quality of support network                              3694.00
Educational attainment                                  3111.00
Student skills                                         19477.00
Years in education                                       701.40
Air pollution                                            547.00
Water quality                                           3376.00
Stakeholder engagement for developing re

In [86]:
df.mean()

Dwellings without basic facilities                        5.057895
Housing expenditure                                      20.641026
Rooms per person                                          1.636842
Household net adjusted disposable income              28000.533333
Household net wealth                                 288004.724138
Labour market insecurity                                  6.817647
Employment rate                                          68.463415
Long-term unemployment rate                               2.809744
Personal earnings                                     39912.611111
Quality of support network                               90.097561
Educational attainment                                   77.775000
Student skills                                          486.925000
Years in education                                       17.535000
Air pollution                                            13.341463
Water quality                                            82.34

In [87]:
df.median()

Dwellings without basic facilities                        0.95
Housing expenditure                                      21.00
Rooms per person                                          1.65
Household net adjusted disposable income              29469.50
Household net wealth                                 259667.00
Labour market insecurity                                  5.10
Employment rate                                          69.00
Long-term unemployment rate                               1.78
Personal earnings                                     41628.00
Quality of support network                               91.00
Educational attainment                                   81.50
Student skills                                          494.00
Years in education                                       17.55
Air pollution                                            14.00
Water quality                                            84.00
Stakeholder engagement for developing regulations      

In [88]:
df.std()

Dwellings without basic facilities                        8.334093
Housing expenditure                                       2.497232
Rooms per person                                          0.426438
Household net adjusted disposable income               7012.870020
Household net wealth                                 164375.661472
Labour market insecurity                                  5.827649
Employment rate                                           7.871142
Long-term unemployment rate                               3.533804
Personal earnings                                     12932.304171
Quality of support network                                4.253263
Educational attainment                                   14.936597
Student skills                                           31.382167
Years in education                                        1.392388
Air pollution                                             5.699166
Water quality                                            10.41

> **Pregunta ❓**: ¿Y si quisiera calcular el promedio por fila?

In [89]:
df.mean(axis=1)

Country
Australia          23191.190909
Austria            16393.843750
Belgium            19469.774583
Canada             20983.409167
Chile               6090.348571
Colombia              55.241667
Czech Republic      2182.590455
Denmark             8373.634167
Estonia             8525.776667
Finland            11458.162917
France             14871.392917
Germany            14282.464583
Greece              8479.655217
Hungary             5825.593182
Iceland             3001.191429
Ireland            12665.660000
Israel              1906.723158
Italy              14345.665417
Japan              17982.630000
Korea              14966.584783
Latvia              4636.631250
Lithuania           2240.958571
Luxembourg         37937.746522
Mexico               779.236667
Netherlands        10490.134348
New Zealand        19537.137273
Norway             13788.015217
Poland             10793.416250
Portugal           12189.686957
Slovak Republic     7202.248696
Slovenia           10834.196250


In [91]:
df.mean(axis=0)

Dwellings without basic facilities                        5.057895
Housing expenditure                                      20.641026
Rooms per person                                          1.636842
Household net adjusted disposable income              28000.533333
Household net wealth                                 288004.724138
Labour market insecurity                                  6.817647
Employment rate                                          68.463415
Long-term unemployment rate                               2.809744
Personal earnings                                     39912.611111
Quality of support network                               90.097561
Educational attainment                                   77.775000
Student skills                                          486.925000
Years in education                                       17.535000
Air pollution                                            13.341463
Water quality                                            82.34

### Round

In [92]:
df.describe()

Unnamed: 0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
count,38.0,39.0,38.0,30.0,29.0,34.0,41.0,39.0,36.0,41.0,40.0,40.0,40.0,41.0,41.0,39.0,41.0,41.0,38.0,41.0,41.0,41.0,39.0,23.0
mean,5.057895,20.641026,1.636842,28000.533333,288004.724138,6.817647,68.463415,2.809744,39912.611111,90.097561,77.775,486.925,17.535,13.341463,82.341463,2.166667,69.536585,79.597561,66.868421,6.534146,68.253659,3.436585,8.029744,15.08
std,8.334093,2.497232,0.426438,7012.87002,164375.661472,5.827649,7.871142,3.533804,12932.304171,4.253263,14.936597,31.382167,1.392388,5.699166,10.415397,0.570933,12.060468,4.616952,14.092687,0.742836,13.847186,6.254469,7.783618,0.667717
min,0.0,15.0,0.9,16275.0,70160.0,0.7,43.0,0.05,15314.0,78.0,38.0,395.0,14.1,3.0,55.0,1.2,47.0,57.5,33.0,4.7,35.6,0.2,0.14,13.83
25%,0.325,19.0,1.2,21504.75,159373.0,3.725,66.0,1.035,26017.75,88.0,76.0,479.5,16.575,10.0,75.0,1.75,61.0,78.0,60.0,5.9,60.0,0.6,3.36,14.63
50%,0.95,21.0,1.65,29469.5,259667.0,5.1,69.0,1.78,41628.0,91.0,81.5,494.0,17.55,14.0,84.0,2.2,69.0,81.3,69.0,6.5,70.1,1.0,5.25,14.92
75%,6.625,22.5,1.9,32395.25,386006.0,7.775,74.0,3.2,49263.25,93.0,88.0,506.0,18.3,16.0,91.0,2.55,79.0,82.4,75.75,7.2,77.7,3.1,11.05,15.59
max,37.0,26.0,2.6,45284.0,769053.0,29.8,86.0,16.46,63062.0,98.0,94.0,529.0,21.0,28.0,99.0,3.2,91.0,84.1,88.0,7.6,90.1,26.7,32.64,16.47


Lo que vemos a continuación se conoce como `Method chaining`. 

In [93]:
df.describe().round(2)

Unnamed: 0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
count,38.0,39.0,38.0,30.0,29.0,34.0,41.0,39.0,36.0,41.0,40.0,40.0,40.0,41.0,41.0,39.0,41.0,41.0,38.0,41.0,41.0,41.0,39.0,23.0
mean,5.06,20.64,1.64,28000.53,288004.72,6.82,68.46,2.81,39912.61,90.1,77.78,486.92,17.54,13.34,82.34,2.17,69.54,79.6,66.87,6.53,68.25,3.44,8.03,15.08
std,8.33,2.5,0.43,7012.87,164375.66,5.83,7.87,3.53,12932.3,4.25,14.94,31.38,1.39,5.7,10.42,0.57,12.06,4.62,14.09,0.74,13.85,6.25,7.78,0.67
min,0.0,15.0,0.9,16275.0,70160.0,0.7,43.0,0.05,15314.0,78.0,38.0,395.0,14.1,3.0,55.0,1.2,47.0,57.5,33.0,4.7,35.6,0.2,0.14,13.83
25%,0.32,19.0,1.2,21504.75,159373.0,3.72,66.0,1.04,26017.75,88.0,76.0,479.5,16.58,10.0,75.0,1.75,61.0,78.0,60.0,5.9,60.0,0.6,3.36,14.63
50%,0.95,21.0,1.65,29469.5,259667.0,5.1,69.0,1.78,41628.0,91.0,81.5,494.0,17.55,14.0,84.0,2.2,69.0,81.3,69.0,6.5,70.1,1.0,5.25,14.92
75%,6.62,22.5,1.9,32395.25,386006.0,7.78,74.0,3.2,49263.25,93.0,88.0,506.0,18.3,16.0,91.0,2.55,79.0,82.4,75.75,7.2,77.7,3.1,11.05,15.59
max,37.0,26.0,2.6,45284.0,769053.0,29.8,86.0,16.46,63062.0,98.0,94.0,529.0,21.0,28.0,99.0,3.2,91.0,84.1,88.0,7.6,90.1,26.7,32.64,16.47


### Contar valores

Cuenta el número de veces que aparece un valor. Útil cuanto trabajamos con datos ordinales y categóricos. **Solo funciona sobre Series**

> **Nota 📖**: Observa que en este ejemplo contamos y luego ordenamos. Esto se conoce como *Method chaining* y se ocupa bastante al usar `pandas`.

In [95]:
df.loc[:, "Time devoted to leisure and personal care"].round(0)

Country
Australia          14.0
Austria            15.0
Belgium            16.0
Canada             15.0
Chile               NaN
Colombia            NaN
Czech Republic      NaN
Denmark            16.0
Estonia            15.0
Finland            15.0
France             16.0
Germany            16.0
Greece              NaN
Hungary             NaN
Iceland             NaN
Ireland             NaN
Israel              NaN
Italy              16.0
Japan               NaN
Korea              15.0
Latvia             14.0
Lithuania           NaN
Luxembourg          NaN
Mexico              NaN
Netherlands         NaN
New Zealand        15.0
Norway             16.0
Poland             14.0
Portugal            NaN
Slovak Republic     NaN
Slovenia           15.0
Spain              16.0
Sweden             15.0
Switzerland         NaN
Turkey             15.0
United Kingdom     15.0
United States      14.0
OECD - Total       15.0
Brazil              NaN
Russia              NaN
South Africa       15.0
Name: Ti

In [94]:
df.loc[:, "Time devoted to leisure and personal care"].round(0).value_counts()

Time devoted to leisure and personal care
15.0    12
16.0     7
14.0     4
Name: count, dtype: int64

### Ordenar datos

Ordena según filas o columnas

Para los siguientes ejemplos, usaremos los datos de medioambientales para los siguientes ejemplos:


**Environmental quality**

- Air pollution 🏙️- Contaminación atmosférica (Concentración promedio de partículas (PM2.5) en ciudades con poblaciones mayores de 100,000 personas)

- Water quality 💧 - Calidad del agua (Porcentaje de personas que informan estar satisfechas con la calidad del agua local)




In [96]:
df_ambiental = df.loc[:, ['Air pollution', 'Water quality']]
df_ambiental

Unnamed: 0_level_0,Air pollution,Water quality
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia,5,93
Austria,16,92
Belgium,15,84
Canada,7,91
Chile,16,71
Colombia,10,75
Czech Republic,20,87
Denmark,9,95
Estonia,8,84
Finland,6,95


> **Nota 📖**: El proceso de ordenamiento genera un nuevo `DataFrame`! En general, esto es así con la mayoría de las operaciones `DataFrames`.

Para ordenar usamos el método `sort_values` que recibe la columna sobre la cuál se quiere realizar el ordenamiento más un parámetro opcional `ascending` que en el caso de ser `True`, indica que se ordene de forma ascendente. `False` por otra parte, ordena de forma descendente.

En este caso queremos ordenar de peor a mejor calidad del agua, o sea, de forma descendente:

In [97]:
df_ambiental_ordenado = df_ambiental.sort_values("Water quality", ascending=False)
df_ambiental_ordenado

Unnamed: 0_level_0,Air pollution,Water quality
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Iceland,3,99
Norway,5,98
Sweden,6,96
Denmark,9,95
Finland,6,95
Switzerland,15,95
Australia,5,93
Netherlands,14,93
Austria,16,92
Canada,7,91


> **Pregunta ❓**: ¿Podemos ordenar por más de una columa?


In [98]:
df_ambiental.sort_values(['Water quality', 'Air pollution'])

Unnamed: 0_level_0,Air pollution,Water quality
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Russia,15,55
Turkey,20,65
Israel,21,67
South Africa,22,67
Mexico,16,68
Greece,18,69
Chile,16,71
Italy,18,71
Spain,11,72
Brazil,10,73


In [99]:
df_ambiental = df_ambiental.sort_values(
    ["Water quality", "Air pollution"], ascending=False
)
df_ambiental

Unnamed: 0_level_0,Air pollution,Water quality
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Iceland,3,99
Norway,5,98
Sweden,6,96
Switzerland,15,95
Denmark,9,95
Finland,6,95
Netherlands,14,93
Australia,5,93
Austria,16,92
Germany,14,91


> **Pregunta ❓**: ¿Es correcto que las dos columnas sean ascendentes?

In [100]:
df_ambiental = df_ambiental.sort_values(
    ["Water quality", "Air pollution"], ascending=[False, True]
)
df_ambiental

Unnamed: 0_level_0,Air pollution,Water quality
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Iceland,3,99
Norway,5,98
Sweden,6,96
Finland,6,95
Denmark,9,95
Switzerland,15,95
Australia,5,93
Netherlands,14,93
Austria,16,92
Canada,7,91


### Analizar Nulos


Podemos comprobar el número de nulos por columna usando la (ya vista) función `info`.

In [101]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41 entries, Australia to South Africa
Data columns (total 24 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Dwellings without basic facilities                 38 non-null     float64
 1   Housing expenditure                                39 non-null     float64
 2   Rooms per person                                   38 non-null     float64
 3   Household net adjusted disposable income           30 non-null     float64
 4   Household net wealth                               29 non-null     float64
 5   Labour market insecurity                           34 non-null     float64
 6   Employment rate                                    41 non-null     int64  
 7   Long-term unemployment rate                        39 non-null     float64
 8   Personal earnings                                  36 non-null     float64
 9  

Ahora, si por ejemplo queremos seleccionar los nulos de una columna en específico, podemos usar el método `isna()`

In [102]:
df['Self-reported health']

Country
Australia          85.0
Austria            70.0
Belgium            74.0
Canada             88.0
Chile              57.0
Colombia            NaN
Czech Republic     60.0
Denmark            71.0
Estonia            53.0
Finland            70.0
France             66.0
Germany            65.0
Greece             74.0
Hungary            60.0
Iceland            76.0
Ireland            83.0
Israel             84.0
Italy              71.0
Japan              36.0
Korea              33.0
Latvia             47.0
Lithuania          43.0
Luxembourg         69.0
Mexico             66.0
Netherlands        76.0
New Zealand        88.0
Norway             77.0
Poland             58.0
Portugal           48.0
Slovak Republic    66.0
Slovenia           64.0
Spain              72.0
Sweden             75.0
Switzerland        78.0
Turkey             69.0
United Kingdom     69.0
United States      88.0
OECD - Total       69.0
Brazil              NaN
Russia             43.0
South Africa        NaN
Name: Se

In [103]:
df['Self-reported health'].isna()

Country
Australia          False
Austria            False
Belgium            False
Canada             False
Chile              False
Colombia            True
Czech Republic     False
Denmark            False
Estonia            False
Finland            False
France             False
Germany            False
Greece             False
Hungary            False
Iceland            False
Ireland            False
Israel             False
Italy              False
Japan              False
Korea              False
Latvia             False
Lithuania          False
Luxembourg         False
Mexico             False
Netherlands        False
New Zealand        False
Norway             False
Poland             False
Portugal           False
Slovak Republic    False
Slovenia           False
Spain              False
Sweden             False
Switzerland        False
Turkey             False
United Kingdom     False
United States      False
OECD - Total       False
Brazil              True
Russia           

Luego, a través de las máscaras podemos obtener las filas que contienen valores nulos a través de `.loc`: 

In [104]:
mascara = df['Self-reported health'].isna()

df.loc[mascara, ['Self-reported health']]

Unnamed: 0_level_0,Self-reported health
Country,Unnamed: 1_level_1
Colombia,
Brazil,
South Africa,


Como también los valores no nulos **negando** la máscara (operador `~`)

In [105]:
~df['Self-reported health'].isna()

Country
Australia           True
Austria             True
Belgium             True
Canada              True
Chile               True
Colombia           False
Czech Republic      True
Denmark             True
Estonia             True
Finland             True
France              True
Germany             True
Greece              True
Hungary             True
Iceland             True
Ireland             True
Israel              True
Italy               True
Japan               True
Korea               True
Latvia              True
Lithuania           True
Luxembourg          True
Mexico              True
Netherlands         True
New Zealand         True
Norway              True
Poland              True
Portugal            True
Slovak Republic     True
Slovenia            True
Spain               True
Sweden              True
Switzerland         True
Turkey              True
United Kingdom      True
United States       True
OECD - Total        True
Brazil             False
Russia           

In [106]:
mascara_2 = ~df['Self-reported health'].isna()

df.loc[mascara_2, ['Self-reported health']]

Unnamed: 0_level_0,Self-reported health
Country,Unnamed: 1_level_1
Australia,85.0
Austria,70.0
Belgium,74.0
Canada,88.0
Chile,57.0
Czech Republic,60.0
Denmark,71.0
Estonia,53.0
Finland,70.0
France,66.0


> **Pregunta ❓**: ¿Qué sucederá si ejecuto `isna()` sobre todo el dataframe?

In [110]:
df.isna()

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Austria,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Belgium,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Canada,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Chile,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
Colombia,False,False,False,True,True,True,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True
Czech Republic,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
Denmark,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Estonia,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Finland,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Luego, usando esa información más el método `sum` puedo encontrar cuántos valores nulos hay por cada fila.

In [111]:
df.isna().sum()

Dwellings without basic facilities                    3
Housing expenditure                                   2
Rooms per person                                      3
Household net adjusted disposable income             11
Household net wealth                                 12
Labour market insecurity                              7
Employment rate                                       0
Long-term unemployment rate                           2
Personal earnings                                     5
Quality of support network                            0
Educational attainment                                1
Student skills                                        1
Years in education                                    1
Air pollution                                         0
Water quality                                         0
Stakeholder engagement for developing regulations     2
Voter turnout                                         0
Life expectancy                                 

In [112]:
df.isna().sum(axis=1)

Country
Australia          2
Austria            0
Belgium            0
Canada             0
Chile              3
Colombia           6
Czech Republic     2
Denmark            0
Estonia            0
Finland            0
France             0
Germany            0
Greece             1
Hungary            2
Iceland            3
Ireland            1
Israel             5
Italy              0
Japan              3
Korea              1
Latvia             0
Lithuania          3
Luxembourg         1
Mexico             3
Netherlands        1
New Zealand        2
Norway             1
Poland             0
Portugal           1
Slovak Republic    1
Slovenia           0
Spain              0
Sweden             1
Switzerland        3
Turkey             3
United Kingdom     0
United States      0
OECD - Total       0
Brazil             9
Russia             6
South Africa       9
dtype: int64

Ahora, para descartar filas con nulos, puedo ocupar el método `dropna`

In [113]:
df.dropna()

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17
France,0.5,21.0,1.8,31304.0,280653.0,7.6,65,4.0,43755.0,90,78.0,496.0,16.5,13,81,2.1,75,82.4,66.0,6.5,70.5,0.5,7.67,16.36
Germany,0.2,20.0,1.8,34294.0,259667.0,2.7,75,1.57,47585.0,90,87.0,508.0,18.1,14,91,1.8,76,81.1,65.0,7.0,72.5,0.5,4.26,15.62
Italy,0.7,23.0,1.4,26588.0,279889.0,12.3,58,6.59,36658.0,92,61.0,485.0,16.6,18,71,2.5,73,83.3,71.0,6.0,58.4,0.6,4.11,16.47
Latvia,13.9,23.0,1.2,16275.0,70160.0,9.6,70,3.35,23683.0,86,88.0,487.0,18.0,11,79,2.2,59,74.7,47.0,5.9,62.4,4.8,1.27,13.83


Y si quiero descartar las filas que tengan solo `Self-reported health`?


> **Ejercicio 📝:** Visitar https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html#pandas.DataFrame.dropna

In [114]:
df_sin_self_reported_health_nulo = df.dropna(subset=['Self-reported health'])
df_sin_self_reported_health_nulo

Unnamed: 0_level_0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17
France,0.5,21.0,1.8,31304.0,280653.0,7.6,65,4.0,43755.0,90,78.0,496.0,16.5,13,81,2.1,75,82.4,66.0,6.5,70.5,0.5,7.67,16.36


## Analizar Duplicados

Supongamos que por algún motivo que desconocemos, los datos traían 4 filas con etiquetadas como `Chile`

In [115]:
df_duplicados = df.copy()

# script para generar el dataframe con duplicados.
# simplemente tomo la lista de índices y cambio a mano un par de filas por Chile.
index = df_duplicados.index.tolist()
index[10] = 'Chile'
index[22] = 'Chile'
index[29] = 'Chile'

print(index)

# luego, reasigno el índice
df_duplicados.index = index
df_duplicados

['Australia', 'Austria', 'Belgium', 'Canada', 'Chile', 'Colombia', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'Chile', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Israel', 'Italy', 'Japan', 'Korea', 'Latvia', 'Lithuania', 'Chile', 'Mexico', 'Netherlands', 'New Zealand', 'Norway', 'Poland', 'Portugal', 'Chile', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey', 'United Kingdom', 'United States', 'OECD - Total', 'Brazil', 'Russia', 'South Africa']


Unnamed: 0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
Colombia,23.9,17.0,1.2,,,,67,0.79,,89,54.0,410.0,14.1,10,75,1.4,53,76.2,,6.3,44.4,24.5,26.56,
Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17


> **Pregunta ❓**: Qué pasa si indexo por Chile?

In [116]:
df_duplicados.loc[['Chile'], :]

Unnamed: 0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
Chile,0.5,21.0,1.8,31304.0,280653.0,7.6,65,4.0,43755.0,90,78.0,496.0,16.5,13,81,2.1,75,82.4,66.0,6.5,70.5,0.5,7.67,16.36
Chile,0.5,21.0,1.9,39264.0,769053.0,1.7,66,2.35,63062.0,93,77.0,483.0,15.1,12,84,1.7,91,82.8,69.0,6.9,75.8,0.6,3.82,
Chile,1.2,23.0,1.1,20474.0,119696.0,9.9,66,4.78,24328.0,91,91.0,463.0,15.8,21,85,3.0,60,77.3,66.0,6.2,63.5,0.8,4.14,


### Paréntesis: Reiniciar Índice y Renombrar Columnas

Si por algún motivo no necesitamos tener más los países (o el índice que tengamos cuándo estemos trabajando), podemos reiniciarlo usando el método `reset_index()`

In [117]:
df_duplicados

Unnamed: 0,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
Colombia,23.9,17.0,1.2,,,,67,0.79,,89,54.0,410.0,14.1,10,75,1.4,53,76.2,,6.3,44.4,24.5,26.56,
Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17


In [118]:
df_duplicados = df_duplicados.reset_index()
df_duplicados

Unnamed: 0,index,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Employment rate,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
0,Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
1,Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
2,Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
3,Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
4,Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
5,Colombia,23.9,17.0,1.2,,,,67,0.79,,89,54.0,410.0,14.1,10,75,1.4,53,76.2,,6.3,44.4,24.5,26.56,
6,Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
7,Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
8,Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
9,Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17


In [120]:
df_duplicados = df_duplicados.rename(columns={
    'Employment rate': 'Tasa de empleo'
})
df_duplicados

Unnamed: 0,Country,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Tasa de empleo,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
0,Australia,,20.0,,32759.0,427064.0,5.4,73,1.31,49126.0,95,81.0,502.0,21.0,5,93,2.7,91,82.5,85.0,7.3,63.5,1.1,13.04,14.35
1,Austria,0.9,21.0,1.6,33541.0,308325.0,3.5,72,1.84,50349.0,92,85.0,492.0,17.0,16,92,1.3,80,81.7,70.0,7.1,80.6,0.5,6.66,14.55
2,Belgium,1.9,21.0,2.2,30364.0,386006.0,3.7,63,3.54,49675.0,91,77.0,503.0,19.3,15,84,2.0,89,81.5,74.0,6.9,70.1,1.0,4.75,15.7
3,Canada,0.2,22.0,2.6,30854.0,423849.0,6.0,73,0.77,47622.0,93,91.0,523.0,17.3,7,91,2.9,68,81.9,88.0,7.4,82.2,1.3,3.69,14.56
4,Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
5,Colombia,23.9,17.0,1.2,,,,67,0.79,,89,54.0,410.0,14.1,10,75,1.4,53,76.2,,6.3,44.4,24.5,26.56,
6,Czech Republic,0.7,24.0,1.4,21453.0,,3.1,74,1.04,25372.0,91,94.0,491.0,17.9,20,87,1.6,61,79.1,60.0,6.7,72.3,0.5,5.65,
7,Denmark,0.5,23.0,1.9,29606.0,118637.0,4.2,74,1.31,51466.0,95,81.0,504.0,19.5,9,95,2.0,86,80.9,71.0,7.6,83.5,0.6,2.34,15.87
8,Estonia,7.0,17.0,1.6,19697.0,159373.0,3.8,74,1.92,24336.0,92,89.0,524.0,17.7,8,84,2.7,64,77.8,53.0,5.7,69.0,3.1,2.42,14.9
9,Finland,0.5,23.0,1.9,29943.0,200827.0,3.9,70,2.13,42964.0,95,88.0,523.0,19.8,6,95,2.2,67,81.5,70.0,7.6,85.1,1.3,3.81,15.17


------------------------ Fin del paréntesis ---------------------------

Retomando, para encontrar duplicados podemos hacer una operación muy similar a `isna()` usando el método `duplicated`.

In [121]:
df_duplicados.duplicated(subset=['Country'])

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22     True
23    False
24    False
25    False
26    False
27    False
28    False
29     True
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
dtype: bool

Luego, usando una máscara podemos seleccionar las filas duplicadas.

In [122]:
df_duplicados.loc[df_duplicados['Country'].duplicated(), :]

Unnamed: 0,Country,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Tasa de empleo,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
10,Chile,0.5,21.0,1.8,31304.0,280653.0,7.6,65,4.0,43755.0,90,78.0,496.0,16.5,13,81,2.1,75,82.4,66.0,6.5,70.5,0.5,7.67,16.36
22,Chile,0.5,21.0,1.9,39264.0,769053.0,1.7,66,2.35,63062.0,93,77.0,483.0,15.1,12,84,1.7,91,82.8,69.0,6.9,75.8,0.6,3.82,
29,Chile,1.2,23.0,1.1,20474.0,119696.0,9.9,66,4.78,24328.0,91,91.0,463.0,15.8,21,85,3.0,60,77.3,66.0,6.2,63.5,0.8,4.14,


> **Pregunta ❓**: ¿No eran 4 filas con Chile?


Podemos ajustar con qué nos quedamos usando el argumento `keep`, que según la documentación:

```python

keep{‘first’, ‘last’, False}, default ‘first’

    Determines which duplicates (if any) to mark.

        first : Mark duplicates as True except for the first occurrence.

        last : Mark duplicates as True except for the last occurrence.

        False : Mark all duplicates as True.



```

In [123]:
df_duplicados.duplicated(subset=['Country'], keep=False)

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22     True
23    False
24    False
25    False
26    False
27    False
28    False
29     True
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
dtype: bool

In [124]:
df_duplicados[df_duplicados.loc[:, ['Country']].duplicated(keep=False)]

Unnamed: 0,Country,Dwellings without basic facilities,Housing expenditure,Rooms per person,Household net adjusted disposable income,Household net wealth,Labour market insecurity,Tasa de empleo,Long-term unemployment rate,Personal earnings,Quality of support network,Educational attainment,Student skills,Years in education,Air pollution,Water quality,Stakeholder engagement for developing regulations,Voter turnout,Life expectancy,Self-reported health,Life satisfaction,Feeling safe walking alone at night,Homicide rate,Employees working very long hours,Time devoted to leisure and personal care
4,Chile,9.4,18.0,1.2,,100967.0,8.7,63,,25879.0,85,65.0,443.0,17.5,16,71,1.3,47,79.9,57.0,6.5,47.9,4.2,9.72,
10,Chile,0.5,21.0,1.8,31304.0,280653.0,7.6,65,4.0,43755.0,90,78.0,496.0,16.5,13,81,2.1,75,82.4,66.0,6.5,70.5,0.5,7.67,16.36
22,Chile,0.5,21.0,1.9,39264.0,769053.0,1.7,66,2.35,63062.0,93,77.0,483.0,15.1,12,84,1.7,91,82.8,69.0,6.9,75.8,0.6,3.82,
29,Chile,1.2,23.0,1.1,20474.0,119696.0,9.9,66,4.78,24328.0,91,91.0,463.0,15.8,21,85,3.0,60,77.3,66.0,6.2,63.5,0.8,4.14,


Por último, podemos ver que filas están totalmente duplicadas usando `duplicated` sobre todo el dataframe.

In [125]:
df_duplicados.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
dtype: bool