In [None]:
# MIT License

# Copyright (c) 2021 GDSC UNI

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

<table align="center">
  <td align="center"><a target="_blank" href="https://gdsc.community.dev/universidad-nacional-de-ingenieria/">
        <img src="https://i.ibb.co/pX2w52P/GDSC.png" style="padding-bottom:5px;" />
      View GDSC UNI</a></td>

  <td align="center"><a target="_blank" href="https://colab.research.google.com/drive/1bfaeH8bjLp4h8Oigwzgss0qnz0uzYfd0?usp=sharing">
        <img src="https://i.ibb.co/Bf0HK0q/Colaboratory.png"  style="padding-bottom:5px;" />Run in Google Colab </a></td>

  <td align="center"><a target="_blank" href="https://github.com/GDSC-UNI/Pandas-For-Data-Science/blob/main/PFDS8_M%C3%BAltiples_indices.ipynb">
        <img src="https://i.ibb.co/VHHdRx2/Github.png"  height="110px" style="padding-bottom:5px;"/>View source on GitHub</a></td>
</table>


<h1></h1>

<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:#000080">PFDS8:</span> Múltiples índices</h1>
<hr>

En el notebook 3 se trabajó con variables categóricas, como sabemos, estas variables se encargan de clasificar nuestros datos. Podemos usar estos valores de las variables categóricas para agrupar nuestros datos considerando ese valor como un índice. Para entender mejor esta idea usaremos el dataset Covid, el cual ha pasado previamente, por un proceso de limpieza. La única modificación que le haremos es convertir el tipo de dato de la variable mes a categórica.


In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('./Datasets/Covid.csv')

df.head(5)

Unnamed: 0,location,country,gender,age,month
0,"Shenzhen, Guangdong",China,male,66.0,1
1,Shanghai,China,female,56.0,1
2,Zhejiang,China,male,46.0,1
3,Tianjin,China,female,60.0,1
4,Tianjin,China,male,58.0,1


In [None]:
df['month'] = pd.Categorical(df['month'].apply(str))
df.dtypes

location      object
country       object
gender        object
age          float64
month       category
dtype: object

Dado que nuestro dataset es muy grande, solo trabajaremos con dos valores de la columna "country", para hacer esta selección usaremos un método nuevo:
 
<code>DataFrame.isin(values)</code>
 
El valor del parámetro values puede ser un iterable, una serie, un DataFrame o un diccionario. Por ejemplo, usaremos *isin* para obtener los meses 1 y 3 de nuestro DataFrame.


In [None]:
df[df['month'].isin(['1','3'])]

Unnamed: 0,location,country,gender,age,month
0,"Shenzhen, Guangdong",China,male,66.0,1
1,Shanghai,China,female,56.0,1
2,Zhejiang,China,male,46.0,1
3,Tianjin,China,female,60.0,1
4,Tianjin,China,male,58.0,1
...,...,...,...,...,...
718,Yau Ma Tei,Hong Kong,female,37.0,1
719,Tsing Yi,Hong Kong,male,75.0,1
720,Kowloon,Hong Kong,male,39.0,1
806,Lapland,Finland,female,32.0,1


Colocaremos los valores de las columnas "country", 'China' y 'Spain' como índices de nuestro DataFrame mediante el método *set_index*.
 
<code>DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)</code>
 
Además, utilizaremos el método *sort_index* con sus parámetros por defecto para obtener un DataFrame ordenado por etiqueta.
 
<code>DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)<code>


In [None]:
df_grouped = df[df['country'].isin(['China', 'Spain'])]
df_grouped = df_grouped.set_index(['country', 'location']).sort_index()
df_grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,gender,age,month
country,location,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
China,Beijing,male,37.0,1
China,Beijing,male,39.0,1
China,Beijing,male,56.0,1
China,Beijing,female,18.0,1
China,Beijing,female,32.0,1
...,...,...,...,...
Spain,Castellon,male,31.0,2
Spain,Castile and Leon,male,30.0,2
Spain,Tenerife,male,69.0,2
Spain,Valencia,male,44.0,2


Teniendo nuestro DataFrame con nuevos índices country y location, podemos aplicar todos los métodos que hacemos con los índices normales.


In [None]:
df_grouped.loc['Spain', :]

Unnamed: 0_level_0,gender,age,month
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Andalusia,male,62.0,2
Andalusia,male,28.0,2
Andalusia,male,42.0,2
Andalusia,male,53.0,2
Andalusia,male,55.0,2
Andalusia,female,25.0,2
Andalusia,male,58.0,2
Barcelona,female,36.0,2
Barcelona,male,22.0,2
Barcelona,female,22.0,2


In [None]:
df_grouped.loc['Spain', :].loc['Barcelona', :]

Unnamed: 0_level_0,gender,age,month
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Barcelona,female,36.0,2
Barcelona,male,22.0,2
Barcelona,female,22.0,2


Un método que hace lo mismo que loc pero diseñado especialmente para los múltiples índices es *xs*, el cual devuelve la sección transversal del DataFrame.

<code>DataFrame.xs(key, axis=0, level=None, drop_level=True)</code>

In [None]:
df_grouped.xs(['Spain'])

Unnamed: 0_level_0,gender,age,month
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Andalusia,male,62.0,2
Andalusia,male,28.0,2
Andalusia,male,42.0,2
Andalusia,male,53.0,2
Andalusia,male,55.0,2
Andalusia,female,25.0,2
Andalusia,male,58.0,2
Barcelona,female,36.0,2
Barcelona,male,22.0,2
Barcelona,female,22.0,2


In [None]:
df_grouped.xs(['Spain', 'Barcelona'])

Unnamed: 0_level_0,Unnamed: 1_level_0,gender,age,month
country,location,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Spain,Barcelona,female,36.0,2
Spain,Barcelona,male,22.0,2
Spain,Barcelona,female,22.0,2


Para conocer más métodos de nuestro DataFrame colocaremos como índices a todos los continentes y ciudades de nuestro DataFrame.

In [None]:
df_countries = df.set_index(['country', 'location']).sort_index()
df_countries

Unnamed: 0_level_0,Unnamed: 1_level_0,gender,age,month
country,location,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia,NSW,male,35.0,1
Australia,NSW,male,43.0,1
Australia,NSW,male,53.0,1
Australia,NSW,female,21.0,1
Australia,Queensland,male,44.0,1
...,...,...,...,...
Vietnam,Vinh Phuc,female,42.0,2
Vietnam,Vinh Phuc,female,16.0,2
Vietnam,Vinh Phuc,female,29.0,2
Vietnam,Vinh Phuc,female,55.0,2


Existe un atributo de pandas llamado *IndexSlice* el cual crea un objeto que nos permite realizar los slice en los múltiples índices.

In [None]:
ids = pd.IndexSlice

In [None]:
df_countries.loc[ids['Australia':'China'],:].sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,gender,age,month
country,location,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia,NSW,male,35.0,1
Australia,NSW,male,43.0,1
Australia,NSW,male,53.0,1
Australia,NSW,female,21.0,1
Australia,Queensland,male,44.0,1
...,...,...,...,...
China,Yunnan,male,71.0,1
China,Yunnan,female,55.0,1
China,Yunnan,female,36.0,1
China,Yunnan,male,65.0,1


Para obtener los valores de un determinado nivel como "country" o "location", utilizamos el método *index.get_level_values*, especificando el nivel del que queremos obtener los valores.

<code>Index.get_level_values(level)</code>

In [None]:
df_countries.index.get_level_values(0)

Index(['Australia', 'Australia', 'Australia', 'Australia', 'Australia',
       'Australia', 'Australia', 'Australia', 'Australia', 'Australia',
       ...
       'USA', 'USA', 'Vietnam', 'Vietnam', 'Vietnam', 'Vietnam', 'Vietnam',
       'Vietnam', 'Vietnam', 'Vietnam'],
      dtype='object', name='country', length=825)

In [None]:
df_countries.index.get_level_values(1)

Index(['NSW', 'NSW', 'NSW', 'NSW', 'Queensland', 'Queensland', 'Queensland',
       'Queensland', 'Queensland', 'South Australia',
       ...
       'Massachusetts', 'Washington', 'Ho Chi Minh City', 'Ho Chi Minh City',
       'Vinh Phuc', 'Vinh Phuc', 'Vinh Phuc', 'Vinh Phuc', 'Vinh Phuc',
       'Vinh Phuc'],
      dtype='object', name='location', length=825)

In [None]:
df_countries['age']['China']['Tianjin']

location
Tianjin    60.0
Tianjin    58.0
Tianjin    46.0
Tianjin    29.0
Tianjin    39.0
Tianjin    59.0
Tianjin    57.0
Tianjin    68.0
Tianjin    40.0
Tianjin    46.0
Tianjin    56.0
Tianjin    29.0
Tianjin    29.0
Tianjin    57.0
Tianjin    30.0
Tianjin    55.0
Tianjin    79.0
Tianjin    19.0
Tianjin    71.0
Tianjin    50.0
Tianjin    78.0
Tianjin    49.0
Name: age, dtype: float64

También podemos realizar operaciones matemáticas como la media de una columna y especificando el nivel en donde queremos realizar esta operación, por ejemplo, si queremos conocer el promedio de las edades por localidad.

In [None]:
df_countries['age'].mean(level='location')

location
NSW                        38.0
Queensland                 33.6
South Australia            60.0
Victoria                   47.5
Preah Sihanouk Province    60.0
                           ... 
Illinois                   63.0
Massachusetts              25.0
Washington                 35.0
Ho Chi Minh City           47.0
Vinh Phuc                  37.0
Name: age, Length: 116, dtype: float64