# Proyecto: Uso de múltiples índices para extracción de la información

En el presente proyecto veremos como usar múltiples índices para tener una mejor esstructura en la información del dataframe y poder
consultar información de forma simple.

### Algunas funciones usadas en este script:

`.Categorical( df_columna.apply(str) )` <- permite transformar la variable 'df_columna' al tipo 'category'

`.rename()` <- Permite cambiar el nombre de índices o columnas

`.isin()` <- Regresa un valor boolean 'True-False' dependiendo si los valores de alguna columna estan en un conjunto dado

`.isin( [select_values]==True )` <- Regresa los registros que se encuentran dentro del conjunto 'select_values'

`.isin( [select_values]==False )` <- Regresa los registros que se encuentran fuera del conjunto 'select_values'

`.set_index()` <- Convierte una o varias columnas como índices de un dataframe

`.reset_index()` <- Convierte los índices múltiples en componentes de columnas

`.sort_index()` <- Reescribe un dataframe ordenado de acuerdo a sus índices

`.loc[]` <- Permite seleccionar renglones en un dataframe (con y sin múltiples índices)

`.sum(level='indice_name')` <- Suma todos los valores con respecto a un índice específico

`.index.get_level_values( level_number )` <- Extrae el nombre de los índices correspondientes al nivel especificado

`.unstack('index_name')` <- Convierte los valores de 'index_name' como columnas

In [1]:
import pandas as pd

In [2]:
# La siguiente línea de código hace que los números "grandes se muestren" sin notación científica:
pd.options.display.float_format = '{:,.2f}'.format

In [3]:
# Cargamos datos:
df_pob = pd.read_csv('./db/Population/poblacion.csv')

df_pob#.head()

Unnamed: 0,Country,year,pop
0,Afghanistan,2015,34413603.00
1,Albania,2015,2880703.00
2,Algeria,2015,39728025.00
3,American Samoa,2015,55812.00
4,Andorra,2015,78011.00
...,...,...,...
1035,Pre-demographic dividend,2018,919485393.00
1036,Small states,2018,40575321.00
1037,South Asia,2018,1814388744.00
1038,South Asia (IDA & IBRD),2018,1814388744.00


Este dataframe muestra el numero de habitantes en distintos países del mundo durante los años de 2015-2018

In [4]:
# Cambiamos el nombre de la columna 'pop' a 'poblacion'
df_pob.rename( columns={'pop':'poblacion'}, inplace=True )

In [5]:
df_pob.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1040 entries, 0 to 1039
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    1040 non-null   object 
 1   year       1040 non-null   int64  
 2   poblacion  1032 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 24.5+ KB


In [6]:
# Vamos a usar 'Country' y 'year' como índices así que los convertiremos al tipo 'category'
df_pob['Country'] = pd.Categorical( df_pob['Country'].apply(str) )
df_pob['year'] = pd.Categorical( df_pob['year'].apply(str) )

In [7]:
df_pob.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1040 entries, 0 to 1039
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   Country    1040 non-null   category
 1   year       1040 non-null   category
 2   poblacion  1032 non-null   float64 
dtypes: category(2), float64(1)
memory usage: 21.6 KB


In [8]:
df_pob

Unnamed: 0,Country,year,poblacion
0,Afghanistan,2015,34413603.00
1,Albania,2015,2880703.00
2,Algeria,2015,39728025.00
3,American Samoa,2015,55812.00
4,Andorra,2015,78011.00
...,...,...,...
1035,Pre-demographic dividend,2018,919485393.00
1036,Small states,2018,40575321.00
1037,South Asia,2018,1814388744.00
1038,South Asia (IDA & IBRD),2018,1814388744.00


In [9]:
# Extraemos una muestra de algunos países
# Esto se hace usando la función '.isin( [select_values]==True )'

df_sample = df_pob[ df_pob['Country'].isin( ['Mexico','France','Colombia','Germany'] )==True ]
df_sample


Unnamed: 0,Country,year,poblacion
42,Colombia,2015,47520667.0
68,France,2015,66593366.0
73,Germany,2015,81686611.0
127,Mexico,2015,121858258.0
302,Colombia,2016,48171392.0
328,France,2016,66859768.0
333,Germany,2016,82348669.0
387,Mexico,2016,123333376.0
562,Colombia,2017,48901066.0
588,France,2017,66865144.0


## Usamos `.set_index()` para convertir valores columna como valores índices

In [10]:
# Convertimos las columnas '['Country','year']' como índices.
# Esto se puede hacer ya que previamente 'Country' y 'year' se transformaron al tipo 'category'
df_sample_new = df_sample.set_index( ['Country','year'] )
df_sample_new

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Colombia,2015,47520667.0
France,2015,66593366.0
Germany,2015,81686611.0
Mexico,2015,121858258.0
Colombia,2016,48171392.0
France,2016,66859768.0
Germany,2016,82348669.0
Mexico,2016,123333376.0
Colombia,2017,48901066.0
France,2017,66865144.0


In [11]:
# Ordenamos con respecto a las categorías de los índices:
df_sample_new = df_sample_new.sort_index()

df_sample_new

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Colombia,2015,47520667.0
Colombia,2016,48171392.0
Colombia,2017,48901066.0
Colombia,2018,49648685.0
France,2015,66593366.0
France,2016,66859768.0
France,2017,66865144.0
France,2018,66987244.0
Germany,2015,81686611.0
Germany,2016,82348669.0


## Usamos `.reset_index()` para convertir valores índices como valores columna

In [12]:
df_sample_new.reset_index()

Unnamed: 0,Country,year,poblacion
0,Colombia,2015,47520667.0
1,Colombia,2016,48171392.0
2,Colombia,2017,48901066.0
3,Colombia,2018,49648685.0
4,France,2015,66593366.0
5,France,2016,66859768.0
6,France,2017,66865144.0
7,France,2018,66987244.0
8,Germany,2015,81686611.0
9,Germany,2016,82348669.0


In [13]:
df_sample_new

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Colombia,2015,47520667.0
Colombia,2016,48171392.0
Colombia,2017,48901066.0
Colombia,2018,49648685.0
France,2015,66593366.0
France,2016,66859768.0
France,2017,66865144.0
France,2018,66987244.0
Germany,2015,81686611.0
Germany,2016,82348669.0


## Extracción de información con la función '.loc[]'


In [14]:
df_sample_new.loc['France']

Unnamed: 0_level_0,poblacion
year,Unnamed: 1_level_1
2015,66593366.0
2016,66859768.0
2017,66865144.0
2018,66987244.0


In [15]:
# si usamos doble corchete podemos ver el dataframe más explícito
df_sample_new.loc[['France']]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
France,2015,66593366.0
France,2016,66859768.0
France,2017,66865144.0
France,2018,66987244.0


In [16]:
df_sample_new.loc['France',:,:]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
France,2015,66593366.0
France,2016,66859768.0
France,2017,66865144.0
France,2018,66987244.0


In [17]:
df_sample_new.loc['France'].loc['2017']

poblacion   66,865,144.00
Name: 2017, dtype: float64

In [18]:
df_sample_new.loc['France','2017']

poblacion   66,865,144.00
Name: (France, 2017), dtype: float64

In [19]:
df_sample_new.loc['France','2017',:]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
France,2017,66865144.0


In [20]:
df_sample_new.loc[['France','Germany'],['2015','2018'],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
France,2015,66593366.0
France,2018,66987244.0
Germany,2015,81686611.0
Germany,2018,82927922.0


In [21]:
df_sample_new.loc[ ['Colombia','Mexico'],:,:]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Colombia,2015,47520667.0
Colombia,2016,48171392.0
Colombia,2017,48901066.0
Colombia,2018,49648685.0
Mexico,2015,121858258.0
Mexico,2016,123333376.0
Mexico,2017,124777324.0
Mexico,2018,126190788.0


In [22]:
df_sample_new.loc[ 'Colombia':'Mexico',:,: ]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Colombia,2015,47520667.0
Colombia,2016,48171392.0
Colombia,2017,48901066.0
Colombia,2018,49648685.0
France,2015,66593366.0
France,2016,66859768.0
France,2017,66865144.0
France,2018,66987244.0
Germany,2015,81686611.0
Germany,2016,82348669.0


In [23]:
df_sample_new.loc[ ['Colombia','Mexico'], '2016':'2017' ,: ]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Colombia,2016,48171392.0
Colombia,2017,48901066.0
Mexico,2016,123333376.0
Mexico,2017,124777324.0


In [24]:
df_sample_new.loc[ 'Colombia':'Mexico' ,'2016':'2017',:]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Colombia,2016,48171392.0
Colombia,2017,48901066.0
France,2016,66859768.0
France,2017,66865144.0
Germany,2016,82348669.0
Germany,2017,82657002.0
Mexico,2016,123333376.0
Mexico,2017,124777324.0


In [25]:
df_sample_new.loc[ :,'2018',: ]

Unnamed: 0_level_0,poblacion
Country,Unnamed: 1_level_1
Colombia,49648685.0
France,66987244.0
Germany,82927922.0
Mexico,126190788.0


## Reorganiazamos el dataframe inicial

In [26]:
# Dataframe inicial
df_pob

Unnamed: 0,Country,year,poblacion
0,Afghanistan,2015,34413603.00
1,Albania,2015,2880703.00
2,Algeria,2015,39728025.00
3,American Samoa,2015,55812.00
4,Andorra,2015,78011.00
...,...,...,...
1035,Pre-demographic dividend,2018,919485393.00
1036,Small states,2018,40575321.00
1037,South Asia,2018,1814388744.00
1038,South Asia (IDA & IBRD),2018,1814388744.00


In [27]:
# Convertimos las columnas ['Country','year'] en múltiples índices: 
df_pob_new = df_pob.set_index( ['Country','year'] )
df_pob_new

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Afghanistan,2015,34413603.00
Albania,2015,2880703.00
Algeria,2015,39728025.00
American Samoa,2015,55812.00
Andorra,2015,78011.00
...,...,...
Pre-demographic dividend,2018,919485393.00
Small states,2018,40575321.00
South Asia,2018,1814388744.00
South Asia (IDA & IBRD),2018,1814388744.00


In [28]:
# Ordenamos la infomración con respecto a los índices
# Por ejemplo: 'Country' de forma descendente y 'year' de forma ascendente
df_pob_new.sort_index(ascending=[False,True])

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Zimbabwe,2015,13814629.00
Zimbabwe,2016,14030390.00
Zimbabwe,2017,14236745.00
Zimbabwe,2018,14439018.00
Zambia,2015,15879361.00
...,...,...
Albania,2018,2866376.00
Afghanistan,2015,34413603.00
Afghanistan,2016,35383128.00
Afghanistan,2017,36296400.00


In [29]:
# Ordenamos la infomración
# Por ejemplo: 'Country' de forma ascendente y 'year' de forma ascendente
df_pob_new = df_pob_new.sort_index(ascending=[True,True])
df_pob_new

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Afghanistan,2015,34413603.00
Afghanistan,2016,35383128.00
Afghanistan,2017,36296400.00
Afghanistan,2018,37172386.00
Albania,2015,2880703.00
...,...,...
Zambia,2018,17351822.00
Zimbabwe,2015,13814629.00
Zimbabwe,2016,14030390.00
Zimbabwe,2017,14236745.00


In [30]:
df_pob_new.loc[ 'Aruba':'Austria' ,'2015':'2017',:]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Aruba,2015,104341.0
Aruba,2016,104872.0
Aruba,2017,105366.0
Australia,2015,23815995.0
Australia,2016,24190907.0
Australia,2017,24601860.0
Austria,2015,8642699.0
Austria,2016,8736668.0
Austria,2017,8797566.0


In [31]:
df_pob_new.loc['Australia',:,:]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Australia,2015,23815995.0
Australia,2016,24190907.0
Australia,2017,24601860.0
Australia,2018,24992369.0


In [32]:
df_pob_new.loc['Australia','2016':'2018',:]

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Australia,2016,24190907.0
Australia,2017,24601860.0
Australia,2018,24992369.0


## Extracción de información con múltiples índices sin usar `.loc[]`

In [33]:
# data_frame[ ['column_1,...,'column_M'] ] ['index_1'],...,['index_N]

df_pob_new[ 'poblacion' ] ['Australia'][:]

year
2015   23,815,995.00
2016   24,190,907.00
2017   24,601,860.00
2018   24,992,369.00
Name: poblacion, dtype: float64

In [34]:
df_pob_new['poblacion']['Australia']['2016':'2018']

year
2016   24,190,907.00
2017   24,601,860.00
2018   24,992,369.00
Name: poblacion, dtype: float64

## Aplicación de operadores a DataFrames con múltiples índices

In [35]:
df_pob_new

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Afghanistan,2015,34413603.00
Afghanistan,2016,35383128.00
Afghanistan,2017,36296400.00
Afghanistan,2018,37172386.00
Albania,2015,2880703.00
...,...,...
Zambia,2018,17351822.00
Zimbabwe,2015,13814629.00
Zimbabwe,2016,14030390.00
Zimbabwe,2017,14236745.00


In [36]:
# Calculo de la población mundial en cada año
df_pob_new.sum(level='year')

# level='year' #<-- idica que la suma se realizará con respecto al índice 'year'

Unnamed: 0_level_0,poblacion
year,Unnamed: 1_level_1
2015,65679147019.0
2016,66487930677.0
2017,67294176701.0
2018,68087886692.0


In [37]:
# Calculo de la población mundial en el año 2016:
df_pob_new.loc[:,'2016',:].sum()

poblacion   66,487,930,677.00
dtype: float64

In [38]:
df_pob_new.loc[:,'2016',:].sum().values[0]

66487930677.0

In [39]:
# Calculo de la población mundial en el año 2016:
df_pob_new['poblacion'][:,'2016'].sum()

66487930677.0

## Conversión de indices en columnas usando la función `.unstack()`

In [40]:
df_pob_new

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Afghanistan,2015,34413603.00
Afghanistan,2016,35383128.00
Afghanistan,2017,36296400.00
Afghanistan,2018,37172386.00
Albania,2015,2880703.00
...,...,...
Zambia,2018,17351822.00
Zimbabwe,2015,13814629.00
Zimbabwe,2016,14030390.00
Zimbabwe,2017,14236745.00


In [41]:
# Pasamos el indice 'year' como columna
df_pob_new.unstack('year')

Unnamed: 0_level_0,poblacion,poblacion,poblacion,poblacion
year,2015,2016,2017,2018
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Afghanistan,34413603.00,35383128.00,36296400.00,37172386.00
Albania,2880703.00,2876101.00,2873457.00,2866376.00
Algeria,39728025.00,40551404.00,41389198.00,42228429.00
American Samoa,55812.00,55741.00,55620.00,55465.00
Andorra,78011.00,77297.00,77001.00,77006.00
...,...,...,...,...
Virgin Islands (U.S.),107710.00,107510.00,107268.00,106977.00
West Bank and Gaza,4270092.00,4367088.00,4454805.00,4569087.00
"Yemen, Rep.",26497889.00,27168210.00,27834821.00,28498687.00
Zambia,15879361.00,16363507.00,16853688.00,17351822.00


In [42]:
# Pasamos el indice 'Country' como columna
df_pob_new.unstack('Country')

Unnamed: 0_level_0,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion,poblacion
Country,Afghanistan,Albania,Algeria,American Samoa,Andorra,Angola,Antigua and Barbuda,Arab World,Argentina,Armenia,...,Uruguay,Uzbekistan,Vanuatu,"Venezuela, RB",Vietnam,Virgin Islands (U.S.),West Bank and Gaza,"Yemen, Rep.",Zambia,Zimbabwe
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2015,34413603.0,2880703.0,39728025.0,55812.0,78011.0,27884381.0,93566.0,396028278.0,43131966.0,2925553.0,...,3412009.0,31298900.0,271130.0,30081829.0,92677076.0,107710.0,4270092.0,26497889.0,15879361.0,13814629.0
2016,35383128.0,2876101.0,40551404.0,55741.0,77297.0,28842484.0,94527.0,404024433.0,43590368.0,2936146.0,...,3424132.0,31847900.0,278330.0,29846179.0,93638724.0,107510.0,4367088.0,27168210.0,16363507.0,14030390.0
2017,36296400.0,2873457.0,41389198.0,55620.0,77001.0,29816748.0,95426.0,411898965.0,44044811.0,2944809.0,...,3436646.0,32388600.0,285510.0,29390409.0,94596642.0,107268.0,4454805.0,27834821.0,16853688.0,14236745.0
2018,37172386.0,2866376.0,42228429.0,55465.0,77006.0,30809762.0,96286.0,419790588.0,44494502.0,2951776.0,...,3449299.0,32955400.0,292680.0,28870195.0,95540395.0,106977.0,4569087.0,28498687.0,17351822.0,14439018.0


In [43]:
# Extracción de una multicolumna:
df_pob_new.unstack('Country')['poblacion'] [['Albania']]


Country,Albania
year,Unnamed: 1_level_1
2015,2880703.0
2016,2876101.0
2017,2873457.0
2018,2866376.0


## Extracción de nombres de múltiples índices 

In [44]:
df_pob_new

Unnamed: 0_level_0,Unnamed: 1_level_0,poblacion
Country,year,Unnamed: 2_level_1
Afghanistan,2015,34413603.00
Afghanistan,2016,35383128.00
Afghanistan,2017,36296400.00
Afghanistan,2018,37172386.00
Albania,2015,2880703.00
...,...,...
Zambia,2018,17351822.00
Zimbabwe,2015,13814629.00
Zimbabwe,2016,14030390.00
Zimbabwe,2017,14236745.00


In [45]:
# Extracción de valores de indices de un nivel dado:
names_country = df_pob_new.index.get_level_values('Country')
names_country

CategoricalIndex(['Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan',
                  'Albania', 'Albania', 'Albania', 'Albania', 'Algeria',
                  'Algeria',
                  ...
                  'Yemen, Rep.', 'Yemen, Rep.', 'Zambia', 'Zambia', 'Zambia',
                  'Zambia', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
                 categories=['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Arab World', ...], ordered=False, name='Country', dtype='category', length=1040)

In [46]:
# Extracción de valores de indices de un nivel dado:
years = df_pob_new.index.get_level_values('year')
years

CategoricalIndex(['2015', '2016', '2017', '2018', '2015', '2016', '2017',
                  '2018', '2015', '2016',
                  ...
                  '2017', '2018', '2015', '2016', '2017', '2018', '2015',
                  '2016', '2017', '2018'],
                 categories=['2015', '2016', '2017', '2018'], ordered=False, name='year', dtype='category', length=1040)