# PRÁTICA GUIADA 2 - Tabelas Dinâmicas.

* Uma tabela dinâmica é uma operação semelhante a um `GroupBy` que costuma ser comum em planilhas de cálculo e outros programas que operam com dados tabulares.
* Uma tabela dinâmica usa uma ou várias colunas como entrada e agrupa as entradas em uma tabela bidimensional que fornece um resumo (geralmente, uma agregação dos dados).

## Usando tabelas dinâmicas

* Usaremos o conjunto de dados de passageiros no Titanic.
* Ele contém informações sociodemográficas sobre os passageiros do navio (incluindo sexo, idade, classe de embarque, etc.)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')

In [2]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


## Tabelas dinâmicas manualmente

* Vamos começar agrupando de acordo com o gênero, se são sobreviventes ou não, etc.
* Isso pode ser feito com um ``GroupBy``; Vejamos, por exemplo, a proporção de sobreviventes de acordo com o sexo:

In [4]:
titanic.groupby(['sex'])[['survived']].mean()

Unnamed: 0_level_0,survived
sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


* Parece que 3 a cada 4 mulheres sobreviveram, enquanto essa proporção é muito menor entre os homens (1 a cada 5)
* Vamos ver o que acontece se analisarmos a sobrevivência por sexo e classe:
    - fazemos um grupo por classe e gênero, selecionamos os sobreviventes, aplicamos uma agregação (média) e combinamos os grupos resultantes.

In [5]:
survived_sex_class = pd.DataFrame(titanic.groupby(['sex', 'class'])['survived'].aggregate('mean'))
survived_sex_class

Unnamed: 0_level_0,Unnamed: 1_level_0,survived
sex,class,Unnamed: 2_level_1
female,First,0.968085
female,Second,0.921053
female,Third,0.5
male,First,0.368852
male,Second,0.157407
male,Third,0.135447


#### A função [aggregate()](https://pandas.pydata.org/pandas-docs/version/0.21.0/generated/pandas.DataFrame.aggregate.html#pandas-dataframe-aggregate) agrega informação usando um objeto chamável `string`, `dict`, ou `list`:

* Isso dá uma ideia melhor de como o gênero e a classe afetam as chances de sobrevivência. O problema é que o código começa a parecer um pouco desorganizado e complicado de ler.
* Esse tipo de ``GroupBy`` é muito comum em Pandas, por isso, foi incluído um método ``pivot_table`` que lida facilmente com este tipo de agregações multidimensionais.

## Sintaxe das tabelas dinâmicas

* Vamos ver um equivalente da operação anterior usando o método ``pivot_table``:

#### A função [pivot_table()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) cria uma tabela dinâmica no estilo de planilha como um DataFrame. Os níveis na tabela dinâmica serão armazenados em objetos MultiIndex (índices hierárquicos) no índice e nas colunas do DataFrame resultante.

In [6]:
titanic.pivot_table(values='survived', index = 'sex', columns = 'class', aggfunc = 'mean')

class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


* É claramente mais legível que o ``groupby`` anterior.
* Como esperado, havia maiores probabilidades (tanto entre homens quanto mulheres) de sobrevivência se a pessoa pertencesse à classe alta.
* As mulheres da primeira classe sobreviveram quase em sua totalidade (com certeza, devido à máxima "mulheres e crianças - com dinheiro - primeiro").

* Às vezes é útil calcular os totais em cada grupo: isso pode ser feito usando ``margins``:

In [7]:
def mean_round(x):
    return round(np.mean(x),2)

titanic.pivot_table(values = 'survived', 
                    index = 'sex', 
                    columns = 'class', 
                    aggfunc = mean_round,
                    margins = True, 
                    margins_name = 'Total')

class,First,Second,Third,Total
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.97,0.92,0.5,0.74
male,0.37,0.16,0.14,0.19
Total,0.63,0.47,0.24,0.38


### Multi-level pivot tables

* O mesmo que em ``GroupBy``, a operação de agrupamento pode ser especificada com múltiplos níveis.


In [8]:
titanic.pivot_table(values='survived', 
                    index = ['sex', 'class'], 
                    columns = 'embark_town',
                    aggfunc = 'mean')

Unnamed: 0_level_0,embark_town,Cherbourg,Queenstown,Southampton
sex,class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,First,0.976744,1.0,0.958333
female,Second,1.0,1.0,0.910448
female,Third,0.652174,0.727273,0.375
male,First,0.404762,0.0,0.35443
male,Second,0.2,0.0,0.154639
male,Third,0.232558,0.076923,0.128302


#### A função [cut()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html?highlight=cut#pandas-cut) é usada para segmentar e classificar dados em posições, usado também para tornar variáveis contínuas em categóricas, como as faixas etárias acima.

In [46]:
titanic['age'].max()

80.0

In [50]:
list(range(0,16,5))

[0, 5, 10, 15]

In [51]:
cinco_anos = list(range(0,int(titanic['age'].max())+6,6))
cinco_anos

[0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84]

In [52]:
age = pd.cut(titanic['age'], [0, 10, 20.5, 30.5, 100])
titanic.pivot_table(values = 'survived',
                    index = ['sex', 'class'], 
                    columns = age, 
                    aggfunc=np.count_nonzero).sum().sum()

290

In [53]:
age = pd.cut(titanic['age'], [0, 10, 20.5, 30.5, 79.5])
titanic.pivot_table(values = 'survived',
                    index = ['sex', 'class'], 
                    columns = age, 
                    aggfunc=np.count_nonzero).sum().sum()

289

#### Podemos usar a mesma estratégia ao trabalhar com colunas: adicionar informações sobre a tarifa (fare) paga, usando `pd.qcut` para calcular os quantis automaticamente:

In [54]:
fare = pd.qcut(titanic['fare'], 4)
titanic.pivot_table(values = 'survived', 
                    index = ['sex', age], 
                    columns = [fare, 'class'],
                    aggfunc = 'mean')

Unnamed: 0_level_0,fare,"(-0.001, 7.91]","(-0.001, 7.91]","(7.91, 14.454]","(7.91, 14.454]","(14.454, 31.0]","(14.454, 31.0]","(14.454, 31.0]","(31.0, 512.329]","(31.0, 512.329]","(31.0, 512.329]"
Unnamed: 0_level_1,class,First,Third,Second,Third,First,Second,Third,First,Second,Third
sex,age,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
female,"(0.0, 10.0]",,,,0.8,,1.0,0.5,0.0,1.0,0.2
female,"(10.0, 20.5]",,0.7,1.0,0.6,1.0,1.0,0.0,1.0,,0.0
female,"(20.5, 30.5]",,0.636364,0.916667,0.357143,1.0,0.857143,0.571429,0.95,1.0,0.0
female,"(30.5, 79.5]",,0.0,0.846154,0.4,0.8,0.928571,0.4,1.0,1.0,0.2
male,"(0.0, 10.0]",,,,1.0,,1.0,0.363636,1.0,1.0,0.125
male,"(10.0, 20.5]",,0.043478,0.142857,0.238095,,0.0,0.166667,0.4,0.0,0.0
male,"(20.5, 30.5]",,0.160714,0.0,0.103448,0.6,0.0,0.1,0.428571,0.0,0.5
male,"(30.5, 79.5]",0.0,0.027027,0.153846,0.2,0.461538,0.055556,0.0,0.318182,0.0,0.666667


#### A função [qcut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html?highlight=qcut#pandas-qcut) discretiza uma variável em blocos de tamanhos iguais, com base na classificação ou nos quartis da amostra.

* Dicionário de funções de agregação

In [55]:
titanic.pivot_table(index = ['sex', age], 
                    columns = [fare, 'class'],
                    aggfunc = {'survived':'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,survived,survived,survived,survived,survived,survived,survived,survived,survived
Unnamed: 0_level_1,fare,"(-0.001, 7.91]","(-0.001, 7.91]","(7.91, 14.454]","(7.91, 14.454]","(14.454, 31.0]","(14.454, 31.0]","(14.454, 31.0]","(31.0, 512.329]","(31.0, 512.329]","(31.0, 512.329]"
Unnamed: 0_level_2,class,First,Third,Second,Third,First,Second,Third,First,Second,Third
sex,age,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3
female,"(0.0, 10.0]",,,,0.8,,1.0,0.5,0.0,1.0,0.2
female,"(10.0, 20.5]",,0.7,1.0,0.6,1.0,1.0,0.0,1.0,,0.0
female,"(20.5, 30.5]",,0.636364,0.916667,0.357143,1.0,0.857143,0.571429,0.95,1.0,0.0
female,"(30.5, 79.5]",,0.0,0.846154,0.4,0.8,0.928571,0.4,1.0,1.0,0.2
male,"(0.0, 10.0]",,,,1.0,,1.0,0.363636,1.0,1.0,0.125
male,"(10.0, 20.5]",,0.043478,0.142857,0.238095,,0.0,0.166667,0.4,0.0,0.0
male,"(20.5, 30.5]",,0.160714,0.0,0.103448,0.6,0.0,0.1,0.428571,0.0,0.5
male,"(30.5, 79.5]",0.0,0.027027,0.153846,0.2,0.461538,0.055556,0.0,0.318182,0.0,0.666667


* Neste caso, a especificação de ``values`` é omitida: ao especificar um mapeamento para ``aggfunc``, `values` é automaticamente determinado.

### Opcionais adicionais em tabelas dinâmicas

* Todos os argumentos do método ``pivot_table`` são os seguintes:

```python
# call signature as of Pandas 0.18
DataFrame.pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False,dropna=True, margins_name='All')```

* Vimos exemplos dos primeiros três argumentos.
* ``fill_value`` e ``dropna``, estão vinculados à existência de dados perdidos e são uma maneira relativamente simples de lidar com eles (voltaremos a esses exemplos mais tarde).

* ``aggfunc` controla o tipo de agregação que é aplicada (por padrão, é uma média)
Como acontece com `` GroupBy``, a especificação da operação de agregação tem muitas opções relativamente comuns (``'sum'``, ``'mean'``, ``'count'``, ``'min'``, ``'max'``, etc.) ou alguma função que implementa uma agregação (por exemplo, ``np.sum()``, ``min()``, ``sum()``, etc.).
* Além disso, pode ser especificado um dicionário que mapeie uma coluna com alguma operação:

In [56]:
titanic.pivot_table(index = ['sex', age], 
                    columns = [fare, 'class'],
                    aggfunc = {'survived':mean_round}, 
                    fill_value = 0)

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,survived,survived,survived,survived,survived,survived,survived,survived,survived
Unnamed: 0_level_1,fare,"(-0.001, 7.91]","(-0.001, 7.91]","(7.91, 14.454]","(7.91, 14.454]","(14.454, 31.0]","(14.454, 31.0]","(14.454, 31.0]","(31.0, 512.329]","(31.0, 512.329]","(31.0, 512.329]"
Unnamed: 0_level_2,class,First,Third,Second,Third,First,Second,Third,First,Second,Third
sex,age,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3
female,"(0.0, 10.0]",0,0.0,0.0,0.8,0.0,1.0,0.5,0.0,1,0.2
female,"(10.0, 20.5]",0,0.7,1.0,0.6,1.0,1.0,0.0,1.0,0,0.0
female,"(20.5, 30.5]",0,0.64,0.92,0.36,1.0,0.86,0.57,0.95,1,0.0
female,"(30.5, 79.5]",0,0.0,0.85,0.4,0.8,0.93,0.4,1.0,1,0.2
male,"(0.0, 10.0]",0,0.0,0.0,1.0,0.0,1.0,0.36,1.0,1,0.12
male,"(10.0, 20.5]",0,0.04,0.14,0.24,0.0,0.0,0.17,0.4,0,0.0
male,"(20.5, 30.5]",0,0.16,0.0,0.1,0.6,0.0,0.1,0.43,0,0.5
male,"(30.5, 79.5]",0,0.03,0.15,0.2,0.46,0.06,0.0,0.32,0,0.67


In [57]:
titanic.pivot_table(index = ['sex', age], 
                    columns = [fare, 'class'],
                    aggfunc = {'survived':mean_round}, 
                    fill_value = round(titanic['survived'].mean(),2))

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,survived,survived,survived,survived,survived,survived,survived,survived,survived
Unnamed: 0_level_1,fare,"(-0.001, 7.91]","(-0.001, 7.91]","(7.91, 14.454]","(7.91, 14.454]","(14.454, 31.0]","(14.454, 31.0]","(14.454, 31.0]","(31.0, 512.329]","(31.0, 512.329]","(31.0, 512.329]"
Unnamed: 0_level_2,class,First,Third,Second,Third,First,Second,Third,First,Second,Third
sex,age,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3
female,"(0.0, 10.0]",0.38,0.38,0.38,0.8,0.38,1.0,0.5,0.0,1.0,0.2
female,"(10.0, 20.5]",0.38,0.7,1.0,0.6,1.0,1.0,0.0,1.0,0.38,0.0
female,"(20.5, 30.5]",0.38,0.64,0.92,0.36,1.0,0.86,0.57,0.95,1.0,0.0
female,"(30.5, 79.5]",0.38,0.0,0.85,0.4,0.8,0.93,0.4,1.0,1.0,0.2
male,"(0.0, 10.0]",0.38,0.38,0.38,1.0,0.38,1.0,0.36,1.0,1.0,0.12
male,"(10.0, 20.5]",0.38,0.04,0.14,0.24,0.38,0.0,0.17,0.4,0.0,0.0
male,"(20.5, 30.5]",0.38,0.16,0.0,0.1,0.6,0.0,0.1,0.43,0.0,0.5
male,"(30.5, 79.5]",0.0,0.03,0.15,0.2,0.46,0.06,0.0,0.32,0.0,0.67
