# Hands-On de Ánalise de Dados com Pandas

Aqui vamos praticar o que aprendemos nos outros notebooks de Pandas, utilizando essas técnicas para resolver algumas tarefas e ánalises. 

### Sobre os Dados

Utilizaremos as estatísticas dos voos de 2019 do Departamento de Transportes dos Estados Unidos (disponível [aqui](https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FMF&QO_fu146_anzr=Nv4%20Pn44vr45), sendo o arquivo `T100_MARKET_ALL_CARRIER.zip`). Cada linha contém informações sobre uma rota espeífica para uma determinada transportadora em um determinado mês (exemplo: JFK &rarr; LAX pela Delta Airlines em Janeiro). Existem 321.409 linhas e 41 colunas no dataset. Note que não é necessário extrair o arquivo zip para fazer a leitura utilizando `pd.read_csv()`.

#### Exercícios

##### 1. Leia os dados e converta o nomes das colunas para letras minúsculas para ser mais fácil de trabalhar com elas.

In [4]:
import pandas as pd

In [5]:
df = pd.read_csv('DL_SelectFields.zip')
df.shape
df

Unnamed: 0,PASSENGERS,FREIGHT,MAIL,DISTANCE,UNIQUE_CARRIER,AIRLINE_ID,UNIQUE_CARRIER_NAME,UNIQUE_CARRIER_ENTITY,REGION,CARRIER,...,DEST_STATE_NM,DEST_COUNTRY,DEST_COUNTRY_NAME,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS,DATA_SOURCE
0,0.0,0.0,0.0,103.0,AAT,19650,Air Sunshine Inc.,30005,D,AAT,...,,VG,British Virgin Islands,282,2019,3,8,1,F,IU
1,0.0,0.0,0.0,364.0,GL,20402,Miami Air International,06690,D,GL,...,West Virginia,US,United States,39,2019,3,8,1,L,DU
2,0.0,0.0,0.0,2055.0,GL,20402,Miami Air International,06690,D,GL,...,Idaho,US,United States,83,2019,3,8,5,L,DU
3,0.0,0.0,0.0,1258.0,GL,20402,Miami Air International,06690,D,GL,...,Florida,US,United States,33,2019,3,8,3,L,DU
4,0.0,0.0,0.0,1632.0,GL,20402,Miami Air International,06690,D,GL,...,Florida,US,United States,33,2019,3,8,4,L,DU
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321383,96984.0,505308.0,15979.0,1448.0,AS,19930,Alaska Airlines Inc.,06031,D,AS,...,Washington,US,United States,93,2019,3,7,3,F,DU
321384,97011.0,473666.0,55963.0,1448.0,AS,19930,Alaska Airlines Inc.,06031,D,AS,...,Alaska,US,United States,1,2019,2,6,3,F,DU
321385,97098.0,592775.0,18834.0,1448.0,AS,19930,Alaska Airlines Inc.,06031,D,AS,...,Washington,US,United States,93,2019,3,8,3,F,DU
321386,97329.0,210292.0,5431.0,404.0,DL,19790,Delta Air Lines Inc.,01260,D,DL,...,Georgia,US,United States,34,2019,1,3,1,F,DU


##### 2. Quais são as colunas do dataset?

In [6]:
df.columns

Index(['PASSENGERS', 'FREIGHT', 'MAIL', 'DISTANCE', 'UNIQUE_CARRIER',
       'AIRLINE_ID', 'UNIQUE_CARRIER_NAME', 'UNIQUE_CARRIER_ENTITY', 'REGION',
       'CARRIER', 'CARRIER_NAME', 'CARRIER_GROUP', 'CARRIER_GROUP_NEW',
       'ORIGIN_AIRPORT_ID', 'ORIGIN_AIRPORT_SEQ_ID', 'ORIGIN_CITY_MARKET_ID',
       'ORIGIN', 'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_FIPS',
       'ORIGIN_STATE_NM', 'ORIGIN_COUNTRY', 'ORIGIN_COUNTRY_NAME',
       'ORIGIN_WAC', 'DEST_AIRPORT_ID', 'DEST_AIRPORT_SEQ_ID',
       'DEST_CITY_MARKET_ID', 'DEST', 'DEST_CITY_NAME', 'DEST_STATE_ABR',
       'DEST_STATE_FIPS', 'DEST_STATE_NM', 'DEST_COUNTRY', 'DEST_COUNTRY_NAME',
       'DEST_WAC', 'YEAR', 'QUARTER', 'MONTH', 'DISTANCE_GROUP', 'CLASS',
       'DATA_SOURCE'],
      dtype='object')

##### 3. Quantas transportadores diferentes existem no dataset?

In [7]:
df.UNIQUE_CARRIER_NAME.nunique()

318

##### 4. Calcule o os valores totais das colunas `freight`, `mail`, e `passengers` para voos do Reino Unido para os Estados Unidos (United Kingdom e United States, respectivamente, no dataset).

In [8]:
df_q4 = df[['FREIGHT', 'MAIL', 'PASSENGERS','ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME']]\
    .query("ORIGIN_COUNTRY_NAME == 'United Kingdom' and DEST_COUNTRY_NAME == 'United States'")\
    .groupby(['ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME']).sum()

df_q4


Unnamed: 0_level_0,Unnamed: 1_level_0,FREIGHT,MAIL,PASSENGERS
ORIGIN_COUNTRY_NAME,DEST_COUNTRY_NAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
United Kingdom,United States,903296879.0,29838395.0,10685614.0


##### 5. Ache as 5 transportadoras com maior distância mediana de rota (exemplo: para todos os pares origem-destino que cada portadora possui, encontre a distância mediana após remover duplicatas).

In [9]:
# Criando subset contendo o par origem-destino para avaliar as duplicatas
df_q5 = df[['DISTANCE', 'UNIQUE_CARRIER_NAME','ORIGIN', 'DEST']]
df_q5_no_dup = df_q5.drop_duplicates().groupby(['UNIQUE_CARRIER_NAME'])\
    .median().nlargest(n=5, columns='DISTANCE', keep='first')

df_q5_no_dup


Unnamed: 0_level_0,DISTANCE
UNIQUE_CARRIER_NAME,Unnamed: 1_level_1
Singapore Airlines Ltd.,8068.0
Cathay Pacific Airways Ltd.,8020.0
Qantas Airways Ltd.,7886.0
Longtail Aviation Ltd.,7855.5
National Aviation Company of India Limited d/b/a Air India,7798.0


##### 6. Ache a carga total transportada (mail + freight) e a distância média viajada para as 10 transportadoras que transportaram o maior número de cargas.

In [9]:
import numpy as np

df_q6 = df[['UNIQUE_CARRIER_NAME', 'FREIGHT', 'MAIL','DISTANCE']]\
    .assign(carga_total=lambda x: x.FREIGHT + x.MAIL)\
    .groupby('UNIQUE_CARRIER_NAME').agg(
        carga_total=('carga_total', 'sum'),
        distancia_media=('DISTANCE', np.mean)
    ).nlargest(10, columns='carga_total', keep='first')

df_q6

Unnamed: 0_level_0,carga_total,distancia_media
UNIQUE_CARRIER_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1
Federal Express Corporation,12709660000.0,1121.887981
United Parcel Service,9173867000.0,1030.81265
Atlas Air Inc.,3356847000.0,1900.987202
United Air Lines Inc.,1577892000.0,1810.656058
American Airlines Inc.,1353074000.0,1583.185048
Kalitta Air LLC,1272180000.0,2519.580343
Polar Air Cargo Airways,1199386000.0,3111.794118
Delta Air Lines Inc.,1129524000.0,1612.688752
China Airlines Ltd.,837079900.0,5828.277778
Cathay Pacific Airways Ltd.,774177700.0,7498.685315


##### 7. Quais as 10 transportadoras que transportaram o maior número de passageiros dos Estados Unidos para outro país?

In [11]:
df_q7 = df[['UNIQUE_CARRIER_NAME','PASSENGERS','ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME']]\
    .query("ORIGIN_COUNTRY_NAME == 'United States' and DEST_COUNTRY_NAME != 'United States'")\
        .groupby('UNIQUE_CARRIER_NAME').sum().nlargest(10,'PASSENGERS')
df_q7

Unnamed: 0_level_0,PASSENGERS
UNIQUE_CARRIER_NAME,Unnamed: 1_level_1
American Airlines Inc.,14867653.0
United Air Lines Inc.,14427923.0
Delta Air Lines Inc.,13054230.0
JetBlue Airways,4522492.0
British Airways Plc,3758945.0
Lufthansa German Airlines,3123611.0
Westjet,2626600.0
Air Canada,2540855.0
Southwest Airlines Co.,2146960.0
Virgin Atlantic Airways,2074735.0


##### 8. Para cada transportadora encontrada no exercício 7, ache o país destino mais popular fora os Estados Unidos.

In [27]:
transportadoras = df_q7.index.to_list()
df_q8 = df[['UNIQUE_CARRIER_NAME','PASSENGERS', 'DEST_COUNTRY_NAME']]\
    .loc[df['UNIQUE_CARRIER_NAME'].isin(transportadoras) \
        & (df['DEST_COUNTRY_NAME'] != 'United States')]\
            .groupby(['UNIQUE_CARRIER_NAME', 'DEST_COUNTRY_NAME']).sum()

df_q8.groupby(level=0).head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,PASSENGERS
UNIQUE_CARRIER_NAME,DEST_COUNTRY_NAME,Unnamed: 2_level_1
Air Canada,Canada,2540855.0
American Airlines Inc.,Antigua and Barbuda,105815.0
British Airways Plc,United Kingdom,3758945.0
Delta Air Lines Inc.,Antigua and Barbuda,10393.0
JetBlue Airways,Antigua and Barbuda,24367.0
Lufthansa German Airlines,Canada,369.0
Southwest Airlines Co.,Aruba,83990.0
United Air Lines Inc.,Antigua and Barbuda,13579.0
Virgin Atlantic Airways,United Kingdom,2074735.0
Westjet,Canada,2626600.0


##### 9. Para cada transportadora encontrada no exercício 7, ache o número total de passageiros que viajaram em voos internacionais para ou a partir dos destinos do exercício 8 ou dos Estados Unidos. Note que esse dataset só tem dados de voos com origem e/ou destino nos Estados Unidos.

In [48]:
temp = df_q8.groupby(level=0).head(1).index.to_list()
destinos = list(set([x[1] for x in temp])) #list comprehension
origens = destinos[:]
origens.append('United States')

In [52]:
df_q9 = df[['UNIQUE_CARRIER_NAME','PASSENGERS','ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME']]\
    .loc[df['UNIQUE_CARRIER_NAME'].isin(transportadoras) \
        & (df['ORIGIN_COUNTRY_NAME'] != df['DEST_COUNTRY_NAME']) \
        & (df['ORIGIN_COUNTRY_NAME'].isin(origens) | df['DEST_COUNTRY_NAME'].isin(destinos))]\
        .groupby('UNIQUE_CARRIER_NAME').sum()
df_q9

Unnamed: 0_level_0,PASSENGERS
UNIQUE_CARRIER_NAME,Unnamed: 1_level_1
Air Canada,5044089.0
American Airlines Inc.,17371773.0
British Airways Plc,7531344.0
Delta Air Lines Inc.,14647449.0
JetBlue Airways,4805860.0
Lufthansa German Airlines,3123611.0
Southwest Airlines Co.,2230477.0
United Air Lines Inc.,16916729.0
Virgin Atlantic Airways,4140090.0
Westjet,5273827.0


##### 10. Entre quais duas cidades os passageiros mais viajaram? Lembre de contar para ambas as direções.

In [76]:
""" Para considerar ambos os sentidos de viagens, criamos uma coluna
 com os nomes das cidades de origem e destino ordenados alfabeticamente,
 e em seguida agregamos as duplicatas criadas com a soma """
df_q10 = df[['ORIGIN_CITY_NAME', 'DEST_CITY_NAME', 'PASSENGERS']]\
    .assign(route=lambda x: np.where(x.ORIGIN_CITY_NAME < x.DEST_CITY_NAME,
    x.ORIGIN_CITY_NAME + '-' + x.DEST_CITY_NAME, x.DEST_CITY_NAME + '-' \
    + x.ORIGIN_CITY_NAME)).groupby('route').sum().nlargest(1,columns='PASSENGERS')

df_q10

Unnamed: 0_level_0,PASSENGERS
route,Unnamed: 1_level_1
"Chicago, IL-New York, NY",4131579.0


##### 11. Ache as 3 transportadoras com o maior número de voos com o par de cidades do exercício 10 como destino ou origem e calcule a porcentagem de passageiros contabilizados para essas cidades.

In [108]:
rota = ['Chicago, IL', 'New York, NY']

df_q11 = df[['UNIQUE_CARRIER_NAME','ORIGIN_CITY_NAME', 'DEST_CITY_NAME', 'PASSENGERS']]\
    .loc[(df['ORIGIN_CITY_NAME'].isin(rota) | df['DEST_CITY_NAME'].isin(rota))]\
    .groupby('UNIQUE_CARRIER_NAME').agg(n_voos = ('ORIGIN_CITY_NAME', 'count'),
        soma = ('PASSENGERS', 'sum')).assign(
            percent = lambda x: round(100 * x.soma/x.soma.sum(),2)\
        ).nlargest(3, 'n_voos')

# Transformando a coluna percent para string para exibir o percentual
df_q11.percent = df_q11.percent.astype(str) + '%'
df_q11

Unnamed: 0_level_0,n_voos,soma,percent
UNIQUE_CARRIER_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SkyWest Airlines Inc.,3403,8042151.0,4.21%
United Air Lines Inc.,3121,26113000.0,13.66%
Envoy Air,2797,7498399.0,3.92%


##### 12. Ache a porcentagem de viagens internacionais por país usando o total de passageiros em voos de classe F.

In [106]:
df_q12 = df[['ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME', 'PASSENGERS', 'CLASS']]\
    .query("CLASS == 'F' and ORIGIN_COUNTRY_NAME != DEST_COUNTRY_NAME")\
    .groupby('ORIGIN_COUNTRY_NAME').agg(n_internacionais = ('CLASS', 'count'))\
    .assign(percent_viagens=lambda x: round(100 * x.n_internacionais / x.n_internacionais.sum(),2))

df_q12.percent_viagens = df_q12.percent_viagens.astype(str) + '%'
df_q12

Unnamed: 0_level_0,n_internacionais,percent_viagens
ORIGIN_COUNTRY_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1
Anguilla,27,0.05%
Antigua and Barbuda,94,0.19%
Argentina,141,0.28%
Aruba,255,0.51%
Australia,284,0.57%
...,...,...
United Kingdom,1148,2.31%
United States,24704,49.62%
Uruguay,12,0.02%
Uzbekistan,12,0.02%


##### 13. Usando uma tabela cruzada, ache a porcentagem do total de passageiros que estão na classe F em voos internacionais entre cidades dos EUA e os países achados na questão 12 que usaram as transportadoras achadas na questão 11.

In [32]:
transportadoras_q11 = df_q11.index.to_list()
temp_list = df_q12.index.to_list()
paises_q12 = temp_list[:] # lista de países da questão 12
paises_q12.remove('United States') # removendo USA para a condição internacional 

df_q13 = df[['UNIQUE_CARRIER_NAME', 'ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME',
        'ORIGIN_CITY_NAME', 'PASSENGERS', 'CLASS']]\
        .loc[df['UNIQUE_CARRIER_NAME'].isin(transportadoras_q11) \
        & (df['ORIGIN_COUNTRY_NAME']== 'United States')\
        & df['DEST_COUNTRY_NAME'].isin(paises_q12)]

df_q13

Unnamed: 0,UNIQUE_CARRIER_NAME,ORIGIN_COUNTRY_NAME,DEST_COUNTRY_NAME,ORIGIN_CITY_NAME,PASSENGERS,CLASS
2490,United Air Lines Inc.,United States,Canada,"Baltimore, MD",0.0,F
2520,United Air Lines Inc.,United States,Canada,"New York, NY",0.0,F
2527,United Air Lines Inc.,United States,Canada,"Kansas City, MO",0.0,F
2532,United Air Lines Inc.,United States,Canada,"Harrisburg, PA",0.0,F
2988,United Air Lines Inc.,United States,Canada,"Kansas City, MO",0.0,F
...,...,...,...,...,...,...
316516,United Air Lines Inc.,United States,Mexico,"Houston, TX",27474.0,F
316994,United Air Lines Inc.,United States,Mexico,"Houston, TX",28693.0,F
317091,United Air Lines Inc.,United States,United Kingdom,"Newark, NJ",29006.0,F
318405,United Air Lines Inc.,United States,Mexico,"Houston, TX",33517.0,F


In [38]:
#salvamos a tabela cruzada com os valores de porcentagem opcionalmente
cross_tab = 100 * pd.crosstab(index=df_q13.ORIGIN_CITY_NAME, columns=df_q13.DEST_COUNTRY_NAME,
values= df_q13.PASSENGERS, aggfunc='sum', normalize=True) # valor relativo ao total
cross_tab 

DEST_COUNTRY_NAME,Antigua and Barbuda,Argentina,Aruba,Australia,Belgium,Belize,Bermuda,"Bonaire, Sint Eustatius, and Saba",Brazil,Canada,...,South Korea,Spain,Sweden,Switzerland,Taiwan,The Bahamas,Trinidad and Tobago,Turks and Caicos Islands,United Kingdom,Uruguay
ORIGIN_CITY_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Albuquerque, NM",0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000076,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000045,0.000000,0.00000,0.0
"Anchorage, AK",0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.0
"Atlanta, GA",0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000006,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000147,0.000000,0.00000,0.0
"Austin, TX",0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000032,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.0
"Baltimore, MD",0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000357,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"Tampa, FL",0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.002569,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.0
"Tulsa, OK",0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.0
"Washington, DC",0.0,0.0,0.056255,0.0,0.526241,0.0,0.000006,0.0,0.397941,0.010881,...,0.0,0.340526,0.0,0.702496,0.0,0.009868,0.000000,0.020838,1.30815,0.0
"West Palm Beach/Palm Beach, FL",0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000070,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.0


##### 14. Crie uma tabela pivotada mostrando o total de passageiros transportados entre cidades nos Estados Unidos e outros países pelas transportadoras identificadas no exercício 7. Selecione o top 10 das cidades dos EUA e o top 10 dos países internacionais desse resultado.

In [63]:
df_q14 = df[['UNIQUE_CARRIER_NAME', 'ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME', 'ORIGIN_CITY_NAME',
        'PASSENGERS']].loc[df['UNIQUE_CARRIER_NAME'].isin(transportadoras) \
        & (df['ORIGIN_COUNTRY_NAME']== 'United States') & (df['DEST_COUNTRY_NAME'] != 'United States')]

pivot_table_q14 = df_q14.pivot_table(index='ORIGIN_CITY_NAME', columns='DEST_COUNTRY_NAME',
         values='PASSENGERS', aggfunc='sum', margins=True, margins_name='Total')



In [64]:
# top 10 cidades
pivot_table_q14.loc[:,'Total'].sort_values(ascending=False).head(11)


ORIGIN_CITY_NAME
Total                    63144004.0
New York, NY              8283535.0
Miami, FL                 6356686.0
Atlanta, GA               5276471.0
Newark, NJ                4669645.0
Los Angeles, CA           4306887.0
Houston, TX               3967596.0
San Francisco, CA         3488325.0
Chicago, IL               3366769.0
Dallas/Fort Worth, TX     3315762.0
Fort Lauderdale, FL       2065595.0
Name: Total, dtype: float64

In [65]:
# top 10 paises de destino
pivot_table_q14.loc['Total',:].sort_values(ascending=False).head(11)

DEST_COUNTRY_NAME
Total                 63144004.0
United Kingdom         9509531.0
Mexico                 7974133.0
Canada                 7644077.0
Germany                4945224.0
Dominican Republic     3339214.0
Japan                  2336393.0
Jamaica                1915561.0
Netherlands            1895959.0
France                 1705949.0
China                  1499616.0
Name: Total, dtype: float64

##### 15: Para o top 15 dos países internacionais, ache a porcentagem de passageiros de classe F viajando para ou a partir das top 10 cidades dos EUA, para as viagens internacionais (se apenas as cidades A, B e C voassem para Aruba, a soma da linha/coluna de Aruba seria 1). Faça a plotagem desses resultados como um heatmap.

In [132]:
#resposta em andamento  -> falta o heatmat
from ast import Assign

top10_cidades = df[['ORIGIN_CITY_NAME', 'PASSENGERS']].groupby('ORIGIN_CITY_NAME').sum().nlargest(10,'PASSENGERS').index.to_list()
top15_paises = df[['DEST_COUNTRY_NAME', 'PASSENGERS']].groupby('DEST_COUNTRY_NAME').sum().nlargest(15,'PASSENGERS').index.to_list()

df_q15 = df[['ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME', 'ORIGIN_CITY_NAME','DEST_CITY_NAME', 'PASSENGERS', 'CLASS']]\
        .loc[(df['ORIGIN_COUNTRY_NAME']!= df['DEST_COUNTRY_NAME']) & (df['ORIGIN_CITY_NAME'] != df['DEST_CITY_NAME'])\
        & (df['ORIGIN_COUNTRY_NAME'].isin(top15_paises) | df['DEST_COUNTRY_NAME'].isin(top15_paises))\
        & (df['ORIGIN_CITY_NAME'].isin(top10_cidades) | df['DEST_CITY_NAME'].isin(top10_cidades)) & (df['CLASS'] == 'F')]\
        .groupby(['ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME', 'ORIGIN_CITY_NAME','DEST_CITY_NAME']).agg(passenger = ('PASSENGERS', 'sum'))\
        .assign(percent = lambda x: round(100 * x.passenger / x.passenger.sum(),2))\
        .sort_values(by='percent', ascending=False)

porcentagens = df_q15.percent.values
df_q15.percent = df_q15.percent.astype(str) + '%'
print(df_q15.shape)
df_q15

(1466, 2)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,passenger,percent
ORIGIN_COUNTRY_NAME,DEST_COUNTRY_NAME,ORIGIN_CITY_NAME,DEST_CITY_NAME,Unnamed: 4_level_1,Unnamed: 5_level_1
United States,United Kingdom,"New York, NY","London, United Kingdom",1897943.0,1.36%
United Kingdom,United States,"London, United Kingdom","New York, NY",1883794.0,1.35%
United States,United Kingdom,"Los Angeles, CA","London, United Kingdom",921716.0,0.66%
France,United States,"Paris, France","New York, NY",890411.0,0.64%
United States,France,"New York, NY","Paris, France",900369.0,0.64%
...,...,...,...,...,...
The Bahamas,United States,"Nassau, The Bahamas","Denver, CO",3.0,0.0%
The Bahamas,United States,"Nassau, The Bahamas","Los Angeles, CA",11.0,0.0%
The Bahamas,United States,"Nassau, The Bahamas","San Francisco, CA",0.0,0.0%
The Bahamas,United States,"North Eleuthera, The Bahamas","Orlando, FL",1745.0,0.0%


In [133]:
# MONTANDO O HEATMAP (a partir do tutorial 3)   EM ANDAMENTO
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 4))

ax = sns.heatmap(data=porcentagens, cmap='Blues', annot=True, fmt='.1f', cbar_kws={'label': 'Number of Travelers (in millions)'})
ax.set_title('TSA Traveler Throughput by Weekday and Month')
ax.set_ylabel('Weekday')
ax.set_xlabel('Month')



IndexError: Inconsistent shape between the condition and the input (got (1466, 1) and (1466,))

<Figure size 576x288 with 0 Axes>

In [80]:
df_q15.shape


(12014, 7)

# Referências

Link do repositório original - Git: https://github.com/stefmolin/pandas-workshop