# Maldives Tourism
One of the best things about being a Dr Prepper is that people are always bringing interesting datasets to your attention. A little while ago, Tableau Zen Master Lorna Brown showed me a dataset with all kinds of information on tourism in the Maldives. This database has a lot of data on different Key Economic Indicators, but as you can imagine, it also has a bit of a quirky structure! For inspiration as to why we might want to clean this data up, check out Lorna's viz below:

1. Pivot all of the month fields into a single column
2. Rename the fields and ensure that each field has the correct data type
3. Filter out the nulls
4. Filter our dataset so our Values are referring to Number of Tourists
5. Our goal now is to remove all totals and subtotals from our dataset so that only the lowest level of granularity remains. Currently we have Total > Continents > Countries, but we don't have data for all countries in a continent, so it's not as simple as just filtering out the totals and subtotals. Plus in our Continents level of detail, we also have The Middle East and UN passport holders as categories. If you feel confident in your prep skills, this (plus the output) should be enough information to go on, but otherwise read on for a breakdown of the steps we need to take:
    * Filter out Total tourist arrivals
    * Split our workflow into 2 streams: Continents and Countries
    * Split out the Continent and Country names from the relevant fields
    * Aggregate our Country stream to the Continent level
    * Join the two streams together and work out how many tourists arrivals there are that we don't know the country of
    * Add in a Country field with the value "Unknown"
    * Union this back to here we had our Country breakdown

In [55]:
import pandas as pd
import numpy as np
from pandasgui import show

In [67]:
df_tourism = pd.read_csv('Tourism Input.csv').melt(id_vars=['id', 'Series-Measure', 'Hierarchy-Breakdown', 'Unit-Detail'], var_name='Month', value_name='Tourists').query('`Unit-Detail` == "Tourists" and `Series-Measure` != "Total tourist arrivals"').drop(columns='Unit-Detail').replace({'na':np.nan}).dropna()
df_tourism['Tourists'] = df_tourism['Tourists'].astype('int')
df_tourism

Unnamed: 0,id,Series-Measure,Hierarchy-Breakdown,Month,Tourists
8,1111,Tourist arrivals from Europe,Real Sector / Tourism / Tourist arrivals,Jan-10,51334
9,1112,Tourist arrivals from Asia,Real Sector / Tourism / Tourist arrivals,Jan-10,13243
10,1113,Tourist arrivals from Africa,Real Sector / Tourism / Tourist arrivals,Jan-10,350
11,1114,Tourist arrivals from Americas,Real Sector / Tourism / Tourist arrivals,Jan-10,1289
12,1115,Tourist arrivals from Oceania,Real Sector / Tourism / Tourist arrivals,Jan-10,703
...,...,...,...,...,...
3687,1122,Tourist arrivals from China,Real Sector / Tourism / Tourist arrivals / Asia,Dec-20,171
3688,1241,Tourist arrivals from India,Real Sector / Tourism / Tourist arrivals / Asia,Dec-20,18637
3693,1252,Tourist arrivals from France,Real Sector / Tourism / Tourist arrivals / Europe,Dec-20,3998
3694,1253,Tourist arrivals from Australia,Real Sector / Tourism / Tourist arrivals / Oce...,Dec-20,607


In [90]:
df_continents = df_tourism.loc[df_tourism['Hierarchy-Breakdown'] == 'Real Sector / Tourism / Tourist arrivals'].copy()
df_continents['Continent'] = df_continents['Series-Measure'].str.replace('Tourist arrivals from (the )?', '')
df_continents = df_continents[['Continent', 'Month', 'Tourists']]
df_continents

Unnamed: 0,Continent,Month,Tourists
8,Europe,Jan-10,51334
9,Asia,Jan-10,13243
10,Africa,Jan-10,350
11,Americas,Jan-10,1289
12,Oceania,Jan-10,703
...,...,...,...
3678,Africa,Dec-20,1996
3679,Americas,Dec-20,4929
3680,Oceania,Dec-20,728
3681,Middle East,Dec-20,4557


In [89]:
df_countries = df_tourism.loc[df_tourism['Hierarchy-Breakdown'] != 'Real Sector / Tourism / Tourist arrivals'].copy()
df_countries['Continent'] = df_countries['Hierarchy-Breakdown'].str.split(' / ').str[-1]
df_countries['Country'] = df_countries['Series-Measure'].str.extract('Tourist arrivals from (.*)')
df_countries = df_countries[['Continent', 'Country', 'Month', 'Tourists']]
df_countries

Unnamed: 0,Continent,Country,Month,Tourists
15,Europe,Germany,Jan-10,5890
16,Europe,Italy,Jan-10,12276
17,Europe,Russia,Jan-10,5873
18,Europe,the United Kingdom,Jan-10,8405
19,Asia,China,Jan-10,6069
...,...,...,...,...
3687,Asia,China,Dec-20,171
3688,Asia,India,Dec-20,18637
3693,Europe,France,Dec-20,3998
3694,Oceania,Australia,Dec-20,607


In [97]:
df_unknown_contries = df_countries.groupby(['Continent', 'Month'], as_index=False)['Tourists'].sum().merge(df_continents, on=['Continent', 'Month'], how='right').rename(columns={'Tourists_x':'Assumed Tourists', 'Tourists_y':'Actual Tourists'})
df_unknown_contries['Country'] = 'Uknown'
df_unknown_contries.fillna(0,inplace=True)
df_unknown_contries['Tourists'] = df_unknown_contries['Actual Tourists'] - df_unknown_contries['Assumed Tourists']
df_unknown_contries = df_unknown_contries[['Continent', 'Country', 'Month', 'Tourists']]
df_unknown_contries

Unnamed: 0,Continent,Country,Month,Tourists
0,Europe,Uknown,Jan-10,11991.0
1,Asia,Uknown,Jan-10,5432.0
2,Africa,Uknown,Jan-10,350.0
3,Americas,Uknown,Jan-10,1289.0
4,Oceania,Uknown,Jan-10,703.0
...,...,...,...,...
875,Africa,Uknown,Dec-20,1996.0
876,Americas,Uknown,Dec-20,1924.0
877,Oceania,Uknown,Dec-20,121.0
878,Middle East,Uknown,Dec-20,4557.0


In [98]:
df_output = pd.concat([df_unknown_contries, df_countries])
df_output

Unnamed: 0,Continent,Country,Month,Tourists
0,Europe,Uknown,Jan-10,11991.0
1,Asia,Uknown,Jan-10,5432.0
2,Africa,Uknown,Jan-10,350.0
3,Americas,Uknown,Jan-10,1289.0
4,Oceania,Uknown,Jan-10,703.0
...,...,...,...,...
3687,Asia,China,Dec-20,171.0
3688,Asia,India,Dec-20,18637.0
3693,Europe,France,Dec-20,3998.0
3694,Oceania,Australia,Dec-20,607.0
