### Prepping Data Challenge: Maldives Tourism (week 12)

Our input is very wide this week, with 136 fields and only 28 rows. It covers tourism in the Maldives from 2010 to 2020. 

#### Requirement:
 
 1. Input the data
 2. Pivot all of the month fields into a single column 
 3. Rename the fields and ensure that each field has the correct data type
 4. Filter out the nulls 
 5. Filter our dataset so our Values are referring to Number of Tourists
 6. Our goal now is to remove all totals and subtotals from our dataset so that only the lowest level of granularity remains. Currently we have Total > Continents > Countries, but we don't have data for all countries in a continent, so it's not as simple as just filtering out the totals and subtotals. Plus in our Continents level of detail, we also have The Middle East and UN passport holders as categories. If you feel confident in your prep skills, this (plus the output) should be enough information to go on, but otherwise read on for a breakdown of the steps we need to take:
    - Filter out Total tourist arrivals
    - Split our workflow into 2 streams: Continents and Countries
    - Split out the Continent and Country names from the relevant fields 
    - Aggregate our Country stream to the Continent level 
    - Join the two streams together and work out how many tourists arrivals there are that we don't know the country of (help)
    - Add in a Country field with the value "Unknown" 
    - Union this back to here we had our Country breakdown 
 7. Output the data

### 1 - 3

In [1]:
#import libraries
import pandas as pd
pd.options.mode.chained_assignment = None  # this removes the SETTINGWITHCOPY Warning when set to None
import numpy as np
import re

In [2]:
columns = ['Series-Measure','Hierarchy-Breakdown','Unit-Detail']
df = pd.read_csv('WK12-Tourism.csv', na_values=['na']).drop(columns=['id']).melt(id_vars=columns).rename(columns={'variable':'Month'})

In [3]:
df['Month'] = pd.to_datetime(df['Month'], format='%b-%y')

In [4]:
df.head()

Unnamed: 0,Series-Measure,Hierarchy-Breakdown,Unit-Detail,Month,value
0,Total tourist arrivals,Real Sector / Tourism,Tourists,2010-01-01,67478.0
1,Tourist bednights,Real Sector / Tourism,Bednights,2010-01-01,552287.0
2,Average stay,Real Sector / Tourism,Days,2010-01-01,8.184697
3,Operational bed capacity,Real Sector / Tourism,Beds,2010-01-01,22825.0
4,Bednight capacity,Real Sector / Tourism,Beds,2010-01-01,707575.0


###  4 - 6

In [5]:
#filter out Total tourist arrivals
df = df.loc[(df['value'].notna()) & (df['Unit-Detail'] == 'Tourists') & (df['Series-Measure'] != 'Total tourist arrivals')]

In [6]:
df['value'] = df['value'].astype(int)

In [7]:
#create pattern searcher function, based on WJ SUTTON method, he got function from
#from: https://stackoverflow.com/questions/17972938/check-if-a-string-in-a-pandas-dataframe-column-is-in-a-list-of-strings
def pattern(s_str:str, s_list:str):
    search_obj = re.search(s_list, s_str)
    if search_obj :
        return_str = s_str[search_obj.start(): search_obj.end()]
    else:
        return_str = 'NA'
    return return_str

In [8]:
# Split our workflow into 2 streams: Continents and Countries 
df_continents = df.loc[df['Hierarchy-Breakdown'] ==  'Real Sector / Tourism / Tourist arrivals']
df_countries = df.loc[df['Hierarchy-Breakdown'] !=  'Real Sector / Tourism / Tourist arrivals']

In [9]:
# Split out the Continent and Country names from the relevant fields 
continents = ['Europe','Asia','Africa','Americas','Oceania','Middle East','UN passport holders and others']
countries =['Germany','Italy','Russia','United Kingdom','China','India','France','Australia','United States']

In [10]:
conti_pattern = '|'.join(continents)
count_pattern = '|'.join(countries)
 
df_continents['Continent']  = df_continents['Series-Measure'].apply(lambda x: pattern(s_str=x, s_list=conti_pattern))
df_countries['Continent']  = df_countries['Hierarchy-Breakdown'].apply(lambda x: pattern(s_str=x, s_list=conti_pattern))
df_countries['Country']  = df_countries['Series-Measure'].apply(lambda x: pattern(s_str=x, s_list=count_pattern))

In [None]:
#join.head()

In [11]:
# Aggregate our Country stream to the Continent level 
country = df_countries.groupby(['Continent','Month'],as_index=False)['value'].sum()
country.columns = ['Continent','Month','Country_Values']

In [12]:
# Join the two streams together and work out how many tourists arrivals there are that we don't know the country of 
join = pd.merge(df_continents,country, on=['Continent','Month'], how='left')
join['difference'] = join['value'] - join['Country_Values']

In [13]:
# Add in a Country field with the value "Unknown" 
join['Country'] = 'Unknown'

In [14]:
# Union this back to here we had our Country breakdown 
df1 = df_countries[['Month','Continent','Country','value']]
df2 = join[['Month','Continent','Country','difference']]
df2.columns = ['Month','Continent','Country','value']

### 8. Output the data 

In [15]:
output = pd.concat([df1,df2])
output.columns = ['Month','Breakdown','Country','Number of Tourists']

In [16]:
output.to_csv('WK12-Tourist Output.csv', index=False)

In [17]:
output.head()

Unnamed: 0,Month,Breakdown,Country,Number of Tourists
15,2010-01-01,Europe,Germany,5890.0
16,2010-01-01,Europe,Italy,12276.0
17,2010-01-01,Europe,Russia,5873.0
18,2010-01-01,Europe,United Kingdom,8405.0
19,2010-01-01,Asia,China,6069.0
