### 1. Define the Question and Understand it

### SOURCE  https://zindi.africa/hackathons/kenya-hack-tanzania-tourism-prediction-challenge/data
- **Here is a question brief Description**
- Question Definition
    - The Tanzanian tourism sector plays a significant role in the Tanzanian economy, contributing about 17% to the country’s GDP and 25% of all foreign exchange revenues. The sector, which provides direct employment for more than 600,000 people and up to 2 million people indirectly, generated approximately $2.4 billion in 2018 according to government statistics. Tanzania received a record 1.1 million international visitor arrivals in 2014, mostly from Europe, the US and Africa.

    - Tanzania is the only country in the world which has allocated more than 25% of its total area for wildlife, national parks, and protected areas.There are 16 national parks in Tanzania, 28 game reserves, 44 game-controlled areas, two marine parks and one conservation area.

    - Tanzania’s tourist attractions include the Serengeti plains, which hosts the largest terrestrial mammal migration in the world; the Ngorongoro Crater, the world’s largest intact volcanic caldera and home to the highest density of big game in Africa; Kilimanjaro, Africa’s highest mountain; and the Mafia Island marine park; among many others. The scenery, topography, rich culture and very friendly people provide for excellent cultural tourism, beach holidays, honeymooning, game hunting, historical and archaeological ventures – and certainly the best wildlife photography safaris in the world.

### Probelem Statement --
    - The objective of this task is to develop a machine learning model to predict what a tourist will spend when visiting Tanzania.The model can be used by different tour operators and the Tanzania Tourism Board to automatically help tourists across the world estimate their expenditure before visiting Tanzania.

- **Main goal is to be able to develop a machine learning model that would be able to accurately try to predic what a tourist may spend when visiting Tanzania**


### 2. Establish the type of ML and  the metric to be used.
- Since we are tyring to predict a continuous value (cost/spending) , our model will be a regression one
- Type --> Regression
- Metrics --> MAE (Mean Absolute Errors) 
- Other metrics for regression includes:-
    - Mean Squared Error (MSE).
    - Root Mean Squared Error (RMSE).

### 3. Work Flow
- We will follow the following procedures
    <font color=red>
    <ol>
      <li>Understanding data given and some processing</li>
      <li>Transform data for modelling</li>
      <li>Feature Engineering If possible</li>
      <li>Split data for validation and training purpose (Model Evaluation)</li>
      <li>Develop baseline model</li>
      <li>Evaluation</li>
      <li>Various Model test , developing ensebles , HyperparamTuning etc</li>
    </ol>
    </font>

In [2]:
# loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import warnings
warnings.filterwarnings("ignore")
# some notebook params 
pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
%matplotlib inline

## Processing

In [3]:
# read the data
train=pd.read_csv("Train.csv")
test=pd.read_csv("Test.csv")
sub=pd.read_csv("SampleSubmission.csv")
des=pd.read_csv("VariableDefinitions.csv")

In [4]:
train.shape , test.shape , sub.shape , des.shape

((4809, 23), (1601, 22), (1601, 2), (23, 2))

### For Easier working and processing , I will merge the Two dataset together in order to process them at once

In [5]:
#create a number to show the size of train set
train_size = train.shape[0]

In [6]:
data=pd.concat([train,test],sort=False).reset_index(drop=True)
# data.columns.tolist()

In [7]:
# preview
train.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,package_transport_int,package_accomodation,package_food,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,No,No,No,No,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5
1,tour_10,UNITED KINGDOM,25-44,,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,No,No,No,No,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,No,No,No,No,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,No,Yes,Yes,Yes,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0
4,tour_1004,CHINA,1-24,,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,No,No,No,No,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0


In [8]:
# check for null values
(data.isna().sum()/data.shape[0])*100

ID                        0.000000
country                   0.000000
age_group                 0.000000
travel_with              22.480499
total_female              0.062402
total_male                0.109204
purpose                   0.000000
main_activity             0.000000
info_source               0.000000
tour_arrangement          0.000000
package_transport_int     0.000000
package_accomodation      0.000000
package_food              0.000000
package_transport_tz      0.000000
package_sightseeing       0.000000
package_guided_tour       0.000000
package_insurance         0.000000
night_mainland            0.000000
night_zanzibar            0.000000
payment_mode              0.000000
first_trip_tz             0.000000
most_impressing           6.614665
total_cost               24.976599
dtype: float64

In [9]:
# fill nulls 
# we assume that if the value of travel_with in null , he was alone
data.travel_with.fillna('Alone',inplace=True)
# the null values for mostimpressing can work as no_comments(maybe he did not want to commment)
data.most_impressing.fillna('No comments',inplace=True)
# use normal methods to fill with mean
data.total_female.fillna(data.total_female.mean(),inplace = True)
data.total_male.fillna(data.total_male.mean(),inplace = True)

### Feature Engineering

In [10]:
# add total attendee and total night spent
data["total_persons"] = data["total_female"] + data["total_male"]
data["total_nights_spent"] = data["night_mainland"] + data["night_zanzibar"]

In [11]:
data[data.age_group == '24-Jan']['age_group'] = "64-100"
data[data.age_group == '64+']['age_group'] = "64-100"

data = data.replace("64+" , "64-100")
data = data.replace("65+" , "64-100")
data = data.replace("24-Jan" , "25-44")
    
    
# data['age_group']=np.where(data['age_group'] =="64+" , "64-100" ,data['age_group'])
# data['age_group']=np.where(data['age_group'] =='24-Jan' , "64-100" ,data['age_group'])

# get the start age range and end date range
data['start_age'] = data['age_group'].str.split("-").str.get(0).astype('int64')
data['end_age'] = data['age_group'].str.split("-").str.get(1).astype('int64')



In [12]:
#mean age 
data['age_mean'] = (data['start_age']+data['end_age'])/2

In [13]:
# whether a tourist travelleved during peak periods of the year such as holiday months. durring this times, 
# tourism services are relatively expensive due to high demand.peak or not

data['peak_period'] = [True if row == 'Leisure and Holidays'  else False for row in data.purpose] 

In [14]:
# isolate tourist as an african or not\
# some african countriues from the dataset
african = ['SOUTH AFRICA', 'NIGERIA', 'RWANDA', 'MOZAMBIQUE', 'KENYA', 'ALGERIA', 'EGYPT','MALAWI', 
           'UGANDA', 'ZIMBABWE', 'ZAMBIA', 'CONGO', 'MAURITIUS', 'DRC', 'TUNISIA', 'ETHIOPIA','BURUNDI',
           'GHANA', 'NIGER', 'COMORO', 'ANGOLA', 'SUDAN', 'NAMIBIA', 'LESOTHO', 'IVORY COAST', 'MADAGASCAR',
           'DJIBOUT', 'MORROCO', 'BOTSWANA', 'LIBERIA', 'GUINEA', 'SOMALI']

data['is_africa'] = [True if country in african else False for country in data['country']]

### NOTE ---
- We can also get very many features from the textual columns like comments.
- This is because a person's comment may really have a great imapct on his stay length and pay
#### Transform some string or group/categorical data to numerical

In [15]:
for x in  data.columns[data.dtypes == 'object']:
    print(f"{ x}   =>  {data[x].nunique()}")

ID   =>  6410
country   =>  118
age_group   =>  4
travel_with   =>  5
purpose   =>  7
main_activity   =>  9
info_source   =>  8
tour_arrangement   =>  2
package_transport_int   =>  2
package_accomodation   =>  2
package_food   =>  2
package_transport_tz   =>  2
package_sightseeing   =>  2
package_guided_tour   =>  2
package_insurance   =>  2
payment_mode   =>  4
first_trip_tz   =>  2
most_impressing   =>  7


In [16]:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['age_group'] = le.fit_transform(data['age_group'])
data['package_transport_int'] = le.fit_transform(data['package_transport_int'])
data['package_accomodation'] = le.fit_transform(data['package_accomodation'])
data['package_food'] = le.fit_transform(data['package_food'])
data['package_transport_tz'] = le.fit_transform(data['package_transport_tz'])
data['package_sightseeing'] = le.fit_transform(data['package_sightseeing'])
data['package_guided_tour'] = le.fit_transform(data['package_guided_tour'])
data['package_insurance'] = le.fit_transform(data['package_insurance'])
data['first_trip_tz'] = le.fit_transform(data['first_trip_tz'])
data['country'] = le.fit_transform(data['country'])
data['peak_period'] = le.fit_transform(data['peak_period'])
data['is_africa'] = le.fit_transform(data['is_africa'])

In [17]:
# use get dummies
columns_to_transform = ['tour_arrangement','travel_with','purpose','main_activity','info_source','most_impressing','payment_mode']
data = pd.get_dummies( data,columns = columns_to_transform,drop_first=True)

In [18]:
data.head()

Unnamed: 0,ID,country,age_group,total_female,total_male,package_transport_int,package_accomodation,package_food,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,first_trip_tz,total_cost,total_persons,total_nights_spent,start_age,end_age,age_mean,peak_period,is_africa,tour_arrangement_Package Tour,travel_with_Children,...,purpose_Volunteering,main_activity_Bird watching,main_activity_Conference tourism,main_activity_Cultural tourism,main_activity_Diving and Sport Fishing,main_activity_Hunting tourism,main_activity_Mountain climbing,main_activity_Wildlife tourism,main_activity_business,"info_source_Newspaper, magazines,brochures","info_source_Radio, TV, Web",info_source_Tanzania Mission Abroad,info_source_Trade fair,"info_source_Travel, agent, tour operator",info_source_inflight magazines,info_source_others,most_impressing_Excellent Experience,most_impressing_Friendly People,most_impressing_Good service,most_impressing_No comments,most_impressing_Satisfies and Hope Come Back,"most_impressing_Wonderful Country, Landscape, Nature",payment_mode_Credit Card,payment_mode_Other,payment_mode_Travellers Cheque
0,tour_0,101,2,1.0,1.0,0,0,0,0,0,0,0,13.0,0.0,0,674602.5,2.0,13.0,45,64,54.5,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,tour_10,111,1,1.0,0.0,0,0,0,0,0,0,0,14.0,7.0,1,3214906.5,1.0,21.0,25,44,34.5,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0
2,tour_1000,111,1,0.0,1.0,0,0,0,0,0,0,0,1.0,31.0,0,3315000.0,1.0,32.0,25,44,34.5,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,tour_1002,111,1,1.0,1.0,0,1,1,1,1,1,0,11.0,0.0,1,7790250.0,2.0,11.0,25,44,34.5,1,0,1,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0
4,tour_1004,17,0,1.0,0.0,0,0,0,0,0,0,0,7.0,4.0,1,1657500.0,1.0,11.0,1,24,12.5,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0


In [19]:
# check if all data are numerical
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6410 entries, 0 to 6409
Data columns (total 58 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   ID                                                    6410 non-null   object 
 1   country                                               6410 non-null   int64  
 2   age_group                                             6410 non-null   int64  
 3   total_female                                          6410 non-null   float64
 4   total_male                                            6410 non-null   float64
 5   package_transport_int                                 6410 non-null   int64  
 6   package_accomodation                                  6410 non-null   int64  
 7   package_food                                          6410 non-null   int64  
 8   package_transport_tz                                  6410

In [20]:
## We only have ID as object .. but it is not helpful in our case it will be droped

In [21]:
# get back our training and test dataset

# drop id
data.drop("ID" , inplace=True , axis =1)
# we use slicing
# you can also use total_cost column since it is null in test
train = data[:train_size]
test = data[train_size:]

# drop cost
test.drop('total_cost' , inplace = True , axis = 1)

In [22]:
train.shape , test.shape

((4809, 57), (1601, 56))

In [23]:
#Modelling
X = train.drop(["total_cost"],1)
cols = X.columns
label=train["total_cost"]

In [24]:
# split for train and val
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score

# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(X,label, test_size=0.25, random_state = 2021)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

(3606, 56) (3606,)
(1203, 56) (1203,)


## MODELLING

- Normally You just have to know how the model only works.
- It is not a must you know how it is implimented just backgroung
- Some of the most useful steps in Modelling.
    - Initialize the Model Object
    - Train the object using Object.fit(features , y)
    - After Training , perfom some predictions and score the model
    
    - I will just train with two different models..
        - An boosting model (Catboost)
        - A normal LinearRegressuion model
        
    - We will then try Improving Score (MAE) maybe by getting average of different models 

In [25]:
# import the models needed
from sklearn.linear_model import LinearRegression
from catboost import CatBoostRegressor
# import metric fucntion

from sklearn.metrics import mean_absolute_error

In [28]:
# modelling for catboost
cat=CatBoostRegressor(depth= 6, iterations=100, learning_rate= 0.05,verbose=False, random_state =2021)
cat.fit(X_train, y_train)
y_predc = cat.predict(X_test) 
maec = mean_absolute_error(y_test, y_predc)
print('CAt Error {}'.format(maec))

CAt Error 5095677.102269491


In [29]:
# modelling for LinearRegression
log=LinearRegression()
log.fit(X_train, y_train)
y_predl = log.predict(X_test) 
mael = mean_absolute_error(y_test, y_predl)
print('CAt Error {}'.format(mael))

CAt Error 5849084.523206729


### Observation
- The boosting model perfoms better thgan LinearRegression
- We can therefore explore more boosting models and try out to check
- Also we can tune parameters for the models to check change in results

In [None]:
### TRY TUNING THE DEPTH PARAM

In [40]:
# we are using different training levels
predictions = []
for i in range(10):
      # Training the model
    cat=CatBoostRegressor(
                        iterations=1500, 
#                         n_estimators=(i*100),
                        loss_function='MAE',
                        logging_level='Silent',
                        depth = i,
                        #verbose=False
                        random_state =2021
                        )
    cat.fit(X_train, y_train)

      # Making predictions
    predc = cat.predict(X_test)
    predictions.append(predc)

# Averaging the preictions
predc = np.mean(predictions, axis = 0)

In [41]:
# check score of the tuneed model
maec = mean_absolute_error(y_test, predc)
print('Error {}'.format(maec))

Error 4801392.176224123


### TRY OUT XGBOOST MODEL WITH PARAM TUNING

In [None]:
def BestParamSnippet(): 
    print() 
    import warnings
    warnings.filterwarnings("ignore")
    
    # load libraries
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from catboost import CatBoostRegressor
    from lightgbm import LGBMRegressor
    from xgboost import XGBRegressor
    # divide data into some train and validations sets
    #X_train, X_test, y_train, y_test = train_test_split(X, label, test_size=0.25)

    model = XGBRegressor()
    # select some params to be used for the  selection of best params
    parameters = {'depth'         : [6,8,10],
                  'learning_rate' : [0.01, 0.05, 0.1],
                  'iterations'    : [30, 50, 100],
                  "n_estimators":[150,200,400,500],
                 }
    grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 2, n_jobs=-1)
    grid.fit(X_train, y_train)    

    # Results from Grid Search
    print("\n========================================================")
    print(" Results from Grid Search " )
    print("========================================================")    
    
    print("\n The best estimator across ALL searched params:\n",
          grid.best_estimator_)
    
    print("\n The best score across ALL searched params:\n",
          grid.best_score_)
    
    print("\n The best parameters across ALL searched params:\n",
          grid.best_params_)
    
    print("\n ========================================================")
BestParamSnippet()





### KFOLD VALIDATION

In [None]:
error=[]
pred_test = np.zeros(len(test))
from sklearn.model_selection import KFold

# use four folds
fold=KFold(n_splits=4)

for train_index, test_index in fold.split(X,label):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = label.iloc[train_index], label.iloc[test_index]
    cat  = CatBoostRegressor(verbose= False,random_seed= 1234, use_best_model=True, loss_function='MAE',n_estimators=1000, learning_rate=0.05)
    cat.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_test, y_test)])
    preds=cat.predict(X_test)
    print(f"ERROR   {mean_absolute_error(y_test,preds)}")
    error.append(mean_absolute_error(y_test,preds))
    p2 = cat.predict(test)
    pred_test+=p2
np.mean(error)

## THE END....

### MORE TIPS

-  You can try out building your own boosting/ensemble models with different algorithms
-  Test Tree algrorithms i.e DecisionTreeregressor and Randomforeestregressor etc
-  Try Checking if the perfomance is good on deep learning models
-  Do more feature engineeering and check purpose
-  Try removing some features and check effects on the model



### I Hope thats Helps Incase of any Query Reach me through  **0705698768**