# AutoGluon Classification Model

#### Data: Tourism Nova Scotia
#### Time: 2006 to 2024
#### Data Source: https://data.novascotia.ca/Business-and-Industry/Tourism-Nova-Scotia-Visitation/n783-4gmh/about_data 

## Import necessary libraries

In [1]:
from autogluon.tabular import TabularDataset, TabularPredictor
#TabularDataset -->  to load data
#TabularPredictor --> to train models and make predictions

from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt

## Load the dataset

In [2]:
data_location = './Tourism_Nova_Scotia_Visitation_20250202_UPDATED.csv'
data = TabularDataset(f'{data_location}')
data.head()

Unnamed: 0,Mode of entry,Month/Year,Visitor Origin,Country,Number of Visitors (Rounded to nearest hundred)
0,Air,January 2006,Atlantic Canada,Canada,5400.0
1,Air,January 2006,Quebec,Canada,3400.0
2,Air,January 2006,Ontario,Canada,16600.0
3,Air,January 2006,Western Canada,Canada,7000.0
4,Air,January 2006,New England (inc Maine),United States,800.0


### Data Dictionary

 - **Mode of entry** - Visitor entered Nova Scotia by what mode [Air, Road].
 - **Month/Year** - The date of visitor entry.
 - **Visitor Origin** - Which City or Province are the visitor originally from.
 - **Country** - Which country are the Visitor originally from.
 - **Number of Visitors (Rounded to nearest hundred)** - Count of visitors

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7182 entries, 0 to 7181
Data columns (total 5 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   Mode of entry                                    7182 non-null   object 
 1   Month/Year                                       7182 non-null   object 
 2   Visitor Origin                                   7182 non-null   object 
 3   Country                                          7182 non-null   object 
 4   Number of Visitors (Rounded to nearest hundred)  6282 non-null   float64
dtypes: float64(1), object(4)
memory usage: 280.7+ KB


## Perform EDA

In [4]:
# Percentage of 'Mode of entry' column values before train test split-for reference

data['Mode of entry'].value_counts()/data.shape[0] * 100

Mode of entry
Air     53.258145
Road    46.741855
Name: count, dtype: float64

## Train Test Split 

In [5]:
#scikit-learn train-test-split --> Split arrays or matrices into random train and test subsets

train_df, test_df = train_test_split(data, test_size=0.3, random_state=42, stratify=data['Mode of entry'])

In [6]:
# Percentage of 'Mode of entry' column values after train-test-split

train_df['Mode of entry'].value_counts()/train_df.shape[0] * 100

Mode of entry
Air     53.252437
Road    46.747563
Name: count, dtype: float64

In [7]:
test_df['Mode of entry'].value_counts()/test_df.shape[0] * 100

Mode of entry
Air     53.271462
Road    46.728538
Name: count, dtype: float64

In [8]:
# AutoGluon’s TabularPredictor --> to predict the values of a target column based on the other columns in a tabular dataset 

predictor = TabularPredictor(label='Mode of entry', path='mode_of_entry_predictors')



## Monitoring the Models

In [9]:
predictor.fit(train_data= train_df) #, presets= 'best_quality'

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.11.11
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19045
CPU Count:          8
Memory Avail:       4.87 GB / 15.71 GB (31.0%)
Disk Space Avail:   83.87 GB / 237.83 GB (35.3%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fast inference speed.
	presets='good'         : Good accuracy with very fast inference speed.
	presets='medium'

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x270fe889810>

In [10]:
predictor.leaderboard(data= train_df)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestEntr,0.931371,0.717694,accuracy,0.182778,0.036668,0.658815,0.182778,0.036668,0.658815,1,True,6
1,RandomForestGini,0.931172,0.715706,accuracy,0.192076,0.035663,0.596371,0.192076,0.035663,0.596371,1,True,5
2,ExtraTreesEntr,0.921424,0.61829,accuracy,0.25699,0.081784,0.934145,0.25699,0.081784,0.934145,1,True,9
3,LightGBMLarge,0.921026,0.809145,accuracy,0.056999,0.002019,1.116314,0.056999,0.002019,1.116314,1,True,13
4,ExtraTreesGini,0.920629,0.610338,accuracy,0.273969,0.061447,0.89188,0.273969,0.061447,0.89188,1,True,8
5,XGBoost,0.862542,0.815109,accuracy,0.072481,0.007733,0.682661,0.072481,0.007733,0.682661,1,True,11
6,NeuralNetTorch,0.850209,0.82505,accuracy,0.048184,0.010048,20.496666,0.048184,0.010048,20.496666,1,True,12
7,LightGBM,0.849413,0.829026,accuracy,0.032328,0.0,0.400152,0.032328,0.0,0.400152,1,True,4
8,WeightedEnsemble_L2,0.846827,0.83499,accuracy,0.068027,0.015733,15.790571,0.020066,0.0,0.110964,2,True,14
9,CatBoost,0.82972,0.829026,accuracy,0.015634,0.015733,15.279455,0.015634,0.015733,15.279455,1,True,7


In [11]:
# observe the output save the plot to a file

predictor.fit_summary(show_plot= False)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val eval_metric  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.834990    accuracy       0.015733  15.790571                0.000000           0.110964            2       True         14
1              LightGBM   0.829026    accuracy       0.000000   0.400152                0.000000           0.400152            1       True          4
2              CatBoost   0.829026    accuracy       0.015733  15.279455                0.015733          15.279455            1       True          7
3        NeuralNetTorch   0.825050    accuracy       0.010048  20.496666                0.010048          20.496666            1       True         12
4               XGBoost   0.815109    accuracy       0.007733   0.682661                0.007733           0.682661            1       True         11
5         LightGBMLarge   0.8091

{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 0.558648111332008,
  'KNeighborsDist': 0.5745526838966203,
  'LightGBMXT': 0.7753479125248509,
  'LightGBM': 0.8290258449304175,
  'RandomForestGini': 0.7157057654075547,
  'RandomForestEntr': 0.7176938369781312,
  'CatBoost': 0.8290258449304175,
  'ExtraTreesGini': 0.610337972166998,
  'ExtraTreesEntr': 0.6182902584493042,
  'NeuralNetFastAI': 0.7713717693836978,
  'XGBoost': 0.8151093439363817,
  'NeuralNetTorch': 0.8250497017892644,
  'LightGBMLarge': 0.

In [12]:
# Generate the summary plot

predictor.fit_summary(show_plot=False)

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val eval_metric  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.834990    accuracy       0.015733  15.790571                0.000000           0.110964            2       True         14
1              LightGBM   0.829026    accuracy       0.000000   0.400152                0.000000           0.400152            1       True          4
2              CatBoost   0.829026    accuracy       0.015733  15.279455                0.015733          15.279455            1       True          7
3        NeuralNetTorch   0.825050    accuracy       0.010048  20.496666                0.010048          20.496666            1       True         12
4               XGBoost   0.815109    accuracy       0.007733   0.682661                0.007733           0.682661            1       True         11
5         LightGBMLarge   0.8091

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 0.558648111332008,
  'KNeighborsDist': 0.5745526838966203,
  'LightGBMXT': 0.7753479125248509,
  'LightGBM': 0.8290258449304175,
  'RandomForestGini': 0.7157057654075547,
  'RandomForestEntr': 0.7176938369781312,
  'CatBoost': 0.8290258449304175,
  'ExtraTreesGini': 0.610337972166998,
  'ExtraTreesEntr': 0.6182902584493042,
  'NeuralNetFastAI': 0.7713717693836978,
  'XGBoost': 0.8151093439363817,
  'NeuralNetTorch': 0.8250497017892644,
  'LightGBMLarge': 0.

In [13]:
# validate the model against unseen data

y_test = test_df["Mode of entry"]
test_data = test_df.drop(columns=["Mode of entry"])

In [14]:
y_pred = predictor.predict(test_data)

In [15]:
metrics = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

In [16]:
metrics

{'accuracy': 0.8129930394431555,
 'balanced_accuracy': 0.8106949091550781,
 'mcc': 0.6239071784230497,
 'f1': 0.794910941475827,
 'precision': 0.8152400835073069,
 'recall': 0.7755710029791459}

In [17]:
absolute_metrics = {key: abs(value) for key, value in metrics.items()}
absolute_metrics

{'accuracy': 0.8129930394431555,
 'balanced_accuracy': 0.8106949091550781,
 'mcc': 0.6239071784230497,
 'f1': 0.794910941475827,
 'precision': 0.8152400835073069,
 'recall': 0.7755710029791459}

In [18]:
# Feature Importance
importance = predictor.feature_importance(test_df)
importance

Computing feature importance via permutation shuffling for 4 features using 2155 rows with 5 shuffle sets...
	1.37s	= Expected runtime (0.27s per shuffle set)
	0.5s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Visitor Origin,0.216241,0.013067,1.592474e-06,5,0.243147,0.189335
Number of Visitors (Rounded to nearest hundred),0.207146,0.005447,5.733009e-08,5,0.218362,0.19593
Month/Year,0.074153,0.003761,7.916196e-07,5,0.081898,0.066409
Country,0.010487,0.004659,0.003657927,5,0.02008,0.000895


### Test Usecases

In [19]:
#Usecase 

res = {
    "Visitor Origin" : "Berlin",
    "Number of Visitors (Rounded to nearest hundred)" : 100,
    "Month/Year" : "January 2025",
    "Country" : "Germany"
}

In [20]:
tourism_data = TabularDataset([res])
predictor.predict(tourism_data)

0    Air
Name: Mode of entry, dtype: object

In [21]:
res2 = {
    "Visitor Origin" : "Baddeck",
    "Number of Visitors (Rounded to nearest hundred)" : 10,
    "Month/Year" : "January 2025",
    "Country" : "Canada"
}

In [22]:
tourism_data = TabularDataset([res2])
predictor.predict(tourism_data)

0    Road
Name: Mode of entry, dtype: object

### Conclusion Summary

#### The RandomForestEntr model achieved the highest test accuracy (93.14%) and slightly outperformed the other models in validation accuracy (71.77%).
#### RandomForestGini closely followed with a test accuracy of 93.12% and validation accuracy of 71.57%.

#### Overall accuracy is 81.30%, balanced accuracy is similarly high (81.07%), and the MCC (Matthews Correlation Coefficient) is 0.62, which indicates reasonable predictive performance across all classes.
#### F1-score: 0.79 - good balance between precision and recall.
#### Precision: 0.82 - indicates relatively few false positives.
#### Recall: 0.78 - reflects that most true positives were correctly identified.

#### Visitor Origin (21.62%) and Number of Visitors (Rounded to nearest hundred) (20.71%) are the most significant predictors, indicating the model relies heavily on these features.


