# AutoGluon Classification Model

#### Data: Tourism Nova Scotia
#### Time: 2006 to 2024
#### Data Source: https://data.novascotia.ca/Arts-Culture-and-History/Tourism-Nova-Scotia-Listed-Operators/2h2s-6bg4/about_data 

## Import necessary libraries

In [1]:
from autogluon.tabular import TabularDataset, TabularPredictor
#TabularDataset -->  to load data
#TabularPredictor --> to train models and make predictions

from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt

## Load the dataset

In [2]:
data_location = './Tourism_Nova_Scotia_Listed_Operators_20250413.csv'
data = TabularDataset(f'{data_location}')
data.head()

Unnamed: 0,Region,Name,Type,Longitude,Latitude,Location
0,South Shore,1 and Only Riverside Accommodations,Accommodation,-65.03934,43.826116,"(43.826116, -65.03934)"
1,Cabot Trail,11827 Cabot Trail,Accommodation,-61.075365,46.49472,"(46.49472, -61.075365)"
2,Cabot Trail,13791 Cabot Trail,Accommodation,-61.025454,46.568788,"(46.568788, -61.025454)"
3,Eastern Shore,3 Moonlight Beach Suites,Accommodation,-63.33005,44.64372,"(44.64372, -63.33005)"
4,Cabot Trail,97 Cheticamp Island Road,Accommodation,-61.029663,46.594279,"(46.594279, -61.029663)"


### Data Dictionary

 - **Region** - Operator location based on Region.
 - **Name** - Name of the operator.
 - **Type** - Which is the operator type.
 - **Longitude** - Longitude location of the operator.
 - **Latitude** - Latitude location of the operator.
 - **Location** - Location of the operator (Latitude, Longitude).

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2519 entries, 0 to 2518
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Region     2517 non-null   object 
 1   Name       2519 non-null   object 
 2   Type       2519 non-null   object 
 3   Longitude  2472 non-null   float64
 4   Latitude   2472 non-null   float64
 5   Location   2472 non-null   object 
dtypes: float64(2), object(4)
memory usage: 118.2+ KB


## Perform EDA

In [4]:
# Percentage of 'Mode of entry' column values before train test split-for reference

data['Type'].value_counts()/data.shape[0] * 100

Type
Accommodation    30.051608
Eat & Drink      17.229059
Attraction       14.728067
Outdoors         12.107979
Tour Ops          6.947201
Fine Arts         6.748710
Trails            6.470822
Campground        5.716554
Name: count, dtype: float64

## Train Test Split 

In [5]:
#scikit-learn train-test-split --> Split arrays or matrices into random train and test subsets

train_df, test_df = train_test_split(data, test_size=0.3, random_state=42, stratify=data['Type'])

In [6]:
# Percentage of 'Mode of entry' column values after train-test-split

train_df['Type'].value_counts()/train_df.shape[0] * 100

Type
Accommodation    30.062394
Eat & Drink      17.243335
Attraction       14.747589
Outdoors         12.081679
Tour Ops          6.920023
Fine Arts         6.749858
Trails            6.466251
Campground        5.728871
Name: count, dtype: float64

In [7]:
test_df['Type'].value_counts()/test_df.shape[0] * 100

Type
Accommodation    30.026455
Eat & Drink      17.195767
Attraction       14.682540
Outdoors         12.169312
Tour Ops          7.010582
Fine Arts         6.746032
Trails            6.481481
Campground        5.687831
Name: count, dtype: float64

In [8]:
# AutoGluon’s TabularPredictor --> to predict the values of a target column based on the other columns in a tabular dataset 

predictor = TabularPredictor(label='Type', path='type_predictors')



## Monitoring the Models

In [9]:
predictor.fit(train_data= train_df) #, presets= 'best_quality'

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.11.11
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19045
CPU Count:          8
Memory Avail:       4.80 GB / 15.71 GB (30.6%)
Disk Space Avail:   83.89 GB / 237.83 GB (35.3%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fast inference speed.
	presets='good'         : Good accuracy with very fast inference speed.
	presets='medium'

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x1a4b18ea9d0>

In [10]:
predictor.leaderboard(data= train_df)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,ExtraTreesGini,0.912649,0.603399,accuracy,0.166203,0.065495,1.151546,0.166203,0.065495,1.151546,1,True,9
1,ExtraTreesEntr,0.912649,0.603399,accuracy,0.192571,0.120707,1.079077,0.192571,0.120707,1.079077,1,True,10
2,RandomForestEntr,0.912082,0.600567,accuracy,0.174658,0.060318,1.090751,0.174658,0.060318,1.090751,1,True,7
3,RandomForestGini,0.911514,0.597734,accuracy,0.189677,0.061566,1.154682,0.189677,0.061566,1.154682,1,True,6
4,LightGBMLarge,0.905275,0.617564,accuracy,0.05241,0.018105,6.140833,0.05241,0.018105,6.140833,1,True,13
5,KNeighborsDist,0.866704,0.373938,accuracy,0.030343,0.003833,0.007476,0.030343,0.003833,0.007476,1,True,2
6,LightGBMXT,0.747022,0.660057,accuracy,0.074411,0.007973,2.265336,0.074411,0.007973,2.265336,1,True,4
7,LightGBM,0.739081,0.654391,accuracy,0.038484,0.008043,3.326558,0.038484,0.008043,3.326558,1,True,5
8,XGBoost,0.718094,0.654391,accuracy,0.133164,0.009512,1.923573,0.133164,0.009512,1.923573,1,True,11
9,CatBoost,0.690301,0.671388,accuracy,0.041937,0.00807,48.465652,0.041937,0.00807,48.465652,1,True,8


In [11]:
# observe the output save the plot to a file

predictor.fit_summary(show_plot= False)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val eval_metric  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.674221    accuracy       0.024116  56.290739                0.000000           0.091534            2       True         14
1              CatBoost   0.671388    accuracy       0.008070  48.465652                0.008070          48.465652            1       True          8
2            LightGBMXT   0.660057    accuracy       0.007973   2.265336                0.007973           2.265336            1       True          4
3              LightGBM   0.654391    accuracy       0.008043   3.326558                0.008043           3.326558            1       True          5
4               XGBoost   0.654391    accuracy       0.009512   1.923573                0.009512           1.923573            1       True         11
5         LightGBMLarge   0.6175

{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 0.39943342776203966,
  'KNeighborsDist': 0.37393767705382436,
  'NeuralNetFastAI': 0.3654390934844193,
  'LightGBMXT': 0.660056657223796,
  'LightGBM': 0.6543909348441926,
  'RandomForestGini': 0.5977337110481586,
  'RandomForestEntr': 0.6005665722379604,
  'CatBoost': 0.6713881019830028,
  'ExtraTreesGini': 0.603399433427762,
  'ExtraTreesEntr': 0.603399433427762,
  'XGBoost': 0.6543909348441926,
  'NeuralNetTorch': 0.37393767705382436,
  'LightGBMLarge': 

In [12]:
# Generate the summary plot

predictor.fit_summary(show_plot=False)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val eval_metric  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.674221    accuracy       0.024116  56.290739                0.000000           0.091534            2       True         14
1              CatBoost   0.671388    accuracy       0.008070  48.465652                0.008070          48.465652            1       True          8
2            LightGBMXT   0.660057    accuracy       0.007973   2.265336                0.007973           2.265336            1       True          4
3              LightGBM   0.654391    accuracy       0.008043   3.326558                0.008043           3.326558            1       True          5
4               XGBoost   0.654391    accuracy       0.009512   1.923573                0.009512           1.923573            1       True         11
5         LightGBMLarge   0.6175

{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 0.39943342776203966,
  'KNeighborsDist': 0.37393767705382436,
  'NeuralNetFastAI': 0.3654390934844193,
  'LightGBMXT': 0.660056657223796,
  'LightGBM': 0.6543909348441926,
  'RandomForestGini': 0.5977337110481586,
  'RandomForestEntr': 0.6005665722379604,
  'CatBoost': 0.6713881019830028,
  'ExtraTreesGini': 0.603399433427762,
  'ExtraTreesEntr': 0.603399433427762,
  'XGBoost': 0.6543909348441926,
  'NeuralNetTorch': 0.37393767705382436,
  'LightGBMLarge': 

In [13]:
# validate the model against unseen data

y_test = test_df["Type"]
test_data = test_df.drop(columns=["Type"])

In [14]:
y_pred = predictor.predict(test_data)

In [15]:
metrics = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

In [16]:
metrics

{'accuracy': 0.6547619047619048,
 'balanced_accuracy': 0.5958061234104766,
 'mcc': 0.5763092714515887}

In [17]:
absolute_metrics = {key: abs(value) for key, value in metrics.items()}
absolute_metrics

{'accuracy': 0.6547619047619048,
 'balanced_accuracy': 0.5958061234104766,
 'mcc': 0.5763092714515887}

In [18]:
# Feature Importance
importance = predictor.feature_importance(test_df)
importance

Computing feature importance via permutation shuffling for 5 features using 756 rows with 5 shuffle sets...
	3.67s	= Expected runtime (0.73s per shuffle set)
	0.93s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Name,0.424339,0.017027,3.104253e-07,5,0.459398,0.38928
Region,0.029101,0.009116,0.001018785,5,0.047871,0.01033
Latitude,0.015344,0.002743,0.000117491,5,0.020992,0.009696
Longitude,0.008201,0.003788,0.004195819,5,0.016,0.000402
Location,0.004497,0.003811,0.02882297,5,0.012344,-0.003349


In [19]:
test_df.head(2)

Unnamed: 0,Region,Name,Type,Longitude,Latitude,Location
1578,Eastern Shore,Rose & Rooster Cafe,Eat & Drink,-63.257678,44.687648,"(44.687648, -63.257678)"
1317,Halifax Metro,Beyond Pho Vietnamese Cuisine,Eat & Drink,-63.579968,44.642812,"(44.642812, -63.579968)"


### Test Usecases

In [20]:
#Usecase 

res = {
    "Region" : "Eastern Shore",
    "Name" : "Rose & Rooster Cafe",
    "Latitude" : "44.687648",
    "Longitude" : "-63.257678",
    "Location" : "(44.687648, -63.257678)"
}

In [21]:
operator_data = TabularDataset([res])
predictor.predict(operator_data)

0    Eat & Drink
Name: Type, dtype: object

In [22]:
res2 = {
    "Region" : "Cabot Trail",
    "Name" : "11827 Cabot Trail",
    "Latitude" : "44.642812",
    "Longitude" : "-63.579968",
    "Location" : "(44.642812, -63.579968)"
}

In [23]:
operator_data = TabularDataset([res2])
predictor.predict(operator_data)

0    Trails
Name: Type, dtype: object

### Conclusion Summary

#### The leaderboard indicates **ExtraTreesGini** and **ExtraTreesEntr** as the best-performing models, achieving identical test accuracy scores of 0.912649. Validation accuracy is comparatively lower (0.603399), suggesting potential overfitting or a less representative validation set.
#### RandomForestEntr comes third but still delivers strong test accuracy (0.912082) with slightly lower validation accuracy (0.600567).

#### While the test accuracy is quite high, the balanced accuracy (0.5958) and MCC (Matthews Correlation Coefficient, 0.5763) reveal imbalances in prediction across classes. This suggests that class imbalances or feature issues might need attention.

#### **Name** dominates feature importance, indicating it heavily influences predictions. However, Region, Latitude, and Longitude have marginal significance, while Location appears least impactful.