# Project 2
## Introduction
This notebook project tourism data for European Union countries to identify frequent patterns and association rules. It includes data filtering, transformation, and the application of the Apriori algorithm to uncover insights into high tourism impact countries.

## Dataset
The dataset contains information from 1999 to 2023 from multiple countries about key tourism factors.

File Path: world_tourism_economy_data.csv
Shape: Printed at runtime to verify the dimensions of the data.

In [1]:
%pip install pandas scikit-learn mlxtend matplotlib

[33mDEPRECATION: Loading egg at /opt/homebrew/lib/python3.11/site-packages/jupyter-1.0.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## 1. Import Libraries
We begin by importing the necessary libraries for data manipulation and association rule mining.

In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

## 2. Load the Dataset
Load the dataset containing global tourism economy data. Ensure the file is located at the specified path.

In [3]:
# Load the dataset
file_path = 'world_tourism_economy_data.csv'
tourism_data = pd.read_csv(file_path)

  and should_run_async(code)


## 3. Data Preparation
### Filter Data for European Union Countries
Here, we select only the countries that are part of the European Union using their country codes.

In [4]:
# european_country_codes = [
#     'ALB', 'AND', 'ARM', 'AUT', 
#     'BEL', 'BGR', 'BIH',
#      'BLR', 'BUL', 'CHE', 
#     'CYP', 'CZE', 'DEU', 'DNK', 'EST', 'FIN', 'FRA', 'GEO', 'GRC', 'HRV', 
#     'HUN', 'IRL', 'ISL', 'ISR', 'ITA', 'KOS', 'LTU', 'LUX', 'LVA', 'MDA', 
#     'MNE', 
#     'NLD', 
#     'NOR', 
#     'POL', 'PRT', 'ROU', 'RUS', 'SVK', 'SVN', 'ESP', 
#     'SWE', 'TUR', 'UKR', 'GBR'
# ]
european_union_country_codes = [
    'AUT', 'BEL', 'BGR', 'CYP', 'CZE', 'DEU', 'DNK', 'EST', 'FIN', 'FRA', 
    'GRC', 'HRV', 'HUN', 'IRL', 'ITA', 'LTU', 'LUX', 'LVA', 'POL', 'PRT', 
    'ROU', 'SVK', 'SVN', 'ESP', 'SWE'
]
tourism_data = tourism_data[tourism_data['country_code'].isin(european_union_country_codes)]

  and should_run_async(code)


### Compute Derived Metrics
Calculate the **Tourism GDP Percentage** and classify countries with a **High Tourism Impact** based on a threshold. (In this case we will set it to 5)
We then remove rows with missing values in the `High_Tourism_Impact` column to ensure clean data.

In [5]:
# Calculate Tourism GDP Percentage and High Tourism Impact
tourism_data['Tourism_GDP_Percentage'] = (tourism_data['tourism_receipts'] / tourism_data['gdp']) * 100
tourism_threshold = 5  # Adjust threshold as needed
tourism_data['High_Tourism_Impact'] = tourism_data['Tourism_GDP_Percentage'] > tourism_threshold

# Clean the data
tourism_data = tourism_data.dropna(subset=['High_Tourism_Impact'])

  and should_run_async(code)


### List Countries with High Tourism Impact
Identify unique countries classified as having a high tourism impact.

In [6]:
# List unique countries with 'High_Tourism_Impact'
countries_with_high_impact = tourism_data[tourism_data['High_Tourism_Impact']]['country_code'].unique()
print(f"Countries with High Tourism Impact: {countries_with_high_impact}")

Countries with High Tourism Impact: ['AUT' 'BGR' 'CYP' 'EST' 'GRC' 'HRV' 'HUN' 'LTU' 'PRT' 'SVN' 'LUX']


  and should_run_async(code)


## 4. Association Rule Mining
### Prepare Data for Association Rule Mining
Group the data by `country_code` and `year`, then pivot it to create a Boolean table suitable for the Apriori algorithm.

In [7]:
# Group the data by 'country_code' and 'year', and pivot
ds_grouped = tourism_data.groupby(['country_code', 'year'], as_index=False).agg({'High_Tourism_Impact': 'any'})
ds_pivot = ds_grouped.pivot(index='year', columns='country_code', values='High_Tourism_Impact').fillna(False)

# Ensure the pivoted table is Boolean
ds_pivot = ds_pivot.applymap(lambda x: bool(x))

  and should_run_async(code)
  ds_pivot = ds_pivot.applymap(lambda x: bool(x))


### Apply the Apriori Algorithm
Run the Apriori algorithm on the pivoted dataset to generate frequent itemsets with a minimum support of 0.5.

In [8]:
# Run apriori on the pivoted data
min_support = 0.5
freq_itemsets = apriori(ds_pivot, min_support=min_support, use_colnames=True)

# If no error occurs, print confirmation
print(freq_itemsets)

     support                        itemsets
0       0.84                           (BGR)
1       0.84                           (CYP)
2       0.68                           (EST)
3       0.72                           (GRC)
4       0.88                           (HRV)
..       ...                             ...
154     0.52  (CYP, HRV, LUX, BGR, EST, GRC)
155     0.52  (CYP, HRV, BGR, EST, SVN, GRC)
156     0.52  (CYP, HRV, LUX, BGR, SVN, GRC)
157     0.52  (CYP, HRV, BGR, SVN, GRC, PRT)
158     0.56  (CYP, HRV, LUX, BGR, SVN, PRT)

[159 rows x 2 columns]


  and should_run_async(code)


### Generate Association Rules
Derive association rules from the frequent itemsets using a confidence threshold of 1. Sort and display the top 10 rules by confidence.

In [17]:

# Number of itemsets
num_itemsets = len(freq_itemsets)

# Generate association rules
rules = association_rules(freq_itemsets, metric="confidence", min_threshold=1)
rules = rules.sort_values(by='confidence', ascending=False)

# Display the top 10 rules
print("Top 10 rules by confidence:")
print(rules.head(10))


Top 10 rules by confidence:
              antecedents      consequents  antecedent support  \
0                   (CYP)            (BGR)                0.84   
290       (PRT, GRC, BGR)       (CYP, HRV)                0.52   
301       (LUX, GRC, SVN)       (CYP, BGR)                0.52   
300  (GRC, LUX, BGR, SVN)            (CYP)                0.52   
299  (CYP, LUX, GRC, SVN)            (BGR)                0.52   
298            (GRC, SVN)  (CYP, HRV, BGR)                0.60   
297       (GRC, BGR, SVN)       (CYP, HRV)                0.60   
296       (HRV, GRC, SVN)       (CYP, BGR)                0.60   
295       (CYP, GRC, SVN)       (HRV, BGR)                0.60   
294  (GRC, HRV, BGR, SVN)            (CYP)                0.60   

     consequent support  support  confidence      lift  leverage  conviction  \
0                  0.84     0.84         1.0  1.190476    0.1344         inf   
290                0.84     0.52         1.0  1.190476    0.0832         inf   
301  

  and should_run_async(code)


### Analyze Specific Antecedent Rules
Define a function to filter association rules where a specific country is in the antecedents and extract related consequents.

In [18]:
def get_rules_where_country_is_antecedent(rules, code):
    # Filter the rules where Portugal (PRT) is in the antecedents
    c_rules = rules[rules['antecedents'].apply(lambda x: code in x)]
    
    # Extract the consequents of these rules
    c_consequents = c_rules['consequents']
    
    # Flatten the list of consequents and get unique countries
    consequent_countries = set()
    for consequent in c_consequents:
        consequent_countries.update(consequent)  # Adds all countries in the consequent
    
    # Return the list of unique countries
    return list(consequent_countries)

# Call the function with the `rules` DataFrame
countries_with_antecedent = get_rules_where_country_is_antecedent(rules, 'PRT')

# Print the result
print("Countries where the writen code is an antecedent:", countries_with_antecedent)


Countries where the writen code is an antecedent: ['CYP', 'HRV', 'BGR', 'SVN']


  and should_run_async(code)
