<a href="https://colab.research.google.com/github/Kiron-Ang/DSC/blob/main/vacation_recommender_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vacation Recommender System
### Kiron Ang, November 2024

Are you struggling to decide where to go on vacation next? This simple content-based recommender system can help you! Follow the directions below to get recommendations for your next vacation. By the end, you'll also understand how the system arrived at its suggestions!

[Please click here to access the recommender system online at Google Colab. Using the tool requires a Google account.](https://colab.research.google.com/drive/1T6lyTMzBqbNAFKgaNKXgLuAQ4i7WZ71U)

---

#### Step 1: Print version numbers
It's always good practice to print out version numbers of the software tools used to make a new tool. Not only does it help promote open source development, but it also gives credit where credit is due. Run the cell below to get a neat list of version numbers.

In [None]:
print("Printing version numbers. . .")
!python -V

!pip install -U polars > output.txt
import polars
print("polars", polars.__version__)

!pip install -U scikit-learn > output.txt
import sklearn
print("scikit-learn", sklearn.__version__)

import ipywidgets
print("ipywidgets", ipywidgets.__version__)

import IPython
print("IPython", IPython.__version__)

#### Step 2: Get Country Information from United Nations Data
The recommender system needs some information to work with! This tool is quite basic, and only uses three datasets provided by the United Nations. Running the cell below will run lots of commands to accomplish the following:
1. Read in data directly from the UN Data website. These three datasets are related to tourism, GDP, and crime for countries all around the world.
2. Filter all the datasets to only keep data for the year 2021. This is the most recent information available; there is no data for 2022, 2023, or 2024 quite yet.
3. Narrow down the data even more to be specific about what values to keep. The data is originally very long, and each row represents a specific statistic for a single country. This ends up making a large dataset where several rows correspond to one country, and this makes data processing complicated. In this situation, one simple approach is to just specify one statistic, so that every row corresponds to a unique country. United Nations data is organized by "series", like "GDP per capita (United States dollars)"; for each of the three datasets, one series is chosen.
4. Rename a column. Originally, the column with the country name has no label for some unknown reason. This makes writing code difficult later on, so the label "country" is assigned.
5. Remove unnecessary columns. Filtering the data makes some of the columns redundant. For example, if all of the data is now from 2021, then the "Year" column no longer provides useful information.
6. Rename the remaining columns.  These new names are based on the information series chosen earlier.
7. Join the three datasets together to form one, complete dataset.
8. Assign the columns to the Float 64 data type. Originally, everything was read in as a string. This is really tricky for comparing numbers in a future step, so the commas in the strings are removed, and the columns are converted into float 64 numbers.
9. Display the final dataset.

In [None]:
# data.un.org
tourism = polars.read_csv("https://data.un.org/_Docs/SYB/CSV/SYB66_176_202310_Tourist-Visitors%20Arrival%20and%20Expenditure.csv", encoding = "latin-1", skip_rows = 1)
gdp = polars.read_csv("https://data.un.org/_Docs/SYB/CSV/SYB66_230_202310_GDP%20and%20GDP%20Per%20Capita.csv", encoding = "latin-1", skip_rows = 1)
crime = polars.read_csv("https://data.un.org/_Docs/SYB/CSV/SYB66_328_202310_Intentional%20homicides%20and%20other%20crimes.csv", encoding = "latin-1", skip_rows = 1, infer_schema = False)

tourism = tourism.filter(tourism["Year"] == 2021)
gdp = gdp.filter(gdp["Year"] == 2021)
crime = crime.filter(crime["Year"] == "2021")

tourism = tourism.filter(tourism["Series"] == "Tourist/visitor arrivals (thousands)")
gdp = gdp.filter(gdp["Series"] == "GDP per capita (US dollars)")
crime = crime.filter(crime["Series"] == "Assault rate per 100,000 population")

tourism = tourism.rename({"": "country"})
gdp = gdp.rename({"": "country"})
crime = crime.rename({"": "country"})

tourism = tourism.drop("Region/Country/Area", "Year", "Series", "Tourism arrivals series type", "Tourism arrivals series type footnote", "Footnotes", "Source")
gdp = gdp.drop("Region/Country/Area", "Year", "Series", "Footnotes", "Source")
crime = crime.drop("Region/Country/Area", "Year", "Series", "Footnotes", "Source")

tourism = tourism.rename({"Value": "tourist_arrivals_thousands"})
gdp = gdp.rename({"Value": "gdp_per_capita"})
crime = crime.rename({"Value": "assault_rate_per_100000"})

two = tourism.join(gdp, on = "country")
all = two.join(crime, on = "country")

all = all.with_columns([
    polars.col("tourist_arrivals_thousands").str.replace(",", "").cast(polars.Float64),
    polars.col("gdp_per_capita").str.replace(",", "").cast(polars.Float64),
    polars.col("assault_rate_per_100000").str.replace(",", "").cast(polars.Float64),
])

all

#### Step 3: Create a Survey to Get User Information
To tailor the recommendations produced, the recommender system needs some information from the user. Run the cell below to create an "intake survey" of sorts, that asks about previous vacations. After running the cell, fill out the survey yourself.

The code below accomplishes the following:
1. Create two lists, one with countries and one with months of the year.
2. Create two dropdown menus for the survey.
3. Initialize an empty dictionary that will later be used to store information about previous vacations that the user has taken.
4. Define a function to add survey results to the dictionary when the "Submit" button is clicked.
5. Define another function that passes the survey information into the previously defined function.
6. Create a submit button that utilizes the function defined in step #5.
7. Print out some directions for users to help them fill out the survey.
8. Display the dropdown menus and the submit button!

In [None]:
countries = all["country"].to_list()
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

country_dropdown = ipywidgets.Dropdown(options = countries, description = "Country:")
month_dropdown = ipywidgets.Dropdown(options = months, description = "Month:")

past_vacations = {}

def submit_survey(month, country):
    if month in past_vacations:
        past_vacations[month].append(country)
    else:
        past_vacations[month] = [country]
    print(f"Adding {month} trip to {country}. . .")
    print(f"Vacations: {past_vacations}")

def on_submit(button):
    submit_survey(month_dropdown.value, country_dropdown.value)

submit_button = ipywidgets.Button(description = "Submit")
submit_button.on_click(on_submit)

print("Please use the form below to enter information")
print("about previous vacations that you enjoyed.")
print("Select the month that you traveled, along with")
print("the country that you visited. If your trip was")
print("longer than a month, then put down the month")
print("that you enjoyed the most. Fill out the form as")
print("many times as you need to. If you visited a")
print("country several times, please fill out the form")
print("for each time you visited.")
print("")

IPython.display.display(month_dropdown, country_dropdown, submit_button)

#### Step 4: Compute Cosine Similarity
Next, the system needs to determine which countries are similar based on the United Nations data prepared earlier. The popular library scikit-learn has a function for computing the cosine similarity given numeric data. Run this cell now to see how it's possible to store the results in a data frame where each column and row represents a country compared to all other countries. Values closer to one indicate that those two countries are more similar.

In [None]:
import sklearn.metrics.pairwise
cosine_similarity = polars.DataFrame(sklearn.metrics.pairwise.cosine_similarity(all[:, 1:]))

new_names_dictionary = {}
default_names = cosine_similarity.columns
index = 0
for country in all["country"]:
  new_names_dictionary[default_names[index]] = country
  index += 1

cosine_similarity = cosine_similarity.rename(new_names_dictionary)
cosine_similarity = cosine_similarity.insert_column(0, all["country"])
cosine_similarity

#### Step 5: Get a List of Similar Countries
The code in this cell uses the list of countries that the user provided earlier to produce a new list of countries that are similar. It uses the data frame generated above, and for each unique country the user provided, the cell below gets the three most similar countries. Run this cell now to get a personalized list.

In [None]:
past_countries = list(set([country for month in past_vacations.values() for country in month]))
similar_countries = []

for country in past_countries:
  most_similar = cosine_similarity.sort(by = country, descending = True).select(["country", country]).slice(1, 3)
  for similar_country in most_similar["country"]:
    similar_countries.append(similar_country)

similar_countries = list(set(similar_countries))
similar_countries

### Step 6: Find the User's Most Frequently Traveled Month
This step is relatively simple compared to the other steps. The code here tallies up the user-supplied dictionary information earlier to determine during which month the user has taken the most vacations. Run this cell, and then move onto the final step!

In [None]:
past_months = {key: len(value) for key, value in past_vacations.items()}
most_frequent_month = max(past_months, key = past_months.get)
most_frequent_month

### Step 7: Provide Final Recommendations
To put everything together, this cell prints some text that suggests that the user take a vacation in their most frequently traveled month to the similar countries found earlier. The bulleted list provides some fun suggestions that can inspire users to go outside of their comfort zone!

In [None]:
print(f"For your next vacation, you should travel in {most_frequent_month}")
print("to one of the countries below:")

for country in similar_countries:
  print(f"• {country}")