# The purpose of this project is to find the similarity between all most popular attractions in the data (not limited to one city). 

**The need for similarity groups in Marketplace is intended to display the attractions that are similar to the attraction selected by the user**

1.Download "all_attractions.csv" which contains all the data of all the attractions of our inventory. (already downloaded to Drive)

2.Preprocess the data mainly in order to remove repeted rows

3.Extract X most popular attractions from each supplier and merge to one DataFrame

4.Run similarity model to extract groups of similarities


5.Investigate the groups obtained

6.Run Duplicates model to extract duplications

7.Invesitgate the Groups obtained

Installations and imports

In [None]:
!pip install sentence_transformers
!pip install spacy
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 2.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 9.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 9.9 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 10.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 47.3 MB/s 
Collecting tokenizers!=

In [None]:
# import the files from google colab

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data_path = '/content/drive/MyDrive/ColabNotebooks/bridgify/duplicates_and_similarities/most_popular/all_attractions.csv'

In [None]:
import re
import json
import pandas as pd
from pandas import DataFrame
import numpy as np
from typing import Any, Dict, List
from sentence_transformers import SentenceTransformer, util

###1+2. Download and preprocess the data

In [None]:

def unavailable_to_nan(df: DataFrame, col_list: List[str]) -> None:
  """
  change 'unavailable' to empty string in the specified columns

  Args:
    df: raw DataFrame of attractions
    col_list: list of text columns

  Returns:
    None

  """

  for col in col_list:
      df[col] = df[col].apply(lambda x: np.nan if x == 'unavailable' else x)
      df[col] = df[col].fillna("")


def remove_duplicates_and_nan(df: DataFrame) -> None:
  """
  Remove rows which are exactly the same and

  Args:
    df: DataFrame of attractions

  Returns:
    None

    """
  print("Shape before removing duplicates:", df.shape)
  df.drop_duplicates(subset=['title', 'description', 'address'], inplace=True)
  df.dropna(subset=["text"], inplace=True)
  df.reset_index(inplace=True)
  print("Shape after removing duplicates:", df.shape)


def format_categories(df: pd.DataFrame) -> pd.Series:
  """
  Transforming each tag in "categories_list" column to a list of categories

  Args:
    DataFrame of attractions

  Returns:
    a DataFrame column (Series) with a list of categories in each entry
    """

  return df["categories_list"].apply(
      lambda x: list(set([j.strip().title() for j in re.sub(r'[()\[\'"{}\]]', '', x).strip().split(",")])) if type(
          x) != list else x)


def strip_list(df: DataFrame, col: str):
  """
  Remove empty items from a list of each entry of the prediction column

  Args:
    df: DataFrame with a new column for the different tags_format
    col: str, the name of the new column with the new tags_format

  Returns:
    None
  """
  df[col] = df[col].apply(lambda x: [ele for ele in x if ele.strip()])


def data_preprocess(raw_df: DataFrame) -> DataFrame:
  """
  preprocess the raw DataFrame: update the name of the columns if needed,
  creates 'prediction' column with list of categories,
  creates 'text' column of joining the title and description,
  remove duplicate rows

  Args:
    raw_df: raw DataFrame of attractions

  Returns:
    Pre-processed DataFrame
  """
  raw_df = raw_df.rename(
      columns={"name": "title", "about": "description", "tags": "categories_list", "source": "inventory_supplier",
                "location_point": "geolocation"})
  if 'prediction' not in raw_df.columns:
      raw_df["prediction"] = format_categories(raw_df)
      strip_list(raw_df, "prediction")
      raw_df["prediction"] = raw_df["prediction"].apply(lambda x: str(x))

  unavailable_to_nan(raw_df, ["title", "description"])
  raw_df["text"] = raw_df["title"] + '. ' + raw_df["description"]
  remove_duplicates_and_nan(raw_df)
  print("The data were processed")
  return raw_df


In [None]:
attractions = pd.read_csv(data_path)
attractions.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,geolocation,...,external_city_name,additional_info_id,city_id,native_language,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant
0,3fa4ebbf-c39d-46ea-853d-882c94642332,2022-03-24 18:03:42.888 +0200,2022-03-29 13:14:12.163 +0300,From Catania: Etna Downhill Mountain Biking Ex...,Take part in an adrenaline-filled mountain bik...,,,,,POINT (15.25942 37.990372),...,,,,,,,False,,,
1,c92cc02c-dfbc-446c-9ca4-11725f4ee68c,2022-03-24 18:06:08.001 +0200,2022-03-29 13:14:12.163 +0300,Seville: Guadalquivir Yacht Tour w/ Drink & Fo...,Enjoy a small-group boat tour to sail down the...,,,,,POINT (-6.00181 37.387691),...,,,,,,,False,,,
2,139af3f5-ca5a-4a55-ab7c-3929ede3054e,2022-03-24 18:28:15.268 +0200,2022-03-29 13:31:16.852 +0300,Eagle's Nest Shooting Experience,This is our most basic package perfect for beg...,,,,,,...,,,,,,,False,,,
3,85625ef9-2cd6-4362-8534-b62470b52d38,2022-03-24 18:26:52.171 +0200,2022-03-29 13:31:16.852 +0300,Austin Bergstrom International Airport Ground ...,We offer First Class Limousine and luxury car ...,,,,,,...,,,,,,,False,,,
4,337a0281-92c7-45c5-96e7-dd23ba89d6ee,2022-03-24 19:08:13.242 +0200,2022-03-31 16:27:59.186 +0300,Visit The Mayan Towns Around Lake Atitlan On a...,Atitlan is known for the stunning views of its...,,,,,,...,Guatemala City,,326.0,,,,False,,,


In [None]:
attractions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 512663 entries, 0 to 512662
Data columns (total 36 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   uuid                    512663 non-null  object 
 1   created_at              512663 non-null  object 
 2   last_updated            512663 non-null  object 
 3   title                   512655 non-null  object 
 4   description             485032 non-null  object 
 5   translation_status      14876 non-null   object 
 6   native_name             0 non-null       float64
 7   native_about            0 non-null       float64
 8   address                 227439 non-null  object 
 9   geolocation             200142 non-null  object 
 10  main_photo_url          509557 non-null  object 
 11  availability_type       144521 non-null  object 
 12  inventory_supplier      512663 non-null  object 
 13  duration                512663 non-null  object 
 14  rating              

In [None]:
attractions = data_preprocess(attractions)

Shape before removing duplicates: (512663, 38)
Shape after removing duplicates: (501384, 39)
The data were processed


3.Extract X most popular attractions from each supplier and merge to one DataFrame

In [None]:
relevant_suppliers = ['Getyourguide', 'Viator', 'Musement', 'Tiqets'] # according to product team
attractions['inventory_supplier'].unique()

array(['Getyourguide', 'Viator', 'Ticketmaster', 'GoogleMaps', 'Musement',
       'Eventim', 'Tiqets', 'SportsEvents365', 'Bajabikes', 'Tickitto'],
      dtype=object)

In [None]:
# set number of attractions
X = 20

all_most_popular_attractions = pd.DataFrame()

for supplier in relevant_suppliers:
  # extract x most popular attractions from a specific supplier
  most_pop_supplier = attractions[attractions["inventory_supplier"] == supplier].sort_values(by="number_of_reviews", ascending=False)[:X]
  most_pop_supplier.sort_values(by="number_of_reviews", ascending=True, inplace=True)
  most_pop_supplier["rank"] = [i for i in range(most_pop_supplier.shape[0])]
  # add to to all most popular
  all_most_popular_attractions = pd.concat([all_most_popular_attractions, most_pop_supplier])
  print("all_most_popular_attractions shape:", all_most_popular_attractions.shape)


def all_most_popular(attractions: DataFrame, num_attractions_per_supplier: int) -> DataFrame:
  all_most_popular_attractions = pd.DataFrame()
  for supplier in relevant_suppliers:
    # extract x most popular attractions from a specific supplier
    most_pop_supplier = attractions[attractions["inventory_supplier"] == supplier].sort_values(by="number_of_reviews", ascending=False)[:num_attractions_per_supplier]
    most_pop_supplier.sort_values(by="number_of_reviews", ascending=True, inplace=True)
    most_pop_supplier["rank"] = [i for i in range(most_pop_supplier.shape[0])]
    # add to to all most popular
    all_most_popular_attractions = pd.concat([all_most_popular_attractions, most_pop_supplier])
    print("all_most_popular_attractions shape:", all_most_popular_attractions.shape)
  return all_most_popular_attractions

all_most_popular_attractions shape: (20, 40)
all_most_popular_attractions shape: (40, 40)
all_most_popular_attractions shape: (60, 40)
all_most_popular_attractions shape: (80, 40)


In [None]:
all_most_popular_attractions["rank"].unique()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

4.Run similarity model to extract groups of similarities

**similarity model**

Transform DataFrame to dict

In [None]:
all_most_popular_attractions_dict = all_most_popular_attractions.to_dict('records')

In [None]:
import os
import re
import datetime
import sys
import json
import uuid
import numpy as np
import pandas as pd
import logging
import argparse
from itertools import combinations
from collections import Counter
from configparser import ConfigParser
from sentence_transformers import SentenceTransformer, util




# def unavailable_to_nan(df, col_list) -> None:
#     """ 
#     change 'unavailable' to empty string in the specified columns

#     Args:
#       df: raw DataFrame of attractions
#       col_list: list of text columns

#     Returns:
#       None
    
#     """

#     for col in col_list:
#           df[col] = df[col].apply(lambda x: np.nan if x == 'unavailable' else x)
#           df[col] = df[col].fillna("")
        


# def remove_duplicates_and_nan(df):
#     """ 
#     Remove rows which are exactly the same and 

#     Args:
#       df: DataFrame of attractions

#     Returns:
#       None
    
#      """
#     print("Shape before removing duplicates:", df.shape)
#     df.drop_duplicates(subset=['title', 'description', 'address'], inplace=True)
#     df.dropna(subset=["text"], inplace=True)
#     df.reset_index(inplace=True)

#     print("Shape after removing duplicates:", df.shape)
#     return df


# def tags_format(df):
#     """
#     Transforming each tag in "categories_list" column to a list of categories

#     Args:
#       DataFrame of attractions

#     Returns:
#       a DataFrame column (Series) with a list of categories in each entry
#      """
   
#     return df["categories_list"].apply(
#         lambda x: list(set([j.strip().title() for j in re.sub(r'[()\[\'"{}\]]', '', x).strip().split(",")])) if type(
#             x) != list else x)


# def strip_list(df,col):
#   """
#   Remove empty items from a list of each entry of the prediction column

#   Args:
#     df: DataFrame with a new column for the different tags_format
#     col: str, the name of the new column with the new tags_format

#   Returns:
#     None 
#   """
#   df[col] = df[col].apply(lambda x: [ele for ele in x if ele.strip()])



# def data_preprocess(raw_df):
#   """
#   preprocess the raw DataFrame: update the name of the columns if needed, 
#   creates 'prediction' column with list of categories, 
#   creates 'text' column of joining the title and description,
#   remove duplicate rows

#   Args:
#     raw_df: raw DataFrame of attractions

#   Returns:
#     Pre-processed DataFrame 
#   """ 
#   raw_df = raw_df.rename(
#         columns={"name": "title", "about": "description", "tags": "categories_list", "source": "inventory_supplier",
#                  "location_point": "geolocation"})
#   if 'prediction' not in raw_df.columns:
#       raw_df["prediction"] = tags_format(raw_df)
#       strip_list(raw_df, "prediction")
#       raw_df["prediction"] = raw_df["prediction"].apply(lambda x: str(x))
    
#   # 'unavailable' to NAN
#   unavailable_to_nan(raw_df, ["title", "description"])

#   # Remove rows which are exactly the same
#   raw_df["text"] = raw_df["title"] + ' ' + raw_df["description"]
#   preprocessed_df = remove_duplicates_and_nan(raw_df)
#   print("The data were processed")
#   return preprocessed_df


def model_embedding(df, col):
  """
  calculates the embeddings (as torch) of each entry in 'text' column according to SentenceTransformers

  Args:
    df: preprocessed DataFrame 
    col: str, the name of the text column according to which the embeddings will be calculated 

  Returns:
    tourch.Tensor
  """
  model = SentenceTransformer('all-MiniLM-L6-v2')
  print("model:", type(model))

  # Single list of sentences
  sentences = df[col].values
  print("sentences:", type(sentences))
  # Compute embeddings
  embeddings = model.encode(sentences, convert_to_tensor=True)  # each text transforms to a vector
  print("embedd:", type(embeddings))
  print("finished embeddings")
  return embeddings
  

def pairs_df_model(embeddings):
  """
  Compute cosine-similarities of each embedded vector with each other embedded vector

  Args:
    embeddings: DataFrame of the embeddings of the text column

  Returns:
    DataFrame with columns: 'ind1' (vector index), 'ind2' (vector index), 'score' (cosine score of the vectors)
    (The shape of the DataFrame is: rows: (n!/(n-k)!k!), for k items out of n)

  """
  cosine_scores = util.cos_sim(embeddings, embeddings)
  print("cosine scores:", type(cosine_scores))

  # Find the pairs with the highest cosine similarity scores
  pairs = []
  for i in range(len(cosine_scores) - 1):
      for j in range(i + 1, len(cosine_scores)):
          pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

  # Sort scores in decreasing order
  pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

  # transform to DataFrame and add split the pairs to two colums: 'ind1', 'ind2'
  pairs_df = pd.DataFrame(pairs)

  pairs_df["ind1"] = pairs_df["index"].apply(lambda x: x[0]).values
  pairs_df["ind2"] = pairs_df["index"].apply(lambda x: x[1]).values
  
  return pairs_df


def similarity_matrix(similarity_idx_df, reduced_df):
  """
  creates n^2 similarity matrix. Each attarction has a similarity score in relation to each attraction in the data

  Args:
    similarity_idx_df: DataFrame output of the function pairs_df_model  
    reduced_df: preprocessed DataFrame

  Returns:
    sqaure DataFrame. columns = index = the indices of the attractions. values: simialrity score
  """
  similarity_matrix = pd.DataFrame(columns=[i for i in range(reduced_df.shape[0])], index=range(reduced_df.shape[0]))
  for i in range(reduced_df.shape[0]):
    for j in range(i,reduced_df.shape[0]):
      if j == i:
        similarity_matrix.iloc[i][j] = 1
        similarity_matrix.iloc[j][i] = 1
      else:
        similarity_score = similarity_idx_df[(similarity_idx_df["ind1"] == i) & (similarity_idx_df["ind2"] == j)]["score"].values
        similarity_matrix.iloc[i][j] = similarity_score
        similarity_matrix.iloc[j][i] = similarity_score
  return similarity_matrix


def change_idx_and_cols(similarity_matrix, df, col):
  """
  transform the name of the columns and indices to the name of the specified column

  Args:
    similarity_matrix: sqaure pd.DataFrame of similarity score of each attraction with each attraction
    df: pd.DataFrame of the attractions
    col: The name of the column according to which the columns will be named

  Return:
    list of dictionaries of similarity scores 
    """
  similarity_matrix[col] = df[col]
  similarity_matrix = similarity_matrix.set_index(col)
  similarity_matrix.columns = similarity_matrix.index

  return similarity_matrix.to_dict('records')


def groups_idx(similarity_df):
    """
    Creates a list of tuples, each tuple is a similarity group which contains the attractions indices (A group consists of the pairs of a particular index and the pairs of
    its pairs. There is no overlap of indices between the groups

    Args:
      similarity_df: DataFrame output of the function pairs_df_model 
  
    Returns:
      a list of tuples. Each tuple contains attractions indices and represent a similarity group
    """
    sets_list = list()

    # go over all the index pairs in the dataframe
    for idx in similarity_df["index"].values:
        was_selected = False

        # list that contains all the groups sets
        first_match = set()

        for group in sets_list:
            # if idx has intersection with one of the groups, add the index to the group
            intersec = set(idx) & group
            if len(intersec) > 0:
                # add the index to the group
                group.update(idx)

                # save in the first group match (and collect if there are more matches)
                first_match.update(group)

                # remove the group (it will be inserted with all the matched items )
                sets_list.remove(group)

                # mark that we have intersection for not adding the idx as different group
                was_selected = True
        # after we iterate over all the groups and found the matches for the idx, insert first_match to the sets_list
        if len(first_match) > 0:
            sets_list.append(first_match)

        if not was_selected:
            sets_list.append(set(idx))

    return sets_list



def groups_df(similarity_df_above_threshold, df):  
    """
    Creates a DataFrame of 'uuid' and 'similarity_uuid' of the attractions which have similarity score above the threshold

    Args:
      similarity_df_above_threshold: a filtered DataFrame of the output of pairs_df which pass 'score' > threshold
      df: pre-processed DataFrame of the attractions
    
    Returns:
      a DataFrame of 'uuid' and 'similarity_uuid'
    """

    # add 'group' column to the above threshold indices and order the dataframe by group
    display_columns = ['uuid']

    # extract the indices
    above_threshold_idx = list(set(np.array([idx for idx in similarity_df_above_threshold["index"]]).ravel()))
    print("above threshold..:", above_threshold_idx)
    
    # extract the relevant rows from the dataframe
    df_above_threshold = df.loc[above_threshold_idx][display_columns]
    df_above_threshold.columns = ["id"]

    df_above_threshold['similarity_uuid'] = 0

    # divide the indices to groups according to similarity
    groups_list = groups_idx(similarity_df_above_threshold)

    # update the group columns according to the groups
    for group in groups_list:
        df_above_threshold['similarity_uuid'].loc[list(group)] = str(uuid.uuid4())

    similarity_groups_json = df_above_threshold.to_dict('records')
    return similarity_groups_json


def main(data):

    df_reduced = pd.DataFrame.from_dict(data)

    # Creating similarity DataFrame according to 'text'
    embeddings_text = model_embedding(df_reduced, "text")
    embeddings = pd.DataFrame(embeddings_text)
    similarity_df = pairs_df_model(embeddings_text)

    # create a square matrix of the similarity scores
    similarity_matrix_text = similarity_matrix(similarity_df, df_reduced)
    similarity_matrix_text.to_csv("similarity_matrix.csv")
    similarity_matrix_text = change_idx_and_cols(similarity_matrix_text, df_reduced, "title")

    # filtering according to 'description' column.
    similarity_threshold = 0.65
    similarity_df_above_threshold = similarity_df[similarity_df["score"] > similarity_threshold]

    # extract the rows above the threshold from the dataframe
    similarity_df_json = groups_df(similarity_df_above_threshold, df_reduced)

    return similarity_df_json, similarity_matrix_text

if __name__ == "__main__":
  data = all_most_popular_attractions_dict
  similarity_json, similarity_matrix_dict = main(data)



model: <class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
sentences: <class 'numpy.ndarray'>
embedd: <class 'torch.Tensor'>
finished embeddings
cosine scores: <class 'torch.Tensor'>
above threshold..: [1, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 23, 28, 34, 35, 37, 38, 39, 41, 43, 44, 46, 47, 48, 50, 51, 52, 55, 56, 58, 59, 65, 66, 67, 70, 71, 73, 76, 77, 78, 79]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Adding 'similarity_id' column to the data in order to investigate the similarities groups

In [None]:
similarity_groups_df = pd.DataFrame.from_dict(similarity_json)
similarity_groups_df.rename(columns={'id': 'uuid'}, inplace=True)
data_with_similarity = pd.merge(all_most_popular_attractions, similarity_groups_df, how='inner')
data_with_without_similarity = pd.merge(all_most_popular_attractions, similarity_groups_df, how='outer')
data_with_similarity.sort_values(by="similarity_uuid", inplace=True)

In [None]:
data_with_similarity.columns

Index(['index', 'uuid', 'created_at', 'last_updated', 'title', 'description',
       'translation_status', 'native_name', 'native_about', 'address',
       'geolocation', 'main_photo_url', 'availability_type',
       'inventory_supplier', 'duration', 'rating', 'number_of_reviews',
       'is_free', 'price', 'order_webpage', 'hotel_pickup', 'is_active',
       'is_itinerary_resource', 'is_curated', 'is_accessible',
       'categories_list', 'external_id', 'external_city_name',
       'additional_info_id', 'city_id', 'native_language',
       'similarity_group_id', 'currency', 'is_city_processed',
       'is_relevant_for_adult', 'is_relevant_for_child',
       'is_relevant_for_infant', 'prediction', 'text', 'rank',
       'similarity_uuid'],
      dtype='object')

5.Investigate the groups obtained

Check how many groups were received in the data

In [None]:
data_with_similarity["similarity_uuid"].nunique()

12

How many attractions are in the similarity groups

In [None]:
data_with_similarity.shape

(44, 41)

In [None]:
data_with_similarity_selected_columns = data_with_similarity[["similarity_uuid", "uuid", "title", "description", "inventory_supplier", "price"]]
for group in data_with_similarity["similarity_uuid"].unique():
  display(data_with_similarity_selected_columns[data_with_similarity_selected_columns["similarity_uuid"] == group])


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
41,0416037e-5112-41c4-9bcc-ef9364c74c93,efd75582-2146-4c50-bd47-ee8404b5b468,Louvre Museum: E-Ticket,These Louvre tickets give you effortless entry...,Tiqets,20.0
10,0416037e-5112-41c4-9bcc-ef9364c74c93,9b769039-abfd-434d-a3ed-ffb6f48e4714,Paris: Louvre Museum Timed-Entrance Ticket,Gain entrance to the Louvre Museum in Paris an...,Getyourguide,18.68


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
0,0a5942b3-814d-4034-bdf9-f61ec566c646,eab900b4-b4fb-4424-9a82-b6d797382eb4,"Dubai: Red Dune Safari, Camel Ride, Sandboard ...",Experience a thrilling 4X4 ride through the re...,Getyourguide,43.56
18,0a5942b3-814d-4034-bdf9-f61ec566c646,200b3af9-49f3-4cbf-9379-c93be6a0a2d6,"Premium Red Dunes, Camel Safari & BBQ at Al Kh...",Spend an evening at the one & only Al Khayma D...,Viator,98.6
8,0a5942b3-814d-4034-bdf9-f61ec566c646,e1bf79d3-0ca0-4376-b6d1-3f28c401f840,"Dubai: Premium Red Dunes, Camel Safari, & BBQ ...",Escape Dubai for an unforgettable desert safar...,Getyourguide,87.22
4,0a5942b3-814d-4034-bdf9-f61ec566c646,8dba7be3-fa90-4d2d-a294-8795121be192,"Dubai: Red Dune Safari, Camel Riding, Sandboar...",Escape Dubai and drive across the Red Arabian ...,Getyourguide,36.75


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
40,3e8a89c3-febc-44be-a078-34a212a7fac6,7c1685db-a67e-4b19-a8a6-e443d7535056,Park Güell,Want to see the most flamboyant and famous par...,Tiqets,13.5
3,3e8a89c3-febc-44be-a078-34a212a7fac6,f10b68e1-6808-4b73-9729-aa87658713ef,Barcelona: Park Güell Admission Ticket,"Visit Park Güell, one of Gaudí’s major works i...",Getyourguide,10.99


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
36,4faa8efe-85f2-42cd-bfdd-191f2d9da70e,ace96b8c-ea17-486e-9ea5-38f0a921c1e7,Pompeii: Reserved Entrance,The eruption of Mt. Vesuvius in 79 C.E. was a ...,Tiqets,21.0
22,4faa8efe-85f2-42cd-bfdd-191f2d9da70e,3dc05141-fb53-4a44-95c7-579bdc4162e2,Archaeological site of Pompeii small-group tou...,Both haunting and fascinating in equal measure...,Musement,42.0


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
23,600c4c27-e8f0-47c1-8476-65508a610565,00b71aac-7f17-473a-a567-7ae07ebeb5a0,Van Gogh Museum entrance and Amsterdam canal c...,This unique combo ticket will save you both ti...,Musement,38.0
39,600c4c27-e8f0-47c1-8476-65508a610565,af03cd7f-e44a-46ed-98ca-767a709bc558,Van Gogh Museum,See the largest collection of Van Gogh's paint...,Tiqets,21.0
2,600c4c27-e8f0-47c1-8476-65508a610565,c74f0fd4-582b-4488-b891-f298784a8d63,Amsterdam: Van Gogh Museum Ticket,Don't miss out on the Van Gogh Museum. With th...,Getyourguide,23.07
17,600c4c27-e8f0-47c1-8476-65508a610565,4d2168fc-0c86-40a4-9ab7-a314e9817cce,Amsterdam Open Boat Canal Cruise - Live Guide ...,The best way of seeing historical Amsterdam is...,Viator,21.8


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
42,61275a36-cfe8-4ad0-b4a0-06104bc5670a,91366148-5ced-4bad-9ad2-b1f8ff6a4087,Vatican Museums & Sistine Chapel: Skip The Line,These skip-the-line tickets to the Vatican Cit...,Tiqets,26.4
20,61275a36-cfe8-4ad0-b4a0-06104bc5670a,d12cfdbe-c16c-47e3-8146-f48880fdde1f,Skip-the-Line: Vatican Museums & Sistine Chape...,Spend more time inside with no-wait access to ...,Viator,54.79
31,61275a36-cfe8-4ad0-b4a0-06104bc5670a,3c19f0fc-493d-4068-80f0-62f5c007ce84,Essential Vatican guided tour: Skip-the-line V...,Get the most out of your visit and discover al...,Musement,37.5
13,61275a36-cfe8-4ad0-b4a0-06104bc5670a,ec5b1e75-ffa6-4db8-943f-7c58891dcaf0,Vatican: Museums & Sistine Chapel Entrance Ticket,See priceless works of art from the Papal coll...,Getyourguide,28.56
25,61275a36-cfe8-4ad0-b4a0-06104bc5670a,a9ed7116-daf5-41c6-a96e-a64c016b8614,Vatican Museums skip the line tickets with esc...,The invaluable masterpieces of the Vatican Mus...,Musement,27.0
12,61275a36-cfe8-4ad0-b4a0-06104bc5670a,13923a3e-8a52-446a-821b-c0a30b747389,Vatican Museum and Sistine Chapel Tour,Join a 3-hour tour of the Vatican with fast-tr...,Getyourguide,51.63
5,61275a36-cfe8-4ad0-b4a0-06104bc5670a,d155bc30-142a-4afa-b34b-80f74fed1351,Skip-the-Ticket-Line Vatican Tour and Sistine ...,Spend more time inside when you skip the ticke...,Getyourguide,53.39
16,61275a36-cfe8-4ad0-b4a0-06104bc5670a,323cd91a-8deb-49aa-8a63-105e3ca9c0fd,"Fast Track - Vatican Tour with Museums, Sistin...",See the highlights of Vatican City with an exp...,Viator,54.68


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
34,66073985-3b43-406b-aac8-9746cf153d45,413d0154-3849-4053-b4c7-8d937f525957,Casa Batlló: Standard Entrance (Blue),"Casa Batlló's nature-inspired facade, brillian...",Tiqets,35.0
26,66073985-3b43-406b-aac8-9746cf153d45,48cc9c3d-0619-45b7-b530-103ed4c0b4e6,Casa Batlló 10D Experience Blue tickets,Casa Batlló is one of Barcelona's most emblema...,Musement,42.0


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
32,ad6b2d23-fc72-423d-ad9d-fb53f491a158,4782c74d-9369-4d34-98b0-340f3038245c,Skip-the-line tickets for the Uffizi Gallery,The Uffizi Gallery is one of the most famous a...,Musement,42.0
27,ad6b2d23-fc72-423d-ad9d-fb53f491a158,8df52e75-7e98-44f1-86e3-46e62ad11f32,"Skip-the-line combo tickets to Uffizi Gallery,...",Don't miss the opportunity to access the most ...,Musement,48.0


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
33,b6a81ef7-917c-4dc9-ab02-25c922ae3f25,a23f64b0-87e6-4e41-9498-1e565b95b1f4,Sagrada Familia entrance tickets,"Discover the cathedral of the Sagrada Familia,...",Musement,37.0
21,b6a81ef7-917c-4dc9-ab02-25c922ae3f25,7fc8f93e-e7ad-4e40-817e-0cc0a83093a0,Guided tour of Sagrada Familia with entrance t...,Barcelona is known as the capital of Modernism...,Musement,65.0
9,b6a81ef7-917c-4dc9-ab02-25c922ae3f25,08a4d0e7-70f5-4970-84e3-da44d8d1a58a,Barcelona: Sagrada Familia Fast-Track Access T...,Gain fast track entrance to Gaudi's unfinished...,Getyourguide,34.72
24,b6a81ef7-917c-4dc9-ab02-25c922ae3f25,d2cf519f-e2ce-4b98-b577-330e31af3a25,Sagrada Familia tickets and guided visit,Barcelona is known as the capital of Modernism...,Musement,50.0
43,b6a81ef7-917c-4dc9-ab02-25c922ae3f25,5ddb203c-cb0a-4429-bb43-c2410a9f66f5,Sagrada Familia: Fast Track,Make seeing the Sagrada Familia the first thin...,Tiqets,33.8


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
29,cebf99b2-e7d3-4752-bac7-2339c95459f6,64f1bfc0-3f5b-45b8-8f56-cea0726d9856,Tour of Turin with tickets and guided tour of ...,Visit the historic center of Turin while strol...,Musement,43.0
15,cebf99b2-e7d3-4752-bac7-2339c95459f6,506b1bc7-56cf-458e-8075-851962e90c83,Private Tour to Giza Pyramids and The Egyptian...,Explore Ancient Egypt all in one day. Take in ...,Viator,93.0


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
28,e889768c-cdb4-42a0-ba8e-7080314ef6ba,763931f1-8c03-4f5e-acdb-7df36f395ec1,"Colosseum, Roman Forum and Palatine Hill tour",Discover the glory of Ancient Rome on this tou...,Musement,42.2
30,e889768c-cdb4-42a0-ba8e-7080314ef6ba,12215024-ccc6-4286-b525-2b2dd05b7b6b,"Priority access to the Colosseum, Roman Forum ...",Visit Rome's most popular sites with one singl...,Musement,27.0
19,e889768c-cdb4-42a0-ba8e-7080314ef6ba,61753dfb-31e4-432b-9b47-56344bf3bed8,Skip the Line: Colosseum Small Group Tour with...,The ancient glory of Rome is reborn! Skip the ...,Viator,48.08
14,e889768c-cdb4-42a0-ba8e-7080314ef6ba,efb07955-0028-47d2-8f20-ee5e3b18f548,Rome Hop-On Hop-Off Sightseeing Tour,Let the sights of Rome unfold before you on an...,Viator,30.19
35,e889768c-cdb4-42a0-ba8e-7080314ef6ba,73147dad-162c-4075-8638-b8256ca6b6df,"Colosseum, Roman Forum & Palatine Hill: Video ...","Take your time exploring the Colosseum, one of...",Tiqets,28.0
37,e889768c-cdb4-42a0-ba8e-7080314ef6ba,c8ad3073-cc0a-4d0c-a740-ed29f5e1b5b8,"Colosseum, Roman Forum & Palatine Hill: Priori...",Sink your sword into the very best of Rome wit...,Tiqets,29.0
7,e889768c-cdb4-42a0-ba8e-7080314ef6ba,fa7ca7c3-8811-4328-809b-3e35498f42cb,"Rome: Colosseum, Roman Forum, Palatine Hill Pr...","Get into the Colosseum, Palatine Hill, and Rom...",Getyourguide,23.62
6,e889768c-cdb4-42a0-ba8e-7080314ef6ba,c75ddc00-eb7f-4fce-968c-8fe79cf083ea,"Rome: Colosseum, Roman Forum, Palatine Hill Fa...",Explore Ancient Rome on a walking tour of the ...,Getyourguide,50.1


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,price
11,f9c695b3-4ec1-42fb-9e93-3049bb50d0c8,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,Witness the views over Dubai from the observat...,Getyourguide,44.65
38,f9c695b3-4ec1-42fb-9e93-3049bb50d0c8,1fa02e65-7527-42d5-a496-e0ae375e1feb,Burj Khalifa: Floors 124 & 125,Treat yourself – and those you're with – to so...,Tiqets,41.43
1,f9c695b3-4ec1-42fb-9e93-3049bb50d0c8,fae98f00-b8b3-457d-9f64-0d6b881c452d,Dubai: Aquarium & Burj Khalifa Levels 124/125 ...,See 2 of Dubai’s most significant sights with ...,Getyourguide,61.26


In [None]:
data_with_similarity_selected_columns.to_csv('marketplace_similarity.csv')

Explore the attractions that were not tailored to any similarity group 

In [None]:
data_with_without_similarity[["similarity_uuid", "uuid", "inventory_supplier", "price", "text"]]

Unnamed: 0,similarity_uuid,uuid,inventory_supplier,price,text
0,,d7b5ba07-f121-45f7-80a2-508ae7c88720,Getyourguide,18.02,"Barcelona: 48, 72, 96, or 120-Hour Public Tran..."
1,0a5942b3-814d-4034-bdf9-f61ec566c646,eab900b4-b4fb-4424-9a82-b6d797382eb4,Getyourguide,43.56,"Dubai: Red Dune Safari, Camel Ride, Sandboard ..."
2,,a792f750-eb2e-400a-9def-b249593f4c27,Getyourguide,21.97,Versailles Palace & Gardens Full Access Ticket...
3,,bab1183a-3627-4755-bb82-e4f7cb5beed2,Getyourguide,44.49,Paris: Eiffel Tower Summit Direct Access by El...
4,,ec1d65e8-95bc-494d-9d8c-5b1887909f75,Getyourguide,9.34,Amsterdam: GVB Public Transport Ticket. Discov...
...,...,...,...,...,...
75,,7c944f30-38b2-40a8-a250-b4f089073d30,Tiqets,17.00,Musée d'Orsay: Dedicated Entrance. The Musée d...
76,3e8a89c3-febc-44be-a078-34a212a7fac6,7c1685db-a67e-4b19-a8a6-e443d7535056,Tiqets,13.50,Park Güell. Want to see the most flamboyant an...
77,0416037e-5112-41c4-9bcc-ef9364c74c93,efd75582-2146-4c50-bd47-ee8404b5b468,Tiqets,20.00,Louvre Museum: E-Ticket. These Louvre tickets ...
78,61275a36-cfe8-4ad0-b4a0-06104bc5670a,91366148-5ced-4bad-9ad2-b1f8ff6a4087,Tiqets,26.40,Vatican Museums & Sistine Chapel: Skip The Lin...


In [None]:
data_with_without_similarity[["similarity_uuid", "uuid", "inventory_supplier","text", "price"]][data_with_without_similarity["similarity_uuid"].isna()]

Unnamed: 0,similarity_uuid,uuid,inventory_supplier,text,price
0,,d7b5ba07-f121-45f7-80a2-508ae7c88720,Getyourguide,"Barcelona: 48, 72, 96, or 120-Hour Public Tran...",18.02
2,,a792f750-eb2e-400a-9def-b249593f4c27,Getyourguide,Versailles Palace & Gardens Full Access Ticket...,21.97
3,,bab1183a-3627-4755-bb82-e4f7cb5beed2,Getyourguide,Paris: Eiffel Tower Summit Direct Access by El...,44.49
4,,ec1d65e8-95bc-494d-9d8c-5b1887909f75,Getyourguide,Amsterdam: GVB Public Transport Ticket. Discov...,9.34
6,,6c6e305e-5477-45d8-9597-1f57eb57e45e,Getyourguide,Disneyland Paris 1-Day Ticket. Enjoy a magical...,64.82
7,,99e738a5-e1c0-49a3-8a61-e1433a1ac9f8,Getyourguide,From Krakow: Guided Tour Auschwitz-Birkenau wi...,35.34
20,,ce8de660-dced-4799-a3e6-5940b77f8e3d,Viator,"Reykjavik Food Walk. Local food, city & histor...",125.0
21,,7a3c1974-0c58-4fc1-9b84-1f56dd7ca33a,Viator,Cinque Terre Day Trip from Florence with Optio...,61.5
22,,30e7134e-28b7-4de3-8fcb-cb153c32f71d,Viator,Neuschwanstein Castle and Linderhof Palace Day...,66.98
24,,3e18460c-d8db-476f-903a-5686443db021,Viator,Cu Chi Tunnels: Morning or Afternoon Guided To...,22.0


In [None]:
for idx, text in enumerate(data_with_without_similarity["text"]):
  if "Duomo di " in text:
    print(idx)

60


In [None]:
#"Lovure" [3, 62,70]
#"Van Gogh Museum" [11, 75, 96]


In [None]:
data_with_without_similarity

Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
0,390091,d7b5ba07-f121-45f7-80a2-508ae7c88720,2022-03-24 18:05:57.918 +0200,2022-03-29 13:14:12.163 +0300,"Barcelona: 48, 72, 96, or 120-Hour Public Tran...",Explore Barcelona with a public transport tick...,,,,,...,,,False,,,,[],"Barcelona: 48, 72, 96, or 120-Hour Public Tran...",0,
1,468592,eab900b4-b4fb-4424-9a82-b6d797382eb4,2022-03-24 20:58:31.432 +0200,2022-03-29 13:14:12.163 +0300,"Dubai: Red Dune Safari, Camel Ride, Sandboard ...",Experience a thrilling 4X4 ride through the re...,,,,,...,,,False,,,,['Guided Tours'],"Dubai: Red Dune Safari, Camel Ride, Sandboard ...",1,0a5942b3-814d-4034-bdf9-f61ec566c646
2,265157,a792f750-eb2e-400a-9def-b249593f4c27,2022-03-24 18:06:21.225 +0200,2022-03-29 13:14:12.163 +0300,Versailles Palace & Gardens Full Access Ticket...,Enjoy 1 or 2 days full access to Versailles' w...,,,,,...,,,False,,,,[],Versailles Palace & Gardens Full Access Ticket...,2,
3,503446,bab1183a-3627-4755-bb82-e4f7cb5beed2,2022-03-24 18:06:28.160 +0200,2022-03-29 13:14:12.163 +0300,Paris: Eiffel Tower Summit Direct Access by El...,Explore the Eiffel Tower with a direct access ...,,,,,...,,,False,,,,['Guided Tours'],Paris: Eiffel Tower Summit Direct Access by El...,3,
4,482964,ec1d65e8-95bc-494d-9d8c-5b1887909f75,2022-03-27 18:09:07.427 +0300,2022-03-29 13:14:12.163 +0300,Amsterdam: GVB Public Transport Ticket,Discover Amsterdam at your own pace and enjoy ...,,,,,...,,,False,,,,['Guided Tours'],Amsterdam: GVB Public Transport Ticket. Discov...,4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,293224,7c944f30-38b2-40a8-a250-b4f089073d30,2022-03-24 18:14:47.273 +0200,2022-06-17 03:01:44.782 +0300,Musée d'Orsay: Dedicated Entrance,The Musée d'Orsay is home to France's national...,,,,"Esplanade Valéry Giscard d'Estaing, Paris, 0",...,,EUR,False,,,,"['Art', 'Museums']",Musée d'Orsay: Dedicated Entrance. The Musée d...,15,
76,293220,7c1685db-a67e-4b19-a8a6-e443d7535056,2022-03-24 18:14:47.271 +0200,2022-06-17 03:01:44.782 +0300,Park Güell,Want to see the most flamboyant and famous par...,,,,"Carrer d'Olot, 12, Barcelona, 0",...,,EUR,False,,,,"['Architecture', 'Historic Sites']",Park Güell. Want to see the most flamboyant an...,16,3e8a89c3-febc-44be-a078-34a212a7fac6
77,293228,efd75582-2146-4c50-bd47-ee8404b5b468,2022-03-24 18:14:47.271 +0200,2022-06-17 03:01:44.782 +0300,Louvre Museum: E-Ticket,These Louvre tickets give you effortless entry...,,,,"Rue de Rivoli, Paris, 0",...,,EUR,False,,,,"['Art', 'Museums']",Louvre Museum: E-Ticket. These Louvre tickets ...,17,0416037e-5112-41c4-9bcc-ef9364c74c93
78,293221,91366148-5ced-4bad-9ad2-b1f8ff6a4087,2022-03-24 18:14:47.271 +0200,2022-06-17 03:01:44.782 +0300,Vatican Museums & Sistine Chapel: Skip The Line,These skip-the-line tickets to the Vatican Cit...,,,,"Viale Vaticano, Rome, 0",...,,EUR,False,,,,"['Art', 'Museums']",Vatican Museums & Sistine Chapel: Skip The Lin...,18,61275a36-cfe8-4ad0-b4a0-06104bc5670a


In [None]:
similarity_matrix = pd.read_csv("similarity_matrix.csv")

In [None]:
similarity_matrix.drop(columns='Unnamed: 0', inplace=True)

In [None]:
similarity_matrix.columns = [int(col) for col in similarity_matrix.columns]

In [None]:
pd.set_option('display.max_colwidth', None)


In [None]:
data_with_without_similarity.iloc[[3,  62, 70]]

In [None]:
similarity_matrix.iloc[11][55]

**Conclusion of threshold = 0.78**<br>
for **80** attractions:<br>
**7** groups were obtained<br>
total of **29** attractions that were classified to a similarity group.<br>
Genrealy the groups look good, only 1 out of 80 attractions(20 from each supplier) was FP. However, I noticed several FN attractions.
There are several attractions that were supposed to be clustered as a group, a.g. "Van Gogh Museum", but were not incorporated into the group because the similarities between them were lower than the set threshold. 
I'll lower the Threshold to **0.65** and check if the groups obtained are more correct in this case than 0.78

**Conclusion of threshold = 0.65**<br>
for 80 attractions:<br>
12 groups were obtained<br>
total of 44 attractions that were classified to a similarity group.<br>
The groups still look homogeneous without FP. 
In the remaining attractions that were not clssified to any similarity group, I couldn't find any FN

###Extract NER from the attractions and find the group's common NER

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
#named_text = nlp(text)
data_with_without_similarity["entity_name"] = data_with_without_similarity["title"].apply(lambda text: [ent.text for ent in nlp(text).ents])

In [None]:
data_with_without_similarity["entity_name"]

0                                                      []
1                        [Vatican Museum, Sistine Chapel]
2                                       [Dubai, 124, 125]
3                   [Paris, Louvre Museum Timed-Entrance]
4                                                      []
                             ...                         
75                                                     []
76    [La Pedrera Essential: Skip The Line + Audio Guide]
77                                          [Kew Gardens]
78                                                     []
79                                    [Rooftops & Museum]
Name: entity_name, Length: 80, dtype: object

In [None]:
data_with_similarity_selected_columns = data_with_without_similarity[["similarity_uuid", "uuid", "title", "description", "inventory_supplier", "entity_name"]]
for group in data_with_similarity["similarity_uuid"].unique():
  display(data_with_similarity_selected_columns[data_with_similarity_selected_columns["similarity_uuid"] == group])

Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,entity_name
5,1719c732-9bb1-4d2d-83c5-0658f72a5fab,e1bf79d3-0ca0-4376-b6d1-3f28c401f840,"Dubai: Premium Red Dunes, Camel Safari, & BBQ at Al Khayma",Escape Dubai for an unforgettable desert safari across red sand dunes and enjoy a BBQ dinner feast and traditional desert activities inside the majestic Al Khayma Camp.,Getyourguide,"[Dubai, Premium Red Dunes, Camel Safari, Al Khayma]"
9,1719c732-9bb1-4d2d-83c5-0658f72a5fab,8dba7be3-fa90-4d2d-a294-8795121be192,"Dubai: Red Dune Safari, Camel Riding, Sandboarding & BBQ","Escape Dubai and drive across the Red Arabian Desert in a 4WD vehicle. Enjoy the sunset, sandboarding, camel ride and visit the camel firm. Choose the 4-hour program or the 7-hour program with the addition of a BBQ dinner.",Getyourguide,"[Dubai, Dune Safari, Camel Riding, Sandboarding & BBQ]"
18,1719c732-9bb1-4d2d-83c5-0658f72a5fab,eab900b4-b4fb-4424-9a82-b6d797382eb4,"Dubai: Red Dune Safari, Camel Ride, Sandboard & BBQ Options",Experience a thrilling 4X4 ride through the red Arabian Desert on this evening safari tour. Choose the 4-hour program that includes dune bashing and sandboarding or the 7-hour program with the addition of a BBQ dinner and Bedouin-style camp experience.,Getyourguide,"[Dubai, Dune Safari, Camel Ride, Sandboard & BBQ Options]"
22,1719c732-9bb1-4d2d-83c5-0658f72a5fab,200b3af9-49f3-4cbf-9379-c93be6a0a2d6,"Premium Red Dunes, Camel Safari & BBQ at Al Khayma Camp™️","Spend an evening at the one & only Al Khayma Desert Camp in Dubai to experience the bygone Bedouin life. Revisit those good old days with Shisha, Henna, Arabian makeover, Barbecue dinner, local Tanoura and Ladies Khaliji Dance.",Viator,"[Premium Red Dunes, Camel Safari & BBQ, Al Khayma Camp]"


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,entity_name
4,68d39033-1431-468d-af9c-a9085b11bd98,08a4d0e7-70f5-4970-84e3-da44d8d1a58a,Barcelona: Sagrada Familia Fast-Track Access Ticket,"Gain fast track entrance to Gaudi's unfinished masterpiece, the Sagrada Familia, and explore Barcelona's most-visited landmark at your own pace with an informative audio guide.",Getyourguide,[]
40,68d39033-1431-468d-af9c-a9085b11bd98,a23f64b0-87e6-4e41-9498-1e565b95b1f4,Sagrada Familia entrance tickets,"Discover the cathedral of the Sagrada Familia, one of the important landmarks in Barcelona, and a UNESCO World Heritage Site which attracts more than three million visitors a year.Sagrada Familia, Gaudí’s unfinished masterpiece, is one of the most-visited attractions in the world. Avoid the long lines and enter the cathedral, where you can spend as long as you like enjoying the impressive architecture. Don't miss the chance to admire the amazing interior of this basilica, where vaults reach up to seventy meters, and marvel at the Latin cross plan with five aisles, extremely rich in ornamentation and symbolism.A multi-lingual audio guide is also included. You'll need to download the audio guide app on your mobile and select the language you prefer: Catalan, Spanish, English, French, German, Italian, Portuguese, Chinese, Japanese, Russian, Hungarian, Korean, Swedish, Finnish, Polish, and Dutch. You'll also get the opportunity to select a small-group tour option on weekends to visit this world-famous temple with a local guide and enter the Sagrada Familia museum, where you'll find content and graphic materials about the history and development of the basilica from its early beginnings to the present.",Musement,[Sagrada Familia]
53,68d39033-1431-468d-af9c-a9085b11bd98,d2cf519f-e2ce-4b98-b577-330e31af3a25,Sagrada Familia tickets and guided visit,"Barcelona is known as the capital of Modernism and the place where the famous architect Antoni Gaudí lived and worked. Gaudí, one of the greatest innovators of his time, has left behind numerous treasures for the discerning tourist to discover.The Sagrada Familia is one of the most visited buildings in the world. With this tour, you will visit the astonishing interior of this basilica, where vaults reach up to seventy meters. Antoni Gaudí designed a Latin cross plan with five aisles, extremely rich in ornamentation and symbolism.You will also get the opportunity to visit the Sagrada Familia museum, where you can find drawings, plaster models and pictures about the history and development of this basilica from its early beginnings to the present day.",Musement,[Sagrada Familia]
58,68d39033-1431-468d-af9c-a9085b11bd98,7fc8f93e-e7ad-4e40-817e-0cc0a83093a0,Guided tour of Sagrada Familia with entrance to the towers,"Barcelona is known as the capital of Modernism and the place where the famous architect Antoni Gaudí worked and lived. Gaudí, one of the greatest innovators of his time, left behind numerous treasures in Barcelona for the discerning tourist to discover. The Sagrada Familia is one of the most visited buildings in the world and this guided tour will allow you to explore this magnificent modernist building. You will visit the astonishing interior of the basilica, where vaults go up to seventy meters. Antoni Gaudí designed a Latin cross plan with five aisles, extremely rich in ornamentation and symbolism. You will also visit the Sagrada Familia Museum, where you will see an exhibition of drawings, plaster models and pictures about the history and development of this basilica from its early beginnings to the present day. The museum will also give valuable information about Antoni Gaudí's life and career.Your guided tour ends at the entrance of the elevator to the tower. Your admission to go up the towers is included in your ticket. Take the elevator up the tower and admire the astonishing views of the city!",Musement,[Sagrada Familia]
60,68d39033-1431-468d-af9c-a9085b11bd98,5ddb203c-cb0a-4429-bb43-c2410a9f66f5,Sagrada Familia: Fast Track,"Make seeing the Sagrada Familia the first thing you do in Barcelona! Sagrada Familia tickets sell out fast, so join the many people who visited Barcelona's top attraction with our Sagrada Familia fast-track tickets and save yourself time and money.\r\n\r\nOnce you're inside, prepare to be mind-blown by Gaudí's modernist, yet unfinished masterpiece. Look up at the spires that tower over the Catalan capital. See the sunlight sending rainbows streaming in through the stained-glass windows. Every inch of this basilica has the wow-factor!\r\n\r\nLook for the turtle at the base of one pillar, and the tortoise at another. These are designed to show the balance between land and sea. In fact, there are decorations inspired by nature and Christian iconography everywhere.",Tiqets,[Sagrada Familia]


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,entity_name
10,7bbb3064-0b2e-41fb-9a18-b181710f2e59,f10b68e1-6808-4b73-9729-aa87658713ef,Barcelona: Park Güell Admission Ticket,"Visit Park Güell, one of Gaudí’s major works in Barcelona. Take in spectacular views of Barcelona and explore this stunning green space that’s surrounded by modernist architecture.",Getyourguide,[]
63,7bbb3064-0b2e-41fb-9a18-b181710f2e59,7c1685db-a67e-4b19-a8a6-e443d7535056,Park Güell,"Want to see the most flamboyant and famous park in Barcelona? Get your hands on these Park Güell tickets and explore the Park Güell Monumental Zone at your leisure. \r\n\r\nThis colorful park perched high on the hills was named after Count Eusebi Güell – and originally intended as a gated community for the city's well-to-do. It was opened to everyone when Gaudí passed away in 1926, and at that point the architect himself had called it home for the last 20 years. \r\n\r\nIt gets even more fanciful inside the Park Güell Monumental Zone. Man-made walls, roads, and walkways mimic natural forms. Exuberant buildings, colorful tile work, and the amazing snaking Serpent Bench all beg for you to take a photo. And the views of Barcelona from up high are like no other! \r\n\r\nPark Güell tickets will immerse you in Gaudí's one-of-a-kind imagination, and it's bound to inspire yours, too.",Tiqets,[Park Güell]


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,entity_name
0,99db2ecd-389a-42da-bccc-f09c1658d23b,ec5b1e75-ffa6-4db8-943f-7c58891dcaf0,Vatican: Museums & Sistine Chapel Entrance Ticket,See priceless works of art from the Papal collections in the Vatican Museums and Sistine Chapel. Marvel at masterpieces from antiquity to Michelangelo’s legendary frescoes. Enjoy optional acces to the Papal Villas and Vatican Gardens.,Getyourguide,[]
1,99db2ecd-389a-42da-bccc-f09c1658d23b,13923a3e-8a52-446a-821b-c0a30b747389,Vatican Museum and Sistine Chapel Tour,"Join a 3-hour tour of the Vatican with fast-track access and explore the Vatican Museums, and the Sistine Chapel.",Getyourguide,"[Vatican Museum, Sistine Chapel]"
8,99db2ecd-389a-42da-bccc-f09c1658d23b,d155bc30-142a-4afa-b34b-80f74fed1351,Skip-the-Ticket-Line Vatican Tour and Sistine Chapel,"Spend more time inside when you skip the ticket line at the Vatican Museums & Sistine Chapel through an official, priority entrance. On this comprehensive guided tour, you’ll experience the best of the Vatican in just 3 hours.",Getyourguide,[Sistine Chapel]
20,99db2ecd-389a-42da-bccc-f09c1658d23b,d12cfdbe-c16c-47e3-8146-f48880fdde1f,Skip-the-Line: Vatican Museums & Sistine Chapel Guided Small-Group Tour,"Spend more time inside with no-wait access to the Vatican Museums and Sistine Chapel through an official, Vatican partner entrance. On this comprehensive guided tour, you'll experience the best of the Vatican in just three hours, including the Raphael Rooms, St. Peter's, and more. Navigate the vast complex of artwork and history with an expert, who will bring this ancient collection to life. Choose from several departure times.",Viator,[]
25,99db2ecd-389a-42da-bccc-f09c1658d23b,323cd91a-8deb-49aa-8a63-105e3ca9c0fd,"Fast Track - Vatican Tour with Museums, Sistine Chapel & Raphael rooms","See the highlights of Vatican City with an expert guide, visiting the Vatican Museums & Sistine Chapel. Head inside the world’s largest collection of private art with an expert guide, and see for yourself why the Vatican is a mecca for millions of travelers. Explore intriguing sites like Raphael’s Rooms and then visit the Sistine Chapel to see incredible frescoes by Michelangelo. <br><br>Before you start, our multilingual staff will welcome you to our fully equipped, air-conditioned offices.",Viator,"[Museums, Sistine Chapel & Raphael]"
43,99db2ecd-389a-42da-bccc-f09c1658d23b,3c19f0fc-493d-4068-80f0-62f5c007ce84,Essential Vatican guided tour: Skip-the-line Vatican Museums and Sistine Chapel,"Get the most out of your visit and discover all the treasures of the Vatican! Skip the long lines and learn about the fascinating collections from a professional guide.This experience will allow you to go inside the Vatican Museums and admire renowned artworks of Michelangelo and Raffaello. Visit the famous Courtyards of the Vatican City, the Gallery of Tapestries and the Gallery of the Candelabra. Seize the opportunity to marvel at the Sistine Chapel and Michelangelo's undisputed masterpiece. Finally, as you exit out into St. Peter’s Square, you will also see Bernini’s optical illusion of the columns.",Musement,"[Vatican, Vatican Museums, Sistine Chapel]"
61,99db2ecd-389a-42da-bccc-f09c1658d23b,91366148-5ced-4bad-9ad2-b1f8ff6a4087,Vatican Museums & Sistine Chapel: Skip The Line,"These skip-the-line tickets to the Vatican City's artistic highlights let you breeze past the long lines waiting to step inside the Vatican and head straight for the entrance. No waiting, just wall-to-wall masterpieces.\r\n\r\nUse the extra time you would have spent in line to visit all four remarkable collections: classical sculpture, Renaissance masterpieces, and stunning artifacts from Ancient Egypt and the Etruscans.\r\n\r\nAfter passing _the Pine Cone_, end your tour by admiring the Sistine Chapel. Michelangelo created more than 300 figures on over 500 square meters of the ceiling in his breathtaking fresco. Twenty-two years later, he returned to create _The Last Judgement_ on the entire wall above the altar.",Tiqets,[Vatican Museums & Sistine Chapel:]


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,entity_name
51,b3728588-78f6-428c-aa82-9a18c6aeb48c,48cc9c3d-0619-45b7-b530-103ed4c0b4e6,Casa Batlló 10D Experience Blue tickets,"Casa Batlló is one of Barcelona's most emblematic modernist buildings and one of the most highly rated cultural and tourist attractions, welcoming 1 million visitors every year. This masterpiece is known as the Casa dels Ossos or House of Bones because of the skeletal shapes that make up its facade and it is considered as one of the most creative and original jobs of the architect.Enter the world of Antonio Gaudí and discover this gem, a UNESCO World Heritage site. Enjoy an immersive experience thanks to the intelligent audio guide. Admire all of the magic surrounding Casa Batlló and discover a legend of architecture and design.The Casa Batlló 10D Experience is a new tour of the astonishing Gaudí masterpiece, with two all-new innovative spaces, technological installations and intelligent devices. Discover this UNESCO World Heritage Site and start your journey into the genius’s mind in the amazing Gaudí Dome and Gaudí Cube.",Musement,[]
74,b3728588-78f6-428c-aa82-9a18c6aeb48c,413d0154-3849-4053-b4c7-8d937f525957,Casa Batlló: Standard Entrance (Blue),"Casa Batlló's nature-inspired facade, brilliant colors, and thought-provoking whimsical features make it one of Gaudí's most popular masterpieces. With these Casa Batlló tickets, you'll enjoy admission to all accessible levels of Casa Batlló as well as its famous Dragon Rooftop – and get a unique experience along the way.\r\n\r\nThe Casa Batlló 10D Experience is an immersive adventure combining artificial intelligence, augmented reality and machine learning. This unique trip will see the history of this magical house come to life all around you during a self-guided tour available in 15 languages, and is included in your Casa Batlló tickets.\r\n\r\nThe 10D Experience allows you to enter the mind of the genius architect while he created his masterpiece. See the Gaudí Dome, an innovative space with over 1,000 screens, and experience the enlightening moment in which Gaudí surrendered to nature. Don't miss the Gaudí Cube, a first-of-its-kind experience which will take you on an immersive journey!",Tiqets,[Casa Batlló: Standard Entrance (Blue]


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,entity_name
2,ce0c4e1c-692c-4c72-af28-60ac9b08bc96,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,Dubai: Burj Khalifa Level 124 and 125 Entry Ticket,"Witness the views over Dubai from the observation deck of the iconic Burj Khalifa, the world's tallest building. Ascend 125 floors for a panoramic 360-degree view over the Arabian Gulf.",Getyourguide,"[Dubai, 124, 125]"
68,ce0c4e1c-692c-4c72-af28-60ac9b08bc96,1fa02e65-7527-42d5-a496-e0ae375e1feb,Burj Khalifa: Floors 124 & 125,"Treat yourself – and those you're with – to something truly special with Burj Khalifa tickets. From the dizzying heights of the 124th and 125 floors, you'll enjoy spectacular views of Dubai as you literally live the high life. With panoramic views from the world's highest building, this is a true bucket-list experience.",Tiqets,[125]


Unnamed: 0,similarity_uuid,uuid,title,description,inventory_supplier,entity_name
6,e9a2a8c9-5a33-42af-9fff-a26e49fcdb53,fa7ca7c3-8811-4328-809b-3e35498f42cb,"Rome: Colosseum, Roman Forum, Palatine Hill Priority Tickets","Get into the Colosseum, Palatine Hill, and Roman Forum in central Rome through the fast track entrance with a combined package. Enjoy a hassle-free experience at your own pace. Marvel at the remains of some of the greatest monuments of the Roman Republic.",Getyourguide,"[Colosseum, Roman Forum]"
7,e9a2a8c9-5a33-42af-9fff-a26e49fcdb53,c75ddc00-eb7f-4fce-968c-8fe79cf083ea,"Rome: Colosseum, Roman Forum, Palatine Hill Fast-Track Tour","Explore Ancient Rome on a walking tour of the Colosseum, Roman Forum, and Palatine Hill, and skip the lines to enter the popular attractions with an expert guide.",Getyourguide,"[Colosseum, Roman Forum, Palatine Hill Fast-Track]"
21,e9a2a8c9-5a33-42af-9fff-a26e49fcdb53,61753dfb-31e4-432b-9b47-56344bf3bed8,Skip the Line: Colosseum Small Group Tour with Roman Forum & Palatine Hill,"The ancient glory of Rome is reborn! Skip the Line at three of the most significant surviving remnants of the Eternal City: the Colosseum, Palatine Hill and Roman Forum. Your English-speaking expert historian tour guide will share anecdotes and history throughout the 3-hour tour, rebuilding the impressive ruins with tales of Ancient Rome’s heyday on this small-group experience of up to 20 participants.",Viator,[Roman Forum & Palatine Hill]
44,e9a2a8c9-5a33-42af-9fff-a26e49fcdb53,12215024-ccc6-4286-b525-2b2dd05b7b6b,"Priority access to the Colosseum, Roman Forum and Palatine Hill with optional guided tour","Visit Rome's most popular sites with one single ticket.With this priority entrance ticket, you'll get to skip the long lines at the ticket office and meet your Musement representative who will accompany you inside the Colosseum. Don't worry about the long waiting time and spend more time inside, enjoying the amazing atmosphere and captivating history of this unique site.During weekends, you can choose to add a guided tour to your ticket: visit the most ancient sites in Rome in a small group of 16 people maximum and discover the history and secrets behind these fascinating monuments.",Musement,"[Colosseum, Roman Forum, Palatine Hill]"
48,e9a2a8c9-5a33-42af-9fff-a26e49fcdb53,763931f1-8c03-4f5e-acdb-7df36f395ec1,"Colosseum, Roman Forum and Palatine Hill tour","Discover the glory of Ancient Rome on this tour of the Eternal City's three main archaeological areas: The Colosseum, Roman Forum, and Palatine Hill. Thanks to skip-the-line entrance tickets to all sites, you won't waste any time waiting in line.Your guide will retrace the history of the empire from its birth in 753 BC, telling you about the clashes between gladiators that took place in the Colosseum and the exciting events related to the kingdoms of Caesar and Nero.",Musement,"[Colosseum, Roman Forum, Palatine Hill]"
69,e9a2a8c9-5a33-42af-9fff-a26e49fcdb53,c8ad3073-cc0a-4d0c-a740-ed29f5e1b5b8,"Colosseum, Roman Forum & Palatine Hill: Priority Entrance + Arena Floor","Sink your sword into the very best of Rome with this triple bill of unmissable culture! Get a gladiator's-eye view of the iconic arena and explore areas that are off-limits to the general public. Then get to the heart of ancient Rome at the Forum, and see where it all began at the Palatine Hill.",Tiqets,"[Colosseum, Roman Forum & Palatine Hill:]"
73,e9a2a8c9-5a33-42af-9fff-a26e49fcdb53,73147dad-162c-4075-8638-b8256ca6b6df,"Colosseum, Roman Forum & Palatine Hill: Video Guide","Take your time exploring the Colosseum, one of Ancient Rome's most treasured archaeological sites. Then see the Roman Forum, which was the political seat of one of the greatest empires the world has ever known, then make your way to the Palatine Hill, where Romulus founded the Eternal City. One ticket, three great Roman sites.",Tiqets,"[Colosseum, Roman Forum & Palatine Hill:]"


###How to choose which x attraction will be display in the UI of MarketPlace?

1. add 'rank' column to all the data

2. For each supplier rank all the most popular attraction from 0-(x_attraction -1). The most popular attraction ranked with the higher score

3. Choose a first attraction with the highest score from one of the vendors. drop its similarities (if any) and choose the next best attraction with priority given to the least chosen provider.

In [None]:
data_with_without_similarity.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80 entries, 0 to 79
Data columns (total 41 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   index                   80 non-null     int64  
 1   uuid                    80 non-null     object 
 2   created_at              80 non-null     object 
 3   last_updated            80 non-null     object 
 4   title                   80 non-null     object 
 5   description             80 non-null     object 
 6   translation_status      0 non-null      object 
 7   native_name             0 non-null      float64
 8   native_about            0 non-null      float64
 9   address                 36 non-null     object 
 10  geolocation             58 non-null     object 
 11  main_photo_url          79 non-null     object 
 12  availability_type       21 non-null     object 
 13  inventory_supplier      80 non-null     object 
 14  duration                80 non-null     obje

In [None]:
data_with_without_similarity["rank"].unique()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [None]:
import random

supplier_count_dict = {supplier:0 for supplier in relevant_suppliers}
popular_data = data_with_without_similarity.copy() 
chosen_idx = list()

num_attractions_for_ui = 15

for i in range(num_attractions_for_ui):
  print(popular_data.shape)
  max_rank = popular_data["rank"].max()
  print("max_rank:", max_rank)
  # finding the most_popular attractions
  most_popular_df = popular_data[popular_data["rank"] == max_rank]
  display(most_popular_df)

  # how to choose which one?
  # first, find the suppliers
  pop_suppliers = most_popular_df["inventory_supplier"].unique()
  min_count = 10
  for supplier in pop_suppliers:
    if supplier_count_dict[supplier] < min_count:
      min_count = supplier_count_dict[supplier]
      chosen_supplier = supplier
      # update supplier_count_dict
      supplier_count_dict[chosen_supplier] += 1
      print(chosen_supplier)
  chosen_attraction_idx = random.choice(most_popular_df[most_popular_df["inventory_supplier"] == chosen_supplier].index)
  chosen_idx.append(chosen_attraction_idx)
  print(chosen_idx)

  # drop similarities of the chosen attractions
  similarity_group = popular_data.loc[chosen_attraction_idx]["similarity_uuid"]
  print("similarity group:", similarity_group)
  if not pd.isna(similarity_group):
    popular_data = popular_data[popular_data["similarity_uuid"] != similarity_group]
  else:
    popular_data.drop(index=chosen_attraction_idx, inplace=True)

(80, 41)
max_rank: 19


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
19,387732,ec5b1e75-ffa6-4db8-943f-7c58891dcaf0,2022-03-24 18:03:58.080 +0200,2022-03-29 13:14:12.163 +0300,Vatican: Museums & Sistine Chapel Entrance Ticket,See priceless works of art from the Papal coll...,,,,,...,,,False,,,,[],Vatican: Museums & Sistine Chapel Entrance Tic...,19,16f3d732-abaf-4baf-abaa-5566e6972f83
39,212626,d12cfdbe-c16c-47e3-8146-f48880fdde1f,2022-03-24 18:41:57.837 +0200,2022-03-29 13:32:46.934 +0300,Skip-the-Line: Vatican Museums & Sistine Chape...,Spend more time inside with no-wait access to ...,,,,,...,,,False,,,,"['Guided Tours', 'Historic Sites', 'Walking & ...",Skip-the-Line: Vatican Museums & Sistine Chape...,19,16f3d732-abaf-4baf-abaa-5566e6972f83
59,352668,a23f64b0-87e6-4e41-9498-1e565b95b1f4,2022-03-24 19:29:35.584 +0200,2022-06-03 16:52:23.650 +0300,Sagrada Familia entrance tickets,"Discover the cathedral of the Sagrada Familia,...",,,,,...,,USD,False,,,,['Guided Tours'],Sagrada Familia entrance tickets. Discover the...,19,d65ee334-73fc-4435-93c9-9102a8dba8a4
79,293216,5ddb203c-cb0a-4429-bb43-c2410a9f66f5,2022-03-24 18:14:47.260 +0200,2022-06-17 03:01:44.782 +0300,Sagrada Familia: Fast Track,Make seeing the Sagrada Familia the first thin...,,,,"Carrer de Mallorca, 401, Barcelona, 0",...,,EUR,False,,,,['Religion'],Sagrada Familia: Fast Track. Make seeing the S...,19,d65ee334-73fc-4435-93c9-9102a8dba8a4


Getyourguide
[19]
similarity group: 16f3d732-abaf-4baf-abaa-5566e6972f83
(72, 41)
max_rank: 19


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
59,352668,a23f64b0-87e6-4e41-9498-1e565b95b1f4,2022-03-24 19:29:35.584 +0200,2022-06-03 16:52:23.650 +0300,Sagrada Familia entrance tickets,"Discover the cathedral of the Sagrada Familia,...",,,,,...,,USD,False,,,,['Guided Tours'],Sagrada Familia entrance tickets. Discover the...,19,d65ee334-73fc-4435-93c9-9102a8dba8a4
79,293216,5ddb203c-cb0a-4429-bb43-c2410a9f66f5,2022-03-24 18:14:47.260 +0200,2022-06-17 03:01:44.782 +0300,Sagrada Familia: Fast Track,Make seeing the Sagrada Familia the first thin...,,,,"Carrer de Mallorca, 401, Barcelona, 0",...,,EUR,False,,,,['Religion'],Sagrada Familia: Fast Track. Make seeing the S...,19,d65ee334-73fc-4435-93c9-9102a8dba8a4


Musement
[19, 59]
similarity group: d65ee334-73fc-4435-93c9-9102a8dba8a4
(67, 41)
max_rank: 18


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
38,102846,61753dfb-31e4-432b-9b47-56344bf3bed8,2022-03-24 18:41:57.837 +0200,2022-03-29 13:32:46.934 +0300,Skip the Line: Colosseum Small Group Tour with...,The ancient glory of Rome is reborn! Skip the ...,,,,,...,,,False,,,,"['Guided Tours', 'Historic Sites', 'Walking & ...",Skip the Line: Colosseum Small Group Tour with...,18,36aa5c4f-4bd7-44f1-83f7-9bcb29f330cf
58,335064,4782c74d-9369-4d34-98b0-340f3038245c,2022-03-24 19:30:28.675 +0200,2022-06-03 16:52:23.650 +0300,Skip-the-line tickets for the Uffizi Gallery,The Uffizi Gallery is one of the most famous a...,,,,Lungarno degli Acciaioli 30,...,,USD,False,,,,"['Culture', 'Guided Tours', 'Popular', 'Histor...",Skip-the-line tickets for the Uffizi Gallery. ...,18,8668e6e1-8299-495c-983f-c7fa6d5f3694


Viator
[19, 59, 38]
similarity group: 36aa5c4f-4bd7-44f1-83f7-9bcb29f330cf
(59, 41)
max_rank: 18


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
58,335064,4782c74d-9369-4d34-98b0-340f3038245c,2022-03-24 19:30:28.675 +0200,2022-06-03 16:52:23.650 +0300,Skip-the-line tickets for the Uffizi Gallery,The Uffizi Gallery is one of the most famous a...,,,,Lungarno degli Acciaioli 30,...,,USD,False,,,,"['Culture', 'Guided Tours', 'Popular', 'Histor...",Skip-the-line tickets for the Uffizi Gallery. ...,18,8668e6e1-8299-495c-983f-c7fa6d5f3694


Musement
[19, 59, 38, 58]
similarity group: 8668e6e1-8299-495c-983f-c7fa6d5f3694
(57, 41)
max_rank: 17


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
17,470595,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,2022-03-24 20:58:30.750 +0200,2022-03-29 13:14:12.163 +0300,Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,Witness the views over Dubai from the observat...,,,,,...,,,False,,,,[],Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,17,7a6c3423-3419-45f7-be1a-13f78f81a71f
37,60651,200b3af9-49f3-4cbf-9379-c93be6a0a2d6,2022-03-24 19:25:02.514 +0200,2022-03-31 18:11:25.745 +0300,"Premium Red Dunes, Camel Safari & BBQ at Al Kh...",Spend an evening at the one & only Al Khayma D...,,,,,...,,,False,,,,"['Guided Tours', 'Hidden Gems', 'Historic Site...","Premium Red Dunes, Camel Safari & BBQ at Al Kh...",17,b15fa0a1-6adc-43fb-96a4-477b7624a258
57,342331,ab348a39-9178-4963-af68-9f0fbf92d0b4,2022-03-24 19:26:45.512 +0200,2022-06-17 03:04:29.436 +0300,Da Vinci's Last Supper tickets and guided tour,Join this exclusive tour to discover one of th...,,,,Corso Magenta 67,...,,USD,False,,,,"['Culture', 'Architecture', 'Guided Tours', 'P...",Da Vinci's Last Supper tickets and guided tour...,17,
77,293228,efd75582-2146-4c50-bd47-ee8404b5b468,2022-03-24 18:14:47.271 +0200,2022-06-17 03:01:44.782 +0300,Louvre Museum: E-Ticket,These Louvre tickets give you effortless entry...,,,,"Rue de Rivoli, Paris, 0",...,,EUR,False,,,,"['Museums', 'Art']",Louvre Museum: E-Ticket. These Louvre tickets ...,17,5caff090-1870-4626-8461-5e09756e590b


Getyourguide
Tiqets
[19, 59, 38, 58, 77]
similarity group: 5caff090-1870-4626-8461-5e09756e590b
(55, 41)
max_rank: 17


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
17,470595,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,2022-03-24 20:58:30.750 +0200,2022-03-29 13:14:12.163 +0300,Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,Witness the views over Dubai from the observat...,,,,,...,,,False,,,,[],Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,17,7a6c3423-3419-45f7-be1a-13f78f81a71f
37,60651,200b3af9-49f3-4cbf-9379-c93be6a0a2d6,2022-03-24 19:25:02.514 +0200,2022-03-31 18:11:25.745 +0300,"Premium Red Dunes, Camel Safari & BBQ at Al Kh...",Spend an evening at the one & only Al Khayma D...,,,,,...,,,False,,,,"['Guided Tours', 'Hidden Gems', 'Historic Site...","Premium Red Dunes, Camel Safari & BBQ at Al Kh...",17,b15fa0a1-6adc-43fb-96a4-477b7624a258
57,342331,ab348a39-9178-4963-af68-9f0fbf92d0b4,2022-03-24 19:26:45.512 +0200,2022-06-17 03:04:29.436 +0300,Da Vinci's Last Supper tickets and guided tour,Join this exclusive tour to discover one of th...,,,,Corso Magenta 67,...,,USD,False,,,,"['Culture', 'Architecture', 'Guided Tours', 'P...",Da Vinci's Last Supper tickets and guided tour...,17,


Getyourguide
Viator
[19, 59, 38, 58, 77, 37]
similarity group: b15fa0a1-6adc-43fb-96a4-477b7624a258
(51, 41)
max_rank: 17


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
17,470595,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,2022-03-24 20:58:30.750 +0200,2022-03-29 13:14:12.163 +0300,Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,Witness the views over Dubai from the observat...,,,,,...,,,False,,,,[],Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,17,7a6c3423-3419-45f7-be1a-13f78f81a71f
57,342331,ab348a39-9178-4963-af68-9f0fbf92d0b4,2022-03-24 19:26:45.512 +0200,2022-06-17 03:04:29.436 +0300,Da Vinci's Last Supper tickets and guided tour,Join this exclusive tour to discover one of th...,,,,Corso Magenta 67,...,,USD,False,,,,"['Culture', 'Architecture', 'Guided Tours', 'P...",Da Vinci's Last Supper tickets and guided tour...,17,


Getyourguide
Musement
[19, 59, 38, 58, 77, 37, 57]
similarity group: nan
(50, 41)
max_rank: 17


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
17,470595,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,2022-03-24 20:58:30.750 +0200,2022-03-29 13:14:12.163 +0300,Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,Witness the views over Dubai from the observat...,,,,,...,,,False,,,,[],Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,17,7a6c3423-3419-45f7-be1a-13f78f81a71f


Getyourguide
[19, 59, 38, 58, 77, 37, 57, 17]
similarity group: 7a6c3423-3419-45f7-be1a-13f78f81a71f
(47, 41)
max_rank: 16


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
36,32750,58623b2b-79b9-4ace-971e-302062807c3e,2022-03-24 18:39:04.135 +0200,2022-03-29 13:32:46.934 +0300,Tuscany in One Day Sightseeing Tour from Florence,"Famous for a wealth of art, history, striking ...",,,,,...,,,False,,,,"['Cuisine', 'Guided Tours', 'Historic Sites', ...",Tuscany in One Day Sightseeing Tour from Flore...,16,
76,293220,7c1685db-a67e-4b19-a8a6-e443d7535056,2022-03-24 18:14:47.271 +0200,2022-06-17 03:01:44.782 +0300,Park Güell,Want to see the most flamboyant and famous par...,,,,"Carrer d'Olot, 12, Barcelona, 0",...,,EUR,False,,,,"['Architecture', 'Historic Sites']",Park Güell. Want to see the most flamboyant an...,16,172e257c-52ec-40fe-ab51-000b364313aa


Viator
Tiqets
[19, 59, 38, 58, 77, 37, 57, 17, 76]
similarity group: 172e257c-52ec-40fe-ab51-000b364313aa
(45, 41)
max_rank: 16


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
36,32750,58623b2b-79b9-4ace-971e-302062807c3e,2022-03-24 18:39:04.135 +0200,2022-03-29 13:32:46.934 +0300,Tuscany in One Day Sightseeing Tour from Florence,"Famous for a wealth of art, history, striking ...",,,,,...,,,False,,,,"['Cuisine', 'Guided Tours', 'Historic Sites', ...",Tuscany in One Day Sightseeing Tour from Flore...,16,


Viator
[19, 59, 38, 58, 77, 37, 57, 17, 76, 36]
similarity group: nan
(44, 41)
max_rank: 15


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
35,21399,4d2168fc-0c86-40a4-9ab7-a314e9817cce,2022-03-24 18:47:02.482 +0200,2022-03-27 15:15:26.657 +0300,Amsterdam Open Boat Canal Cruise - Live Guide ...,The best way of seeing historical Amsterdam is...,,,,,...,,,False,,,,"['Outdoor Activities', 'Hidden Gems']",Amsterdam Open Boat Canal Cruise - Live Guide ...,15,286d6007-8080-4c81-a640-f348631c0cab
75,293224,7c944f30-38b2-40a8-a250-b4f089073d30,2022-03-24 18:14:47.273 +0200,2022-06-17 03:01:44.782 +0300,Musée d'Orsay: Dedicated Entrance,The Musée d'Orsay is home to France's national...,,,,"Esplanade Valéry Giscard d'Estaing, Paris, 0",...,,EUR,False,,,,"['Museums', 'Art']",Musée d'Orsay: Dedicated Entrance. The Musée d...,15,


Viator
Tiqets
[19, 59, 38, 58, 77, 37, 57, 17, 76, 36, 75]
similarity group: nan
(43, 41)
max_rank: 15


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
35,21399,4d2168fc-0c86-40a4-9ab7-a314e9817cce,2022-03-24 18:47:02.482 +0200,2022-03-27 15:15:26.657 +0300,Amsterdam Open Boat Canal Cruise - Live Guide ...,The best way of seeing historical Amsterdam is...,,,,,...,,,False,,,,"['Outdoor Activities', 'Hidden Gems']",Amsterdam Open Boat Canal Cruise - Live Guide ...,15,286d6007-8080-4c81-a640-f348631c0cab


Viator
[19, 59, 38, 58, 77, 37, 57, 17, 76, 36, 75, 35]
similarity group: 286d6007-8080-4c81-a640-f348631c0cab
(39, 41)
max_rank: 14


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
54,327724,99f7be76-aa37-42ff-8e38-84e2cd392b1f,2022-03-24 19:29:49.821 +0200,2022-06-03 16:52:23.650 +0300,Pisa Leaning Tower and Cathedral skip-the-line...,"The Leaning Tower, situated in Piazza dei Mira...",,,,,...,,USD,False,,,,"['Culture', 'Architecture', 'Guided Tours', 'P...",Pisa Leaning Tower and Cathedral skip-the-line...,14,
74,293256,f14adf11-2567-4388-8754-0fa6af692a09,2022-03-24 18:14:48.443 +0200,2022-06-17 03:01:44.782 +0300,Rijksmuseum,Flas your Rijksmuseum tickets and get up close...,,,,"Museumstraat 1, Amsterdam, 0",...,,EUR,False,,,,"['Museums', 'Art']",Rijksmuseum. Flas your Rijksmuseum tickets and...,14,


Musement
[19, 59, 38, 58, 77, 37, 57, 17, 76, 36, 75, 35, 54]
similarity group: nan
(38, 41)
max_rank: 14


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
74,293256,f14adf11-2567-4388-8754-0fa6af692a09,2022-03-24 18:14:48.443 +0200,2022-06-17 03:01:44.782 +0300,Rijksmuseum,Flas your Rijksmuseum tickets and get up close...,,,,"Museumstraat 1, Amsterdam, 0",...,,EUR,False,,,,"['Museums', 'Art']",Rijksmuseum. Flas your Rijksmuseum tickets and...,14,


Tiqets
[19, 59, 38, 58, 77, 37, 57, 17, 76, 36, 75, 35, 54, 74]
similarity group: nan
(37, 41)
max_rank: 13


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
33,96976,64e177bb-ed5e-460b-bc29-d29765c54a96,2022-03-24 18:43:57.594 +0200,2022-03-29 13:32:51.760 +0300,"Stonehenge, Windsor Castle, and Bath from London",See the official residence of The Queen and ho...,,,,,...,,,False,,,,"['Guided Tours', 'Historic Sites']","Stonehenge, Windsor Castle, and Bath from Lond...",13,
53,299467,4a655b8b-b211-4af3-a47f-1f60b02b3de5,2022-03-24 19:27:09.670 +0200,2022-06-17 03:04:29.436 +0300,Alhambra and Nasrid Palace skip the line ticke...,No need to wait when you can skip the line to ...,,,,,...,,USD,False,,,,"['Architecture', 'Guided Tours', 'Historic Sit...",Alhambra and Nasrid Palace skip the line ticke...,13,


Viator
Musement
[19, 59, 38, 58, 77, 37, 57, 17, 76, 36, 75, 35, 54, 74, 53]
similarity group: nan


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
def choose_X_most_popular_idx(data_with_similarity_id: DataFrame) -> List[int]:

  supplier_count_dict = {supplier:0 for supplier in relevant_suppliers}
  popular_data = data_with_similarity_id.copy() 
  chosen_idx = list()

  for i in range(NUM_ATTRACTIONS_TO_DISPLAY):
    max_rank = popular_data["rank"].max()
    
    # finding the most_popular attractions
    most_popular_df = popular_data[popular_data["rank"] == max_rank]
    
    # how to choose which one?
    # first, find the suppliers
    pop_suppliers = most_popular_df["inventory_supplier"].unique()
    min_count = 100
    for supplier in pop_suppliers:
      if supplier_count_dict[supplier] < min_count:
        min_count = supplier_count_dict[supplier]
        chosen_supplier = supplier
        # update supplier_count_dict
        supplier_count_dict[chosen_supplier] += 1
        
    chosen_attraction_idx = random.choice(most_popular_df[most_popular_df["inventory_supplier"] == chosen_supplier].index)
    chosen_idx.append(chosen_attraction_idx)
    

    # drop similarities of the chosen attractions
    similarity_group = popular_data.loc[chosen_attraction_idx]["similarity_uuid"]
    
    if not pd.isna(similarity_group):
      popular_data = popular_data[popular_data["similarity_uuid"] != similarity_group]
    else:
      popular_data.drop(index=chosen_attraction_idx, inplace=True)

  return chosen_idx

In [None]:
#checking for duplicates
assert len(set(chosen_idx)) == len(chosen_idx)

In [None]:
chosen_most_popular_attractions = data_with_without_similarity.iloc[chosen_idx]

In [None]:
chosen_most_popular_attractions["rank"]

19    19
59    19
38    18
58    18
77    17
37    17
57    17
17    17
76    16
36    16
75    15
35    15
54    14
74    14
53    13
Name: rank, dtype: int64

In [None]:
chosen_most_popular_attractions

Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
19,387732,ec5b1e75-ffa6-4db8-943f-7c58891dcaf0,2022-03-24 18:03:58.080 +0200,2022-03-29 13:14:12.163 +0300,Vatican: Museums & Sistine Chapel Entrance Ticket,See priceless works of art from the Papal coll...,,,,,...,,,False,,,,[],Vatican: Museums & Sistine Chapel Entrance Tic...,19,457f8f00-2888-4ecb-b9bb-5e0184e7a602
59,352668,a23f64b0-87e6-4e41-9498-1e565b95b1f4,2022-03-24 19:29:35.584 +0200,2022-06-03 16:52:23.650 +0300,Sagrada Familia entrance tickets,"Discover the cathedral of the Sagrada Familia,...",,,,,...,,USD,False,,,,['Guided Tours'],Sagrada Familia entrance tickets Discover the ...,19,19d8a8e5-2f51-4b86-aa9b-4c4585ac4a9f
38,102846,61753dfb-31e4-432b-9b47-56344bf3bed8,2022-03-24 18:41:57.837 +0200,2022-03-29 13:32:46.934 +0300,Skip the Line: Colosseum Small Group Tour with...,The ancient glory of Rome is reborn! Skip the ...,,,,,...,,,False,,,,"['Historic Sites', 'Guided Tours', 'Walking & ...",Skip the Line: Colosseum Small Group Tour with...,18,7a536c41-8726-4cc0-942a-51df86c19085
58,335064,4782c74d-9369-4d34-98b0-340f3038245c,2022-03-24 19:30:28.675 +0200,2022-06-03 16:52:23.650 +0300,Skip-the-line tickets for the Uffizi Gallery,The Uffizi Gallery is one of the most famous a...,,,,Lungarno degli Acciaioli 30,...,,USD,False,,,,"['Historic Sites', 'Culture', 'Popular', 'Guid...",Skip-the-line tickets for the Uffizi Gallery T...,18,3fc41bef-eb3c-4c7e-ae81-71c5458e6ac0
77,293228,efd75582-2146-4c50-bd47-ee8404b5b468,2022-03-24 18:14:47.271 +0200,2022-06-17 03:01:44.782 +0300,Louvre Museum: E-Ticket,These Louvre tickets give you effortless entry...,,,,"Rue de Rivoli, Paris, 0",...,,EUR,False,,,,"['Museums', 'Art']",Louvre Museum: E-Ticket These Louvre tickets g...,17,0c65ce5c-5456-4de2-aaae-4356df4a1528
37,60651,200b3af9-49f3-4cbf-9379-c93be6a0a2d6,2022-03-24 19:25:02.514 +0200,2022-03-31 18:11:25.745 +0300,"Premium Red Dunes, Camel Safari & BBQ at Al Kh...",Spend an evening at the one & only Al Khayma D...,,,,,...,,,False,,,,"['Historic Sites', 'Nightlife', 'Guided Tours'...","Premium Red Dunes, Camel Safari & BBQ at Al Kh...",17,c10e1741-e2eb-4a94-8884-84df2299206e
57,342331,ab348a39-9178-4963-af68-9f0fbf92d0b4,2022-03-24 19:26:45.512 +0200,2022-06-17 03:04:29.436 +0300,Da Vinci's Last Supper tickets and guided tour,Join this exclusive tour to discover one of th...,,,,Corso Magenta 67,...,,USD,False,,,,"['Historic Sites', 'Culture', 'Popular', 'Arch...",Da Vinci's Last Supper tickets and guided tour...,17,
17,470595,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,2022-03-24 20:58:30.750 +0200,2022-03-29 13:14:12.163 +0300,Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,Witness the views over Dubai from the observat...,,,,,...,,,False,,,,[],Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,17,620b20f5-4e36-430c-bc03-20bffef1df29
76,293220,7c1685db-a67e-4b19-a8a6-e443d7535056,2022-03-24 18:14:47.271 +0200,2022-06-17 03:01:44.782 +0300,Park Güell,Want to see the most flamboyant and famous par...,,,,"Carrer d'Olot, 12, Barcelona, 0",...,,EUR,False,,,,"['Architecture', 'Historic Sites']",Park Güell Want to see the most flamboyant and...,16,15681704-703d-4509-a8b6-3f145389ff86
36,32750,58623b2b-79b9-4ace-971e-302062807c3e,2022-03-24 18:39:04.135 +0200,2022-03-29 13:32:46.934 +0300,Tuscany in One Day Sightseeing Tour from Florence,"Famous for a wealth of art, history, striking ...",,,,,...,,,False,,,,"['Historic Sites', 'Cuisine', 'Guided Tours', ...",Tuscany in One Day Sightseeing Tour from Flore...,16,


In [None]:
(chosen_most_popular_attractions.set_index("index")).to_dict('records')

In [None]:
chosen_most_popular_attractions.to_csv("chosen_most_popular_attractions.csv")

In [None]:
supplier_count_dict

{'Getyourguide': 5, 'Musement': 5, 'Tiqets': 4, 'Viator': 7}

In [None]:
# create a list of dictionaries of 'uuid': int (1-15)

def create_most_pop_dict(df, chosen_idx):
  chosen_most_pop_df = pd.DataFrame(df["uuid"].iloc[chosen_idx])
  chosen_most_pop_df["most_popular_global"] = [i for i in range(1,NUM_ATTRACTIONS_TO_DISPLAY+1)]
  return chosen_most_pop_df.to_dict('records')


create_most_pop_dict(data_with_without_similarity, chosen_idx)


Unnamed: 0,uuid,most_popular_global
19,ec5b1e75-ffa6-4db8-943f-7c58891dcaf0,1
59,a23f64b0-87e6-4e41-9498-1e565b95b1f4,2
38,61753dfb-31e4-432b-9b47-56344bf3bed8,3
58,4782c74d-9369-4d34-98b0-340f3038245c,4
77,efd75582-2146-4c50-bd47-ee8404b5b468,5
37,200b3af9-49f3-4cbf-9379-c93be6a0a2d6,6
57,ab348a39-9178-4963-af68-9f0fbf92d0b4,7
17,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,8
76,7c1685db-a67e-4b19-a8a6-e443d7535056,9
36,58623b2b-79b9-4ace-971e-302062807c3e,10


Complete code:<br>
input: List of dictionaries of the attractions<br>
output: List of dictionaries of X most popular attractions

In [None]:
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 3.9 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 52.7 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 47.2 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 5.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.7 MB/s 
Collecting tokenizers!

In [None]:
# import the files from google colab

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:

def unavailable_to_nan(df: DataFrame, col_list: List[str]) -> None:
  """
  change 'unavailable' to empty string in the specified columns

  Args:
    df: raw DataFrame of attractions
    col_list: list of text columns

  Returns:
    None

  """

  for col in col_list:
      df[col] = df[col].apply(lambda x: np.nan if x == 'unavailable' else x)
      df[col] = df[col].fillna("")


def remove_duplicates_and_nan(df: DataFrame) -> None:
  """
  Remove rows which are exactly the same and

  Args:
    df: DataFrame of attractions

  Returns:
    None

    """
  print("Shape before removing duplicates:", df.shape)
  df.drop_duplicates(subset=['title', 'description', 'address'], inplace=True)
  df.dropna(subset=["text"], inplace=True)
  df.reset_index(inplace=True)
  print("Shape after removing duplicates:", df.shape)


def format_categories(df: pd.DataFrame) -> pd.Series:
  """
  Transforming each tag in "categories_list" column to a list of categories

  Args:
    DataFrame of attractions

  Returns:
    a DataFrame column (Series) with a list of categories in each entry
    """

  return df["categories_list"].apply(
      lambda x: list(set([j.strip().title() for j in re.sub(r'[()\[\'"{}\]]', '', x).strip().split(",")])) if type(
          x) != list else x)


def strip_list(df: DataFrame, col: str):
  """
  Remove empty items from a list of each entry of the prediction column

  Args:
    df: DataFrame with a new column for the different tags_format
    col: str, the name of the new column with the new tags_format

  Returns:
    None
  """
  df[col] = df[col].apply(lambda x: [ele for ele in x if ele.strip()])


def data_preprocess(raw_df: DataFrame) -> DataFrame:
  """
  preprocess the raw DataFrame: update the name of the columns if needed,
  creates 'prediction' column with list of categories,
  creates 'text' column of joining the title and description,
  remove duplicate rows

  Args:
    raw_df: raw DataFrame of attractions

  Returns:
    Pre-processed DataFrame
  """
  raw_df = raw_df.rename(
      columns={"name": "title", "about": "description", "tags": "categories_list", "source": "inventory_supplier",
                "location_point": "geolocation"})
  if 'prediction' not in raw_df.columns:
      raw_df["prediction"] = format_categories(raw_df)
      strip_list(raw_df, "prediction")
      raw_df["prediction"] = raw_df["prediction"].apply(lambda x: str(x))

  unavailable_to_nan(raw_df, ["title", "description"])
  raw_df["text"] = raw_df["title"] + '. ' + raw_df["description"]
  remove_duplicates_and_nan(raw_df)
  print("The data were processed")
  return raw_df


In [None]:
import re
import json
import uuid
import pandas as pd
from pandas import DataFrame
from torch import Tensor
import numpy as np
import random
from typing import Any, Dict, List
from sentence_transformers import SentenceTransformer, util


SIMILARITY_THRESHOLD: float = 0.65
ATTRACTIONS_PER_SUPPLIER = 40
NUM_ATTRACTIONS_TO_DISPLAY = 15
RELEVANT_SUPPLIERS: List[str] = ['Getyourguide', 'Viator', 'Musement', 'Tiqets']

def _model_embedding(df: DataFrame, col: str) -> Tensor:
    """
    calculates the embeddings (as torch) of each entry in 'text' column according to SentenceTransformers

    Args:
      df: preprocessed DataFrame
      col: str, the name of the text column according to which the embeddings will be calculated

    Returns:
      torch.Tensor
    """
    model: SentenceTransformer = SentenceTransformer('all-MiniLM-L6-v2')
    sentences = df[col].values
    embeddings: Tensor = model.encode(sentences, convert_to_tensor=True)
    print("finished embeddings")
    return embeddings


def _pairs_df_model(embeddings: Tensor) -> DataFrame:
    """
    Compute cosine-similarities of each embedded vector with each other embedded vector

    Args:
      embeddings: Tensor embeddings of the text column

    Returns:
      DataFrame with columns: 'ind1' (vector index), 'ind2' (vector index), 'score' (cosine score of the vectors)
      (The shape of the DataFrame is: rows: (n!/(n-k)!k!), for k items out of n)

    """
    cosine_scores: Tensor = util.cos_sim(embeddings, embeddings)
    pairs: List[Dict[str, Any]] = []
    for i in range(len(cosine_scores) - 1):
        for j in range(i + 1, len(cosine_scores)):
            pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

    pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
    pairs_df = pd.DataFrame(pairs)

    pairs_df["ind1"] = pairs_df["index"].apply(lambda x: x[0]).values
    pairs_df["ind2"] = pairs_df["index"].apply(lambda x: x[1]).values

    return pairs_df


def similarity_matrix(similarity_idx_df: DataFrame, reduced_df: DataFrame) -> DataFrame:
    """
    creates n^2 similarity matrix. Each attraction has a similarity score in relation to each attraction in the data

    Args:
      similarity_idx_df: DataFrame output of the function pairs_df_model
      reduced_df: preprocessed DataFrame

    Returns:
      sqaure DataFrame. columns = index = the indices of the attractions. values: similarity score
    """
    similarity_matrix: DataFrame = pd.DataFrame(columns=[i for i in range(reduced_df.shape[0])], index=range(reduced_df.shape[0]))
    for i in range(reduced_df.shape[0]):
        for j in range(i, reduced_df.shape[0]):
            if j == i:
                similarity_matrix.iloc[i][j] = 1
                similarity_matrix.iloc[j][i] = 1
            else:
                similarity_score = \
                    similarity_idx_df[(similarity_idx_df["ind1"] == i) & (similarity_idx_df["ind2"] == j)][
                        "score"].values
                similarity_matrix.iloc[i][j] = similarity_score
                similarity_matrix.iloc[j][i] = similarity_score
    return similarity_matrix


def _groups_idx(similarity_df: DataFrame):
    """
    Creates a list of sets, each tuple is a similarity group which contains the attractions indices (A group consists
    of the pairs of a particular index and the pairs of its pairs. There is no overlap of indices between the groups

    Args:
      similarity_df: DataFrame output of the function pairs_df_model

    Returns:
      a list of sets. Each tuple contains attractions indices and represent a similarity group
    """
    sets_list: List[set[int]] = list()

    for idx in similarity_df["index"].values:
        was_selected = False
        first_match: set[int] = set()

        for group in sets_list:
            intersec = set(idx) & group
            if len(intersec) > 0:
                group.update(idx)
                first_match.update(group)
                sets_list.remove(group)
                was_selected = True

        if len(first_match) > 0:
            sets_list.append(first_match)

        if not was_selected:
            sets_list.append(set(idx))

    return sets_list


def _groups_df(similarity_df_above_threshold: DataFrame, df: DataFrame) -> List[Dict[str, str]]:
    """
    Creates a DataFrame of 'uuid' and 'similarity_uuid' of the attractions which have similarity score above the threshold

    Args:
      similarity_df_above_threshold: a filtered DataFrame of the output of pairs_df which pass 'score' > threshold
      df: pre-processed DataFrame of the attractions

    Returns:
      a DataFrame of 'uuid' and 'similarity_uuid'
    """
    display_columns: List[str] = ['uuid']
    above_threshold_idx: List[int] = list(set(np.array([idx for idx in similarity_df_above_threshold["index"]]).ravel()))
    df_above_threshold: DataFrame = df.loc[above_threshold_idx][display_columns]
    df_above_threshold.columns = ["id"]
    df_above_threshold['similarity_uuid'] = 0
    groups_list: List[set[int]] = _groups_idx(similarity_df_above_threshold)

    for group in groups_list:
        df_above_threshold['similarity_uuid'].loc[list(group)] = str(uuid.uuid4())

    similarity_groups_json: List[Dict[str, str]] = df_above_threshold.to_dict('records')
    return similarity_groups_json


def _compute_similarity_groups(
        attractions: List[Dict[str, str]]
) -> List[Dict[str, str]]:
    """
    Creates a similarity uuid for each attraction with similarities

    Args:
        attractions: List of dictionaries of the attractions

    Returns:
        List of dictionaries, each dictionary contains "uuid" : "similarity_uuid"
    """
    raw_df: DataFrame = pd.DataFrame.from_dict(attractions)
    df_reduced: DataFrame = data_preprocess(raw_df)

    embeddings_text: Tensor = _model_embedding(df_reduced, "text")
    similarity_df: DataFrame = _pairs_df_model(embeddings_text)

    similarity_df_above_threshold: DataFrame = similarity_df[similarity_df["score"] > SIMILARITY_THRESHOLD]

    similarity_df_json: List[Dict[str, str]] = _groups_df(similarity_df_above_threshold, df_reduced)

    return similarity_df_json



def all_most_popular(attractions: DataFrame) -> List[Dict[str,str]]:
  """
  Extract the specified number of most popular attractions from each supplier
  and join all to one DataFrame

  Args:
    attractions: DataFrame of all the attractions in the database

  Return:
    DataFrame with most popular attractions of each paid supplier

  """
  all_most_popular_attractions = pd.DataFrame()

  for supplier in RELEVANT_SUPPLIERS:
    # extract x most popular attractions from a specific supplier
    most_pop_supplier = attractions[attractions["inventory_supplier"] == supplier].sort_values(by="number_of_reviews", ascending=False)[:ATTRACTIONS_PER_SUPPLIER]
    most_pop_supplier.sort_values(by="number_of_reviews", ascending=True, inplace=True)
    most_pop_supplier["rank"] = [i for i in range(most_pop_supplier.shape[0])]
    # add to to all most popular
    all_most_popular_attractions = pd.concat([all_most_popular_attractions, most_pop_supplier])
    print("all_most_popular_attractions shape:", all_most_popular_attractions.shape)
    all_most_popular_attractions_dict = all_most_popular_attractions.to_dict('records')
  return all_most_popular_attractions


def choose_x_most_popular_idx(data_with_similarity_id: DataFrame) -> List[int]:
  """
  select the specifeid number of indices of most popular attractions 

  Args:
    data_with_similarity_id: DataFrame of all most popular attractions with similarity_uuid column

  Return:
    list of indices of the chosen most popular attractions

  """

  supplier_count_dict: Dict[str,int] = {supplier:0 for supplier in RELEVANT_SUPPLIERS}
  popular_data: DataFrame = data_with_similarity_id.copy() 
  chosen_idx = list()

  for i in range(NUM_ATTRACTIONS_TO_DISPLAY):
    max_rank: int = popular_data["rank"].max()
    
    # finding the most_popular attractions
    most_popular_df: DataFrame = popular_data[popular_data["rank"] == max_rank]
    
    # how to choose which one?
    # first, find the suppliers
    pop_suppliers: List[str] = most_popular_df["inventory_supplier"].unique()
    min_count = 100
    for supplier in pop_suppliers:
      if supplier_count_dict[supplier] < min_count:
        min_count = supplier_count_dict[supplier]
        chosen_supplier: str = supplier
        # update supplier_count_dict
        supplier_count_dict[chosen_supplier] += 1
        
    chosen_attraction_idx: int = random.choice(most_popular_df[most_popular_df["inventory_supplier"] == chosen_supplier].index)
    chosen_idx.append(chosen_attraction_idx)
    
    # drop similarities of the chosen attractions
    similarity_group: str = popular_data.loc[chosen_attraction_idx]["similarity_uuid"]
    
    if not pd.isna(similarity_group):
      popular_data: DataFrame = popular_data[popular_data["similarity_uuid"] != similarity_group]
    else:
      popular_data.drop(index=chosen_attraction_idx, inplace=True)

  return chosen_idx


def create_most_pop_dict(df, chosen_idx):
    """
    Creates a list of dictionaries of uuid: rank (1-15, most_popular_global)

    Args:
      df: DataFrame of all popular attractions with similarity_uuid column
      chosen_idx: list of int, The output of choose_x_most_popular_idx function

    Return:
      list of dictionaries of uuid: rank (1-16, most_popular_global)
    """

    chosen_most_pop_df = pd.DataFrame(df["uuid"].iloc[chosen_idx])
    chosen_most_pop_df["uuid"] = chosen_most_pop_df["uuid"].apply(lambda id: str(id))
    chosen_most_pop_df.rename(columns={"uuid": "attraction_id"}, inplace=True)
    chosen_most_pop_df['external_city_name'] = df["external_city_name"].iloc[chosen_idx]
    chosen_most_pop_df["rank"] = [i for i in range(1, NUM_ATTRACTIONS_TO_DISPLAY + 1)]
    return chosen_most_pop_df.to_dict("records")



def selected_most_popular(attractions: List[Dict[str,str]]):
  """ 
  select the specified number of most popular attractions

  Args:
    attractions: List of dictionaries of the attractions

  Return:
    List of dictionaries of selected most popular attractions

  """
  attractions_df: DataFrame = pd.DataFrame.from_dict(attractions)
  attractions_df_preprocess: DataFrame = data_preprocess(attractions_df)

  #choose X most popular from the supplier list
  all_popular_attractions_dict: List[Dict[str,str]] = all_most_popular(attractions_df_preprocess)
  all_most_popular_attractions_df = pd.DataFrame.from_dict(all_popular_attractions_dict)
  # create similarity groups
  similarity_groups: List[Dict[str,str]] = _compute_similarity_groups(all_popular_attractions_dict)
  similarity_groups_df: DataFrame = pd.DataFrame.from_dict(similarity_groups)
  # merge the preprocess data with similarity_id column
  similarity_groups_df.rename(columns={'id': 'uuid'}, inplace=True)
  data_with_similarity:  DataFrame = pd.merge(all_most_popular_attractions_df, similarity_groups_df, how='outer')
  
  chosen_popular_idx: List[int] = choose_x_most_popular_idx(data_with_similarity)
  most_popular_dict = create_most_pop_dict(data_with_similarity, chosen_popular_idx) 

  print("chosen_idx:", chosen_popular_idx)
  #chosen_most_popular_attractions: DataFrame = data_with_similarity.loc[chosen_popular_idx]
  return most_popular_dict


In [None]:
data_path = '/content/drive/MyDrive/ColabNotebooks/bridgify/duplicates_and_similarities/most_popular/all_attractions.csv'
attractions_dict = (pd.read_csv(data_path)).to_dict('records')
attractions_dict

  exec(code_obj, self.user_global_ns, self.user_ns)


[{'additional_info_id': nan,
  'address': nan,
  'availability_type': nan,
  'categories_list': '\'{"Guided Tours"}\'',
  'city_id': nan,
  'created_at': '2022-03-24 18:03:42.888 +0200',
  'currency': nan,
  'description': 'Take part in an adrenaline-filled mountain bike tour through Mount Etna’s nature paths. Ride downhill on exciting trails with lava bumps surrounded by ancient trees, making for a truly unique experience with a professional guide.',
  'duration': '06:00:00',
  'external_city_name': nan,
  'external_id': '320938',
  'geolocation': 'POINT (15.25942 37.990372)',
  'hotel_pickup': False,
  'inventory_supplier': 'Getyourguide',
  'is_accessible': nan,
  'is_active': True,
  'is_city_processed': False,
  'is_curated': True,
  'is_free': False,
  'is_itinerary_resource': False,
  'is_relevant_for_adult': nan,
  'is_relevant_for_child': nan,
  'is_relevant_for_infant': nan,
  'last_updated': '2022-03-29 13:14:12.163 +0300',
  'main_photo_url': 'https://cdn.getyourguide.com/i

In [None]:
selected_most_popular_dict = selected_most_popular(attractions_dict)
selected_most_popular_df = pd.DataFrame.from_dict(selected_most_popular_dict)

Shape before removing duplicates: (512663, 38)
Shape after removing duplicates: (501384, 39)
The data were processed
all_most_popular_attractions shape: (40, 40)
all_most_popular_attractions shape: (80, 40)
all_most_popular_attractions shape: (120, 40)
all_most_popular_attractions shape: (160, 40)
Shape before removing duplicates: (160, 40)
Shape after removing duplicates: (160, 41)
The data were processed


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

finished embeddings


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


chosen_idx: [39, 119, 78, 118, 157, 77, 117, 37, 156, 76, 155, 114, 154, 73, 113]


In [None]:
selected_most_popular_dict

[{'attraction_id': 'ec5b1e75-ffa6-4db8-943f-7c58891dcaf0',
  'external_city_name': 'Rome',
  'rank': 1},
 {'attraction_id': 'a23f64b0-87e6-4e41-9498-1e565b95b1f4',
  'external_city_name': 'Barcelona',
  'rank': 2},
 {'attraction_id': '61753dfb-31e4-432b-9b47-56344bf3bed8',
  'external_city_name': nan,
  'rank': 3},
 {'attraction_id': '4782c74d-9369-4d34-98b0-340f3038245c',
  'external_city_name': 'Florence',
  'rank': 4},
 {'attraction_id': 'efd75582-2146-4c50-bd47-ee8404b5b468',
  'external_city_name': 'Paris',
  'rank': 5},
 {'attraction_id': '200b3af9-49f3-4cbf-9379-c93be6a0a2d6',
  'external_city_name': 'Dubai',
  'rank': 6},
 {'attraction_id': 'ab348a39-9178-4963-af68-9f0fbf92d0b4',
  'external_city_name': 'Milan',
  'rank': 7},
 {'attraction_id': 'c88d79eb-7ee4-47ac-8d0d-9970fc82fa10',
  'external_city_name': 'Dubai',
  'rank': 8},
 {'attraction_id': '7c1685db-a67e-4b19-a8a6-e443d7535056',
  'external_city_name': 'Barcelona',
  'rank': 9},
 {'attraction_id': '58623b2b-79b9-4ace-9

In [None]:
pd.merge(data_with_without_similarity, selected_most_popular_df, how='inner', left_on="uuid", right_on="attraction_id")

Unnamed: 0,index,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,...,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank_x,similarity_uuid,attraction_id,rank_y
0,470595,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,2022-03-24 20:58:30.750 +0200,2022-03-29 13:14:12.163 +0300,Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,Witness the views over Dubai from the observat...,,,,,...,False,,,,[],Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,17,f9c695b3-4ec1-42fb-9e93-3049bb50d0c8,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,8
1,387732,ec5b1e75-ffa6-4db8-943f-7c58891dcaf0,2022-03-24 18:03:58.080 +0200,2022-03-29 13:14:12.163 +0300,Vatican: Museums & Sistine Chapel Entrance Ticket,See priceless works of art from the Papal coll...,,,,,...,False,,,,[],Vatican: Museums & Sistine Chapel Entrance Tic...,19,61275a36-cfe8-4ad0-b4a0-06104bc5670a,ec5b1e75-ffa6-4db8-943f-7c58891dcaf0,1
2,96976,64e177bb-ed5e-460b-bc29-d29765c54a96,2022-03-24 18:43:57.594 +0200,2022-03-29 13:32:51.760 +0300,"Stonehenge, Windsor Castle, and Bath from London",See the official residence of The Queen and ho...,,,,,...,False,,,,"['Historic Sites', 'Guided Tours']","Stonehenge, Windsor Castle, and Bath from Lond...",13,,64e177bb-ed5e-460b-bc29-d29765c54a96,14
3,32750,58623b2b-79b9-4ace-971e-302062807c3e,2022-03-24 18:39:04.135 +0200,2022-03-29 13:32:46.934 +0300,Tuscany in One Day Sightseeing Tour from Florence,"Famous for a wealth of art, history, striking ...",,,,,...,False,,,,"['Historic Sites', 'Cuisine', 'Guided Tours', ...",Tuscany in One Day Sightseeing Tour from Flore...,16,,58623b2b-79b9-4ace-971e-302062807c3e,10
4,60651,200b3af9-49f3-4cbf-9379-c93be6a0a2d6,2022-03-24 19:25:02.514 +0200,2022-03-31 18:11:25.745 +0300,"Premium Red Dunes, Camel Safari & BBQ at Al Kh...",Spend an evening at the one & only Al Khayma D...,,,,,...,False,,,,"['Guided Tours', 'Nature', 'Outdoor Activities...","Premium Red Dunes, Camel Safari & BBQ at Al Kh...",17,0a5942b3-814d-4034-bdf9-f61ec566c646,200b3af9-49f3-4cbf-9379-c93be6a0a2d6,6
5,102846,61753dfb-31e4-432b-9b47-56344bf3bed8,2022-03-24 18:41:57.837 +0200,2022-03-29 13:32:46.934 +0300,Skip the Line: Colosseum Small Group Tour with...,The ancient glory of Rome is reborn! Skip the ...,,,,,...,False,,,,"['Historic Sites', 'Guided Tours', 'Walking & ...",Skip the Line: Colosseum Small Group Tour with...,18,e889768c-cdb4-42a0-ba8e-7080314ef6ba,61753dfb-31e4-432b-9b47-56344bf3bed8,3
6,299467,4a655b8b-b211-4af3-a47f-1f60b02b3de5,2022-03-24 19:27:09.670 +0200,2022-06-17 03:04:29.436 +0300,Alhambra and Nasrid Palace skip the line ticke...,No need to wait when you can skip the line to ...,,,,,...,False,,,,"['Architecture', 'Historic Sites', 'Guided Tou...",Alhambra and Nasrid Palace skip the line ticke...,13,,4a655b8b-b211-4af3-a47f-1f60b02b3de5,15
7,327724,99f7be76-aa37-42ff-8e38-84e2cd392b1f,2022-03-24 19:29:49.821 +0200,2022-06-03 16:52:23.650 +0300,Pisa Leaning Tower and Cathedral skip-the-line...,"The Leaning Tower, situated in Piazza dei Mira...",,,,,...,False,,,,"['Popular', 'Architecture', 'Guided Tours', 'C...",Pisa Leaning Tower and Cathedral skip-the-line...,14,,99f7be76-aa37-42ff-8e38-84e2cd392b1f,12
8,342331,ab348a39-9178-4963-af68-9f0fbf92d0b4,2022-03-24 19:26:45.512 +0200,2022-06-17 03:04:29.436 +0300,Da Vinci's Last Supper tickets and guided tour,Join this exclusive tour to discover one of th...,,,,Corso Magenta 67,...,False,,,,"['Popular', 'Architecture', 'Culture', 'Guided...",Da Vinci's Last Supper tickets and guided tour...,17,,ab348a39-9178-4963-af68-9f0fbf92d0b4,7
9,335064,4782c74d-9369-4d34-98b0-340f3038245c,2022-03-24 19:30:28.675 +0200,2022-06-03 16:52:23.650 +0300,Skip-the-line tickets for the Uffizi Gallery,The Uffizi Gallery is one of the most famous a...,,,,Lungarno degli Acciaioli 30,...,False,,,,"['Popular', 'Guided Tours', 'Culture', 'Museum...",Skip-the-line tickets for the Uffizi Gallery. ...,18,ad6b2d23-fc72-423d-ad9d-fb53f491a158,4782c74d-9369-4d34-98b0-340f3038245c,4


In [None]:
data_with_without_similarity["rank"].loc[[59, 19, 38, 58, 77, 17, 37, 57, 76, 36, 75, 35, 74, 54, 33]]

59    19
19    19
38    18
58    18
77    17
17    17
37    17
57    17
76    16
36    16
75    15
35    15
74    14
54    14
33    13
Name: rank, dtype: int64

In [None]:
selected_most_popular_df["rank"]

0     19
1     19
2     18
3     18
4     17
5     17
6     17
7     17
8     16
9     16
10    15
11    15
12    14
13    14
14    13
Name: rank, dtype: int64

In [None]:
selected_most_popular_df

Unnamed: 0,uuid,created_at,last_updated,title,description,translation_status,native_name,native_about,address,geolocation,...,similarity_group_id,currency,is_city_processed,is_relevant_for_adult,is_relevant_for_child,is_relevant_for_infant,prediction,text,rank,similarity_uuid
0,a23f64b0-87e6-4e41-9498-1e565b95b1f4,2022-03-24 19:29:35.584 +0200,2022-06-03 16:52:23.650 +0300,Sagrada Familia entrance tickets,"Discover the cathedral of the Sagrada Familia,...",,,,,,...,,USD,False,,,,['Guided Tours'],Sagrada Familia entrance tickets Discover the ...,39,19d8a8e5-2f51-4b86-aa9b-4c4585ac4a9f
1,ec5b1e75-ffa6-4db8-943f-7c58891dcaf0,2022-03-24 18:03:58.080 +0200,2022-03-29 13:14:12.163 +0300,Vatican: Museums & Sistine Chapel Entrance Ticket,See priceless works of art from the Papal coll...,,,,,POINT (12.45249 41.903839),...,,,False,,,,[],Vatican: Museums & Sistine Chapel Entrance Tic...,39,457f8f00-2888-4ecb-b9bb-5e0184e7a602
2,61753dfb-31e4-432b-9b47-56344bf3bed8,2022-03-24 18:41:57.837 +0200,2022-03-29 13:32:46.934 +0300,Skip the Line: Colosseum Small Group Tour with...,The ancient glory of Rome is reborn! Skip the ...,,,,,,...,,,False,,,,"['Historic Sites', 'Guided Tours', 'Walking & ...",Skip the Line: Colosseum Small Group Tour with...,38,7a536c41-8726-4cc0-942a-51df86c19085
3,4782c74d-9369-4d34-98b0-340f3038245c,2022-03-24 19:30:28.675 +0200,2022-06-03 16:52:23.650 +0300,Skip-the-line tickets for the Uffizi Gallery,The Uffizi Gallery is one of the most famous a...,,,,Lungarno degli Acciaioli 30,POINT (11.252402 43.768837),...,,USD,False,,,,"['Historic Sites', 'Culture', 'Popular', 'Guid...",Skip-the-line tickets for the Uffizi Gallery T...,38,3fc41bef-eb3c-4c7e-ae81-71c5458e6ac0
4,efd75582-2146-4c50-bd47-ee8404b5b468,2022-03-24 18:14:47.271 +0200,2022-06-17 03:01:44.782 +0300,Louvre Museum: E-Ticket,These Louvre tickets give you effortless entry...,,,,"Rue de Rivoli, Paris, 0",POINT (2.335537 48.861109),...,,EUR,False,,,,"['Museums', 'Art']",Louvre Museum: E-Ticket These Louvre tickets g...,37,0c65ce5c-5456-4de2-aaae-4356df4a1528
5,c88d79eb-7ee4-47ac-8d0d-9970fc82fa10,2022-03-24 20:58:30.750 +0200,2022-03-29 13:14:12.163 +0300,Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,Witness the views over Dubai from the observat...,,,,,POINT (55.308651 25.26944),...,,,False,,,,[],Dubai: Burj Khalifa Level 124 and 125 Entry Ti...,37,620b20f5-4e36-430c-bc03-20bffef1df29
6,200b3af9-49f3-4cbf-9379-c93be6a0a2d6,2022-03-24 19:25:02.514 +0200,2022-03-31 18:11:25.745 +0300,"Premium Red Dunes, Camel Safari & BBQ at Al Kh...",Spend an evening at the one & only Al Khayma D...,,,,,,...,,,False,,,,"['Historic Sites', 'Nightlife', 'Guided Tours'...","Premium Red Dunes, Camel Safari & BBQ at Al Kh...",37,c10e1741-e2eb-4a94-8884-84df2299206e
7,ab348a39-9178-4963-af68-9f0fbf92d0b4,2022-03-24 19:26:45.512 +0200,2022-06-17 03:04:29.436 +0300,Da Vinci's Last Supper tickets and guided tour,Join this exclusive tour to discover one of th...,,,,Corso Magenta 67,POINT (9.1710985 45.4653785),...,,USD,False,,,,"['Historic Sites', 'Culture', 'Popular', 'Arch...",Da Vinci's Last Supper tickets and guided tour...,37,
8,7c1685db-a67e-4b19-a8a6-e443d7535056,2022-03-24 18:14:47.271 +0200,2022-06-17 03:01:44.782 +0300,Park Güell,Want to see the most flamboyant and famous par...,,,,"Carrer d'Olot, 12, Barcelona, 0",POINT (2.152694 41.414495),...,,EUR,False,,,,"['Architecture', 'Historic Sites']",Park Güell Want to see the most flamboyant and...,36,15681704-703d-4509-a8b6-3f145389ff86
9,58623b2b-79b9-4ace-971e-302062807c3e,2022-03-24 18:39:04.135 +0200,2022-03-29 13:32:46.934 +0300,Tuscany in One Day Sightseeing Tour from Florence,"Famous for a wealth of art, history, striking ...",,,,,,...,,,False,,,,"['Historic Sites', 'Cuisine', 'Guided Tours', ...",Tuscany in One Day Sightseeing Tour from Flore...,36,
