# **New route-buildibg approach**<br>
finding the next attraction according to distance, similarity, tags and popularity vectors

Using Berlin_google_tagged data.<br>
The file was tagged using the "tagging" code, however, since the model was also trained on Berlin data, the tagging results are biased. For the example of route_building it is fine.

**similarity**

* In the context of similarity between attractions, the algorithm specifically 
addresses the similarity to the last attraction, but still takes into account the other selected attractions as well

**Tags**

* Preference to the attraction is given on the selected tags and also on the number of tags that the attraction answers. For example: an attraction that contains 5/5 tags will have a higher priority than an attraction that contains 3/5 tags


* If a selected attraction meets a particular tag then the priority of the particular tag decreases but is still higher than the tags that are not selected at all

**Popularity**

* Currently the popularity is determined by "number_of_reviews" (can work with google attractions, but is problematic indication regarding of other suppliers attractions)

**Distance**

* The distance is calculated by Haversine distance:
The angular distance between two points on the surface of a sphere (in KM)

**Next attraction formula**<br>

* The first attraction is selected according to the weighting vectors of popularity and tags

* The formula is flexible and can be changed according to our / the user's requirements

* Currently the formula prioritizing popularity, then distance and then tags and similarities.<br>
Coefficients:<br>Popularity: 3<br>Distance: 2<br>Tags and Similarities: 1

###Installations and imports

In [1]:
!pip install sentence_transformers
!pip install haversine

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 7.4 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 54.8 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 57.9 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 6.1 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 48.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64

In [2]:
import os
import re
import sys
import logging
import warnings
import argparse
import datetime
import numpy as np
import pandas as pd
import haversine as hs
from itertools import combinations
from collections import Counter
from configparser import ConfigParser
from sentence_transformers import SentenceTransformer, util

### **Similarities** matrix

In [3]:
# main.py


# read config file
config_object = ConfigParser()
config_object.read("pr_config.ini")


def parse_args(logger):
    """
    This function initialize the parser and the input parameters
    """

    my_parser = argparse.ArgumentParser(description=config_object['params']['description'])
    my_parser.add_argument('--path', '-p',
                           required=True,
                           type=str,
                           help="config_object['params']['path_help']")

    my_parser.add_argument('--save', '-s',
                           required=False,
                           type=str, default=None,
                           help=config_object['params']['save_help'])

    args = my_parser.parse_args()
    logger.info('Parsed arguments')
    return args


def val_input(args, logger):
    """
    This function validated that the input file exists and that the output path's folder exists
    """

    if not os.path.isfile(args.path):
        logger.debug('the input file doesn\'t exists')
        return False

    if args.save:
        if '/' in args.save:
            folder = "/".join(args.save.split('/')[:-1])
            if not os.path.exists(folder):
                logger.debug('the output folder doesn\'t exists')
                return False
        else:
            folder = "/".join(args.path.split('/')[:-1])
            args.save = folder + '/' + args.save

    else:
        current_time = datetime.datetime.now()
        save_path = f'processed_data_{current_time.year}-{current_time.month}-{current_time.day}-{current_time.hour}-' \
                    f'{current_time.minute}.xlsx'
        args.save = save_path
        logger.info('the save path was set to default')
    logger.info(f'args={args}')
    logger.info('input was validated')
    return True


def init_logger():
    """
    This function initialize the logger and returns its handle
    """

    log_formatter = logging.Formatter('%(levelname)s-%(asctime)s-FUNC:%(funcName)s-LINE:%(lineno)d-%(message)s')
    logger = logging.getLogger('log')
    logger.setLevel('DEBUG')
    file_handler = logging.FileHandler('pr_log.txt')
    file_handler.setLevel(logging.DEBUG)
    file_handler.setFormatter(log_formatter)
    logger.addHandler(file_handler)

    return logger


def unavailable_to_nan(df, logger):
    """
    Transforming 'unavailable' to np.nan
    """
    text_cols = ["about", "name", "address"]
    for col in text_cols:
        try:
            df[col] = df[col].apply(lambda x: np.nan if x == 'unavailable' else x)
        except KeyError as er:
            logger.debug(f'{col} column is missing from the DataFrame!')
            print(er)
            sys.exit(1)


def remove_duplicates_and_nan(df, logger):
    """ Remove rows which are exactly the same """

    logger.info(f"Shape before removing duplicates and Nans: {df.shape}")
    print("Shape before removing duplicates and Nans:", df.shape)
    try:
        # I exclude 'address' from 'drop_duplicates' because in many rows the address is inaccurate or missing so the
        # duplicates will be expressed especially according to 'name' and 'about'
        df.drop_duplicates(subset=['name', 'about'], inplace=True)
        df.dropna(subset=['name', 'about', 'address'], inplace=True)
        df.reset_index(inplace=True)
    except KeyError as er:
        logger.debug("One or more columns from the list ['name','about'] are missing from the "
                     "DataFrame!")
        print(er)
        sys.exit(1)

    logger.info(f"Shape after removing duplicates: {df.shape}")
    print("Shape after removing duplicates:", df.shape)
    return df


def model_embedding(text_df, col):
    """
  return the embeddings (as torch) of all the text column
  """
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Single list of sentences
    sentences = text_df[col].values

    # Compute embeddings
    embeddings = model.encode(sentences, convert_to_tensor=True)  # each text transforms to a vector
    return embeddings


def pairs_df_model(embeddings):
    """
  receive embeddings as dataframe.
  Return a DataFrame of computed cosine-similarities for each embedded vector with each other embedded vector.
  The shape of the DataFrame supposed to be (len(embedding), len(embeddings))
  """
    cosine_scores = util.cos_sim(embeddings, embeddings)

    # Find the pairs with the highest cosine similarity scores
    pairs = []
    for i in range(len(cosine_scores) - 1):
        for j in range(i + 1, len(cosine_scores)):
            pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

    # Sort scores in decreasing order
    pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

    # transform to DataFrame and add split the pairs to two colums: 'ind1', 'ind2'
    pairs_df = pd.DataFrame(pairs)

    pairs_df["ind1"] = pairs_df["index"].apply(lambda x: x[0]).values
    pairs_df["ind2"] = pairs_df["index"].apply(lambda x: x[1]).values
    # pairs_df = pairs_df[pairs_df["score"] > threshold]
    return pairs_df


def df_for_model(df, text_col, name_score):
    """
    The function receives dataframe and a text column (not 'about') according to which the similarity will be calculated
    and retrieves a similarity df with the columns: name_score, "ind1", "ind2"
    """
    embedding = model_embedding(df, text_col)
    df_similarity = pairs_df_model(embedding)
    df_similarity.rename(columns={"score": name_score}, inplace=True)
    return df_similarity.drop(columns=["index"])


def merge_df(df1, df2):
    """
    return merged dataframe according to the values in 'ind1' and ind2'
     """
    return pd.merge(df1, df2, on=["ind1", "ind2"], how="inner")



def embeddings_for_model(group_vectors_df):
    """
    Retrieve vectors dataframe and return the data as a numpy array for the similarity model
    """
    groups_vectors = group_vectors_df["avg_vector"].values
    return np.array(groups_vectors.tolist())


def last_col_first(df):
    """
  changing the order of the columns so the last column will be the first column in the dataframe
  """
    cols = df.columns.to_list()
    cols = cols[-1:] + cols[:-1]
    return df[cols]


def similarity_matrix(similarity_idx_df, reduced_df):
  """
  Return n^2 similarity matrix. Each attarction has a similarity score in relation to each attraction in the data
  """
  similarity_matrix = pd.DataFrame(columns=[i for i in range(reduced_df.shape[0])], index=range(reduced_df.shape[0]))
  for i in range(reduced_df.shape[0]):
    for j in range(i,reduced_df.shape[0]):
      if j == i:
        similarity_matrix.iloc[i][j] = 1
        similarity_matrix.iloc[j][i] = 1
      else:
        similarity_score = similarity_idx_df[(similarity_idx_df["ind1"] == i) & (similarity_idx_df["ind2"] == j)]["score"].values
        similarity_matrix.iloc[i][j] = similarity_score
        similarity_matrix.iloc[j][i] = similarity_score
  return similarity_matrix


def data_attributes(df):
    """
    :param df: The groups dataframe that was extracted from the original data
    :return: print the dataframe information
    """
    print("Number of groups in the data:", df["group"].nunique())
    print("\n")
    print("Number of rows in the data:", df.shape[0])
    print("\n")
    print("row count in each group:\n")
    print(df.groupby("group")["name"].count())

def norm_df(df):
  return (df-df.min())/ (df.max() - df.min())


def main():
    # config logger
    logger = init_logger()
    logger.info('STARTED RUNNING')

    # load the data
    raw_df = pd.read_csv("Berlin_google_tagged.csv", encoding='UTF-8')

    # 'unavailable' to NAN
    unavailable_to_nan(raw_df, logger)

    # Remove rows which are exactly the same
    df_reduced = remove_duplicates_and_nan(raw_df, logger)
    df_reduced["name_about"] = df_reduced["name"] + " " + df_reduced["about"]

    # Creating similarities DataFrames according to 'name' and 'address'
    #name_similarity = df_for_model(df_reduced, "name", "name_score")
    #address_similarity = df_for_model(df_reduced, "address", "address_score")

    # Creating similarity DataFrame according to 'about' column and according
    embeddings_about = model_embedding(df_reduced, "name_about")
    embeddings = pd.DataFrame(embeddings_about)
    about_similarity = pairs_df_model(embeddings_about)
    similarity_matrix_scores = similarity_matrix(about_similarity, df_reduced)

    # transforming the tensors to floats 
    for i in range(similarity_matrix_scores.shape[0]):
      similarity_matrix_scores.iloc[i] = similarity_matrix_scores.iloc[i].astype('float')

    return similarity_matrix_scores, df_reduced


berlin_similarity_matrix, berlin_reduced  = main()
berlin_similarity_norm = norm_df(berlin_similarity_matrix)

Shape before removing duplicates and Nans: (301, 17)
Shape after removing duplicates: (290, 18)


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [53]:
# should add this line to code in pycharm for route-building
for i in range(berlin_similarity_norm.shape[0]):
  berlin_similarity_norm.iloc[i] = berlin_similarity_norm.iloc[i].astype('float')

In [4]:
berlin_similarity_norm

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,280,281,282,283,284,285,286,287,288,289
0,1.0,0.322699,0.25035,0.297662,0.248276,0.282005,0.213082,0.243938,0.151737,0.127449,...,0.15907,0.171361,0.181363,0.195135,0.161197,0.205002,0.160908,0.143461,0.188931,0.213141
1,0.324559,1.0,0.171081,0.322057,0.223086,0.40631,0.244922,0.220396,0.164508,0.226302,...,0.123165,0.212983,0.182022,0.10395,0.183199,0.216534,0.115309,0.195455,0.089166,0.211241
2,0.26949,0.190021,1.0,0.136857,0.116162,0.18565,0.267309,0.274952,0.193231,0.235519,...,0.101739,0.213039,0.143898,0.24949,0.226971,0.330113,0.094496,0.128048,0.193419,0.129954
3,0.292044,0.314752,0.107156,1.0,0.145723,0.27826,0.169044,0.170754,0.02366,0.130988,...,0.047314,0.083685,0.209702,0.157556,0.266267,0.313744,0.05645,0.099006,0.041353,0.022244
4,0.280671,0.25452,0.132092,0.189026,1.0,0.364073,0.538468,0.107977,0.184431,0.227856,...,0.226767,0.241998,0.090812,0.083915,0.188177,0.157129,0.178756,0.137263,0.091187,0.165804
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,0.235929,0.244939,0.339304,0.345675,0.153436,0.254085,0.288843,0.391109,0.365301,0.363703,...,0.391472,0.445401,0.42709,0.64793,0.754086,1.0,0.368343,0.474023,0.546079,0.421581
286,0.103594,0.052278,0.0073,0.0,0.083149,0.058934,0.038802,0.045836,0.109361,0.174072,...,0.858439,0.474824,0.25829,0.318941,0.445608,0.297884,1.0,0.318682,0.441958,0.467553
287,0.173278,0.221323,0.13635,0.137271,0.129793,0.101107,0.132663,0.319492,0.386053,0.367114,...,0.421652,0.46766,0.29528,0.314593,0.355153,0.471784,0.384445,1.0,0.421136,0.373138
288,0.241302,0.145633,0.225732,0.11037,0.111583,0.140291,0.1732,0.205928,0.244766,0.22737,...,0.511684,0.484858,0.273579,0.557858,0.580663,0.558202,0.511367,0.438984,1.0,0.499115


In [5]:
def current_similarity_vec(last_vec, last_idx, similarity_df_norm):
  """
  Return the similarity vector associated with the last attraction along with the other selected attractions
  """
  current_similarity_vec = similarity_df_norm.iloc[last_idx]
  return current_similarity_vec + (1/3) * last_vec


###**Distances** Matrix

In [21]:
import re

txt = berlin_reduced["location_point"][0]
berlin_reduced["long_lat"]  = berlin_reduced["location_point"].apply([lambda x:[float(s) for s in re.findall(r'-?\d+\.?\d*', x)][1:]])

In [22]:
!pip install haversine



In [23]:
berlin_reduced.head()

Unnamed: 0.1,index,Unnamed: 0,name,created,source,address,rating,number_of_reviews,location_point,about,tags,main_photo_url,order_page,curated,is_free,price,data_source,prediction,name_about,long_lat
0,0,0,Neue Wache,2021-10-28 09:52:47,GoogleMaps,"Unter den Linden 4, 10117 Berlin, Germany",4.5,1519,"('SRID=4326;POINT (13.3955281 52.5178902)',)",Moving memorial dedicated to war victims. War ...,Architecture,https://upload.wikimedia.org/wikipedia/commons...,,False,True,,,"['Architecture', 'Historic Sites', 'Popular']",Neue Wache Moving memorial dedicated to war vi...,"[13.3955281, 52.5178902]"
1,1,1,Rotes Rathaus,2021-10-28 09:52:38,GoogleMaps,"Rathausstraße 15, 10178 Berlin, Germany",4.4,1043,('SRID=4326;POINT (13.408644299999999 52.51827...,Imposing neo Renaissance town hall. This massi...,Architecture,https://lh5.googleusercontent.com/p/AF1QipN2J_...,,False,True,,,"['Architecture', 'Historic Sites']",Rotes Rathaus Imposing neo Renaissance town ha...,"[13.408644299999999, 52.518277499999996]"
2,2,2,Tempelhofer Feld,2021-10-28 09:52:47,GoogleMaps,"Tempelhofer Damm, 12101 Berlin, Germany",4.5,19915,('SRID=4326;POINT (13.401893 52.47539099999999...,Recreational hub in an old airport. Former air...,"Urban Parks, Walking & Biking",https://lh5.googleusercontent.com/p/AF1QipPJf_...,,False,True,,,['Urban Parks'],Tempelhofer Feld Recreational hub in an old ai...,"[13.401893, 52.475390999999995]"
3,3,3,The Wall Museum,2021-10-28 09:52:35,GoogleMaps,"Mühlenstraße 78-80, 10243 Berlin, Germany",4.2,986,('SRID=4326;POINT (13.445263899999999 52.50265...,Berlin Wall news reels & interview clips. Mult...,"Museums, Historic Sites",https://lh5.googleusercontent.com/p/AF1QipPc8Q...,,False,True,,,['Museums'],The Wall Museum Berlin Wall news reels & inter...,"[13.445263899999999, 52.5026556]"
4,4,4,New Palace,2021-10-28 09:53:06,GoogleMaps,"Am Neuen Palais, 14469 Potsdam, Germany",4.7,6855,('SRID=4326;POINT (13.016029999999999 52.40130...,"Rococo interiors in 18th-century palace. Vast,...","Architecture, Art, Historic Sites, Museums",https://lh5.googleusercontent.com/p/AF1QipM84B...,,False,True,,,"['Architecture', 'Historic Sites', 'Museums']",New Palace Rococo interiors in 18th-century pa...,"[13.016029999999999, 52.401301]"


**Haversine distance:**<br>The angular distance between two points on the surface of a sphere.

In [24]:
def distance_matrix(df_reduced):

  df_reduced["long_lat"]  = df_reduced["location_point"].apply([lambda x:[float(s) for s in re.findall(r'-?\d+\.?\d*', x)][1:]])
  distances_matrix = pd.DataFrame(columns=[i for i in range(df_reduced.shape[0])], index=range(df_reduced.shape[0]))
  for i in range(berlin_reduced.shape[0]):
    for j in range(i,berlin_reduced.shape[0]):
      loc1 = df_reduced["long_lat"][i]
      loc2 = df_reduced["long_lat"][j]
      dist_score = hs.haversine(loc1,loc2)  # distance in km

      distances_matrix.iloc[i][j] = dist_score
      distances_matrix.iloc[j][i] = dist_score
  return distances_matrix


berlin_distance_matrix = distance_matrix(berlin_reduced)
berlin_distances_norm = norm_df(berlin_distance_matrix)

In [25]:
berlin_distances_norm

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,280,281,282,283,284,285,286,287,288,289
0,0.0,0.031809,0.102179,0.115364,0.657188,0.028013,0.644246,0.04073,0.016645,0.077252,...,0.153904,0.050042,0.227524,0.097288,0.019323,0.137853,0.190706,0.055585,0.016092,0.098773
1,0.032852,0.0,0.103233,0.088133,0.678238,0.004441,0.666058,0.025985,0.02314,0.054503,...,0.190835,0.031555,0.202371,0.127439,0.052794,0.172519,0.228841,0.082548,0.025872,0.133531
2,0.104727,0.102448,0.0,0.113007,0.651278,0.100235,0.637616,0.128171,0.113982,0.14946,...,0.185794,0.133235,0.221605,0.107066,0.109966,0.155913,0.218393,0.08414,0.115631,0.178695
3,0.129931,0.096112,0.124181,0.0,0.730718,0.099377,0.720276,0.109923,0.121186,0.09578,...,0.291937,0.105277,0.130294,0.211346,0.150646,0.266064,0.332558,0.165452,0.124141,0.240603
4,0.991725,0.991002,0.958895,0.979051,0.0,0.99047,0.039033,0.999161,0.995044,1.0,...,0.983412,1.0,0.98064,0.973661,0.991908,0.976105,0.98205,0.978939,0.995544,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,0.122868,0.148884,0.135583,0.210552,0.576521,0.145066,0.560334,0.158533,0.137184,0.191116,...,0.027968,0.166946,0.313559,0.044011,0.107239,0.0,0.054474,0.080066,0.135987,0.084563
286,0.161312,0.187425,0.180237,0.249761,0.550472,0.183886,0.533465,0.193477,0.17398,0.224264,...,0.031135,0.201408,0.347615,0.093261,0.14538,0.051698,0.0,0.125512,0.172508,0.101168
287,0.05375,0.077289,0.079382,0.142052,0.627299,0.073155,0.61301,0.092674,0.069734,0.126653,...,0.108802,0.101372,0.253918,0.042123,0.04319,0.086866,0.143483,0.0,0.069308,0.089868
288,0.016242,0.025283,0.113866,0.111247,0.66585,0.023146,0.653311,0.025538,0.00267,0.062983,...,0.168779,0.035107,0.221337,0.114169,0.03265,0.153991,0.205837,0.07234,0.0,0.106592


### **Tags** vector

In [16]:
bridgify_tags = ['Amusements', 'Architecture', 'Art', 'Beach',
        'Culinary Experiences', 'Culture', 'Festivals',
        'Guided Tours', 'Hidden Gems', 'Historic Sites','Local Events',
        'Local Markets','Museums', 'Music', 'Nature',
        'Nightlife', 'Outdoor Activities', 'Popular', 'Religion', 'Shopping',
        'Shows/Performances', 'Sporting Events', 'Street Food', 'Urban Parks',
        'Walking & Biking', 'Watersports', 'Wellness & Wellbeing']

Lets choose tags hypothetically<br>
chosen tags = Architecture, Culinary Experiences, Shopping, Art, Urban Parks

In [17]:
chosen_tags = ["Architecture", "Culinary Experiences", "Shopping", "Art", "Urban Parks", "Museums"]

In [18]:
# Creating pandas DataFrame for the chosen tags (each tag will have a different column)

def df_tags(df, tags_list):
  """ Creating DataFrame for all tags (each tag will have a different column)"""
  tags_dict = {tag: [] for tag in tags_list}
  for tag_name in tags_dict.keys():
    for tags in df["prediction"]:
      if tag_name in tags:
        tags_dict[tag_name].append(1)
      else:
        tags_dict[tag_name].append(0)
  return pd.DataFrame(tags_dict)


def chosen_tags_vec(df_reduced, chosen_tags):

  # creating a dataframe for the chosen tags
  tags_df = df_tags(df_reduced, chosen_tags) 

  # sum the tags to a vector
  tags_sum = tags_df.sum(axis=1)

  # normalized the vector to be between 0-1. transforming the results with '1-norm' in order for the best attraction in terms of tags to be the lowest. Adding 0.01 in order not to reset the results
  tags_sum_norm = 1 - norm_df(tags_sum) + 0.01  

  return tags_sum_norm


berlin_tags_vec = chosen_tags_vec(berlin_reduced, chosen_tags)
berlin_tags_vec

0      0.51
1      0.51
2      0.51
3      0.51
4      0.01
       ... 
285    1.01
286    1.01
287    1.01
288    1.01
289    1.01
Length: 290, dtype: float64

In [19]:
# Creating pandas DataFrame for the chosen tags (each tag will have a different column)
# for modify!!!!!!!!!!!!!!!!!!!!!!!!!!!!

def df_tags(df, tags_list):
  """ Creating DataFrame for all tags (each tag will have a different column)"""
  tags_dict = {tag: [] for tag in tags_list}
  for tag_name in tags_dict.keys():
    for tags in df["prediction"]:
      if tag_name in tags:
        tags_dict[tag_name].append(1)
      else:
        tags_dict[tag_name].append(0)
  return pd.DataFrame(tags_dict)


def chosen_tags_vec(df_reduced, chosen_tags, idx_list=None):
  
  # creating a dataframe for the chosen tags
  tags_df = df_tags(df_reduced, chosen_tags) 

  # reduce the values of the tags that already been chosen
  if idx_list:
    selected_tags_count = tags_df.iloc[idx_list].sum()
    selected_tags_count = selected_tags_count[selected_tags_count > 0]
    for i in range(selected_tags_count.shape[0]):
      tag_name = selected_tags_count.index[i]
      tag_value = selected_tags_count.values[i]
      tags_df[tag_name] = tags_df[tag_name] * (1/(2*tag_value))

  # sum the tags to a vector
  tags_sum = tags_df.sum(axis=1)

  # normalized the vector to be between 0-1. transforming the results with '1-norm' in order for the best attraction in terms of tags to be the lowest. Adding 0.01 in order not to reset the results
  tags_sum_norm = 1 - norm_df(tags_sum) + 0.01  

  return tags_sum_norm

 
berlin_tags_vec = chosen_tags_vec(berlin_reduced, chosen_tags)
# berlin_tags = df_tags(berlin_reduced, chosen_tags)
# drop_tag = berlin_tags.iloc[[31, 47, 179, 124, 235]] 
# used_tags_count = drop_tag.sum()[drop_tag.sum()>0]
# used_tags_count[used_tags_count.index[0]]
# for i in range(used_tags_count.shape[0]):
#   berlin_tags[used_tags_count.index[i]] = berlin_tags[used_tags_count.index[i]] * (1/(2**used_tags_count.values[i]))


Lets choose tags hypothetically<br>
chosen tags = Architecture, Culinary Experiences, Shopping, Art, Urban Parks

In [20]:
# chosen_tags_arr = np.zeros(len(bridgify_tags))
# for i in range(len(bridgify_tags)):
#   if bridgify_tags[i] in chosen_tags:
#     chosen_tags_arr[i] = 1
# chosen_tags_arr

### **Popularity** vector

Popularity is determined in this case by the number of reviews

In [26]:
def popularity_vec(df):
  pop_vec = df["number_of_reviews"]
  norm_vec = 1 - norm_df(pop_vec) + 0.01
  return norm_vec

berlin_popularity_vec = popularity_vec(berlin_reduced)
berlin_popularity_vec.sort_values()

31     0.010000
47     0.257392
130    0.632268
177    0.726939
179    0.740806
         ...   
87     1.009963
234    1.009963
99     1.009988
76     1.009994
279    1.010000
Name: number_of_reviews, Length: 290, dtype: float64

In [27]:
berlin_reduced.iloc[[279]]

Unnamed: 0.1,index,Unnamed: 0,name,created,source,address,rating,number_of_reviews,location_point,about,tags,main_photo_url,order_page,curated,is_free,price,data_source,prediction,name_about,long_lat
279,289,289,kolula - Stand Up Paddle,2021-10-28 09:51:49,GoogleMaps,"Havelchaussee 1, 14193 Berlin, Germany",4.6,50,('SRID=4326;POINT (13.1921851 52.4675699999999...,Boat rental service.,"Watersports, Outdoor Activities",https://lh5.googleusercontent.com/p/AF1QipOPU9...,,False,True,,,['Outdoor Activities'],kolula - Stand Up Paddle Boat rental service.,"[13.1921851, 52.467569999999995]"


### **Results**



####Displaying matrices and vectors

In [36]:
print("berlin similarity matrix:")
berlin_similarity_norm

berlin similarity matrix:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,280,281,282,283,284,285,286,287,288,289
0,1.0,0.322699,0.25035,0.297662,0.248276,0.282005,0.213082,0.243938,0.151737,0.127449,...,0.15907,0.171361,0.181363,0.195135,0.161197,0.205002,0.160908,0.143461,0.188931,0.213141
1,0.324559,1.0,0.171081,0.322057,0.223086,0.40631,0.244922,0.220396,0.164508,0.226302,...,0.123165,0.212983,0.182022,0.10395,0.183199,0.216534,0.115309,0.195455,0.089166,0.211241
2,0.26949,0.190021,1.0,0.136857,0.116162,0.18565,0.267309,0.274952,0.193231,0.235519,...,0.101739,0.213039,0.143898,0.24949,0.226971,0.330113,0.094496,0.128048,0.193419,0.129954
3,0.292044,0.314752,0.107156,1.0,0.145723,0.27826,0.169044,0.170754,0.02366,0.130988,...,0.047314,0.083685,0.209702,0.157556,0.266267,0.313744,0.05645,0.099006,0.041353,0.022244
4,0.280671,0.25452,0.132092,0.189026,1.0,0.364073,0.538468,0.107977,0.184431,0.227856,...,0.226767,0.241998,0.090812,0.083915,0.188177,0.157129,0.178756,0.137263,0.091187,0.165804
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,0.235929,0.244939,0.339304,0.345675,0.153436,0.254085,0.288843,0.391109,0.365301,0.363703,...,0.391472,0.445401,0.42709,0.64793,0.754086,1.0,0.368343,0.474023,0.546079,0.421581
286,0.103594,0.052278,0.0073,0.0,0.083149,0.058934,0.038802,0.045836,0.109361,0.174072,...,0.858439,0.474824,0.25829,0.318941,0.445608,0.297884,1.0,0.318682,0.441958,0.467553
287,0.173278,0.221323,0.13635,0.137271,0.129793,0.101107,0.132663,0.319492,0.386053,0.367114,...,0.421652,0.46766,0.29528,0.314593,0.355153,0.471784,0.384445,1.0,0.421136,0.373138
288,0.241302,0.145633,0.225732,0.11037,0.111583,0.140291,0.1732,0.205928,0.244766,0.22737,...,0.511684,0.484858,0.273579,0.557858,0.580663,0.558202,0.511367,0.438984,1.0,0.499115


The results are normalized (0<x<1).<br>
Completely similar results are equal to 1 and completely different attractions are equal to zero

In [37]:
print("berlin distances matrix:")
berlin_distances_norm

berlin distances matrix:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,280,281,282,283,284,285,286,287,288,289
0,0.0,0.031809,0.102179,0.115364,0.657188,0.028013,0.644246,0.04073,0.016645,0.077252,...,0.153904,0.050042,0.227524,0.097288,0.019323,0.137853,0.190706,0.055585,0.016092,0.098773
1,0.032852,0.0,0.103233,0.088133,0.678238,0.004441,0.666058,0.025985,0.02314,0.054503,...,0.190835,0.031555,0.202371,0.127439,0.052794,0.172519,0.228841,0.082548,0.025872,0.133531
2,0.104727,0.102448,0.0,0.113007,0.651278,0.100235,0.637616,0.128171,0.113982,0.14946,...,0.185794,0.133235,0.221605,0.107066,0.109966,0.155913,0.218393,0.08414,0.115631,0.178695
3,0.129931,0.096112,0.124181,0.0,0.730718,0.099377,0.720276,0.109923,0.121186,0.09578,...,0.291937,0.105277,0.130294,0.211346,0.150646,0.266064,0.332558,0.165452,0.124141,0.240603
4,0.991725,0.991002,0.958895,0.979051,0.0,0.99047,0.039033,0.999161,0.995044,1.0,...,0.983412,1.0,0.98064,0.973661,0.991908,0.976105,0.98205,0.978939,0.995544,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,0.122868,0.148884,0.135583,0.210552,0.576521,0.145066,0.560334,0.158533,0.137184,0.191116,...,0.027968,0.166946,0.313559,0.044011,0.107239,0.0,0.054474,0.080066,0.135987,0.084563
286,0.161312,0.187425,0.180237,0.249761,0.550472,0.183886,0.533465,0.193477,0.17398,0.224264,...,0.031135,0.201408,0.347615,0.093261,0.14538,0.051698,0.0,0.125512,0.172508,0.101168
287,0.05375,0.077289,0.079382,0.142052,0.627299,0.073155,0.61301,0.092674,0.069734,0.126653,...,0.108802,0.101372,0.253918,0.042123,0.04319,0.086866,0.143483,0.0,0.069308,0.089868
288,0.016242,0.025283,0.113866,0.111247,0.66585,0.023146,0.653311,0.025538,0.00267,0.062983,...,0.168779,0.035107,0.221337,0.114169,0.03265,0.153991,0.205837,0.07234,0.0,0.106592


The results are normalized (0<x<1).<br>
The closer the attractions are to each other, the lower the result

In [38]:
print("tags vector:")
berlin_tags_vec

tags vector:


0      0.76
1      0.76
2      0.01
3      0.51
4      0.26
       ... 
285    1.01
286    1.01
287    1.01
288    1.01
289    1.01
Length: 290, dtype: float64

The results are normalized (0<x<1).<br>
An attraction that contains all the tags will get a result close to zero while an attraction that does not contain any of the tags will get a result that is close to 1

In [39]:
print("Popularity vector:")
berlin_popularity_vec

Popularity vector:


0      1.001034
1      1.003939
2      0.888754
3      1.004287
4      0.968466
         ...   
285    1.009835
286    1.009683
287    1.000418
288    1.009640
289    0.972512
Name: number_of_reviews, Length: 290, dtype: float64

The results are normalized (0<x<1).<br>
The most popular attraction will get ~0 and the least popular will get ~1

#### Building the route

In [34]:
def first_attraction(tags_vec, pop_vec):
  """
  return the first attraction according to popularity and chosen tags
  """
  return (tags_vec * pop_vec).sort_values().index[0]



def next_best_attraction(tags_vec, pop_vec,distance_vec, similarity_vec):
  """
  return the next best attraction according to previous attraction
  """
  vectors_results = tags_vec +  (3*pop_vec) + (2*distance_vec) + similarity_vec
  
  # drop the chosen indices
  vectors_results.drop(index=chosen_idx, inplace=True)

  return vectors_results.sort_values().index[0]

In [35]:
# select number of attractions per day
num_attractions = 5

# Start with selecting the first attraction
attraction_idx = first_attraction(berlin_tags_vec, berlin_popularity_vec)
print("first chosen attraction index:",attraction_idx)

# Save a list with the selected attractions indices
chosen_idx = [attraction_idx]

similarity_vec = 0
similarity_vec_norm = 0


# select the next (num_attractions - 1)
for i in range(num_attractions-1):

  # update the tags_vector
  berlin_tags_vec = chosen_tags_vec(berlin_reduced, chosen_tags, chosen_idx) 

  # find similarity_vec according to current attraction
  similarity_vec = current_similarity_vec(similarity_vec, attraction_idx, berlin_similarity_norm)
  similarity_vec_norm = norm_df(similarity_vec)

  # extract distance vector 
  distance_vec = berlin_distances_norm.iloc[attraction_idx]

  # Select the next best attraction
  attraction_idx = next_best_attraction(berlin_tags_vec, berlin_popularity_vec, distance_vec, similarity_vec_norm)

  # append the next index to the indices list
  chosen_idx.append(attraction_idx)
  print(chosen_idx)

# present the selected attractions by order:
berlin_reduced.iloc[chosen_idx].drop(columns=["index", "Unnamed: 0"])

first chosen attraction index: 31
[31, 47]
[31, 47, 235]
[31, 47, 235, 98]
[31, 47, 235, 98, 130]


Unnamed: 0,name,created,source,address,rating,number_of_reviews,location_point,about,tags,main_photo_url,order_page,curated,is_free,price,data_source,prediction,name_about,long_lat
31,Alexanderplatz,2021-10-28 09:50:08,GoogleMaps,"10178 Berlin, Germany",4.2,163891,('SRID=4326;POINT (13.413305999999999 52.52198...,Pedestrianized square with iconic tower. Histo...,"Historic Sites, Popular, Architecture",https://upload.wikimedia.org/wikipedia/commons...,,False,True,,,"['Architecture', 'Historic Sites', 'Popular']",Alexanderplatz Pedestrianized square with icon...,"[13.413305999999999, 52.521981399999994]"
47,Brandenburg Gate,2021-10-28 09:50:08,GoogleMaps,"Pariser Platz, 10117 Berlin, Germany",4.7,123358,('SRID=4326;POINT (13.377704099999999 52.51627...,Grand classical archway & city divide. Restore...,"Historic Sites, Popular, Architecture",https://lh5.googleusercontent.com/p/AF1QipMz1I...,,False,True,,,"['Architecture', 'Historic Sites', 'Popular']",Brandenburg Gate Grand classical archway & cit...,"[13.377704099999999, 52.516274599999996]"
235,Mall of Berlin,2021-10-28 09:48:56,GoogleMaps,"Leipziger Pl. 12, 10117 Berlin, Germany",4.4,40796,('SRID=4326;POINT (13.380708499999999 52.51051...,Spacious shopping mall with a food court. Expa...,Shopping,https://www.mallofberlin.de/fileadmin/files/st...,,False,True,,,['Shopping'],Mall of Berlin Spacious shopping mall with a f...,"[13.380708499999999, 52.5105167]"
98,Boros Foundation,2021-10-28 09:48:32,GoogleMaps,"Reinhardtstraße 20, 10117 Berlin, Germany",4.4,928,"('SRID=4326;POINT (13.384125 52.5234611)',)",Private collection of contemporary art. Contem...,Art,https://scontent.fsdv3-1.fna.fbcdn.net/v/t1.18...,,False,True,,,"['Art', 'Museums']",Boros Foundation Private collection of contemp...,"[13.384125, 52.5234611]"
130,Checkpoint Charlie,2021-10-28 09:50:08,GoogleMaps,"Friedrichstraße 43-45, 10117 Berlin, Germany",4.0,61938,"('SRID=4326;POINT (13.3903913 52.5074434)',)",Cold war & east-west border landmark. Landmark...,"Historic Sites, Popular",https://lh5.googleusercontent.com/p/AF1QipOF6a...,,False,True,,,"['Architecture', 'Culture', 'Historic Sites', ...",Checkpoint Charlie Cold war & east-west border...,"[13.3903913, 52.5074434]"


In [59]:
berlin_distance_matrix.iloc[235][98]

1.450889618200225

### **Results- probabilities**

In [86]:
# def first_attraction(tags_vec, pop_vec):
#   """
#   return the first attraction according to popularity and chosen tags
#   """
#   # transforming the results so that the highest score is the best.
#   results = 1/(tags_vec * pop_vec)

#   # normalizing the vectors so that each vector sum will be equal to 1.
#   attractions_probs = results / results.sum()
  
#   return np.random.choice(range(len(tags_vec)), p=attractions_probs.values)


def first_attraction(tags_vec, pop_vec):
  """
  return the first attraction according to popularity and chosen tags
  """
  return (tags_vec * pop_vec).sort_values().index[0]



def next_best_attraction(tags_vec, pop_vec,distance_vec, similarity_vec):
  """
  return the next best attraction according to previous attraction
  """
  vectors_results = 1 / (tags_vec +  (3*pop_vec) + (2*distance_vec) + similarity_vec)
  
  # drop the chosen indices
  vectors_results.drop(index=chosen_idx, inplace=True)

  # normalizing the vectors so that each vector sum will be equal to 1
  attractions_probs = vectors_results / vectors_results.sum()
  
  return np.random.choice(vectors_results.index, p=attractions_probs)

In [87]:
# select number of attractions per day
num_attractions = 5

# Start with selecting the first attraction
attraction_idx = first_attraction(berlin_tags_vec, berlin_popularity_vec)
print("first chosen attraction index:",attraction_idx)

# Save a list with the selected attractions indices
chosen_idx = [attraction_idx]

similarity_vec = 0
similarity_vec_norm = 0


# select the next (num_attractions - 1)
for i in range(num_attractions-1):

  # update the tags_vector
  berlin_tags_vec = chosen_tags_vec(berlin_reduced, chosen_tags, chosen_idx) 

  # find similarity_vec according to current attraction
  similarity_vec = current_similarity_vec(similarity_vec, attraction_idx, berlin_similarity_norm)
  similarity_vec_norm = norm_df(similarity_vec)

  # extract distance vector 
  distance_vec = berlin_distances_norm.iloc[attraction_idx]

  # Select the next best attraction
  attraction_idx = next_best_attraction(berlin_tags_vec, berlin_popularity_vec, distance_vec, similarity_vec_norm)

  # append the next index to the indices list
  chosen_idx.append(attraction_idx)
  print(chosen_idx)

# present the selected attractions by order:
berlin_reduced.iloc[chosen_idx]

first chosen attraction index: 31
[31, 7]
[31, 7, 118]
[31, 7, 118, 159]
[31, 7, 118, 159, 266]


Unnamed: 0.1,index,Unnamed: 0,name,created,source,address,rating,number_of_reviews,location_point,about,tags,main_photo_url,order_page,curated,is_free,price,data_source,prediction,name_about
31,37,37,Alexanderplatz,2021-10-28 09:50:08,GoogleMaps,"10178 Berlin, Germany",4.2,163891,('SRID=4326;POINT (13.413305999999999 52.52198...,Pedestrianized square with iconic tower. Histo...,"Historic Sites, Popular, Architecture",https://upload.wikimedia.org/wikipedia/commons...,,False,True,,,"['Architecture', 'Historic Sites', 'Popular']",Alexanderplatz Pedestrianized square with icon...
7,7,7,Gaststätte W. Prassnik,2021-10-28 09:51:14,GoogleMaps,"Torstraße 65, 10119 Berlin, Germany",4.5,321,('SRID=4326;POINT (13.408131599999999 52.52926...,Beer & bites with a nostalgic vibe. House beer...,Nightlife,https://lh5.googleusercontent.com/p/AF1QipNmgz...,,False,True,,,['Nightlife'],Gaststätte W. Prassnik Beer & bites with a nos...
118,125,125,Lustgarten,2021-10-28 09:52:32,GoogleMaps,"Unter den Linden 1, 10178 Berlin, Germany",4.6,8041,('SRID=4326;POINT (13.399199999999999 52.51869...,Public park with lawns & fountains. The one-ti...,"Architecture, Urban Parks",https://lh5.googleusercontent.com/p/AF1QipN3bH...,,False,True,,,['Urban Parks'],Lustgarten Public park with lawns & fountains....
159,166,166,Illuseum Berlin,2021-10-28 09:55:37,GoogleMaps,"Karl-Liebknecht-Str. 9, 10178 Berlin, Germany",4.1,2601,"('SRID=4326;POINT (13.406989 52.5215677)',)",Visual tricks & immersive installations. Holog...,"Museums, Amusements",https://www.illuseum-berlin.de/wp-content/uplo...,,False,True,,,"['Amusements', 'Museums']",Illuseum Berlin Visual tricks & immersive inst...
266,276,276,Engelbecken Park,2021-10-28 09:54:42,GoogleMaps,"Legiendamm 4, 10179 Berlin, Germany",4.5,110,('SRID=4326;POINT (13.417270199999999 52.50431...,"Small, pleasant park with flowers next to an h...",Urban Parks,https://lh5.googleusercontent.com/p/AF1QipMOIS...,,False,True,,,['Urban Parks'],"Engelbecken Park Small, pleasant park with flo..."


In [90]:
berlin_distance_matrix.iloc[31][7]

0.9755089986276935

### Junk

Save a list with the chosen attractions indexes

In [86]:
chosen_idx = [first_attraction_idx]
chosen_idx

[31]

Update the tags_vector

In [87]:
berlin_tags_vec = chosen_tags_vec(berlin_reduced, chosen_tags, chosen_idx)

Drop the chosen indices from the data and choose the next best attraction

In [93]:
next_attraction_idx = next_best_attraction(chosen_idx, berlin_tags_vec, berlin_popularity_vec, berlin_distances_norm, berlin_similarity_norm)

# append the next idx to the indices list
chosen_idx.append(next_attraction_idx)
print(chosen_idx)

# update the tags_vector
berlin_tags_vec = chosen_tags_vec(berlin_reduced, chosen_tags, chosen_idx)


[31, 47, 130, 124, 179]


In [94]:
berlin_reduced.iloc[chosen_idx]

Unnamed: 0.1,index,Unnamed: 0,name,created,source,address,rating,number_of_reviews,location_point,about,tags,main_photo_url,order_page,curated,is_free,price,data_source,prediction,name_about,long_lat
31,37,37,Alexanderplatz,2021-10-28 09:50:08,GoogleMaps,"10178 Berlin, Germany",4.2,163891,('SRID=4326;POINT (13.413305999999999 52.52198...,Pedestrianized square with iconic tower. Histo...,"Historic Sites, Popular, Architecture",https://upload.wikimedia.org/wikipedia/commons...,,False,True,,,"['Architecture', 'Historic Sites', 'Popular']",Alexanderplatz Pedestrianized square with icon...,"[13.413305999999999, 52.521981399999994]"
47,53,53,Brandenburg Gate,2021-10-28 09:50:08,GoogleMaps,"Pariser Platz, 10117 Berlin, Germany",4.7,123358,('SRID=4326;POINT (13.377704099999999 52.51627...,Grand classical archway & city divide. Restore...,"Historic Sites, Popular, Architecture",https://lh5.googleusercontent.com/p/AF1QipMz1I...,,False,True,,,"['Architecture', 'Historic Sites', 'Popular']",Brandenburg Gate Grand classical archway & cit...,"[13.377704099999999, 52.516274599999996]"
130,137,137,Checkpoint Charlie,2021-10-28 09:50:08,GoogleMaps,"Friedrichstraße 43-45, 10117 Berlin, Germany",4.0,61938,"('SRID=4326;POINT (13.3903913 52.5074434)',)",Cold war & east-west border landmark. Landmark...,"Historic Sites, Popular",https://lh5.googleusercontent.com/p/AF1QipOF6a...,,False,True,,,"['Architecture', 'Culture', 'Historic Sites', ...",Checkpoint Charlie Cold war & east-west border...,"[13.3903913, 52.5074434]"
124,131,131,Memorial to the Murdered Jews of Europe,2021-10-28 09:50:08,GoogleMaps,"Cora-Berliner-Straße 1, 10117 Berlin, Germany",4.6,39846,"('SRID=4326;POINT (13.3787127 52.5139474)',)","2,711 columns commemorating Holocaust. 2,711 c...","Historic Sites, Popular, Art, Architecture",https://upload.wikimedia.org/wikipedia/commons...,,False,True,,,"['Architecture', 'Historic Sites']","Memorial to the Murdered Jews of Europe 2,711 ...","[13.3787127, 52.5139474]"
179,186,186,Potsdamer Platz,2021-10-28 09:52:32,GoogleMaps,"10785 Berlin, Germany",4.4,44155,"('SRID=4326;POINT (13.3759441 52.5096488)',)",Square at the heart of city's history. Histori...,"Architecture, Popular, Shopping",https://lh5.googleusercontent.com/p/AF1QipMGqj...,,False,True,,,"['Architecture', 'Historic Sites', 'Popular']",Potsdamer Platz Square at the heart of city's ...,"[13.3759441, 52.5096488]"


In [97]:
berlin_distance_matrix.iloc[31][179]

4.363386967399246

In [210]:
berlin_reduced.iloc[102]["about"]

'Free installation intended to allow silence and contemplation in an artful space.'

In [100]:
berlin_distances_norm.iloc[286][31]

0.19816414724286188