# Here we are creating an average age and way of travel of each site. This will help us make the model more personalized. 

**Case study:**

According to the government of Spain, in age category: 
- Young people go more to towns, parks or lookouts and contemporary buildings and art galleries
- Adult people go more to historic buildings, museums and archaeological rests and experiences, cultural centers, theaters and music.
- Old people like religious sites and food, seaside, sport and others. 

In way of travel category:
-  Alone people go more to points of interest, like squares and sculptures and experiences, cultural centers, theaters and music. 
- In couple people go more to towns, parks or lookouts and contemporary buildings and art galleries. 
- In family go more to historic building, museums and archaeological rests and route and urban routes. 
- In groups go more food, seaside, sport and others. 

First of doing this I tried to run the model with random values in this columns and _obviously_ didn't serve the model. 

With this new approach of categories more prone to specific sites, the MVP with personalised values will work better, which is the main topic of this project, to make personalised  recommendations. Since there isn't any data available online, and travel recommenders don't work with personalised values, we need synthetic data to make this work. 

In the first tries folder you'll see how I did a lot of feature engineering to try to make this synthetic data as real as possible, but with some imabalance results so it serves the model. Here you'll see only the way I chose to create the data, but there are more tries in case you want to check them. I put two of the best tries, but stayed with the last one.

Since we didn't see how to create synthetic data, I used Stack over Flow, Chat GPT and a lot of Medium readings (I even paid for the premium version, which I recommend a lot) to get to know how to create synthetic data responsibly. 

I used one of the databases which have already the reduced categories encoded to make it more easy to create the databases. The relation goes as follows:

    'historic building, museums and archaeological rests': 1,
    'town, parks or lookouts': 2,
    'points of interest, like squares and sculptures': 3,
    'experiences, cultural centers, theaters and music': 4,
    'religious': 5,
    'contemporary buildings and art galleries': 6,
    'route and urban routes': 7,
    'food, seaside, sport and others': 8

In [2]:
import pandas as pd
import numpy as np

In [23]:
# this is the one that i feel confortable with, because it is imbalanced but real.


# Load the data from the excel file 'data_final_2.xlsx' into df7
df7 = pd.read_excel("data_final_2.xlsx")

# Function to create the 'age' column with 10% random values
def generate_age_random(row):
    return np.random.choice(['young', 'adult', 'old'], p=[0.35, 0.35, 0.3])

# Function to create the 'way_travel' column with 10% random values
def generate_way_travel_random(row):
    return np.random.choice(['alone', 'in couple', 'in family', 'in group'], p=[0.25, 0.25, 0.25, 0.25])

# Apply the functions to create the new columns with 10% random values in df7
df7['age'] = df7.apply(lambda row: generate_age_random(row) if np.random.rand() <= 0.1 else np.nan, axis=1)
df7['way_travel'] = df7.apply(lambda row: generate_way_travel_random(row) if np.random.rand() <= 0.1 else np.nan, axis=1)

# Function to create the 'age' column with the specified imbalance and 0.87 probability for more common categories
def generate_age_imbalance(row):
    if pd.notnull(row['age']):
        return row['age']
    if row['cat_sites_reduced_more_encoded'] in [2, 6]:
        return np.random.choice(['young', 'adult'], p=[0.87, 0.13])
    elif row['cat_sites_reduced_more_encoded'] in [1, 4]:
        return np.random.choice(['adult', 'old'], p=[0.87, 0.13])
    elif row['cat_sites_reduced_more_encoded'] in [5, 8]:
        return np.random.choice(['young', 'old'], p=[0.87, 0.13])
    else:
        return np.random.choice(['young', 'adult', 'old'])

# Function to create the 'way_travel' column with the specified imbalance and 0.87 probability for more common categories
def generate_way_travel_imbalance(row):
    if pd.notnull(row['way_travel']):
        return row['way_travel']
    if row['cat_sites_reduced_more_encoded'] in [3, 4]:
        return np.random.choice(['in couple', 'in family'], p=[0.87, 0.13])
    elif row['cat_sites_reduced_more_encoded'] in [2, 6]:
        return np.random.choice(['in couple', 'in family'], p=[0.87, 0.13])
    elif row['cat_sites_reduced_more_encoded'] in [1, 7]:
        return np.random.choice(['in family', 'in group'], p=[0.87, 0.13])
    elif row['cat_sites_reduced_more_encoded'] == 8:
        return np.random.choice(['in group', 'in couple'], p=[0.87, 0.13])
    else:
        return np.random.choice(['in couple', 'in family', 'in group'])

# Apply the new functions to create the 'age' and 'way_travel' columns with the adjusted imbalance and 0.87 probability in df7
df7['age'] = df7.apply(generate_age_imbalance, axis=1)
df7['way_travel'] = df7.apply(generate_way_travel_imbalance, axis=1)


In [24]:

distribution_age7 = df7.groupby(['age', 'cat_sites_reduced_more']).size().reset_index(name='count')
distribution_age7

Unnamed: 0,age,cat_sites_reduced_more,count
0,adult,contemporary buildings and art galleries,51
1,adult,"experiences, cultural centers, theaters and music",416
2,adult,"food, music, seaside, sport and others",2
3,adult,"historic building, museums and archaeological ...",878
4,adult,"points of interest, like squares and sculptures",174
5,adult,religious,13
6,adult,route and urban routes,69
7,adult,"town, parks or lookouts",82
8,old,contemporary buildings and art galleries,11
9,old,"experiences, cultural centers, theaters and music",74


In [25]:
user_age = {
    'young': 0,
    'adult': 1,
    'old': 2
}

user_way = {
    'alone': 0,
    'in couple': 1,
    'in group': 2,
    'in family':3
}

df7['age_encoded'] = df7['age'].map(user_age)
df7['way_encoded'] = df7['way_travel'].map(user_way)

In [26]:
df7.to_excel("data_final_final_7.xlsx")