### Redefinying distributions

All columns followed the same pattern: all possible values had almost the same proportion. 

#### Customized distribution

In this step I will force my fields to follow the distributions I came up with.

##### Rating fields

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Users\PC\Desktop\Estudio\Analisis de Datos\Proyectos\Festival Purchase Behavior Analysis\Datasets\festival_dataset_dirty.csv")

#-----------------------
# Column satisfaction_score
# Mixed reviews for satisfaction, but generally positive
values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
distribution = [0.11, 0.05, 0.04, 0.12, 0.08, 0.1, 0.07, 0.3, 0.1, 0.03]

# Randomly assign the satisfaction score based on the distribution
df["satisfaction_score"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#-----------------------
# Column security_rating
# General positive reviews for securty
values = [1, 2, 4, 5, 6, 7, 8, 9, 10]
distribution = [0.02, 0.02, 0.01, 0.05, 0.1, 0.2, 0.25, 0.2, 0.15]

# Randomly assign the security rating based on the distribution
df["security_rating"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#-----------------------
# Column cleanliness_rating
# General negative reviews for cleanliness
values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
distribution = [0.03, 0.12, 0.18, 0.12, 0.1, 0.18, 0.1, 0.07, 0.06, 0.04]

# Randomly assign the cleanliness rating based on the distribution
df["cleanliness_rating"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

##### No rating fields

In [2]:
#-----------------------
# Column group_size
# Mayority of larger groups over smaller ones
values = [1, 2, 3, 4, 5]
distribution = [0.09, 0.17, 0.31, 0.23, 0.2]

# Randomly assign the group size based on the distribution
df["group_size"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#-----------------------
# Column payment_method
# Tendency to pay with card and cash
values = ["Card", "Cash", "Festival App"]
distribution = [0.65, 0.3, 0.05]

# Randomly assign the payment method based on the distribution
df["payment_method"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#-----------------------
# Column stages_visited
# People tend to visit 2-3 stages
values = [1, 2, 3, 4, 5]
distribution = [0.06, 0.1, 0.19, 0.4, 0.25]

# Randomly assign the stages visited based on the distribution
df["stages_visited"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#### Column *"age"*

I applied a random normal distribution to this field, as it best simulates a realistic age spread.. I thought this would be the best fit for the column to behave realisticly. I clipped the values to ensure a minimum of 18 and a maximum of 59, meaning any outliers below or above those limits were automatically set to the nearest boundary.

In [3]:
# For age I will use a normal distribution with a mean of 30 and a standard deviation of 7
min_age = 18
max_age = 59
mean_age = 30
std_dev_age = 8

# Generate random ages using a normal distribution
ages = np.random.normal(loc=mean_age, scale=std_dev_age, size=len(df))

# Clip the ages to be within the specified range
clipped_ages = np.clip(ages, min_age, max_age)

# Round the ages to the nearest integer
clipped_ages = np.round(clipped_ages).astype(int)

# Assinging the clipped ages to the 'age' column
df["age"] = clipped_ages

#### Adding variety in *"origin_city"*

In [4]:
# I will introduce 'Malaga' to add variety
values = ["Madrid", "Barcelona", "Valencia", "Sevilla", "Malaga"]
distribution = [0.38, 0.29, 0.18, 0.1, 0.05]
df["origin_city"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#### Changing *"tickep_price"*

The original ticket prices were:

In [5]:
print(df["ticket_price"].value_counts())

ticket_price
80     4750
200    4642
150    4608
Name: count, dtype: int64


These values didn’t reflect a realistic price gap between tickets, especially considering the added value of multi-day or VIP passes. I updated them as follows:

In [6]:
# I will replace 150 for 210 and 200 for 350
df["ticket_price"] = df["ticket_price"].replace({150: 210, 200: 350})

print(df["ticket_price"].value_counts())

ticket_price
80     4750
350    4642
210    4608
Name: count, dtype: int64
