## Redefinying distributions

All columns followed the same pattern: all possible values had almost the same proportion. 

### Customized distribution

In this step I will force my fields to follow the distributions I came up with.

#### Rating fields

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Users\PC\Desktop\Estudio\Analisis de Datos\Proyectos\Festival Purchase Behavior Analysis\Datasets & Tables\festival_dataset_dirty.csv")

#-----------------------
# Column satisfaction_score
# Mixed reviews for satisfaction, but generally positive
values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
distribution = [0.11, 0.05, 0.04, 0.12, 0.08, 0.1, 0.07, 0.3, 0.1, 0.03]

# Randomly assign the satisfaction score based on the distribution
df["satisfaction_score"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#-----------------------
# Column security_rating
# General positive reviews for securty
values = [1, 2, 4, 5, 6, 7, 8, 9, 10]
distribution = [0.02, 0.02, 0.01, 0.05, 0.1, 0.2, 0.25, 0.2, 0.15]

# Randomly assign the security rating based on the distribution
df["security_rating"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#-----------------------
# Column cleanliness_rating
# General negative reviews for cleanliness
values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
distribution = [0.03, 0.12, 0.18, 0.12, 0.1, 0.18, 0.1, 0.07, 0.06, 0.04]

# Randomly assign the cleanliness rating based on the distribution
df["cleanliness_rating"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#### No rating fields

In [4]:
#-----------------------
# Column group_size
# Mayority of larger groups over smaller ones
values = [1, 2, 3, 4, 5]
distribution = [0.09, 0.17, 0.31, 0.23, 0.2]

# Randomly assign the group size based on the distribution
df["group_size"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

#-----------------------
# Column payment_method
# Tendency to pay with card and cash
values = ["Card", "Cash", "Festival App"]
distribution = [0.65, 0.3, 0.05]

# Randomly assign the payment method based on the distribution
df["payment_method"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

### Column *"age"*

I applied a random normal distribution to this field, as it best simulates a realistic age spread. I thought this would be the best fit for the column to behave realisticly. I clipped the values to ensure a minimum of 18 and a maximum of 59, meaning any outliers below or above those limits were automatically set to the nearest boundary.

In [5]:
# Import necessary libraries
from scipy.stats import skewnorm

min = 18
max = 59
mean = 30
std_dev = 8

# Generate a right skewed normal distribution for age
# 0.4 is the skewness parameter: positive for right skew, 
# This densifies the distribution towards lower values but keeps them close to the mean,
# What makes me avoid clipping many values below the minimum
age_dist = skewnorm.rvs(0.4, loc=mean, scale=std_dev, size=len(df), random_state=None)

# Clip to keep values in range
age_dist = np.clip(age_dist, min, max)

# Round to nearest integer
df["age"] = np.round(age_dist).astype(int)

print(df["age"].value_counts())

age
32    746
33    711
36    709
30    707
35    697
31    687
34    685
29    661
28    601
37    601
27    571
38    537
39    527
26    500
40    460
18    455
25    437
24    414
41    405
42    349
23    334
22    317
43    271
21    221
44    214
45    188
20    186
19    171
46    143
47    128
48     89
49     76
50     57
51     35
53     32
52     25
54     16
56     13
55      9
59      8
57      5
58      2
Name: count, dtype: int64


## Adding variety in *"origin_city"*

In [6]:
# I will introduce 'Malaga' to add variety
values = ["Madrid", "Barcelona", "Valencia", "Sevilla", "Malaga"]
distribution = [0.38, 0.29, 0.18, 0.1, 0.05]
df["origin_city"] = np.random.choice(
    values,
    size=len(df),
    p=distribution
)

## Changing *"tickep_price"*

The original ticket prices were:

In [7]:
print(df["ticket_price"].value_counts())

ticket_price
80     4750
200    4642
150    4608
Name: count, dtype: int64


These values didn’t reflect a realistic price gap between tickets, especially considering the added value of multi-day or VIP passes. I updated them as follows:

In [8]:
# I will replace 150 for 210 and 200 for 350
df["ticket_price"] = df["ticket_price"].replace({150: 210, 200: 350})

print(df["ticket_price"].value_counts())

ticket_price
80     4750
350    4642
210    4608
Name: count, dtype: int64


## Saving changes

In [9]:
df.to_csv(r"C:\Users\PC\Desktop\Estudio\Analisis de Datos\Proyectos\Festival Purchase Behavior Analysis\Datasets & Tables\festival_dataset_dirty_modified.csv", index=False)