# Normal distribution in *"age"*

In [27]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Users\PC\Desktop\Estudio\Analisis de Datos\Proyectos\Festival Purchase Behavior Analysis\Datasets\festival_dataset_dirty.csv")

# For age I will use a normal distribution with a mean of 30 and a standard deviation of 7
min = 18
max = 59
mean = 30
std_dev = 8

# Generate random ages using a normal distribution
ages = np.random.normal(loc=mean, scale=std_dev, size=len(df))

# Clip the ages to be within the specified range
clipped_ages = np.clip(ages, min, max)

# Round the ages to the nearest integer
clipped_ages = np.round(clipped_ages).astype(int)

# Assinging the clipped ages to the 'age' column
df["age"] = clipped_ages

print(df["age"].value_counts())

age
18    1077
29     734
32     680
30     673
28     673
31     657
33     654
27     633
34     632
35     606
25     595
26     591
23     541
36     537
24     515
37     467
38     425
39     389
22     379
21     367
40     334
20     289
41     265
19     259
42     214
43     187
44     129
45     108
46      99
47      83
48      60
49      48
50      26
52      23
51      17
53       9
54       9
55       8
56       4
57       3
58       1
Name: count, dtype: int64


The value *18* showed an unexpectedly high peak due to clipping all values below the lower limit of the normal distribution. This introduced a bias, making *18* the mode, which didn’t reflect a realistic age distribution.

To solve this, I applied a ***right-skewed normal distribution*** using `scipy.stats.skewnorm`. This approach allowed me to reduce the artificial accumulation of minimum values while still preserving a realistic presence of younger attendees.

You can read more about the skewnorm function [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skewnorm.html#scipy.stats.skewnorm).

In [28]:
# Import necessary libraries
from scipy.stats import skewnorm

min = 18
max = 59
mean = 30
std_dev = 8

# Generate a right skewed normal distribution for age
age_dist = skewnorm.rvs(0.4, loc=mean, scale=std_dev, size=len(df), random_state=None)

# Clip to keep values in range
age_dist = np.clip(age_dist, min, max)

# Round to nearest integer
df["age"] = np.round(age_dist).astype(int)

print(df["age"].value_counts())

age
33    762
30    735
32    722
34    696
36    691
29    683
31    652
35    641
28    619
27    603
37    580
38    563
26    539
39    499
25    463
18    462
40    430
41    416
24    402
23    345
42    294
43    283
22    281
21    262
44    247
20    196
45    175
19    167
46    145
47    105
48     81
49     71
50     54
51     41
52     35
54     17
53     14
57     11
55      9
56      3
58      3
59      3
Name: count, dtype: int64


With this new approach, the *"age"* distribution looks much more natural. The value *18* is no longer dominant but still retains enough frequency to represent a meaningful portion of the younger audience.