### New columns

In order to optimize performance and organization, I created columns for various reasons:

1. **Rating and expenses fields**  
In order to create proper dim tables, I calculated the average for each rating field and summed all expenses for each *"ticket_id"*.
2. **Mean rating and total expense**  
Easy and useful metrics.
3. ***"is_multiday"*, *"age_group"*, rating levels and group type**  
For easier segmentation in later phases.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Users\PC\Desktop\Estudio\Analisis de Datos\Proyectos\Festival Purchase Behavior Analysis\Datasets & Tables\festival_dataset_clean.csv")

# ------------------------
# Rating and expenses calculations
# ------------------------

# Calculate mean for each ticket_id and assign it to each row and rating type
ratings_list = ["satisfaction_rating", "cleanliness_rating", "security_rating"]
for rating in ratings_list:
    # Calculate mean rating score
    df[rating] = (
    # Groups by ticket_id to get the mean score for each ticket
    df.groupby("ticket_id")[rating].
    # Provides the mean score for each ticket_id, rounded to 2
    transform("mean").round(2)
)

# Calculate total expenses in different categories
expense_list = ["food_expense", "drink_expense", "merch_expense"]
for expense in expense_list:
    df[expense] = (
        # Groups by ticket_id to get the total expense for each ticket
        df.groupby("ticket_id")[expense].
        # Sums up the expenses for each ticket_id and rounds to 2
        transform("sum").
        round(2)
        )
    
# ------------------------
# Means calculations
# ------------------------

# Calculate the total amount spent on food, drinks, and merchandise
df["total_spent"] = round(df["food_expense"] + df["drink_expense"] + df["merch_expense"], 2)

# Mean rating of all 3 rating scores
df["mean_rating"] = round((df["satisfaction_rating"] + df["security_rating"] + df["cleanliness_rating"])/3, 2)

# ------------------------
# Segmentation columns
# ------------------------

# Boolean column to indicate if the person attends multiple days
# Checks the ticket type row by row assinging True or False
df["is_multiday"] = np.where(
    df["ticket_type"] == "1-day Pass",
    False,
    True
)

# Create bins for age groups
df["age_group"] = pd.cut(df["age"],
                  bins = [17, 24, 34, 44, 54, 59],
                  labels = ["18-24", "25-34", "35-44", "45-54", "55+"]
                  )

ratings_list = ["satisfaction", "cleanliness", "security"]
# Create bins for each rating column
for col in ratings_list:
    df[col + "_level"] = pd.cut(df[col + "_rating"],
                            bins = [0, 5, 8, 10],
                            labels = ["Low", "Medium", "High"],
                            # Include the lowest value in the first bin, so 0
                            include_lowest = True
                            )

# Segment the type of visitor for easier analysis
df["group_type"] = (
    df["group_size"]
    # Map the values 1 and 2 to "Individual" and "Couple", respectively
    .map({1: "Individual", 2: "Couple"})
    # All other values (3 or more) are mapped to "Group"
    .fillna("Group")
)

### Changing types

For a better performance and to avoid inconsistancies, I will reassign types to the new columns.

*Note: rating scores and expenses don't need to be included in this step since they were already proccessed during the [Data Cleaning](https://github.com/Donnie-McGee/Festival-Purchase-Behavior-Analysis/tree/main/3.-%20Data%20Cleaning).*

In [None]:
df = df.astype({
    "total_spent": "float64",
    "mean_rating": "float64",
    "is_multiday": "bool",
    "age_group": "category",
    "satisfaction_level": "category",
    "cleanliness_level": "category",
    "security_level": "category",
    "group_type": "category"
})

# ------------------------
# Saving the final dataset
# ------------------------

df.to_csv(r"C:\Users\PC\Desktop\Estudio\Analisis de Datos\Proyectos\Festival Purchase Behavior Analysis\Datasets & Tables\final_festival_dataset.csv", index=False)