### Data Cleaning

With the information gathered during the [Data Survey](https://github.com/Donnie-McGee/Festival-Purchase-Behavior-Analysis/tree/main/1.-%20Python%20Phase/1.-%20Data%20Survey), I proceed to clean the dataset.

### Removing non-essential fields

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Festival Purchase Behavior Analysis\Datasets\festival_dataset_dirty_modified.csv")

df = df.drop([
         "hours_spent",
         "ticket_price",
         "stages_visited",
         "attendee_id",
         "entry_time", 
         "purchase_date", 
         "was_present", 
         "transport_used",
         "top_artist_seen", 
         "origin_city"], axis=1)

### Change alphanumeric ticket_id to a numeric identifier

In [None]:
# Create a unique identifier for each row based on ticket_id
# Factorize returns a tuple, but I only need the first element (the array of codes)
# Smashes ticket_id values and assings indexes starting from one, taking care of duplicates
df["ticket_id"] = pd.factorize(df["ticket_id"])[0] + 1

### Null values handling

In [18]:
# --- Gender column ---

# Count the number of unique values in each column
# Used to understand the dataset better
print(df['gender'].value_counts(dropna=False))
print("\n")

# gender_dist will store the normalized distribution
gender_dist = df["gender"].value_counts(normalize=True)

# mask will be used to locate the null values in the dataset
mask = df["gender"].isnull()

# Adds all the null
n_nulls = mask.sum()

# Fill the null values with random choices based on the distribution
# This will ensure that the null values are filled in a way that reflects the original distribution
df.loc[mask, "gender"] = np.random.choice(
    gender_dist.index,
    size=n_nulls,
    p=gender_dist.values
)

# --- Ticket Type column ---
# Same steps to clean "ticket_type" column as we followed for "gender" column
print(df["ticket_type"].value_counts(dropna=False))
print("---------------------------------")

type_dist = df["ticket_type"].value_counts(normalize=True)
mask = df["ticket_type"].isnull()
n_nulls = mask.sum()
df.loc[mask, "ticket_type"] = np.random.choice(
    type_dist.index,
    size = n_nulls,
    p=type_dist.values
)


print(df["gender"].value_counts(dropna=False))
print("\n")
print(df["ticket_type"].value_counts(dropna=False))

gender
Female    7636
Male      5359
Other      866
NaN        139
Name: count, dtype: int64


ticket_type
3-day Pass    8158
1-day Pass    2782
VIP           2780
NaN            280
Name: count, dtype: int64
---------------------------------
gender
Female    7716
Male      5411
Other      873
Name: count, dtype: int64


ticket_type
3-day Pass    8323
1-day Pass    2849
VIP           2828
Name: count, dtype: int64


### Field rename

For better consistancy, "satisfaction_score" will be renamed "satisfaction_rating". This change will come in handy later on, in the [Data Modelling](https://github.com/Donnie-McGee/Festival-Purchase-Behavior-Analysis/tree/main/1.-%20Python%20Phase/4.-%20Data%20Modelling) phase.

In [19]:
# Rename column for consistency
df.rename({"satisfaction_score": "satisfaction_rating"}, axis=1, inplace=True)

### Typos cleaning

During my [Data Survey](https://github.com/Donnie-McGee/Festival-Purchase-Behavior-Analysis/tree/main/1.-%20Python%20Phase/1.-%20Data%20Survey) I found these typos.

In [20]:
df["payment_method"] = df["payment_method"].replace({"cash ": "Cash"})
df["favourite_genre"] = df["favourite_genre"].replace("hiphop", "Hip-Hop")
df["favourite_genre"] = df["favourite_genre"].replace("Regueton", "Reggaeton")
df["recommend_to_friend"] = df["recommend_to_friend"].replace({"nO": "No"})

### Trimming values with spaces

In [21]:
# Strips leading and trailing whitespace from all string columns
for col in df.columns:
    if df[col].dtype == 'object':
      # It will convert to string, then strip whitespace
        df[col] = df[col].str.strip()
      # It will replace multiple spaces with a single space
        df[col] = df[col].str.replace(r'\s+', ' ', regex=True)

### Type convertion

In [22]:
# With it, we ensure that the data types are appropriate for analysis and optimize memory usage
df = df.astype({
    'ticket_id': 'category',
    'ticket_type': 'category',
    'age': 'int',
    'gender': 'category',
    'group_size': 'int',
    'food_expense': 'float',
    'drink_expense': 'float',
    'merch_expense': 'float',
    'payment_method': 'category',
    'favourite_genre': 'category',
    'satisfaction_rating': 'int',
    'security_rating': 'int',
    'cleanliness_rating': 'int',
    'recommend_to_friend': 'bool'
})

# Date conversion
df['attendance_date'] = pd.to_datetime(df['attendance_date'])


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   ticket_id            14000 non-null  category      
 1   ticket_type          14000 non-null  category      
 2   attendance_date      14000 non-null  datetime64[ns]
 3   age                  14000 non-null  int64         
 4   gender               14000 non-null  category      
 5   group_size           14000 non-null  int64         
 6   food_expense         14000 non-null  float64       
 7   drink_expense        14000 non-null  float64       
 8   merch_expense        14000 non-null  float64       
 9   payment_method       14000 non-null  category      
 10  favourite_genre      14000 non-null  category      
 11  satisfaction_rating  14000 non-null  int64         
 12  security_rating      14000 non-null  int64         
 13  cleanliness_rating   14000 non-