# Data Survey

It's essential to understand the structure and quality of the dataset Iâ€™ll be working with. This first section focuses on exploring the data to identify potential issues, limitations, and opportunities for improvement.

I plan to examine the following aspects, in order of priority:

1. **Null Values**  
   Are there any missing values? Which columns are affected, and how significant is the impact?

2. **Inconsistent or Incorrect Entries (Typos)**  
   Do any fields contain typos, extra spaces, or inconsistent formatting? Which columns need standarization?

3. **Data Distributions**  
   How do values distribute across key fields (e.g. satisfaction, ticket type, group size)? Do they appear natural or artificially uniform?

4. **Data Consistency**  
   Are related columns logically aligned?

5. **Field Types and Optimization**  
   Are the column data types appropriate? Can I optimize memory or performance through type conversion?

6. **Outliers and Edge Cases**  
   Are there any values that fall outside expected ranges (e.g. negative expenses, extreme ages)?

This survey will shape the decisions I make in the [Data Sharpening](https://github.com/Donnie-McGee/Festival-Purchase-Behavior-Analysis/tree/main/2.-%20Data%20Shapening) and [Data Cleaning]().

### Searching for null values, typos and checking types

In [5]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Users\PC\Desktop\Estudio\Analisis de Datos\Proyectos\Festival Purchase Behavior Analysis\Datasets\festival_dataset_dirty.csv")

# This part of the code provides an overview of the dataset, spaced, so each function is easier to read.

print(df.shape)
print("\n")
print(df.head())
print("\n")
print(df.info())
print("\n")
print(round(df.describe(), 2))
print("\n")
print(df.isnull().sum())
print("\n")
print(df.duplicated().sum())
print("\n")
print(df.nunique())
print("\n")
print(df.columns)
print("\n")

# I'm selecting columns of type 'object' (text) or 'category' to analyze unique values in those columns.
text_columns = df.select_dtypes(include=['object', 'category']).columns

# Displaying unique values for each text column
# With it, we can see the unique values in each text column, which helps us understand the dataset better
# And detect typos or inconsistencies in the data.
for col in text_columns:
    print(f"\nUnique values of '{col}':")
    print(df[col].unique())

(14000, 25)


   ticket_id ticket_type  ticket_price purchase_date attendance_date  \
0  TCK102187  3-day Pass           200    2025-03-07      2025-05-02   
1  TCK105186  3-day Pass           150    2025-02-06      2025-05-03   
2  TCK107469         NaN           200    2025-03-08      2025-05-03   
3  TCK102961  3-day Pass           200    2025-04-30      2025-05-01   
4  TCK107515  3-day Pass            80    2025-02-18      2025-05-01   

  entry_time  was_present attendee_id  age  gender  ... merch_expense  \
0   19:22:00         True   ATD202187   20  Female  ...         78.35   
1   15:59:00         True   ATD205186   33   Other  ...         48.70   
2   16:32:00         True   ATD207469   35  Female  ...         56.57   
3   22:19:00         True   ATD202961   50   Other  ...         10.78   
4   19:35:00         True   ATD207515   27   Other  ...         11.05   

  payment_method  favourite_genre  stages_visited  top_artist_seen  \
0   Festival App              Pop           

With this first overview, I now have a clearer understanding of the dataset:

1. **Null Values**  
   Missing values were found in the fields *"ticket_type"* and *"gender"*. These will need attention during the cleaning phase.

2. **Typos and Inconsistencies**  
   Several textual fields contain typos or inconsistent formatting, particularly in *"favourite_genre"*, *"payment_method"*, and *"recommend_to_friend"*.

3. **Distributions**  
   A quick look at means and percentiles suggests that some columns have unusually uniform or skewed distributions, which may not reflect realistic festival behavior.

4. **Field Types**  
   Aside from numeric fields, all other columns are currently of type *object*. This will negatively impact performance and must be optimized by converting them to more appropriate types like *category* or *datetime*.

### Closer look to distributions

In [6]:
print(df['origin_city'].value_counts(dropna=False))
print("\n")
print(df['ticket_type'].value_counts(dropna=False))
print("\n")
print(df['gender'].value_counts(dropna=False))
print("\n")
print(df['group_size'].value_counts(dropna=False))
print("\n")
print(df['payment_method'].value_counts(dropna=False))
print("\n")
print(df['stages_visited'].value_counts(dropna=False))
print("\n")
print(df['age'].value_counts(dropna=False))
print("\n")
print(df['satisfaction_score'].value_counts(dropna=False))
print("\n")
print(df['security_rating'].value_counts(dropna=False))
print("\n")
print(df['cleanliness_rating'].value_counts(dropna=False))
print("\n")

origin_city
Madrid       5587
Barcelona    4195
Sevilla      2818
Valencia     1400
Name: count, dtype: int64


ticket_type
3-day Pass    8158
1-day Pass    2782
VIP           2780
NaN            280
Name: count, dtype: int64


gender
Female    5124
Male      4486
Other     4261
NaN        129
Name: count, dtype: int64


group_size
5    2861
2    2852
4    2809
1    2749
3    2729
Name: count, dtype: int64


payment_method
Card            4709
Festival App    4673
Cash            3693
cash             925
Name: count, dtype: int64


stages_visited
3    3572
2    3536
1    3450
4    3442
Name: count, dtype: int64


age
58    398
33    372
41    367
35    365
42    364
23    363
46    357
19    351
34    350
31    347
44    346
57    345
43    344
32    343
50    342
39    341
24    336
20    336
38    332
56    328
52    327
45    327
54    326
51    326
49    325
25    322
37    322
22    322
30    319
26    319
53    318
40    318
28    317
18    315
47    313
21    313
36    313
29  

1. **Balanced Distributions**  
   The fields *"origin_city"* and *"ticket_type"* appear to have reasonably well-balanced distributions, reflecting realistic attendance patterns. No modifications will be applied to these.

2. **Monotonous or Artificial Distributions**  
   Fields like *"gender"*, *"age"*, *"stages_visited"*, *"payment_method"*, and *"group_size"* show distributions that are too uniform or lack natural variability. These do not align with what would be expected in a real festival scenario and will require adjustment.

3. **Unrealistic Rating Distributions**  
   While the rating fields (*"satisfaction_score"*, *"security_rating"*, and *"cleanliness_rating"*) are not strictly monotonous, they lack realism. They contain only high, rounded values (e.g., 5.0, 6.5, 8.0, 9.5) with identical categories across all three fields. This pattern suggests artificiality and calls for a more natural spread of scores, including lower and more varied ratings.