# Data Survey

It's essential to understand the structure and quality of the dataset I’ll be working with. This first section focuses on exploring the data to identify potential issues, limitations, and opportunities for improvement.

I plan to examine the following aspects, in order of priority:

1. **Null Values**  
   Are there any missing values? Which columns are affected, and how significant is the impact?

2. **Inconsistent or Incorrect Entries (Typos)**  
   Do any fields contain typos, extra spaces, or inconsistent formatting? Which columns need standarization?

3. **Data Distributions**  
   How do values distribute across key fields (e.g. satisfaction, ticket type, group size)? Do they appear natural or artificially uniform?

4. **Data Consistency**  
   Are related columns logically aligned?

5. **Field Types and Optimization**  
   Are the column data types appropriate? Can I optimize memory or performance through type conversion?

6. **Outliers and Edge Cases**  
   Are there any values that fall outside expected ranges (e.g. negative expenses, extreme ages)?

7. **Duplicates**  
   Are there any duplicated records along any field?

This survey will shape the decisions I make in the [Data Sharpening](https://github.com/Donnie-McGee/Festival-Purchase-Behavior-Analysis/tree/main/2.-%20Data%20Sharpening) and [Data Cleaning](https://github.com/Donnie-McGee/Festival-Purchase-Behavior-Analysis/tree/main/3.-%20Data%20Cleaning).

### Searching for null values, typos and checking types

In [9]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Users\PC\Desktop\Estudio\Analisis de Datos\Proyectos\Festival Purchase Behavior Analysis\Datasets & Tables\festival_dataset_dirty.csv")

# This part of the code provides an overview of the dataset, spaced, so each function is easier to read.

df.shape
print("\n")
df.head(10)
print("\n")
df.info()
print("\n")
round(df.describe(), 2)
print("\n")
df.isnull().sum()
print("\n")
df.duplicated().sum()
print("\n")
df.nunique()
print("\n")
df.columns
print("\n")

# I'm selecting columns of type 'object' (text) or 'category' to analyze unique values in those columns.
text_columns = df.select_dtypes(include=['object', 'category']).columns

# Displaying unique values for each text column
# With it, we can see the unique values in each text column, which helps us understand the dataset better
# And detect typos or inconsistencies in the data.
for col in text_columns:
    print(f"\nUnique values of '{col}':")
    print(df[col].unique())





<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ticket_id            14000 non-null  object 
 1   ticket_type          13720 non-null  object 
 2   ticket_price         14000 non-null  int64  
 3   purchase_date        14000 non-null  object 
 4   attendance_date      14000 non-null  object 
 5   entry_time           14000 non-null  object 
 6   was_present          14000 non-null  bool   
 7   attendee_id          14000 non-null  object 
 8   age                  14000 non-null  int64  
 9   gender               13861 non-null  object 
 10  origin_city          14000 non-null  object 
 11  transport_used       14000 non-null  object 
 12  group_size           14000 non-null  int64  
 13  food_expense         14000 non-null  float64
 14  drink_expense        14000 non-null  float64
 15  merch_expense        14000 non-n

With this first overview, I now have a clearer understanding of the dataset:

1. **Null Values**  
   Missing values were found in the fields *"ticket_type"* and *"gender"*. These will need attention during the cleaning phase.

2. **Typos and Inconsistencies**  
   Several textual fields contain typos or inconsistent formatting, particularly in *"favourite_genre"*, *"payment_method"*, and *"recommend_to_friend"*.

3. **Distributions**  
   A quick look at means and percentiles suggests that some columns have unusually uniform or skewed distributions, which may not reflect realistic festival behavior.

4. **Field Types**  
   Aside from numeric fields, all other columns are currently of type *object*. This will negatively impact performance and must be optimized by converting them to more appropriate types like *category* or *datetime*.

5. **Duplicates**  
   *"ticket_id"*, which is an essential field, presents duplicates in its records, what I could use to create a fact table.

### Closer look to distributions

In [10]:
print(df['origin_city'].value_counts(dropna=False))
print("\n")
print(df['ticket_type'].value_counts(dropna=False))
print("\n")
print(df['gender'].value_counts(dropna=False))
print("\n")
print(df['group_size'].value_counts(dropna=False))
print("\n")
print(df['payment_method'].value_counts(dropna=False))
print("\n")
print(df['stages_visited'].value_counts(dropna=False))
print("\n")
print(df['age'].value_counts(dropna=False))
print("\n")
print(df['satisfaction_score'].value_counts(dropna=False))
print("\n")
print(df['security_rating'].value_counts(dropna=False))
print("\n")
print(df['cleanliness_rating'].value_counts(dropna=False))
print("\n")

origin_city
Madrid       5587
Barcelona    4195
Sevilla      2818
Valencia     1400
Name: count, dtype: int64


ticket_type
3-day Pass    8158
1-day Pass    2782
VIP           2780
NaN            280
Name: count, dtype: int64


gender
Female    5112
Male      4573
Other     4176
NaN        139
Name: count, dtype: int64


group_size
5    2861
2    2852
4    2809
1    2749
3    2729
Name: count, dtype: int64


payment_method
Festival App    4783
Card            4619
Cash            3683
cash             915
Name: count, dtype: int64


stages_visited
3    3572
2    3536
1    3450
4    3442
Name: count, dtype: int64


age
58    398
33    372
41    367
35    365
42    364
     ... 
29    311
48    310
55    310
27    309
59    291
Name: count, Length: 42, dtype: int64


satisfaction_score
8.0    5561
9.5    4252
6.5    2790
5.0    1397
Name: count, dtype: int64


security_rating
8.0    5633
5.0    2842
6.5    2764
9.5    2761
Name: count, dtype: int64


cleanliness_rating
8.0    5664
6.5   

1. **Balanced Distributions**  
   The fields *"origin_city"* and *"ticket_type"* appear to have reasonably well-balanced distributions, reflecting realistic attendance patterns. No modifications will be applied to these.

2. **Monotonous or Artificial Distributions**  
   Fields like *"gender"*, *"age"*, *"stages_visited"*, *"payment_method"*, and *"group_size"* show distributions that are too uniform or lack natural variability. These do not align with what would be expected in a real festival scenario and will require adjustment.

3. **Unrealistic Rating Distributions**  
   While the rating fields (*"satisfaction_score"*, *"security_rating"*, and *"cleanliness_rating"*) are not strictly monotonous, they lack realism. They contain only high, rounded values (e.g., 5.0, 6.5, 8.0, 9.5) with identical categories across all three fields. This pattern suggests artificiality and calls for a more natural spread of scores, including lower and more varied ratings.