# Data Survey

It's essential to understand the structure and quality of the dataset I’ll be working with. This first section focuses on exploring the data to identify potential issues, limitations, and opportunities for improvement.

I plan to examine the following aspects, in order of priority:

1. **Null Values**  
   Are there any missing values? Which columns are affected, and how significant is the impact?

2. **Inconsistent or Incorrect Entries (Typos)**  
   Do any fields contain typos, extra spaces, or inconsistent formatting? Which columns need standarization?

3. **Data Distributions**  
   How do values distribute across key fields (e.g. satisfaction, ticket type, group size)? Do they appear natural or artificially uniform?

4. **Data Consistency**  
   Are related columns logically aligned?

5. **Field Types and Optimization**  
   Are the column data types appropriate? Can I optimize memory or performance through type conversion?

6. **Outliers and Edge Cases**  
   Are there any values that fall outside expected ranges (e.g. negative expenses, extreme ages)?

This survey will shape the decisions I make in the [Data Sharpening]() and [Data Cleaning]().

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Users\PC\Desktop\Estudio\Analisis de Datos\Proyectos\Festival Purchase Behavior Analysis\Datasets\festival_dataset_dirty_modified.csv")

# This part of the code provides an overview of the dataset, spaced, so each function is easier to read.

print(df.shape)
print("\n")
print(df.head())
print("\n")
print(df.info())
print("\n")
print(df.describe())
print("\n")
print(df.isnull().sum())
print("\n")
print(df.duplicated().sum())
print("\n")
print(df.nunique())
print("\n")
print(df.columns)
print("\n")

# I'm selecting columns of type 'object' (text) or 'category' to analyze unique values in those columns.
text_columns = df.select_dtypes(include=['object', 'category']).columns

# Displaying unique values for each text column
# With it, we can see the unique values in each text column, which helps us understand the dataset better
# And detect typos or inconsistencies in the data.
for col in text_columns:
    print(f"\nUnique values of '{col}':")
    print(df[col].unique())