# 01. Exploratory Data Analysis (EDA)

This notebook contains an initial exploratory analysis of the dataset.  

The goal is to understand the structure, quality, and basic statistical properties of the data to guide further processing and modeling.

## Objectives

- Load and preview the dataset
- Understand the structure and types of data
- Identify missing values and potential data quality issues
- Generate basic descriptive statistics
- Visualize key distributions and relationships

> 🧭 This notebook serves as the starting point for any data-driven workflow.

## Library Imports

We import the core libraries needed for data analysis and visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization styles
sns.set_theme(style="whitegrid", font_scale=1.2)  # Seaborn style
plt.style.use('ggplot')  # Matplotlib style (can be changed)

# Optional: adjust figure aesthetics
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['axes.grid'] = True

# Uncomment to list all available matplotlib styles
# print(plt.style.available)


## Dataset Loading

We load the dataset from a CSV file located in the `data/raw/` folder.  
Multiple encoding formats are handled to prevent read errors.


In [None]:
# Define the relative path to the data file
data_path = '../data/raw/dataset.csv'  # Replace with your actual filename

# Load the dataset with encoding fallback
try:
    df = pd.read_csv(data_path, encoding='utf-8')
except UnicodeDecodeError:
    try:
        df = pd.read_csv(data_path, encoding='latin1')
    except Exception as e:
        print(f"Error loading CSV file: {e}")
        df = None  # Use None to indicate failure

# Check if the dataset was loaded successfully
if df is not None:
    print(f"✅ Dataset loaded successfully. Shape: {df.shape}")
else:
    print("❌ Failed to load dataset. Please check the file path and encoding.")


## First Look at the Dataset

We display the first and last few rows to get an initial sense of the data structure and content.

In [None]:
# Preview the dataset
if df is not None:
    print("🔍 First 5 rows of the dataset:")
    display(df.head())

    print("\n🔎 Last 5 rows of the dataset:")
    display(df.tail())
else:
    print("⚠️ Cannot preview data because the dataset failed to load.")


## Dataset Overview

We inspect general metadata such as column names, data types, and non-null values to understand the dataset structure.

In [None]:
# Display dataset information
if df is not None:
    print("🧾 Dataset Info:")
    df.info()
else:
    print("⚠️ Dataset information is unavailable because it failed to load.")

## Unique Values by Column

Next, we explore the unique values present in each column of the dataset. This helps us understand the types of data, categories, and overall diversity, which is essential for guiding further analysis.


In [None]:
# Function to display unique values for each column in a visually friendly format
def show_unique_values(dataframe, exclude_columns=None):
    if exclude_columns is None:
        exclude_columns = []

    unique_values = {}

    for col in dataframe.columns:
        if col not in exclude_columns:
            values = dataframe[col].dropna().unique()
            if len(values) > 0:
                unique_values[col] = values

    print(f"\nDisplaying unique values for {len(unique_values)} out of {len(dataframe.columns)} columns.\n")

    for col, values in unique_values.items():
        print(f"\n{'=' * 80}\n{col} ({len(values)} unique values)\n{'=' * 80}")

        max_display = 200
        if len(values) > max_display:
            print(f"Showing the first {max_display} values (out of {len(values)} total):\n")
            values_to_show = values[:max_display]
        else:
            values_to_show = values

        for i, val in enumerate(values_to_show):
            print(f"  {i+1}. {val}")

# Apply the function to the dataset
if df is not None:
    show_unique_values(df, exclude_columns=['FECHA_CORTE', 'PER_OCU', 'PER_DECLA', 'PER_UBIC', 'PER_SA', 'EVENTOS'])
else:
    print("⚠️ Cannot display unique values because the dataset failed to load.")


## Descriptive Statistics

We calculate descriptive statistics to understand the distribution, central tendency, and variability of the dataset's values.


In [None]:
# Calculate descriptive statistics
if df is not None:
    print("Descriptive Statistics:")
    display(df.describe(include='all'))
else:
    print("⚠️ Cannot compute descriptive statistics because the dataset failed to load.")

## Missing Values

We analyze the presence of missing values in the dataset. Understanding which columns contain null entries and their proportions helps identify data quality issues and guides preprocessing decisions.


In [None]:
# Check for missing values
if df is not None:
    # Calculate the number of missing values per column
    null_counts = df.isnull().sum()
    
    # Calculate the percentage of missing values
    null_percentage = (null_counts / len(df)) * 100
    
    # Create a DataFrame with null information
    null_info = pd.DataFrame({
        'Missing Values': null_counts,
        'Percentage (%)': null_percentage
    })
    
    print("Missing Values Analysis:")
    display(null_info)
else:
    print("⚠️ Cannot analyze missing values because the dataset failed to load.")


## Data Exploration and Visualization

In this section, we analyze unique values, descriptive statistics, missing data, and the distribution of a selected categorical column to better understand the dataset structure and guide further data processing or modeling.


In [None]:
def analyze_categorical_distribution(dataframe, column_name, top_n=10, title=None):
    """
    Analyze and visualize the distribution of a categorical column.

    Parameters:
    - dataframe: pandas DataFrame
    - column_name: name of the categorical column to analyze
    - top_n: number of top categories to visualize
    - title: custom title for the chart
    """
    if dataframe is None or column_name not in dataframe.columns:
        print(f"⚠️ Column '{column_name}' not found or dataset is not loaded.")
        return

    # Frequency and percentage
    count = dataframe[column_name].value_counts()
    percent = (count / len(dataframe)) * 100

    # Combine into a DataFrame
    summary_df = pd.DataFrame({
        'Frequency': count,
        'Percentage (%)': percent
    })

    print(f"Distribution of '{column_name}':")
    display(summary_df)

    # Plot
    plt.figure(figsize=(14, 8))
    sns.barplot(x=count.values[:top_n], y=count.index[:top_n])
    plt.title(title or f"Top {top_n} Most Frequent Categories in '{column_name}'")
    plt.xlabel('Count')
    plt.ylabel(column_name)
    plt.tight_layout()
    plt.show()

# Example usage:
if df is not None:
    analyze_categorical_distribution(df, column_name='HECHO', top_n=10)
else:
    print("⚠️ Dataset not loaded.")


## Outlier Detection in Count Variables

This step visualizes count variables using boxplots to identify potential outliers.

Boxplots summarize the distribution, showing medians, quartiles, and extreme values that may indicate anomalies.

This helps inform further cleaning before analysis or modeling.

Column names are examples and should be adapted to your dataset.


In [None]:
# Step 1: Define count variables
count_variables = ['Count_Occupied', 'Count_Declared', 'Count_Located', 'Count_Reported', 'EventCount']

# Step 2: Visualize boxplot to detect outliers in count variables
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[count_variables])
plt.title("Boxplot of count variables to detect outliers")
plt.ylabel("Value")
plt.xticks(rotation=45)
plt.grid(True, axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


## Outlier Summary

This table quantifies outliers detected in each count variable using the Interquartile Range (IQR) method. It shows the number and percentage of data points outside typical bounds, providing insight into the potential impact of outliers on the dataset.

In [None]:
# Previsualización del impacto de los outliers
resumen_outliers = []

for col in variables_conteo:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    total = len(df)
    num_outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].shape[0]

    resumen_outliers.append({
        'Variable': col,
        'Q1': round(Q1, 2),
        'Q3': round(Q3, 2),
        'IQR': round(IQR, 2),
        'Límite inferior': round(lower_bound, 2),
        'Límite superior': round(upper_bound, 2),
        'Outliers detectados': num_outliers,
        'Porcentaje Outliers': round((num_outliers / total) * 100, 2)
    })

resumen_df = pd.DataFrame(resumen_outliers)
display(resumen_df)

In [None]:
resumen_df.to_csv(
    path_or_buf='../data/processed/resumen_outliers.csv',
    sep=',',
    na_rep='',
    header=True,
    index=False,
    encoding='utf-8',
    quoting=csv.QUOTE_MINIMAL,
    lineterminator=os.linesep,
    quotechar='"',
    decimal='.',
    errors='strict'
)