# Exploratory Data Analysis (EDA) Notebook for the SII Project

This notebook performs exploratory data analysis on the training and testing datasets of the Severely Impairment Index (SII) project, following a systematic structure to understand the quality, distribution, and relationships of the variables.

## 1. Import Libraries and Initial Configuration

In this section, we import the necessary libraries for analysis and configure visualization options for pandas and matplotlib.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Visualization options
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 120)
sns.set(style="whitegrid", palette="muted")
%matplotlib inline

## 2. Load Training and Testing Data

We load the `train.csv` and `test.csv` files from the data folder. We display the dimensions and the first few rows of each dataset for initial inspection.

In [2]:
# Define paths to data files
data_dir = r"data"
train_path = f"{data_dir}/train.csv"
test_path = f"{data_dir}/test.csv"

# Load the data
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)

# Display dimensions and first rows
print("Dimensions of train:", train.shape)
print("Dimensions of test:", test.shape)
display(train.head())
display(test.head())

## 3. Overview of the Data

Description of columns, data types, and basic statistics using `.info()` and `.describe()`.

In [3]:
print("General information of the training set:")
train.info()
print("\nGeneral information of the testing set:")
test.info()

print("\nDescriptive statistics of numerical variables (train):")
display(train.describe())

print("\nDescriptive statistics of numerical variables (test):")
display(test.describe())

## 4. Analysis of Missing Values

We calculate and visualize the number and percentage of null values per column in both datasets.

In [4]:
# Function to calculate missing values
def missing_values_table(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * mis_val / len(df)
    mis_val_table = pd.DataFrame({'Missing Values': mis_val, '% of Total': mis_val_percent})
    mis_val_table = mis_val_table[mis_val_table['Missing Values'] > 0].sort_values('% of Total', ascending=False)
    return mis_val_table

print("Missing values in train:")
display(missing_values_table(train))

print("Missing values in test:")
display(missing_values_table(test))

# Visualization
plt.figure(figsize=(14,5))
sns.barplot(x=missing_values_table(train).index, y=missing_values_table(train)['% of Total'])
plt.title('Percentage of Missing Values by Column (train)')
plt.xticks(rotation=90)
plt.show()

## 5. Distribution of the Target Variable

We analyze the distribution of `PCIAT-PCIAT_Total` and its grouping into quartiles, using histograms and boxplots.

In [5]:
# Remove nulls in the target variable for analysis
train_obj = train.dropna(subset=['PCIAT-PCIAT_Total'])

plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
sns.histplot(train_obj['PCIAT-PCIAT_Total'], bins=30, kde=True)
plt.title('Distribution of PCIAT-PCIAT_Total')

plt.subplot(1,2,2)
sns.boxplot(y=train_obj['PCIAT-PCIAT_Total'])
plt.title('Boxplot of PCIAT-PCIAT_Total')
plt.show()

# Group into quartiles
train_obj['SII_group'] = pd.qcut(train_obj['PCIAT-PCIAT_Total'], q=4, labels=[0,1,2,3])

plt.figure(figsize=(6,4))
sns.countplot(x='SII_group', data=train_obj)
plt.title('Distribution of SII_group (quartiles)')
plt.show()

## 6. Analysis of Numerical Variables

Exploration of the distribution of numerical variables through histograms, boxplots, and descriptive statistics.

In [6]:
# Select numerical variables (excluding the target variable and identifier)
num_cols = train.select_dtypes(include=[np.number]).columns
num_cols = [col for col in num_cols if col not in ['PCIAT-PCIAT_Total', 'Subject_ID']]

# Histograms
train[num_cols].hist(figsize=(16, 12), bins=30, layout=(int(np.ceil(len(num_cols)/4)), 4))
plt.suptitle('Distribution of Numerical Variables (train)')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Boxplots of some numerical variables
for col in num_cols[:6]:  # Show only the first 6 to avoid clutter
    plt.figure(figsize=(6,2))
    sns.boxplot(x=train[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

## 7. Analysis of Categorical Variables

Analysis of the frequency of categorical variables such as `Basic_Demos-Sex` and others present in the data.

In [7]:
# Select categorical variables
cat_cols = train.select_dtypes(include=['object', 'category']).columns

for col in cat_cols:
    print(f"Frequency of values for {col}:")
    display(train[col].value_counts(dropna=False))
    plt.figure(figsize=(5,3))
    sns.countplot(y=col, data=train, order=train[col].value_counts().index)
    plt.title(f'Distribution of {col}')
    plt.show()

## 8. Correlation Between Variables

We calculate the correlation matrix between numerical variables and visualize it with a heatmap to identify relevant relationships.

In [8]:
corr = train[num_cols + ['PCIAT-PCIAT_Total']].corr()

plt.figure(figsize=(14,10))
sns.heatmap(corr, annot=False, cmap='coolwarm', center=0)
plt.title('Correlation Matrix Between Numerical Variables')
plt.show()

## 9. Visualization of Outliers

Identification and visualization of possible outliers in relevant variables using boxplots.

In [9]:
# Visualize outliers in the numerical variables most correlated with the target variable
corr_target = corr['PCIAT-PCIAT_Total'].abs().sort_values(ascending=False)
top_corr_vars = corr_target.index[1:7]  # Exclude the target variable

for col in top_corr_vars:
    plt.figure(figsize=(6,2))
    sns.boxplot(x=train[col])
    plt.title(f'Boxplot of Possible Outliers in {col}')
    plt.show()