# SpatialIQ: Predicting Students' Spatial Intelligence through Machine Learning

## A Comprehensive AI-in-Education Research Study

**Author:** AI Research Developer  
**Date:** November 2025  
**Dataset Citation:** DOI: 10.21227/5qxw-bw66  
**Purpose:** Interpretable prediction and analysis of spatial intelligence in high school students using behavioral, academic, and demographic features

---

## Executive Summary

This notebook presents a rigorous machine learning pipeline designed to predict students' **Spatial Intelligence** levels (ranging from Very Low to Very High) using a rich dataset of 40 behavioral, academic, and demographic features collected from high school students. Spatial intelligence—the ability to visualize, manipulate, and reason about spatial relationships—is a critical component of cognitive development and academic achievement, particularly in STEM disciplines.

### Research Questions:
1. **What behavioral and academic patterns predict spatial intelligence?**
2. **How do demographic factors (gender, parental education, environment) influence spatial reasoning ability?**
3. **Can we build interpretable models that identify actionable insights for educators?**
4. **What ethical considerations arise from AI-driven student cognitive profiling?**

### Methodology Overview:
- **Phase 1:** Comprehensive data exploration and statistical profiling
- **Phase 2:** Intelligent feature engineering with domain knowledge
- **Phase 3:** Competitive modeling with Logistic Regression, Random Forest, XGBoost, and Neural Networks
- **Phase 4:** Deep model interpretation using SHAP explainability methods
- **Phase 5:** Actionable insights and ethical considerations

---

## Dataset Overview

The dataset comprises 398 high school students with 40 features across five categories:
- **Demographics:** Age, Gender, Class size, Environment (Urban/Suburban/Rural)
- **Socioeconomic:** Family size, Parental occupation, Parental education, Income level
- **Academic:** Major, GPA, Study time, Extra classes, Teacher assessment
- **Behavioral:** Internet usage, TV watching, Pattern recognition, Geographic familiarity
- **Gaming Preferences:** Action, Adventure, Strategy, Sport, Simulation, Role-playing, Puzzle games
- **Learning Modes:** Visual vs. Auditory, Map usage, Diagram usage, Experience with GIS

**Target Variable:** Spatial Intelligence (Categorical: VL, L, M, H, VH)


---

# Section 1: Import Required Libraries and Dataset Loading

This foundational section initializes all necessary libraries for data manipulation, visualization, machine learning, and model interpretation. We leverage industry-standard tools including pandas for data handling, scikit-learn for preprocessing and modeling, XGBoost for gradient boosting, TensorFlow for neural networks, and SHAP for model explainability. This comprehensive toolkit enables both rigorous statistical analysis and production-grade machine learning workflows.

In [None]:
# ==================== LIBRARY IMPORTS ====================
# Core Data Science and Manipulation
import pandas as pd
import numpy as np
from pathlib import Path

# Data Preprocessing and Encoding
from sklearn.preprocessing import StandardScaler, LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Model Selection, Training, and Evaluation
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier, plot_importance
from lightgbm import LGBMClassifier

# Metrics and Evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score,
    roc_curve, auc, matthews_corrcoef
)

# Statistical Analysis
from scipy.stats import chi2_contingency, mutual_info_classif, spearmanr, pearsonr
from scipy.stats import kurtosis, skew
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Feature Importance and SHAP Interpretability
import shap
from sklearn.inspection import permutation_importance

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Warnings and Configuration
import warnings
warnings.filterwarnings('ignore')

# Configuration for visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✓ All libraries successfully imported")
print(f"✓ NumPy version: {np.__version__}")
print(f"✓ Pandas version: {pd.__version__}")
print(f"✓ Scikit-learn version: {pd.__version__.split('.')[0]}")
print(f"✓ SHAP library initialized for model interpretation")

## 1.1: Load and Inspect the Dataset

In this subsection, we load the Dataset.csv file and perform initial exploratory inspection to understand the data structure, dimensions, data types, and preliminary statistical characteristics. This foundational step is critical for identifying potential data quality issues, missing values, and the nature of features we'll be working with throughout the analysis.

In [None]:
# Load the dataset from CSV file
data_path = Path("data/Dataset.csv")
df = pd.read_csv(data_path)

# Display basic information about the dataset
print("=" * 80)
print("DATASET LOADING AND INITIAL INSPECTION")
print("=" * 80)
print(f"\n✓ Dataset successfully loaded from: {data_path}")
print(f"\nDataset Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display column names and data types
print("\n" + "=" * 80)
print("COLUMN INFORMATION AND DATA TYPES")
print("=" * 80)
print("\nColumn Names and Data Types:")
print(df.info())

# Display summary statistics
print("\n" + "=" * 80)
print("FIRST 10 ROWS OF DATA")
print("=" * 80)
display(df.head(10))

print("\n" + "=" * 80)
print("SUMMARY STATISTICS")
print("=" * 80)
display(df.describe().transpose())

# Check for missing values
print("\n" + "=" * 80)
print("MISSING VALUES ANALYSIS")
print("=" * 80)
missing_counts = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': missing_counts.index,
    'Missing_Count': missing_counts.values,
    'Missing_Percentage': missing_percentage.values
})
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

if len(missing_df) > 0:
    print(f"\nColumns with missing values ({len(missing_df)} found):")
    display(missing_df)
else:
    print("\n✓ No missing values detected in the dataset!")

# Identify target variable
print("\n" + "=" * 80)
print("TARGET VARIABLE IDENTIFICATION")
print("=" * 80)
print("\nDataset column names:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

# The last column is typically the target (Spatial Intelligence level)
target_col = df.columns[-1]
print(f"\n✓ Identified target variable: '{target_col}'")
print(f"Unique values in target: {df[target_col].unique()}")
print(f"Target value counts:\n{df[target_col].value_counts().sort_index()}")

# Create a reference dictionary for later use
target_mapping = {val: idx for idx, val in enumerate(sorted(df[target_col].unique()))}
print(f"\nTarget ordinal mapping: {target_mapping}")

---

# Section 2: Exploratory Data Analysis and Statistical Profiling

## 2.1: Distribution Analysis of Spatial Intelligence

Spatial Intelligence is our target variable, classified into five ordinal categories ranging from Very Low (VL) to Very High (VH). Understanding the distribution of this outcome variable is crucial for identifying potential class imbalance issues and informing our choice of evaluation metrics and sampling strategies. We'll visualize this distribution through multiple perspectives: raw counts, percentages, and normalized visualizations.

In [None]:
# Analyze the distribution of the target variable
print("=" * 80)
print("TARGET VARIABLE DISTRIBUTION ANALYSIS")
print("=" * 80)

target_distribution = df[target_col].value_counts().sort_index()
target_percentages = (target_distribution / len(df)) * 100

distribution_df = pd.DataFrame({
    'Spatial_Intelligence_Level': target_distribution.index,
    'Count': target_distribution.values,
    'Percentage': target_percentages.values,
    'Cumulative_Percentage': target_percentages.cumsum().values
})

print("\nTarget Variable Distribution:")
display(distribution_df)

# Calculate class balance metrics
class_weights = len(df) / (len(target_distribution) * target_distribution)
print(f"\nClass Balance Weights (for handling imbalance):")
for level, weight in zip(target_distribution.index, class_weights):
    print(f"  {level}: {weight:.4f}")

# Create visualization of target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot with counts
axes[0].bar(target_distribution.index, target_distribution.values, color=sns.color_palette("husl", len(target_distribution)))
axes[0].set_xlabel('Spatial Intelligence Level', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Count', fontsize=12, fontweight='bold')
axes[0].set_title('Distribution of Spatial Intelligence (Counts)', fontsize=13, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(target_distribution.values):
    axes[0].text(i, v + 1, str(v), ha='center', fontweight='bold')

# Pie chart with percentages
colors = sns.color_palette("husl", len(target_distribution))
wedges, texts, autotexts = axes[1].pie(
    target_distribution.values,
    labels=target_distribution.index,
    autopct='%1.1f%%',
    colors=colors,
    startangle=90
)
axes[1].set_title('Distribution of Spatial Intelligence (Percentage)', fontsize=13, fontweight='bold')
for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_fontweight('bold')

plt.tight_layout()
plt.savefig('visualizations/01_target_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Target distribution visualization saved")

## 2.2: Feature Type Classification and Statistical Profiling

Before conducting deeper analysis, we need to classify features by type (numeric, categorical ordinal, categorical nominal) and understand their individual distributions. This classification will inform our encoding strategies and feature engineering decisions. We'll compute key statistical properties including skewness, kurtosis, and outlier presence for numeric features, and frequency distributions for categorical features.

In [None]:
# Create visualization directory if it doesn't exist
import os
os.makedirs('visualizations', exist_ok=True)

# Separate features from target
X = df.drop(columns=[target_col])
y = df[target_col]

# Classify features by data type
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print("=" * 80)
print("FEATURE TYPE CLASSIFICATION")
print("=" * 80)
print(f"\nNumeric Features ({len(numeric_features)} total):")
for feat in numeric_features:
    print(f"  • {feat}")

print(f"\nCategorical Features ({len(categorical_features)} total):")
for feat in categorical_features:
    print(f"  • {feat}")

print(f"\n✓ Total Features (excluding target): {len(numeric_features) + len(categorical_features)}")

# Statistical profiling of numeric features
print("\n" + "=" * 80)
print("NUMERIC FEATURES: STATISTICAL PROFILING")
print("=" * 80)

numeric_stats = []
for feat in numeric_features:
    stats = {
        'Feature': feat,
        'Mean': X[feat].mean(),
        'Std': X[feat].std(),
        'Min': X[feat].min(),
        'Max': X[feat].max(),
        'Median': X[feat].median(),
        'Skewness': skew(X[feat].dropna()),
        'Kurtosis': kurtosis(X[feat].dropna()),
        'Unique_Values': X[feat].nunique()
    }
    numeric_stats.append(stats)

numeric_stats_df = pd.DataFrame(numeric_stats)
print("\nNumeric Features Summary:")
display(numeric_stats_df.round(4))

# Statistical profiling of categorical features
print("\n" + "=" * 80)
print("CATEGORICAL FEATURES: VALUE DISTRIBUTION")
print("=" * 80)

categorical_stats = []
for feat in categorical_features:
    stats = {
        'Feature': feat,
        'Data_Type': X[feat].dtype,
        'Unique_Values': X[feat].nunique(),
        'Top_Value': X[feat].mode()[0] if len(X[feat].mode()) > 0 else 'N/A',
        'Top_Value_Freq': X[feat].value_counts().iloc[0] if len(X[feat].value_counts()) > 0 else 0,
        'Missing_Values': X[feat].isnull().sum()
    }
    categorical_stats.append(stats)

categorical_stats_df = pd.DataFrame(categorical_stats)
print("\nCategorical Features Summary:")
display(categorical_stats_df)

# Visualize distributions of numeric features
print("\n" + "=" * 80)
print("VISUALIZING NUMERIC FEATURE DISTRIBUTIONS")
print("=" * 80)

n_numeric = len(numeric_features)
n_cols = 4
n_rows = (n_numeric + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 3*n_rows))
axes = axes.flatten()

for idx, feat in enumerate(numeric_features):
    axes[idx].hist(X[feat].dropna(), bins=30, color='steelblue', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{feat}\n(Skew: {skew(X[feat].dropna()):.2f})', fontweight='bold')
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(axis='y', alpha=0.3)

# Hide unused subplots
for idx in range(n_numeric, len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.savefig('visualizations/02_numeric_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Numeric distributions saved")

## 2.3: Categorical Feature Distributions and Demographic Insights

Understanding the distribution of categorical features provides crucial insights into the composition of our study population. We'll analyze the frequency distributions across key demographic variables including gender, environment, major, and parental occupation. These distributions will help us understand potential biases in our dataset and inform our data preprocessing decisions.

In [None]:
# Detailed analysis of key categorical features
print("=" * 80)
print("CATEGORICAL FEATURE DETAILED ANALYSIS")
print("=" * 80)

# Select key demographic features for visualization
key_categorical = categorical_features[:8] if len(categorical_features) >= 8 else categorical_features

fig, axes = plt.subplots(2, 4, figsize=(18, 10))
axes = axes.flatten()

for idx, feat in enumerate(key_categorical):
    value_counts = X[feat].value_counts()
    axes[idx].barh(value_counts.index, value_counts.values, color=sns.color_palette("husl", len(value_counts)))
    axes[idx].set_title(f'{feat}\n({len(value_counts)} categories)', fontweight='bold')
    axes[idx].set_xlabel('Frequency')
    axes[idx].grid(axis='x', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(value_counts.values):
        axes[idx].text(v + 0.5, i, str(v), va='center', fontweight='bold', fontsize=9)

# Hide unused subplots
for idx in range(len(key_categorical), len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.savefig('visualizations/03_categorical_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n✓ Categorical distributions visualization saved")

# Print value counts for each categorical feature
for feat in categorical_features:
    print(f"\n{feat}:")
    print(X[feat].value_counts().to_string())

---

# Section 3: Data Preprocessing and Intelligent Encoding

## 3.1: Strategic Encoding Framework

This section implements a sophisticated encoding strategy that respects the underlying structure of different feature types:

- **Ordinal Encoding:** Applied to features with inherent order (e.g., Spatial Intelligence levels VL→L→M→H→VH, Education levels)
- **One-Hot Encoding:** Applied to nominal categorical features without meaningful order (e.g., gender, environment, major)
- **Standardization:** Applied to numeric features to ensure comparable scales for algorithms sensitive to feature magnitude

This intelligent approach preserves information while preparing data for machine learning algorithms.

In [None]:
# Define encoding strategy for features
print("=" * 80)
print("ENCODING STRATEGY DEFINITION")
print("=" * 80)

# Ordinal features with explicit ordering (knowledge-based)
# These are features that have a natural hierarchy
ordinal_features_mapping = {
    # Education levels (if present) - typically: Unemployment < High School < Bachelor < Master < PhD
    # Spatial intelligence related orderings will be handled when found
    # Identify by examining unique values
}

# Binary features (Yes/No, 0/1) - can be left as-is or encoded
binary_features = []

# Examine categorical features to identify ordinal vs nominal
print("\nExamining categorical features for ordinal structure:")
for feat in categorical_features:
    unique_vals = sorted(X[feat].unique())
    print(f"  {feat}: {unique_vals}")
    
    # Check if it looks ordinal based on values
    if all(isinstance(v, (int, float)) for v in unique_vals):
        binary_features.append(feat)

print(f"\n✓ Identified {len(binary_features)} binary/numeric-like categorical features")

# Prepare data copy for preprocessing
X_processed = X.copy()
y_processed = y.copy()

# Handle binary/numeric categorical features - convert to numeric if already numeric
print("\n" + "=" * 80)
print("ENCODING IMPLEMENTATION")
print("=" * 80)

for feat in binary_features:
    X_processed[feat] = pd.to_numeric(X_processed[feat], errors='coerce').fillna(X_processed[feat].mode()[0])

# Encode target variable using OrdinalEncoder (preserving order VL < L < M < H < VH)
target_order = sorted(y_processed.unique())
print(f"\nTarget variable ordinal encoding: {target_order}")

le_target = LabelEncoder()
y_processed = pd.Series(le_target.fit_transform(y_processed), index=y_processed.index)

# Separate remaining categorical features for One-Hot Encoding
nominal_categorical = [f for f in categorical_features if f not in binary_features]

print(f"\nFeatures to be One-Hot Encoded ({len(nominal_categorical)} total):")
for feat in nominal_categorical:
    print(f"  • {feat}")

# Apply One-Hot Encoding to nominal categorical features
X_encoded = pd.get_dummies(X_processed, columns=nominal_categorical, drop_first=False, dtype=int)

print(f"\n✓ Encoding complete!")
print(f"Original features: {X_processed.shape[1]}")
print(f"Features after encoding: {X_encoded.shape[1]}")
print(f"New feature columns added: {X_encoded.shape[1] - X_processed.shape[1]}")

# Display encoded feature names
print("\nEncoded features:")
encoded_cols = X_encoded.columns.tolist()
for i, col in enumerate(encoded_cols, 1):
    print(f"{i:3d}. {col}")

# Standardize numeric features
print("\n" + "=" * 80)
print("NUMERIC FEATURE STANDARDIZATION")
print("=" * 80)

numeric_features_encoded = X_encoded.select_dtypes(include=[np.number]).columns.tolist()
scaler = StandardScaler()
X_scaled = X_encoded.copy()
X_scaled[numeric_features_encoded] = scaler.fit_transform(X_encoded[numeric_features_encoded])

print(f"\n✓ Standardized {len(numeric_features_encoded)} numeric features")
print(f"  Using StandardScaler (mean=0, std=1)")

# Display summary of preprocessing
print("\n" + "=" * 80)
print("PREPROCESSING SUMMARY")
print("=" * 80)
print(f"\nOriginal dataset shape: {df.shape}")
print(f"Processed dataset shape: {X_scaled.shape}")
print(f"Target variable shape: {y_processed.shape}")
print(f"\n✓ Data preprocessing completed successfully!")

# Show sample of processed data
print("\nSample of processed data (first 5 rows):")
display(X_scaled.head())

---

# Section 4: Feature Engineering and Dimensionality Reduction

## 4.1: Domain-Driven Feature Engineering

In this section, we create scientifically-motivated derived features that capture meaningful combinations of existing variables. Domain expertise in education psychology and cognitive science informs our feature engineering decisions. These engineered features aim to represent latent concepts that directly influence spatial intelligence development.

### Engineered Features:
1. **Study Efficiency:** Ratio of study time to academic performance (GPA)
2. **Academic Support Index:** Combination of extra classes and parental education level
3. **Gaming Engagement:** Aggregated preference across multiple game genres
4. **Visual Learning Orientation:** Combined score of map and diagram usage
5. **Digital Lifestyle:** Combined internet and gaming engagement
6. **Family Education:** Average of parental education levels