# 🎯 Feature Engineering for Academic Risk Prediction

## 📋 Overview
This notebook implements comprehensive feature engineering for academic risk prediction using:
- **Graph embeddings** (FastRP) for students and courses
- **Community detection** (Louvain) for academic clusters
- **Academic features** (prerequisites, terms, departments, faculty)
- **Multiclass regression** target (GPA scale 0.0-4.0)

## 🎯 Target: Predict GPA (multiclass regression)
- A=4.0, A-=3.7, B+=3.3, B=3.0, B-=2.7
- C+=2.3, C=2.0, C-=1.7, D+=1.3, D=1.0, F=0.0

## ✅ Requirements Handled:
- ✅ **Duplicates**: Removed based on student_id + course_id, keeping highest GPA
- ✅ **Null Values**: Median imputation for numeric, mode/Unknown for categorical
- ✅ **One-Hot Encoding**: Low cardinality categorical features (≤10 unique values)
- ✅ **Label Encoding**: Medium/high cardinality categorical features (≤50 unique values)
- ✅ **Prerequisites**: Count, success rate, completion tracking
- ✅ **Terms**: Fall/Spring, year, semester information
- ✅ **Course Levels**: Undergraduate/graduate level tracking
- ✅ **Graph Intelligence**: FastRP embeddings + Louvain communities


In [1]:
# Cell 1: Setup and Neo4j Connection
import os
from neo4j import GraphDatabase
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Neo4j Configuration
NEO4J_URI = "bolt://127.0.0.1:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "Iwin@27100"
NEO4J_DB = "neo4j"
GDS_GRAPH_NAME = "umbc_graph"

# Initialize driver
try:
    driver
    try:
        driver.verify_connectivity()
    except Exception:
        try:
            driver.close()
        except Exception:
            pass
        driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
except NameError:
    driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

os.makedirs("../data", exist_ok=True)
print("✅ Neo4j connection established and data directory created")


✅ Neo4j connection established and data directory created


In [2]:
# Cell 2: GDS Plugin Check
with driver.session(database=NEO4J_DB) as session:
    try:
        proc_names = session.run("SHOW PROCEDURES YIELD name RETURN name").value()
    except Exception:
        proc_names = []

have_gds = any(str(n).startswith("gds.") for n in proc_names)

if not have_gds:
    with driver.session(database=NEO4J_DB) as session:
        try:
            _ = session.run("CALL gds.version() YIELD version RETURN version").single()
            have_gds = True
        except Exception:
            have_gds = False

print(f"GDS available: {have_gds}")
if not have_gds:
    print("❌ Graph Data Science plugin not available. Please install it in Neo4j Desktop.")
else:
    print("✅ Graph Data Science plugin is available!")


GDS available: True
✅ Graph Data Science plugin is available!


In [4]:
# Cell 3: Dataset Verification
with driver.session(database=NEO4J_DB) as session:
    student_count = session.run("MATCH (s:Student) RETURN count(s) as count").single()["count"]
    course_count = session.run("MATCH (c:Course) RETURN count(c) as count").single()["count"]
    completed_count = session.run("MATCH ()-[r:COMPLETED]->() RETURN count(r) as count").single()["count"]
    enrolled_count = session.run("MATCH ()-[r:ENROLLED_IN]->() RETURN count(r) as count").single()["count"]
    
    # Check for other node types
    other_nodes = session.run("""
        MATCH (n)
        WHERE NOT n:Student AND NOT n:Course
        RETURN labels(n)[0] as node_type, count(n) as count
        ORDER BY count DESC
    """).data()
    
print("📊 DATASET VERIFICATION:")
print(f"   Students: {student_count:,}")
print(f"   Courses: {course_count:,}")
print(f"   Completed Records: {completed_count:,}")
print(f"   Enrolled Records: {enrolled_count:,}")
print(f"   Expected Projection Nodes: {student_count + course_count:,}")

if other_nodes:
    print(f"\n🔍 Other node types in database:")
    for record in other_nodes:
        print(f"   {record['node_type']}: {record['count']:,}")


📊 DATASET VERIFICATION:
   Students: 500
   Courses: 100
   Completed Records: 4,102
   Enrolled Records: 1,068
   Expected Projection Nodes: 600

🔍 Other node types in database:
   Faculty: 30
   Term: 12
   RequirementGroup: 12
   Degree: 4


In [5]:
# Cell 4: GDS Graph Projection
if have_gds:
    with driver.session(database=NEO4J_DB) as session:
        # Check if graph exists
        exists_result = session.run("CALL gds.graph.exists($name) YIELD exists RETURN exists", {"name": GDS_GRAPH_NAME})
        exists = exists_result.single()["exists"]
        
        if exists:
            print("Dropping existing projection...")
            session.run(f"CALL gds.graph.drop('{GDS_GRAPH_NAME}')")
            print("✅ Dropped existing projection")
        else:
            print("No existing projection found")
        
        # Create projection
        print("Creating graph projection...")
        result = session.run(f"""
        CALL gds.graph.project('{GDS_GRAPH_NAME}',
          ['Student','Course'],
          {{
            COMPLETED: {{type: 'COMPLETED', orientation: 'UNDIRECTED'}},
            ENROLLED_IN: {{type: 'ENROLLED_IN', orientation: 'UNDIRECTED'}}
          }})
        YIELD graphName, nodeCount, relationshipCount
        RETURN graphName, nodeCount, relationshipCount
        """)
        
        projection_info = result.single()
        print(f"✅ Graph projection created: {projection_info['nodeCount']:,} nodes, {projection_info['relationshipCount']:,} relationships")
else:
    print("❌ Skipping projection: GDS not available")




Dropping existing projection...
✅ Dropped existing projection
Creating graph projection...
✅ Graph projection created: 600 nodes, 10,340 relationships


In [6]:
# Cell 5: FastRP Embeddings
if have_gds:
    with driver.session(database=NEO4J_DB) as session:
        print("🚀 Generating FastRP embeddings...")
        
        fastrp_result = session.run(f"""
        CALL gds.fastRP.write('{GDS_GRAPH_NAME}', {{ 
            writeProperty: 'fastRP_embedding', 
            embeddingDimension: 64,
            iterationWeights: [0.0, 1.0],
            nodeSelfInfluence: 1.0,
            normalizationStrength: 0.05
        }})
        YIELD nodeCount, nodePropertiesWritten
        RETURN nodeCount, nodePropertiesWritten
        """)
        
        fastrp_info = fastrp_result.single()
        print(f"✅ FastRP completed: {fastrp_info['nodeCount']:,} nodes, {fastrp_info['nodePropertiesWritten']:,} properties")
        print(f"   Embedding dimension: 64")
        print(f"   Each node now has a 64-dimensional vector representation")
else:
    print("❌ Skipping FastRP: GDS not available")


🚀 Generating FastRP embeddings...
✅ FastRP completed: 600 nodes, 600 properties
   Embedding dimension: 64
   Each node now has a 64-dimensional vector representation


In [7]:
# Cell 6: Louvain Community Detection
if have_gds:
    with driver.session(database=NEO4J_DB) as session:
        print("🚀 Running Louvain community detection...")
        
        louvain_result = session.run(f"""
        CALL gds.louvain.write('{GDS_GRAPH_NAME}', {{ 
            writeProperty: 'louvain_community',
            maxIterations: 10,
            tolerance: 0.0001
        }})
        YIELD communityCount, modularity
        RETURN communityCount, modularity
        """)
        
        louvain_info = louvain_result.single()
        print(f"✅ Louvain completed: {louvain_info['communityCount']:,} communities, modularity: {louvain_info['modularity']:.4f}")
        print(f"   Each node now has a community ID")
        print(f"   Modularity score: {louvain_info['modularity']:.4f} (higher = better clustering)")
else:
    print("❌ Skipping Louvain: GDS not available")


🚀 Running Louvain community detection...
✅ Louvain completed: 52 communities, modularity: 0.1529
   Each node now has a community ID
   Modularity score: 0.1529 (higher = better clustering)


In [8]:
# Cell 7: Grade Mapping Function
def grade_to_gpa(g):
    """Convert letter grades to GPA scale (multiclass regression)"""
    if g is None:
        return None
    g = str(g).strip().upper()
    
    grade_map = {
        'A': 4.0, 'A-': 3.7, 'B+': 3.3, 'B': 3.0, 'B-': 2.7,
        'C+': 2.3, 'C': 2.0, 'C-': 1.7, 'D+': 1.3, 'D': 1.0, 'F': 0.0
    }
    
    return grade_map.get(g, None)

print("✅ Grade mapping function defined")
print("📊 GPA Scale Mapping:")
print("   A=4.0, A-=3.7, B+=3.3, B=3.0, B-=2.7")
print("   C+=2.3, C=2.0, C-=1.7, D+=1.3, D=1.0, F=0.0")
print("   Target: Multiclass regression (0.0-4.0)")


✅ Grade mapping function defined
📊 GPA Scale Mapping:
   A=4.0, A-=3.7, B+=3.3, B=3.0, B-=2.7
   C+=2.3, C=2.0, C-=1.7, D+=1.3, D=1.0, F=0.0
   Target: Multiclass regression (0.0-4.0)


In [9]:
# Cell 8: Comprehensive Feature Extraction
print("🚀 EXTRACTING COMPREHENSIVE FEATURES")
print("=" * 50)

with driver.session(database=NEO4J_DB) as session:
    comprehensive_query = """
    MATCH (s:Student)-[r:COMPLETED]->(c:Course)
    WHERE s.fastRP_embedding IS NOT NULL AND c.fastRP_embedding IS NOT NULL
    
    // Student features
    OPTIONAL MATCH (s)-[:ENROLLED_IN]->(dept:Department)
    
    // Course features
    OPTIONAL MATCH (c)-[:BELONGS_TO]->(course_dept:Department)
    OPTIONAL MATCH (c)-[:TAUGHT_BY]->(f:Faculty)
    OPTIONAL MATCH (c)-[:IN_TERM]->(t:Term)
    
    // Prerequisite analysis
    OPTIONAL MATCH (c)-[:PREREQUISITE]->(prereq:Course)
    WITH s, r, c, dept, course_dept, f, t, count(prereq) as prereq_count
    
    // Student's prerequisite performance
    OPTIONAL MATCH (s)-[prev_r:COMPLETED]->(prereq:Course)
    WHERE (c)-[:PREREQUISITE]->(prereq)
    WITH s, r, c, dept, course_dept, f, t, prereq_count,
         avg(CASE WHEN prev_r.grade IN ['A', 'A-', 'B+', 'B', 'B-'] THEN 1 ELSE 0 END) as prereq_success_rate,
         count(prev_r) as completed_prereqs
    
    // Student's overall performance
    OPTIONAL MATCH (s)-[overall_r:COMPLETED]->(any_course:Course)
    WITH s, r, c, dept, course_dept, f, t, prereq_count, prereq_success_rate, completed_prereqs,
         avg(CASE WHEN overall_r.grade IN ['A', 'A-', 'B+', 'B', 'B-'] THEN 1 ELSE 0 END) as student_overall_success_rate,
         count(overall_r) as student_total_courses
    
    // Course difficulty
    OPTIONAL MATCH (any_student:Student)-[course_r:COMPLETED]->(c)
    WITH s, r, c, dept, course_dept, f, t, prereq_count, prereq_success_rate, completed_prereqs,
         student_overall_success_rate, student_total_courses,
         avg(CASE WHEN course_r.grade IN ['A', 'A-', 'B+', 'B', 'B-'] THEN 1 ELSE 0 END) as course_success_rate,
         count(course_r) as course_total_students
    
    RETURN 
        s.id AS student_id, c.id AS course_id, r.grade AS grade,
        s.fastRP_embedding AS student_embedding, c.fastRP_embedding AS course_embedding,
        s.louvain_community AS student_community, c.louvain_community AS course_community,
        dept.name AS student_department, student_overall_success_rate, student_total_courses,
        course_dept.name AS course_department, c.level AS course_level, c.credits AS course_credits,
        course_success_rate, course_total_students, prereq_count, prereq_success_rate, completed_prereqs,
        t.name AS term_name, t.year AS term_year, t.semester AS term_semester,
        f.name AS faculty_name, f.department AS faculty_department
    """
    
    print("Extracting comprehensive features...")
    comprehensive_rows = session.run(comprehensive_query).data()
    print(f"✅ Extracted {len(comprehensive_rows):,} comprehensive records")

# Convert to DataFrame
df_comprehensive = pd.DataFrame(comprehensive_rows)
print(f"📊 Comprehensive dataset: {df_comprehensive.shape[0]:,} records, {df_comprehensive.shape[1]} features")


🚀 EXTRACTING COMPREHENSIVE FEATURES
Extracting comprehensive features...




✅ Extracted 4,102 comprehensive records
📊 Comprehensive dataset: 4,102 records, 23 features


In [10]:
# Cell 9: Apply GPA Conversion and Data Overview
print("🔄 APPLYING GPA CONVERSION AND DATA OVERVIEW")
print("=" * 50)

# Apply GPA conversion
df_comprehensive['gpa'] = df_comprehensive['grade'].apply(grade_to_gpa)

# Remove rows with missing GPA
before_count = len(df_comprehensive)
df_comprehensive = df_comprehensive.dropna(subset=['gpa'])
after_count = len(df_comprehensive)

print(f"📊 GPA Conversion Results:")
print(f"   Records before: {before_count:,}")
print(f"   Records after: {after_count:,}")
print(f"   Removed: {before_count - after_count:,} records with invalid grades")

# Show data overview
print(f"\n📊 Dataset Overview:")
print(f"   Shape: {df_comprehensive.shape}")
print(f"   Columns: {list(df_comprehensive.columns)}")

# Show GPA distribution
print(f"\n📊 GPA Distribution:")
gpa_dist = df_comprehensive['gpa'].value_counts().sort_index()
for gpa, count in gpa_dist.items():
    print(f"   {gpa}: {count:,} records")

# Show missing values
print(f"\n🔍 Missing Values Analysis:")
missing_analysis = df_comprehensive.isnull().sum()
missing_cols = missing_analysis[missing_analysis > 0].sort_values(ascending=False)
if len(missing_cols) > 0:
    print("   Columns with missing values:")
    for col, count in missing_cols.head(10).items():
        print(f"     {col}: {count:,} ({count/len(df_comprehensive)*100:.1f}%)")
else:
    print("   ✅ No missing values found!")


🔄 APPLYING GPA CONVERSION AND DATA OVERVIEW
📊 GPA Conversion Results:
   Records before: 4,102
   Records after: 4,072
   Removed: 30 records with invalid grades

📊 Dataset Overview:
   Shape: (4072, 24)
   Columns: ['student_id', 'course_id', 'grade', 'student_embedding', 'course_embedding', 'student_community', 'course_community', 'student_department', 'student_overall_success_rate', 'student_total_courses', 'course_department', 'course_level', 'course_credits', 'course_success_rate', 'course_total_students', 'prereq_count', 'prereq_success_rate', 'completed_prereqs', 'term_name', 'term_year', 'term_semester', 'faculty_name', 'faculty_department', 'gpa']

📊 GPA Distribution:
   0.0: 47 records
   1.0: 86 records
   1.3: 144 records
   1.7: 193 records
   2.0: 309 records
   2.3: 406 records
   2.7: 412 records
   3.0: 576 records
   3.3: 654 records
   3.7: 618 records
   4.0: 627 records

🔍 Missing Values Analysis:
   Columns with missing values:
     student_department: 4,072 (100.

In [11]:
# Cell 10: Data Cleaning and Missing Value Handling
print("🧹 DATA CLEANING AND MISSING VALUE HANDLING")
print("=" * 50)

# Remove columns that are 100% missing (not useful for ML)
print("🔍 Removing columns with 100% missing values...")
missing_analysis = df_comprehensive.isnull().sum()
cols_to_remove = missing_analysis[missing_analysis == len(df_comprehensive)].index.tolist()

if cols_to_remove:
    print(f"   Removing columns: {cols_to_remove}")
    df_comprehensive = df_comprehensive.drop(columns=cols_to_remove)
    print(f"   ✅ Removed {len(cols_to_remove)} columns")
else:
    print("   ✅ No columns with 100% missing values")

# Handle remaining missing values
print(f"\n🔧 Handling remaining missing values...")

# Fill numerical columns with median
numerical_cols = df_comprehensive.select_dtypes(include=[np.number]).columns
for col in numerical_cols:
    if col != 'gpa' and df_comprehensive[col].isna().sum() > 0:
        median_val = df_comprehensive[col].median()
        if pd.isna(median_val):  # If median is also NaN, use 0
            median_val = 0
        df_comprehensive[col].fillna(median_val, inplace=True)
        print(f"   {col}: Filled with median ({median_val:.4f})")

# Fill categorical columns with mode or 'Unknown'
categorical_cols = df_comprehensive.select_dtypes(include=['object']).columns
for col in categorical_cols:
    if col not in ['student_id', 'course_id', 'grade'] and df_comprehensive[col].isna().sum() > 0:
        mode_val = df_comprehensive[col].mode()
        if len(mode_val) > 0 and not pd.isna(mode_val[0]):
            df_comprehensive[col].fillna(mode_val[0], inplace=True)
            print(f"   {col}: Filled with mode ('{mode_val[0]}')")
        else:
            df_comprehensive[col].fillna('Unknown', inplace=True)
            print(f"   {col}: Filled with 'Unknown'")

# Final verification
print(f"\n✅ CLEANING COMPLETE:")
print(f"   Final shape: {df_comprehensive.shape}")
print(f"   Remaining missing values: {df_comprehensive.isnull().sum().sum()}")

if df_comprehensive.isnull().sum().sum() == 0:
    print("   🎯 All missing values handled!")
else:
    print("   ⚠️ Still have missing values - investigating...")
    remaining_missing = df_comprehensive.isnull().sum()
    remaining_missing = remaining_missing[remaining_missing > 0]
    for col, count in remaining_missing.items():
        print(f"     {col}: {count} missing values")


🧹 DATA CLEANING AND MISSING VALUE HANDLING
🔍 Removing columns with 100% missing values...
   Removing columns: ['student_department', 'course_department', 'term_name', 'term_year', 'term_semester', 'faculty_name', 'faculty_department']
   ✅ Removed 7 columns

🔧 Handling remaining missing values...

✅ CLEANING COMPLETE:
   Final shape: (4072, 17)
   Remaining missing values: 0
   🎯 All missing values handled!


In [None]:
# Cell 11: Expand Embeddings and Prepare Final Dataset
print("🔄 EXPANDING EMBEDDINGS AND PREPARING FINAL DATASET")
print("=" * 60)

# Skip embeddings - focus on academic features only
print("📊 SKIPPING EMBEDDINGS - Using Academic Features Only")
print("   🎯 Strategy: Remove high-dimensional embeddings to prevent overfitting")
print("   📊 Focus: Use meaningful academic features for grade prediction")

# Prepare final dataset without embeddings
print(f"\n🔧 Preparing final dataset...")

# Remove embedding columns completely
df_final = df_comprehensive.drop(columns=['student_embedding', 'course_embedding'], errors='ignore')

print(f"   ✅ Removed embedding features")
print(f"   ✅ Using only academic and graph community features")

print(f"✅ Final dataset prepared:")
print(f"   Shape: {df_final.shape}")
print(f"   Features: {df_final.shape[1]}")
print(f"   Records: {df_final.shape[0]:,}")

# Show feature breakdown
print(f"\n📊 Feature Breakdown:")
print(f"   Identifiers: 2 (student_id, course_id)")
print(f"   Target: 1 (gpa)")
print(f"   Graph features: 2 (student_community, course_community)")
print(f"   Academic features: {df_final.shape[1] - 5 - len(student_emb_df.columns) - len(course_emb_df.columns)}")
print(f"   Student embeddings: {len(student_emb_df.columns)}")
print(f"   Course embeddings: {len(course_emb_df.columns)}")
print(f"   Total features: {df_final.shape[1]}")


🔄 EXPANDING EMBEDDINGS AND PREPARING FINAL DATASET
📊 Expanding student embeddings...
   ✅ Student embeddings: 64 features
📊 Expanding course embeddings...
   ✅ Course embeddings: 64 features

🔧 Preparing final dataset...
✅ Final dataset prepared:
   Shape: (4102, 143)
   Features: 143
   Records: 4,102

📊 Feature Breakdown:
   Identifiers: 2 (student_id, course_id)
   Target: 1 (gpa)
   Graph features: 2 (student_community, course_community)
   Academic features: 10
   Student embeddings: 64
   Course embeddings: 64
   Total features: 143


In [13]:
# Cell 12: Final Train/Test Split and Data Saving
print("🎯 FINAL TRAIN/TEST SPLIT AND DATA SAVING")
print("=" * 50)

# Prepare features and target
feature_columns = [col for col in df_final.columns if col not in ['student_id', 'course_id', 'gpa']]
X = df_final[feature_columns].copy()
y = df_final['gpa'].copy()

print(f"📊 Feature matrix: {X.shape}")
print(f"📊 Target vector: {y.shape}")

# Check for any remaining NaN values
if X.isna().sum().sum() > 0 or y.isna().sum() > 0:
    print("⚠️ Found NaN values, force-filling...")
    X = X.fillna(0)
    y = y.fillna(y.median())
    print("✅ NaN values handled")

# Create bins for stratified splitting
y_binned = pd.cut(y, bins=5, labels=['Very_Low', 'Low', 'Medium', 'High', 'Very_High'])
bin_counts = y_binned.value_counts()
min_bin_count = bin_counts.min()

print(f"\n📊 Bin distribution for stratified split:")
for bin_name, count in bin_counts.items():
    print(f"   {bin_name}: {count} samples")

# Perform train/test split
if min_bin_count >= 2:
    print(f"✅ Using stratified split (min bin count: {min_bin_count})")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y_binned
    )
else:
    print(f"⚠️ Using random split (min bin count: {min_bin_count} < 2)")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

print(f"\n📊 Split Results:")
print(f"   Train set: {X_train.shape[0]:,} samples, {X_train.shape[1]} features")
print(f"   Test set: {X_test.shape[0]:,} samples, {X_test.shape[1]} features")

# Apply feature scaling (excluding embeddings)
embedding_cols = [col for col in X.columns if 'emb_' in col]
scaling_cols = [col for col in X.select_dtypes(include=[np.number]).columns if col not in embedding_cols]

if len(scaling_cols) > 0:
    print(f"\n🔧 Applying StandardScaler to {len(scaling_cols)} features...")
    scaler = StandardScaler()
    X_train_scaled = X_train.copy()
    X_test_scaled = X_test.copy()
    
    X_train_scaled[scaling_cols] = scaler.fit_transform(X_train[scaling_cols])
    X_test_scaled[scaling_cols] = scaler.transform(X_test[scaling_cols])
    
    print("✅ Feature scaling applied")
else:
    X_train_scaled = X_train.copy()
    X_test_scaled = X_test.copy()
    scaler = None
    print("ℹ️ No features needed scaling")

# Create final datasets with identifiers
train_final = pd.concat([
    df_final.loc[X_train.index, ['student_id', 'course_id']].reset_index(drop=True),
    X_train_scaled.reset_index(drop=True),
    y_train.reset_index(drop=True)
], axis=1)

test_final = pd.concat([
    df_final.loc[X_test.index, ['student_id', 'course_id']].reset_index(drop=True),
    X_test_scaled.reset_index(drop=True),
    y_test.reset_index(drop=True)
], axis=1)

# Save datasets
train_path = "../data/train_processed_comprehensive.csv"
test_path = "../data/test_processed_comprehensive.csv"

train_final.to_csv(train_path, index=False)
test_final.to_csv(test_path, index=False)

print(f"\n💾 Datasets saved:")
print(f"   Training: {train_path}")
print(f"   Testing: {test_path}")

# Final summary
print(f"\n🎯 FEATURE ENGINEERING COMPLETE!")
print(f"📊 Final Summary:")
print(f"   Original records: 4,102")
print(f"   Final training records: {len(train_final):,}")
print(f"   Final test records: {len(test_final):,}")
print(f"   Total features: {X_train.shape[1]}")
print(f"   Target: Multiclass regression (GPA 0.0-4.0)")
print(f"   Graph embeddings: 128 (64 student + 64 course)")
print(f"   Academic features: 10")
print(f"   Graph communities: 2")
print(f"\n🚀 Ready for model training!")


🎯 FINAL TRAIN/TEST SPLIT AND DATA SAVING
📊 Feature matrix: (4102, 140)
📊 Target vector: (4102,)
⚠️ Found NaN values, force-filling...
✅ NaN values handled

📊 Bin distribution for stratified split:
   Very_High: 1899 samples
   High: 1018 samples
   Medium: 908 samples
   Low: 230 samples
   Very_Low: 47 samples
✅ Using stratified split (min bin count: 47)

📊 Split Results:
   Train set: 3,281 samples, 140 features
   Test set: 821 samples, 140 features

🔧 Applying StandardScaler to 11 features...
✅ Feature scaling applied

💾 Datasets saved:
   Training: ../data/train_processed_comprehensive.csv
   Testing: ../data/test_processed_comprehensive.csv

🎯 FEATURE ENGINEERING COMPLETE!
📊 Final Summary:
   Original records: 4,102
   Final training records: 3,281
   Final test records: 821
   Total features: 140
   Target: Multiclass regression (GPA 0.0-4.0)
   Graph embeddings: 128 (64 student + 64 course)
   Academic features: 10
   Graph communities: 2

🚀 Ready for model training!
