# FSLSM Learning Style Classification - Complete Analysis
## Felder-Silverman Learning Style Model Implementation

**Objective:** Achieve 96%+ accuracy in predicting learning styles from behavioral data

**Current Status:** 85% R² (needs improvement)

---

## 1. Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import joblib
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
import xgboost as xgb
from scipy import stats

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ Libraries imported successfully")

## 2. Data Loading and Initial Exploration

In [None]:
# Load data
data_path = Path('../data/training_data.csv')
df = pd.read_csv(data_path)

print(f"📊 Dataset Shape: {df.shape}")
print(f"\n📈 First 5 rows:")
df.head()

In [None]:
# Dataset info
print("📋 Dataset Information:")
df.info()

In [None]:
# Statistical summary
print("📊 Statistical Summary:")
df.describe()

## 3. Feature Analysis

### 3.1 Feature List and Categories

In [None]:
# Define feature groups
feature_groups = {
    'Active/Reflective': [
        'activeModeRatio', 'questionsGenerated', 'debatesParticipated',
        'reflectiveModeRatio', 'reflectionsWritten', 'journalEntries'
    ],
    'Sensing/Intuitive': [
        'sensingModeRatio', 'simulationsCompleted', 'challengesCompleted',
        'intuitiveModeRatio', 'conceptsExplored', 'patternsDiscovered'
    ],
    'Visual/Verbal': [
        'visualModeRatio', 'diagramsViewed', 'wireframesExplored',
        'verbalModeRatio', 'textRead', 'summariesCreated'
    ],
    'Sequential/Global': [
        'sequentialModeRatio', 'stepsCompleted', 'linearNavigation',
        'globalModeRatio', 'overviewsViewed', 'navigationJumps'
    ]
}

all_features = [f for group in feature_groups.values() for f in group]
label_cols = ['activeReflective', 'sensingIntuitive', 'visualVerbal', 'sequentialGlobal']

print(f"📊 Total Features: {len(all_features)}")
print(f"🎯 Target Labels: {len(label_cols)}")
print(f"\n📋 Feature Groups:")
for group, features in feature_groups.items():
    print(f"  {group}: {len(features)} features")

### 3.2 Feature Distributions

In [None]:
# Plot feature distributions by group
for group_name, features in feature_groups.items():
    fig, axes = plt.subplots(2, 3, figsize=(15, 8))
    fig.suptitle(f'{group_name} Features Distribution', fontsize=16)
    
    for idx, feature in enumerate(features):
        ax = axes[idx // 3, idx % 3]
        df[feature].hist(bins=30, ax=ax, edgecolor='black')
        ax.set_title(feature)
        ax.set_xlabel('Value')
        ax.set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()

### 3.3 Label Distributions

In [None]:
# Plot label distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('FSLSM Dimension Distributions', fontsize=16)

for idx, label in enumerate(label_cols):
    ax = axes[idx // 2, idx % 2]
    df[label].hist(bins=23, ax=ax, edgecolor='black', range=(-11.5, 11.5))
    ax.set_title(f'{label} (mean={df[label].mean():.2f}, std={df[label].std():.2f})')
    ax.set_xlabel('Score (-11 to +11)')
    ax.set_ylabel('Frequency')
    ax.axvline(0, color='red', linestyle='--', label='Balanced')
    ax.legend()

plt.tight_layout()
plt.show()

### 3.4 Feature Correlations

In [None]:
# Correlation heatmap for each dimension
for group_name, features in feature_groups.items():
    # Get corresponding label
    label_map = {
        'Active/Reflective': 'activeReflective',
        'Sensing/Intuitive': 'sensingIntuitive',
        'Visual/Verbal': 'visualVerbal',
        'Sequential/Global': 'sequentialGlobal'
    }
    label = label_map[group_name]
    
    # Calculate correlations
    corr_data = df[features + [label]].corr()
    
    # Plot
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_data, annot=True, fmt='.2f', cmap='coolwarm', center=0,
                square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title(f'{group_name} - Feature Correlations with {label}')
    plt.tight_layout()
    plt.show()
    
    # Print top correlations with label
    label_corr = corr_data[label].drop(label).sort_values(ascending=False)
    print(f"\n📊 {group_name} - Top correlations with {label}:")
    print(label_corr)

## 4. Labeling Algorithm Explanation

### Rule-Based Labeling Algorithm

In [None]:
print("""\n🎯 LABELING ALGORITHM EXPLANATION\n" + "="*60)

The synthetic data is generated using a rule-based algorithm based on FSLSM theory:

1. GENERATE LEARNING STYLE PROFILE (Labels)
   - Randomly generate scores for each dimension: -11 to +11
   - These represent the "true" learning style

2. GENERATE BEHAVIORAL FEATURES (Based on Profile)
   For each dimension, features are generated to match the profile:
   
   Active/Reflective:
   - If Active (score < -3): High activeModeRatio (0.6-0.9), many questions/debates
   - If Reflective (score > 3): High reflectiveModeRatio (0.6-0.9), many reflections/journals
   - If Balanced: Moderate values for both
   
   Sensing/Intuitive:
   - If Sensing (score < -3): High sensingModeRatio, many simulations/challenges
   - If Intuitive (score > 3): High intuitiveModeRatio, many concepts/patterns explored
   - If Balanced: Moderate values for both
   
   Visual/Verbal:
   - If Visual (score < -3): High visualModeRatio, many diagrams/wireframes viewed
   - If Verbal (score > 3): High verbalModeRatio, much text read/summaries created
   - If Balanced: Moderate values for both
   
   Sequential/Global:
   - If Sequential (score < -3): High sequentialModeRatio, many steps/linear navigation
   - If Global (score > 3): High globalModeRatio, many overviews/navigation jumps
   - If Balanced: Moderate values for both

3. ADD REALISTIC NOISE
   - Add Gaussian noise (5% of value) to make data realistic
   - Ensures features aren't perfectly predictive

This creates a dataset where:
- Features are causally related to labels (realistic)
- Relationships are strong but not perfect (realistic noise)
- ML models can learn the underlying patterns
""")

## 5. Current Model Performance Analysis

In [None]:
# Prepare data
X = df[all_features].values
y_dict = {col: df[col].values for col in label_cols}

# Split data
X_train, X_test, y_train_dict, y_test_dict = {}, {}, {}, {}
X_train_data, X_test_data = train_test_split(X, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_data)
X_test_scaled = scaler.transform(X_test_data)

print(f"✅ Data prepared:")
print(f"  Train: {X_train_scaled.shape[0]} samples")
print(f"  Test: {X_test_scaled.shape[0]} samples")

In [None]:
# Train baseline models and evaluate
baseline_results = {}

for label in label_cols:
    # Split labels
    y_train, y_test = train_test_split(y_dict[label], test_size=0.2, random_state=42)
    
    # Train baseline XGBoost
    model = xgb.XGBRegressor(
        objective='reg:squarederror',
        max_depth=6,
        learning_rate=0.1,
        n_estimators=100,
        random_state=42
    )
    model.fit(X_train_scaled, y_train)
    
    # Predictions
    y_pred = model.predict(X_test_scaled)
    
    # Metrics
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    baseline_results[label] = {
        'MAE': mae,
        'RMSE': rmse,
        'R²': r2,
        'Accuracy': r2 * 100
    }
    
    print(f"\n{label}:")
    print(f"  MAE: {mae:.3f}")
    print(f"  RMSE: {rmse:.3f}")
    print(f"  R²: {r2:.3f} ({r2*100:.1f}%)")

# Overall performance
avg_r2 = np.mean([r['R²'] for r in baseline_results.values()])
print(f"\n🎯 Average R²: {avg_r2:.3f} ({avg_r2*100:.1f}%)")
print(f"\n⚠️ PROBLEM: Current accuracy is {avg_r2*100:.1f}%, target is 96%")

## 6. Problem Diagnosis

### Why is accuracy low?

In [None]:
print("""\n🔍 PROBLEM DIAGNOSIS\n" + "="*60)

Current Issues:

1. ❌ INSUFFICIENT DATA
   - Current: 500 samples
   - Recommended: 2000-5000 samples for 96% accuracy
   - With 24 features, need more data to learn patterns

2. ❌ SIMPLE FEATURES
   - Current: 24 basic behavioral features
   - Missing: Interaction features, polynomial features, time-based features
   - Need feature engineering to capture complex patterns

3. ❌ BASIC HYPERPARAMETERS
   - Using default XGBoost parameters
   - Need hyperparameter tuning (GridSearch/RandomSearch)
   - Can improve 5-10% with proper tuning

4. ❌ NO ENSEMBLE
   - Using single XGBoost model
   - Ensemble methods (stacking, voting) can boost accuracy
   - Combine XGBoost + Random Forest + Neural Network

Solutions to implement:
✅ Increase dataset to 2000+ samples
✅ Add engineered features (interactions, polynomials)
✅ Hyperparameter tuning with GridSearchCV
✅ Create ensemble model
""")