# Linear Model Comparison: Lasso Regression vs Linear SVR

## Research Objective
This notebook compares **Lasso Regression** and **Linear Support Vector Regression (SVR)** for predicting Instagram users' perceived stress scores (`perceived_stress_score`).

---

## Methodology Overview

### 1. Lasso Regression (Least Absolute Shrinkage and Selection Operator)
- **Objective Function**: $\min_{\beta} \frac{1}{2n}||y - X\beta||_2^2 + \alpha||\beta||_1$
- **Key Feature**: Automatic feature selection via L1 regularization

### 2. Linear SVR (Support Vector Regression with Linear Kernel)
- **Objective Function**: $\min_{w,b} \frac{1}{2}||w||^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)$
- **Key Feature**: ε-insensitive loss function, robust to outliers

### 3. Evaluation Metrics
- **RMSE**: $\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$
- **R²**: $1 - \frac{SS_{res}}{SS_{tot}}$
- **MAE**: $\frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$

---
## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import Lasso
from sklearn.svm import LinearSVR  # Use LinearSVR instead of SVR for better memory efficiency
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import gc  # Garbage collection for memory management
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.dpi'] = 100

print("Libraries imported successfully!")

---
## 2. Define Features and Target Variable

In [None]:
# Define feature list (20 features)
FEATURES = [
    'daily_active_minutes_instagram',
    'likes_given_per_day',
    'comments_written_per_day',
    'time_on_feed_per_day',
    'user_engagement_score',
    'sessions_per_day',
    'dms_received_per_week',
    'ads_viewed_per_day',
    'average_session_length_minutes',
    'stories_viewed_per_day',
    'posts_created_per_week',
    'time_on_reels_per_day',
    'time_on_messages_per_day',
    'ads_clicked_per_day',
    'time_on_explore_per_day',
    'age',
    'dms_sent_per_week',
    'diet_quality',
    'exercise_hours_per_week',
    'reels_watched_per_day'
]

CATEGORICAL_FEATURES = ['diet_quality']
TARGET = 'perceived_stress_score'

print(f"Total features: {len(FEATURES)}")
print(f"Target: {TARGET}")

---
## 3. Data Loading

In [None]:
# Load data with memory optimization - use float32 instead of float64
df = pd.read_csv('instagram_usage_lifestyle.csv')

print(f"Dataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
df.head()

In [None]:
# Check unique values in diet_quality
print("Unique values in diet_quality:")
print(df['diet_quality'].value_counts())

---
## 4. Data Preprocessing

### 4.1 Encode Categorical Variables

In [None]:
# Create a copy for processing
df_processed = df[FEATURES + [TARGET]].copy()

# Ordinal encoding for diet_quality
diet_quality_mapping = {
    'Poor': 1,
    'Fair': 2,
    'Average': 2,
    'Good': 3,
    'Excellent': 4
}

df_processed['diet_quality_encoded'] = df_processed['diet_quality'].map(diet_quality_mapping)

# Check for unmapped values
if df_processed['diet_quality_encoded'].isna().sum() > 0:
    print("Using LabelEncoder as fallback...")
    le = LabelEncoder()
    df_processed['diet_quality_encoded'] = le.fit_transform(df_processed['diet_quality'].astype(str))
    print(f"Mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")
else:
    print("Ordinal encoding successful!")

# Update feature list
FEATURES_ENCODED = [f if f != 'diet_quality' else 'diet_quality_encoded' for f in FEATURES]

# Show encoding
print("\nEncoding result:")
print(df_processed[['diet_quality', 'diet_quality_encoded']].drop_duplicates())

### 4.2 Extract Features and Convert to float32 (Memory Optimization)

In [None]:
# Extract features and target
X = df_processed[FEATURES_ENCODED].values.astype(np.float32)  # Use float32 to save memory
y = df_processed[TARGET].values.astype(np.float32)

# Handle missing values
X = np.nan_to_num(X, nan=np.nanmedian(X, axis=0))
y = np.nan_to_num(y, nan=np.nanmedian(y))

print(f"X shape: {X.shape}, dtype: {X.dtype}")
print(f"y shape: {y.shape}, dtype: {y.dtype}")
print(f"Memory for X: {X.nbytes / 1024**2:.2f} MB")

# Clear unused dataframes
del df, df_processed
gc.collect()

### 4.3 Train-Test Split

In [None]:
# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

### 4.4 Feature Standardization

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train).astype(np.float32)
X_test_scaled = scaler.transform(X_test).astype(np.float32)

print("Standardization completed!")
print(f"Train mean: {X_train_scaled.mean():.6f}, std: {X_train_scaled.std():.6f}")

---
## 5. Lasso Regression Model

### 5.1 Hyperparameter Tuning (Memory-Optimized)

Using **3-fold CV** with reduced parameter grid to save memory.

In [None]:
# Reduced parameter grid for memory efficiency
lasso_param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]
}

# Create Lasso model
lasso = Lasso(max_iter=1000, random_state=42, tol=1e-3)

# Grid search with 3-fold CV (reduced from 5 to save memory)
lasso_grid_search = GridSearchCV(
    lasso, 
    lasso_param_grid, 
    cv=3,  # Reduced to 3-fold
    scoring='neg_mean_squared_error',
    n_jobs=1,  # Single job to reduce memory
    return_train_score=False  # Don't store train scores
)

print("Training Lasso model...")
lasso_grid_search.fit(X_train_scaled, y_train)

print(f"\nLasso Best Parameters:")
print(f"  alpha: {lasso_grid_search.best_params_['alpha']}")
print(f"  CV RMSE: {np.sqrt(-lasso_grid_search.best_score_):.4f}")

In [None]:
# Get best Lasso model
best_lasso = lasso_grid_search.best_estimator_

# Clear grid search results to free memory
del lasso_grid_search
gc.collect()

---
## 6. Linear SVR Model

### 6.1 Hyperparameter Tuning (Memory-Optimized)

Using **LinearSVR** instead of SVR with kernel='linear' for better memory efficiency.

In [None]:
# Reduced parameter grid
svr_param_grid = {
    'C': [0.1, 1.0, 10.0],
    'epsilon': [0.1, 0.5, 1.0]
}

# Use LinearSVR (more memory efficient than SVR with linear kernel)
svr = LinearSVR(max_iter=1000, random_state=42, tol=1e-3, dual=True)

# Grid search with 3-fold CV
svr_grid_search = GridSearchCV(
    svr, 
    svr_param_grid, 
    cv=3,  # Reduced to 3-fold
    scoring='neg_mean_squared_error',
    n_jobs=1,  # Single job to reduce memory
    return_train_score=False
)

print("Training Linear SVR model...")
svr_grid_search.fit(X_train_scaled, y_train)

print(f"\nLinear SVR Best Parameters:")
print(f"  C: {svr_grid_search.best_params_['C']}")
print(f"  epsilon: {svr_grid_search.best_params_['epsilon']}")
print(f"  CV RMSE: {np.sqrt(-svr_grid_search.best_score_):.4f}")

In [None]:
# Get best SVR model
best_svr = svr_grid_search.best_estimator_

# Clear grid search results
del svr_grid_search
gc.collect()

---
## 7. Model Evaluation

### 7.1 Test Set Evaluation

In [None]:
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Evaluate model performance."""
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    results = {
        'Model': model_name,
        'Train_RMSE': np.sqrt(mean_squared_error(y_train, y_train_pred)),
        'Train_R2': r2_score(y_train, y_train_pred),
        'Test_RMSE': np.sqrt(mean_squared_error(y_test, y_test_pred)),
        'Test_R2': r2_score(y_test, y_test_pred),
        'Test_MAE': mean_absolute_error(y_test, y_test_pred)
    }
    
    return results, y_test_pred

In [None]:
# Evaluate both models
lasso_results, lasso_pred = evaluate_model(
    best_lasso, X_train_scaled, X_test_scaled, y_train, y_test, 'Lasso'
)

svr_results, svr_pred = evaluate_model(
    best_svr, X_train_scaled, X_test_scaled, y_train, y_test, 'Linear SVR'
)

# Display results
results_df = pd.DataFrame([lasso_results, svr_results])
print("\nTest Set Evaluation Results:")
print("=" * 70)
results_df

### 7.2 Cross-Validation (5-Fold, Memory-Optimized)

In [None]:
# Scale full data for CV
scaler_full = StandardScaler()
X_scaled = scaler_full.fit_transform(X).astype(np.float32)

# 5-fold CV for final evaluation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Lasso CV
lasso_cv_model = Lasso(alpha=best_lasso.alpha, max_iter=1000, random_state=42)
lasso_cv_scores = cross_val_score(lasso_cv_model, X_scaled, y, cv=kfold, scoring='neg_mean_squared_error')
lasso_cv_rmse = np.sqrt(-lasso_cv_scores)

# Linear SVR CV
svr_cv_model = LinearSVR(C=best_svr.C, epsilon=best_svr.epsilon, max_iter=1000, random_state=42)
svr_cv_scores = cross_val_score(svr_cv_model, X_scaled, y, cv=kfold, scoring='neg_mean_squared_error')
svr_cv_rmse = np.sqrt(-svr_cv_scores)

print("5-Fold Cross-Validation Results:")
print("=" * 50)
print(f"Lasso RMSE:      {lasso_cv_rmse.mean():.4f} ± {lasso_cv_rmse.std():.4f}")
print(f"Linear SVR RMSE: {svr_cv_rmse.mean():.4f} ± {svr_cv_rmse.std():.4f}")

### 7.3 Model Comparison Visualization

In [None]:
# Create comparison plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

models = ['Lasso', 'Linear SVR']
colors = ['steelblue', 'coral']

# RMSE comparison
rmse_means = [lasso_cv_rmse.mean(), svr_cv_rmse.mean()]
rmse_stds = [lasso_cv_rmse.std(), svr_cv_rmse.std()]
axes[0].bar(models, rmse_means, yerr=rmse_stds, color=colors, capsize=10, edgecolor='black')
axes[0].set_ylabel('RMSE', fontsize=12)
axes[0].set_title('RMSE Comparison (5-Fold CV)', fontsize=14)
for i, (m, s) in enumerate(zip(rmse_means, rmse_stds)):
    axes[0].text(i, m + s + 0.01, f'{m:.4f}', ha='center', fontsize=11)

# R² comparison
r2_values = [lasso_results['Test_R2'], svr_results['Test_R2']]
axes[1].bar(models, r2_values, color=colors, edgecolor='black')
axes[1].set_ylabel('R²', fontsize=12)
axes[1].set_title('R² Comparison (Test Set)', fontsize=14)
for i, v in enumerate(r2_values):
    axes[1].text(i, v + 0.01, f'{v:.4f}', ha='center', fontsize=11)

plt.tight_layout()
plt.savefig('model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

### 7.4 Predicted vs Actual Scatter Plot

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Lasso
axes[0].scatter(y_test, lasso_pred, alpha=0.3, color='steelblue', s=10)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual', fontsize=12)
axes[0].set_ylabel('Predicted', fontsize=12)
axes[0].set_title(f'Lasso (R²={lasso_results["Test_R2"]:.4f})', fontsize=14)

# Linear SVR
axes[1].scatter(y_test, svr_pred, alpha=0.3, color='coral', s=10)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual', fontsize=12)
axes[1].set_ylabel('Predicted', fontsize=12)
axes[1].set_title(f'Linear SVR (R²={svr_results["Test_R2"]:.4f})', fontsize=14)

plt.tight_layout()
plt.savefig('prediction_scatter.png', dpi=150, bbox_inches='tight')
plt.show()

---
## 8. Feature Importance Analysis

In [None]:
# Get feature coefficients
lasso_coef = best_lasso.coef_
svr_coef = best_svr.coef_

# Create importance dataframe
importance_df = pd.DataFrame({
    'Feature': FEATURES_ENCODED,
    'Lasso_Coef': lasso_coef,
    'SVR_Coef': svr_coef,
    'Lasso_Abs': np.abs(lasso_coef),
    'SVR_Abs': np.abs(svr_coef)
}).sort_values('Lasso_Abs', ascending=False)

print("Feature Coefficients (sorted by Lasso importance):")
print("=" * 70)
importance_df[['Feature', 'Lasso_Coef', 'SVR_Coef']]

In [None]:
# Lasso feature selection
selected = importance_df[importance_df['Lasso_Coef'] != 0]
excluded = importance_df[importance_df['Lasso_Coef'] == 0]

print(f"\nLasso selected {len(selected)}/{len(FEATURES_ENCODED)} features")
if len(excluded) > 0:
    print(f"\nExcluded features (coefficient = 0):")
    for f in excluded['Feature'].values:
        print(f"  - {f}")

In [None]:
# Feature coefficient visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 8))

# Lasso coefficients
lasso_sorted = importance_df.sort_values('Lasso_Abs', ascending=True)
colors_l = ['steelblue' if c >= 0 else 'coral' for c in lasso_sorted['Lasso_Coef']]
axes[0].barh(lasso_sorted['Feature'], lasso_sorted['Lasso_Coef'], color=colors_l)
axes[0].set_xlabel('Coefficient')
axes[0].set_title('Lasso Feature Coefficients')
axes[0].axvline(x=0, color='black', linewidth=0.5)

# SVR coefficients
svr_sorted = importance_df.sort_values('SVR_Abs', ascending=True)
colors_s = ['steelblue' if c >= 0 else 'coral' for c in svr_sorted['SVR_Coef']]
axes[1].barh(svr_sorted['Feature'], svr_sorted['SVR_Coef'], color=colors_s)
axes[1].set_xlabel('Coefficient')
axes[1].set_title('Linear SVR Feature Coefficients')
axes[1].axvline(x=0, color='black', linewidth=0.5)

plt.tight_layout()
plt.savefig('feature_coefficients.png', dpi=150, bbox_inches='tight')
plt.show()

---
## 9. Results Summary

In [None]:
# Final summary table
final_summary = pd.DataFrame({
    'Model': ['Lasso', 'Linear SVR'],
    'Best_Params': [
        f"alpha={best_lasso.alpha}",
        f"C={best_svr.C}, eps={best_svr.epsilon}"
    ],
    'Test_RMSE': [lasso_results['Test_RMSE'], svr_results['Test_RMSE']],
    'Test_R2': [lasso_results['Test_R2'], svr_results['Test_R2']],
    'CV_RMSE_Mean': [lasso_cv_rmse.mean(), svr_cv_rmse.mean()],
    'CV_RMSE_Std': [lasso_cv_rmse.std(), svr_cv_rmse.std()]
})

print("\n" + "=" * 70)
print("FINAL MODEL COMPARISON")
print("=" * 70)
final_summary

In [None]:
# Export results
final_summary.to_csv('linear_models_results.csv', index=False)
importance_df.to_csv('feature_coefficients.csv', index=False)

print("\nResults exported:")
print("  ✓ linear_models_results.csv")
print("  ✓ feature_coefficients.csv")

---
## 10. Conclusions

### Memory Optimization Techniques Used:
1. **float32** instead of float64 (50% memory reduction)
2. **3-fold CV** for hyperparameter tuning (reduced from 5)
3. **LinearSVR** instead of SVR with linear kernel (more efficient)
4. **Reduced parameter grid** (fewer combinations)
5. **Single-threaded** execution (n_jobs=1)
6. **Garbage collection** after each step

### Model Comparison:
| Method | Advantages | Limitations |
|--------|------------|-------------|
| Lasso | Automatic feature selection | May exclude correlated features |
| Linear SVR | Robust to outliers | No feature selection |

In [None]:
print("\n" + "=" * 50)
print("Analysis Complete!")
print("=" * 50)