# Week 4 Part 1: Data Exploration for Classification

## Interactive Learning: Exploring Innovation Success Patterns

This notebook provides hands-on exploration of classification datasets, helping you understand:
- How to explore data before building classifiers
- Feature distributions and their relationship to outcomes
- Data preparation techniques for classification
- Visual patterns that indicate classification potential

**Learning Objectives:**
1. Load and understand classification datasets
2. Visualize feature distributions by class
3. Identify patterns and relationships
4. Prepare data for classification algorithms

In [None]:
# Cell 1: Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_iris, load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print('Libraries imported successfully!')
print(f'NumPy version: {np.__version__}')
print(f'Pandas version: {pd.__version__}')
print(f'Scikit-learn loaded and ready')

In [None]:
# Cell 2: Data Generation and Loading Functions

def generate_innovation_dataset(n_samples=1000, random_state=42):
    """
    Generate a synthetic innovation success dataset.
    
    Features represent:
    - Novelty Score: How innovative the idea is (0-100)
    - Market Size: Potential market in millions
    - Team Experience: Years of combined team experience
    - Development Time: Months to develop
    - Budget Efficiency: Percentage of budget used efficiently
    - Competitive Advantage: Uniqueness score (0-100)
    
    Target: Success (1) or Failure (0)
    """
    np.random.seed(random_state)
    
    # Generate base features
    X, y = make_classification(
        n_samples=n_samples,
        n_features=6,
        n_informative=5,
        n_redundant=1,
        n_clusters_per_class=2,
        weights=[0.4, 0.6],  # 60% success rate
        flip_y=0.05,  # 5% label noise
        random_state=random_state
    )
    
    # Transform to realistic ranges
    novelty_score = 50 + 20 * X[:, 0]  # 0-100 scale
    novelty_score = np.clip(novelty_score, 0, 100)
    
    market_size = np.exp(2 + 0.5 * X[:, 1]) * 10  # Millions, log-normal
    market_size = np.clip(market_size, 1, 1000)
    
    team_experience = 5 + 3 * X[:, 2]  # Years
    team_experience = np.clip(team_experience, 0, 20)
    
    development_time = 12 + 6 * X[:, 3]  # Months
    development_time = np.clip(development_time, 3, 36)
    
    budget_efficiency = 50 + 15 * X[:, 4]  # Percentage
    budget_efficiency = np.clip(budget_efficiency, 10, 100)
    
    competitive_advantage = 50 + 20 * X[:, 5]  # 0-100 scale
    competitive_advantage = np.clip(competitive_advantage, 0, 100)
    
    # Create DataFrame
    df = pd.DataFrame({
        'novelty_score': novelty_score,
        'market_size': market_size,
        'team_experience': team_experience,
        'development_time': development_time,
        'budget_efficiency': budget_efficiency,
        'competitive_advantage': competitive_advantage,
        'success': y
    })
    
    return df

def load_standard_datasets():
    """
    Load standard classification datasets for comparison.
    Returns dictionary of datasets.
    """
    datasets = {}
    
    # Iris dataset
    iris = load_iris()
    iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
    iris_df['species'] = iris.target
    datasets['iris'] = iris_df
    
    # Wine dataset
    wine = load_wine()
    wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
    wine_df['quality'] = wine.target
    datasets['wine'] = wine_df
    
    return datasets

def get_data_summary(df, target_column):
    """
    Generate comprehensive data summary statistics.
    """
    summary = {
        'shape': df.shape,
        'features': list(df.drop(columns=[target_column]).columns),
        'target_distribution': df[target_column].value_counts(normalize=True).to_dict(),
        'missing_values': df.isnull().sum().to_dict(),
        'dtypes': df.dtypes.to_dict()
    }
    
    # Basic statistics
    summary['statistics'] = df.describe().to_dict()
    
    return summary

In [None]:
# Cell 3: Visualization Functions

def plot_feature_distributions(df, target_column, figsize=(15, 10)):
    """
    Plot distribution of features by target class.
    """
    features = [col for col in df.columns if col != target_column]
    n_features = len(features)
    n_cols = 3
    n_rows = (n_features + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
    axes = axes.flatten()
    
    for i, feature in enumerate(features):
        ax = axes[i]
        
        # Plot distributions for each class
        for class_val in df[target_column].unique():
            data = df[df[target_column] == class_val][feature]
            label = f'Class {class_val}' if isinstance(class_val, (int, float)) else class_val
            ax.hist(data, alpha=0.5, label=label, bins=20, edgecolor='black', linewidth=0.5)
        
        ax.set_xlabel(feature.replace('_', ' ').title())
        ax.set_ylabel('Frequency')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    # Hide unused subplots
    for i in range(n_features, len(axes)):
        axes[i].axis('off')
    
    plt.suptitle('Feature Distributions by Class', fontsize=16, y=1.02)
    plt.tight_layout()
    return fig

def plot_correlation_matrix(df, target_column, figsize=(10, 8)):
    """
    Plot correlation matrix including target variable.
    """
    fig, ax = plt.subplots(figsize=figsize)
    
    # Calculate correlation matrix
    corr = df.corr()
    
    # Create mask for upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))
    
    # Plot heatmap
    sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', 
                cmap='coolwarm', center=0, vmin=-1, vmax=1,
                square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    
    plt.title('Feature Correlation Matrix', fontsize=14)
    plt.tight_layout()
    return fig

def plot_pca_visualization(df, target_column, figsize=(15, 5)):
    """
    Visualize data in 2D using PCA.
    """
    fig, axes = plt.subplots(1, 3, figsize=figsize)
    
    # Prepare data
    X = df.drop(columns=[target_column])
    y = df[target_column]
    
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Apply PCA
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X_scaled)
    
    # Plot 1: PCA scatter plot
    scatter = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, 
                             cmap='viridis', alpha=0.6, edgecolor='black', linewidth=0.5)
    axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
    axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
    axes[0].set_title('PCA Visualization')
    plt.colorbar(scatter, ax=axes[0])
    
    # Plot 2: Explained variance
    pca_full = PCA()
    pca_full.fit(X_scaled)
    cumsum_var = np.cumsum(pca_full.explained_variance_ratio_)
    
    axes[1].bar(range(1, len(pca_full.explained_variance_ratio_) + 1),
               pca_full.explained_variance_ratio_, alpha=0.7, label='Individual')
    axes[1].plot(range(1, len(cumsum_var) + 1), cumsum_var, 
                'r-o', linewidth=2, markersize=6, label='Cumulative')
    axes[1].set_xlabel('Principal Component')
    axes[1].set_ylabel('Explained Variance Ratio')
    axes[1].set_title('PCA Explained Variance')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    # Plot 3: Feature loadings
    loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
    feature_names = X.columns
    
    for i, feature in enumerate(feature_names):
        axes[2].arrow(0, 0, loadings[i, 0], loadings[i, 1], 
                     head_width=0.05, head_length=0.05, fc='blue', ec='blue', alpha=0.6)
        axes[2].text(loadings[i, 0]*1.1, loadings[i, 1]*1.1, feature, 
                    fontsize=8, ha='center')
    
    axes[2].set_xlim(-1, 1)
    axes[2].set_ylim(-1, 1)
    axes[2].set_xlabel('PC1 Loadings')
    axes[2].set_ylabel('PC2 Loadings')
    axes[2].set_title('PCA Feature Loadings')
    axes[2].grid(True, alpha=0.3)
    axes[2].axhline(y=0, color='k', linewidth=0.5)
    axes[2].axvline(x=0, color='k', linewidth=0.5)
    
    plt.tight_layout()
    return fig

def plot_feature_importance(df, target_column, figsize=(12, 5)):
    """
    Calculate and plot feature importance using statistical tests.
    """
    fig, axes = plt.subplots(1, 2, figsize=figsize)
    
    X = df.drop(columns=[target_column])
    y = df[target_column]
    
    # F-statistic importance
    selector_f = SelectKBest(score_func=f_classif, k='all')
    selector_f.fit(X, y)
    f_scores = selector_f.scores_
    
    # Mutual information importance
    mi_scores = mutual_info_classif(X, y, random_state=42)
    
    feature_names = X.columns
    
    # Plot F-statistic scores
    indices = np.argsort(f_scores)[::-1]
    axes[0].barh(range(len(indices)), f_scores[indices], color='skyblue', edgecolor='navy', alpha=0.7)
    axes[0].set_yticks(range(len(indices)))
    axes[0].set_yticklabels([feature_names[i] for i in indices])
    axes[0].set_xlabel('F-statistic Score')
    axes[0].set_title('Feature Importance (ANOVA F-test)')
    axes[0].grid(True, alpha=0.3, axis='x')
    
    # Plot Mutual Information scores
    indices = np.argsort(mi_scores)[::-1]
    axes[1].barh(range(len(indices)), mi_scores[indices], color='lightcoral', edgecolor='darkred', alpha=0.7)
    axes[1].set_yticks(range(len(indices)))
    axes[1].set_yticklabels([feature_names[i] for i in indices])
    axes[1].set_xlabel('Mutual Information Score')
    axes[1].set_title('Feature Importance (Mutual Information)')
    axes[1].grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    return fig

In [None]:
# Cell 4: Data Preparation Functions

def prepare_classification_data(df, target_column, test_size=0.2, random_state=42):
    """
    Prepare data for classification: split, scale, and encode.
    """
    # Separate features and target
    X = df.drop(columns=[target_column])
    y = df[target_column]
    
    # Split into train and test
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Create DataFrames with scaled data
    X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)
    X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)
    
    return {
        'X_train': X_train,
        'X_test': X_test,
        'y_train': y_train,
        'y_test': y_test,
        'X_train_scaled': X_train_scaled_df,
        'X_test_scaled': X_test_scaled_df,
        'scaler': scaler,
        'feature_names': list(X.columns)
    }

def analyze_class_balance(y, title="Class Distribution"):
    """
    Analyze and visualize class balance.
    """
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Count plot
    class_counts = y.value_counts()
    axes[0].bar(class_counts.index, class_counts.values, 
               color='steelblue', edgecolor='navy', alpha=0.7)
    axes[0].set_xlabel('Class')
    axes[0].set_ylabel('Count')
    axes[0].set_title(f'{title} - Counts')
    axes[0].grid(True, alpha=0.3, axis='y')
    
    # Add value labels on bars
    for i, (idx, val) in enumerate(class_counts.items()):
        axes[0].text(i, val, str(val), ha='center', va='bottom')
    
    # Pie chart
    axes[1].pie(class_counts.values, labels=class_counts.index, 
               autopct='%1.1f%%', startangle=90, colors=sns.color_palette('husl', len(class_counts)))
    axes[1].set_title(f'{title} - Proportions')
    
    plt.tight_layout()
    
    # Print statistics
    print(f"\n{title} Statistics:")
    print(f"Total samples: {len(y)}")
    for class_label, count in class_counts.items():
        print(f"Class {class_label}: {count} ({count/len(y)*100:.1f}%)")
    
    # Check for imbalance
    imbalance_ratio = class_counts.max() / class_counts.min()
    print(f"\nImbalance ratio: {imbalance_ratio:.2f}")
    if imbalance_ratio > 3:
        print("WARNING: Dataset is highly imbalanced! Consider using:")
        print("- Class weight balancing")
        print("- SMOTE or other oversampling techniques")
        print("- Stratified sampling")
    elif imbalance_ratio > 1.5:
        print("Note: Dataset shows some imbalance. Consider using stratified sampling.")
    else:
        print("Dataset is well-balanced.")
    
    return fig

def detect_outliers(df, target_column, method='zscore', threshold=3):
    """
    Detect outliers in the dataset using various methods.
    """
    X = df.drop(columns=[target_column])
    outlier_indices = set()
    
    if method == 'zscore':
        # Z-score method
        z_scores = np.abs(stats.zscore(X))
        outlier_mask = (z_scores > threshold).any(axis=1)
        outlier_indices = set(X.index[outlier_mask])
    
    elif method == 'iqr':
        # IQR method
        Q1 = X.quantile(0.25)
        Q3 = X.quantile(0.75)
        IQR = Q3 - Q1
        outlier_mask = ((X < (Q1 - 1.5 * IQR)) | (X > (Q3 + 1.5 * IQR))).any(axis=1)
        outlier_indices = set(X.index[outlier_mask])
    
    print(f"\nOutlier Detection Results ({method} method):")
    print(f"Total samples: {len(df)}")
    print(f"Outliers detected: {len(outlier_indices)} ({len(outlier_indices)/len(df)*100:.1f}%)")
    
    return list(outlier_indices)

def create_interaction_features(df, target_column):
    """
    Create interaction features for better classification.
    """
    X = df.drop(columns=[target_column])
    df_interactions = df.copy()
    
    # Create ratio features
    if 'novelty_score' in X.columns and 'competitive_advantage' in X.columns:
        df_interactions['innovation_index'] = (
            df['novelty_score'] * df['competitive_advantage'] / 100
        )
    
    if 'budget_efficiency' in X.columns and 'development_time' in X.columns:
        df_interactions['efficiency_ratio'] = (
            df['budget_efficiency'] / (df['development_time'] + 1)
        )
    
    if 'market_size' in X.columns and 'team_experience' in X.columns:
        df_interactions['market_experience_score'] = (
            np.log1p(df['market_size']) * df['team_experience']
        )
    
    print("\nInteraction Features Created:")
    new_features = [col for col in df_interactions.columns if col not in df.columns]
    for feature in new_features:
        print(f"- {feature}")
    
    return df_interactions

## Part 1: Load and Explore the Innovation Dataset

Let's start by generating our innovation success dataset and exploring its characteristics.

In [None]:
# Cell 5: Generate and load the innovation dataset

# Generate innovation dataset
print("Generating innovation success dataset...")
innovation_df = generate_innovation_dataset(n_samples=1000)

print("\nDataset created successfully!")
print(f"Shape: {innovation_df.shape}")
print("\nFirst 5 rows:")
innovation_df.head()

In [None]:
# Cell 6: Get comprehensive data summary

summary = get_data_summary(innovation_df, 'success')

print("DATA SUMMARY")
print("="*50)
print(f"\nDataset shape: {summary['shape'][0]} samples, {summary['shape'][1]} columns")
print(f"\nFeatures ({len(summary['features'])}):
for feature in summary['features']:
    print(f"  - {feature}")

print(f"\nTarget distribution:")
for class_val, proportion in summary['target_distribution'].items():
    print(f"  Class {class_val}: {proportion:.1%}")

print(f"\nMissing values: {sum(summary['missing_values'].values())}")

# Display basic statistics
print("\nBasic Statistics:")
innovation_df.describe()

## Part 2: Visualize Feature Distributions

Understanding how features are distributed across different classes helps identify which features might be good predictors.

In [None]:
# Cell 7: Plot feature distributions by class

print("Visualizing feature distributions by success/failure...\n")
fig = plot_feature_distributions(innovation_df, 'success')
plt.show()

print("\nObservations:")
print("- Features with clear separation between classes are better predictors")
print("- Overlapping distributions indicate features that might not discriminate well")
print("- Look for features where successful and failed innovations have different patterns")

In [None]:
# Cell 8: Analyze feature correlations

print("Analyzing feature correlations...\n")
fig = plot_correlation_matrix(innovation_df, 'success')
plt.show()

# Find features most correlated with success
correlations_with_target = innovation_df.corr()['success'].sort_values(ascending=False)
print("\nFeatures most correlated with success:")
for feature, corr in correlations_with_target.items():
    if feature != 'success':
        print(f"  {feature}: {corr:.3f}")

## Part 3: Dimensionality Reduction and Visualization

PCA helps us visualize high-dimensional data in 2D and understand which features contribute most to variance.

In [None]:
# Cell 9: PCA visualization

print("Applying PCA for visualization...\n")
fig = plot_pca_visualization(innovation_df, 'success')
plt.show()

print("\nPCA Insights:")
print("- Left plot: How well the data separates in 2D")
print("- Middle plot: How many components needed to capture variance")
print("- Right plot: Which features contribute most to principal components")

## Part 4: Feature Importance Analysis

Before building classifiers, let's identify which features are most important for predicting success.

In [None]:
# Cell 10: Calculate feature importance

print("Calculating feature importance...\n")
fig = plot_feature_importance(innovation_df, 'success')
plt.show()

print("\nFeature Importance Methods:")
print("- F-statistic: Measures linear relationships")
print("- Mutual Information: Captures non-linear relationships")
print("\nFeatures with high scores in both methods are strong predictors!")

## Part 5: Data Preparation for Classification

Now let's prepare the data for classification algorithms.

In [None]:
# Cell 11: Prepare data for classification

print("Preparing data for classification...\n")
data_dict = prepare_classification_data(innovation_df, 'success', test_size=0.2)

print(f"Training set: {data_dict['X_train'].shape[0]} samples")
print(f"Test set: {data_dict['X_test'].shape[0]} samples")
print(f"Features: {len(data_dict['feature_names'])}")
print(f"\nFeature names: {', '.join(data_dict['feature_names'])}")

print("\nData has been:")
print("✓ Split into training and test sets")
print("✓ Scaled to zero mean and unit variance")
print("✓ Stratified to maintain class balance")

In [None]:
# Cell 12: Analyze class balance

print("Analyzing class balance in training and test sets...\n")

# Training set balance
fig_train = analyze_class_balance(data_dict['y_train'], "Training Set Class Distribution")
plt.show()

# Test set balance
fig_test = analyze_class_balance(data_dict['y_test'], "Test Set Class Distribution")
plt.show()

## Part 6: Outlier Detection and Feature Engineering

Let's identify outliers and create interaction features to improve classification.

In [None]:
# Cell 13: Detect outliers

print("Detecting outliers using different methods...\n")

# Z-score method
outliers_zscore = detect_outliers(innovation_df, 'success', method='zscore', threshold=3)

# IQR method
outliers_iqr = detect_outliers(innovation_df, 'success', method='iqr')

# Common outliers
common_outliers = set(outliers_zscore) & set(outliers_iqr)
print(f"\nOutliers detected by both methods: {len(common_outliers)}")

if len(common_outliers) > 0:
    print("\nSample of outlier records:")
    print(innovation_df.iloc[list(common_outliers)[:3]])

In [None]:
# Cell 14: Create interaction features

print("Creating interaction features...\n")
innovation_enhanced = create_interaction_features(innovation_df, 'success')

print("\nEnhanced dataset shape:", innovation_enhanced.shape)
print("\nNew features added:", innovation_enhanced.shape[1] - innovation_df.shape[1])

# Show correlation of new features with target
new_features = [col for col in innovation_enhanced.columns if col not in innovation_df.columns]
if new_features:
    print("\nNew feature correlations with success:")
    for feature in new_features:
        corr = innovation_enhanced[[feature, 'success']].corr().iloc[0, 1]
        print(f"  {feature}: {corr:.3f}")

## Part 7: Compare with Standard Datasets

Let's also explore standard classification datasets to understand different data characteristics.

In [None]:
# Cell 15: Load and explore standard datasets

print("Loading standard classification datasets...\n")
standard_datasets = load_standard_datasets()

for name, df in standard_datasets.items():
    print(f"\n{name.upper()} Dataset:")
    print(f"  Shape: {df.shape}")
    print(f"  Features: {list(df.columns[:-1])}")
    print(f"  Classes: {df.iloc[:, -1].nunique()}")
    print(f"  Class distribution:")
    for class_val, count in df.iloc[:, -1].value_counts().items():
        print(f"    Class {class_val}: {count} samples")

## Interactive Exploration

Now you can explore the data interactively! Try modifying the parameters below.

In [None]:
# Cell 16: Interactive parameter exploration

# Try different parameters!
SAMPLE_SIZE = 500  # Change this: 100, 500, 1000, 5000
TEST_SIZE = 0.3    # Change this: 0.1, 0.2, 0.3, 0.4
RANDOM_STATE = 42  # Change this for different random splits

# Generate new dataset with your parameters
custom_df = generate_innovation_dataset(n_samples=SAMPLE_SIZE, random_state=RANDOM_STATE)

# Prepare the data
custom_data = prepare_classification_data(custom_df, 'success', 
                                         test_size=TEST_SIZE, 
                                         random_state=RANDOM_STATE)

print(f"Custom Dataset Configuration:")
print(f"  Total samples: {SAMPLE_SIZE}")
print(f"  Training samples: {len(custom_data['y_train'])}")
print(f"  Test samples: {len(custom_data['y_test'])}")
print(f"  Test size ratio: {TEST_SIZE:.0%}")

# Quick visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Plot training distribution
train_counts = custom_data['y_train'].value_counts()
axes[0].bar(train_counts.index, train_counts.values, color='steelblue', alpha=0.7)
axes[0].set_title(f'Training Set (n={len(custom_data["y_train"])})')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Count')

# Plot test distribution
test_counts = custom_data['y_test'].value_counts()
axes[1].bar(test_counts.index, test_counts.values, color='coral', alpha=0.7)
axes[1].set_title(f'Test Set (n={len(custom_data["y_test"])})')
axes[1].set_xlabel('Class')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

## Summary and Next Steps

### What We've Learned:
1. **Data Exploration**: How to thoroughly explore a classification dataset
2. **Feature Analysis**: Identifying important features using statistical methods
3. **Visualization**: Multiple ways to visualize classification data
4. **Data Preparation**: Properly splitting and scaling data for ML
5. **Class Balance**: Checking for and handling imbalanced datasets
6. **Feature Engineering**: Creating new features to improve classification

### Key Takeaways:
- Always explore your data before building models
- Check for class imbalance and outliers
- Use multiple visualization techniques to understand patterns
- Feature importance helps focus on relevant variables
- Proper data preparation is crucial for model performance

### Next Steps:
1. **Part 2**: Build and compare different classification algorithms
2. **Part 3**: Evaluate models and tune hyperparameters
3. **Exercises**: Try with your own dataset

### Exercises to Try:
1. Generate a highly imbalanced dataset (90% one class) and explore it
2. Create additional interaction features and check their importance
3. Compare the innovation dataset with the standard datasets
4. Try different outlier detection thresholds
5. Visualize the data using different dimensionality reduction techniques

In [None]:
# Cell 17: Export prepared data for next notebooks

# Save the prepared data for use in other notebooks
print("Saving prepared data for next notebooks...")

# Save as CSV
innovation_df.to_csv('innovation_dataset.csv', index=False)
innovation_enhanced.to_csv('innovation_dataset_enhanced.csv', index=False)

print("✓ Saved: innovation_dataset.csv")
print("✓ Saved: innovation_dataset_enhanced.csv")
print("\nData ready for Part 2: Building Classification Models!")