# 🔍 Comprehensive Feature Analysis & Dataset Combination
## Analyzing ALL Fontys Travel Datasets

**Goal:** 
1. Analyze features in ALL 6 datasets
2. Compare which dataset has the BEST features
3. Identify which datasets can be JOINED together
4. Create a COMBINED dataset for modeling

---

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


---
## 📂 Section 1: Load ALL Datasets

Let's load all 6 datasets and handle any loading issues.

In [7]:
# Define file paths
files = {
    'df1_trip_flight': 'trip-overview-flight-20251013010007-fontys.csv',
    'df2_deelnemer': 'export_Deelnemer_boekingen_Vertrekkend_januari_2025-fontys.xls',
    'df3_reizen_riss': 'export_reizen_Riss_31-12-2025-fontys.xlsx',
    'df4_overzicht_cax': 'Overzicht_reizen_vanuit_cax-GV_vertrek_juni_2025_en_verder_-_fontys.xlsx',
    'df5_boekingen_total': 'Export_boekingen_total_RISS_31-12-2024_-_fontys.xlsx',
    'df6_grip_freeze': 'Grip_export_booking_freeze_1_wk-fontys.xlsx'
}

datasets = {}

print("Loading datasets...\n")
print("="*80)

for name, file in files.items():
    try:
        if file.endswith('.csv'):
            # Try different encodings and separators for CSV
            try:
                df = pd.read_csv(file, encoding='utf-8')
            except:
                try:
                    df = pd.read_csv(file, encoding='latin-1')
                except:
                    df = pd.read_csv(file, encoding='utf-8', on_bad_lines='skip')
        else:
            df = pd.read_excel(file)
        
        datasets[name] = df
        print(f"✅ {name}: {len(df):,} rows × {len(df.columns)} columns")
    except Exception as e:
        print(f"❌ {name}: ERROR - {e}")
        datasets[name] = None

print("="*80)
print(f"\n✅ Successfully loaded {sum(1 for v in datasets.values() if v is not None)} out of {len(files)} datasets!")

Loading datasets...

❌ df1_trip_flight: ERROR - [Errno 2] No such file or directory: 'trip-overview-flight-20251013010007-fontys.csv'
❌ df2_deelnemer: ERROR - [Errno 2] No such file or directory: 'export_Deelnemer_boekingen_Vertrekkend_januari_2025-fontys.xls'
❌ df3_reizen_riss: ERROR - [Errno 2] No such file or directory: 'export_reizen_Riss_31-12-2025-fontys.xlsx'
❌ df4_overzicht_cax: ERROR - [Errno 2] No such file or directory: 'Overzicht_reizen_vanuit_cax-GV_vertrek_juni_2025_en_verder_-_fontys.xlsx'
❌ df5_boekingen_total: ERROR - [Errno 2] No such file or directory: 'Export_boekingen_total_RISS_31-12-2024_-_fontys.xlsx'
❌ df6_grip_freeze: ERROR - [Errno 2] No such file or directory: 'Grip_export_booking_freeze_1_wk-fontys.xlsx'

✅ Successfully loaded 0 out of 6 datasets!


---
## 📊 Section 2: Quick Dataset Overview

Let's see the basic structure of each dataset.

In [3]:
print("\n" + "="*100)
print("DATASET OVERVIEW")
print("="*100)

overview_data = []

for name, df in datasets.items():
    if df is not None:
        overview_data.append({
            'Dataset': name,
            'Rows': f"{len(df):,}",
            'Columns': len(df.columns),
            'Memory': f"{df.memory_usage(deep=True).sum() / 1024**2:.2f} MB"
        })

overview_df = pd.DataFrame(overview_data)
print(overview_df.to_string(index=False))
print("="*100)


DATASET OVERVIEW
Empty DataFrame
Columns: []
Index: []


---
## 🔍 Section 3: Detailed Feature Analysis - Each Dataset

Let's analyze the features (columns) in EACH dataset individually.

In [4]:
def analyze_dataset_features(df, dataset_name):
    """
    Comprehensive feature analysis for a single dataset
    """
    print("\n" + "="*100)
    print(f"📊 FEATURE ANALYSIS: {dataset_name}")
    print("="*100)
    
    if df is None:
        print("❌ Dataset not loaded")
        return None
    
    # Basic info
    print(f"\n📏 Dimensions: {len(df):,} rows × {len(df.columns)} columns")
    print(f"\n📋 Column Names:\n{', '.join(df.columns.tolist())}")
    
    # Feature quality analysis
    feature_quality = []
    
    for col in df.columns:
        total = len(df)
        missing = df[col].isna().sum()
        missing_pct = (missing / total) * 100
        unique = df[col].nunique()
        unique_pct = (unique / total) * 100
        dtype = str(df[col].dtype)
        
        # Quality score (0-100)
        completeness_score = 100 - missing_pct
        variety_score = min(unique_pct, 100) if unique > 1 else 0
        quality_score = (completeness_score * 0.7) + (variety_score * 0.3)
        
        feature_quality.append({
            'Column': col,
            'Type': dtype,
            'Missing': missing,
            'Missing %': f"{missing_pct:.1f}%",
            'Unique': unique,
            'Unique %': f"{unique_pct:.1f}%",
            'Quality Score': f"{quality_score:.1f}"
        })
    
    feature_df = pd.DataFrame(feature_quality)
    feature_df = feature_df.sort_values('Quality Score', ascending=False)
    
    print(f"\n🎯 FEATURE QUALITY RANKING:\n")
    print(feature_df.to_string(index=False))
    
    # Summary statistics
    avg_quality = feature_df['Quality Score'].str.replace('%', '').astype(float).mean()
    print(f"\n📈 Average Feature Quality: {avg_quality:.1f}/100")
    
    # First few rows
    print(f"\n👀 Sample Data (first 3 rows):\n")
    print(df.head(3).to_string())
    
    return feature_df

# Analyze all datasets
feature_analyses = {}
for name, df in datasets.items():
    feature_analyses[name] = analyze_dataset_features(df, name)


📊 FEATURE ANALYSIS: df1_trip_flight
❌ Dataset not loaded

📊 FEATURE ANALYSIS: df2_deelnemer
❌ Dataset not loaded

📊 FEATURE ANALYSIS: df3_reizen_riss
❌ Dataset not loaded

📊 FEATURE ANALYSIS: df4_overzicht_cax
❌ Dataset not loaded

📊 FEATURE ANALYSIS: df5_boekingen_total
❌ Dataset not loaded

📊 FEATURE ANALYSIS: df6_grip_freeze
❌ Dataset not loaded


---
## 🏆 Section 4: Compare Datasets - Which Has the BEST Features?

Let's compare the overall quality across all datasets.

In [5]:
print("\n" + "="*100)
print("🏆 DATASET COMPARISON - WHICH HAS THE BEST FEATURES?")
print("="*100)

comparison_data = []

for name, df in datasets.items():
    if df is not None:
        # Calculate metrics
        total_cells = df.shape[0] * df.shape[1]
        missing_cells = df.isna().sum().sum()
        completeness = ((total_cells - missing_cells) / total_cells) * 100
        
        # Get average quality from feature analysis
        if feature_analyses[name] is not None:
            avg_quality = float(feature_analyses[name]['Quality Score'].str.replace('%', '').mean())
        else:
            avg_quality = 0
        
        # Calculate feature diversity (how many different types of features)
        numeric_cols = len(df.select_dtypes(include=[np.number]).columns)
        categorical_cols = len(df.select_dtypes(include=['object']).columns)
        datetime_cols = len(df.select_dtypes(include=['datetime64']).columns)
        
        comparison_data.append({
            'Dataset': name,
            'Rows': len(df),
            'Features': len(df.columns),
            'Completeness %': f"{completeness:.1f}%",
            'Avg Quality': f"{avg_quality:.1f}",
            'Numeric': numeric_cols,
            'Categorical': categorical_cols,
            'DateTime': datetime_cols
        })

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('Avg Quality', ascending=False)

print("\n" + comparison_df.to_string(index=False))
print("\n" + "="*100)

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Average Quality Score
quality_scores = comparison_df['Avg Quality'].str.replace('%', '').astype(float)
colors = ['#2ecc71' if x > 80 else '#f39c12' if x > 60 else '#e74c3c' for x in quality_scores]
axes[0].barh(comparison_df['Dataset'], quality_scores, color=colors)
axes[0].set_xlabel('Average Feature Quality Score', fontsize=12)
axes[0].set_title('🏆 Dataset Quality Comparison', fontsize=14, fontweight='bold')
axes[0].axvline(x=80, color='green', linestyle='--', alpha=0.5, label='Excellent (80+)')
axes[0].axvline(x=60, color='orange', linestyle='--', alpha=0.5, label='Good (60+)')
axes[0].legend()

# Plot 2: Feature Count and Types
comparison_df.plot(x='Dataset', y=['Numeric', 'Categorical', 'DateTime'], 
                   kind='bar', stacked=True, ax=axes[1], 
                   color=['#3498db', '#e74c3c', '#f39c12'])
axes[1].set_title('📊 Feature Type Distribution', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Number of Features')
axes[1].set_xlabel('')
axes[1].legend(title='Feature Type')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Winner announcement
best_dataset = comparison_df.iloc[0]['Dataset']
best_score = comparison_df.iloc[0]['Avg Quality']
print(f"\n🥇 WINNER: {best_dataset} with average quality score of {best_score}!")


🏆 DATASET COMPARISON - WHICH HAS THE BEST FEATURES?


KeyError: 'Avg Quality'

---
## 🔗 Section 5: Identify JOIN KEYS - Which Datasets Can Work Together?

Let's find common columns that can be used to merge datasets.

In [None]:
print("\n" + "="*100)
print("🔗 IDENTIFYING JOIN KEYS - Which Datasets Can Be Combined?")
print("="*100)

# Get all column names from all datasets
all_columns = {}
for name, df in datasets.items():
    if df is not None:
        all_columns[name] = set(df.columns)

# Find common columns between datasets
print("\n📋 COMMON COLUMNS (Potential Join Keys):\n")

dataset_names = list(all_columns.keys())
join_possibilities = []

for i, ds1 in enumerate(dataset_names):
    for ds2 in dataset_names[i+1:]:
        common = all_columns[ds1] & all_columns[ds2]
        if common:
            print(f"\n{ds1} ⟷ {ds2}")
            print(f"  Common columns: {', '.join(sorted(common))}")
            join_possibilities.append({
                'Dataset 1': ds1,
                'Dataset 2': ds2,
                'Common Columns': ', '.join(sorted(common)),
                'Join Keys': len(common)
            })

if join_possibilities:
    join_df = pd.DataFrame(join_possibilities)
    join_df = join_df.sort_values('Join Keys', ascending=False)
    print("\n" + "="*100)
    print("\n🎯 JOIN POSSIBILITIES RANKED:\n")
    print(join_df.to_string(index=False))
else:
    print("\n⚠️ No common columns found between datasets!")
    print("   → Datasets may need to be analyzed separately")
    print("   → Or manual mapping/ID creation may be required")

print("\n" + "="*100)

---
## 🧩 Section 6: Analyze Specific Join Keys

Let's examine the quality of potential join keys in detail.

In [None]:
def analyze_join_key(df1, df2, df1_name, df2_name, key_column):
    """
    Analyze a potential join key between two datasets
    """
    print(f"\n🔍 Analyzing JOIN KEY: '{key_column}'")
    print(f"   Between: {df1_name} ⟷ {df2_name}")
    print("-" * 80)
    
    # Get unique values
    df1_values = set(df1[key_column].dropna().unique())
    df2_values = set(df2[key_column].dropna().unique())
    
    # Calculate overlap
    common_values = df1_values & df2_values
    only_in_df1 = df1_values - df2_values
    only_in_df2 = df2_values - df1_values
    
    # Calculate match percentage
    match_pct = (len(common_values) / max(len(df1_values), len(df2_values))) * 100
    
    print(f"\n📊 Statistics:")
    print(f"   {df1_name}: {len(df1_values):,} unique values")
    print(f"   {df2_name}: {len(df2_values):,} unique values")
    print(f"   Common values: {len(common_values):,}")
    print(f"   Only in {df1_name}: {len(only_in_df1):,}")
    print(f"   Only in {df2_name}: {len(only_in_df2):,}")
    print(f"\n✨ Match percentage: {match_pct:.1f}%")
    
    # Determine if it's a good join key
    if match_pct > 80:
        quality = "🟢 EXCELLENT - Strong overlap, great for joining!"
    elif match_pct > 50:
        quality = "🟡 GOOD - Moderate overlap, usable for joining"
    elif match_pct > 20:
        quality = "🟠 FAIR - Limited overlap, may lose data"
    else:
        quality = "🔴 POOR - Very little overlap, not recommended"
    
    print(f"\n{quality}")
    
    return {
        'key': key_column,
        'match_pct': match_pct,
        'common': len(common_values)
    }

# Analyze each potential join
print("\n" + "="*100)
print("🔍 DETAILED JOIN KEY ANALYSIS")
print("="*100)

join_quality = []

for i, ds1_name in enumerate(dataset_names):
    df1 = datasets[ds1_name]
    if df1 is None:
        continue
    
    for ds2_name in dataset_names[i+1:]:
        df2 = datasets[ds2_name]
        if df2 is None:
            continue
        
        common_cols = all_columns[ds1_name] & all_columns[ds2_name]
        
        if common_cols:
            for col in sorted(common_cols):
                try:
                    result = analyze_join_key(df1, df2, ds1_name, ds2_name, col)
                    join_quality.append({
                        'Dataset 1': ds1_name,
                        'Dataset 2': ds2_name,
                        'Join Key': col,
                        'Match %': f"{result['match_pct']:.1f}%",
                        'Common Values': result['common']
                    })
                except Exception as e:
                    print(f"   ⚠️ Could not analyze column '{col}': {e}")

if join_quality:
    join_quality_df = pd.DataFrame(join_quality)
    join_quality_df['Match %'] = join_quality_df['Match %'].str.replace('%', '').astype(float)
    join_quality_df = join_quality_df.sort_values('Match %', ascending=False)
    join_quality_df['Match %'] = join_quality_df['Match %'].apply(lambda x: f"{x:.1f}%")
    
    print("\n" + "="*100)
    print("\n🎯 BEST JOIN KEYS SUMMARY:\n")
    print(join_quality_df.to_string(index=False))
    print("\n" + "="*100)

print("\n💡 TIP: Focus on join keys with >80% match for best results!")

---
## 🎯 Section 7: CREATE THE COMBINED DATASET

Based on our analysis, let's create the best combined dataset!

In [None]:
print("\n" + "="*100)
print("🎯 CREATING COMBINED DATASET")
print("="*100)

# Strategy: Start with the highest quality dataset and merge others
# Use the best join keys we identified

# Get the best dataset (highest quality)
best_dataset_name = comparison_df.iloc[0]['Dataset']
combined_df = datasets[best_dataset_name].copy()

print(f"\n🏁 Starting with: {best_dataset_name}")
print(f"   Initial size: {len(combined_df):,} rows × {len(combined_df.columns)} columns")

# Try to merge other datasets
merged_count = 0

for ds_name in dataset_names:
    if ds_name == best_dataset_name or datasets[ds_name] is None:
        continue
    
    # Find common columns
    common_cols = all_columns[best_dataset_name] & all_columns[ds_name]
    
    if not common_cols:
        print(f"\n⚠️ Skipping {ds_name}: No common columns for joining")
        continue
    
    # Try each common column as join key
    best_join_key = None
    best_match = 0
    
    for col in common_cols:
        try:
            # Calculate match percentage
            base_values = set(combined_df[col].dropna().unique())
            new_values = set(datasets[ds_name][col].dropna().unique())
            common_values = base_values & new_values
            match_pct = (len(common_values) / max(len(base_values), len(new_values))) * 100
            
            if match_pct > best_match:
                best_match = match_pct
                best_join_key = col
        except:
            continue
    
    if best_join_key and best_match > 20:  # Only merge if >20% match
        print(f"\n✅ Merging {ds_name} using '{best_join_key}' ({best_match:.1f}% match)")
        
        # Perform the merge
        before_cols = len(combined_df.columns)
        combined_df = combined_df.merge(
            datasets[ds_name],
            on=best_join_key,
            how='left',
            suffixes=('', f'_{ds_name}')
        )
        after_cols = len(combined_df.columns)
        new_cols = after_cols - before_cols
        
        print(f"   Added {new_cols} new columns")
        merged_count += 1
    else:
        print(f"\n⚠️ Skipping {ds_name}: Poor match quality ({best_match:.1f}%)")

print("\n" + "="*100)
print(f"\n🎉 COMBINED DATASET CREATED!")
print(f"   Final size: {len(combined_df):,} rows × {len(combined_df.columns)} columns")
print(f"   Datasets merged: {merged_count + 1}")
print("\n" + "="*100)

---
## 📊 Section 8: Analyze the Combined Dataset

Let's see what we have in our final combined dataset!

In [None]:
print("\n" + "="*100)
print("📊 COMBINED DATASET ANALYSIS")
print("="*100)

print(f"\n📏 Dimensions: {len(combined_df):,} rows × {len(combined_df.columns)} columns")
print(f"\n📋 All Columns ({len(combined_df.columns)} total):\n")

# Group columns by type
numeric_cols = combined_df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = combined_df.select_dtypes(include=['object']).columns.tolist()
datetime_cols = combined_df.select_dtypes(include=['datetime64']).columns.tolist()

print(f"🔢 Numeric features ({len(numeric_cols)}):")
for col in numeric_cols[:20]:  # Show first 20
    print(f"   - {col}")
if len(numeric_cols) > 20:
    print(f"   ... and {len(numeric_cols) - 20} more")

print(f"\n📝 Categorical features ({len(categorical_cols)}):")
for col in categorical_cols[:20]:  # Show first 20
    print(f"   - {col}")
if len(categorical_cols) > 20:
    print(f"   ... and {len(categorical_cols) - 20} more")

if datetime_cols:
    print(f"\n📅 DateTime features ({len(datetime_cols)}):")
    for col in datetime_cols:
        print(f"   - {col}")

# Data quality
total_cells = combined_df.shape[0] * combined_df.shape[1]
missing_cells = combined_df.isna().sum().sum()
completeness = ((total_cells - missing_cells) / total_cells) * 100

print(f"\n📈 Data Quality:")
print(f"   Completeness: {completeness:.1f}%")
print(f"   Missing values: {missing_cells:,} out of {total_cells:,} cells")
print(f"   Memory usage: {combined_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Show sample
print(f"\n👀 Sample Data (first 5 rows):\n")
print(combined_df.head(5))

---
## 🎯 Section 9: Feature Selection for Modeling

Now let's identify which features are BEST for machine learning!

In [None]:
print("\n" + "="*100)
print("🎯 FEATURE SELECTION FOR MODELING")
print("="*100)

# Analyze each feature for modeling potential
feature_scores = []

for col in combined_df.columns:
    # Calculate metrics
    missing_pct = (combined_df[col].isna().sum() / len(combined_df)) * 100
    unique_count = combined_df[col].nunique()
    unique_pct = (unique_count / len(combined_df)) * 100
    dtype = str(combined_df[col].dtype)
    
    # Feature quality score (0-100)
    # Factors:
    # - Low missing values (70% weight)
    # - Good variety but not too unique (30% weight)
    completeness_score = 100 - missing_pct
    
    # Variety score: prefer 2-50% unique (not constant, not unique IDs)
    if unique_count <= 1:
        variety_score = 0  # Constant column
    elif unique_pct > 95:
        variety_score = 30  # Likely an ID, less useful
    elif 2 <= unique_pct <= 50:
        variety_score = 100  # Sweet spot!
    else:
        variety_score = 70
    
    total_score = (completeness_score * 0.7) + (variety_score * 0.3)
    
    # Determine recommendation
    if total_score >= 80 and missing_pct < 30:
        recommendation = "✅ EXCELLENT - Highly recommended"
    elif total_score >= 60 and missing_pct < 50:
        recommendation = "🟢 GOOD - Recommended"
    elif total_score >= 40:
        recommendation = "🟡 FAIR - Consider with caution"
    else:
        recommendation = "❌ POOR - Not recommended"
    
    feature_scores.append({
        'Feature': col,
        'Type': dtype,
        'Missing %': f"{missing_pct:.1f}%",
        'Unique': unique_count,
        'Unique %': f"{unique_pct:.1f}%",
        'Score': f"{total_score:.1f}",
        'Recommendation': recommendation
    })

# Create DataFrame and sort
feature_scores_df = pd.DataFrame(feature_scores)
feature_scores_df['Score_Numeric'] = feature_scores_df['Score'].astype(float)
feature_scores_df = feature_scores_df.sort_values('Score_Numeric', ascending=False)
feature_scores_df = feature_scores_df.drop('Score_Numeric', axis=1)

print("\n🏆 FEATURE RANKING FOR MODELING:\n")
print(feature_scores_df.to_string(index=False))

# Summary statistics
excellent = len(feature_scores_df[feature_scores_df['Recommendation'].str.contains('EXCELLENT')])
good = len(feature_scores_df[feature_scores_df['Recommendation'].str.contains('GOOD')])
fair = len(feature_scores_df[feature_scores_df['Recommendation'].str.contains('FAIR')])
poor = len(feature_scores_df[feature_scores_df['Recommendation'].str.contains('POOR')])

print("\n" + "="*100)
print("\n📊 FEATURE QUALITY SUMMARY:")
print(f"   ✅ Excellent features: {excellent}")
print(f"   🟢 Good features: {good}")
print(f"   🟡 Fair features: {fair}")
print(f"   ❌ Poor features: {poor}")
print(f"\n   💡 Recommended for modeling: {excellent + good} features")
print("="*100)

# Visualize feature quality distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Feature scores distribution
scores = feature_scores_df['Score'].astype(float)
axes[0].hist(scores, bins=20, color='#3498db', edgecolor='black')
axes[0].axvline(x=80, color='green', linestyle='--', label='Excellent (80+)', linewidth=2)
axes[0].axvline(x=60, color='orange', linestyle='--', label='Good (60+)', linewidth=2)
axes[0].axvline(x=40, color='red', linestyle='--', label='Poor (<40)', linewidth=2)
axes[0].set_xlabel('Feature Quality Score', fontsize=12)
axes[0].set_ylabel('Number of Features', fontsize=12)
axes[0].set_title('📊 Feature Quality Distribution', fontsize=14, fontweight='bold')
axes[0].legend()

# Plot 2: Recommendation breakdown
recommendation_counts = {
    'Excellent': excellent,
    'Good': good,
    'Fair': fair,
    'Poor': poor
}
colors = ['#2ecc71', '#3498db', '#f39c12', '#e74c3c']
axes[1].pie(recommendation_counts.values(), labels=recommendation_counts.keys(), 
            autopct='%1.1f%%', colors=colors, startangle=90)
axes[1].set_title('🎯 Feature Recommendation Breakdown', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

---
## ✅ Section 10: Final Recommendations

Let's create a summary of the best features to use for modeling!

In [None]:
print("\n" + "="*100)
print("✅ FINAL RECOMMENDATIONS FOR MODELING")
print("="*100)

# Get recommended features
recommended_features = feature_scores_df[
    feature_scores_df['Recommendation'].str.contains('EXCELLENT|GOOD', regex=True)
]

print(f"\n🎯 RECOMMENDED FEATURES ({len(recommended_features)} total):\n")
print(recommended_features[['Feature', 'Type', 'Missing %', 'Score', 'Recommendation']].to_string(index=False))

print("\n" + "="*100)
print("\n💡 NEXT STEPS FOR YOUR MODEL:")
print("="*100)
print("""
1️⃣ **Start with EXCELLENT features** (highest quality)
   → These have <20% missing values and good variety
   → Perfect for initial model training

2️⃣ **Add GOOD features** for more information
   → These have <50% missing values
   → Can improve model performance

3️⃣ **Handle missing data:**
   → For numeric: use median/mean imputation
   → For categorical: use mode or create 'Unknown' category
   → Or use algorithms that handle missing values (XGBoost, LightGBM)

4️⃣ **Feature engineering:**
   → Create date-based features (day, month, year, day of week)
   → Combine categorical features
   → Create interaction features

5️⃣ **Encode categorical variables:**
   → Label encoding for ordinal features
   → One-hot encoding for nominal features
   → Target encoding for high-cardinality features

6️⃣ **Scale/normalize numeric features:**
   → StandardScaler or MinMaxScaler
   → Important for algorithms like SVM, Neural Networks
""")

print("="*100)
print("\n🎉 Analysis complete! You now know:")
print("   ✅ Which dataset has the best features")
print("   ✅ Which datasets can be joined together")
print("   ✅ How to create a combined dataset")
print("   ✅ Which features to use for modeling")
print("\n" + "="*100)

---
## 💾 Section 11: Save the Combined Dataset

Let's save our combined dataset for future use!

In [None]:
# Save the combined dataset
output_file = 'combined_fontys_dataset.csv'
combined_df.to_csv(output_file, index=False)

print(f"✅ Combined dataset saved to: {output_file}")
print(f"   Size: {len(combined_df):,} rows × {len(combined_df.columns)} columns")
print(f"   File size: {combined_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Also save the feature recommendations
feature_scores_df.to_csv('feature_recommendations.csv', index=False)
print(f"\n✅ Feature recommendations saved to: feature_recommendations.csv")

# Save recommended features list
recommended_feature_names = recommended_features['Feature'].tolist()
with open('recommended_features.txt', 'w') as f:
    f.write("\n".join(recommended_feature_names))
print(f"✅ Recommended feature names saved to: recommended_features.txt")

print("\n" + "="*100)
print("\n🎉 ALL DONE! You're ready to start modeling!")
print("="*100)