# Vehicle Dataset Classification with Fuzzy Rough Sets

This notebook demonstrates how to use the Eddy library (Fuzzy LEM2) for vehicle classification.

## Dataset Information
- **846 samples** of vehicle silhouettes
- **18 continuous features** (shape measurements)
- **4 classes**: van, saab, bus, opel

## What is Fuzzy LEM2?
LEM2 (Learning from Examples Module 2) is a rule induction algorithm based on rough set theory. The fuzzy extension handles continuous data by using fuzzy membership degrees.

## 1. Setup and Imports

In [1]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    accuracy_score,
    ConfusionMatrixDisplay
)

# Eddy library imports
from eddy.fuzzylem import FuzzyLEM2Classifier
import eddy.datasets as data

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print("‚úÖ Imports successful!")

‚úÖ Imports successful!


## 2. Load and Explore the Dataset

In [2]:
# Load the vehicle dataset
(X, y), ds_name = data.vehicle()

print(f"Dataset: {ds_name}")
print(f"Shape: {X.shape}")
print(f"Samples: {X.shape[0]}")
print(f"Features: {X.shape[1]}")
print(f"Classes: {len(np.unique(y))}")

Dataset: vehicle
Shape: (768, 8)
Samples: 768
Features: 8
Classes: 1


In [None]:
# Feature names (from the CSV header)
feature_names = [
    'Compactness', 'Circularity', 'Distance_circularity', 'Radius_ratio',
    'Praxis_aspect_ratio', 'Max_length_aspect_ratio', 'Scatter_ratio',
    'Elongatedness', 'Praxis_rectangular', 'Length_rectangular',
    'Major_variance', 'Minor_variance', 'Gyration_radius',
    'Major_skewness', 'Minor_skewness', 'Minor_kurtosis',
    'Major_kurtosis', 'Hollows_ratio'
]

class_names = ['van', 'saab', 'bus', 'opel']

# Create a DataFrame for easier exploration
df = pd.DataFrame(X, columns=feature_names)
df['Class'] = y
df['Class_Name'] = df['Class'].map({0: 'van', 1: 'saab', 2: 'bus', 3: 'opel'})

print("\nüìä First few rows:")
df.head()

## 3. Data Exploration and Visualization

In [None]:
# Class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
class_counts = df['Class_Name'].value_counts()
axes[0].bar(class_counts.index, class_counts.values, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A'])
axes[0].set_title('Class Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Vehicle Type')
axes[0].set_ylabel('Count')
axes[0].grid(axis='y', alpha=0.3)

# Add count labels on bars
for i, (name, count) in enumerate(class_counts.items()):
    axes[0].text(i, count + 5, str(count), ha='center', fontweight='bold')

# Pie chart
axes[1].pie(class_counts.values, labels=class_counts.index, autopct='%1.1f%%',
            colors=['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A'], startangle=90)
axes[1].set_title('Class Distribution (%)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüìà Class Statistics:")
for name in class_names:
    count = (df['Class_Name'] == name).sum()
    percentage = count / len(df) * 100
    print(f"   {name:>6s}: {count:3d} samples ({percentage:5.1f}%)")

In [None]:
# Feature statistics
print("\nüìê Feature Statistics:")
df[feature_names].describe()

In [None]:
# Visualize some key features by class
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

features_to_plot = ['Compactness', 'Circularity', 'Elongatedness', 
                    'Scatter_ratio', 'Major_variance', 'Minor_variance']

for idx, feature in enumerate(features_to_plot):
    for class_name in class_names:
        data_subset = df[df['Class_Name'] == class_name][feature]
        axes[idx].hist(data_subset, alpha=0.5, label=class_name, bins=20)
    
    axes[idx].set_title(f'{feature} Distribution', fontweight='bold')
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap (first 10 features for readability)
plt.figure(figsize=(12, 10))
correlation_matrix = df[feature_names[:10]].corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            square=True, cbar_kws={'label': 'Correlation'})
plt.title('Feature Correlation Heatmap (First 10 Features)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 4. Data Preparation

In [None]:
# Split the data (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")
print(f"\nTraining class distribution:")
train_dist = pd.Series(y_train).value_counts().sort_index()
for class_idx, count in train_dist.items():
    print(f"   {class_names[int(class_idx)]}: {count} samples")

## 5. Train Fuzzy LEM2 Classifier

### Parameters Explanation:
- **alpha** (0.05): Dependency threshold - controls how strict the rules are
  - Lower values ‚Üí stricter rules (fewer false positives)
  - Higher values ‚Üí more lenient rules (better coverage)
  
- **beta** (0.2): Covering threshold - controls how precisely rules must cover the concept
  - Lower values ‚Üí more precise covering required
  - Higher values ‚Üí allows partial covering

In [None]:
# Initialize the classifier
clf = FuzzyLEM2Classifier(alpha=0.05, beta=0.2)

print("ü§ñ Training Fuzzy LEM2 Classifier...")
print("‚è≥ This may take a few minutes...\n")

# Train the model
clf.fit(X_train, y_train)

print("\n‚úÖ Training complete!")

In [None]:
# Analyze the generated rules
print("\nüìã Generated Rules Summary:")
print("-" * 40)

total_rules = 0
for class_idx, class_name in enumerate(class_names):
    rules = clf.rules_[class_idx]
    num_rules = len(rules)
    total_rules += num_rules
    print(f"{class_name:>6s}: {num_rules:3d} rule complexes")

print("-" * 40)
print(f"Total:  {total_rules:3d} rule complexes")

## 6. Make Predictions and Evaluate

In [None]:
# Make predictions on test set
print("üîÆ Making predictions on test set...")
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"\nüéØ Overall Accuracy: {accuracy:.2%}")

In [None]:
# Detailed classification report
print("\n" + "="*70)
print("üìà CLASSIFICATION REPORT")
print("="*70)
print()
print(classification_report(y_test, y_pred, target_names=class_names, digits=3))

In [None]:
# Confusion Matrix Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Confusion matrix - counts
cm = confusion_matrix(y_test, y_pred)
disp1 = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
disp1.plot(ax=axes[0], cmap='Blues', values_format='d')
axes[0].set_title('Confusion Matrix (Counts)', fontsize=14, fontweight='bold')

# Confusion matrix - percentages
cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
disp2 = ConfusionMatrixDisplay(confusion_matrix=cm_percent, display_labels=class_names)
disp2.plot(ax=axes[1], cmap='Greens', values_format='.1f')
axes[1].set_title('Confusion Matrix (Percentages)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Per-class performance
print("\nüìä Per-Class Performance:")
print("-" * 50)

cm = confusion_matrix(y_test, y_pred)
per_class_data = []

for i, name in enumerate(class_names):
    tp = cm[i, i]
    total = cm[i].sum()
    accuracy_class = tp / total if total > 0 else 0
    
    # Calculate precision and recall
    precision = tp / cm[:, i].sum() if cm[:, i].sum() > 0 else 0
    recall = accuracy_class
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    per_class_data.append({
        'Class': name,
        'Accuracy': accuracy_class,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    })
    
    print(f"{name:>6s}: Acc={accuracy_class:.2%}, Prec={precision:.2%}, Rec={recall:.2%}, F1={f1:.3f}")

# Visualize per-class performance
perf_df = pd.DataFrame(per_class_data)
perf_df.set_index('Class')[['Accuracy', 'Precision', 'Recall', 'F1-Score']].plot(
    kind='bar', figsize=(12, 6), rot=0
)
plt.title('Per-Class Performance Metrics', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.legend(loc='lower right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Example Predictions

In [None]:
# Show some example predictions
print("\nüîç Example Predictions (First 10 test samples):")
print("-" * 60)
print(f"{'Sample':>6} | {'True':>6} | {'Predicted':>10} | {'Correct':>7}")
print("-" * 60)

for i in range(min(10, len(y_test))):
    true_class = class_names[int(y_test[i])]
    pred_class = class_names[int(y_pred[i])]
    correct = "‚úì" if y_test[i] == y_pred[i] else "‚úó"
    print(f"{i+1:>6} | {true_class:>6} | {pred_class:>10} | {correct:>7}")

## 8. Parameter Tuning Experiment (Optional)

**Warning**: This section will take a long time to run! Feel free to skip it or reduce the parameter ranges.

In [None]:
# Test different parameter combinations
alpha_values = [0.01, 0.05, 0.1]
beta_values = [0.1, 0.2, 0.3]

results = []

print("üî¨ Testing different parameter combinations...\n")
print("This will take several minutes...\n")

for alpha in alpha_values:
    for beta in beta_values:
        print(f"Testing alpha={alpha}, beta={beta}...", end=" ")
        
        try:
            clf_test = FuzzyLEM2Classifier(alpha=alpha, beta=beta)
            clf_test.fit(X_train, y_train)
            y_pred_test = clf_test.predict(X_test)
            acc = accuracy_score(y_test, y_pred_test)
            
            results.append({
                'alpha': alpha,
                'beta': beta,
                'accuracy': acc
            })
            
            print(f"Accuracy: {acc:.2%}")
        except Exception as e:
            print(f"Failed: {e}")

print("\n‚úÖ Parameter tuning complete!")

In [None]:
# Visualize parameter tuning results
if results:
    results_df = pd.DataFrame(results)
    
    # Create pivot table for heatmap
    pivot_table = results_df.pivot(index='alpha', columns='beta', values='accuracy')
    
    plt.figure(figsize=(10, 6))
    sns.heatmap(pivot_table, annot=True, fmt='.3f', cmap='YlOrRd', 
                cbar_kws={'label': 'Accuracy'})
    plt.title('Parameter Tuning Results (Alpha vs Beta)', fontsize=14, fontweight='bold')
    plt.xlabel('Beta (covering threshold)')
    plt.ylabel('Alpha (dependency threshold)')
    plt.tight_layout()
    plt.show()
    
    # Find best parameters
    best_result = results_df.loc[results_df['accuracy'].idxmax()]
    print(f"\nüèÜ Best Parameters:")
    print(f"   Alpha: {best_result['alpha']}")
    print(f"   Beta: {best_result['beta']}")
    print(f"   Accuracy: {best_result['accuracy']:.2%}")

## 9. Comparison with Other Classifiers (Optional)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Train different classifiers
classifiers = {
    'Fuzzy LEM2': clf,
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Naive Bayes': GaussianNB()
}

comparison_results = []

print("üî¨ Comparing with other classifiers...\n")

for name, classifier in classifiers.items():
    if name != 'Fuzzy LEM2':
        print(f"Training {name}...", end=" ")
        classifier.fit(X_train, y_train)
        print("Done.")
    
    y_pred_comp = classifier.predict(X_test)
    acc = accuracy_score(y_test, y_pred_comp)
    
    comparison_results.append({
        'Classifier': name,
        'Accuracy': acc
    })

print("\n‚úÖ Comparison complete!")

In [None]:
# Visualize comparison
comp_df = pd.DataFrame(comparison_results).sort_values('Accuracy', ascending=False)

plt.figure(figsize=(12, 6))
colors = ['#FF6B6B' if x == 'Fuzzy LEM2' else '#95E1D3' for x in comp_df['Classifier']]
bars = plt.barh(comp_df['Classifier'], comp_df['Accuracy'], color=colors)

# Add value labels
for i, (bar, acc) in enumerate(zip(bars, comp_df['Accuracy'])):
    plt.text(acc + 0.01, i, f'{acc:.2%}', va='center', fontweight='bold')

plt.xlabel('Accuracy', fontweight='bold')
plt.title('Classifier Comparison on Vehicle Dataset', fontsize=14, fontweight='bold')
plt.xlim(0, 1)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä Classifier Rankings:")
print(comp_df.to_string(index=False))

## 10. Summary and Conclusions

In [None]:
print("="*70)
print("üìù SUMMARY")
print("="*70)
print(f"\n‚úÖ Successfully trained Fuzzy LEM2 classifier on vehicle dataset")
print(f"\nüìä Key Results:")
print(f"   ‚Ä¢ Dataset: {X.shape[0]} samples, {X.shape[1]} features, {len(class_names)} classes")
print(f"   ‚Ä¢ Test Accuracy: {accuracy:.2%}")
print(f"   ‚Ä¢ Total Rules Generated: {sum(len(clf.rules_[i]) for i in range(len(class_names)))}")
print(f"\nüí° Key Observations:")
print(f"   ‚Ä¢ Vehicle classification is challenging due to overlapping features")
print(f"   ‚Ä¢ Fuzzy LEM2 generates interpretable if-then rules")
print(f"   ‚Ä¢ Parameters (alpha, beta) significantly affect performance")
print("\n" + "="*70)