# Wine Cultivar Origin Prediction System
## Model Development and Training

**Author:** Eneasato David  
**Matric No:** 23CG034068  
**Algorithm:** Random Forest Classifier  
**Date:** January 21, 2026

---

This notebook implements a multiclass classification model to predict wine cultivar origin based on chemical properties.

## 1. Import Required Libraries

In [None]:
# Data manipulation and analysis
import numpy as np
import pandas as pd

# Machine learning
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix,
)

# Model persistence
import joblib

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Utilities
import warnings
from pathlib import Path

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

print("All libraries imported successfully!")

## 2. Load and Explore the Wine Dataset

In [None]:
# Load the Wine dataset
wine_data = load_wine()

# Create a DataFrame for better visualization
df = pd.DataFrame(data=wine_data.data, columns=wine_data.feature_names)
df['cultivar'] = wine_data.target

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nDataset Info:")
print(df.info())
print("\nTarget Distribution:")
print(df['cultivar'].value_counts().sort_index())

## 3. Data Preprocessing

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print("\nNo missing values found!" if df.isnull().sum().sum() == 0 else "Missing values detected!")

# Statistical summary
print("\nStatistical Summary:")
print(df.describe())

## 4. Feature Selection

Selected 6 features based on domain knowledge and importance:
1. **alcohol** - Primary identifier
2. **malic_acid** - Taste profile
3. **ash** - Mineral content
4. **magnesium** - Essential mineral
5. **flavanoids** - Phenolic compounds
6. **proline** - Amino acid content

In [None]:
# Define selected features
SELECTED_FEATURES = [
    'alcohol',
    'malic_acid',
    'ash',
    'magnesium',
    'flavanoids',
    'proline'
]

# Extract features and target
X = df[SELECTED_FEATURES]
y = df['cultivar']

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nSelected Features:")
for i, feature in enumerate(SELECTED_FEATURES, 1):
    print(f"{i}. {feature}")

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, feature in enumerate(SELECTED_FEATURES):
    axes[idx].hist(X[feature], bins=20, edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {feature}')
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## 5. Train-Test Split

In [None]:
# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])
print("\nTraining set class distribution:")
print(y_train.value_counts().sort_index())
print("\nTest set class distribution:")
print(y_test.value_counts().sort_index())

## 6. Feature Scaling

Mandatory step due to varying feature ranges to ensure all features contribute equally.

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit on training data and transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Scaling completed!")
print("\nScaled Training Data - First 5 rows:")
print(pd.DataFrame(X_train_scaled, columns=SELECTED_FEATURES).head())
print("\nMean of scaled features (should be ~0):")
print(pd.DataFrame(X_train_scaled, columns=SELECTED_FEATURES).mean())
print("\nStd of scaled features (should be ~1):")
print(pd.DataFrame(X_train_scaled, columns=SELECTED_FEATURES).std())

## 7. Model Training - Random Forest Classifier

Random Forest is chosen for its:
- Robustness against overfitting
- Ability to handle non-linear relationships
- Feature importance estimation
- Excellent performance on multiclass problems

In [None]:
# Initialize Random Forest Classifier with optimal parameters
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1,
    class_weight='balanced'
)

# Train the model
print("Training Random Forest Classifier...")
rf_model.fit(X_train_scaled, y_train)
print("Model training completed!")

# Display feature importances
feature_importance = pd.DataFrame({
    'Feature': SELECTED_FEATURES,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nFeature Importances:")
print(feature_importance)

In [None]:
# Visualize feature importances
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance Score')
plt.title('Feature Importance in Random Forest Model')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 8. Model Evaluation

In [None]:
# Make predictions
y_train_pred = rf_model.predict(X_train_scaled)
y_test_pred = rf_model.predict(X_test_scaled)

# Calculate metrics
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Precision, Recall, F1-score (weighted averages)
precision = precision_score(y_test, y_test_pred, average='weighted')
recall = recall_score(y_test, y_test_pred, average='weighted')
f1 = f1_score(y_test, y_test_pred, average='weighted')

print("="*60)
print("MODEL PERFORMANCE METRICS")
print("="*60)
print(f"Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print(f"\nWeighted Precision: {precision:.4f}")
print(f"Weighted Recall: {recall:.4f}")
print(f"Weighted F1-Score: {f1:.4f}")
print("="*60)

In [None]:
# Detailed classification report
print("\nDETAILED CLASSIFICATION REPORT")
print("="*60)
print(classification_report(
    y_test, 
    y_test_pred, 
    target_names=['Cultivar 0', 'Cultivar 1', 'Cultivar 2']
))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(
    cm, 
    annot=True, 
    fmt='d', 
    cmap='Blues',
    xticklabels=['Cultivar 0', 'Cultivar 1', 'Cultivar 2'],
    yticklabels=['Cultivar 0', 'Cultivar 1', 'Cultivar 2']
)
plt.title('Confusion Matrix - Test Set')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

## 9. Cross-Validation

In [None]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(
    rf_model, 
    X_train_scaled, 
    y_train, 
    cv=5, 
    scoring='accuracy'
)

print("\nCROSS-VALIDATION RESULTS (5-Fold)")
print("="*60)
print(f"CV Scores: {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print("="*60)

## 10. Save Model and Preprocessing Components

Using Joblib for efficient model persistence

In [None]:
# Create a model package with all necessary components
model_package = {
    'model': rf_model,
    'scaler': scaler,
    'feature_names': SELECTED_FEATURES,
    'target_names': ['Cultivar 0', 'Cultivar 1', 'Cultivar 2'],
    'metadata': {
        'algorithm': 'Random Forest Classifier',
        'test_accuracy': test_accuracy,
        'train_accuracy': train_accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'cv_mean_accuracy': cv_scores.mean(),
        'trained_on': '2026-01-21',
        'author': 'Eneasato David',
        'matric_no': '23CG034068'
    }
}

# Save the model package
model_path = 'wine_cultivar_model.pkl'
joblib.dump(model_package, model_path, compress=3)

print(f"\nModel saved successfully to: {model_path}")
print(f"File size: {Path(model_path).stat().st_size / 1024:.2f} KB")

# Verify by loading
loaded_package = joblib.load(model_path)
print("\nModel verification: Successfully loaded!")
print(f"Algorithm: {loaded_package['metadata']['algorithm']}")
print(f"Test Accuracy: {loaded_package['metadata']['test_accuracy']:.4f}")

## 11. Test Prediction Function

In [None]:
def predict_cultivar(model_package, input_features):
    """
    Predict wine cultivar from input features.
    
    Parameters:
    -----------
    model_package : dict
        Package containing model, scaler, and metadata
    input_features : dict
        Dictionary with feature names as keys and values as floats
    
    Returns:
    --------
    dict : Prediction results including cultivar and probabilities
    """
    # Extract components
    model = model_package['model']
    scaler = model_package['scaler']
    feature_names = model_package['feature_names']
    target_names = model_package['target_names']
    
    # Prepare input
    input_array = np.array([[input_features[f] for f in feature_names]])
    
    # Scale and predict
    input_scaled = scaler.transform(input_array)
    prediction = model.predict(input_scaled)[0]
    probabilities = model.predict_proba(input_scaled)[0]
    
    return {
        'prediction': int(prediction),
        'cultivar_name': target_names[prediction],
        'probabilities': {
            target_names[i]: float(prob) for i, prob in enumerate(probabilities)
        },
        'confidence': float(probabilities[prediction])
    }

# Test with a sample
sample_input = {
    'alcohol': 13.5,
    'malic_acid': 2.0,
    'ash': 2.3,
    'magnesium': 110.0,
    'flavanoids': 2.5,
    'proline': 1000.0
}

result = predict_cultivar(loaded_package, sample_input)
print("\nTEST PREDICTION")
print("="*60)
print(f"Input: {sample_input}")
print(f"\nPredicted Cultivar: {result['cultivar_name']}")
print(f"Confidence: {result['confidence']*100:.2f}%")
print("\nClass Probabilities:")
for cultivar, prob in result['probabilities'].items():
    print(f"  {cultivar}: {prob*100:.2f}%")
print("="*60)

## Summary

### Model Performance
- **Algorithm**: Random Forest Classifier
- **Features Used**: 6 (alcohol, malic_acid, ash, magnesium, flavanoids, proline)
- **Test Accuracy**: High performance achieved
- **Model Persistence**: Joblib

### Key Achievements
âœ… Data preprocessing with proper handling  
âœ… Feature scaling implemented  
âœ… Model trained and evaluated  
âœ… Comprehensive metrics calculated  
âœ… Model saved for production use  

### Next Steps
1. Deploy model in Flask web application
2. Create user-friendly interface
3. Host on cloud platform

---
**Project Ready for Production Deployment** ðŸš€