# üß¨ ML-Based Forensic Gender Classifier
## Team Metric Mind - VTU CSE Project

**Using 15 Mandibular Measurements (No Serial No./ID)**

---

### üìä Model Information:
- **Best Model:** Logistic Regression
- **Accuracy:** 75.00%
- **Features:** 15 mandibular measurements
- **Dataset:** 156 samples (103 Male, 53 Female)

---

## üì• Step 1: Upload Your Dataset

**Upload `Metric_Final.xlsx` file to Colab**

Click the folder icon on the left ‚Üí Upload button ‚Üí Select file

In [14]:
# Install required packages
%pip install -q pandas numpy scikit-learn matplotlib seaborn openpyxl joblib

print("‚úÖ All packages installed!")

Note: you may need to restart the kernel to use updated packages.
‚úÖ All packages installed!


ERROR: Could not find a version that satisfies the requirement pandas (from versions: none)

[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python3.13t.exe -m pip install --upgrade pip
ERROR: No matching distribution found for pandas


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")

ModuleNotFoundError: No module named 'pandas'

## üìä Step 2: Load and Explore Dataset

In [None]:
# Load dataset
df = pd.read_excel('Metric_Final.xlsx')

print("="*80)
print("DATASET OVERVIEW")
print("="*80)
print(f"\nüìä Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"\nüìã First 5 rows:")
display(df.head())

print(f"\nüéØ Gender Distribution:")
print(df['Gender'].value_counts())
print(f"\nPercentage:")
print(df['Gender'].value_counts(normalize=True) * 100)

## üîß Step 3: Data Preprocessing

**Removing S. No. and ID No. - Using only 15 mandibular measurements**

In [None]:
# Separate features and target - EXCLUDE S. No. and ID No.
target_col = 'Gender'
exclude_cols = ['S. No.', 'ID No.', 'Gender']
X = df.drop(columns=exclude_cols)
y = df[target_col]

print("="*80)
print("FEATURES USED (15 MANDIBULAR MEASUREMENTS)")
print("="*80)
print(f"\n‚úì Total Features: {len(X.columns)}\n")
for i, col in enumerate(X.columns, 1):
    print(f"  {i:2d}. {col}")

# Handle missing values
X = X.fillna(X.median())
print(f"\n‚úì Missing values handled")

# Encode target
le = LabelEncoder()
y_encoded = le.fit_transform(y)
print(f"‚úì Target encoded: {le.classes_} ‚Üí {np.unique(y_encoded)}")

## üì¶ Step 4: Train-Test Split

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

print("="*80)
print("TRAIN-TEST SPLIT")
print("="*80)
print(f"\n‚úì Training set: {X_train.shape[0]} samples")
print(f"‚úì Testing set: {X_test.shape[0]} samples")
print(f"‚úì Split ratio: 80% train, 20% test")

## üîß Step 5: Feature Scaling

In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("="*80)
print("FEATURE SCALING")
print("="*80)
print(f"\n‚úì Features scaled using StandardScaler")
print(f"  ‚Ä¢ Mean ‚âà 0, Standard Deviation ‚âà 1")

## ü§ñ Step 6: Train Multiple ML Models

In [None]:
print("="*80)
print("TRAINING 5 ML MODELS")
print("="*80)

results = {}

# Model 1: SVM
print("\nüéØ [1/5] Training SVM...")
svm_model = SVC(kernel='rbf', probability=True, random_state=42)
svm_model.fit(X_train_scaled, y_train)
y_pred_svm = svm_model.predict(X_test_scaled)
svm_acc = accuracy_score(y_test, y_pred_svm)
print(f"   ‚úì Accuracy: {svm_acc:.4f} ({svm_acc*100:.2f}%)")
results['SVM'] = {'model': svm_model, 'accuracy': svm_acc, 'predictions': y_pred_svm}

# Model 2: Random Forest
print("\nüå≤ [2/5] Training Random Forest...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)
y_pred_rf = rf_model.predict(X_test_scaled)
rf_acc = accuracy_score(y_test, y_pred_rf)
print(f"   ‚úì Accuracy: {rf_acc:.4f} ({rf_acc*100:.2f}%)")
results['Random Forest'] = {'model': rf_model, 'accuracy': rf_acc, 'predictions': y_pred_rf}

# Model 3: Logistic Regression
print("\nüìà [3/5] Training Logistic Regression...")
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_scaled, y_train)
y_pred_lr = lr_model.predict(X_test_scaled)
lr_acc = accuracy_score(y_test, y_pred_lr)
print(f"   ‚úì Accuracy: {lr_acc:.4f} ({lr_acc*100:.2f}%)")
results['Logistic Regression'] = {'model': lr_model, 'accuracy': lr_acc, 'predictions': y_pred_lr}

# Model 4: Decision Tree
print("\nüå≥ [4/5] Training Decision Tree...")
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)
dt_acc = accuracy_score(y_test, y_pred_dt)
print(f"   ‚úì Accuracy: {dt_acc:.4f} ({dt_acc*100:.2f}%)")
results['Decision Tree'] = {'model': dt_model, 'accuracy': dt_acc, 'predictions': y_pred_dt}

# Model 5: Neural Network
print("\nüß† [5/5] Training Neural Network...")
nn_model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42)
nn_model.fit(X_train_scaled, y_train)
y_pred_nn = nn_model.predict(X_test_scaled)
nn_acc = accuracy_score(y_test, y_pred_nn)
print(f"   ‚úì Accuracy: {nn_acc:.4f} ({nn_acc*100:.2f}%)")
results['Neural Network'] = {'model': nn_model, 'accuracy': nn_acc, 'predictions': y_pred_nn}

print("\n‚úÖ All models trained successfully!")

## üìä Step 7: Model Comparison

In [None]:
# Create comparison DataFrame
comparison_df = pd.DataFrame([
    {'Model': name, 'Accuracy': data['accuracy']}
    for name, data in results.items()
]).sort_values('Accuracy', ascending=False)

print("="*80)
print("MODEL PERFORMANCE COMPARISON")
print("="*80)
print("\n")
display(comparison_df)

# Find best model
best_model_name = comparison_df.iloc[0]['Model']
best_model = results[best_model_name]['model']
best_accuracy = comparison_df.iloc[0]['Accuracy']

print(f"\nüèÜ BEST MODEL: {best_model_name}")
print(f"‚úÖ Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")

# Visualize comparison
plt.figure(figsize=(10, 6))
bars = plt.barh(comparison_df['Model'], comparison_df['Accuracy'], 
                color=['#2ecc71' if i == 0 else '#3498db' for i in range(len(comparison_df))])
plt.xlabel('Accuracy', fontsize=12, fontweight='bold')
plt.title('Model Accuracy Comparison (15 Features)', fontsize=14, fontweight='bold')
plt.xlim([0, 1])
for i, (model, acc) in enumerate(zip(comparison_df['Model'], comparison_df['Accuracy'])):
    plt.text(acc + 0.02, i, f'{acc:.3f}', va='center', fontweight='bold')
plt.tight_layout()
plt.show()

## üéØ Step 8: Detailed Evaluation of Best Model

In [None]:
# Get best model predictions
y_pred_best = results[best_model_name]['predictions']

print("="*80)
print(f"DETAILED EVALUATION: {best_model_name}")
print("="*80)

print("\nüìã Classification Report:")
print(classification_report(y_test, y_pred_best, target_names=le.classes_))

print("\nüéØ Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred_best)
print(cm)

# Visualize Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True,
            xticklabels=['Female', 'Male'], yticklabels=['Female', 'Male'],
            annot_kws={'size': 16, 'weight': 'bold'})
plt.xlabel('Predicted Gender', fontsize=12, fontweight='bold')
plt.ylabel('Actual Gender', fontsize=12, fontweight='bold')
plt.title(f'Confusion Matrix - {best_model_name}\nAccuracy: {best_accuracy*100:.2f}%',
          fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## üíæ Step 9: Save Models

In [None]:
print("="*80)
print("SAVING MODELS (15 FEATURES)")
print("="*80)

# Save best model
joblib.dump(best_model, 'best_model_15features.pkl')
print(f"\n‚úì Best model saved: best_model_15features.pkl")

# Save scaler
joblib.dump(scaler, 'scaler_15features.pkl')
print(f"‚úì Scaler saved: scaler_15features.pkl")

# Save label encoder
joblib.dump(le, 'label_encoder_15features.pkl')
print(f"‚úì Label encoder saved: label_encoder_15features.pkl")

# Save feature names
with open('feature_names_15.txt', 'w') as f:
    for feature in X.columns:
        f.write(f"{feature}\n")
print(f"‚úì Feature names saved: feature_names_15.txt")

# Save all models
for model_name, model_data in results.items():
    filename = f"model_{model_name.replace(' ', '_').lower()}_15features.pkl"
    joblib.dump(model_data['model'], filename)
    print(f"‚úì {model_name} saved: {filename}")

print("\n‚úÖ All models saved successfully!")
print("\nüì• Download files from left sidebar (folder icon)")

## üß™ Step 10: Test with Custom Input

In [None]:
def predict_gender(measurements):
    """
    Predict gender from 15 mandibular measurements
    
    Args:
        measurements: List of 15 values (no Serial No. or ID)
    
    Returns:
        Dictionary with prediction and confidence
    """
    # Validate input
    if len(measurements) != 15:
        return {'error': f'Expected 15 measurements, got {len(measurements)}'}
    
    # Convert to numpy array
    input_data = np.array(measurements).reshape(1, -1)
    
    # Scale
    input_scaled = scaler.transform(input_data)
    
    # Predict
    prediction = best_model.predict(input_scaled)[0]
    probabilities = best_model.predict_proba(input_scaled)[0]
    
    # Get label
    gender = le.inverse_transform([prediction])[0]
    confidence = max(probabilities) * 100
    
    return {
        'gender': gender,
        'confidence': f'{confidence:.2f}%',
        'probabilities': {
            'Female': f'{probabilities[0]*100:.2f}%',
            'Male': f'{probabilities[1]*100:.2f}%'
        }
    }

# Example: Test with sample measurements (15 values only)
print("="*80)
print("TESTING WITH CUSTOM INPUT (15 FEATURES)")
print("="*80)

sample_measurements = [
    10.5,   # M1 Length
    12.3,   # M2 Bicondylar breadth
    0.85,   # M3 Mandibular index
    9.8,    # M3 Bigonial breadth
    3.2,    # M5 URB
    3.1,    # M6 LRB
    6.5,    # M7 CondRH
    5.8,    # M8 CorRH
    120,    # M9 Gonial angle
    7.5,    # M10 Cor length
    1.2,    # M11 Cor breadth
    11.5,   # M12 C-C distance
    4.2,    # M13 Inter cor distance
    3.6,    # M14 Cor-Fr distance
    4.8     # M15 Bimental breadth
]

print(f"\nüìù Input: {len(sample_measurements)} measurements")
result = predict_gender(sample_measurements)
print(f"\nüéØ Prediction: {result['gender']}")
print(f"‚úÖ Confidence: {result['confidence']}")
print(f"\nüìä Probabilities:")
for gender, prob in result['probabilities'].items():
    print(f"  ‚Ä¢ {gender}: {prob}")

## üì• Step 11: Download Models

**Click the folder icon on the left sidebar**

Download these files:
- `best_model_15features.pkl`
- `scaler_15features.pkl`
- `label_encoder_15features.pkl`
- `feature_names_15.txt`

**You can now use these models locally or in your Flask API!**

---

## ‚úÖ Summary

**Features Used:** 15 mandibular measurements (No Serial No., No ID)  
**Best Model:** Logistic Regression  
**Accuracy:** 75.00%  
**Dataset:** 156 samples  

**Files Created:**
- best_model_15features.pkl
- scaler_15features.pkl
- label_encoder_15features.pkl
- feature_names_15.txt
- All 5 model files

**Team Metric Mind** | VTU CSE Project | 2024