#Fruit Classifier Project

**Project Goal:** To create a classification model that can accurately distinguish between Bananas, Grapes, and Apples.
---
## 1. Data Setup and Cleaning

In [62]:
# Necessary tools
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report, roc_auc_score

# Load the data
import pandas as pd

df = pd.read_excel(r"C:\Users\JJE4FE\Desktop\fruit_data.xlsx")
df.head()

# Drop the unnecessary index column
df = df.drop('Unnamed: 0', axis=1)

# Fix the error in the 'weight' column and rename target column
df['weight'] = df['weight'].astype(str).str.replace('e', '', regex=False)
df['weight'] = pd.to_numeric(df['weight'], errors='coerce')
df = df.rename(columns={'fruit_type': 'target'})

# Show the first 5 rows
print("Data Loaded and Initial Cleanup Performed:")
print(df.head())
print("\nData Status:")
df.info()

Data Loaded and Initial Cleanup Performed:
   target         color    size     weight
0   grape        Yellow    Tiny   8.303385
1   apple          Pink  Largee  80.976370
2  banana   Pale Yellow   Large  74.615192
3   grape           Red    Tiny   6.924070
4  banana  Creamy White  Largee  82.002542

Data Status:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   target  200 non-null    object 
 1   color   200 non-null    object 
 2   size    200 non-null    object 
 3   weight  200 non-null    float64
dtypes: float64(1), object(3)
memory usage: 6.4+ KB


## Data Preparation: Final Cleaning and Encoding

We perform final cleanup by fixing inconsistencies in the categorical columns and then prepare the data for the models. This involves:
1.  Consolidating inconsistent labels (`Largee` -> `Large`).
2.  One-Hot Encoding categorical features (`color`, `size`) into numbers.
3.  Label Encoding the target variable (`target`) into 0, 1, 2.
4.  Scaling the `weight` column (for Logistic Regression)
5.  Splitting the data into training and testing sets.

In [63]:
# Each alphabet lowercased
for col in ['target', 'color', 'size']:
    df[col] = df[col].astype(str).str.strip().str.lower()

# Fix inconsistencies of naming
df['size'] = df['size'].replace({'largee': 'large', 'tiny': 'small'})
df['color'] = df['color'].replace({'yellow1': 'yellow'})

# Remove duplicates
df = df.drop_duplicates()
print(f"Total rows after removing duplicates: {len(df)}\n")

# Separation of Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# Target Encoding (converts apple/banana/grape to 0/1/2)
le = LabelEncoder()
y_encoded = le.fit_transform(y)
target_classes = le.classes_
print(f"Target classes encoded to: {target_classes}")

# One-Hot Encoding for features
X_encoded = pd.get_dummies(X, columns=['color', 'size'], drop_first=True, dtype=int)

# Split the data (20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)
print(f"Train/Test split completed. Train size: {len(X_train)}\n")

# Scaling
scaler = StandardScaler()
# Find the weight column index for simple scaling
weight_col = X_train.columns.get_loc('weight')

# Scale the weight column in both train and test sets
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Fit only on training data, then transform both
X_train_scaled.iloc[:, weight_col] = scaler.fit_transform(
    X_train_scaled.iloc[:, weight_col].values.reshape(-1, 1)
)
X_test_scaled.iloc[:, weight_col] = scaler.transform(
    X_test_scaled.iloc[:, weight_col].values.reshape(-1, 1)
)

print("Data preparation finished. Data is scaled and split.")

Total rows after removing duplicates: 177

Target classes encoded to: ['apple' 'banana' 'grape']
Train/Test split completed. Train size: 141

Data preparation finished. Data is scaled and split.


## 2 Modeling and Analysis

We now build the two required models: Logistic Regression (a simple, linear model) and Decision Tree (a non-linear, interpretable model).

To show a transparent choice of hyperparameters, we will use a simple method: testing a few different settings (parameters) and choosing the one that performs best on the training data. This is better than just using the defaults.

In [64]:
# Logistic Regression

C_values = [0.1, 1.0, 10.0]
best_f1 = 0
best_C = None
best_logreg_model = None

print("Tuning Logistic Regression (C value):")
for C in C_values:
    # Use the SCALED data (X_train_scaled, y_train)
    model = LogisticRegression(C=C, random_state=42, multi_class='ovr', solver='liblinear')
    model.fit(X_train_scaled, y_train)

    # Evaluate performance on the training set
    y_pred_train = model.predict(X_train_scaled)
    f1 = f1_score(y_train, y_pred_train, average='weighted')

    print(f"  C={C:<5} -> F1 Score: {f1:.4f}")

    if f1 > best_f1:
        best_f1 = f1
        best_C = C
        best_logreg_model = model

print(f"\nChosen Hyperparameter: C = {best_C} (Transparent Choice)")

# We now evaluate the best model on the unseen test data
y_pred_logreg = best_logreg_model.predict(X_test_scaled)

print("\nFinal Test Set Evaluation (Logistic Regression)")
print("Test Accuracy:", accuracy_score(y_test, y_pred_logreg))
print("Test F1 (macro):", f1_score(y_test, y_pred_logreg, average='macro'))
print("\nClassification report:\n",
      classification_report(y_test, y_pred_logreg, target_names=target_classes))

Tuning Logistic Regression (C value):
  C=0.1   -> F1 Score: 0.7717
  C=1.0   -> F1 Score: 0.8261
  C=10.0  -> F1 Score: 0.8717

Chosen Hyperparameter: C = 10.0 (Transparent Choice)

Final Test Set Evaluation (Logistic Regression)
Test Accuracy: 0.8888888888888888
Test F1 (macro): 0.8931623931623932

Classification report:
               precision    recall  f1-score   support

       apple       0.85      0.85      0.85        13
      banana       0.83      0.83      0.83        12
       grape       1.00      1.00      1.00        11

    accuracy                           0.89        36
   macro avg       0.89      0.89      0.89        36
weighted avg       0.89      0.89      0.89        36





## Decision Tree with Parameter Tuning

The Decision Tree classifier is a non-linear model that is generally insensitive to feature scaling, making it a good comparison to Logistic Regression.

We will tune the max_depth parameter, which controls the complexity of the tree and prevents overfitting.

In [65]:
# Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

max_depths = [3, 5, 7, 10]
best_f1_dt = 0
best_depth = None
best_dt_model = None

print("Tuning Decision Tree (max_depth):")
for depth in max_depths:
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)

    # Evaluate performance on the training set
    y_pred_train = model.predict(X_train)
    f1 = f1_score(y_train, y_pred_train, average='weighted')

    print(f"  Depth={depth:<2} -> F1 Score: {f1:.4f}")

    if f1 > best_f1_dt:
        best_f1_dt = f1
        best_depth = depth
        best_dt_model = model

print(f"\nChosen Hyperparameter: Max Depth = {best_depth} (Transparent Choice)")

# We now evaluate the best model on the unseen test data
y_pred_dt = best_dt_model.predict(X_test)

print("\nFinal Test Set Evaluation (Decision Tree)")
print("Test Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Test F1 (macro):", f1_score(y_test, y_pred_dt, average='macro'))
print("\nClassification report:\n",
      classification_report(y_test, y_pred_dt, target_names=target_classes))

Tuning Decision Tree (max_depth):
  Depth=3  -> F1 Score: 0.8205
  Depth=5  -> F1 Score: 0.8917
  Depth=7  -> F1 Score: 0.9078
  Depth=10 -> F1 Score: 0.9716

Chosen Hyperparameter: Max Depth = 10 (Transparent Choice)

Final Test Set Evaluation (Decision Tree)
Test Accuracy: 0.8055555555555556
Test F1 (macro): 0.8133333333333334

Classification report:
               precision    recall  f1-score   support

       apple       0.75      0.69      0.72        13
      banana       0.69      0.75      0.72        12
       grape       1.00      1.00      1.00        11

    accuracy                           0.81        36
   macro avg       0.81      0.81      0.81        36
weighted avg       0.81      0.81      0.81        36



## 3 Model Performance and Interpretation

We now finalize the evaluation of both optimized models using the mandatory ROC-AUC score. Finally, we compare all metrics and draw a conclusion on the best model.

In [66]:
# Tools for calculating scores and making the table
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score
from pandas import DataFrame

# Probabilities
logreg_probs = best_logreg_model.predict_proba(X_test_scaled)
dt_probs = best_dt_model.predict_proba(X_test)

# Calculate the ROC-scores
logreg_roc_auc = roc_auc_score(y_test, logreg_probs, multi_class='ovr', average='weighted')
dt_roc_auc = roc_auc_score(y_test, dt_probs, multi_class='ovr', average='weighted')

print("ROC-AUC Scores:")
print(f"Logistic Regression: {logreg_roc_auc:.4f}")
print(f"Decision Tree:       {dt_roc_auc:.4f}")

# Gather all the final performance metrics into one easy-to-read structure.
summary_data = {
    'Metric': ['Accuracy', 'F1 Score (Macro)', 'ROC-AUC (Weighted)'],
    'LogReg Score': [
        accuracy_score(y_test, y_pred_logreg),
        f1_score(y_test, y_pred_logreg, average='macro'),
        logreg_roc_auc
    ],
    'DT Score': [
        accuracy_score(y_test, y_pred_dt),
        f1_score(y_test, y_pred_dt, average='macro'),
        dt_roc_auc
    ]
}

comparison_df = DataFrame(summary_data).set_index('Metric').round(3) # Final table

print("\nComprehensive Model Performance:")
print(comparison_df)

ROC-AUC Scores:
Logistic Regression: 0.9753
Decision Tree:       0.8810

Comprehensive Model Performance:
                    LogReg Score  DT Score
Metric                                    
Accuracy                   0.889     0.806
F1 Score (Macro)           0.893     0.813
ROC-AUC (Weighted)         0.975     0.881


## 4 Conclusion and Project Summary

### 4.1 Final Model Recommendation

The project aimed to develop a classifier to distinguish between apples, bananas, and grapes. Based on a comprehensive evaluation of all required metrics on the unseen test data, the Logistic Regression model is the final recommended classifier.

| Model | Accuracy | F1 Score (Macro) | ROC-AUC (Weighted) |
| :--- | :--- | :--- | :--- |
| Logistic Regression | 0.889 | 0.893 | 0.975 |
| Decision Tree | 0.806 | 0.813 | 0.881 |

The Logistic Regression model, optimized with a C-value of 10.0 and trained on scaled features (which helps distance-based algorithms), performed significantly better across all metrics:

* Higher Accuracy: A score of 0.889 means it correctly identified the fruit type nearly 9 out of 10 times.
* Superior ROC-AUC: A weighted ROC-AUC of 0.975 is very strong. Since this metric measures the model's ability to rank probabilities, a score this close to 1.0 indicates the model is highly capable of separating the three fruit classes consistently.
* Strong F1 Score: The 0.893 F1 score suggests a strong balance between precision (avoiding false positives) and recall (avoiding false negatives) across all three classes.

While the Decision Tree model showed high performance on the training data, its performance dropped severely on the test set. This difference between training and test performance indicates overfitting, meaning the model memorized the training data's noise and did not generalize well to new, unseen data.

### 4.2 Key Findings and Insights

1.  Grapes are Easiest: Both models achieved a perfect 1.00 F1-score for Grapes, suggesting this fruit is clearly distinguishable based on the provided features.
2.  Apples and Bananas are Confused: The primary misclassifications occurred between apples and bananas, as suggested by their lower F1 scores (around 0.85). This suggests that their feature distributions (e.g., color, size, or shape) may overlap more than the other classes.
3.  Feature Scaling is Crucial: The superior performance of Logistic Regression (a linear model sensitive to feature magnitude) confirms the benefit of data scaling for this classification problem.