In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import make_pipeline

In [2]:
# load the dataset
file_path = 'fruit_data_with_colors.txt'
fruits_data = pd.read_csv(file_path, delimiter='\t')
fruits_data.head()

Unnamed: 0,fruit_label,fruit_name,fruit_subtype,mass,width,height,color_score
0,1,apple,granny_smith,192,8.4,7.3,0.55
1,1,apple,granny_smith,180,8.0,6.8,0.59
2,1,apple,granny_smith,176,7.4,7.2,0.6
3,2,mandarin,mandarin,86,6.2,4.7,0.8
4,2,mandarin,mandarin,84,6.0,4.6,0.79


The dataset consists of the following columns:   
fruit_label: Numeric labels for the fruit.  
fruit_name: Name of the fruit (e.g., apple, mandarin).  
fruit_subtype: Specific subtype of the fruit.  
mass: The mass of the fruit.  
width: The width of the fruit.  
height: The height of the fruit.  
color_score: A score representing the color.  

To build a classification model we can use the features such as mass, width, height, and color_score to predict the fruit_label or fruit_name.

Before proceeding with model building, let`s perform EDA to understand the data better, check for any missing values

In [4]:
# checking for missing values
missing_values = fruits_data.isnull().sum()
missing_values

fruit_label      0
fruit_name       0
fruit_subtype    0
mass             0
width            0
height           0
color_score      0
dtype: int64

There are no missing values in the dataset, which is good as it simplifies the preprocessing stage.

In [5]:
# summury stats:
summary_statistics = fruits_data.describe()
summary_statistics

Unnamed: 0,fruit_label,mass,width,height,color_score
count,59.0,59.0,59.0,59.0,59.0
mean,2.542373,163.118644,7.105085,7.69322,0.762881
std,1.208048,55.018832,0.816938,1.361017,0.076857
min,1.0,76.0,5.8,4.0,0.55
25%,1.0,140.0,6.6,7.2,0.72
50%,3.0,158.0,7.2,7.6,0.75
75%,4.0,177.0,7.5,8.2,0.81
max,4.0,362.0,9.6,10.5,0.93


The dataset contains 59 entries.
The features like mass, width, height, and color_score vary widely, indicated by their mean, standard deviation, and range (min-max).

In [6]:
# distribution of fruit labels
fruit_label_distribution = fruits_data['fruit_label'].value_counts()
fruit_label_distribution

1    19
3    19
4    16
2     5
Name: fruit_label, dtype: int64

In [7]:
# distribution of fruit names
fruit_names_distribution = fruits_data['fruit_name'].value_counts()
fruit_names_distribution

apple       19
orange      19
lemon       16
mandarin     5
Name: fruit_name, dtype: int64

There are four unique fruit labels (1, 2, 3, 4) in the dataset.
Labels 1 (apple) and 3 (orange) have 19 samples each, label 4 (lemon) has 16 samples, and label 2 (mandarin) has 5 samples

Given this information, we can proceed to build a classification model. The task is to predict the fruit label or name based on features like mass, width, height, and color_score. let`s start by splitting the data into training and testing sets, and then select an appropriate model for training.

In [9]:
# separating features and target
X = fruits_data[['mass', 'width','height','color_score']]
y = fruits_data['fruit_label']

In [10]:
# splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
#creating RandomForest Classifier model using a pipeline to integrate SdandardScaler 
#for feature scaling
model = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=100, random_state=42))

In [12]:
# training the model
model.fit(X_train, y_train)

In [13]:
# making prediction
y_pred = model.predict(X_test)

In [14]:
# evaluating the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_output = classification_report(y_test, y_pred)
print(f'Acuuracy is {accuracy},\nClassification report:{classification_report_output}')

Acuuracy is 1.0,
Classification report:              precision    recall  f1-score   support

           1       1.00      1.00      1.00         3
           2       1.00      1.00      1.00         2
           3       1.00      1.00      1.00         2
           4       1.00      1.00      1.00         5

    accuracy                           1.00        12
   macro avg       1.00      1.00      1.00        12
weighted avg       1.00      1.00      1.00        12



The Random Forest Classifier model has achieved excellent results on the test set:

Accuracy: 100%  
Precision: 100% for all fruit labels (1, 2, 3, 4), indicating a high rate of correctly predicted positive observations to total predicted positives  
Recall: 100% for all fruit labels, showing that the model correctly identified all positive samples.  
F1-Score: 100% for all fruit labels, reflecting the balanced precision and recall.

These results suggest that the model is performing exceptionally well in classifying the fruits based on their features (mass, width, height, color_score). However, it's also important to consider that such high metrics, especially in a small dataset, might indicate overfitting. To ensure the model's robustness, you might consider additional validation methods or gathering more data.

Building classification models for all fruit subtypes involves a similar process, but this time our target variable will be the fruit_subtype instead of fruit_label or fruit_name. Given the potentially larger number of classes and more specific distinctions between them, this task might be more challenging.

In [None]:
# Evaluating the model
accuracy_subtype = accuracy_score(y_test_subtype, y_pred_subtype)
classification_report_subtype = classification_report(y_test_subtype, y_pred_subtype)

accuracy_subtype, classification_report_subtype

In [15]:
# preparing the data with fruit_subtype as the target varaible
y_subtype = fruits_data['fruit_subtype']

In [17]:
# splitting the dataset into training and testing sets for the new target
X_train_subtype, X_test_subtype, y_train_subtype, y_test_subtype = train_test_split(X, y_subtype, 
                                                                                    test_size=0.2, random_state=42)

In [18]:
# creating RandomForest classifier model
model_subtype = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=100,random_state=42 ))

In [19]:
# training the model
model_subtype.fit(X_train_subtype, y_train_subtype)

In [20]:
# making predictions
y_pred_subtype = model_subtype.predict(X_test_subtype)

In [22]:
# evaluating the model
accuracy_subtype = accuracy_score(y_test_subtype, y_pred_subtype)
classification_report_subtype = classification_report(y_test_subtype, y_pred_subtype)
print(f'Acuuracy is {accuracy_subtype},\nClassification report:\n{classification_report_subtype}')

Acuuracy is 0.9166666666666666,
Classification report:
                  precision    recall  f1-score   support

        braeburn       0.00      0.00      0.00         1
     cripps_pink       0.00      0.00      0.00         0
golden_delicious       1.00      1.00      1.00         1
    granny_smith       1.00      1.00      1.00         1
        mandarin       1.00      1.00      1.00         2
  spanish_belsan       1.00      1.00      1.00         2
   spanish_jumbo       1.00      1.00      1.00         1
    turkey_navel       1.00      1.00      1.00         1
         unknown       1.00      1.00      1.00         3

        accuracy                           0.92        12
       macro avg       0.78      0.78      0.78        12
    weighted avg       0.92      0.92      0.92        12



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 91.67%  
The precision, recall, and f1-score vary across different subtypes.
Some subtypes like 'braeburn' and 'cripps_pink' show a precision and recall of 0. This indicates that the model was unable to correctly predict these subtypes in the test set. This could be due to a limited number of samples for these subtypes in the training data.
For other subtypes like 'golden_delicious', 'granny_smith', 'mandarin', 'spanish_belsan', 'spanish_jumbo', 'turkey_navel', and 'unknown', the model achieved 100% precision and recall
The overall accuracy is quite high, but the variability in performance across different subtypes suggests that the model might benefit from a more balanced dataset or perhaps a different modeling approach for subtypes with fewer samples. 