# EXOTIC FRUIT CLASSIFICATION 
*Fourth Master AI Engineering project*

## Import Libraries and Dataset

* The Python packages I used in this projects are in requirements.txt file
* Dataset is stored in the 'Data' folder
* I renamed the columns for two reasons:
    - I need to translate Italian column names into English ones
    - I didn't want measurement units and ranges in the column names

In [1]:
# Import libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import pygwalker as pyg

from src.preprocessing_and_viz import handle_duplicates, plot_feature_distribution, scatter_plot, feature_scaling

In [2]:
# Import Dataset

df = pd.read_csv("Data/Fruits_Dataset.csv")

df.head()

Unnamed: 0,Frutto,Peso (g),Diametro medio (mm),Lunghezza media (mm),Durezza buccia (1-10),Dolcezza (1-10)
0,Mela,86.4,89.68,8.69,9.61,2.41
1,Mela,77.58,73.45,6.49,7.2,3.87
2,Mela,81.95,81.66,6.4,9.09,2.88
3,Mela,66.33,36.71,6.78,8.21,2.55
4,Mela,56.73,75.69,5.78,9.15,3.88


In [3]:
# Renaming columns

new_columns = [
    'Fruit',
    'Weight',
    'Average diameter',
    'Average length',
    'Peel hardness',
    'Sweetness',
]

df.columns = new_columns

df.columns

Index(['Fruit', 'Weight', 'Average diameter', 'Average length',
       'Peel hardness', 'Sweetness'],
      dtype='object')

## Statistics Information

Useful information I got:
* Shape: (500,6)
* `Fruit` is the target column. There are 5 options: Apple, Banana, Orange, Grape, Kiwi
* 5 numerical features with continous values (float64)
* Statistics:
    - mean values and ranges are one order of magnitude different, I will implement normalization/standardization
    - Maximum value for Peel hardness should have been 10, but it's clearly higher (13.72). Strange!
    - it seems there are no missing values 

In [7]:
# Shape and Info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Fruit             500 non-null    object 
 1   Weight            500 non-null    float64
 2   Average diameter  500 non-null    float64
 3   Average length    500 non-null    float64
 4   Peel hardness     500 non-null    float64
 5   Sweetness         500 non-null    float64
dtypes: float64(5), object(1)
memory usage: 23.6+ KB


In [8]:
# Statistics

df.describe()

Unnamed: 0,Weight,Average diameter,Average length,Peel hardness,Sweetness
count,500.0,500.0,500.0,500.0,500.0
mean,47.31344,86.5731,5.29862,7.43764,3.59322
std,26.768797,64.293403,2.641993,1.812548,1.264899
min,8.57,7.53,1.15,3.07,1.25
25%,24.7975,50.51,2.6725,6.1525,2.57
50%,42.38,70.45,5.67,7.34,3.535
75%,68.08,88.8525,7.455,8.615,4.465
max,111.21,299.89,11.14,13.72,6.95


In [4]:
# Rename Target values and see how they are distributed

target_value_names = {
    'Mela' : 'Apple',
    'Banana' : 'Banana',
    'Arancia' : 'Orange',
    'Uva' : 'Grape',
    'Kiwi' : 'Kiwi'
}

df['Fruit'] = df['Fruit'].map(lambda x: target_value_names[x])

df['Fruit'].value_counts()

Fruit
Apple     100
Banana    100
Orange    100
Grape     100
Kiwi      100
Name: count, dtype: int64

## Preprocessing

* No duplicates in the dataset. Shape is still (500,6)
* No missing values in the dataset.

In [5]:
# Check for duplicates

df_cleaned = handle_duplicates(df)

No duplicates found!


In [8]:
# Check for missing values

df_cleaned.isna().sum()

Fruit               0
Weight              0
Average diameter    0
Average length      0
Peel hardness       0
Sweetness           0
dtype: int64

## Data Visualization

Feature distribution without comparison with target column:
* Weight: as expected, it is a negatively skewed distribution with high max value (111.21)
* Average diameter: it is clear we have smaller fruits, so distribution is negatively skewed, with higher values come from Banana target
* Average length: its distribution has two peaks, sign of fruits with well separated lengths
* Peel hardness: has a almost normal distribution with a couple of possible outliers, median and mean are quite the same
* Sweetness: has a almost normal distribution, median and mean are quite the same

Feature distribution and scatter plot and comparison with target column:
* Weight, Average diameter, Average length: each fruit follows distributions that were expected, interesting how Apple, Orange and Kiwi have almost the same distribution for Average diameter
* Peel hardness: each fruit seems to follow a normal distribution
* Sweetness: Orange, Grape and Kiwi reach higher values, as expected

Scatter plot to see how features are distributed according to target column:
* Weight vs Average diameter: it is possible to clearly see two clusters for Banana and Grape. Maybe, Apple, Orange and Kiwi will be more difficult to classify
* Average length vs Average diameter: this plot emphasizes what said before 
* Peel hardness vs Weight: lighter fruits have higher Peel hardness values
* Sweetness vs Average diameter: smaller fruits have higher Sweetness values
* Peel hardness vs Sweetness: Banana has lowest Sweetness values and well distributed Peel hardness values. Orange, instead, had higher Sweetness values and lowest Peel hardness values

In [28]:
# Feature distribution among ['Weight', 'Average diameter', 'Average length', 'Peel hardness', 'Sweetness']

feature_to_viz = 'Weight'

plot_feature_distribution(df_cleaned, feature_to_viz, comparison=False)

In [33]:
# Feature distribution among ['Weight', 'Average diameter', 'Average length', 'Peel hardness', 'Sweetness']
# and comparison with TARGET 'Fruit' 

feature_to_viz = 'Sweetness'

plot_feature_distribution(df_cleaned, feature_to_viz, comparison='Fruit')

In [39]:
# Scatter plot among ['Weight', 'Average diameter', 'Average length', 'Peel hardness', 'Sweetness']
# and comparison with TARGET 'Fruit' 

feature_x = 'Sweetness'
feature_y = 'Average diameter'

scatter_plot(df_cleaned, feature_x, feature_y, target='Fruit')

In [None]:
# Visualization Dashboard using PYGWALKER, you can see interactively the last plots
# You can save the json file for the configuration you want to visualize and make other people see it using this code:
# pyg.walk(df_cleaned, spec="Images/pygwalker_config_1.json")


pyg.walk(df_cleaned)

## Feature Engineering 

Two options:
* Feature scaling only with MinMaxScaler method: 
    - it allows me to have all features in the same range of values [0,1]
    -  intuitive and fast to implement
* Feature scaling only with StandardScaler method:
    - it's useful to get stable and robust distributions from a statistically point of view, bringing all features to mean 0 and std 1
    - in this dataset, features are close, or at least quite close, to normal distribution

**My decision** is to use *StandardScaler* because:
1. in this dataset, there are feature that should be normally distributed. All of them are properties of some fruits, so increasing the numebr of instances I think they will follow a Gaussian distribution
2. MinMax scaler compresses data to a fixed range, it could cause a loss of significant information

In [13]:
# Feature scaling with MinMaxScaler method

df_minmax_transformed = feature_scaling(df_cleaned, method = 'MinMax')
df_minmax_transformed.describe().map(lambda x: f"{x:.2f}")

Unnamed: 0,Weight,Average diameter,Average length,Peel hardness,Sweetness
count,500.0,500.0,500.0,500.0,500.0
mean,0.38,0.27,0.42,0.41,0.41
std,0.26,0.22,0.26,0.17,0.22
min,0.0,0.0,0.0,0.0,0.0
25%,0.16,0.15,0.15,0.29,0.23
50%,0.33,0.22,0.45,0.4,0.4
75%,0.58,0.28,0.63,0.52,0.56
max,1.0,1.0,1.0,1.0,1.0


In [12]:
# Feature scaling with StandardScaler method

df_standard_transformed = feature_scaling(df_cleaned, method = 'Standard')
df_standard_transformed.describe().map(lambda x: f'{x:.2f}')

Unnamed: 0,Weight,Average diameter,Average length,Peel hardness,Sweetness
count,500.0,500.0,500.0,500.0,500.0
mean,-0.0,-0.0,0.0,0.0,-0.0
std,1.0,1.0,1.0,1.0,1.0
min,-1.45,-1.23,-1.57,-2.41,-1.85
25%,-0.84,-0.56,-0.99,-0.71,-0.81
50%,-0.18,-0.25,0.14,-0.05,-0.05
75%,0.78,0.04,0.82,0.65,0.69
max,2.39,3.32,2.21,3.47,2.66


## Visualization after feature engineering

In [14]:
# Scatter plot among ['Weight', 'Average diameter', 'Average length', 'Peel hardness', 'Sweetness']
# and comparison with TARGET 'Fruit' 

feature_x = 'Sweetness'
feature_y = 'Average diameter'

scatter_plot(df_standard_transformed, feature_x, feature_y, target='Fruit')

##  K-Nearest Neighbors (KNN) implementation

## Results

## [EXTRA] Machine Learning Models implementation