# EXOTIC FRUIT CLASSIFICATION 
*Fourth Master AI Engineering project*

## Import Libraries and Dataset

* The Python packages I used in this projects are in requirements.txt file
* Dataset is stored in the 'Data' folder
* I renamed the columns for two reasons:
    - I need to translate Italian column names into English ones
    - I didn't want measurement units and ranges in the column names

In [7]:
# Import libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
import pygwalker as pyg

from src.preprocessing_and_viz import handle_duplicates, plot_feature_distribution, scatter_plot

In [2]:
# Import Dataset

df = pd.read_csv("Data/Fruits_Dataset.csv")

df.head()

Unnamed: 0,Frutto,Peso (g),Diametro medio (mm),Lunghezza media (mm),Durezza buccia (1-10),Dolcezza (1-10)
0,Mela,86.4,89.68,8.69,9.61,2.41
1,Mela,77.58,73.45,6.49,7.2,3.87
2,Mela,81.95,81.66,6.4,9.09,2.88
3,Mela,66.33,36.71,6.78,8.21,2.55
4,Mela,56.73,75.69,5.78,9.15,3.88


In [3]:
# Renaming columns

new_columns = [
    'Fruit',
    'Weight',
    'Average diameter',
    'Average length',
    'Peel hardness',
    'Sweetness',
]

df.columns = new_columns

df.columns

Index(['Fruit', 'Weight', 'Average diameter', 'Average length',
       'Peel hardness', 'Sweetness'],
      dtype='object')

## Statistics Information

Useful information I got:
* Shape: (500,6)
* `Fruit` is the target column. There are 5 options: Apple, Banana, Orange, Grape, Kiwi
* 5 numerical features with continous values (float64)
* Statistics:
    - mean values and ranges are one order of magnitude different, I will implement normalization/standardization
    - it seems there are no missing values 

In [7]:
# Shape and Info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Fruit             500 non-null    object 
 1   Weight            500 non-null    float64
 2   Average diameter  500 non-null    float64
 3   Average length    500 non-null    float64
 4   Peel hardness     500 non-null    float64
 5   Sweetness         500 non-null    float64
dtypes: float64(5), object(1)
memory usage: 23.6+ KB


In [8]:
# Statistics

df.describe()

Unnamed: 0,Weight,Average diameter,Average length,Peel hardness,Sweetness
count,500.0,500.0,500.0,500.0,500.0
mean,47.31344,86.5731,5.29862,7.43764,3.59322
std,26.768797,64.293403,2.641993,1.812548,1.264899
min,8.57,7.53,1.15,3.07,1.25
25%,24.7975,50.51,2.6725,6.1525,2.57
50%,42.38,70.45,5.67,7.34,3.535
75%,68.08,88.8525,7.455,8.615,4.465
max,111.21,299.89,11.14,13.72,6.95


In [4]:
# Rename Target values and see how they are distributed

target_value_names = {
    'Mela' : 'Apple',
    'Banana' : 'Banana',
    'Arancia' : 'Orange',
    'Uva' : 'Grape',
    'Kiwi' : 'Kiwi'
}

df['Fruit'] = df['Fruit'].map(lambda x: target_value_names[x])

df['Fruit'].value_counts()

Fruit
Apple     100
Banana    100
Orange    100
Grape     100
Kiwi      100
Name: count, dtype: int64

## Preprocessing

* No duplicates in the dataset. Shape is still (500,6)
* No missing values in the dataset.

In [5]:
# Check for duplicates

df_cleaned = handle_duplicates(df)

No duplicates found!


In [8]:
# Check for missing values

df_cleaned.isna().sum()

Fruit               0
Weight              0
Average diameter    0
Average length      0
Peel hardness       0
Sweetness           0
dtype: int64

## Data Visualization

Feature distribution without comparison with target column:
* Weight:
* Average diameter:
* Average length:
* Peel hardness:
* Sweetness:

Feature distribution and scatter plot with comparison with target column:
* Weight:
* Average diameter:
* Average length:
* Peel hardness:
* Sweetness:

In [10]:
# Feature distribution among ['Weight', 'Average diameter', 'Average length', 'Peel hardness', 'Sweetness']

feature_to_viz = 'Sweetness'

plot_feature_distribution(df_cleaned, feature_to_viz, comparison=False)

In [11]:
# Feature distribution among ['Weight', 'Average diameter', 'Average length', 'Peel hardness', 'Sweetness']
# and comparison with TARGET 'Fruit' 

feature_to_viz = 'Sweetness'

plot_feature_distribution(df_cleaned, feature_to_viz, comparison='Fruit')

In [8]:
# Scatter plot among ['Weight', 'Average diameter', 'Average length', 'Peel hardness', 'Sweetness']
# and comparison with TARGET 'Fruit' 

feature_x = 'Weight'
feature_y = 'Average diameter'

scatter_plot(df_cleaned, feature_x, feature_y, target='Fruit')

In [7]:
pyg.walk(df_cleaned)

Box(children=(HTML(value='\n<div id="ifr-pyg-000620bf48e0cb081FJf9pzgGMe3xO4U" style="height: auto">\n    <hea…

<pygwalker.api.pygwalker.PygWalker at 0x2905db510>

## Feature Engineering 

##  K-Nearest Neighbors (KNN) implementation

## Results

## [EXTRA] Machine Learning Models implementation