# Good Apples

## Introduction

This dataset was found on kaggle at https://www.kaggle.com/datasets/nelgiriyewithana/apple-quality?resource=download

The data describes the different variables that are associated with the evaluation of a 'good' or 'bad' apple. This includes the following variables:

1. **A_id**: Unique identifier for each fruit
2. **Size**: Size of the fruit
3. **Weight**: Weight of the fruit
4. **Sweetness**: Degree of sweetness of the fruit
5. **Crunchiness**: Texture indicating the crunchiness of the fruit
6. **Juiciness**: Level of juiciness of the fruit
7. **Ripeness**: Stage of ripeness of the fruit
8. **Acidity**: Acidity level of the fruit
9. **Quality**: Overall quality of the fruit - **Outcome variable**

## Data Preparation

#### Importing relevant packages 


In [6]:
# Import needed packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# If you're working in Jupyter Notebook, include the following so that plots will display:
%matplotlib inline

### Overview on data

In [18]:
df = pd.read_csv('apple_quality.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4001 entries, 0 to 4000
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   A_id         4000 non-null   float64
 1   Size         4000 non-null   float64
 2   Weight       4000 non-null   float64
 3   Sweetness    4000 non-null   float64
 4   Crunchiness  4000 non-null   float64
 5   Juiciness    4000 non-null   float64
 6   Ripeness     4000 non-null   float64
 7   Acidity      4001 non-null   object 
 8   Quality      4000 non-null   object 
dtypes: float64(7), object(2)
memory usage: 281.4+ KB


#### Dealing with nulls and formatting
There is an inconsistency in the data where *acidity* has an extra value. To address this I have located the rows that contain nulls.

In [20]:
nan_rows = df[df.isnull().any(axis=1)]
nan_rows

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
4000,,,,,,,,Created_by_Nidula_Elgiriyewithana,


It seems that the final row has a signature from the dataset creator. This will be removed.

In [22]:
df = df.iloc[0:4000]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   A_id         4000 non-null   float64
 1   Size         4000 non-null   float64
 2   Weight       4000 non-null   float64
 3   Sweetness    4000 non-null   float64
 4   Crunchiness  4000 non-null   float64
 5   Juiciness    4000 non-null   float64
 6   Ripeness     4000 non-null   float64
 7   Acidity      4000 non-null   object 
 8   Quality      4000 non-null   object 
dtypes: float64(7), object(2)
memory usage: 281.4+ KB


This data now seems to be consistent but for my OCD I have to make the *A_id* column an integer.

In [26]:
df['A_id'] = df['A_id'].astype('int')
np.dtype(df['A_id'])

dtype('int32')

### Addressing assumptions

#### Quality of input features
Random Forest assumes that input features are relevant and informative for the prediction task. If features are **noisy, irrelevant, or contain biases**, the model's performance may degrade.

This assumption can be evaluated by the concepts that are being incorporated to predict a 'good' or 'bad' apple. **What defines a good apple?** 

- Is a good apple large? Small? or is there an optimal medium? Is it sweet? Crunchy? Sour? 

After looking over the variables, it seems that they generally descibe the factors that contribute to an apples quality (subject to personal prefernece of course). The only thing that may be missing is the concept of apple health -  blemishes, wear & tear, rotting,. etc. Though, this is possibly wrapped into other features such as ripeness and crunchiness.

**Evaluation - The features appear relevant**

In [43]:
predict_corr = df[['Size','Weight','Sweetness','Crunchiness','Juiciness','Ripeness','Acidity']].corr()


Unnamed: 0,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity
Size,1.0,-0.170702,-0.32468,0.169868,-0.018892,-0.134773,0.196218
Weight,-0.170702,1.0,-0.154246,-0.095882,-0.092263,-0.243824,0.016414
Sweetness,-0.32468,-0.154246,1.0,-0.037552,0.095882,-0.2738,0.085999
Crunchiness,0.169868,-0.095882,-0.037552,1.0,-0.259607,-0.201982,0.069943
Juiciness,-0.018892,-0.092263,0.095882,-0.259607,1.0,-0.097144,0.248714
Ripeness,-0.134773,-0.243824,-0.2738,-0.201982,-0.097144,1.0,-0.202669
Acidity,0.196218,0.016414,0.085999,0.069943,0.248714,-0.202669,1.0


#### Balance of data set

Random Forest performs well on balanced datasets where each class or target value is adequately represented. Imbalanced datasets can lead to biased models that favor the majority class.

#### Homogeneous feature importance
Random Forest assumes that the importance of features remains consistent across the dataset. If feature importance varies significantly across different subsets of data, the model may not generalize well.