# Prediction
In this notebook we will preprocess the data and feed them to the Machine Learning models.

## Importing the Dataset

In [1]:
import sys
import os

# Add the directory containing utils.py to the Python path
sys.path.append(os.path.abspath(os.path.join('..', '..')))

In [2]:
from utils import load_dataset

DATA_PATH = '../data/apple_quality.csv'
apple_data = load_dataset(DATA_PATH)
apple_data.head()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
0,0.0,-3.970049,-2.512336,5.34633,-1.012009,1.8449,0.32984,-0.491590483,good
1,1.0,-1.195217,-2.839257,3.664059,1.588232,0.853286,0.86753,-0.722809367,good
2,2.0,-0.292024,-1.351282,-1.738429,-0.342616,2.838636,-0.038033,2.621636473,bad
3,3.0,-0.657196,-2.271627,1.324874,-0.097875,3.63797,-3.413761,0.790723217,good
4,4.0,1.364217,-1.296612,-0.384658,-0.553006,3.030874,-1.303849,0.501984036,good


## Data Cleaning
In the process of data cleaning we will handle the missing values and convert the categorical variables to numerical variables.

In [3]:
# Check for missing values
missing_values = apple_data.isnull().sum()
missing_values

A_id           1
Size           1
Weight         1
Sweetness      1
Crunchiness    1
Juiciness      1
Ripeness       1
Acidity        0
Quality        1
dtype: int64

There is only one missing value for each column. Given that we have 4000 records at our disposal, we can simply drop the rows containing missing values.

In [4]:
apple_data = apple_data.dropna()
apple_data.head()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
0,0.0,-3.970049,-2.512336,5.34633,-1.012009,1.8449,0.32984,-0.491590483,good
1,1.0,-1.195217,-2.839257,3.664059,1.588232,0.853286,0.86753,-0.722809367,good
2,2.0,-0.292024,-1.351282,-1.738429,-0.342616,2.838636,-0.038033,2.621636473,bad
3,3.0,-0.657196,-2.271627,1.324874,-0.097875,3.63797,-3.413761,0.790723217,good
4,4.0,1.364217,-1.296612,-0.384658,-0.553006,3.030874,-1.303849,0.501984036,good


In [5]:
# Check for missing values
missing_values = apple_data.isnull().sum()
missing_values

A_id           0
Size           0
Weight         0
Sweetness      0
Crunchiness    0
Juiciness      0
Ripeness       0
Acidity        0
Quality        0
dtype: int64

As we can now see, there are no more missing values in the dataset. Now we will proceed with transforming the categorical column `Quality` into numerical values. We will use a `LabelEncoder` that transforms the values into 0 and 1. This is ok for an attribute with only two values. For attributes with more than two values, we would use a `OneHotEncoder`.

In [6]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
apple_quality = apple_data['Quality']
apple_quality_encoded = encoder.fit_transform(apple_quality)
apple_quality_encoded

array([1, 1, 0, ..., 0, 1, 1])

In [7]:
apple_data['QualityEncoded'] = apple_quality_encoded
apple_data.drop('Quality', axis=1, inplace=True)
apple_data.head()

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,QualityEncoded
0,0.0,-3.970049,-2.512336,5.34633,-1.012009,1.8449,0.32984,-0.491590483,1
1,1.0,-1.195217,-2.839257,3.664059,1.588232,0.853286,0.86753,-0.722809367,1
2,2.0,-0.292024,-1.351282,-1.738429,-0.342616,2.838636,-0.038033,2.621636473,0
3,3.0,-0.657196,-2.271627,1.324874,-0.097875,3.63797,-3.413761,0.790723217,1
4,4.0,1.364217,-1.296612,-0.384658,-0.553006,3.030874,-1.303849,0.501984036,1


## A