Instructions
Objective:
To determine the variety of date fruit from data describing the colour, length, diameter, and shape.

Data:
Obtained from DATASETS (muratkoklu.com) and used in M. Koklu, R. Kursun, Y.S. Taspinar, and I. Cinar, "Classification of Date Fruits into Genetic Varieties Using Image Analysis," Mathematical Problems in Engineering, Vol.2021, Article ID: 4793293 (2021).

Problem Statement:
In food production it is important to properly label ingredients for both health and business reasons. However, sometimes mistakes are made and there is room for improvement in food labeling practices. A number of different types of dates are grown around the world, and it takes expertise to correctly identify the variety. Your job as a machine learning developer is to create a model that can identify the type of date from external features such as colour, length, diameter and shape factors which have been determined by a computer vision model.

Steps to be completed:
Create a Jupyter notebook and complete the following steps:


In [70]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import confusion_matrix


Data
Load Date_Fruit_Datasets.csv into a pandas dataframe. Print out the header. Use pandas.DataFrame.describe to summarize the data. Using markdown, explain the meaning of the columns (as well as you can with the information available) and make observations about the dataset.


In [3]:
url = "https://raw.githubusercontent.com/Hunteracademic/Neural_network_group_7/master/Date_Fruit_Datasets.csv"
Fruit_data = pd.read_csv(url)

In [4]:
Fruit_data.head()

Unnamed: 0,AREA,PERIMETER,MAJOR_AXIS,MINOR_AXIS,ECCENTRICITY,EQDIASQ,SOLIDITY,CONVEX_AREA,EXTENT,ASPECT_RATIO,...,KurtosisRR,KurtosisRG,KurtosisRB,EntropyRR,EntropyRG,EntropyRB,ALLdaub4RR,ALLdaub4RG,ALLdaub4RB,Class
0,422163,2378.908,837.8484,645.6693,0.6373,733.1539,0.9947,424428,0.7831,1.2976,...,3.237,2.9574,4.2287,-59191260000.0,-50714214400,-39922372608,58.7255,54.9554,47.84,BERHI
1,338136,2085.144,723.8198,595.2073,0.569,656.1464,0.9974,339014,0.7795,1.2161,...,2.6228,2.635,3.1704,-34233070000.0,-37462601728,-31477794816,50.0259,52.8168,47.8315,BERHI
2,526843,2647.394,940.7379,715.3638,0.6494,819.0222,0.9962,528876,0.7657,1.315,...,3.7516,3.8611,4.7192,-93948350000.0,-74738221056,-60311207936,65.4772,59.286,51.9378,BERHI
3,416063,2351.21,827.9804,645.2988,0.6266,727.8378,0.9948,418255,0.7759,1.2831,...,5.0401,8.6136,8.2618,-32074310000.0,-32060925952,-29575010304,43.39,44.1259,41.1882,BERHI
4,347562,2160.354,763.9877,582.8359,0.6465,665.2291,0.9908,350797,0.7569,1.3108,...,2.7016,2.9761,4.4146,-39980970000.0,-35980042240,-25593278464,52.7743,50.908,42.6666,BERHI


In [5]:
Fruit_data.describe()

Unnamed: 0,AREA,PERIMETER,MAJOR_AXIS,MINOR_AXIS,ECCENTRICITY,EQDIASQ,SOLIDITY,CONVEX_AREA,EXTENT,ASPECT_RATIO,...,SkewRB,KurtosisRR,KurtosisRG,KurtosisRB,EntropyRR,EntropyRG,EntropyRB,ALLdaub4RR,ALLdaub4RG,ALLdaub4RB
count,898.0,898.0,898.0,898.0,898.0,898.0,898.0,898.0,898.0,898.0,...,898.0,898.0,898.0,898.0,898.0,898.0,898.0,898.0,898.0,898.0
mean,298295.207127,2057.660953,750.811994,495.872785,0.737468,604.577938,0.98184,303845.592428,0.736267,2.131102,...,0.250518,4.247845,5.110894,3.780928,-31850210000.0,-29018600000.0,-27718760000.0,50.082888,48.805681,48.098393
std,107245.205337,410.012459,144.059326,114.268917,0.088727,119.593888,0.018157,108815.656947,0.053745,17.820778,...,0.632918,2.892357,3.745463,2.049831,20372410000.0,17129520000.0,14841370000.0,16.063125,14.125911,10.813862
min,1987.0,911.828,336.7227,2.2832,0.3448,50.2984,0.8366,2257.0,0.5123,1.0653,...,-1.0291,1.7082,1.6076,1.7672,-109122000000.0,-92616970000.0,-87471770000.0,15.1911,20.5247,22.13
25%,206948.0,1726.0915,641.06865,404.684375,0.685625,513.317075,0.978825,210022.75,0.705875,1.373725,...,-0.19695,2.536625,2.50885,2.577275,-44294440000.0,-38946380000.0,-35645340000.0,38.224425,38.654525,39.250725
50%,319833.0,2196.34545,791.3634,495.05485,0.7547,638.14095,0.9873,327207.0,0.74695,1.52415,...,0.13555,3.0698,3.1278,3.0807,-28261560000.0,-26209900000.0,-23929280000.0,53.8413,50.3378,49.6141
75%,382573.0,2389.716575,858.63375,589.0317,0.80215,697.930525,0.9918,388804.0,0.77585,1.67475,...,0.59395,4.44985,7.3204,4.283125,-14604820000.0,-14331050000.0,-16603670000.0,63.06335,59.5736,56.666675
max,546063.0,2811.9971,1222.723,766.4536,1.0,833.8279,0.9974,552598.0,0.8562,535.5257,...,3.0923,26.1711,26.7367,32.2495,-162731600.0,-562772700.0,-437043500.0,79.8289,83.0649,74.1046


In [6]:
Fruit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 898 entries, 0 to 897
Data columns (total 35 columns):
AREA             898 non-null int64
PERIMETER        898 non-null float64
MAJOR_AXIS       898 non-null float64
MINOR_AXIS       898 non-null float64
ECCENTRICITY     898 non-null float64
EQDIASQ          898 non-null float64
SOLIDITY         898 non-null float64
CONVEX_AREA      898 non-null int64
EXTENT           898 non-null float64
ASPECT_RATIO     898 non-null float64
ROUNDNESS        898 non-null float64
COMPACTNESS      898 non-null float64
SHAPEFACTOR_1    898 non-null float64
SHAPEFACTOR_2    898 non-null float64
SHAPEFACTOR_3    898 non-null float64
SHAPEFACTOR_4    898 non-null float64
MeanRR           898 non-null float64
MeanRG           898 non-null float64
MeanRB           898 non-null float64
StdDevRR         898 non-null float64
StdDevRG         898 non-null float64
StdDevRB         898 non-null float64
SkewRR           898 non-null float64
SkewRG           898 non-


Use pandas.DataFrame.info to check if the entries are the correct datatype, and if there are any missing values. Use pandas.DataFrame.duplicates to check for duplicate entries. Fix the dataset so that there are no missing values, duplicate rows, or incorrect data types. Use markdown to make observations and explain what you have done.


In [7]:
print(Fruit_data.duplicated().sum())

0


In [8]:
print(Fruit_data.isna().sum())

AREA             0
PERIMETER        0
MAJOR_AXIS       0
MINOR_AXIS       0
ECCENTRICITY     0
EQDIASQ          0
SOLIDITY         0
CONVEX_AREA      0
EXTENT           0
ASPECT_RATIO     0
ROUNDNESS        0
COMPACTNESS      0
SHAPEFACTOR_1    0
SHAPEFACTOR_2    0
SHAPEFACTOR_3    0
SHAPEFACTOR_4    0
MeanRR           0
MeanRG           0
MeanRB           0
StdDevRR         0
StdDevRG         0
StdDevRB         0
SkewRR           0
SkewRG           0
SkewRB           0
KurtosisRR       0
KurtosisRG       0
KurtosisRB       0
EntropyRR        0
EntropyRG        0
EntropyRB        0
ALLdaub4RR       0
ALLdaub4RG       0
ALLdaub4RB       0
Class            0
dtype: int64



Create a bar plot using seaborn.barplot of the number of elements in each category. Use markdown to comment on how well balanced the dataset is.



Move the labels into a separate dataframe and use sklearn.preprocessing.LabelEncoder to convert the string labels into integers. Reshape the labels into a 2d array. Determine which number has been assigned to each type of date and record this information in markdown.


In [9]:
# Creating a separate dataframe for the features and the labels
df_features = Fruit_data.drop(['Class'], axis = 1)
labels = Fruit_data['Class']

# Encoding the labels and converting the strings into integers
encoder = LabelEncoder()
labels_encoded = encoder.fit_transform(labels)

# Reshaping encoded labels into a two-dimensional array
labels_encoded = labels_encoded.reshape(-1, 1)
labels_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))

# Outputting the number assigned to each type
labels_mapping

{'BERHI': 0,
 'DEGLET': 1,
 'DOKOL': 2,
 'IRAQI': 3,
 'ROTANA': 4,
 'SAFAVI': 5,
 'SOGAY': 6}


Use sklearn.preprocessing.MinMaxScaler to scale the features (but not the labels). Split the data into training, testing and validation sets with appropriate proportions.


In [10]:
# Scaling the features 
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(df_features)

# Splitting the data into training, validation, and testing sets
x_train, x_valtest, y_train, y_valtest = train_test_split(scaled_features, labels_encoded, train_size = 0.7, random_state = 10)
x_val, x_test, y_val, y_test = train_test_split(x_valtest, y_valtest, train_size = 0.5, random_state = 10)

In [11]:
x_train.shape, x_val.shape, x_test.shape, y_train.shape, y_val.shape, y_test.shape

((628, 34), (135, 34), (135, 34), (628, 1), (135, 1), (135, 1))


Modeling 
Use tf.keras.Sequential to create a fully connected artificial neural network with at least two hidden layers. Choose an activation function for each layer, and make sure the input and output dimensions are appropriate for the data. Print a summary of the model using tf.summary.


In [47]:
# Creating an artificial neural network with two hidden layers 
model = Sequential([
    Dense(units = 64, activation='relu', input_shape=(34,)),  
    Dense(units = 32, activation='relu'),                      
    Dense(units = 7, activation='softmax')                      
])

In [59]:
# A summary of the model
model.summary()

Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_44 (Dense)             (None, 64)                2240      
_________________________________________________________________
dense_45 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_46 (Dense)             (None, 7)                 231       
Total params: 4,551
Trainable params: 4,551
Non-trainable params: 0
_________________________________________________________________



Compile the model with a choice of optimizer and loss function, and the set the metrics argument equal to ['accuracy'].


In [60]:
# Choosing adam optimizer and binary cross entropy for loss
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])


Train the model and record the training accuracy. Find the validation accuracy and confusion matrix.


In [61]:
# Training the model on 25 epoch and a batch size of 32
history = model.fit(
    x_train, y_train,
    epochs=25,
    batch_size=32,
    validation_data=(x_val, y_val),
    verbose=1
)

Train on 628 samples, validate on 135 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [62]:
# Training accuracy
model.evaluate(x_train, y_train)



[0.3622702382932043, 0.8757962]

In [63]:
# Validation accuracy
model.evaluate(x_val, y_val)



[0.4061429374747806, 0.8888889]

In [71]:
# Plotting the confusion matrix
y_test_probs = model.predict(x_test)
y_test_pred = np.argmax(y_test_probs, axis=1)

# True labels
y_test_true = y_test.flatten()

# Confusion matrix
cm = confusion_matrix(y_test_true, y_test_pred)
cm

array([[ 4,  0,  0,  2,  0,  0,  0],
       [ 0,  7,  3,  0,  1,  0,  4],
       [ 0,  3, 34,  0,  0,  0,  0],
       [ 2,  0,  0, 10,  0,  0,  0],
       [ 0,  0,  0,  0, 20,  0,  0],
       [ 0,  0,  0,  0,  0, 32,  0],
       [ 0,  1,  0,  0,  4,  0,  8]], dtype=int64)

An accuracy of 87.5% on the training set is pretty good, and 88.9% on the validation set means our model is performing well. However, we'll try to tune the hyperparameters a bit more to see if the accuracy improves.

Return to the above steps to try at least five different choices of hyperparameters (including dimensions, activation functions, number of layers, optimizer, loss function, etc.). Neatly present the description of each model tried along with the training and validation accuracies, and the confusion matrix.


In [72]:
# Creating another model to check if accuracy improves--adding additional layers and more units

model_2 = Sequential([
    Dense(units = 256, activation='relu', input_shape=(34,)), 
    Dense(units = 128, activation='relu'),    
    Dense(units = 64, activation='relu'),
    Dense(units = 7, activation='softmax')                  
])
model_2.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = model_2.fit(
    x_train, y_train,
    epochs=25,
    batch_size=32,
    validation_data=(x_val, y_val),
    verbose=1
)

print("Training accuracy for second model: ", model_2.evaluate(x_train, y_train))
print("Validation accuracy for second model: ", model_2.evaluate(x_val, y_val))

y_test_probs = model_2.predict(x_test)
y_test_pred = np.argmax(y_test_probs, axis=1)
y_test_true = y_test.flatten()
cm = confusion_matrix(y_test_true, y_test_pred)
print(cm)

Train on 628 samples, validate on 135 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Training accuracy for second model:  [0.26804395826758853, 0.89649683]
Validation accuracy for second model:  [0.3344751912134665, 0.91851854]
[[ 2  0  0  3  1  0  0]
 [ 0  8  3  0  1  0  3]
 [ 0  1 36  0  0  0  0]
 [ 0  0  0 12  0  0  0]
 [ 0  0  0  0 20  0  0]
 [ 0  0  0  0  0 32  0]
 [ 0  3  0  0  2  0  8]]


Already with additional layers and more neurons, although the complexity of our model increases, it performs better.

In [78]:
# Creating another model to check if accuracy improves: a different activation function
model_3 = Sequential([
    Dense(units = 64, activation='tanh', input_shape=(34,)),
    Dense(units = 32, activation='tanh'),    
    Dense(units = 7, activation='softmax')
])
model_3.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = model_3.fit(
    x_train, y_train,
    epochs=25,
    batch_size=32,
    validation_data=(x_val, y_val),
    verbose=1
)
print("Training accuracy for third model: ",model_3.evaluate(x_train, y_train))
print("Validation accuracy for third model: ",model_3.evaluate(x_val, y_val))

y_test_probs = model_3.predict(x_test)
y_test_pred = np.argmax(y_test_probs, axis=1)
y_test_true = y_test.flatten()
cm = confusion_matrix(y_test_true, y_test_pred)
print(cm)

Train on 628 samples, validate on 135 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Training accuracy for third model:  [0.3613097737928864, 0.88375795]
Validation accuracy for third model:  [0.3236304470786342, 0.9111111]
[[ 4  0  0  2  0  0  0]
 [ 0  6  5  0  1  0  3]
 [ 0  1 36  0  0  0  0]
 [ 0  0  0 12  0  0  0]
 [ 0  0  0  0 20  0  0]
 [ 0  0  0  0  0 32  0]
 [ 0  4  0  0  1  0  8]]


In [75]:
# Creating another model to check if accuracy improves: trying out a different loss function
model_4 = Sequential([
    Dense(units = 64, activation='relu', input_shape=(34,)), 
    Dense(units = 32, activation='relu'),    
    Dense(units = 7, activation='softmax')                    
])
model_4.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = model_4.fit(
    x_train, y_train,
    epochs=25,
    batch_size=32,
    validation_data=(x_val, y_val),
    verbose=1
)
print("Training accuracy for fourth model: ",model_4.evaluate(x_train, y_train))
print("Validation accuracy for second model: ",model_4.evaluate(x_val, y_val))

y_test_probs = model_4.predict(x_test)
y_test_pred = np.argmax(y_test_probs, axis=1)
y_test_true = y_test.flatten()
cm = confusion_matrix(y_test_true, y_test_pred)
print(cm)

Train on 628 samples, validate on 135 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Training accuracy for fourth model:  [0.3425321497355297, 0.86464965]
Validation accuracy for second model:  [0.3524906904609115, 0.9037037]
[[ 3  0  0  2  1  0  0]
 [ 0  5  4  0  1  0  5]
 [ 0  1 36  0  0  0  0]
 [ 0  0  0 12  0  0  0]
 [ 0  0  0  0 20  0  0]
 [ 0  0  0  0  0 32  0]
 [ 0  2  0  0  0  0 11]]


In [82]:
# Creating another model to check if accuracy improves: larger batch size
model_5 = Sequential([
    Dense(units = 64, activation='relu', input_shape=(34,)),
    Dense(units = 32, activation='relu'),    
    Dense(units = 7, activation='softmax')           
])
model_5.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

for batch_size in [32, 64]:
    model_5.fit(
    x_train, y_train,
    epochs=25,
    batch_size=batch_size,
    validation_data=(x_val, y_val),
    verbose=0
)
    print(f"Training accuracy for fifth model and batch size {batch_size}: ", model_5.evaluate(x_train, y_train))
    print(f"Validation accuracy for fifth model and batch size {batch_size}: ", model_5.evaluate(x_val, y_val))
    y_test_probs = model_5.predict(x_test)
    y_test_pred = np.argmax(y_test_probs, axis=1)
    y_test_true = y_test.flatten()
    cm = confusion_matrix(y_test_true, y_test_pred)
    print(cm)

Training accuracy for fifth model and batch size 32:  [0.37496140048762033, 0.8742038]
Validation accuracy for fifth model and batch size 32:  [0.3584494615042651, 0.88148147]
[[ 4  0  0  2  0  0  0]
 [ 0  6  4  0  0  0  5]
 [ 0  1 36  0  0  0  0]
 [ 2  0  0 10  0  0  0]
 [ 0  0  0  0 20  0  0]
 [ 0  0  0  0  0 32  0]
 [ 0  4  0  0  0  0  9]]
Training accuracy for fifth model and batch size 64:  [0.30086019122676483, 0.9012739]
Validation accuracy for fifth model and batch size 64:  [0.3290406584739685, 0.91851854]
[[ 4  0  0  2  0  0  0]
 [ 0  7  4  0  0  0  4]
 [ 0  1 36  0  0  0  0]
 [ 0  0  0 11  0  1  0]
 [ 0  0  0  0 20  0  0]
 [ 0  0  0  0  0 32  0]
 [ 0  4  0  0  0  0  9]]



Conclusion
Select the best model and justify your selection using markdown.



Use the best model to make predictions on the testing set. Find the testing accuracy and confusion matrix.



Use markdown to comment on how well the model works to make predictions for this use case.