# Notes

The data was collected from various search engines, and stored in a file called 'data' where we store the images in folders with their corresponding label. We start by setting up our GPU for usage by downloading and configuring the appropriate software. We remove any images that do not have an image extension in the code because some images downloaded are not appropriate image datasets. Data is configure for better processing when training the model. A sequential model is used and tested with different Dense and 2DConvolutional layers.  

One thing to note was that when training my model the pictures were being trained in RGB, but when using OpenCV to test my model my pictures where going in as BGR which lower the accuracy by 30%. After fixing this problem model's classification increased to about 60-70%.
When I tried to recreate the layers for the VGG16 model (a model that has won awards on image classificaiton) the accuracy surprisingly dropped again to about 30%. Leading me to believe that the model is too big for my datasets since it's only about 250 images per category and the model was made to classify 1000s of categories filled with 1000s of pictures. So instead I research that the recommended amount of images is around 1000 per category. And after downloading and using data augmentation to attain more data the model's accuracy increased to about 80%.

After getting a sufficient accuracy (80% was ok with me because I used mostly of edges cases or harder to identify pictures to really test my model) categorizing 3 sports balls (basketballs, soccer balls and volleyballs) I added more labels of balls to see if my model would hold up or needed more tuning. I also added a 'star' label to see if the model was generalizing well between the shape of the ball and the shape of a star. When adding new categories (bowling balls, tennis balls ans stars) the model did not performed so well, so more tuning was needed to accomodate for the new difference in data. Although the model was pretty good at classifying stars sometimes it would confused stars with tennisballs and visa versa. This tells me that color is a pretty big factor in categorizing images and this is especially hard when categorizing sports balls because some balls are filled with colors that cover up the details that set it apart from other balls.

To clean some of my data for example I deleted images where the 3 holes of the bowling ball was missing because without it it just looked like it could be any rubber ball. And images that I consider edge cases that the model kept getting wrong were augmented and added to the training data to see if the model could pick up on similarities of the pictures by adding its augmented images and not the image itself. With this the model performed to about 84% accuracy.

# Challenges

### 1. Splitting Data
When splitting data as shown below:

In [None]:
training_set = tf.keras.utils.image_dataset_from_directory(
    os.path.join(data_dir, 'training'), 
    validation_split=0.2,
    subset="training",
    seed=123,
    batch_size = 64)

In [None]:
validation_set = tf.keras.utils.image_dataset_from_directory(
    os.path.join(data_dir, 'training'), 
    validation_split=0.2,
    subset="validation",
    seed=123,
    batch_size = 64)

The accuracy of the model decreased by a lot. When training the model the validation part would be stuck at 80% accuracy when the training accuracy was close to 98% meaning my the model was overfitting.  
But when I split my data like this:

In [None]:
data_set = tf.keras.utils.image_dataset_from_directory(
    os.path.join(data_dir, 'training'), 
    batch_size = 32)

In [None]:
training = data_set.take(int(len(data_set)*.8))
validation = data_set.skip(int(len(data_set)*.8)).take(int(len(data_set)*.2))

The model training and validation are pretty close together in accuracy.  
But caching and prefetching the trainig and validation data separately would give me the same problem of overfitting, so the code below must be done before splitting the data.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE
data_set = data_set.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)

Have not figured out why this happens yet.. :(

### 2. Modeling
These are the models that worked best for me when testing different number and combination of categories

In [None]:
# Best model for 3 categories
model.add(Conv2D(16, (3, 3), 1, activation='relu', input_shape=(256, 256, 3)))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))

model.add(Conv2D(32, (3, 3), 1, activation='relu'))
model.add(Conv2D(32, (3, 3), 1, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))

model.add(Conv2D(16, (3, 3), 1, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))

model.add(Flatten())
model.add(Dropout(0.2))

model.add(Dense(256, activation='relu'))
model.add(Dense(len(class_names)))

In [None]:
# Best model for 6 categories
model.add(Conv2D(16, (3, 3), 1, activation='relu', input_shape=(256, 256, 3)))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))

model.add(Conv2D(16, (3, 3), 1, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))

model.add(Conv2D(32, (3, 3), 1, activation='relu'))
model.add(Conv2D(32, (3, 3), 1, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))

model.add(Conv2D(64, (3, 3), 1, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))

model.add(Flatten())
model.add(Dropout(0.2))

model.add(Dense(256, activation='relu'))
model.add(Dense(len(class_names)))

### 3. Cross Validation
I coded a K-fold cross validation that would work with my code because of the way that my data splits, since my images and labels are in the same array.  
My challenge with this ties with my splitting challenge. Because I split the data the following way.

In [None]:
t1 = data.take(first_seg)
test = data.skip(first_seg).take(test_seg)
t2 = data.skip(first_seg + test_seg).take(second_seg)
train = t1.concatenate(t2)

It would give better results when evaluating the model than the actual trained model if I split the data using keras.utils