<a href="https://colab.research.google.com/github/sergiolms/smartbite/blob/main/google-collab/SmartBite_Training_phase.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SmartBite 🍏
The AI tool that makes you eat wiser.

## 🥫 Dataset used
- Dataset Food101 extracted from [Kaggle](https://www.kaggle.com/datasets/dansbecker/food-101).
  - This dataset contains 101 folders of food & dishes with a thousand pictures each.

- For the food information, due to time limitations we initially used ChatGPT to obtain the list of ingredients & nutritional information based on a recommended portion in grams.
  - As a follow up task, we intend to use the [Open Food Facts database](https://world.openfoodfacts.org/data), a ~10GB database of food and ingredients with their nutritional information and healthy score. This will give us a more precise and adjusted result of the nutritional information of the dish.

## 🍽 Project Setup
Install dependencies, make imports and mount Google Drive

In [None]:
import os
import json
import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import regularizers
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import GlobalAveragePooling2D
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import ModelCheckpoint, CSVLogger

from google.colab import drive

# Mount the drive
drive.mount('/content/drive')
# Here we load our project folder, that has the following structure:
# - root folder ("smartbite_project")
# |-- datasource // Contains the nutritional information JSON file
#   |-- food_info // Contains the nutritional information JSON file
#   |-- images // Contains all images
#   |-- meta // Contains metadata files, such as training & test data and classes
# |-- model
#   |-- model_trained_3class.keras // This is the model trained
# Change this path to the location of this project in your Drive
base_dir = '/content/drive/MyDrive/smartbite_project/'

## 📂 Load metadata

In [10]:
images_dir = os.path.join(base_dir,"datasource/images")

# This is a list of food that have a folder with the exact same name as it appears here. Eg.: "bread_pudding", "bruschetta", "caesar_salad"
with open(os.path.join(base_dir,"datasource/meta/classes.txt"), 'r') as f:
    food_folders = f.read().strip().split('\n')

# This is a list with the user-friendly name of the food in the same order as the previous file. Eg.: "Bread pudding", "Bruschetta", "Caesar salad"
with open(os.path.join(base_dir,"datasource/meta/labels.txt"), 'r') as f:
    food_names = f.read().strip().split('\n')

total_classes = len(food_folders)

### ☝️🤓 Train & Validation

The whole dataset is made of 101 dishes with 1000 pictures for each. That makes a total of 101000 pictures to process.

The following files contain a partition of 75% train data, 25% validation data.

That is:
- 75750 pictures used for training
- 25250 pictures used for validation (test).

In [11]:
train_data = json.load(open(os.path.join(base_dir,"datasource/meta/train.json")))
test_data = json.load(open(os.path.join(base_dir,"datasource/meta/test.json")))

# 📊 Data preprocessing

## 🔡 Create a Dataframe from data

In [12]:
def create_dataframe(data):
    X = []
    y = []
    for key in data:
      for item in data[key]:
          X.append(item.strip()+".jpg") # Image
          y.append(key.strip()) # Category
    X = np.array(X)
    y = np.array(y)
    df = pd.DataFrame()
    df['filename'] = X
    df['label'] = y
    return df

In [13]:
# Dataset for training
df_train = create_dataframe(train_data)
# Dataset for validation
df_test = create_dataframe(test_data)
df_train.head()

Unnamed: 0,filename,label
0,churros/1004234.jpg,churros
1,churros/1013460.jpg,churros
2,churros/1016791.jpg,churros
3,churros/102100.jpg,churros
4,churros/1025494.jpg,churros


## 🌅 Create ImageDataGenerator artifacts

In [None]:
img_width, img_height = 299, 299

# We create an ImageDataGenerator for test and training data
datagen = ImageDataGenerator(
    rescale=1. / 255,
    validation_split=0.2
)

# We use the JSON files and folders to create the generator
train_generator = datagen.flow_from_dataframe(
    df_train,
    directory=images_dir,
    x_col='filename',
    y_col='label',
    class_mode='categorical',
    target_size=(img_height, img_width),
)

test_generator = datagen.flow_from_dataframe(
    df_test,
    directory=images_dir,
    x_col='filename',
    y_col='label',
    class_mode='categorical',
    target_size=(img_height, img_width),
)

# ⚙️ Model configuration

In [None]:
batch_size = 32

# Initialize the pretrained Model
inception = tf.keras.applications.inception_v3.InceptionV3(weights='imagenet', include_top=False)

# Set layers
x = inception.output
x = GlobalAveragePooling2D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.2)(x)

predictions = Dense(
    total_classes,
    kernel_regularizer=regularizers.l2(0.005),
    activation='softmax'
)(x)

# Create model & compile it
model = Model(inputs=inception.input, outputs=predictions)
model.compile(
    optimizer=SGD(learning_rate=0.0001, momentum=0.9),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Add checkpoints, store only the best model in each epoc
checkpointer = ModelCheckpoint(
    filepath='best_model.keras',
    verbose=1,
    save_best_only=True
)

csv_logger = CSVLogger('history.log')

# Display a summary of the model we're going to use
model.summary()

## 🏋🏻‍♂️ Train model

In [None]:
# Train the model with data (each epoc took ~9h, so I suggest you to skip this step :P)
history = model.fit(
    train_generator,
    validation_data=test_generator,
    batch_size=batch_size,
    epochs=30,
    verbose=1,
    callbacks=[csv_logger, checkpointer])

# Save the entire model as a SavedModel.
model.save('model_trained.keras')

### 🔄 Pause & resume training

Since each epoch may take ~9h to process, sometimes we needed to stop and resume the training later.
If you run this from Google Collab, your session may end when it hits 12h or so.

The following block is displayed as a demonstration on how resuming a training works. However, it's not necessary to run if you already were patient enough to run the whole training. 😀

In [None]:
# Load the last (best) version of the model we have
model = load_model('best_model.keras')

# Load the training history
history_df = pd.read_csv('history.log')

# Get the number of epoch completed
initial_epoch = len(history_df)

# Set the checkpoints up again, we still want to save the best models from next epochs
checkpointer = ModelCheckpoint(filepath='best_model.keras', verbose=1, save_best_only=True)
csv_logger = CSVLogger('history.log', append=True)

# Continue training the model
history = model.fit(train_generator,
    validation_data=test_generator,
    batch_size=batch_size,
    initial_epoch=initial_epoch, # Continue from last epoch
    epochs=30,
    verbose=1,
    callbacks=[csv_logger, checkpointer])

model.save('model_trained.keras')