<a href="https://colab.research.google.com/github/PolemoniProkshitha/ai_food_wastage_analysis/blob/main/Food_Wastage_Prediction_with_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Training a model using modern AI techniques like transformers for regression tasks, particularly with a synthetic dataset, can be a powerful approach.
Below, is a guide through the process of using a transformer model, such as BERT, for a regression task in Google Colab.

# **High-Level Steps:**

1. **Setup Google Colab Environment:** Install necessary libraries.
2. **Load and Prepare Data:** Load your dataset and prepare it for the transformer model.
3. **Model Selection and Tokenization:** Choose a transformer model and tokenize the data.
4. **Fine-Tuning the Model:** Adapt the transformer for regression and train the model.
5. **Evaluation and Predictions:** Evaluate the model's performance and make predictions.


#Step 1: Setup Google Colab Environment

In Google Colab, start by installing necessary libraries.

In [1]:
!pip install transformers datasets
!pip install torch

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m


# Step 2: Load and Prepare Data
Prepare your synthetic dataset in a CSV format:

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
from google.colab import files
uploaded = files.upload()

Saving synthetic_food_wastage_data.csv to synthetic_food_wastage_data.csv


In [4]:
import pandas as pd
import io

data = pd.read_csv(io.BytesIO(uploaded['synthetic_food_wastage_data.csv']))

In [5]:
# Define your features and target
X = data.drop(columns=['date', 'food_wasted'])
y = data['food_wasted']

In [6]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [7]:
# Convert data to the required format for transformers
train_data = X_train.copy()
train_data['labels'] = y_train

test_data = X_test.copy()
test_data['labels'] = y_test

# Step 3: Model Selection and Tokenization
Here, we'll use a transformer model like BERT. Since BERT is primarily used for text, we'll use it with the appropriate tokenization.

In [8]:
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
import torch

In [9]:
# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [10]:
class FoodWasteDataset(Dataset):
    def __init__(self, data, tokenizer, max_len):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        inputs = self.tokenizer(
            str(row.drop('labels')),
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors='pt'
        )
        inputs = {k: v.squeeze(0) for k, v in inputs.items()}
        inputs['labels'] = torch.tensor(row['labels'], dtype=torch.float)
        return inputs

In [11]:
# Prepare data loaders
max_len = 128  # Adjust based on your data
train_dataset = FoodWasteDataset(train_data, tokenizer, max_len)
test_dataset = FoodWasteDataset(test_data, tokenizer, max_len)

In [12]:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Step 4: Fine-Tuning the Model
Fine-tuning BERT for a regression task involves changing the final layer to output a single continuous value.

In [13]:
from transformers import BertForSequenceClassification, AdamW
import torch.nn as nn

In [14]:
# Load the BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=1)
model = model.to('cuda' if torch.cuda.is_available() else 'cpu')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = AdamW(model.parameters(), lr=2e-5)




In [16]:
# Training loop
epochs = 3
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [17]:
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        inputs = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**inputs)
        loss = criterion(outputs.logits.squeeze(-1), inputs['labels'])

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader)}')

Epoch 1, Loss: 282.6409952264083
Epoch 2, Loss: 215.44374486019737
Epoch 3, Loss: 185.67078158729956


# Step 5: Evaluation and Predictions
After training, evaluate the model on the test data.

In [18]:
model.eval()
predictions, true_labels = [], []

with torch.no_grad():
    for batch in test_loader:
        inputs = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**inputs)
        preds = outputs.logits.squeeze(-1).cpu().numpy()
        labels = inputs['labels'].cpu().numpy()

        predictions.extend(preds)
        true_labels.extend(labels)

In [19]:
# Evaluate the model
from sklearn.metrics import mean_squared_error

In [20]:
mse = mean_squared_error(true_labels, predictions)
print(f'Mean Squared Error on Test Data: {mse}')

Mean Squared Error on Test Data: 156.2137451171875
