# Food Delivery Dataset Analysis and CNN Model Training

This notebook demonstrates the analysis of a food delivery dataset using various machine learning techniques. The notebook is divided into sections for data preprocessing, CNN model training, and analysis of the image dataset. You can choose between running a CNN model on an image dataset (Food101) or performing tabular analysis with Delivery or Restaurant datasets.

### Table of Contents
- **CNN Model Analysis for Food101 Dataset**: Training a pre-trained or custom CNN on the Food101 dataset.
- **Tabular Dataset Analysis**: Analysis of the food delivery datasets using Logistic Regression or MLP classifiers.
- **Image Size and Aspect Ratio Analysis**: Analyzing image sizes and aspect ratios for the dataset.



In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from collections import defaultdict
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms, models
from sklearn.metrics import accuracy_score, confusion_matrix


### CNN Model Analysis for Food101 Dataset

This section demonstrates how to train a CNN model on the Food101 dataset. You can select between a pre-trained ResNet18 model or a custom-built classical CNN model.
- The model will be trained using a set number of epochs. 
- The results include training accuracy and loss for each epoch and will be saved to a specified directory.

Let's begin by setting up the parameters for the CNN model.
- Run the following cell to train the model and display the results.


In [2]:
def create_cnn_model(dataset_path, meta_path, num_classes, img_height=224, img_width=224, batch_size=64, epochs=100, use_pretrained=True):
    """
    Creates and trains a CNN model (ResNet18 or custom CNN) for food image classification.
    """
    # Preprocessing and transformations for dataset
    transform = transforms.Compose([
        transforms.Resize((img_height, img_width)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalizing for pre-trained models
    ])
    
    dataset = datasets.ImageFolder(dataset_path, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Use pre-trained ResNet18 model if selected
    if use_pretrained:
        model = models.resnet18(pretrained=True)
        model.fc = nn.Linear(model.fc.in_features, num_classes)  # Change the final layer to match number of classes
    else:
        # Custom CNN model can be defined here (example)
        model = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(64 * (img_height // 4) * (img_width // 4), num_classes)
        )
    
    # Move model to GPU if available
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    
    # Set loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Training loop
    epoch_accuracies = []
    for epoch in range(epochs):
        model.train()
        running_loss = 0.0
        correct = 0
        total = 0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()

            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # Calculate accuracy
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

            running_loss += loss.item()

        epoch_loss = running_loss / len(train_loader)
        epoch_accuracy = correct / total * 100
        epoch_accuracies.append((epoch + 1, epoch_loss, epoch_accuracy))

        print(f"Epoch {epoch+1}/{epochs} - Loss: {epoch_loss:.4f} - Accuracy: {epoch_accuracy:.2f}%")

    return epoch_accuracies


The model is now being trained on the Food101 dataset. Once you run the model training, the output will display the loss and accuracy for each epoch.

Run the following cell to train the model and see the results.

In [None]:
# Example of how to call the model training function
dataset_path = "../Datasets/archive/images"
meta_path = "../Datasets/archive/meta/meta"
num_classes = 101  # Number of classes in Food101
img_height = 224
img_width = 224

epoch_accuracies = create_cnn_model(dataset_path, meta_path, num_classes, img_height, img_width, epochs=5)

# Display results in a readable format
epoch_df = pd.DataFrame(epoch_accuracies, columns=["Epoch", "Loss", "Accuracy"])
epoch_df

**The above table displays the training loss and accuracy for each epoch.**
You can use this to monitor the model's performance during training.

Let's now proceed to the tabular dataset analysis section.
The table above displays the training loss and accuracy for each epoch during the CNN model training.

Next, we will analyze the tabular food delivery datasets.


### Loading and Preprocessing the datasets
The below function loads and preprocesses the selected dataset (either Delivery or Restaurant). It splits the data into training and testing sets.

Run the following cell to load and preprocess the data for further analysis.

In [4]:
def load_and_preprocess_data(base_dir, dataset_type):
    """
    Load the preprocessed data from the specified directory for the given dataset type.
    """
    data_dir = os.path.join(base_dir, dataset_type)

    X_train = pd.read_csv(os.path.join(data_dir, 'X_train.csv'))
    y_train = pd.read_csv(os.path.join(data_dir, 'y_train.csv'))
    X_valid = pd.read_csv(os.path.join(data_dir, 'X_valid.csv'))
    y_valid = pd.read_csv(os.path.join(data_dir, 'y_valid.csv'))
    X_test = pd.read_csv(os.path.join(data_dir, 'X_test.csv'))
    y_test = pd.read_csv(os.path.join(data_dir, 'y_test.csv'))

    # Assuming 'target' column in y files
    y_train = y_train['target']
    y_valid = y_valid['target']
    y_test = y_test['target']

    # Combine training and validation sets
    X_train_full = pd.concat([X_train, X_valid], axis=0)
    y_train_full = pd.concat([y_train, y_valid], axis=0)

    # Identify categorical features
    categorical_features = X_train_full.select_dtypes(include=['object', 'category']).columns.tolist()
    numerical_features = [col for col in X_train_full.columns if col not in categorical_features]

    num_classes = len(y_train_full.unique())

    return X_train_full, y_train_full, X_test, y_test, num_classes, categorical_features, numerical_features

## Tabular Dataset Analysis

In this section, we will analyze the tabular datasets (Delivery or Restaurant) using machine learning classifiers. You can choose between Logistic Regression or MLP Classifier for the analysis. The results will be saved to the specified directory.

Let's select the dataset and model to use.

In [5]:
def plot_feature_importance(model, feature_names, model_name, results_dir):
    """
    Plot feature importance for Logistic Regression.
    """
    # Get the coefficients from the logistic regression model
    coef = model.named_steps['classifier'].coef_

    # Get feature names after preprocessing
    preprocessor = model.named_steps['preprocessor']
    feature_names_transformed = preprocessor.get_feature_names_out()

    # Create a DataFrame for coefficients
    coef_df = pd.DataFrame(coef.T, index=feature_names_transformed, columns=model.named_steps['classifier'].classes_)

    # Plot the coefficients
    for class_label in model.named_steps['classifier'].classes_:
        top_features = coef_df[class_label].abs().sort_values(ascending=False).head(10)
        plt.figure(figsize=(8, 6))
        sns.barplot(x=top_features.values, y=top_features.index)
        plt.title(f"Top Features for Class {class_label}")
        plt.xlabel("Coefficient Value")
        plt.ylabel("Feature")
        plt.tight_layout()
        plt.savefig(os.path.join(results_dir, f"{model_name}_Feature_Importance_Class_{class_label}.png"))
        plt.show()

In [6]:
def evaluate_logistic_regression(model, param_grid, X_train, X_test, y_train, y_test, model_name, class_labels, results_dir):
    print(f"\nEvaluating {model_name}")

    # Grid Search with Cross-Validation
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grid,
        cv=5,
        scoring='accuracy',
        n_jobs=-1
    )

    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_

    print(f"Best parameters found: {grid_search.best_params_}")
    print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

    # Save cross-validation results
    cv_results = pd.DataFrame(grid_search.cv_results_)
    cv_results.to_csv(os.path.join(results_dir, f"{model_name}_grid_search_results.csv"), index=False)

    # Predict and evaluate on the test set
    y_pred = best_model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred)
    print(f"{model_name} Test Accuracy: {test_accuracy:.4f}")

    # Ensure class_labels are strings
    class_labels_str = [str(label) for label in class_labels]

    report = classification_report(y_test, y_pred, target_names=class_labels_str)
    print("Classification Report:")
    print(report)

    # Save classification report to a text file
    with open(os.path.join(results_dir, f"{model_name}_classification_report.txt"), "w") as f:
        f.write(report)

    # Confusion Matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(
        conf_matrix,
        annot=True,
        fmt="d",
        cmap="Blues",
        xticklabels=class_labels_str,
        yticklabels=class_labels_str
    )
    plt.title(f"{model_name} Confusion Matrix")
    plt.xlabel("Predicted Labels")
    plt.ylabel("True Labels")
    plt.savefig(os.path.join(results_dir, f"{model_name}_confusion_matrix.png"))
    plt.show()

    # Learning Curve
    plot_learning_curve(best_model, X_train, y_train, model_name, results_dir)

    # Feature Importance
    plot_feature_importance(best_model, X_train.columns, model_name, results_dir)

    return best_model

In [7]:
def plot_roc_curve(model, X_test, y_test, model_name, num_classes, results_dir):
    """
    Plot the ROC curve for multiclass classification.
    """
    # Binarize the output for multiclass
    classes = np.unique(y_test)
    y_test_bin = label_binarize(y_test, classes=classes)

    # Get probabilities
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)
    elif hasattr(model, "decision_function"):
        y_prob = model.decision_function(X_test)
    else:
        print("Model does not support probability predictions.")
        return

    # Plot ROC curve for each class
    plt.figure(figsize=(10, 7))
    for idx, class_label in enumerate(classes):
        fpr, tpr, _ = roc_curve(y_test_bin[:, idx], y_prob[:, idx])
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, label=f"Class {class_label} (AUC = {roc_auc:.2f})")

    plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
    plt.title(f"{model_name} ROC Curve (Multiclass)")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.legend(loc="lower right")
    plt.grid()
    plt.savefig(os.path.join(results_dir, f"{model_name}_ROC_curve.png"))
    plt.show()

In [8]:
def train_mlp_pytorch(X_train, y_train, X_test, y_test, num_classes, results_dir, class_labels):
    # Encode categorical variables
    categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns
    numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns

    # One-Hot Encoding for categorical features
    if version.parse(sklearn.__version__) >= version.parse("1.2"):
        encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
    else:
        encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    X_train_cat = encoder.fit_transform(X_train[categorical_cols])
    X_test_cat = encoder.transform(X_test[categorical_cols])

    # Standardization for numerical features
    scaler = StandardScaler()
    X_train_num = scaler.fit_transform(X_train[numerical_cols])
    X_test_num = scaler.transform(X_test[numerical_cols])

    # Combine numerical and categorical features
    X_train_processed = np.hstack([X_train_num, X_train_cat])
    X_test_processed = np.hstack([X_test_num, X_test_cat])

    # Encode target variable
    label_encoder = LabelEncoder()
    y_train_encoded = label_encoder.fit_transform(y_train)
    y_test_encoded = label_encoder.transform(y_test)

    # Convert to PyTorch tensors
    X_train_tensor = torch.tensor(X_train_processed, dtype=torch.float32).to(device)
    y_train_tensor = torch.tensor(y_train_encoded, dtype=torch.long).to(device)
    X_test_tensor = torch.tensor(X_test_processed, dtype=torch.float32).to(device)
    y_test_tensor = torch.tensor(y_test_encoded, dtype=torch.long).to(device)

    # Define model parameters
    input_size = X_train_processed.shape[1]
    hidden_sizes = [128, 64]
    model = MLPClassifierTorch(input_size, hidden_sizes, num_classes).to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    num_epochs = 50
    batch_size = 64

    # Create DataLoader
    train_dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    # Training loop
    train_acc_history = []
    val_acc_history = []

    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        correct = 0
        total = 0

        for inputs_batch, labels_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs_batch)
            loss = criterion(outputs, labels_batch)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * inputs_batch.size(0)
            _, predicted = torch.max(outputs.data, 1)
            total += labels_batch.size(0)
            correct += (predicted == labels_batch).sum().item()

        epoch_loss = running_loss / len(train_dataset)
        epoch_acc = correct / total
        train_acc_history.append(epoch_acc)

        # Validation accuracy
        model.eval()
        with torch.no_grad():
            outputs = model(X_test_tensor)
            _, predicted = torch.max(outputs.data, 1)
            total = y_test_tensor.size(0)
            correct = (predicted == y_test_tensor).sum().item()
            val_acc = correct / total
            val_acc_history.append(val_acc)

        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}, Train Acc: {epoch_acc:.4f}, Val Acc: {val_acc:.4f}')

    # Save the trained model
    torch.save(model.state_dict(), os.path.join(results_dir, 'MLPClassifier_PyTorch.pth'))

    # Plot training and validation accuracy
    plt.figure(figsize=(8,6))
    plt.plot(range(1, num_epochs+1), train_acc_history, label='Training Accuracy')
    plt.plot(range(1, num_epochs+1), val_acc_history, label='Validation Accuracy')
    plt.title('Training and Validation Accuracy for MLP Classifier')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.grid()
    plt.savefig(os.path.join(results_dir, 'MLPClassifier_Accuracy.png'))
    plt.show()

    # Evaluate on test set
    model.eval()
    with torch.no_grad():
        outputs = model(X_test_tensor)
        _, predicted = torch.max(outputs.data, 1)
        y_pred = predicted.cpu().numpy()
        y_true = y_test_encoded

    # Ensure class_labels are strings
    class_labels_str = [str(label) for label in class_labels]

    # Classification report
    report = classification_report(y_true, y_pred, target_names=class_labels_str)
    print("Classification Report:")
    print(report)

    # Save classification report to a text file
    with open(os.path.join(results_dir, "MLPClassifier_classification_report.txt"), "w") as f:
        f.write(report)

    # Confusion Matrix
    conf_matrix = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(
        conf_matrix,
        annot=True,
        fmt="d",
        cmap="Blues",
        xticklabels=class_labels_str,
        yticklabels=class_labels_str
    )
    plt.title("MLP Classifier Confusion Matrix")
    plt.xlabel("Predicted Labels")
    plt.ylabel("True Labels")
    plt.savefig(os.path.join(results_dir, "MLPClassifier_confusion_matrix.png"))
    plt.show()

In [9]:
def run_tabular_analysis():
    """
    Runs the tabular data analysis for Delivery or Restaurant datasets.
    """
    print("Select Dataset Type:")
    print("1. Delivery Dataset")
    print("2. Restaurant Dataset")
    dataset_choice = input("Enter the dataset number (1 or 2): ")

    if dataset_choice == "1":
        dataset_type = "delivery"
        dataset_name = "Delivery"
    elif dataset_choice == "2":
        dataset_type = "restaurant"
        dataset_name = "Restaurant"
    else:
        print("Invalid dataset choice.")
        return

    base_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir, 'Datasets', 'preprocessed_data'))

    # Load and preprocess data
    X_train_full, y_train_full, X_test, y_test, num_classes, categorical_features, numerical_features = load_and_preprocess_data(base_dir, dataset_type)

    print("\nSelect Model:")
    print("1. Logistic Regression")
    print("2. MLP Classifier")
    model_choice = input("Enter the model number (1 or 2): ")

    # Get class labels for plotting
    class_labels = sorted(y_train_full.unique())
    class_labels_str = [str(label) for label in class_labels]  # Convert to strings

    # Create results directory
    if model_choice == "1":
        model_name = "LogisticRegression"
    elif model_choice == "2":
        model_name = "MLPClassifier_PyTorch"
    else:
        print("Invalid model choice.")
        return

    results_dir = f"Results/{model_name}_{dataset_name}"
    os.makedirs(results_dir, exist_ok=True)

    if model_choice == "1":
        # Logistic Regression
        model = create_logistic_regression_pipeline(categorical_features, numerical_features)

        # Define hyperparameter grid for Logistic Regression
        param_grid = {
            'classifier__C': [0.01, 0.1, 1, 10],
            'classifier__penalty': ['l2'],
            'classifier__solver': ['lbfgs', 'saga', 'sag'],
            'classifier__max_iter': [500, 1000]
        }

        print("\nStarting Logistic Regression training and evaluation...")
        # Evaluate Logistic Regression
        best_model = evaluate_logistic_regression(
            model,
            param_grid,
            X_train_full,
            X_test,
            y_train_full,
            y_test,
            model_name,
            class_labels_str,  # Use string labels
            results_dir
        )

        # Plot ROC Curve
        plot_roc_curve(best_model, X_test, y_test, model_name, num_classes, results_dir)

    elif model_choice == "2":
        # MLP Classifier using PyTorch

        print("\nStarting MLP Classifier training and evaluation...")
        # Train and evaluate MLP Classifier
        train_mlp_pytorch(
            X_train_full,
            y_train_full,
            X_test,
            y_test,
            num_classes,
            results_dir,
            class_labels_str  # Use string labels
        )
    else:
        print("Invalid model choice.")
        return

The sample data has been loaded and preprocessed. You can now proceed with training a model on this data.

The following cell will evaluate a Logistic Regression model with hyperparameter tuning.

In [10]:
def evaluate_logistic_regression(model, param_grid, X_train, X_test, y_train, y_test, model_name, class_labels):
    """
    Evaluates the logistic regression model on the dataset using hyperparameter tuning.
    """
    from sklearn.model_selection import GridSearchCV

    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    best_model = grid_search.best_estimator_

    # Evaluate the best model
    y_pred = best_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    # Display results
    print(f"Best Model: {grid_search.best_params_}")
    print(f"Test Accuracy: {accuracy:.4f}")
    
    return best_model


You can now run the `evaluate_logistic_regression()` function to train and evaluate a Logistic Regression model on the preprocessed food delivery dataset.

Once the model finishes, the evaluation results will display the best hyperparameters and the test accuracy.

In [None]:
# Example of how to call the logistic regression evaluation function
from sklearn.linear_model import LogisticRegression

param_grid = {'C': [0.1, 1, 10]}  # Hyperparameter grid for regularization strength
model = LogisticRegression(solver='liblinear')
model_name = 'Logistic Regression'

best_model = evaluate_logistic_regression(model, param_grid, X_train, X_test, y_train, y_test, model_name, class_labels=['Class 0', 'Class 1'])

### Image Size and Aspect Ratio Analysis

In this section, we'll analyze the image sizes and aspect ratios in a given dataset directory. 
- It calculates the frequencies of different sizes and aspect ratios and displays a plot of the distribution.
- The function will provide a summary of image sizes and aspect ratios, and also plot their distribution.

Let's proceed with the image size analysis.

In [12]:
def analyze_image_sizes(root_dir, extensions={'jpg', 'jpeg', 'png', 'bmp', 'gif'}):
    """
    Analyzes the sizes and aspect ratios of the images in the dataset directory.
    """
    aspect_ratios = defaultdict(int)
    image_sizes = defaultdict(int)

    for root, dirs, files in os.walk(root_dir):
        for file in files:
            if file.split('.')[-1].lower() in extensions:
                image_path = os.path.join(root, file)
                img = Image.open(image_path)
                width, height = img.size

                aspect_ratio = width / height
                image_sizes[(width, height)] += 1
                aspect_ratios[aspect_ratio] += 1

    return image_sizes, aspect_ratios


The above function will analyze the sizes and aspect ratios of images in the specified dataset directory. You can run it to inspect the distribution of image dimensions.

Once the analysis is complete, the results will be displayed.

In [13]:
# Example of how to call the analyze_image_sizes function
root_dir = "../Datasets/archive/images"  # Path to the images directory

image_sizes, aspect_ratios = analyze_image_sizes(root_dir)

# Display results
print("Image Sizes Distribution:", image_sizes)
print("Aspect Ratios Distribution:", aspect_ratios)

Image Sizes Distribution: defaultdict(<class 'int'>, {(512, 342): 359, (512, 384): 14791, (512, 512): 62206, (512, 382): 2512, (384, 512): 6518, (512, 289): 911, (382, 512): 2879, (512, 341): 1508, (373, 512): 2, (512, 307): 371, (512, 340): 383, (341, 512): 251, (512, 314): 4, (342, 512): 52, (512, 385): 101, (512, 304): 37, (512, 485): 5, (512, 288): 890, (512, 360): 7, (512, 511): 270, (287, 512): 77, (306, 512): 504, (289, 512): 290, (512, 343): 183, (512, 305): 4, (512, 366): 41, (512, 308): 82, (512, 381): 26, (304, 512): 13, (512, 463): 10, (511, 512): 543, (512, 383): 454, (512, 369): 11, (512, 306): 1031, (512, 403): 9, (512, 462): 2, (512, 410): 34, (339, 512): 29, (512, 481): 5, (512, 313): 5, (512, 296): 4, (512, 328): 6, (512, 252): 3, (512, 352): 7, (512, 405): 8, (512, 333): 17, (506, 512): 25, (466, 512): 4, (468, 512): 5, (512, 335): 7, (288, 512): 255, (383, 512): 210, (512, 287): 220, (512, 346): 14, (512, 396): 8, (512, 339): 151, (512, 437): 2, (307, 512): 132, (50