# Module 05 - Images
## Computer Vision
Computer vision is a field of artificial intelligence that focuses on enabling machines to interpret and understand visual data from the world, such as images and videos. At its core, computer vision seeks to replicate human visual perception by using algorithms to process, analyze, and extract meaningful information from visual content. This could involve recognizing objects, detecting patterns, identifying faces, or even interpreting handwritten text. The field leverages advancements in machine learning and deep learning to make sense of complex visual data, transforming how machines "see" and interact with their environment. By converting raw pixels into actionable insights, computer vision is a bridge between digital systems and the physical world.

<p style="text-align: center"><img src="https://thislondonhouse.com/Jupyter/Images/computer_vision.png"></p>

Increasingly, businesses are turning to computer vision to automate tasks and enhance decision-making. For instance, in retail, it is used for inventory management and cashier-less checkout systems. In healthcare, computer vision powers diagnostic tools that analyze medical images, improving accuracy and speed. The field also finds applications in agriculture, manufacturing, and autonomous vehicles, where it monitors crops, detects defects in products, and enables safe navigation, respectively. By harnessing the potential of computer vision, businesses can drive innovation, optimize operations, and create new opportunities, all while delivering better value to their customers. This transformative capability exemplifies the power of AI to shape the future of industries.

### Classification, Detection, Recognition
The ability of computers to see is powered by a range of techniques in computer vision, primarily driven by advancements in machine learning and deep learning. Convolutional Neural Networks (CNNs) are at the forefront, designed to mimic human visual processing by using layers of filters to detect features such as edges, textures, and shapes in images. Techniques like object detection enable computers to locate and identify multiple objects within a scene, while semantic segmentation breaks an image into regions for pixel-level understanding. Optical character recognition (OCR) enables the interpretation of text in visual data, and image classification assigns labels to entire images based on their content. Additionally, methods like feature extraction and keypoint detection are used to match patterns or track movements. Combined, these techniques allow computers to analyze and interpret visual information, making them invaluable across various applications.

Different techniques in computer vision excel in various tasks based on their design and focus, each bringing unique strengths to the table. For example, Convolutional Neural Networks (CNNs) are highly effective for image classification and feature extraction due to their ability to detect patterns and hierarchical features in visual data. Object detection techniques, such as YOLO (You Only Look Once) and Faster R-CNN, specialize in identifying and locating multiple objects within an image or video, offering speed and precision for real-time applications like autonomous driving. In contrast, semantic segmentation techniques like U-Net and Mask R-CNN provide a pixel-level understanding of images, useful in fields like medical imaging and agriculture where detailed analysis is critical.

Meanwhile, Optical Character Recognition (OCR) is tailored for extracting and interpreting text from images, making it invaluable for document processing and automation. Feature-matching methods, such as Scale-Invariant Feature Transform (SIFT) and Speeded-Up Robust Features (SURF), are particularly useful for image stitching or 3D reconstruction by recognizing and aligning key points between images. The choice of technique often depends on the specific goals, computational resources, and accuracy requirements of the task, illustrating the diversity and adaptability of approaches within the realm of computer vision.

## Synthetic Data
Synthetic data is a powerful tool in machine learning, used to augment datasets, reduce bias, and train models more effectively when real-world data is scarce, expensive, or challenging to collect. By generating artificial yet statistically relevant data, synthetic datasets can simulate various scenarios and improve model generalization. This is particularly helpful in applications like healthcare, finance, and autonomous systems, where obtaining real data might involve privacy concerns, high costs, or safety risks. Synthetic data also aids in balancing datasets, mitigating issues like class imbalance, and creating diverse, representative training samples.

In the context of image classification, synthetic data generation often involves techniques such as data augmentation, where transformations like rotation, scaling, flipping, and color adjustments are applied to existing images to produce variations. More advanced methods include using generative adversarial networks (GANs) or 3D rendering to create entirely new images based on the properties of the original dataset. This can be especially beneficial for rare or underrepresented classes in the dataset, as it provides additional samples to improve the model's performance and reduce overfitting. The key advantage is that synthetic data expands the scope of the dataset without requiring manual collection or labeling efforts



<p style="text-align: center"><img src="https://thislondonhouse.com/Jupyter/Images/hotdog_left.png" width=40%>&nbsp;<img src="https://thislondonhouse.com/Jupyter/Images/hotdog_right.png" width=40%></p>


In [None]:
# Libraries
import os
from groq import Groq
from dotenv import load_dotenv
import base64
import random
import zipfile
from time import time
import matplotlib.pyplot as plt
import random
import numpy as np
import requests
from PIL import Image
from io import BytesIO
import tensorflow as tf
from sklearn import metrics
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.svm import SVC, LinearSVC
from scikeras.wrappers import KerasClassifier
from tensorflow.keras.layers import Dense, Input, MaxPooling2D, Conv2D, Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from skimage.feature import hog
from skimage.color import rgb2gray

In [None]:
# Function
def load_data_from_directory(base_dir, include_flipped_images=False):
    data = []
    labels = []

    # Iterate through each class folder
    for class_name in os.listdir(base_dir):
        class_dir = os.path.join(base_dir, class_name)

        if os.path.isdir(class_dir):  # Ensure it is a directory
            for file_name in os.listdir(class_dir):
                file_path = os.path.join(class_dir, file_name)

                try:
                    # Open the image, preprocess it, and add to dataset
                    with Image.open(file_path) as img:
                        img = img.resize((128,128))
                        img_array = np.array(img)  # Convert to a NumPy array
                        data.append(img_array)
                        labels.append(class_name)  # Use the folder name as the label
                        if include_flipped_images:
                            flipped_img = img.transpose(Image.FLIP_LEFT_RIGHT)
                            flipped_img_array = np.array(flipped_img)  # Convert flipped image to a NumPy array

                            # Append the flipped image and label
                            data.append(flipped_img_array)
                            labels.append(class_name)  # The label remains the same

                except Exception as e:
                    print(f"Error loading image {file_path}: {e}")

    return np.array(data, dtype='float32')/255.0, np.array(labels)

def load_random_images(folder, num_images):
    images = []
    if os.path.isdir(folder):
        all_images = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith(('png', 'jpg', 'jpeg'))]
        sampled_images = random.sample(all_images, min(len(all_images), num_images))
        images.extend(sampled_images)
    return images

def plot_images_grid(images, grid_size):
    fig, axes = plt.subplots(grid_size, grid_size, figsize=(12, 12))
    fig.subplots_adjust(hspace=0.5, wspace=0.5)
    for i, ax in enumerate(axes.flatten()):
        if i < len(images):
            img = Image.open(images[i])
            ax.imshow(img)
            ax.axis('off')
    plt.show()

def get_image(image_url):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    img = img.resize((128, 128))  # Example for models like ResNet or VGG

    # Convert to a NumPy array and normalize pixel values
    img_array = np.array(img)  # Scale pixel values to [0, 1]
    data = [img_array]
    return np.array(data, dtype='float32')/255.0

def flatten_images(X):
    return X.reshape(X.shape[0], -1)

def to_grayscale(images):
    return np.array([rgb2gray(image) for image in images])

def extract_hog_features(images, pixels_per_cell=(8, 8), cells_per_block=(2, 2), orientations=9):
    return np.array([hog(image, 
                         pixels_per_cell=pixels_per_cell, 
                         cells_per_block=cells_per_block, 
                         orientations=orientations, 
                         block_norm='L2-Hys') for image in images])

def create_sequential_model(dims, metric):
    print(dims)
    model = Sequential()
    model.add(Input(shape=dims))
    model.add(Conv2D(32, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(64, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(64, (3, 3), activation='relu'))
    model.add(Flatten())
    model.add(Dense(64, activation='relu'))
    model.add(Dense(10))
    model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=metric)
    return model

def classifier_performance(y, y_pred, labels_dict=None):
    accuracy = metrics.accuracy_score(y, y_pred)
    precision = metrics.precision_score(y, y_pred, average='weighted')
    recall = metrics.recall_score(y, y_pred, average='weighted')
    balanced_accuracy = metrics.balanced_accuracy_score(y, y_pred)
    f1 = metrics.f1_score(y, y_pred, average='weighted')
    report = metrics.classification_report(y, y_pred, target_names=[labels_dict[i] for i in sorted(
        labels_dict.keys())] if not labels_dict is None else np.unique(y_pred))

    # Display the confusion matrix with custom labels
    conf_matrix = metrics.confusion_matrix(y, y_pred)
    disp = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=[labels_dict[i] for i in sorted(
        labels_dict.keys())] if not labels_dict is None else np.unique(y_pred))
    disp.plot(cmap=plt.cm.Greens)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"Balanced Accuracy: {balanced_accuracy:.4f}")
    print(f"F1-score: {f1:.4f}")
    print("\nDetailed Classification Report:")
    print(report)
    plt.show()

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

## Text Exercise 1
### Business Problem
Being able to discriminate between types of objects is an essential first step to teaching a computer how to make sense of an image. If we are able to effectively teach a computer to discriminate between different types of images, we could then build subsequent tools that leverage these ablilities and automate processes that previously required human intervention.

### Data Collection/Selection
We will be loading data from a kaggle dataset. More information here: https://www.kaggle.com/datasets/thedatasith/hotdog-nothotdog and here: https://www.youtube.com/watch?v=ACmydtFDTGs

The data are organized into testing and training folders with each folder containing subfolders for each class of object.

In [None]:
# Specify the URL of the zip file and the local paths
zip_file_name = 'hotdog_nothotdog.zip'
zip_file_url = f'https://www.thislondonhouse.com/Jupyter/{zip_file_name}'  # Replace with the URL of the zip file
extract_to_dir = 'data/images'  # Replace with your desired extraction folder

# Download the file from the URL
response = requests.get(zip_file_url)
with open(zip_file_name, 'wb') as file:
    file.write(response.content)
print(f"Downloaded zip file to: {zip_file_name}")

# Create the extraction directory if it doesn't exist
os.makedirs(extract_to_dir, exist_ok=True)

# Open and extract the zip file
if os.path.exists(zip_file_name):
    with zipfile.ZipFile(zip_file_name, 'r') as zip_ref:
        zip_ref.extractall(extract_to_dir)
        print(f"Files extracted to: {extract_to_dir}")

    # Delete the file
    os.remove(zip_file_name)
    print(f"{zip_file_name} has been deleted.")

Though we are analyzing iamges, we will still want to profile the data. This includes exploring a subset of the data to understand what we are analyzing. 

In [None]:
test_dir = 'data/images/hotdog-nothotdog/hotdog-nothotdog/test'
train_dir = 'data/images/hotdog-nothotdog/hotdog-nothotdog/train'

This is a sample of 'hotdog' images.

In [None]:
hotdog_images = load_random_images(f"{train_dir}/hotdog", 45)  # Adjust numbers as needed
plot_images_grid(hotdog_images, 6)

This is a sample of 'nothotdog' images.

In [None]:
hotdog_images = load_random_images(f"{train_dir}/nothotdog", 45)  # Adjust numbers as needed
plot_images_grid(hotdog_images, 6)

Now, we will load our testing and trainging data. Whereas previous analyses exercises subset the training and testing from the total dataset, image dataset are often presorted into folders to ease the analysis process. So, we will load our training data first and then our testing data.

When loading the data, we will perform several standardization steps. These steps are similar in purpose to the cleaning steps for text analysis. In this case, we will standardize the size of the image to 256 color 128x128px image. Then we will divide the pixel values by 255 to place each pixel value on a scale of 0 to 1. 

In [None]:
X_train, y_train = load_data_from_directory(train_dir)
print(f"Training data: {len(X_train)} samples; Shape: {X_train.shape}")

In [None]:
X_test, y_test = load_data_from_directory(test_dir)
print(f"Testing data: {len(X_test)} samples; Shape: {X_test.shape}")

Now, we will visualize the data that will be fed into our analysis pipeline.

In [None]:
num_images = 36
random_indices = random.sample(range(len(X_train)), num_images)
random_images = [X_train[i] for i in random_indices]

# Create a 3x3 grid
fig, axes = plt.subplots(6, 6, figsize=(8, 8))

# Plot each image
for ax, img in zip(axes.flatten(), random_images):
    ax.imshow(img)  # Display the image
    ax.axis('off')  # Hide the axes for better aesthetics

plt.tight_layout()
plt.show()

### Model Specification
As with the text analysis exercise, in this exercise, we will use several transformers specially designed to process image data. The first will convert all images to grayscale because shape rather than color is a more important indicator of a hotdog. This also reduces the complexity of the problem. Rather than dealing with three colors (red, blue, green) and intensities of each, we are only dealing with the intensities of one color (black). Next, we will perform a histogram of gradients fuction (HOG). This function seeks to further reduce the complexity of the image while also emphasizing important characteristics such as edges and the direction of objects. Finally, we will flatten the image so that each image is represented as a single vector of values rather than a matrix of values.  

As with previous exercises, we will begin with a logitistic regression classifier.

In [None]:
# Create a pipeline with the custom transformer
pipeline = Pipeline([
    ('grayscale', FunctionTransformer(to_grayscale, validate=False)),  # Convert images to grayscale
    ('hog', FunctionTransformer(lambda x: extract_hog_features(x), validate=False)),  # Extract HOG features
    ('flatten', FunctionTransformer(flatten_images, validate=False)),
    ('classifier', LogisticRegression(max_iter=1000))
])

Fit the pipeline

In [None]:
pipeline.fit(X_train, y_train)

And predict the results.

In [None]:
logistic_predicted = pipeline.predict(X_test)

Assess classifier performance.

In [None]:
classifier_performance(y_test, logistic_predicted)

For this exercie, a convolutional neural network is added to the end of the list of comparison classifiers. This type of neural netowrk is specially designed to process images and to detect objects. The following code applies early stopping rules to the neural network to limit overfitting.

In [None]:
# Define the EarlyStopping callback
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

As with the text analysis exercises, we will then run a series of classifiers to assess the quality of our model. 

In [None]:
results = []
classifiers = ((DummyClassifier(), "Dummy Classifier"),
               (LogisticRegression(C=5, max_iter=10000), "Logistic Regression"),
               (RidgeClassifier(alpha=1.0, solver="sparse_cg"), "Ridge Classifier"),
               (KNeighborsClassifier(n_neighbors=100), "kNN"),
               (RandomForestClassifier(), "Random Forest"),
               (SVC(kernel='linear', C=1.0, max_iter=10000), "Linear SVC"),
               (SGDClassifier(loss="log_loss", alpha=1e-4, n_iter_no_change=3, early_stopping=True), "log-loss SGD",),
               (NearestCentroid(), "NearestCentroid"),
               (ComplementNB(alpha=0.1), "Complement naive Bayes"), 
               (KerasClassifier(model=create_sequential_model((128,128,3), ['accuracy']), epochs=10, batch_size=5, verbose=1, validation_split=0.2, callbacks=[early_stopping]), 'Neural Network'))

for clf, name in classifiers:
    print("=" * 80)
    print(name)
    print("_" * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    if name == 'Neural Network':
        ## no need to flatten for neural network
        pipeline = Pipeline([
            ('classifier', clf) 
        ])
    else:
        pipeline = Pipeline([
            ('grayscale', FunctionTransformer(to_grayscale, validate=False)),  # Convert images to grayscale
            ('hog', FunctionTransformer(lambda x: extract_hog_features(x), validate=False)),  # Extract HOG features
            ('flatten', FunctionTransformer(flatten_images, validate=False)),
            ('classifier', clf) 
        ])
    pipeline.fit(X_train, y_train)

    train_time = time() - t0
    print(f"train time: {train_time:.3}s")

    t0 = time()
    y_pred = pipeline.predict(X_test)
    test_time = time() - t0
    print(f"test time:  {test_time:.3}s")
    classifier_performance(y_test, y_pred, {0: 'Not Hot Dog', 1: 'Hot Dog'})
    print()
    if name:
        clf_descr = str(name)
    else:
        clf_descr = clf.__class__.__name__

    results.append((clf_descr, metrics.accuracy_score(y_test, y_pred), train_time, test_time))

Processing images is a resource intensive task, so it will be particularly important to consider the efficiency of our models.

In [None]:
results = [[x[i] for x in results] for i in range(4)]

clf_names, score, training_time, test_time = results
training_time = np.array(training_time)
test_time = np.array(test_time)

fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(10, 8))
ax1.scatter(score, training_time, s=60)
ax1.set(
    title="Score-training time trade-off",
    yscale="log",
    xlabel="test accuracy",
    ylabel="training time (s)",
)
ax2.scatter(score, test_time, s=60)
ax2.set(
    title="Score-test time trade-off",
    yscale="log",
    xlabel="test accuracy",
    ylabel="test time (s)",
)

for i, txt in enumerate(clf_names):
    ax1.annotate(txt, (score[i], training_time[i]))
    ax2.annotate(txt, (score[i], test_time[i]))

plt.tight_layout()
plt.show()

In [None]:
test_image = get_image('https://thislondonhouse.com/Jupyter/Images/hotdog.jpg')
plt.imshow(test_image[0])
plt.show()
pipeline.predict(test_image)

In [None]:
test_image = get_image('https://thislondonhouse.com/Jupyter/Images/tacos.jpg')
plt.imshow(test_image[0])
plt.show()
pipeline.predict(test_image)

In [None]:
test_image = get_image('https://thislondonhouse.com/Jupyter/Images/puppy.jpg')
plt.imshow(test_image[0])
plt.show()
pipeline.predict(test_image)

### Conclusion
The model performed moderately well. It did better than the dummy classifier which predicts the same class regardless, but it was not much better than a coin flip. There could be several reasons for this. The training sample may not be large enough. It may be necessary to simply take more pictures of hot dogs. Alternatively, we could introduce synthetic data because the orientation of a hot dog is not a distinguishing feature.  Also, hot dogs are fairly common food items, but they may often be pictured on a plate. This would suggest to the classifier that the plate is part of the hot dog, increasingly the likelihood that any other food item would be classified as a hot dog. The risk of misclassifying a hot dog is very low so the cost of implementing our classifier, is low. If the task was more existential such as predicting illness, our classifier would be insufficient to the task and may do more harm than good.

## Image Exercise 2
In this exerise, we will be building an LLM-wrapper application. These steps will serve as a model for how we approach LLM-wrappers in the future.  

### Business Problem
Creating copy for online stores is a labor intensive process. Though very few people actually read the copy, web crawlers do and product descriptions are essential for developing an effective search engine optimization (SEO) strategy. Small companies often lack the resources needed to develop high quality product descriptions in an efficient manner. So, it would be valuable to have an LLM that can 'see' our products and describe them in a way that fits our store's identity.

### Data Collection/Selection
For this exercise, I have downloaded an image from an Etsy store: https://www.etsy.com/listing/671179169/linen-dress-long-midcalf-belt-dress

In [None]:
image_url = "https://i.etsystatic.com/7803582/r/il/7f4b28/2007580738/il_1140xN.2007580738_1ttb.jpg"

Image.open(BytesIO(requests.get(image_url).content))

### LLM Engineering
In this exercise, the LLM is our intelligence, but we have to tell it what kind of intelligence to exhibit. The unlike the chat features which allow us to specify the system behavior, we must embed all instructions in a single prompt. The image is then passed and the LLM assesses the image based on the instructions.

In [None]:
instruction = """
    What do you see?
"""

In [None]:
instruction = """
    Imagine you are an assistant at an upscale boutique and you need to describe our new line of spring dresses. 
    Could you describe this dress in a way that would be enticing to upscale customers who refresh their wardrobe annually?
    Do not provide any another commentary. Only describe the dress.
"""

### Application Building

Again, we will be using Groq for LLM inference. [sign up for API access](https://console.groq.com/login). We will use a free level of service, but there are paid levels. So it is important to protect your key. Once you have created an API key, you can add it as a variable to a variables.env file to obscure the key from your source code.

In [None]:
dotenv_path = 'variables.env'

load_dotenv(dotenv_path)

Here we load the environment variable from the variables.env file and pass it into the Groq library to establish a link to their inference resources.

In [None]:
client = Groq(api_key=os.getenv("GROQ_API_KEY"))

In [None]:
completion = client.chat.completions.create(
    model="llama-3.2-11b-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": instruction
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url
                    }
                }
            ]
        }
    ],
)

print(image_url)
print(completion.choices[0].message.content)

## Image Exercise 3
In this exerise, we will be building an LLM-wrapper application. These steps will serve as a model for how we approach LLM-wrappers in the future.  

### Business Problem
Many organizations are still heavily reliant on paper. It takes time, effort, and money to transform paper processes to digital processes and the transformation is often slow and uneven. Therefore, it would be useful to have tools that can read paper documents and accurately transcribe the information into digital systems. In this exercise, we will have the LLM read a change of minor form and extract pertinent information from it.

### Data Collection/Selection
For this exercise, I have scanned a change of minor form.

In [None]:
image_path = "data/minor_drop.png"

In [None]:
Image.open(image_path)

### LLM Engineering

Here I provide instructions for how I want the LLM to assess the image and how I want it to respond. The goal of this exercise would be to extract the information from the form and then load it into some subsequent digital system. Therefore, we will ask the LLM to format the data in JSON format.

In [None]:
instruction = """
    The following is an add/drop form for Loyola University Maryland.
    The form contains data entry boxes for Student ID, Current Major, Student Athlete, Class Year, Last Name, First Name, Middle Initial (M.I.).
    Can you read the form and find this information?
    Report this information in JSON format.
    Only extract this information in JSON. Do not provide any other commentary.
"""

### Application Building

In [None]:
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text", 
                     "text": instruction
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{encode_image(image_path)}",
                    },
                },
            ],
        }
    ],
    model="llama-3.2-11b-vision-preview",
)
print(chat_completion.choices[0].message.content)