# <a id='toc1_'></a>[Visual Question Answering](#toc0_)

Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. VQA has many applications: Medical VQA, Education purposes, for surveillance and numerous other applications. In this project we will use [VizWiz](https://vizwiz.org/tasks-and-datasets/vqa/) dataset for Visual Question Answering, this dataset was constructed to train models to help visually impaired people.  In the words of creators of VizWiz: “we introduce the visual question answering (VQA) dataset coming from this population, which we call VizWiz-VQA.  It originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question.”



<p align="center">
  <img src="Latex Paper\graphics\chapter1\vizwiz_example.png" alt="vizwiz_example" width="500"/>
</p>

- **Note:** This repository is an implementation for [Less is More: Linear Layers on CLIP Features as Powerful VizWiz Model](https://arxiv.org/abs/2206.05281) paper.
- It is really advised to read OpenAI's [CLIP](https://openai.com/blog/clip/) paper before reading this repository if you have enough time.

## <a id='toc1_1_'></a>[Installing Required Libraries](#toc0_)

In [12]:
# Importing os, numpy and pandas for data manipulation
import os
import numpy as np 
import pandas as pd

# For data visualization, we will use matplotlib, wordcloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# For data preprocessing, we will use Counter, train_test_split, Levenshtein distance, Python Image Library and OneHotEncoder
from collections import Counter
import Levenshtein as lev
from PIL import Image
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# For saving and loading the preprocessed data, we will use pickle
import pickle

# For Building the model, we will use PyTorch and its functions
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import clip
from torch.utils.data import Dataset, DataLoader

# For taking the image from the URL, we will use requests
import requests

# For evaluation, we will need sklearn.metrics.average_precision_score
from sklearn.metrics import average_precision_score

# Importing json for results formatting which will be uploaded for evaluation
import json

## <a id='toc1_3_'></a>[Configuring the Notebook](#toc0_)

In [13]:
# Configuring the paths for the dataset
INPUT_PATH = 'E:\HK1  2025-2026\Đồ án CS420\VQA\Dataset\VizWiz-'
ANNOTATIONS = INPUT_PATH + '/Annotations'
TRAIN_PATH = INPUT_PATH + '/train/train'
VALIDATION_PATH = INPUT_PATH + '/val/val'
ANNOTATIONS_TRAIN_PATH = ANNOTATIONS + '/train.json'
ANNOTATIONS_VAL_PATH = ANNOTATIONS + '/val.json'
ANNOTATIONS_TEST_PATH = ANNOTATIONS + '/test.json'
TEST_PATH = INPUT_PATH + '/test/test'

OUTPUT_PATH = 'E:\HK1  2025-2026\Đồ án CS420\VQA\Output'
ANSWER_SPACE = 0 # Will be configured later when we build the vocab using the methodology described in the paper
MODEL_NAME = "ViT-L/14@336px" # This is the backbone of the CLIP model


# Using accelerated computing if available
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))
if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))

CUDA available: True
Device: cuda
GPU Name: NVIDIA GeForce RTX 3060


## <a id='toc1_4_'></a>[Processing Data](#toc0_)

In [3]:
import json
import pandas as pd

# đọc JSON
with open(ANNOTATIONS_TRAIN_PATH, 'r') as f:
    data = json.load(f)

df = pd.DataFrame(data)

print(df.head())


                       image  \
0  VizWiz_train_00000000.jpg   
1  VizWiz_train_00000001.jpg   
2  VizWiz_train_00000002.jpg   
3  VizWiz_train_00000003.jpg   
4  VizWiz_train_00000004.jpg   

                                            question  \
0                   What's the name of this product?   
1        Can you tell me what is in this can please?   
2  Is this enchilada sauce or is this tomatoes?  ...   
3            What is the captcha on this screenshot?   
4                                 What is this item?   

                                             answers answer_type  answerable  
0  [{'answer_confidence': 'yes', 'answer': 'basil...       other           1  
1  [{'answer_confidence': 'yes', 'answer': 'soda'...       other           1  
2  [{'answer_confidence': 'yes', 'answer': 'these...       other           1  
3  [{'answer_confidence': 'yes', 'answer': 't36m'...       other           1  
4  [{'answer_confidence': 'yes', 'answer': 'solar...       other           

In [4]:
df.dtypes

image          object
question       object
answers        object
answer_type    object
answerable      int64
dtype: object

In [6]:
print(df)

                           image  \
0      VizWiz_train_00000000.jpg   
1      VizWiz_train_00000001.jpg   
2      VizWiz_train_00000002.jpg   
3      VizWiz_train_00000003.jpg   
4      VizWiz_train_00000004.jpg   
...                          ...   
20518  VizWiz_train_00023949.jpg   
20519  VizWiz_train_00023950.jpg   
20520  VizWiz_train_00023951.jpg   
20521  VizWiz_train_00023952.jpg   
20522  VizWiz_train_00023953.jpg   

                                                question  \
0                       What's the name of this product?   
1            Can you tell me what is in this can please?   
2      Is this enchilada sauce or is this tomatoes?  ...   
3                What is the captcha on this screenshot?   
4                                     What is this item?   
...                                                  ...   
20518                  What's the color for this laptop?   
20519  (inaudible) can you see it? If so, then tell m...   
20520         What are the 

In [None]:
def read_dataframe(path):
    """
    Reads the JSON file and returns a dataframe with available columns
    (image, question, answers, answer_type, answerable if exist)

    Parameters:
        path (str): Path to the JSON file

    Returns:
        df (pandas.DataFrame): Dataframe with the available columns
    """
    df = pd.read_json(path)

    cols = ['image', 'question', 'answers', 'answer_type', 'answerable']

    existing_cols = [c for c in cols if c in df.columns]
    df = df[existing_cols]

    return df

def split_train_test(dataframe, test_size = 0.05):
    """
    Splits the dataframe into train and test sets

    Parameters:
        dataframe (pandas.DataFrame): Dataframe to be split

    Returns:
        train (pandas.DataFrame): Train set
        test (pandas.DataFrame): Test set
    """
    train, test = train_test_split(dataframe, test_size=test_size, random_state=42, stratify=dataframe[['answer_type', 'answerable']])
    return train, test

def plot_histogram(dataframe, column):
    """
    Plots the histogram of the given column

    Parameters:
        dataframe (pandas.DataFrame): Dataframe to be plotted
        column (str): Column to be plotted
    
    Returns:
        None
    """
    plt.hist(dataframe[column])
    plt.title(column)
    plt.show()

def plot_pie(dataframe, column):
    """
    Plots the pie chart of the given column

    Parameters:
        dataframe (pandas.DataFrame): Dataframe to be plotted
        column (str): Column to be plotted
    
    Returns:
        None
    """
    value_counts = dataframe[column].value_counts()
    plt.pie(value_counts, labels=value_counts.index, autopct='%1.1f%%')
    plt.title(column)
    plt.show()

def plot_wordcloud(dataframe, column):
    """
    Plots the wordcloud of the given column

    Parameters:
        dataframe (pandas.DataFrame): Dataframe to be plotted
        column (str): Column to be plotted

    Returns:
        None
    """
    text = " ".join([word for word in dataframe[column]])

    wordcloud = WordCloud(width = 800, height = 800, 
                    background_color ='white', 
                    min_font_size = 10).generate(text) 
    
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.show()

def explore_dataframe(dataframe):
    """
    Explores the dataframe (EDA) by plotting the pie charts, histograms and wordclouds of the columns

    Parameters:
        dataframe (pandas.DataFrame): Dataframe to be explored

    Returns:
        None
    """
    plot_pie(dataframe, 'answer_type')
    plot_pie(dataframe, 'answerable')
    plot_histogram(dataframe, 'answerable')
    plot_wordcloud(dataframe, 'question')
    
def get_number_of_distinct_answers(dataframe):
    """
    Returns the number of distinct answers in the dataframe

    Parameters:
        dataframe (pandas.DataFrame): Dataframe to be explored

    Returns:
        len(unique_answers_set) (int): Number of distinct answers in the dataframe
    """
    unique_answers_set = set()
    for row in dataframe['answers']:
        for answer_map in row:
            unique_answers_set.add(answer_map['answer'])
    return len(unique_answers_set)

def process_images(dataframe, image_path, clip_model, preprocessor, device):
    """
    Processes the images in the dataframe and returns the image features

    Parameters:
        dataframe (pandas.DataFrame): Dataframe containing the images
        image_path (str): Path to the input images
        clip_model (clip.model.CLIP): CLIP model
        preprocessor (clip.model.Preprocess): Preprocessor for the CLIP model
        device (torch.device): Device to be used for processing
    
    Returns:
        images (list): List of image features
    """
    images = []
    for _, row in dataframe.iterrows():
        full_path = image_path + "/" + row['image']
        image = Image.open(full_path)
        image = preprocessor(image).unsqueeze(0).to(device)
        image_features = clip_model.encode_image(image)
        image_features = torch.flatten(image_features, start_dim=1)
        images.append(image_features)
    return images

def process_questions(dataframe, clip_model,device):
    """
    Processes the questions in the dataframe and returns the question features

    Parameters:
        dataframe (pandas.DataFrame): Dataframe containing the questions
        clip_model (clip.model.CLIP): CLIP model
        device (torch.device): Device to be used for processing

    Returns:
        questions (list): List of question features
    """
    questions = []
    for _, row in dataframe.iterrows():
        question = row['question']
        question =  clip.tokenize(question).to(device)
        text_features = clip_model.encode_text(question).float()
        text_features = torch.flatten(text_features, start_dim=1)
        questions.append(text_features)
    return questions

## <a id='toc1_5_'></a>[Creating Dataframes & Splitting](#toc0_)

Ussing the defined function to create dataframes and split them into train and test cases.

In [15]:
train_df = read_dataframe(ANNOTATIONS_TRAIN_PATH)
validation_df = read_dataframe(ANNOTATIONS_VAL_PATH)
test_df = read_dataframe(ANNOTATIONS_TEST_PATH)
ANSWER_SPACE = get_number_of_distinct_answers(train_df) # The answer space will be decreased later when we process the answers
print("Number of distinct answers: ", ANSWER_SPACE)



KeyError: "['answers', 'answer_type', 'answerable'] not in index"

## <a id='toc1_6_'></a>[Exploratory Data Analysis](#toc0_)

Shape: (19496, 5)

Info:
<class 'pandas.core.frame.DataFrame'>
Index: 19496 entries, 14709 to 8401
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   image        19496 non-null  object
 1   question     19496 non-null  object
 2   answers      19496 non-null  object
 3   answer_type  19496 non-null  object
 4   answerable   19496 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 913.9+ KB
None

Missing values:
 image          0
question       0
answers        0
answer_type    0
answerable     0
dtype: int64


TypeError: unhashable type: 'list'