# <a id='toc1_'></a>[Visual Question Answering](#toc0_)

Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. VQA has many applications: Medical VQA, Education purposes, for surveillance and numerous other applications. In this project we will use [VizWiz](https://vizwiz.org/tasks-and-datasets/vqa/) dataset for Visual Question Answering, this dataset was constructed to train models to help visually impaired people.  In the words of creators of VizWiz: “we introduce the visual question answering (VQA) dataset coming from this population, which we call VizWiz-VQA.  It originates from a natural visual question answering setting where blind people each took an image and recorded a spoken question about it, together with 10 crowdsourced answers per visual question.”



<p align="center">
  <img src="Latex Paper\graphics\chapter1\vizwiz_example.png" alt="vizwiz_example" width="500"/>
</p>

- **Note:** This repository is an implementation for [Less is More: Linear Layers on CLIP Features as Powerful VizWiz Model](https://arxiv.org/abs/2206.05281) paper.
- It is really advised to read OpenAI's [CLIP](https://openai.com/blog/clip/) paper before reading this repository if you have enough time.

## <a id='toc1_1_'></a>[Installing Required Libraries](#toc0_)

In [8]:
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import clip
from torch.utils.data import Dataset, DataLoader
import numpy as np
import seaborn as sns
import json                       
import matplotlib.pyplot as plt             
%matplotlib inline     
sns.set(color_codes=True)

## <a id='toc1_3_'></a>[Configuring the Notebook](#toc0_)

In [13]:
# Configuring the paths for the dataset
INPUT_PATH = 'E:\HK1  2025-2026\Đồ án CS420\VQA\Dataset\VizWiz-'
ANNOTATIONS = INPUT_PATH + '/Annotations'
TRAIN_PATH = INPUT_PATH + '/train/train'
VALIDATION_PATH = INPUT_PATH + '/val/val'
ANNOTATIONS_TRAIN_PATH = ANNOTATIONS + '/train.json'
ANNOTATIONS_VAL_PATH = ANNOTATIONS + '/val.json'
OUTPUT_PATH = 'E:\HK1  2025-2026\Đồ án CS420\VQA\Output'
ANSWER_SPACE = 0 # Will be configured later when we build the vocab using the methodology described in the paper
MODEL_NAME = "ViT-L/14@336px" # This is the backbone of the CLIP model

# Using accelerated computing if available
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))
if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))

CUDA available: True
Device: cuda
GPU Name: NVIDIA GeForce RTX 3060


## <a id='toc1_4_'></a>[Processing Data](#toc0_)

In [15]:
import json
import pandas as pd

# đọc JSON
with open(ANNOTATIONS_TRAIN_PATH, 'r') as f:
    data = json.load(f)

df = pd.DataFrame(data)

print(df.head())


                       image  \
0  VizWiz_train_00000000.jpg   
1  VizWiz_train_00000001.jpg   
2  VizWiz_train_00000002.jpg   
3  VizWiz_train_00000003.jpg   
4  VizWiz_train_00000004.jpg   

                                            question  \
0                   What's the name of this product?   
1        Can you tell me what is in this can please?   
2  Is this enchilada sauce or is this tomatoes?  ...   
3            What is the captcha on this screenshot?   
4                                 What is this item?   

                                             answers answer_type  answerable  
0  [{'answer_confidence': 'yes', 'answer': 'basil...       other           1  
1  [{'answer_confidence': 'yes', 'answer': 'soda'...       other           1  
2  [{'answer_confidence': 'yes', 'answer': 'these...       other           1  
3  [{'answer_confidence': 'yes', 'answer': 't36m'...       other           1  
4  [{'answer_confidence': 'yes', 'answer': 'solar...       other           

In [4]:
df.dtypes

question_type             object
multiple_choice_answer    object
answers                   object
image_id                   int64
answer_type               object
question_id                int64
dtype: object

In [None]:
print(df[""])

['other' 'yes/no' 'number']
