# <a id='toc1_'></a>[Visual Question Answering](#toc0_)

Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. VQA has many applications: Medical VQA, Education purposes, for surveillance and numerous other applications. In this project we will use the VQA v2
 dataset for Visual Question Answering. This dataset was constructed to balance “yes/no” answers and reduce language priors that appeared in the first version of VQA.

In the words of the creators of VQA v2:
“Compared to VQA v1.0, the VQA v2.0 dataset is more balanced. For every question, there exist complementary images such that the answer to the question is different, which helps in reducing question–answer biases and encourages models to rely more on visual understanding.”

<p align="center"> <img src="Latex Paper\graphics\chapter1\minhhoa.png" alt="vqa_v2_example" width="500"/> </p>

## <a id='toc1_2_'></a>[Importing Libraries](#toc0_)

In [2]:
# Importing os, numpy and pandas for data manipulation
import os
import numpy as np 
import pandas as pd

# For data visualization, we will use matplotlib, wordcloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# For data preprocessing, we will use Counter, train_test_split, Levenshtein distance, Python Image Library and OneHotEncoder
from collections import Counter
import Levenshtein as lev
from PIL import Image
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# For saving and loading the preprocessed data, we will use pickle
import pickle

# For Building the model, we will use PyTorch and its functions
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import clip
from torch.utils.data import Dataset, DataLoader

# For taking the image from the URL, we will use requests
import requests

# For evaluation, we will need sklearn.metrics.average_precision_score
from sklearn.metrics import average_precision_score

# Importing json for results formatting which will be uploaded for evaluation
import json

## <a id='toc1_3_'></a>[Configuring the Notebook](#toc0_)

In [14]:

INPUT_PATH = r"E:\HK1  2025-2026\Đồ án CS420\VQA\Dataset\VQA V2"
OUTPUT_PATH = r"E:\HK1  2025-2026\Đồ án CS420\VQA\Output"

TRAIN_PATH = os.path.join(INPUT_PATH, "train2014", "train2014")
VALIDATION_PATH = os.path.join(INPUT_PATH, "val2014", "val2014")
ANNOTATIONS_TRAIN_PATH = os.path.join(INPUT_PATH, "v2_Annotations_Train_mscoco", "v2_mscoco_train2014_annotations.json")
ANNOTATIONS_VAL_PATH = os.path.join(INPUT_PATH, "v2_Annotations_Val_mscoco", "v2_mscoco_val2014_annotations.json")
QUESTIONS_TRAIN_PATH =  os.path.join(INPUT_PATH, "v2_Questions_Train_mscoco", "v2_OpenEnded_mscoco_train2014_questions.json")
QUESTIONS_VAL_PATH =  os.path.join(INPUT_PATH, "v2_Questions_Val_mscoco", "v2_mscoco_val2014_annotations.json")


ANSWER_SPACE = 0
MODEL_NAME = "ViT-L/14@336px"

# Check CUDA
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))
if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))


CUDA available: True
Device: cuda
GPU Name: NVIDIA GeForce RTX 3060


## <a id='toc1_6_'></a>[Exploratory Data Analysis](#toc0_)

### <a id='toc1_6_1_'></a>[Load Dataframe](#toc0_)

In [15]:

# Load annotations
with open(ANNOTATIONS_TRAIN_PATH, "r") as f:
    ann_data = json.load(f)

annotations = ann_data["annotations"]
df_ann = pd.DataFrame(annotations)

# Load questions
with open(QUESTIONS_TRAIN_PATH, "r") as f:
    q_data = json.load(f)

questions = q_data["questions"]
df_q = pd.DataFrame(questions)

print(df_ann.head())
print(df_q.head())


       question_type multiple_choice_answer  \
0       what is this                    net   
1               what                pitcher   
2  what color is the                 orange   
3            is this                    yes   
4  what color is the                  white   

                                             answers  image_id answer_type  \
0  [{'answer': 'net', 'answer_confidence': 'maybe...    458752       other   
1  [{'answer': 'pitcher', 'answer_confidence': 'y...    458752       other   
2  [{'answer': 'orange', 'answer_confidence': 'ye...    458752       other   
3  [{'answer': 'yes', 'answer_confidence': 'yes',...    458752      yes/no   
4  [{'answer': 'white', 'answer_confidence': 'yes...    262146       other   

   question_id  
0    458752000  
1    458752001  
2    458752002  
3    458752003  
4    262146000  
   image_id                                     question  question_id
0    458752    What is this photo taken looking through?    458752000
1    4


### <a id='toc1_6_1_'></a>[Merge Dataset (Question + Annotation)](#toc0_)

In [16]:
df = pd.merge(df_q, df_ann, on=["question_id", "image_id"])
print(df.head())

   image_id                                     question  question_id  \
0    458752    What is this photo taken looking through?    458752000   
1    458752           What position is this man playing?    458752001   
2    458752             What color is the players shirt?    458752002   
3    458752  Is this man a professional baseball player?    458752003   
4    262146                      What color is the snow?    262146000   

       question_type multiple_choice_answer  \
0       what is this                    net   
1               what                pitcher   
2  what color is the                 orange   
3            is this                    yes   
4  what color is the                  white   

                                             answers answer_type  
0  [{'answer': 'net', 'answer_confidence': 'maybe...       other  
1  [{'answer': 'pitcher', 'answer_confidence': 'y...       other  
2  [{'answer': 'orange', 'answer_confidence': 'ye...       other  
3  [{'answ