#Pytorch Toy Project 2

How to make CUSTOM Dataset (by imjjun KUBIG 16th)



---


It is important to make the Network with some unique layers but handling with Dataset is also very important to utilize pre-organized models. We could use those models on some contests or projects.

In this notebook, we might learn the pytorch's Dataset & DataLoader and handle some datasets to participate previous contests.


#DataModule


*This notebook is based on the official tutorial of pytorch docs.

Our goal is to make the 'Dataset iterator' to let the model fed. Pytorch fundametally offers the DataModule Class so that we could make our customized dataset for our us, which are ***Dataset***(loading the file) & ***DataLoader***(make it iterate)

This notebook contains the simple & image-related exmaple. Therefore, we would make another Customized Dataset for NLP(Natural Language Process)!

In [None]:
#Let's load the ordinary Dataset: FashionMNIST

import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt


training_data = datasets.FashionMNIST(
    root="data", #경로
    train=True,  #train=True: 학습용 데이터 / train=False: 추론용 데이터
    download=True, #다운로드=True, 이미 다운로드 받았으면 False로 설정 가능
    transform=ToTensor() #transform: 이미지 변환(주로 Tensor화 혹은 Normalization 수행)
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

In [None]:
print(type(training_data))
print(type(test_data))

In [None]:
#We can access the data as list!

labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}
figure = plt.figure(figsize=(8, 8))
cols, rows = 3, 3
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(training_data), size=(1,)).item()
    img, label = training_data[sample_idx] #If you access the dataset as list, the two stuffs are returned, Data & Label
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray") #squeeze: delete the dimension which is one-dimensional
plt.show()

##Dataset

*Prerequisites: Inheritance of OOP(Object-oriented Programming)

*Please refer to this article if you wanna know more: https://compmath.korea.ac.kr/oop/Inheritance.html

In Dataset Class, there are three modules, \__init__(), \__len__() and \__getitem()__. This class is inherited by the class: torch.utils.data.Dataset.

- \__init__(): literally initialize the Class. We will load the data on the Class by defining some methods here.

- \__len__(): literally return the length of data. This is necessary to calculate the batch index, etc

- \__getitem__(): literally return the data which are needed for model. For example, image with label, or sentence with label, real image with targeted image etc

In [None]:
import os
import pandas as pd
from torchvision.io import read_image


#For FashionMNIST Data, images are stored in the 'img_dir' & labels are stored in 'annotations_file.csv'
#csv filename extension is commonly loaded, using the python library 'pandas'
#By using the embedded method: 'read_image' which is similar to opencv's 'imread'


class CustomImageDataset(Dataset): #Inheritance!!

    #intialize our dataset and make inputs self.object

    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
        self.img_labels = pd.read_csv(annotations_file, names=['file_name', 'label']) #read annotation file
        self.img_dir = img_dir #image directory
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):  #always self in Class
        return len(self.img_labels) #return length #Usually return the label's length, since it's simpler

    def __getitem__(self, idx): #we have to contain the variable
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = read_image(img_path)
        label = self.img_labels.iloc[idx, 1]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

##DataLoader


Fortunately, pytorch provides Basic dataloader module, "DataLoader"!

If we define our dataset as pytorch's Dataset Class, then we can wrap up that dataset simply. Just use DataLoader Module from torch.utils.data. We can iterate that module by method **'iter()'** but the dataloader is actually not frequently used directly.

There are some variables which you have to choose:

- **Batch Size**: You have to choose your batch size, considering your domain, hardware etc. Usually, the bigger batch size, the better performance.

- **shuffle**: Usually True on Train Dataset & False on Test Dataset

- **pin_memory**: Simply speaking, data is allocated directly to VRAM, not to DRAM (Dram is what we usually call RAM & VRAM is RAM of GPU)

- **num_workers**: the number of subprocss of data multi processing [ Usually set to 4 * (the # of GPU) ]

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=False)

In [None]:
print(type(train_dataloader))
print(type(test_dataloader))

In [None]:
train_features, train_labels = next(iter(train_dataloader))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")
img = train_features[0].squeeze()
label = train_labels[0]
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label}")

#Data Link



https://dacon.io/competitions/official/235747/overview/description

We will participate this competition on the next month :)


## 1) Download the Dataset

In [None]:
!gdown 1pg-Q42ybABcXaoInyF-QRqdr8tRx3A-R

Downloading...
From: https://drive.google.com/uc?id=1pg-Q42ybABcXaoInyF-QRqdr8tRx3A-R
To: /content/open.zip
  0% 0.00/1.91M [00:00<?, ?B/s] 55% 1.05M/1.91M [00:00<00:00, 9.12MB/s]100% 1.91M/1.91M [00:00<00:00, 14.3MB/s]


In [None]:
!unzip /content/open.zip

Archive:  /content/open.zip
  inflating: sample_submission.csv   
  inflating: test_data.csv           
  inflating: topic_dict.csv          
  inflating: train_data.csv          


## 2) Install the one of NLP packages, Transformer

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/7.4 MB[0m [31m23.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/7.4 MB[0m [31m25.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m4.9/7.4 MB[0m [31m43.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.4/7.4 MB[0m [31m53.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.4/7.4 MB[0m [31m53.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggin

In [None]:
from transformers import AutoTokenizer #This AutoTokenizer would be useful to make word embeddings
import torch

#tokenizer=AutoTokenizer.from_pretrained("klue/roberta-large")
#y=tokenizer(text, return_tensors='pt',truncation=True, max_length=20, pad_to_max_length=True, add_special_token=True)

#https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizer <- You can refer to it

#Access like 'config'

#input_id=y['input_ids']

#attention_mask=y['attention_mask']

"""Inputs:
    - text -> sentence
    - return_tensor -> 'pt': pytorch, 'np': numpy etc
    - truncation -> Allow sentence truncation(문장 잘림)
    - max length -> Word embeddings' maximal length for dimensionality
    - pad to max length -> Match the length with the longest sentence
    - add_special_token -> Add special tokens related to pretrained model(BERT, RoBERTa etc)
"""

"""Outputs:
    - input_ids -> Table of Tokenized inputs
    - attention_mask -> Seperating between token and padded token
"""

## 3) How does the dataset look like?


Output: You have to show the csv file by any methods(pandas, numpy etc). Plz print it out on this ipynb.

In [None]:
#Plz print out the dataset by any methods


Unnamed: 0,index,title,topic_idx
0,0,인천→핀란드 항공기 결항…휴가철 여행객 분통,4
1,1,실리콘밸리 넘어서겠다…구글 15조원 들여 美전역 거점화,4
2,2,이란 외무 긴장완화 해결책은 미국이 경제전쟁 멈추는 것,4
3,3,NYT 클린턴 측근韓기업 특수관계 조명…공과 사 맞물려종합,4
4,4,시진핑 트럼프에 중미 무역협상 조속 타결 희망,4


## 4) Make your Dataset!

Make only **'train set'**, not test set

In [None]:
# define your function or code to utilize the given dataset as pytorch Dataset !
# You can refer to the code sharing tap of above dacon homepage :)

from torch.utils.data import Dataset, DataLoader

class NLPDataset_train(Dataset):

    """ Dataset Implementation
        You have to implement them on __init__,
        and return your embeddings through __getitem__ (Outputs might be returned through this method).
        Don't forget the __len__!

        Inputs:
        - csv file which contains ['topic','classification index']
        - AutoTokenizer of Transformer for word embeddings -> Vectorization of Sentences

        Outputs:
        - input_ids:idx of given sentence
        - attention_mask: Simply, word embeddings
        - label: the category of given sentence

    """
    def __init__(self, csv):


      return None

    def __len__(self):

      return None


    def __getitem__(self):

      return None




5) See what components are iterated!

In [None]:
#Loading your dataset
train_loader = DataLoader(NLPDataset_train, batch_size=16, shuffle=True)
input_iter, mask_iter, label_iter = iter(train_dataloader)

#Print out train batch !
print(f"Feature batch: {next(input_iter), next(mask_iter)}")
print(f"Feature batch size: {next(input_iter.shape, next(mask_iter.shape))}")
print(f"Labels batch & size: {next(label_iter), next(label_iter.shape)}")