# Download datasets

## CRDDC2022 Dataset

If the two following cells are not working, you can download the CRDDC2022 dataset from the following links:

1 - Japan dataset from [here](https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_Japan.zip), then extract it.

2- India dataset from [here](https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_India.zip), then extract it.

3- Czech dataset from [here](https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_Czech.zip), then extract it.

4- Norway dataset from [here](https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_Norway.zip), then extract it.

5- United States dataset from [here](https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_United_States.zip), then extract it.

6- China MotorBike dataset from [here](https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_China_MotorBike.zip), then extract it.

7- China Drone dataset from [here](https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_China_Drone.zip), then extract it.

In [None]:
CRDDC2022 = {'Japan': 'https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_Japan.zip',
            'India': 'https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_India.zip',
            'Czech': 'https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_Czech.zip',
            'Norway': 'https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_Norway.zip',
            'United_States': 'https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_United_States.zip',
            'China_MotorBike': 'https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_China_MotorBike.zip',
            'China_Drone': 'https://bigdatacup.s3.ap-northeast-1.amazonaws.com/2022/CRDDC2022/RDD2022/Country_Specific_Data_CRDDC2022/RDD2022_China_Drone.zip'}

In [None]:
import urllib.request
from tqdm import tqdm
print('Downloading the CRDDC2022 Dataset...')
for country_name, url in tqdm(CRDDC2022.items()):
    urllib.request.urlretrieve(url, f'./CRDD2022_all_countries/{country_name}.zip')

# Modify the architecture of the datasets as well as the label format to be compatible with Yolov7

In [None]:
from xml.etree import ElementTree
from xml.dom import minidom
import collections
import os

import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline

import torch
from IPython.display import Image  # for displaying images

print('torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))

In [None]:
import os
import random
from shutil import copy

In [None]:
countries = ['Czech','India','Japan','Norway','United_States','China_MotorBike','China_Drone']
countriestest = ['Czech','India','Japan','Norway','United_States','China_MotorBike']

base_path = "./CRDD2022_all_countries/"
damageType_to_class = {"D00":0,"D10":1, "D20":2, "D40":3}
damageTypes = ['D00','D10','D20','D40']
class_dict = {'D00':0,'D10':0,'D20':0,'D40':0}

As we only have the labels of the train dataset, we have to split the train set into two sets : the train set and the test/validation set
We do that by random, 

In [None]:
file_list = []
for country in countries :
    file_list_country = os.listdir('./CRDD2022_all_countries/{}/train/images'.format(country))
    random.shuffle(file_list_country)
    file_list.append(file_list_country)
    print("Number of images in "+country+" : "+str(len(file_list_country)))

In [None]:
file_list_test = []
for country in countriestest :
    file_list_country = os.listdir('./CRDD2022_all_countries/{}/test/images'.format(country))
    file_list_test.append(file_list_country)
    print("Number of images in "+country+" : "+str(len(file_list_country)))

In [None]:
!mkdir RDD_Dataset_2022_Yolo
!mkdir RDD_Dataset_2022_Yolo/images
!mkdir RDD_Dataset_2022_Yolo/labels
!mkdir RDD_Dataset_2022_Yolo/images/train
!mkdir RDD_Dataset_2022_Yolo/images/val
!mkdir RDD_Dataset_2022_Yolo/labels/train
!mkdir RDD_Dataset_2022_Yolo/labels/val

In [None]:
!mkdir RDD_Dataset_2022_Yolo
!mkdir RDD_Dataset_2022_Yolo/images
!mkdir RDD_Dataset_2022_Yolo/images/test

In [None]:
PROPORTION_TRAIN = 0.9 # Proportion of the images used for training
PATH_IMGS = "CRDD2022_all_countries/images/"
PATH_LABELS = "CRDD2022_all_countries/labels/"

In [None]:
file_list_train = []
file_list_val = []
for i in range(len(countries)) :
    file_list_train.append(file_list[i][:int(PROPORTION_TRAIN*len(file_list[i]))])
    file_list_val.append(file_list[i][int(PROPORTION_TRAIN*len(file_list[i])):])

In [None]:
phases = ['train','val']
file_list_train_val = [file_list_train,file_list_val]
for (j,phase) in enumerate(phases) :
    file_list_phase = file_list_train_val[j]
    for (i,country) in enumerate(countries) :
        file_list_country = file_list_phase[i]

        ################################### FOR THE LABELS ###################################
        for file in file_list_country:
            file_name = file.rsplit('.')[0]
            infile_xml = open(base_path + country + '/train/annotations/xmls/' + file_name +'.xml')
            tree = ElementTree.parse(infile_xml)
            root = tree.getroot()
            file_txt = open(PATH_LABELS+phase+'/'+file_name+'.txt','w')

            for obj in root.iter('size'):
                img_height = int(obj.find('height').text)
                img_width = int(obj.find('width').text)

            nb_boxes_img = 0
            for obj in root.iter('object'):
                cls_name = obj.find('name').text
                if cls_name not in damageTypes :
                    pass
                else :
                    class_dict[cls_name]+=1
                    nb_boxes_img += 1
                    xmlbox = obj.find('bndbox')
                    xmin = int(round(float(xmlbox.find('xmin').text)))
                    xmax = int(round(float(xmlbox.find('xmax').text)))
                    ymin = int(round(float(xmlbox.find('ymin').text)))
                    ymax = int(round(float(xmlbox.find('ymax').text)))

                    x_center = 0.5*(xmin + xmax)
                    y_center = 0.5*(ymin + ymax)
                    width = xmax - xmin
                    height = ymax - ymin
                    
                    x_center, y_center, width, height = round(x_center/img_width,5), round(y_center/img_height,5), round(width/img_width,5), round(height/img_height,5)

                    class_number = damageType_to_class[cls_name]

                    file_txt.write(str(class_number)+' '+str(x_center)+' '+str(y_center)+' '+str(width)+' '+str(height)+'\n')
            file_txt.close()
            ################################ FOR THE IMAGES ########################################
            img_path = base_path + country + '/train/images/' + file
            phase_path = PATH_IMGS+phase+'/'
            copy(img_path,phase_path)

In [None]:
for (j,phase) in enumerate(['test']) :
    file_list_phase = file_list_test
    for (i,country) in enumerate(countries) :
        file_list_country = file_list_phase[i]
        for file in file_list_country:
            ################################ FOR THE IMAGES ########################################
            img_path = base_path + country + '/test/images/' + file
            phase_path = PATH_IMGS+phase+'/'
            copy(img_path,phase_path)

# Clone Yolov7 GitHub repository to start training

In [None]:
!git clone https://github.com/WongKinYiu/yolov7.git  # clone repository
%cd yolov7

%pip install -qr requirements.txt  # install dependencies
import torch
from IPython.display import Image, clear_output  # to display images

clear_output()
print(f"Setup complete. Using torch {torch.__version__} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

In [None]:
# Weights & Biases  (optional)

# %pip install -q wandb
# import wandb
# wandb.login(key='xxx') # After registering to WandB you will have access to your key, that you cant put in the place of xxx

If you have any problem with training or testing with Yolov7, please refer to the [Yolov7 github](https://github.com/WongKinYiu/yolov7)

## Train

Every train run is saved in yolov7/runs/train/

Train with default hyperparameters and predefined weights

In [None]:
# Train from scratch yolov7 for all countries
!python train.py --workers 8 --device 0 --batch-size 8 --data data/coco-custom.yaml --img 640 640 --cfg cfg/training/yolov7-RDD2022.yaml --weights ' ' --name yolo7 --hyp data/hyp.scratch.p5.yaml

In [None]:
# Train yolov7x fine-tuned for all countries
!python train.py --workers 8 --device 0 --batch-size 8 --data data/coco-custom.yaml --img 640 640 --cfg cfg/training/yolov7x-RDD2022.yaml --weights yolov7x_training.pt  --name yolo7x --hyp data/hyp.scratch.custom.yaml