<a href="https://colab.research.google.com/github/AIFahim/Some-Preprocessing-of-Dhaka-AI/blob/main/Some_Important_Data_Preprocessing_YoLo_v3_For_Dhaka_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<h1>Table of Content:</h1>

1. [Data Download](#1)
2. [Data Cleaning](#2)
3. [Convert .xml to .txt](#3)
4. [Resizing all the Images](#4)
5. [Train and Validation Split](#5)
6. [Creating Metadata:](#6)
7. [Saving the Processed Dataset](#7)

<a id="1"></a>
##1. Data Download: 
First of all you need to download the original data hosted in the [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/POREXF) repository. For simplicity I just downloaded it and put it on my google drive. You can simply download the data and unzip it into an usable form just by running the follwoing cell of code.  

In [None]:
from IPython.display import clear_output
import os, glob


# Downloading teh dataset
!gdown --id 1GIiqqmqEPSiGBb1MU1kIZG4q7BOIzqik
!unzip traffic-dataset.zip; rm traffic-dataset.zip;
clear_output()

# There was .rar file inside .zip file. So we unzip them again !
!unrar x train.rar
!unrar x test1.rar
clear_output()

# Removing rar files that we no longer need. 
!rm train.rar
!rm test1.rar

# Removing unnecessary demo data folder from workspace.
!rm -r sample_data

# Renaming raw data folder to remove space. Trust me, it makes life a lot easier :D 
%mv 'Final Train Dataset' train_data_raw


<a id="2"></a>
## 2. Data Cleaning:
At the time of initial coding, I found some problem with three of the images. They had two categories of problem. 
1. Train image files <font color="red">`Pias (359).PNG` </font> and <font color="red">`Pias (360).PNG` </font> are actually  `JPEG` files but somehow they are named with extention `.PNG`. So the height and width attribute in the corresponding xml labels are 0 as automatic label generator could not read the image properly. 

2. One of the label files <font color="red"> `231.xml` </font> is actually a `.txt` format label but labeled as `.xml` file. However for the inconsistency of the label index in that file, we will simply drop label and image together. 


To avoid the problem, simply we will remove these three problematic files and process rest of them. The follwoing code cell will remove them and their label file.


In [None]:
corrupt_files = ['231.jpg', '231.xml', 'Pias (359).PNG','Pias (359).xml', 'Pias (360).PNG', 'Pias (360).xml']

%cd /content/train_data_raw/

for file in corrupt_files:
    file_path = os.path.join('/content/train_data_raw/', file)  
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f'{file} is removed successfully')
    else:
        print(f'{file} is already deleted')

%cd /content/

/content/train_data_raw
231.jpg is removed successfully
231.xml is removed successfully
Pias (359).PNG is removed successfully
Pias (359).xml is removed successfully
Pias (360).PNG is removed successfully
Pias (360).xml is removed successfully
/content


<a id="3"></a>
## 3. Convert .xml  to .txt


In [None]:
"""
Thanks to @bjornstenger for his excellent code for converting the code from XML format to .txt format
here is the original link of this cell of code.
Original Link: https://github.com/bjornstenger/xml2yolo/blob/master/convert.py 
"""

from xml.dom import minidom

# Remember these number assigned. These are the label indexes which will be used in the training process.
# Feel free to unfold to see what's inside 
lut={"ambulance": 0,
     "auto rickshaw": 1,
     "bicycle": 2,
     "bus": 3,
     "car": 4,
     "garbagevan": 5,
     "human hauler": 6,
     "minibus": 7,
     "minivan": 8,
     "motorbike": 9,
     "pickup": 10,
     "army vehicle": 11,
     "policecar": 12,
     "rickshaw": 13,
     "scooter": 14,
     "suv": 15,
     "taxi": 16,
     "three wheelers (CNG)": 17,
     "truck": 18,
     "van": 19,
     "wheelbarrow": 20
     }

label_count ={}

print(f'Object Names: {list(lut.keys())}' )

def convert_coordinates(size, box):
    """
    This function converts the coordinates. 
    box: (xmin, ymin, xmax, ymax)
    size: (width, height)

    returns a touple where (x, y, height, width) of the boundary box
    """
    dw = 1.0/size[0]
    dh = 1.0/size[1]
    x = (box[0]+box[1])/2.0
    y = (box[2]+box[3])/2.0
    w = box[1]-box[0]
    h = box[3]-box[2]
    x = x*dw
    w = w*dw
    y = y*dh
    h = h*dh
    return (x,y,w,h)


def convert_xml2yolo(filelist, lut ):
    """
    filelist: list of .xml file paths to convert to .txt file
    lut: a dictionary containing class_name to class_index mapping
    """
    for fname in filelist:
        xmldoc = minidom.parse(fname)
        fname_out = (fname[:-4]+'.txt')

        with open(fname_out, "w") as f:
            # print(f'processing{fname}')

            itemlist = xmldoc.getElementsByTagName('object')
            size = xmldoc.getElementsByTagName('size')[0]
            width = int((size.getElementsByTagName('width')[0]).firstChild.data)
            height = int((size.getElementsByTagName('height')[0]).firstChild.data)

            for item in itemlist:
                # get class label
                classid =  (item.getElementsByTagName('name')[0]).firstChild.data
                if classid in lut:
                    label_str = str(lut[classid])
                else:
                    label_str = "-1"
                    print ("warning: label '%s' not in look-up table for file '%s'" % classid, fname )
                # get bbox coordinates
                xmin = ((item.getElementsByTagName('bndbox')[0]).getElementsByTagName('xmin')[0]).firstChild.data
                ymin = ((item.getElementsByTagName('bndbox')[0]).getElementsByTagName('ymin')[0]).firstChild.data
                xmax = ((item.getElementsByTagName('bndbox')[0]).getElementsByTagName('xmax')[0]).firstChild.data
                ymax = ((item.getElementsByTagName('bndbox')[0]).getElementsByTagName('ymax')[0]).firstChild.data
                b = (float(xmin), float(xmax), float(ymin), float(ymax))
                bb = convert_coordinates((width,height), b)
                #print(bb)

                label_count[classid] = label_count.get(classid, 0) + 1

                f.write(label_str + " " + " ".join([("%.6f" % a) for a in bb]) + '\n')
        # print ("wrote %s" % fname_out)


Object Names: ['ambulance', 'auto rickshaw', 'bicycle', 'bus', 'car', 'garbagevan', 'human hauler', 'minibus', 'minivan', 'motorbike', 'pickup', 'army vehicle', 'policecar', 'rickshaw', 'scooter', 'suv', 'taxi', 'three wheelers (CNG)', 'truck', 'van', 'wheelbarrow']


Now Let us create the path list of the `.xml` files to convert them to `.txt` files. 

In [None]:
from IPython.display import clear_output
import os, glob
# Reading Image file paths
formats = ['jpg']
image_file_list = []
for format in formats:
    image_file_list.extend(glob.glob(f'/content/dataset/images_jpg/*.{format}'))

# # Reading XML label file paths
# label_file_list_xml = glob.glob('/content/train_data_raw/*.xml')

print(f'Image files found: {len(image_file_list)} ')  # \nLabel files found: { len(label_file_list_xml)}'

Image files found: 2996 


In [None]:
# Converting .xml file to .txt file
convert_xml2yolo(label_file_list_xml, lut)
label_file_list_txt = glob.glob('/content/train_data_raw/*.txt')
print(f'XML --> TXT files: {len(label_file_list_txt)}')

XML --> TXT files: 3000


In [None]:
label_count

{'ambulance': 70,
 'army vehicle': 43,
 'auto rickshaw': 372,
 'bicycle': 459,
 'bus': 3333,
 'car': 5476,
 'garbagevan': 3,
 'human hauler': 169,
 'minibus': 95,
 'minivan': 934,
 'motorbike': 2284,
 'pickup': 1225,
 'policecar': 32,
 'rickshaw': 3536,
 'scooter': 38,
 'suv': 859,
 'taxi': 60,
 'three wheelers (CNG)': 2989,
 'truck': 1492,
 'van': 756,
 'wheelbarrow': 119}

## Data Visualization: 
Let us have a look at the existance of the labels in the dataset. 

In [None]:
import pandas as pd
import plotly.express as px

# DataFrame Generation
df = pd.DataFrame({'labels': label_count.keys(), 'count': label_count.values()})
df.columns = ['labels', 'count']
df.sort_values(['count'], ascending = False, inplace =True)
df.head(n=21)

# # Plotting
# fig = px.bar(df, x="labels", y='count',  color="count",
#     orientation='v', 
#     title='Frequency of the Labels in Dhaka.ai-2020 Challenge', 
#     color_continuous_scale=px.colors.sequential.Viridis_r
# )
# fig.update_layout(title_x=0.5, xaxis_title = 'Labels', yaxis_title = 'Label Count')
# fig.update_xaxes(tickangle=60)
# fig.show()

Unnamed: 0,labels,count
1,car,5476
10,rickshaw,3536
2,bus,3333
4,three wheelers (CNG),2989
9,motorbike,2284
8,truck,1492
11,pickup,1225
7,minivan,934
6,suv,859
3,van,756


<a id="4"></a>
## 4. Resizing all the Images (Optional)
Dimension of resize = 1024 * 1024



In [None]:
%cd /content/
!mkdir dataset
%cd dataset
# !mkdir train
# %cd train
!gdown --id 11Fjy9BE8TCPJvfS4KcGEQhTC58uUokGG
!unzip images_jpg.zip; rm images_jpg.zip;
%cd /content/input/dataset/

/content
/content/dataset
Downloading...
From: https://drive.google.com/uc?id=11Fjy9BE8TCPJvfS4KcGEQhTC58uUokGG
To: /content/dataset/images_jpg.zip
1.42GB [00:22, 63.1MB/s]
Archive:  images_jpg.zip
   creating: images_jpg/
  inflating: images_jpg/01.jpg       
  inflating: images_jpg/09.jpg       
  inflating: images_jpg/10.jpg       
  inflating: images_jpg/13.jpg       
  inflating: images_jpg/16.jpg       
  inflating: images_jpg/18.jpg       
  inflating: images_jpg/28.jpg       
  inflating: images_jpg/39.jpg       
  inflating: images_jpg/48.jpg       
  inflating: images_jpg/49.jpg       
  inflating: images_jpg/50.jpg       
  inflating: images_jpg/54.jpg       
  inflating: images_jpg/56.jpg       
  inflating: images_jpg/58.jpg       
  inflating: images_jpg/59.jpg       
  inflating: images_jpg/61.jpg       
  inflating: images_jpg/63.jpg       
  inflating: images_jpg/68.jpg       
  inflating: images_jpg/70.jpg       
  inflating: images_jpg/75.jpg       
  inflating: imag

In [None]:
from PIL import Image
img_sizes = {}

for fname in image_file_list:
    img = Image.open(fname)
    img_sizes[img.size] = img_sizes.get(img.size, 0) +1 
img_sizes

{(352, 399): 1,
 (352, 401): 1,
 (352, 404): 1,
 (352, 405): 1,
 (352, 421): 1,
 (352, 426): 1,
 (352, 429): 1,
 (352, 432): 1,
 (352, 435): 1,
 (352, 436): 2,
 (352, 437): 1,
 (352, 438): 1,
 (352, 439): 2,
 (352, 443): 1,
 (352, 444): 1,
 (352, 445): 1,
 (352, 448): 1,
 (352, 454): 1,
 (352, 456): 1,
 (352, 458): 1,
 (352, 460): 1,
 (352, 461): 1,
 (352, 462): 1,
 (352, 467): 1,
 (352, 469): 2,
 (352, 472): 1,
 (352, 476): 1,
 (352, 478): 1,
 (352, 479): 3,
 (352, 482): 1,
 (352, 486): 1,
 (352, 488): 1,
 (352, 490): 1,
 (352, 491): 1,
 (352, 492): 1,
 (352, 493): 1,
 (352, 496): 1,
 (352, 498): 2,
 (352, 502): 1,
 (352, 506): 2,
 (352, 507): 1,
 (352, 509): 1,
 (352, 511): 2,
 (352, 512): 1,
 (352, 515): 1,
 (352, 517): 1,
 (352, 524): 1,
 (352, 526): 1,
 (352, 527): 1,
 (352, 537): 1,
 (352, 543): 2,
 (352, 545): 1,
 (352, 547): 1,
 (352, 586): 1,
 (407, 539): 1,
 (431, 576): 1,
 (451, 587): 1,
 (490, 654): 1,
 (491, 654): 1,
 (491, 656): 1,
 (492, 655): 1,
 (530, 667): 1,
 (540, 3

In [None]:
def resize_images(file_list, width = 1024, height = 1024, overwrite = True, save_dir = ''):
    total_files = len(file_list)
    idx = 1
    for path in file_list:
        img = Image.open(path)
        img_resized = img.resize((width, height))
        if overwrite:
            img_resized.save(path)
            filename = path.split('/')[-1] 
            print(f"{idx}/{total_files}: {filename} {img.size}--> ({width}x{height})")
        else:
            filename = path.split('/')[-1]
            img_resized.save(save_dir + filename)
            print(f'{filename} saved to {save_dir}')
        idx +=1
    clear_output()

In [None]:
resize_images(image_file_list , overwrite= True)

In [None]:
!zip -r /content/resizedDataset.zip /content/dataset/images_jpg
#!zip /content/dataset/images_jpg

  adding: content/dataset/images_jpg/ (stored 0%)
  adding: content/dataset/images_jpg/Navid_370.jpg (deflated 2%)
  adding: content/dataset/images_jpg/Navid_385.jpg (deflated 1%)
  adding: content/dataset/images_jpg/Asraf_45.jpg (deflated 0%)
  adding: content/dataset/images_jpg/46.jpg (deflated 1%)
  adding: content/dataset/images_jpg/Pias_(310).jpg (deflated 0%)
  adding: content/dataset/images_jpg/Dipto_767.jpg (deflated 2%)
  adding: content/dataset/images_jpg/Pias_(525).jpg (deflated 3%)
  adding: content/dataset/images_jpg/Pias_(205).jpg (deflated 0%)
  adding: content/dataset/images_jpg/Dipto_595.jpg (deflated 0%)
  adding: content/dataset/images_jpg/Navid_713.jpg (deflated 1%)
  adding: content/dataset/images_jpg/Pias_(493).jpg (deflated 3%)
  adding: content/dataset/images_jpg/Navid_802.jpg (deflated 0%)
  adding: content/dataset/images_jpg/Pias_(227).jpg (deflated 1%)
  adding: content/dataset/images_jpg/Navid_157.jpg (deflated 1%)
  adding: content/dataset/images_jpg/Navid_

In [None]:
from google.colab import files
files.download("/content/resizedDataset.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

We don't have to resize the test images because they are already resized to our desired format. 

<a id="5"></a>
## 5. Train and Validation Split


In [None]:
import random
random.seed(401)

#randomply selecting the index of the files
valid_set_index = random.sample(range(len(image_file_list)), 600)
len(set(image_file_list)), len(set(label_file_list_txt)), len(valid_set_index)

image_file_list = sorted(image_file_list)
label_file_list_txt = sorted(label_file_list_txt)

# sanity check of the image files and labels being in the same order
print('Checking files concurrency')
print(image_file_list[:5])
print(label_file_list_txt[:5])

# code to separate train and validation set
valid_selected_images = []
valid_selected_labels = []

for index in valid_set_index: 
    valid_selected_images.append(image_file_list[index])
    valid_selected_labels.append(label_file_list_txt[index])


print('\n\nChecking files concurrency in validation set')
print(valid_selected_images[:5])
print(valid_selected_labels[:5])

Checking files concurrency
['/content/train_data_raw/01.jpg', '/content/train_data_raw/02.jpg', '/content/train_data_raw/03.jpg', '/content/train_data_raw/04.jpg', '/content/train_data_raw/05.jpg']
['/content/train_data_raw/01.txt', '/content/train_data_raw/02.txt', '/content/train_data_raw/03.txt', '/content/train_data_raw/04.txt', '/content/train_data_raw/05.txt']


Checking files concurrency in validation set
['/content/train_data_raw/Numan_(143).jpg', '/content/train_data_raw/Dipto_ 191.jpg', '/content/train_data_raw/Navid_635.JPG', '/content/train_data_raw/Dipto_852.jpg', '/content/train_data_raw/Numan_(44).jpg']
['/content/train_data_raw/Numan_(143).txt', '/content/train_data_raw/Dipto_ 191.txt', '/content/train_data_raw/Navid_635.txt', '/content/train_data_raw/Dipto_852.txt', '/content/train_data_raw/Numan_(44).txt']


In [None]:
import shutil

# Creating validation directory
valid_dir = '/content/valid/'

if os.path.exists(valid_dir):
    print(f'Directory {valid_dir} already exists !')
else: 
    os.makedirs(valid_dir)
    print(f"Directory {valid_dir} is created successfully!") 


for idx in range(len(valid_selected_images)):
    # moving image files
    mypath = valid_selected_images[idx]
    if os.path.exists(mypath):
        filename = mypath.split('/')[-1]
        shutil.move(mypath , valid_dir + filename)
    else:
        print(f'{mypath} not found')
        
    # moving label files
    mypath = valid_selected_labels[idx]
    if os.path.exists(mypath):
        filename = mypath.split('/')[-1]
        shutil.move(mypath , valid_dir + filename)
    else:
        print(f'{mypath} not found')

Directory /content/valid/ is created successfully!


Now the remaining images in the `train_data_raw` are actually the train directory. We will rename it to `train`

In [None]:
!mv train_data_raw train  

<a id="6"></a>
## 6. Creating Metadata:
The strter code has some files as metadata. We need to produce contents for them. We have to produce the follwoing files. 
* `train.txt` : A text file containing full paths of all the training image files. 
* `valid.txt` : A text file containing full paths of all the validation image files
* `test.txt` :  A text file containing full paths of all the test image files
* `traffic.names`: A text file containing all the traffic label names line by line
* `traffic.data`: Its a confguration file that stores the number of classes, and the location of train.txt and valid.txt for training purpose

In [None]:
def lookup_image_file_paths(formats, dir):
    """
    This function takes a specified set of formats and directory address to list all the filepaths
    of the desired format in that directory
    """
    filepaths = []
    for format in formats:
        filepaths.extend(glob.glob(f'{dir}*.{format}'))
    return filepaths

def make_txt_file(formats, dir):
    """
    Formats the file names to write in the desired txt file
    """
    filepaths = lookup_image_file_paths(formats, dir)
    
    filenames = [x.split('/')[-1] for x in filepaths]
    txt_file_name = dir.split('/')[-2]

    print(f'{txt_file_name} : {len(filepaths)} images')
    with open(f'/content/metadata/{txt_file_name}.txt', 'w') as outfile:
        for filename in filenames:
            outfile.write(f'data/{txt_file_name}/'+filename+'\n')
        outfile.close()


In [None]:
train_dir = '/content/train/'
valid_dir = '/content/valid/'
test_dir =  '/content/test/'
!mkdir metadata

# Making the .txt file containing list of images. 
make_txt_file(formats, dir = train_dir)
make_txt_file(formats, dir = test_dir)
make_txt_file(formats, dir = valid_dir)

# Writing the file traffic.names
object_labels = list(lut.keys())
with open('/content/metadata/traffic.names', 'w') as outfile:
    for label in object_labels:
        outfile.write(label + '\n')
    outfile.close()

# Writing the file traffic.data
data_config = f'classes=21\ntrain=train.txt\nvalid=valid.txt\nnames=traffic.names'
with open('/content/metadata/traffic.data', 'w') as outfile:
    outfile.write(data_config + '\n')
    outfile.close()

train : 2400 images
test : 500 images
valid : 600 images


<a id="7"></a>
## 7. Saving the Processed Dataset:


In [None]:
!zip -r dhaka-traffic-yolo-v3_seed_401.zip train test valid metadata
clear_output()

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!cp dhaka-traffic-yolo-v3_seed_401.zip '/content/gdrive/My Drive/yolov3/'