# Getting Additional Data for the Project, for Balancing Classes

## New Training & Validation Data needed to balance out classes

    - We are facing a big data imbalancing issue, we need to fix this to avoid model to only predict Class 3
    - We have a lot of data in the Amazon Bin Image Dataset, so getting more data is feasible
    - Since Class 3 has the highest amount of data, we will download and add data for all other classes to make it equal to that
    - We will make sure that the new data we have is different from our existing data


## Run the below cells to download the additional data

### Have `file_list.json`  provided by Udacity in the same directory, to check against existing data


## Stats for New Data required per Class (1, 2, 4 & 5)

In [9]:
# Train Data Difference for Class 1
2053 - 902

1151

In [2]:
# Train Data Difference for Class 2
2053 - 1759

294

In [3]:
# Train Data Difference for Class 4
2053 - 1818

235

In [4]:
# Train Data Difference for Class 5
2053 - 1420

633

In [5]:
#Valid Data Difference for Class 1
513 - 226

287

In [6]:
#Valid Data Difference for Class 2
513 - 440

73

In [7]:
#Valid Data Difference for Class 4
513 - 455

58

In [8]:
#Valid Data Difference for Class 5
513 - 355

158

### Total More Data Required

In [10]:
#Total More Data Required Class 1
1151 + 287

1438

In [11]:
#Total More Data Required Class 2
294 + 73

367

In [12]:
#Total More Data Required Class 4
235 + 58

293

In [13]:
#Total More Data Required Class 5
633 + 158

791

## Script to get more Data

- The Amazon Bin Image Dataset contains about ~500,000 Images
- https://github.com/awslabs/open-data-docs/tree/main/docs/aft-vbi-pds is the documentation for this
- The link tells that each Image has a corresponding Meta Data File, where the "EXPECTED_QUANTITY" field tells the number of objects in that Image
-  I have created a script which downloads the Meta Data files, checks if number of objects are in 1,2, 4 or 5 and accordingly downloads the corresponding Image in the Class Folder
- I set counters for number of Images required per class, according to statistics above
- Once a counter hits zero, we don't check if Meta Data file has that number of objects
- Since the Images in the S3 Bucket are sequentially numbered, we can use number ranges to check for files

In [2]:
import json
with open('file_list.json', 'r') as f:
        d=json.load(f)

In [13]:
import boto3
import os
s3_client = boto3.client('s3')

In [15]:
directory=os.path.join('train_data', 'test')

In [32]:
directory2=os.path.join('new_data', '5')

In [33]:
os.makedirs(directory2)

In [23]:
classes = set([1, 2, 4, 5])

In [40]:
class_1_cnt = 1438
class_2_cnt = 366
class_4_cnt = 293
class_5_cnt = 791

for i in range(105276, 110000):
    
    if((i%1000)==0):
        print(i)
        print(classes)
        print("Class 1 left:", class_1_cnt)
        print("Class 2 left:", class_2_cnt)
        print("Class 4 left:", class_4_cnt)
        print("Class 5 left:", class_5_cnt)
    
    to_check = 'data/metadata/' + str(i) + '.json'
    if to_check in d['1'] or to_check in d['2'] or to_check in d['3'] or to_check in d['4'] or to_check in d['5']:
        #print("Have already: ",to_check)
        continue
    else:
        s3_path = to_check[5:]
        #print(s3_path)
        file_name = str(i) + '.json'
        s3_client.download_file('aft-vbi-pds', s3_path,
                             os.path.join(directory, file_name))
        with open(os.path.join(directory, file_name), 'r') as f:
                            d_temp=json.load(f)
        if d_temp["EXPECTED_QUANTITY"] not in classes:
            print(d_temp["EXPECTED_QUANTITY"])
            continue
        else:
            qty = d_temp["EXPECTED_QUANTITY"]
            print("FOUND: ", qty)
            if(qty==1):
                class_1_cnt-=1
                if(class_1_cnt == 0):
                    classes.remove(1)
            elif(qty==2):
                class_2_cnt-=1
                if(class_2_cnt == 0):
                    classes.remove(2)
            elif(qty==4):
                class_4_cnt-=1
                if(class_4_cnt == 0):
                    classes.remove(4)
            else:
                class_5_cnt-=1
                if(class_5_cnt == 0):
                    classes.remove(5)
                    
            image_name = str(i) + '.jpg'
            down_directory = os.path.join('new_data', str(d_temp["EXPECTED_QUANTITY"]))
            s3_client.download_file('aft-vbi-pds', os.path.join('bin-images', image_name),
                             os.path.join(down_directory, image_name))
            
            if(class_1_cnt ==0 and class_2_cnt==0 and class_3_cnt==0 and class_4_cnt==0 and class_5_cnt==0):
                print("ALL DONE")
                break
            

FOUND:  5
FOUND:  2
FOUND:  1
3
FOUND:  4
FOUND:  4
FOUND:  2
FOUND:  4
FOUND:  1
FOUND:  4
8
7
6
FOUND:  4
FOUND:  2
3
FOUND:  2
3
FOUND:  4
8
7
FOUND:  4
FOUND:  4
FOUND:  2
6
FOUND:  2
6
3
11
FOUND:  4
3
FOUND:  2
9
8
7
3
FOUND:  2
FOUND:  5
FOUND:  5
FOUND:  4
12
10
6
8
3
7
9
7
FOUND:  4
FOUND:  1
FOUND:  2
FOUND:  4
3
6
FOUND:  5
7
6
7
FOUND:  1
3
7
FOUND:  2
FOUND:  1
3
9
FOUND:  2
3
FOUND:  2
FOUND:  4
FOUND:  2
3
FOUND:  2
FOUND:  4
3
3
7
FOUND:  5
FOUND:  4
FOUND:  4
FOUND:  5
FOUND:  2
0
0
6
6
FOUND:  5
9
3
6
FOUND:  5
13
11
FOUND:  5
3
3
FOUND:  5
8
11
FOUND:  1
0
7
FOUND:  1
3
3
FOUND:  4
7
6
FOUND:  5
3
FOUND:  2
6
FOUND:  5
FOUND:  2
3
FOUND:  2
3
FOUND:  2
6
9
10
6
FOUND:  5
15
17
7
6
FOUND:  5
FOUND:  4
6
FOUND:  5
21
FOUND:  4
3
9
8
12
11
FOUND:  4
FOUND:  5
FOUND:  4
FOUND:  4
3
3
FOUND:  5
8
7
13
6
FOUND:  1
3
3
FOUND:  4
FOUND:  4
6
FOUND:  5
7
7
FOUND:  5
7
10
3
FOUND:  1
10
8
10
9
FOUND:  2
3
FOUND:  4
FOUND:  2
FOUND:  4
FOUND:  2
3
FOUND:  4
3
FOUND:  2
FOUND:  

### Got Required Number of Images for Classes 2 and 4!

In [41]:
# After Running my Script from 100000 to 110000
print(classes)
print("Class 1 left:", class_1_cnt)
print("Class 2 left:", class_2_cnt)
print("Class 4 left:", class_4_cnt)
print("Class 5 left:", class_5_cnt)

{1, 5}
Class 1 left: 1084
Class 2 left: 0
Class 4 left: 0
Class 5 left: 264


### Trying a New Range for fulfilling the rest of classes

In [46]:
for i in range(110000, 120000):
    
    if((i%1000)==0):
        print(i)
        print(classes)
        print("Class 1 left:", class_1_cnt)
        print("Class 2 left:", class_2_cnt)
        print("Class 4 left:", class_4_cnt)
        print("Class 5 left:", class_5_cnt)
    
    to_check = 'data/metadata/' + str(i) + '.json'
    if to_check in d['1'] or to_check in d['2'] or to_check in d['3'] or to_check in d['4'] or to_check in d['5']:
        #print("Have already: ",to_check)
        continue
    else:
        s3_path = to_check[5:]
        #print(s3_path)
        file_name = str(i) + '.json'
        s3_client.download_file('aft-vbi-pds', s3_path,
                             os.path.join(directory, file_name))
        with open(os.path.join(directory, file_name), 'r') as f:
                            d_temp=json.load(f)
        if d_temp["EXPECTED_QUANTITY"] not in classes:
            print(d_temp["EXPECTED_QUANTITY"])
            continue
        else:
            qty = d_temp["EXPECTED_QUANTITY"]
            print("FOUND: ", qty)
            if(qty==1):
                class_1_cnt-=1
                if(class_1_cnt == 0):
                    classes.remove(1)
            elif(qty==2):
                class_2_cnt-=1
                if(class_2_cnt == 0):
                    classes.remove(2)
            elif(qty==4):
                class_4_cnt-=1
                if(class_4_cnt == 0):
                    classes.remove(4)
            else:
                class_5_cnt-=1
                if(class_5_cnt == 0):
                    classes.remove(5)
                    
            image_name = str(i) + '.jpg'
            down_directory = os.path.join('new_data', str(d_temp["EXPECTED_QUANTITY"]))
            s3_client.download_file('aft-vbi-pds', os.path.join('bin-images', image_name),
                             os.path.join(down_directory, image_name))
            
            if(class_1_cnt ==0 and class_2_cnt==0 and class_4_cnt==0 and class_5_cnt==0):
                print("ALL DONE")
                break
            

110000
{1, 5}
Class 1 left: 1084
Class 2 left: 0
Class 4 left: 0
Class 5 left: 264
FOUND:  1
FOUND:  1
7
6
21
20
8
7
FOUND:  1
2
3
4
FOUND:  5
6
2
6
FOUND:  5
FOUND:  5
4
4
3
4
3
4
4
3
12
7
6
7
6
2
FOUND:  1
FOUND:  5
4
4
3
8
10
2
3
12
2
FOUND:  5
4
3
17
16
3
2
13
FOUND:  5
6
FOUND:  5
8
7
3
6
3
6
3
FOUND:  5
7
6
4
6
7
FOUND:  5
4
3
3
2
2
FOUND:  1
9
8
9
9
FOUND:  5
4
15
FOUND:  1
6
FOUND:  5
6
FOUND:  5
6
FOUND:  5
17
10
8
8
6
8
7
4
2
3
3
2
4
FOUND:  5
19
18
7
6
0
27
10
15
3
4
9
8
FOUND:  1
3
3
2
FOUND:  1
6
20
19
4
3
4
3
17
FOUND:  5
FOUND:  1
3
FOUND:  5
FOUND:  1
2
FOUND:  1
0
2
FOUND:  1
2
FOUND:  1
FOUND:  5
3
0
4
FOUND:  5
4
FOUND:  1
4
6
4
3
3
4
2
9
FOUND:  5
FOUND:  5
9
7
6
FOUND:  5
7
4
FOUND:  5
9
3
3
4
2
FOUND:  1
3
FOUND:  5
4
FOUND:  1
2
10
9
8
2
FOUND:  5
9
10
2
3
FOUND:  1
FOUND:  1
8
7
6
8
7
3
4
4
11
10
7
3
4
3
4
3
4
3
6
FOUND:  5
4
2
3
4
2
FOUND:  5
FOUND:  5
8
3
10
3
9
8
15
3
2
3
FOUND:  5
14
11
2
3
4
3
3
7
9
2
0
FOUND:  1
4
FOUND:  5
4
3
3
FOUND:  1
7
6
3
12
17
16
F

### Class 5 also Fulfilled! Now just Class 1 Left

In [47]:
# After Running my Script from 110000 to 120000
print(classes)
print("Class 1 left:", class_1_cnt)
print("Class 2 left:", class_2_cnt)
print("Class 4 left:", class_4_cnt)
print("Class 5 left:", class_5_cnt)

{1}
Class 1 left: 316
Class 2 left: 0
Class 4 left: 0
Class 5 left: 0


In [52]:
classes = {1}
class_1_cnt = 316

In [53]:
for i in range(120000, 130000):
    
    if((i%1000)==0):
        print(i)
        print(classes)
        print("Class 1 left:", class_1_cnt)
        print("Class 2 left:", class_2_cnt)
        print("Class 4 left:", class_4_cnt)
        print("Class 5 left:", class_5_cnt)
    
    to_check = 'data/metadata/' + str(i) + '.json'
    if to_check in d['1'] or to_check in d['2'] or to_check in d['3'] or to_check in d['4'] or to_check in d['5']:
        #print("Have already: ",to_check)
        continue
    else:
        s3_path = to_check[5:]
        #print(s3_path)
        file_name = str(i) + '.json'
        s3_client.download_file('aft-vbi-pds', s3_path,
                             os.path.join(directory, file_name))
        with open(os.path.join(directory, file_name), 'r') as f:
                            d_temp=json.load(f)
        if d_temp["EXPECTED_QUANTITY"] not in classes:
            print(d_temp["EXPECTED_QUANTITY"])
            continue
        else:
            qty = d_temp["EXPECTED_QUANTITY"]
            print("FOUND: ", qty)
            if(qty==1):
                class_1_cnt-=1
                if(class_1_cnt == 0):
                    classes.remove(1)
            elif(qty==2):
                class_2_cnt-=1
                if(class_2_cnt == 0):
                    classes.remove(2)
            elif(qty==4):
                class_4_cnt-=1
                if(class_4_cnt == 0):
                    classes.remove(4)
            else:
                class_5_cnt-=1
                if(class_5_cnt == 0):
                    classes.remove(5)
                    
            image_name = str(i) + '.jpg'
            down_directory = os.path.join('new_data', str(d_temp["EXPECTED_QUANTITY"]))
            s3_client.download_file('aft-vbi-pds', os.path.join('bin-images', image_name),
                             os.path.join(down_directory, image_name))
            
            if(class_1_cnt ==0 and class_2_cnt==0 and class_4_cnt==0 and class_5_cnt==0):
                print("ALL DONE")
                break
            

120000
{1}
Class 1 left: 316
Class 2 left: 0
Class 4 left: 0
Class 5 left: 0
FOUND:  1
7
8
6
2
3
3
4
2
4
4
19
18
3
6
5
4
5
7
18
20
6
5
2
FOUND:  1
4
2
3
2
FOUND:  1
3
FOUND:  1
2
3
6
5
4
3
8
7
FOUND:  1
2
4
3
6
5
4
8
4
3
FOUND:  1
9
5
0
24
7
6
3
5
6
FOUND:  1
6
5
5
4
3
2
3
3
2
FOUND:  1
4
3
9
8
7
4
3
7
8
8
10
5
3
3
9
8
5
4
15
10
9
10
18
17
7
6
3
6
5
7
9
6
10
3
3
4
8
7
11
8
4
0
3
4
13
10
5
4
5
4
2
2
3
3
7
5
2
3
2
4
5
5
4
5
3
9
8
10
9
0
3
7
8
FOUND:  1
2
3
5
7
FOUND:  1
3
4
7
8
7
2
4
4
3
7
6
3
2
3
4
3
25
24
23
0
FOUND:  1
2
5
2
FOUND:  1
4
6
5
5
3
3
4
4
0
5
5
4
6
3
4
4
3
6
5
7
6
7
6
3
2
5
4
3
6
5
5
10
3
5
4
4
6
5
FOUND:  1
8
5
7
12
11
7
6
3
2
FOUND:  1
FOUND:  1
10
4
3
2
4
7
5
8
7
6
5
8
3
4
9
7
11
10
0
2
12
11
9
6
2
4
2
8
19
2
3
8
2
5
4
6
3
2
3
2
5
10
9
3
4
5
3
9
8
7
6
5
FOUND:  1
6
7
9
3
2
2
4
2
3
8
7
5
4
5
4
FOUND:  1
FOUND:  1
9
5
5
4
3
4
FOUND:  1
0
2
5
6
5
6
FOUND:  1
2
2
5
4
2
2
FOUND:  1
2
FOUND:  1
3
8
9
8
6
FOUND:  1
2
5
3
2
5
4
0
3
3
2
4
3
0
14
13
4
5
7
6
7
8
FOUND:  1
5
3
2
FO

In [54]:
print(classes)
print("Class 1 left:", class_1_cnt)
print("Class 2 left:", class_2_cnt)
print("Class 4 left:", class_4_cnt)
print("Class 5 left:", class_5_cnt)

set()
Class 1 left: 0
Class 2 left: 0
Class 4 left: 0
Class 5 left: 0


## ALL DONE! GOT NEW DATA FOR ALL CLASSES!