# Introduction

## Project Plan

<center>How can computers help us in our fight against Climate Change?</center>

I wanted to explore the question above, combining a few of my passions and seeing how I could contribute to solving this issue. I plan on creating a Machine Learning algorithm, employing a Convoluted Neural Network, to accurately detect and sort out recyclable waste from other junk. This project is going to have 3 key aspects to it:  

1. Collecting images to train

2. Creating a model to accurately detect recyclable waste  

3. Adding a live camera feed component, to show in real time whether the object is recyclable or not.

For now, the scope is limited to only soda cans, due to time restrictions, but I have plans to expand it to support paper products and plastic bottles. I also plan on adding in my experience with robotics to make a physical device to sort automatically, so students can simply place the object in front of the camera and the machine will properly dispose of the waste for them.

## Project Inspiration

Hong Kong in general doesn't have the greates track record with the environment. From being notorious for bad air quality, to having one of the lowest recycling rates in the world despite being so developed, we as a city should be on the forefront in the fight against climate change. But, quite the opposite is true, and I think we must all push to change that narrative.  

On a more personal note, since a young age, I've been passionate about the environment. I was an avid member of the sustainability club at my school since 6th grade, did my BSA Eagle Project in 8th grade focused on reducing plastic waste in our community, and have led out in my school's participation in the Hong Kong 10 Tonne Challenge. Further, I've co-founded a club in my school to tackle the issue of e-waste in our community. The environment is something I'm extremely passionate about, and I wanted to see how I could to my bit to help.

## Purpose of this Report  

Having completed the first aspect of the project, I wanted to document my process, track my progress, and have a representation of all the work I put into this. This report goes in-depth into what I was looking for in my data, the various problems I faced and how I overcame them in sourcing my data, as well as how these images will fit into the next steps of my project.

## Report Outline

After this introduction, this report goes into how I:  

1. Established parameters and guidelines for my dataset

* Requirements that my dataset needed to meet
* Methodology of sourcing the data
* How I will work with this data
* How I will prepare this data to be optimal for my use  

2. Scraped various websites for my data  
3. Evaluated the strenghts and weaknesses of the images I had
4. Plan on deploying this data into my machine learning algorithm  

I also include additional thoughts at the end on how I think my data currently fulfills my needs, and what challenges I expect to encounter.

# Dataset Needs

## Overview

When I first started thinking about the data I would need, I didn't quite know what I was getting myself into. I'm still relatively new to python, and I've never done anything with web scraping or image downloading before. Further, I initially saw various pre-built tools online, such as a Google Images scraper. However, a lot of websites continue to make it harder to scrape data off of them, and as a result, I really didn't realize what a challenge getting my images was going to be.

## Requirements of Dataset

I had done a few machine learning projects prior to this. I'd made a model to predict the direction of a stock, given historical data, along with detect cells with cancer. Even though the dataset had been provided for both of them, it gave me valuable insight into exactly what kind of data an algorithm needs to make it effective. After revisiting my previous projects I came up with the following criteria for the dataset I was putting together:  

1. **Sufficient Quantity:** I wanted to ensure that I had enough data to actually train my model. With the cancer cell project, I had almost 2000 images, and as a result, I wanted to make sure I had a similar number of images at the minimum. 
 
2. **Detail Rich:** One of the hallmarks of a Convolutional Neural Network (referred to as CNN or ConvNet in this document) is it's ability to detect individual features amongst a wide array of images. However, the features cannot be detected if they don't exist in the first place. I wanted to ensure that the images I was getting were actually helpful in training the network, and weren't just random images of cans.  

3. **Highly Versatile:** If my end goal is to be able to attach a live camera feed and detect real life cans, I thought that my dataset would really need to cover a large variety of color conditions, lighting conditions, etc. I don't know what equipment I will end up working with, and I don't know where I would set it up. I wanted to make sure that my model would be accurate regardless of the environment it is implemented in, and to achieve that, I had to have a diverse dataset.  

With these parameters established, I had a good set of goals to meet, and was able to focus on images that fulfilled at least one of these goals.

### Example of Goal 2:

!["text"](29.jpg)

This image is perfect for my neural network, is it explicitly shows one of the key features of a soda can - the tab.

### Example of Goal 3:

![text](Flickr150.jpg)

The image above, although not a great image on it's own, actually serves an extremely valuable purpose for my project. The lighting conditions are quite dark, the colors are undersaturated, and all the other things that make it a not-so-great photo, will go to make my model stronger.

## Sourcing the Dataset  

As mentioned earlier, I thought that collecting my data would be as easy as providing my search query into a tool created by someone else, and that a lot of my programming would be focused on developing, tuning, and refining my machine learning model. However, this was most certainly not the case.  

In order to get my data, I had to create various web scrapers, sifting through HTML pages of various sources, and finding the image link to download.  

As for the sources themselves, I picked three: CAN Stock Images, iStock Images, and Flickr Photos. I go much more in depth on why I picked these three, as well as my process of scraping for the pictures and downloading them in the next section.

## Working with the Dataset

Once I have the images downloaded, working with it would be no problem at all. I have had experience working with images and converting them and storing them in tensors for easy implementation with machine learning models.  

All digital images can be represented with a few different values determining each pixel. When these are put together, they create a specific signal that your computer uses to display it. Our process would just be undoing this, and getting to those raw numbers so we can feed them into the neural network and train it.

## Preparing my Dataset

I also discuss this in a bit more detail later in this report. However, in order to be consistent with Goals 1 and 3, I will also be modifying my dataset before feeding it into my model. I will be applying filters, transformations, and other image manipulations onto each image, to make my dataset more vast, more versatile, and more equipped to accurately detect cans. These forms of image augmentation also help prevent overfitting to the data, and further increase model accuracy through that. This process will be done at a later stage in the project, when I'm actually putting together and making my neural network.

# Dataset Scraping

## Overview

In this section, I will be discussing the process I took to actually obtain my dataset. After identifying what exactly I needed from my dataset, I had to select my sources. Upon a bit of consideration, I decided to scrape images off of three websites: CAN Stock, iStock, and Flickr. I chose the first two sites since the stock images would provide clean, well-lit, clear images with easily identifable features (i.e. the pull tab, the lip, the concave shape of the base). On the other hand, with Flickr being user-uploaded images, I thought they would provide a bit more of a "real-life" aspect to it - from natural lighting to crumpled cans.

Then, I will also detail how I made my dataset of "Not Cans". For this, I ended up creating various random noise images and exporting them, to be easily accessible and for us to be able to see them.

## Images of Soda Cans

### Importing Packages
As with any program, the first step is to import all the packages we will be using. 

| Package | Use  |
| :---: | :--- |
| time | We will be using the 'time' package in order to set timeouts for various loops, as well as give the pages that we visit time to load  |
| selenium | The 'selenium' package will be used for the purpose of emulating a web browser. For my personal use case, I used Google Chrome since I already had it installed. A little prior setup is required, which I will detail in next steps, but the package itself is used to emulate and automate a web browser, run commands like scrolling, clicking, text input, as well as retrieve the HTML code from it  |
| bs4 | 'bs4', better known as BeautifulSoup4, is a great package when dealing with HTML. It allows you to efficiently sort through the source code, extract various tags and properties of said tags, and other analysis of HTML code  |
| urllib | 'urllib' is what I originally used for downloading images. It workes very smoothly for CAN and iStock images, however it does have it's limitations. It's great for downloading, since it is quite intuitive. But that does mean it has it's own limitations, and for Flickr I had to employ two other packages  |
| requests | This was one of the packages I had to switch to in order to download images from flickr. With the way that I scraped these images, I had to navigate to the link of the image itself, and the way the 'urllib' does it is forbidden. As a result, I used 'requests' to access the images and download them  |
| shutil | Last but not least, 'shutil' was used in conjunction with 'requests' for FLickr. This package is extremely useful for working with your file directory, and I used it to copy the image from the request into the directore.  |

In [None]:
# General Code Requirements
import time

# Automated Web Browser and HTML Manipulation
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# Downloading Images from CAN and iStock
import urllib.request

# Downloading Images from Flickr
import requests
import shutil

### Setting up the WebDriver

Having imported all our packages, the next step is to initialize and create an instance of the automated web browser we will be using. First, we configure some of the settings of the driver. Setting the 'headless' value of options to be True means that it will run in the background, without rendering that browser itself. Although setting it to False was useful when I was debugging the program, setting it to True when actually running the full code allows it to be a bit faster and more resource-efficient.

We must also specify the window size of the driver. Since my MacBook has a screen resolution of 1920 x 1200, I just set it as that, but the resolution itself shouldn't make a big difference, and can easily be adjusted to meet the resolution of your device.

The next step is to create an instance of the WebDriver. First we must lead the WebDriver to the path of the driver. As I mentioned earlier, I personally am using Chrome, and as a result of that, had to install the chrome driver from the Chromium Open Source Project. The driver can be found here: https://chromedriver.chromium.org

Lastly, we create a WebDriver instance, and assign it to the variable "driver".

In [None]:

# Configuring the preferences for our WebDriver
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

# Locating the path for the driver, and creating a WebDriver object
DRIVER_PATH = '/Users/anonymousvikram/chromedriver'
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

### Retrieving the HTML and Image Links from each page

Now that our WebDriver has been set up, we get to the fun part! Each site has a different process. However, CAN Stock and iStock are quite similar, just slight tweaks in the code. So I've split up this section to discuss the two together, and then Flickr seperately. Note that Flickr was a bit of a challenge due to the way they display images, and so the code for that is quite a bit more extensive and jumps through more hurdles.


#### CAN Stock Images and iStock Images  
Firstly, I'll talk through the process of getting the HTML and seperating the image links for CAN Stock and iStock. In this step, there's only two lines that are different, but I'll make sure to mention it in my explanation as well as comment it out in the code below.  

We begin by navigating to the web page on the driver we set up. This is the first point of difference. Since the two sites are obviously different, the URL you navigate to for each one is different. However, the command is still the same, and the next steps are also identical. Then, we run a while loop for 15 seconds, during which the driver just keeps scrolling down. This ensures that all the images on the first page load, so that we can extract their URLs and download them. It is certainly possible to navigate to further pages, but I thought that the data I got from this was sufficient.  

After we scroll to the bottom of the page and the while loop has finished, we get the HTML code of the page by accessing the 'page_source' attribute of the driver, and storing that in a variable called 'source_code'. Then, we quit the driver to save resource, and print "Completed Selenium" to provide a status update for the user. This has now given us the HTML of the page.

In [None]:

# The next line is for scraping from CAN Stock Images
driver.get("https://www.canstockphoto.com/images-photos/soda-cans.html")

# The next line is for scraping from iStock Images
driver.get("https://www.istockphoto.com/hk/圖片/soda-cans?mediatype=photography&phrase=soda%20cans&sort=mostpopular")

# Scroll to the bottom of the page
start = time.time()
while time.time() <= start + 15:
    driver.execute_script("window.scrollBy(0, 1000)")

# Store the HTML of the page and quit the driver
source_code = driver.page_source
driver.quit()

# Inform user of progress
print("Completed Selenium")

# Create BeautifulSoup object from the HTML code
soup = BeautifulSoup(source_code)

# The next line is for scraping from CAN Stock Images
images = soup.find_all("article")

# The next line is for scraping from iStock Images
images = soup.find_all("a", {"class": "gallery-mosaic-asset__link"})

However, the HTML on its own is not very helpful for our purpose. We have to sift through the code and locate the specific URLs that will allow us to download the images. That's where BeautifulSoup comes in. We create a new BeautifulSoup instace, passing it the 'source_code' we just extracted, and storing this BeautifulSoup instance in a variable called 'soup'. 


The final step is once again different between CAN Stock Images and iStock Images. The two display search results differently, and as a result the HTML code must be sorted in different ways. In the case of CAN Stock, we  use the 'BeautifulSoup.find_all()' method on 'soup', and search for all items that are classified as an "article". For iStock however, we use the same method on the same object, but instead search for all items that are classified as "a", and have their "class" identifier set as "gallery-mosaic-asset__link". 

This method returns an iterable object, of which each entry consists of a string of text with the general location of the URL. We will run further bs4 methods on this string in a later step, to pinpoint the image URL and download it.

#### Flickr Images Part One - Accepting Cookies  
Now, with Flickr Images, there were countless hurdles I had to jump through to scrape images off their website. This next code block tackles the first such challenge. As with the previous example, the first step is to navigate to the web page with the results. However, this is where we diverge a bit. On Flickr, upon visiting the website for the first time, you are presented with a cookies dialog, asking for permission to use cookies to enhance your web experience. However, sometimes the dialog is slow to appear. So, we do a few steps:  

1. We first scroll down slightly, to make sure the dialog appears.  

2. We *look* for the button to "Accpet All Cookies". The key step here is that we look to ensure the button is indeed there. Without the if loop, in the case the browser already has cookie settings saved, the code would return an exceptionError and stop immediately.  

3. Once found, we inform the user that the button is indeed found, just as a bit of feedback in case running it in Headless mode.  

4. Now that we know the button exists, we use the "find_element_by_xpath()" method on our driver, to locate the button. Notice the difference between the two commands. When looking for it, we use "find_element**s**_by_xpath()". This method returns a list of all instances, so if it returns an empty list, it won't return an exceptionError.  

5. Lastly, we inform the user that the code has successfully found, and clicked to agree to all cookies.  

After accepting cookies, Flickr actually reloads its page, so we make our code wait for 10 seconds to let that process take place.

In [None]:
# Navigate to the Web Page
driver.get("https://www.flickr.com/search/?text=soda%20can&view_all=1")

# Timout to let the page load
time.sleep(5)

# Small scroll to ensure cookies dialog appears
driver.execute_script("window.scrollBy(0, 100)")

# Look for Button to Accept Cookies, without interrupting the code in case no such button is found
if(len(driver.find_elements_by_xpath("//button[@id='truste-consent-button']")) != 0):

    # Update the user on progress
    print("Found button")

    # Navigate to the Accept Button and Click it
    driver.find_element_by_xpath("//button[@id='truste-consent-button']").click()

    # Update the user on progress
    print("Agreed")

# Small timeout to let the page refresh after accepting cookies
time.sleep(10)

#### Flickr Images Part Two - Scrolling, Clicking, and Much More  

This next part is significantly more complicated than what I had to do to for the other two sources. There's a variety of factors that makes scraping images off of Flickr challenging. 

The first is that results are displayed in an infinite scroll, rather than over multiple pages. This means that scraping is no longer just scroll to the bottom of the page, save HTML, and go to page 2. Further, when you scroll far enough, Flickr actually starts unloading the images near the top to keep your experience smooth. Although great for user experience, when automating, it turns out to be a real challenge for getting each image URL. Lastly, althought Flickr uses an "infinite scroll", after a while it starts making you clikc a "Load More Results" button, and so we had to implement a solution for that as well.  

It's clear that all of these problems are quite critical to the success of the scraper and each one must be dealt with. For the issue of the infinite scroll, I solved this by implementing a few novel procedures. First, I would extract the HTML of the page every 5 scrolls, rather than after scrolling to the very bottom. This deals with the issue of images at the top unloading. However, it also creates a new problem. It's extremely easy to start getting duplicate images with this method. So each time, I actually process the HTML immediately after, and extract the URL of the lead. Then, I store that URL in an array, so that it can be referred to later when actually downloading the images. But the array actually also serves a second purpose: I can use it to cross-reference already processed images and prevent duplicates! Last but not least, Flickr's "Load More Results" button actually changes between two possibilites. So I implement code similar to when accepting cookies, to check for the load more button, and if it is found, to click on the button.

When all of this is put together, although my loop runs for longer, it quite efficiently processes and stores the image leads into an array, which can be easily iterated through, tracked down, and downloaded from.

Although it seems like a lot of work, the sheer number of images I got from this process made it well worth it.

In [None]:
# Initializing array for leads to be used later when downloading, as well as to prevent repeats
imageLeads = []


start = time.time()

# Update the user on progress
print("starting")

counter = 0

# Run the 'infinite' scroll for 5 minutes
while time.time() <= start + 300:

    # Scroll down the page
    driver.execute_script("window.scrollBy(0, 12000)")

    # Update the user on progress
    print("Scroll")

    # Short delay to allow results to load
    time.sleep(0.5)

    # Clicking "Load More" button
    if(len(driver.find_elements_by_xpath("//button[@class='alt']")) != 0):
            
            driver.find_element_by_xpath("//button[@class='alt']").click()
            
            # Update the user on progress
            print("load more")

    # Clicking "Load More" button as well
    if(len(driver.find_elements_by_xpath("//button[@class='alt no-outline']")) != 0):
        
        driver.find_element_by_xpath("//button[@class='alt no-outline']").click()
        
        # Update the user on progress
        print("load more")
    
    # Update the number of times scrolled
    counter = counter + 1
    
    # Extract HTML and get URLs every 5 scrolls
    if(counter == 5):

        # Update the user on progress
        print("Loading pictures now")

        # Delay to let results load
        time.sleep(15)

        # Update the user on progress
        print("Done Loading")

        # Extracting HTML from the Webpage
        plainText = driver.page_source

        # Creating a BeautifulSoup object with the HTML extracted
        soupTemp = BeautifulSoup(plainText)

        # Find all links to images in the HTML and store it in an iterable object
        imageIteration1 = soupTemp.find_all("a", {"class": "overlay"})

        # Iterate through the iterable to extract the URLs
        for imageLead in imageIteration1:

            # Find and Create the image URL   
            href = imageLead.extract().get('href')
            href = "https://www.flickr.com" + href

            # Setting up a variable to determine if image has been found before
            isRepeat = False

            # Going through existing leads to see if image has been processed before
            for usedLead in imageLeads:
                if(usedLead == href):

                    # Setting the variable to true so that image isn't processed again
                    isRepeat = True
            
            if(isRepeat):
                
                # Update the user on progress
                print("Repeat image lead")
            
            else:

                # Update the user on progress
                print(href)
                
                # Adding the URL to the list to download later and prevent duplicates
                imageLeads.append(href)
        
        # Update the user on progress
        print(str(len(imageLeads)) + " image leads found so far")

        # Resetting Counter
        counter = 0

        # Update the user on progress
        print(str(time.time() - start) + " seconds")

# Update the user on progress
print(str(len(imageLeads)) + " image leads found")

# Quitting Selenium Driver
driver.quit()

# Small delay to allow system to recover
print("sleeping zzz")
time.sleep(10)

#### Flickr Images Part Three -  Getting Image URLs from Leads  

You may have noticed that in my previous explanation, I referred to the end result being an array of 'leads', rather than direct image URLs. There is actually a very good reason for this. With Flickr, the URL you can extract leads you directly to the user's photo album, with a highlight on the photo you selected. Downloading this would just be downloading a web page, rather than an image. So it is actually necessary to navigate to the URL we extracted, and run BeautifulSoup and sort through the HTML all over again, to ultimately get the image URL that we can download.

In order to implement this, I make a new Selenium instance using the same options as earlier, and iterate through all the leads we collected. We then look for all images, get their "src" property and select all the text from the third character onwards (the first two are '\\' and interfere with transforming the src into a standalone URL). We then add the "https://" at the beginning of this src to get the final image URL!. Then, we print it to update the user on the progress, store it into an array to be downloaded, and repeat.


In [None]:
# Update the user on progress
print("Finding True Image URLs")

# Initializing empty array to store image urls
trueImageUrls = []

# Iteration variable to keep the user informed on progress
tempInt = 0

# Setting up new Selenium driver
driverImg = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

# Iterating through leads found in previous step
for href in imageLeads:

    # Going to the URL of the lead
    driverImg.get(href)

    # Storing the HTML of the page
    trueImageSource = driverImg.page_source

    # Creating a BeautifulSoup instance with the HTML to sort through
    trueImageSoup = BeautifulSoup(trueImageSource)

    # Looking for all images in the code
    trueImageLink = trueImageSoup.find("img")

    # Getting the src property of the image, and selecting all characters from the third one
    trueHref = trueImageLink.get("src")[2:]

    # Finalizing image URL
    trueHref = "https://" + trueHref

    # Update the user on progress
    print(str(tempInt) + "/" + str(len(imageLeads)) + ": " + trueHref)
    
    # Storing image URL to be downloaded
    trueImageUrls.append(trueHref)

    # Increasing the progress iterator
    tempInt = tempInt + 1
print(str(len(trueImageUrls)) + " true images found")

### Downloading the Images from the Image Links  

This next section will cover how we download the images from the URLs we've extracted thus far. Again, CAN Stock and iStock were quite similar in this process, but Flickr needed it's own way, so I'll discuss both of the different methods.

#### CAN Stock Images and iStock Images  

In this case, the code once again only differs by two. Downloading is made very straightforward with the usage of the urllib package. However, the differences come in narrowing down the actual url of the image, and saving it. Similar to last time, the tags that each website uses is slightly different, and so narrowing it down is changed up through that. Then, when saving images, I changed the names to differentiate between the two sources.

For the download itself, we use the 'request.urlretrieve()' function, providing both the link to download from as well as the path of the file to download to.

In [None]:
# Iteration Variable for naming file
i = 0

# Looping through all the images found in previous step
for link in images:

    # The next line is for downloading from CAN Stock Images
    href = link.extract().find("a").find("img").get("src")

    # The next line is for downloading from iStock Images
    href = link.extract().find("figure").find("img").get('src')

    print(href)

    # The next line is for downloading from CAN Stock Images
    full_name = "/Users/anonymousvikram/recyclingVision/downloads/drink can single/canStock" + str(i) + ".jpg"

    # The next line is for downloading from iStock Image
    full_name = "/Users/anonymousvikram/recyclingVision/downloads/drink can single/iStock" + str(i) + ".jpg"

    # Actually download the file
    urllib.request.urlretrieve(href, full_name)
    i += 1
# Let the user know how many images were downloaded
print(str(i) + " images processed")

#### Flickr Download  

Once again, Flickr was slightly more complicated, although altogether not too bad. I was unable to use urllib, since the way they access the image is forbidden by flickr. However, by hacking togther code from the 'requests' and 'shutil' packages, I was able to put together a workaround.

Firstly, we define the location path of the image. After doing that, we provide a little update to the user on the progress. Then, we use requests to get the image as a raw file. We open up the file at the location determined, decode the content, and use shutil to copy the raw file into the distantion. We repeat this for all the URLs we extracted

In [None]:
# Setting up iteration variable for progress and naming
tempInt = 0

# Iterating through all URLs collected
for imageUrl in trueImageUrls:

    # Determining path of the file
    full_name = "/Users/anonymousvikram/recyclingVision/downloads/drink can single/Flickr" + str(
            tempInt) + ".jpg"
    
    # Update the user on progress
    print(str(tempInt) + "/" + str(len(trueImageUrls)) + ": " + full_name[66:])

    # Get the image file using requests
    r = requests.get(imageUrl, stream=True)

    # Open a file at the destination
    with open(full_name, 'wb') as destination:
        
        # Decode the image file
        r.raw.decode_content = True

        # Copy the file into the directory
        shutil.copyfileobj(r.raw, destination)    
    
    # Increment the iterator
    tempInt += 1

# Let the user know how many images were downloaded
print(str(tempInt) + " images processed")

## Images of Not Soda Cans

Now that we have our dataset of Soda Cans, we have to generate images that aren't soda cans. For the Machine Learning algorithm to train correctly, since we're using binary classification, it needs images of both labels. As a result, instead of scraping random images off the web, I decided to generate random noise images. These are created using random values in NumPy, converted into images, and saved so that we can visualize them and reuse them.

### Importing Packages  

This endeaver is far simpler than the previous, and only requires two packages:

| Package | Use |
| :---: | :-- |
| NumPy | NumPy is probably one of the most famous packages for Python. It is extremely powerful, and augments what you can do with the language. For our use case, we will be using it to generate a 3D array of random values, which we will then use to convert into an image |
| Python Imaging Library | PIL, better known as Python Imaging Library, will be utilized in order to convert our images from random numbers to actual images. It will take the thress values, for RGBA, and then produce an image out of it. |

In [None]:
import numpy as np 
from PIL import Image

### Creating and Downloading Images

Now that our work environment is all set up, we can get started! I wanted to have an approximate even split between pictures of cans and pictures of not cans, so I ran this code 200 times using the for loop. We begin by creading a 3D Numpy Array of random values. The 400 and 400 show that the image itself will be 400x400 pixels, and the 3 shows that each pixel has 3 valus: Red Green and Blue. We also multiply this array by 255, since the np.random.rand() function generates values between 0 and 1, and said values have to be scaled up to generate the colors we want.

We then use PIL to generate the image from the array. We specify that each value is of type uint8, and that we want it to convert using the "RGBA" (Red Green Blue Alpha) Color scheme.

Finally, we finish it off by saving it into the directory.

In [None]:
# Setting up a for loop to make 2000 images

for i in range(2000):
    
    # Generating random values to make an image out of
    imageArray = np.random.rand(400,400,3) * 255

    # Converting random values into an actual image
    image = Image.fromarray(imageArray.astype("uint8")).convert("RGBA")

    # Saving that image into the directory
    image.save("/Users/anonymousvikram/recyclingVision/downloads/not drink can single/Flickr" + str(
            i) + ".jpg")

# Dataset Post-Analysis

## Strengths

I think one of the greatest strengths of my dataset is the sheer size of it. I have over 4000 images, which I will likely split into 3000 for training and 1000 for testing. Further, I think that the variety within my dataset is also very beneficial. As I mentioned earlier, I have two sources of stock photos, which will have good lighting and coloring, and will show all the features very clearly. But I also have over 1500 images from Flickr, that will be a bit more realistic, with slightly imperfect cans, and unideal lighting and color conditions.

## Shortcomings

Although most of this was automated, one of the most tedious parts of this undertaking was sorting through the images themselves. Especially with Flickr images, they will often have cans in them but may not be appropriate for the algorithm to learn from, or may not be beneficial in any way for the ConvNet I plan on using (see example below). Sorting through over 2000 images and determining which ones would be good for my Neural Network was quite a tedious task, and I think with a bit of refinement on my search parameters, I could've avoided having to do it so meticulously.

![text](Flickr846.jpg)

The image above was in my original dataset, and showed up as a "soda can" on Flickr. However, it clearly isn't, and had to be manually removed from the dataset to prevent it from disrupting the accuracy of my model.

# Dataset Results

## Future Considerations
Going ahead with my project, I do have one concern about my dataset. For the images of not cans, I have an abundance of random noise images (see below). However, Since I'm using a ConvNet, my algorithm might end up learning the random noise to be not can. That in itself isn't too big of an issue, but my fear is that it'll start detecting *only* random noise as not cans, and that my algorithm might fail when presented with actual trash. If this is the case, I will likely have to use the various web scrapers I've build to repopulate a random image dataset.

| ![text](notCan80.png) | ![text](notCan2261.png) |
| :---: | :---: |

Above are two examples of the random noise images that were generated. It becomes very clear side by side how they look quite similar.

## Image Augmentation

On the flip side, I do have a powerful tool at my disposal when training my algorithm. I will likely augment my current dataset numerous times, applying various color filters, rotating, stretching, and performing other transformations on the image, etc. I think that in addition with the sheer number of images I have will enable me to have a strong, reliable dataset - one that will be effective in training my algorithm to detect cans.

# Summary

## Outcome of Dataset

I think overall, I'm quite happy with the data I have now. In my opinion, I managed to meet all three goals I set out for my data, and went above in beyond in some of them as well. I never expected to have this many images, and am quite proud of what I was able to accompish in this first segment of the project.  


The data I've compiled can be found at https://bit.ly/3l8Nmhn

## Challenges due to Data

The biggest issue I see right now from the data I have is just the sizing. For the non-cans, I standardized the size to an even 400x400 square. However, for all the images I scraped, they all have different resolutions, and almost none of them are square. Scaling it down may result in too much distortion, meaning that I may not be able to apply some of the transformations I would like to. However, I think I have a few different ways of circumventing this, and will detail my findings in the next report.

## Contact Information

If you would like to inquire about anything detailed in this report or give suggestions/feedback, please reach out to me at 220602@hkis.edu.hk

<h1> <center> Thank you for reading my report. </center> </h1>