## DSI 15 Capstone Project

## Product Image Classification on Amazon

### Problem Statement

E-commerce sites have thousands of listings everyday and at times, the users may not be correctly classifying their uploaded image or be using the wrong product depiction. Mismatch of product listing information will decrease the effectiveness of succssful transactions and also result in unnecessary resources being utilitzed to perform these corrections on a large scale. A product detection system would help to ensure the correct listing and categorization of products or to assist the user in classifying product types. To meet this need, an image classifer will be trained and developed to accurately identify the correct image labels with the use of neural networks.

### Executive Summary

This capstone project would be aimed at building an image classification model using convulated neural networks that will identify the following 3 categories of products: clothing, footwear and watches. The source data to construct this model will be based on images scraped from Amazon, the world's largest online retailer. 

Stakeholders will be the e-commerce companies and the user of the services themselves. It will help the company improve the effectiveness of potential transactions. It will also improve the user experience with more accuracy and also to avoid problems arising from wrongly identifying products.

Metrics used to measure the performance would be the AUC & Type I / Type II errors.

Challenges foreseen would be potential imbalanced data, complex background noise or poor resolution images.  

The goal at present seems to be sufficiently scoped as there are 3 categoeires of distinct image features an the quantity of images are adequate for the purposes of this analysis. The timeline for completion tentatively end of July 2020 is still a reasonable expectation to work towards to.

## Data Extraction

Selenium and chrome webdriver is used to scrape the 3 categories in Amazon. The image formats are known to be fixed in size and hence will be saved as JPEG format. To ensure a sufficient quantity for testing and to account for possible discarding of ineligible datapoints, 2,000 images were saved from each class. They were extracted to the respective folders:
<ul>
    <li>Watch_Images - 2016 image count</li>
    <li>Shirt_Images - 2016 image count</li>
    <li>Shoe_Images - 2016 image count</li>
</ul>

In [4]:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from multiprocessing import Pool
import requests
import random
import time

In [5]:
driver = webdriver.Chrome("chromedriver")

### Extract Images from the Amazon Category of Men's Watches

In [None]:
def main():

  image_num = 0
  row = 1
  page_response = driver.get("https://www.amazon.com/s?i=specialty-aps&bbn=16225019011&rh=n%3A7141123011%2Cn%3A16225019011%2Cn%3A6358539011&pf_rd_i=16225019011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=5cd8272b-5ce4-4c26-bfcb-d6dca0c1e427&pf_rd_p=5cd8272b-5ce4-4c26-bfcb-d6dca0c1e427&pf_rd_r=CX53J0NV7EFDPJSMDE9S&pf_rd_r=CX53J0NV7EFDPJSMDE9S&pf_rd_s=merchandised-search-left-2&pf_rd_t=101&ref=AE_Men_Watches")
  page_content = soup(driver.page_source, 'html.parser')
  images = page_content.findAll("img",{"class":"s-image"})
  image_num += len(images)
  for i in range(len(images)):
    f = open('../Watch_Images/'+str((row-1)*(len(images))+i)+".jpg",'wb')
    f.write(requests.get(images[i]['src']).content)
    f.close
  while(1):

    row += 1
    if(page_content.find("li",{'class':'a-last'}) != None):
      driver.find_element_by_xpath("//li[contains(@class, 'a-last')]/a").click()
      time.sleep(3)
      page_content = soup(driver.page_source, 'html.parser')
      images = page_content.findAll("img",{"class":"s-image"})
      image_num += len(images)
      for i in range(len(images)):
        f = open('../Watch_Images/'+str((row-1)*(len(images))+i)+".jpg",'wb')
        #if not: 
        #continue
        f.write(requests.get(images[i]['src']).content)
        f.close
      if(image_num > 2000): break

main()

### Extract Images from the Amazon Category of Men's Shirts

In [None]:
def main():

  image_num = 0
  row = 1
  page_response = driver.get("https://www.amazon.com/s?i=fashion-mens-intl-ship&bbn=16225019011&rh=n%3A16225019011%2Cn%3A1040658%2Cn%3A2476517011&dc&pf_rd_i=16225019011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=554625a3-8de1-4fdc-8877-99874d353388&pf_rd_r=SK809FT75R844KJ5WXGY&pf_rd_s=merchandised-search-4&pf_rd_t=101&qid=1594157542&rnid=1040658&ref=sr_nr_n_1")
  time.sleep(3)
  page_content = soup(driver.page_source, 'html.parser')
  images = page_content.findAll("img",{"class":"s-image"})
  image_num += len(images)
  for i in range(len(images)):
    f = open('../Shirt_Images/'+str((row-1)*(len(images))+i)+".jpg",'wb')
    f.write(requests.get(images[i]['src']).content)
    f.close
  while(1):

    row += 1
    if(page_content.find("li",{'class':'a-last'}) != None):
      driver.find_element_by_xpath("//li[contains(@class, 'a-last')]/a").click()
      time.sleep(3)
      page_content = soup(driver.page_source, 'html.parser')
      images = page_content.findAll("img",{"class":"s-image"})
      image_num += len(images)
      for i in range(len(images)):
        f = open('../Shirt_Images/'+str((row-1)*(len(images))+i)+".jpg",'wb')
        #if not: 
        #continue
        f.write(requests.get(images[i]['src']).content)
        f.close
      if(image_num > 2000): break

main()

### Extract Images from the Amazon Category of Men's Shoes

In [7]:
def main():

  image_num = 0
  row = 1
  page_response = driver.get("https://www.amazon.com/s?i=specialty-aps&bbn=16225019011&rh=n%3A7141123011%2Cn%3A16225019011%2Cn%3A679255011&pf_rd_i=16225019011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=5cd8272b-5ce4-4c26-bfcb-d6dca0c1e427&pf_rd_p=5cd8272b-5ce4-4c26-bfcb-d6dca0c1e427&pf_rd_r=V3G56PM79KZ6R1KBGHTK&pf_rd_r=V3G56PM79KZ6R1KBGHTK&pf_rd_s=merchandised-search-left-2&pf_rd_t=101&ref=AE_Men_Shoes")
  time.sleep(3)
  page_content = soup(driver.page_source, 'html.parser')
  images = page_content.findAll("img",{"class":"s-image"})
  image_num += len(images)
  for i in range(len(images)):
    f = open('C:/Users/silve/Desktop/materials-master/materials-master/DSI15 Capstone Project/Shoe_Images/'+str((row-1)*(len(images))+i)+".jpg",'wb')
    f.write(requests.get(images[i]['src']).content)
    f.close
  while(1):

    row += 1
    if(page_content.find("li",{'class':'a-last'}) != None):
      driver.find_element_by_xpath("//li[contains(@class, 'a-last')]/a").click()
      time.sleep(3)
      page_content = soup(driver.page_source, 'html.parser')
      images = page_content.findAll("img",{"class":"s-image"})
      image_num += len(images)
      for i in range(len(images)):
        f = open('C:/Users/silve/Desktop/materials-master/materials-master/DSI15 Capstone Project/Shoe_Images/'+str((row-1)*(len(images))+i)+".jpg",'wb')
        #if not: 
        #continue
        f.write(requests.get(images[i]['src']).content)
        f.close
      if(image_num > 2000): break

main()

### Preliminary EDA

A preliminary visual inspections shows that there minimal image noise due from type inconsistency. There are images that are missclassified and out-of-category but these only comprise of a miniscule number of the data set. 
<ul>
<li>The watch images are mostly consistent and of similar representation, but there appears to be many duplicate images. This will be further investigated.</li>
<li>The shoe images are mostly clean, unique and of similar representation without much internal image noise. However, there is a mix of variations such as slippers, boots and sandals.</li>
<li>The shirt category has a mix image representation. There are variations such as long & short sleeves, hoodies, vests, singlets and also human representation for some of the images.</li>
</ul>