## Using this notebook to download the dataset images
The final dataset comprises over 400k pictures taking about 200GB on disk.

Downloading the whole dataset takes several days since the images are available through a txt file with direct links to the websites hosting the images. 

Having the whole dataset not available through direct download results in a series of issues since some of these links are not working anymore and some other typical issues arise such as corrupted images, images having been replaced for an error placeholder image or antiscraping protection. These issues will be explored on separate notebooks, finally for all those images that are not available anymore, a csv file has been pulled together and the team behind the street2go dataset has been contacted for help in regards to these images.

In [1]:
import os
import pandas as pd
import requests
import csv

In [2]:
# Download urls from www.tamaraberg.com/street2shop/wheretobuyit/photos.tar
file_path = "./photos.txt"
photos_file = pd.read_table(file_path, header=None)

photos_file = photos_file[0].str.split(pat=",", n=1, expand=True)
photos_file.columns = ["photo", "url"]

print(photos_file.head())
print(photos_file.shape)
print(photos_file["photo"].unique().shape)
print(photos_file["url"].unique().shape)

       photo                                                url
0  000000001  http://s3.amazonaws.com/media.modcloth/images/...
1  000000002  http://s3.amazonaws.com/media.modcloth/images/...
2  000000003  http://s3.amazonaws.com/media.modcloth/images/...
3  000000004  http://s3.amazonaws.com/media.modcloth/images/...
4  000000005  http://media1.modcloth.com/community_outfit_im...
(424840, 2)
(424840,)
(424840,)


In [3]:
# Top url domains
url_domains = photos_file["url"].str.extract(r'(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)')
print(url_domains[0].value_counts().sort_values(ascending=False))
print(url_domains[0].value_counts().sort_values(ascending=False).sum(), "images") #not all but most included

www.zappos.com                     74353
ecx.images-amazon.com              45773
images.asos-media.com              23819
g.nordstromimage.com               23434
media1.modcloth.com                18435
g-ecx.images-amazon.com            18209
product-images2.therealreal.com    17125
scene7.targetimg1.com              17098
productshots2.modcloth.net         16942
product-images4.therealreal.com    16911
productshots3.modcloth.net         16875
productshots1.modcloth.net         16768
productshots0.modcloth.net         16731
images.neimanmarcus.com            15197
www.neimanmarcus.com               13257
www.forever21.com                  11390
product-images1.therealreal.com    10761
product-images3.therealreal.com    10691
slimages.macys.com                  9619
images.bloomingdales.com            7258
images.urbanoutfitters.com          6545
s7d2.scene7.com                     6255
media.kohls.com.edgesuite.net       3516
images.anthropologie.com            1942
s3.amazonaws.com

### Download images + extract broken urls

In [4]:
def image_extraction_3(df):
    #we assume a broken link are those taking >3 secs to load up
    img_path = "C:/Users/heret/Desktop/street2shop/photos/"
    urls = df["url"].tolist()
    photo_ids = df["photo"].tolist()
    
    broken_urls = pd.DataFrame(columns=["photo", "url"])
    with open("broken_urls.csv", "a") as f:
        broken_urls.to_csv(f, index=False)

    for url, photo_id in zip(urls, photo_ids):
        try:
            r = requests.get(url, timeout=3, 
                             headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36'})
            if r.status_code == requests.codes.ok:
                with open(str(img_path + photo_id.lstrip("0") + ".jpg"), 'wb') as f:
                    f.write(r.content)
        except:
            with open("broken_urls.csv", "a") as f:
                writer = csv.writer(f)
                writer.writerow([photo_id, url])

In [5]:
start_n = 258511 #beware photos ids start at 1 not at 0 as python indexes. ie. start_n = 5216 will download from 5217
finish_n = 275000 #ie. finish_n = 5218 will download until 5219 included
split_urls = photos_file.loc[start_n:finish_n]

In [6]:
%%time
image_extraction_3(split_urls) #started at 22:05 #DO NOT OPEN CSV FILE WHILE SCRIPT IS RUNNING

Wall time: 5h 15min 37s
