## Gathering the URLs

In this notebook, we begin the process of scraping sneaker sales from the website goat.com. Because of the structure of the goat website, we first gather all the URL's for each sneaker so that we can parse through each sneaker link and gather pertinent data.

I define a function that will take in a year, and get the page source for shoes for that year. I also note that I am restricting to Air Jordan's, Lifestyle, and Mens/Womens sneakers.

After gathering the page sources, I get a dictionary of those hyperlinks to be used to gather the data needed to predict shoe resale price. 

In [4]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import matplotlib.pyplot as plot
from time import sleep
import json, re
from selenium import webdriver
from selenium.webdriver.common.by import By
import pickle

import requests

## this is to suppress warnings I was getting in this code. 
import warnings
# Suppress FutureWarning messages
warnings.simplefilter(action='ignore', category=FutureWarning)

In [5]:
# Since the GOAT sneaker website uses infinity scroll to see all sneakers,
# I define a function that will take in a year, and get the page source for shoes for that year
# I also note that I am restricting to Air Jordan's, Lifestyle, and Mens/Womens sneakers

def GoatScraper(year):
    goat_url = 'https://www.goat.com/sneakers?brand=air+jordan&release_date_year=' + year + '&web_groups=lifestyle&gender=men&gender=women'

    ## -- --
    ## we use selenium to scroll down the page on the website, since goat utilizes infinite scrolling
    ## -- --

    driver = webdriver.Chrome() ## utilizing the chrome browser
    driver.maximize_window()    ## starting off with a max window to scroll

    driver.get(goat_url)        ## loading the webpage 

    last_height = 0             ## recording the initial heigh as zero to make sure we can scroll
                                ## the entire page

    while True:
        driver.execute_script('window.scrollBy(0,8000)')    ## start off with a BIG scroll
        sleep(1.75)                                         ## sleep, had to adjust this to reach the entire page since after
                                                            ## you scroll the page loads, so a slower loading page needs more time to sleep

        new_height = driver.execute_script("return document.body.scrollHeight") ## set a new height to compare
        
        if new_height == last_height:   ## if we cannot scroll anymore, we break the while loop
            print('Ended with a page height of: ' + str(new_height)) ## printing our ending page height
            break
        else:                           ## otherwise, we set new lower height, and continue to scroll 
            last_height = new_height

    page_source = driver.page_source    ## after we scroll the page, we record the source of the page to scrape

    return(page_source)

In [10]:
#years = ['2019','2020','2021','2022','2023','2024']
years = ['2013','2014','2015','2016','2017','2018']
link_dict = {}

for year in years:
    print('--')
    print('Working on year ' + year)
    source = GoatScraper(year)
    ## -- --
    ## now we use BeatifulSoup to parse the page source for the data we want
    ## -- --
    soup = BeautifulSoup(source)   ## parsing the page source

    ## grabbing the div labels with certain attributes. 
    ## this is gotten from the page source and seeing where the hyperlink for each shoe is kept
    data = soup.find_all('div', attrs = {'class':"GridStyles__GridWrapper-sc-1cm482p-1 fpmUch"})[0]

    ## under div, we need the 'a' label that contains the 'href' we are after
    hrefs = data.find_all('a')
    print('Now we are getting the links to the shoes')

    ## we now grab those hyper links, and put them into a list to be used 
    links = list()
    for link in hrefs:
        links.append(link['href'])

    link_dict[year] = links
    print('--')
    print('Done with ' + year)
    print('--')


--
Working on year 2013
Ended with a page height of: 9692
Now we are getting the links to the shoes
--
Done with 2013
--
--
Working on year 2014
Ended with a page height of: 11798
Now we are getting the links to the shoes
--
Done with 2014
--
--
Working on year 2015
Ended with a page height of: 13052
Now we are getting the links to the shoes
--
Done with 2015
--
--
Working on year 2016
Ended with a page height of: 17274
Now we are getting the links to the shoes
--
Done with 2016
--
--
Working on year 2017
Ended with a page height of: 21918
Now we are getting the links to the shoes
--
Done with 2017
--
--
Working on year 2018
Ended with a page height of: 38015
Now we are getting the links to the shoes
--
Done with 2018
--


In [8]:
# we export the dictionary of years and url using pickle to a text file
links_file = open('links_file', 'wb') 
pickle.dump(link_dict, links_file) 
links_file.close() 