# Data Collection

The goal of this notebooks is to scrape the Running Shoe Gru site for each page containing a shoe review, then save the html in a dataframe. The site is stuctured with 117 homepages, each containing 10 links to reviews for a total of 1163 (with an incomplete final page, at the time of writing).


The steps to achieve this:

1) Scrape the 10 links on a single home page 
2) Run a loop across all homepages to collect ever link
3) Use requests module to collect html for each collected link 

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup 
import requests
import re
from datetime import datetime

import time

## Section 1: Scrape a single link

In this section we are going to collect links from a single homepage, we are aiming for 10 total.  

In [2]:
#load an example page (page 50)
page=requests.get('https://www.runningshoesguru.com/reviews/page/50')
page_soup=BeautifulSoup(page.content, 'html.parser')

print(page_soup.prettify() )

<!DOCTYPE html>
<html lang="en">
 <head>
  <link as="style" href="https://cdn.runningshoesguru.com/wp-content/cache/fvm/min/1701332637-cssd58405d58b3fd05e5c2c0e5d4eff2da7811898d6c24b8984e1e58b1d9b093.css" media="all" rel="preload"/>
  <link as="style" href="https://cdn.runningshoesguru.com/wp-content/cache/fvm/min/1701332637-css6a6b57001c9178d5f14e9f6c1ca64217dad5f5b72b4d158abaa93f762160b.css" media="all" rel="preload"/>
  <script data-cfasync="false">
   if(navigator.userAgent.match(/MSIE|Internet Explorer/i)||navigator.userAgent.match(/Trident\/7\..*?rv:11/i)){var href=document.location.href;if(!href.match(/[?&]iebrowser/)){if(href.indexOf("?")==-1){if(href.indexOf("#")==-1){document.location.href=href+"?iebrowser=1"}else{document.location.href=href.replace("#","?iebrowser=1#")}}else{if(href.indexOf("#")==-1){document.location.href=href+"&iebrowser=1"}else{document.location.href=href.replace("#","&iebrowser=1#")}}}}
  </script>
  <script data-cfasync="false">
   class FVMLoader{const

Reviewing the page html, we can see that the section with the reviews has the heading tag 'h4' so we can search for these:

In [3]:
print(page_soup.find_all('h4'))

[<h4> 1164 matching reviews </h4>, <h4 class="d-none d-sm-block block-title" style="margin-top:0">Search</h4>, <h4 class="block-title" style="margin-top:0">Filter</h4>, <h4 class="mt-2" style=""><a href="https://www.runningshoesguru.com/review/inov-8-mudclaw-275/">Inov-8 Mudclaw 275</a></h4>, <h4 class="mb-4 mt-3" style=""><a href="https://www.runningshoesguru.com/review/inov-8-mudclaw-275/">Inov-8 Mudclaw 275</a></h4>, <h4 class="mt-2" style=""><a href="https://www.runningshoesguru.com/2019/02/nike-legend-react-review/">Nike Legend React</a></h4>, <h4 class="mb-4 mt-3" style=""><a href="https://www.runningshoesguru.com/2019/02/nike-legend-react-review/">Nike Legend React</a></h4>, <h4 class="mt-2" style=""><a href="https://www.runningshoesguru.com/2019/02/brooks-launch-6-review/">Brooks Launch 6</a></h4>, <h4 class="mb-4 mt-3" style=""><a href="https://www.runningshoesguru.com/2019/02/brooks-launch-6-review/">Brooks Launch 6</a></h4>, <h4 class="mt-2" style=""><a href="https://www.run

Now it's fairly straightforward to see we are looking for the 'a' tag in this section, let's create a set with these in and check the length. We are using 'href' to make sure we are only collecting the links 

In [4]:
sample_links_set=set()
for heading in page_soup.find_all('h4'):
    if heading.find('a') is not None: 
        sample_links_set.add(heading.find('a')['href'])



## Test 1

In [5]:
#test1 
print(f'Distict pages found: {len(sample_links_set)}')
print(sample_links_set)

Distict pages found: 10
{'https://www.runningshoesguru.com/review/new-balance-summit-unknown/', 'https://www.runningshoesguru.com/review/new-balance-fresh-foam-gobi-v3/', 'https://www.runningshoesguru.com/2019/02/saucony-kinvara-10-review/', 'https://www.runningshoesguru.com/2019/02/nike-revolution-4-review/', 'https://www.runningshoesguru.com/2019/02/brooks-launch-6-review/', 'https://www.runningshoesguru.com/2019/02/nike-legend-react-review/', 'https://www.runningshoesguru.com/review/inov-8-mudclaw-275/', 'https://www.runningshoesguru.com/review/salomon-s-lab-sense-7/', 'https://www.runningshoesguru.com/review/altra-lone-peak-4-0/', 'https://www.runningshoesguru.com/2019/02/altra-solstice-review/'}


Great, we have 10 distinct links from our example page we can print these below. We'll need to remember that we have only checked this for 1 example so when we write a loop we'll need to add a test section to check othwer pages.

## Section 2: Run a loop using this method to scrape all review page links

In [5]:
links_set=set()
for page_number in range(0,118):
    page_to_scrape=requests.get(f'https://www.runningshoesguru.com/reviews/page/{page_number}')
    page_soup=BeautifulSoup(page_to_scrape.content, 'html.parser')
    
    page_links=set()
    for h4 in page_soup.find_all('h4'):
        if h4.find('a') is not None:
            page_links.add(h4.find('a')['href'])
    
    links_set=links_set.union(page_links)
    
    #need to check we have collected 10 links from each page
    if len(page_links)!=10:
        print(f'WARNING: {len(page_links)} links collected from page {page_number}')
    
    #this gives us a progress update every 10 pages
    if page_number%10==0:
        print(f'pages scraped: {page_number}')

print(f'Number of links collected: {len(links_set)}')

pages scraped: 0
pages scraped: 10
pages scraped: 20
pages scraped: 30
pages scraped: 40
pages scraped: 50
pages scraped: 60
pages scraped: 70
pages scraped: 80
pages scraped: 90
pages scraped: 100
pages scraped: 110
1163 {'https://www.runningshoesguru.com/2020/05/saucony-cohesion-13-review/', 'https://www.runningshoesguru.com/review/brooks-caldera-4-review/', 'https://www.runningshoesguru.com/reviews/road/mizuno-wave-inspire-19-review/', 'https://www.runningshoesguru.com/2015/09/nike-zoom-odyssey-review/', 'https://www.runningshoesguru.com/reviews/trail/asics-fuji-lite-4-review/', 'https://www.runningshoesguru.com/2013/01/kswiss-blade-light-run-ii-review/', 'https://www.runningshoesguru.com/2013/05/saucony-peregrine-3-review/', 'https://www.runningshoesguru.com/2020/10/saucony-endorphin-shift-review/', 'https://www.runningshoesguru.com/2014/03/saucony-guide-7-review/', 'https://www.runningshoesguru.com/2021/02/mizuno-wave-inspire-17-review/', 'https://www.runningshoesguru.com/2019/10/

In [6]:
#np.save('links_to_reviews',np.array(links_set))
links_list=list((np.load('links_to_reviews.npy', allow_pickle=True)).tolist())

## test 2 

Now that we have collected the links from the site we need to check that we haven't missed any (we do not need to check they are unique as we used a set, so this is guaranteed (python sets do not contain duplicated values). 

In [7]:
len(links_list)

1163

Now that we can see that matches the expected number of reviews, we can create and save a dataframe 

In [8]:
full_data=pd.DataFrame(links_list, columns=['link'])
print(full_data.head(10))


                                                link
0  https://www.runningshoesguru.com/2017/09/hoka-...
1  https://www.runningshoesguru.com/2015/08/nike-...
2  https://www.runningshoesguru.com/2022/05/nike-...
3  https://www.runningshoesguru.com/2022/03/sauco...
4  https://www.runningshoesguru.com/2018/08/new-b...
5  https://www.runningshoesguru.com/2012/09/scott...
6  https://www.runningshoesguru.com/2016/06/sauco...
7  https://www.runningshoesguru.com/2022/04/nike-...
8  https://www.runningshoesguru.com/2012/05/nike-...
9  https://www.runningshoesguru.com/2020/05/new-b...


In [9]:
full_data.to_csv('Running_dataset',index=False)

## Section 3: Use requests to get html for each link

In this section our aim is to use the page links we collected to get the html for each page, we'll then check we don't have any null values. 

In [10]:
load_data=pd.read_csv('Running_dataset')
links_list=load_data['link'].to_list()


In [12]:
start = time.time()
# Took 48 mins to run
load_data['page']=load_data['link'].apply(lambda link: requests.get(link).content)
print(load_data)
end= time.time()
print((end-start)/60)

                                                   link  \
0     https://www.runningshoesguru.com/2017/09/hoka-...   
1     https://www.runningshoesguru.com/2015/08/nike-...   
2     https://www.runningshoesguru.com/2022/05/nike-...   
3     https://www.runningshoesguru.com/2022/03/sauco...   
4     https://www.runningshoesguru.com/2018/08/new-b...   
...                                                 ...   
1158  https://www.runningshoesguru.com/2013/01/mizun...   
1159  https://www.runningshoesguru.com/reviews/trail...   
1160  https://www.runningshoesguru.com/2013/12/new-b...   
1161  https://www.runningshoesguru.com/2020/07/reebo...   
1162  https://www.runningshoesguru.com/2017/06/nike-...   

                                                   page  
0     b'<!DOCTYPE html> \n<html lang="en"> \n<head>\...  
1     b'<!DOCTYPE html> \n<html lang="en"> \n<head>\...  
2     b'<!DOCTYPE html> \n<html lang="en"> \n<head>\...  
3     b'<!DOCTYPE html> \n<html lang="en"> \n<head>\...  
4

In [18]:
load_data.to_csv('Running_dataset',index=False)

## Test 3

In this section, we want to check that we haven't got any null values and that the html is a plausible length for a review page. 

In [17]:
load_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1163 entries, 0 to 1162
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   link    1163 non-null   object
 1   page    1163 non-null   object
dtypes: object(2)
memory usage: 18.3+ KB


In [16]:
print(load_data['page'].apply(lambda z: len(z)).mean())
print(load_data['page'].apply(lambda z: len(z)).min())
print(load_data['page'].apply(lambda z: len(z)).max())

162342.17454858124
55136
515865
