# Notebook Overview

By working through the Next.co.uk robots.txt page, we find a few useful pages in order to scrape content ethically. 

# Imports

In [2]:
import requests # Retrieving robots.txt file
from datetime import date # Saving robots.txt file

# Allow and Disallow pages according to robots.txt

In [88]:
# Find the url for next's robots.txt
robots_txt = 'https://www.next.co.uk/robots.txt'

# Get the text from robots.txt page
response = requests.get(robots_txt)
allow_and_disallow = response.text

# Save robots.txt file
today = date.today()
with open('next_uk_robots.txt', 'w') as text_file:
    text_file.write('Next.co.uk Robots Text File (as of {}): \n---\n{}'.format(today, allow_and_disallow))

# Create list of each line (string -> list)
allow_and_disallow = allow_and_disallow.split("\r\n")

# Filter for only pages that are allowed or disallowed
allow = [i for i in allow_and_disallow if i.startswith('Allow')]
disallow = [i for i in allow_and_disallow if i.startswith('Disallow')]

# Filter pages to find useful Allow pages

In [55]:
# Filter for only pages that are clothing
allow_clothing = [i for i in allow if not i.startswith('Allow: /homeware')]
allow_clothing = [i for i in allow_clothing if not i.startswith('Allow: /shop/department-homeware')]
allow_clothing = [i for i in allow_clothing if not i.startswith('Allow: /shop/department-beauty-*-0')]

In [56]:
# Preview both lists
num_to_preview = 3

# Preview Allow Clothing list
print(f'Allow Clothing Preview ({len(allow_clothing)} total):')
for page in allow_clothing[:num_to_preview]:
    print('\t',page)

# Preview Disallow list
print(f'\n\nDisallow Preview ({len(disallow)} total):')
for page in disallow[:num_to_preview]:
    print('\t',page)

Allow Clothing Preview (59 total):
	 Allow: /shop/gender-women/sizetype-petite$
	 Allow: /shop/gender-women/sizetype-maternity$
	 Allow: /shop/gender-women/sizetype-curve$


Disallow Preview (514 total):
	 Disallow: /error.asp
	 Disallow: */Error
	 Disallow: /thanks.asp


## Pages to use

Base URL: `https://www.next.co.uk`

> Extension for Women's Clothing: `/shop/gender-women/sizetype-*\$` [Link](https://www.next.co.uk/shop/gender-women/sizetype-)

> Extension for Men's Clothing: `/shop/gender-men/sizetype-*\$` [Link](https://www.next.co.uk/shop/gender-men/sizetype-)

In [93]:
# Don't use this very similar page:
print('Do not scrape from these url extensions:\n')
similar_pages = [i for i in disallow if '/shop/gender-women/sizetype' in i]
for page in similar_pages:
    print('\t',page.strip('Disallow: '))

Do not scrape from these url extensions:

	 /shop/gender-women/sizetype-*/*


In [94]:
# Don't use this very similar page:
print('Do not scrape from these url extensions:\n')
similar_pages = [i for i in disallow if '/boys' in i]
for page in similar_pages:
    print('\t',page.strip('Disallow: '))

Do not scrape from these url extensions:

	 /brands/boy
	 /shop/boys/*
	 /boys/*/1*
	 /boys/*/2*
	 /boys/*/3*
	 /boys/*/4*
	 /boys/*/5*
	 /boys/*/6*
	 /boys/*/7*
	 /boys/*/8*
	 /boys/*/9*


In [98]:
# Don't use this very similar page:
print('Scrape from these url extensions:\n')
similar_pages = [i for i in allow if 'boys' in i]
for page in similar_pages:
    print('\t',page.strip('Allow: '))

Scrape from these url extensions:

	 /shop/gender-newbornboys-gender-newbornunisex-gender-olderboys-gender-youngerboys/designfeature-christmassweater$
	 /boys/new-in/older-boys/1$
