# Overview

This page renders the jupyter notebook that Frances put together on August 21 to scrape ([Farmers.co.nz](https://www.farmers.co.nz/)).

* Aim: to scrape all product names and prices for women's fashion from the Farmers.co.nz site

As usual, one first need some packages to be loaded

In [1]:
import requests 
from bs4 import BeautifulSoup 
import pandas as pd 


## Initial scrape: Only one page
Start with just one page - the first page for women's tops.


## First, check that we can

A common first step is to confirm whether we can scrape at all - so we should check out the [robots.txt](https://www.farmers.co.nz/robots.txt). 

```
User-agent:*
Disallow: /cms/post-surgery-bra-fitting
Disallow: /cms/mothers-day-v2
Disallow: /women/new-collection/sally-ann-mullin-s-top-picks
Disallow: /cms/comp
Disallow: /cms/testcataloguetest
Disallow: /cms/page_SelectorTest_20190823084433
Disallow: /cms/instore-bed-selector
Disallow: /cms/stevens-privacy-policy
Disallow: /cms/sremraf
Disallow: /cms/store-update
Disallow: /cms/sleepyhead-early-bird-voucher
Disallow: *SearchTerm=*
Disallow: /filter/*
Disallow: */INTERSHOP/web*
Disallow: */ManufacturerName-*
Disallow: */ProductSalePriceGross-*
Disallow: */Size-*
Disallow: */FilterClearance-*
Disallow: */ColourFamilyDisplayName-*
Disallow: */function-*
Disallow: */gender-*
Disallow: */age-*
Disallow: */SpecialOffer-*
Disallow: */StockStatus-*
Disallow: */styleshape-*
Disallow: */fabric-*
Disallow: */fillingtype-*
Disallow: */GroupSize*
Disallow: *ContextCategoryUUID*

Sitemap: https://www.farmers.co.nz/sitemap-sitemap

User-agent: bingbot
Crawl-delay: 1
```

As the site we want to scrape is https://www.farmers.co.nz/women/fashion/tops, this is not disallowed!

### Defining the Website URL
It is good to have a look at the website beforehand and anvigate a bit to see if the strucure looks easy to naviage, fand formating consistent

In [2]:
# Define the URL of the website
url = "https://www.farmers.co.nz/women/fashion/tops"

### Testing the website
Then the URL of the website has to be tested. We send a request to the web server hosting the URL, asking for the content of the page.

In [3]:
# Send a GET request to the URL
response = requests.get(url)


Depending on the response, we can start scrapping (or not). The responses can be: 
* 200 OK, there are some elements (data in return of our request)
* 404: Not Found, nothing is returned, there may be an error in the URL
* other responses such as  500: Internal Server Error, or  403 forbidden to acces...

Here the sesponse is: 

In [4]:
response

<Response [200]>

> A code response = 200, means that the URL is correctly responding our request. We can then move on to scrap ;-)

### Using BeautifulSoup to start scraping
The whole code will be embedded into an if then else structure:

**if** one can scrap
   ***(then)*** scrap website
      (many actions there)
      at the end print a message with the location of the file with data
**else** do nothing
      print a message informaing that nothing was retuned 



In [7]:


# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the product listings
    products = soup.find_all("div", class_ = "product-tile")
    
    # Create lists to store the data
    product_names = []
    product_prices = []
    
    # Loop through the product listings and extract the data
    for product in products:
        name_tag = product.find("span", class_ = "product-title-span") # found by inspecting html
        price_tag = product.find("div", class_="current-price") # found by inspecting html
        
        if name_tag:
            name = name_tag.text.strip()  
        else:
            name = "N/A"
        
        if price_tag:
            price = price_tag.text.strip()
        else:
            price = "N/A"
        
        product_names.append(name)
        product_prices.append(price)
    
    # Create a DataFrame from the lists
    df = pd.DataFrame({
        "Product Name": product_names,
        "Product Price": product_prices
    })
    
    # Save the DataFrame to a CSV file
    df.to_csv("farmers_women_tops_p1.csv", index=False)
    
    print("Data has been written to farmers_women_tops_p1.csv")
else:
    print("Failed to retrieve the webpage.")

Data has been written to farmers_women_tops_p1.csv


In [8]:
df.head()

Unnamed: 0,Product Name,Product Price
0,Oliver Black Animal Print V-Neck Short Sleeve ...,$69.99
1,"Ella J Ditsy Tie Top, Aqua",$69.99
2,"Whistle Rib V-Neck Flutter Sleeve Tee, Baby Blue",$59.99
3,"Whistle V-Neck Top, Black",$69.99
4,"Ella J Tropical Tie Top, Blue",$69.99


That got 24 prices (one page) - next step is to extend across all pages for women's tops

Call this new file farmers_women_tops.csv

## Cleaning the web scraped data frame
By looking at the resulting data frame we observe that the variable ProdcutPrice has some text in it. 
We can use regular expressions to create a "Price" variable 

In [10]:
# Extracting the price using regex
df['Price'] = df['Product Price'].str.extract(r'(\d+\.\d+|\d+)').astype(float)

### Some descriptive statistics
We may be interested in a first overview of the data collected


In [19]:
# Descriptive statistics for the Price variable
price_stats = df['Price'].describe()

## Select the statistics we are interested in and display horizontally
selected_stats = price_stats[[ 'min','mean', '50%' , 'max', 'count']].to_frame().T

print(selected_stats)


         min   mean    50%    max  count
Price  49.99  71.24  69.99  99.99   24.0


### Graphical distribution of the prices


## Second part: looping over pages

In [9]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Base URL of the website
base_url = "https://www.farmers.co.nz/women/fashion/tops"

# Lists to store the product data
product_names = []
product_prices = []

# Loop through each page (0 to 14 in this specific case) 
for page_num in range(0, 15): # note, 2nd number of range not included, so needs to be 14+1
    # Modify the URL to include the page number
    url = f"{base_url}/Page-{page_num}-SortingAttribute-SortBy-asc" # specific to the Farmers.co.nz site
    
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Find the product listings
        products = soup.find_all("div", class_="product-tile") # found by inspecting html
        
        # Loop through the product listings and extract the data
        for product in products:
            name_tag = product.find("span", class_="product-title-span") # found by inspecting html
            price_tag = product.find("div", class_="current-price") # found by inspecting html
            
            # Extract the product name
            if name_tag:
                name = name_tag.text.strip()  
            else:
                name = "N/A"
            
            # extract the product price
            if price_tag:
                price = price_tag.text.strip()
            else:
                price = "N/A"
            
            # Append the data to the lists
            product_names.append(name)
            product_prices.append(price)
    else:
        print(f"Failed to retrieve page {page_num}")

# Create a DataFrame from the lists
df = pd.DataFrame({
    "Product Name": product_names,
    "Product Price": product_prices
})

# Save the DataFrame to a CSV file
df.to_csv("farmers_women_tops.csv", index=False)

print("Data has been written to farmers_women_tops.csv")

Data has been written to farmers_women_tops.csv


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360 entries, 0 to 359
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Product Name   360 non-null    object
 1   Product Price  360 non-null    object
dtypes: object(2)
memory usage: 5.8+ KB


Successfully scraped all 348 women's tops.

Next need to work out how to loop across categories (i.e. 'new arrivals', 'dresses', 'tops'...). Work out how to get the URLs for each category.

In [13]:
# Define the URL of the website
url = "https://www.farmers.co.nz/women/fashion"

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the categories
    categories = soup.find_all("a", class_ = "category-list-image")
    
    # Create list to store the data
    category_urls = []

    # Loop through the categories and extract the names
    for category in categories:
        url = category.get("href", "N/A")
        category_urls.append(url)
          
    print("category_urls list has been created")
else:
    print("Failed to retrieve the webpage.")    

category_urls list has been created


In [14]:
category_urls

['http://www.farmers.co.nz/women/fashion/new-arrivals',
 'http://www.farmers.co.nz/women/fashion/dresses',
 'http://www.farmers.co.nz/women/fashion/tops',
 'http://www.farmers.co.nz/women/fashion/skirts',
 'http://www.farmers.co.nz/women/fashion/jeans',
 'http://www.farmers.co.nz/women/fashion/pants-leggings',
 'http://www.farmers.co.nz/women/fashion/activewear',
 'http://www.farmers.co.nz/women/fashion/shorts',
 'http://www.farmers.co.nz/women/fashion/swimwear',
 'http://www.farmers.co.nz/women/fashion/sweatshirts-hoodies',
 'http://www.farmers.co.nz/women/fashion/knitwear',
 'http://www.farmers.co.nz/women/fashion/coats-jackets']

Work out how to get the number of pages as a variable so it doesn't have to be hard-coded as above for tops (which had pages 0-14)

In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Base URL of the website
base_url = "https://www.farmers.co.nz/women/fashion/dresses"

# Send a GET request to the URL
response = requests.get(base_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the number of pages to be iterated through
    pagenum_tag = soup.find_all("span", class_ = "pagination-hide")

    lastpage = pagenum_tag[-1].text.strip()  

    lastpage_num = int(lastpage[2:])
    print("retrieved number of pages")
else:
    print("Failed to retrieve the webpage.")    

retrieved number of pages


Now try and combine the outer loop above with the code that scrapes all the pages (after determining number of pages) for each of the categories.
And remember to add in a variable which has the date and time of the scrape.

In [16]:
#import requests
#from bs4 import BeautifulSoup
#import pandas as pd
from datetime import datetime # need this new module

# Define the URL of the website
url = "https://www.farmers.co.nz/women/fashion"

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the categories
    categories = soup.find_all("a", class_ = "category-list-image")
    
    # Create list to store the data
    category_urls = []

    # Loop through the categories and extract the names
    for category in categories:
        url = category.get("href", "N/A")
        category_urls.append(url)
          
    print("category_urls list has been created")
else:
    print("Failed to retrieve the webpage.")   

# now run a loop across the list of categories (dresses, tops etc ) 

# Lists to store the product data
product_names = []
product_prices = []
product_urls = [] # get the full url for now can transform at later stage
scrape_times = [] 
    
for category_url in category_urls:
    
    # Base URL for the product category
    base_url = category_url

    # Send a GET request to the URL
    response = requests.get(base_url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, "html.parser")
    
        # Find the number of pages to be iterated through
        pagenum_tag = soup.find_all("span", class_ = "pagination-hide")
        if pagenum_tag == [] :
            lastpage_num = 0
        
        else : 

            lastpage = pagenum_tag[-1].text.strip()  
            lastpage_num = int(lastpage[2:])

        print(f"Number of pages is {lastpage_num}")
    else:
        print("Failed to retrieve the webpage.")    

    for page_num in range(0, lastpage_num):
        # Modify the URL to include the page number
        url = f"{base_url}/Page-{page_num}-SortingAttribute-SortBy-asc"
    
        # Send a GET request to the URL
        response = requests.get(url)
    
        # Check if the request was successful
        if response.status_code == 200:
            # Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(response.content, "html.parser")
        
            # Find the product listings
            products = soup.find_all("div", class_="product-tile")
        
            # Loop through the product listings and extract the data
            for product in products:
                name_tag = product.find("span", class_="product-title-span")
                price_tag = product.find("div", class_="current-price")
            
                # Extract and clean the product name
                name = name_tag.text.strip() if name_tag else "N/A"
            
                # Extract and clean the product price
                price = price_tag.text.strip() if price_tag else "N/A"

                # get time of scrape
                scrape_time = datetime.today().strftime('%Y-%m-%d %H:%M:%S')
            
                # Append the data to the lists
                product_names.append(name)
                product_prices.append(price)
                product_urls.append(base_url)
                scrape_times.append(scrape_time)
        else:
            print(f"Failed to retrieve page {page_num}")

        # Create a DataFrame from the lists
    df = pd.DataFrame({
        "Product Name": product_names,
        "Product Price": product_prices,
        "Product Url" : product_urls, # note this is actually the category URL (eg 'tops') so will help with categorisation
        "Scrape Time" : scrape_times
    })

    # Save the DataFrame to a CSV file
    df.to_csv("farmers_womens_fashion.csv", index=False)

    print(f"Data has been written to farmers_womens_fashion.csv for page {category_url}")

category_urls list has been created
Number of pages is 7
Data has been written to farmers_womens_fashion.csv for page http://www.farmers.co.nz/women/fashion/new-arrivals
Number of pages is 5
Data has been written to farmers_womens_fashion.csv for page http://www.farmers.co.nz/women/fashion/dresses
Number of pages is 19
Data has been written to farmers_womens_fashion.csv for page http://www.farmers.co.nz/women/fashion/tops
Number of pages is 3
Data has been written to farmers_womens_fashion.csv for page http://www.farmers.co.nz/women/fashion/skirts
Number of pages is 3
Data has been written to farmers_womens_fashion.csv for page http://www.farmers.co.nz/women/fashion/jeans
Number of pages is 6
Data has been written to farmers_womens_fashion.csv for page http://www.farmers.co.nz/women/fashion/pants-leggings
Number of pages is 9
Data has been written to farmers_womens_fashion.csv for page http://www.farmers.co.nz/women/fashion/activewear
Number of pages is 0
Data has been written to farme

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1293 entries, 0 to 1292
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Product Name   1293 non-null   object
 1   Product Price  1293 non-null   object
 2   Product Url    1293 non-null   object
 3   Scrape Time    1293 non-null   object
dtypes: object(4)
memory usage: 40.5+ KB


Perfect! that means we've scraped 1293 products and have a clean data frame that has our 4 columns!