<a href="https://colab.research.google.com/github/MaximL98/CrawlingInMyProtein.github.io/blob/master/HW03/CrawlingInMP_gym_equip_and_prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This code handles the following:


1.   Does the product is in the gym equipment category (not cloths).
2.   The prices of products and diplaying products based on user wanted price range.



Importing relevant libraries

In [None]:
# Imports
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import re
import json
import time

In [None]:
mp_main = 'https://www.myprotein.co.il'

Function that returns:


*   Product name
*   Product price
*   Product picture

In [None]:
def get_product_details(mp_soup):
  product_prices = [] # List to store product prices
  product_names = [] # List to store product names
  product_pics = [] # List to store product image URLs
  all_products = mp_soup.find('ul', {'class': 'productListProducts_products'})
  product_list = all_products.find('li', {'class': 'productListProducts_product'})
  # If the product list container exists
  if all_products:
    # Find all product items in the list
    product_list = all_products.find_all('li')
    for product in product_list:
      # Try to find the product price in the standard class
      product_price = product.find('span', class_='productBlock_fromValue')
      # If price not found, try an alternative class
      if product_price == None:
        try:
          product_price = product.find('span', class_='productBlock_priceValue') # For some products the price stored in different class
        except:
          if not product_price:
            continue
      # Extract the price as a float, removing the last character (currency symbol)
      product_prices.append(float(product_price.text.strip()[:-1]))

      # Find the product name
      product_name = product.find('div', class_='productBlock_title')
      product_name = product_name.find('h3', class_='productBlock_productName')
      product_names.append(product_name.text.strip())

      # Find the product image URL
      product_pic = product.find('img', class_='productBlock_image')
      product_pics.append(product_pic.attrs['src'])

  return product_prices, product_names, product_pics

Function that returns the link to each product

In [None]:
def get_links_from_page(mp_soup):
    links = [] # Initialize an empty list to store links
    all_products = mp_soup.find('ul', {'class': 'productListProducts_products'})
    if all_products:
      # Find the product list items
      product_list = all_products.find('li', {'class': 'productListProducts_product'})
    if all_products:
      product_list = all_products.find_all('li')
      # Iterate over each product list item
      for li in product_list:
        # Find the anchor tag (link) within the list item
        a_tag_array = li.findAll('a', {'class': 'productBlock_link'})
        # Assuming the second anchor tag is the desired link
        a_tag = a_tag_array[1]

        # If a valid link is found, append it to the list
        if a_tag:
          links.append(mp_main + a_tag['href'])
      return links # Return the list of extracted links

    else:
      print("Page does not exits!") # Handle the case where the product list is not found
      return

In [None]:
def flatten(xss):
    return [x for xs in xss for x in xs]

Access all products in the accessories  (6 pages) from this [link](https://www.myprotein.co.il/clothing/soft-accessories.list?pageNumber=).

And retrieve product name, price and picture.

In [None]:
# Initialize empty lists to store product data
links = []
product_names = []
product_prices = []
product_pics = []

for i in range(1,7):
  print(f"Loading products from page {i} ... ")
  mp_url = "https://www.myprotein.co.il/clothing/soft-accessories.list?pageNumber=" + str(i)
  # Use requests to retrieve data from a given URL
  mp_response = requests.get(mp_url)
  # Parse the whole HTML page using BeautifulSoup
  mp_soup = BeautifulSoup(mp_response.text, 'html.parser')
  # Function call to extract product links from the current page
  links.append(get_links_from_page(mp_soup))
  # Append extracted product details to their respective lists
  prices, names, pics = get_product_details(mp_soup)
  product_prices.append(prices)
  product_names.append(names)
  product_pics.append(pics)

Loading products from page 1 ... 
Loading products from page 2 ... 
Loading products from page 3 ... 
Loading products from page 4 ... 
Loading products from page 5 ... 
Loading products from page 6 ... 


In [None]:
links = flatten(links)
product_names = flatten(product_names)
product_pics = flatten(product_pics)
product_prices = flatten(product_prices)

Some of the retrived data needed to get filtered to remove unusual characters

In [None]:
def extract_words_in_parentheses(text):
  pattern = r'\(([^()%]+)\)'
  matches = re.findall(pattern, text)
  result = [match for match in matches if '%' not in match]
  return result

Another filter function that is used in the function ```get_product_range```





In [None]:
def split_and_insert(input_list):
  new_list = []
  for item in input_list:
    if ',' in item:
      split_items = item.split(',')
      new_list.extend(split_items)
    else:
      new_list.append(item)
  return new_list

Get product range from product details (example: Hard accessories)

In [None]:
def get_product_range(links):
  hard_acc_products = {} # Dictionary to store hard accessory products
  soft_acc_products = {} # Dictionary to store soft accessory products
  description_texts = {} # Dictionary to store product descriptions
  ingredients = [] # List to store ingredient lists
  for i, link in enumerate(links):
    # Use requests to retrieve data from a given URL
    mp_response = requests.get(link)
    # Parse the whole HTML page using BeautifulSoup
    mp_soup = BeautifulSoup(mp_response.text, 'html.parser')

    # Extract product description
    product_div_description = mp_soup.find('div', {'class': 'productDescription_contentWrapper'})
    # Ingredient information retrieval
    product_div_ingredients = mp_soup.find('div', {'class': 'productDescription_contentPropertyListItem_ingredients'})
    # Extract ingredients
    if product_div_ingredients:
      ingredients_div = product_div_ingredients.find('div', class_='athenaProductPageSynopsisContent')
      ingredients_text = ingredients_div.text.strip()
      ingredients_text = re.sub(r'\xa0', ' ', ingredients_text)
      ingredients_text = extract_words_in_parentheses(ingredients_text.lower())
      ingredients_text = split_and_insert(ingredients_text)
      ingredients_text = [string.lstrip() for string in ingredients_text]
      ingredients.append(ingredients_text)

    if product_div_description:
      description_ul = product_div_description.find('ul', class_='productDescription_contentPropertyValue')
      description_li = description_ul.find('li', class_='productDescription_contentPropertyValue_value')
      description_text = description_li.text.strip()
      description_texts[product_names[i]] = description_text
      # Classify product as hard or soft accessory
      if description_text == "Hard Accessories":
        hard_acc_products[product_names[i]] = link
      else:
        soft_acc_products[product_names[i]] = link

    else:
      description_texts[product_names[i]] = 'None'
      continue
  return hard_acc_products, soft_acc_products, description_texts, ingredients

In [None]:
hard_acc_products, soft_acc_products, description_texts, _ = get_product_range(links)

Getting price data for protein product, from this [link](https://www.myprotein.co.il/nutrition/protein.list)





In [None]:
mp_url = "https://www.myprotein.co.il/nutrition/protein.list"
# Use requests to retrieve data from a given URL
mp_response = requests.get(mp_url)
# Parse the whole HTML page using BeautifulSoup
mp_soup = BeautifulSoup(mp_response.text, 'html.parser')

Getting links, description, ingredient, names, prices and pictures, for all the new products.

In [None]:
links2 =  get_links_from_page(mp_soup)

In [None]:
_, _, description_texts2, ingredients2 = get_product_range(links2)

In [None]:
product_prices2 = []
product_names2 = []
product_pics2 = []

prices, names, pics = get_product_details(mp_soup)
product_prices2.append(prices)
product_names2.append(names)
product_pics2.append(pics)

In [None]:
product_names2 = flatten(product_names2)
product_pics2 = flatten(product_pics2)
product_prices2 = flatten(product_prices2)

Create new array of tuples in the format of (product name, product price).

In [None]:
product_data = [(product_names2[i], product_prices2[i]) for i in range(0, len(product_names2))]

Sort this new array by price.

In [None]:
sorted_product_data = sorted(
    product_data,
    key=lambda x: x[1]
)

Dictionary of all possible operations that the user might use, when deciding the price range

In [None]:
operation_dict = {'less then': 1,
                  'less then or equal': 2,
                  'higher then': 3,
                  'higher then or equal': 4,
                  'between': 5,
                  'equal': 6}

Extract the operation and the numbers from user query

In [None]:
def get_operation(q):
  # Find the operation based on keywords in the query
  for key in operation_dict.keys():
    if key in q:
      operation = operation_dict[key]

  # Extract numerical values from the query
  extracted_nums = re.findall(r'\d+', q)

  # Validate the number of extracted values
  if len(extracted_nums) >= 3:
    print("Bad user input!")
    return

  # Handle specific operation (operation 5) requiring two values
  if operation == 5:
    price_a, price_b = extracted_nums
    return operation, price_a, price_b

  # Default case: return the first extracted value and None for the second
  price_a = extracted_nums
  return operation, price_a[0], None

Function that performs the operation, using switch-case

In [None]:
def perform_operation(operation, product_price, price_a, price_b):
  # Convert price_a to a float
  price_a = float(price_a)

  # Convert price_b to a float if it's not None
  if price_b:
    price_b = float(price_b)

  # Perform the comparison based on the operation
  match operation:
    case 1:
      return product_price < price_a # Product price is less than price_a
    case 2:
      return product_price <= price_a # Product price is less than or equal to price_a
    case 3:
      return product_price > price_a # Product price is greater than price_a
    case 4:
      return product_price >= price_a # Product price is greater or equal to price_a
    case 5:
      return price_a < product_price < price_b # Product price is between price_a and price_b
    case 6:
      return product_price == price_a # Product price is equal to price_a

  print("Such operation does not exits!")
  return

Example for user query, and the output.

In [None]:
q = "i want product with price between 50 and 200"
operation, price_a, price_b = get_operation(q)
[price for price in sorted_product_data if perform_operation(operation, price[1], price_a, price_b)]

[('Impact Whey Protein Powder', 58.0),
 ('Impact Soy Protein', 68.0),
 ('Impact Pea Protein', 68.0),
 ('Plant Protein Superblend', 69.0),
 ('Protein Meal Replacement Blend', 73.0),
 ('Protein Hot Chocolate', 76.0),
 ('Impact Weight Gainer', 97.0),
 ('Breakfast Smoothie', 119.0),
 ('Clear Vegan Protein', 121.0),
 ('Clear Soy Protein', 121.0),
 ('Clear Whey Hydrate', 136.0),
 ('Impact Whey Isolate Powder', 150.0),
 ('Whey Forward Isolate', 150.0),
 ('Clear Collagen Protein Powder', 155.0),
 ('Clear Whey Diet', 160.0),
 ('Clear Whey Protein Powder', 170.0),
 ('Total Protein Blend', 170.0),
 ('Impact Diet Whey', 184.0),
 ('Collagen Protein Powder', 184.0),
 ('Impact Casein Powder', 194.0)]

Convert all the data that we retrived from both links into json format.

In [None]:
product_data_to_json = [(product_names[i], product_prices[i],  description_texts[product_names[i]], product_pics[i]) for i in range(0, len(description_texts))]

In [None]:
product_data_to_json2 = [(product_names2[i], product_prices2[i],  description_texts2[product_names[i]], ingredients2[i], product_pics2[i]) for i in range(0, len(description_texts2))]

In [None]:
# Convert to a list of dictionaries
data_dict = [{"product": name, "price": price, "description": description, "picture": picture} for name, price, description, picture in product_data_to_json]
# Write to a JSON file
with open('products_from_acc_list.json', 'w') as outfile:
  json.dump(data_dict, outfile, indent=4)

In [None]:
data_dict2 = [{"product": name, "price": price, "description": description, "ingredients":ingredients, "picture": picture} for name, price, description, ingredients, picture in product_data_to_json2]
with open('products_from_protein_list.json', 'w') as outfile:
  json.dump(data_dict2, outfile, indent=4)