                  ***Product Alternates***
This code works generally on all shopify website:
Best to Worst

https://www.boysnextdoor-apparel.co/collections/all  (Work Best)

https://sartale2022.myshopify.com/collections/all  (Work Best)

https://glamaroustitijewels.com/collections/all  (good)

https://berkehome.pl/collections/all  (ok)

https://kitchenoasis.com/collections/all  (Page not working)

# After analyzing the Shopify websites you provided, I attempted to ***generalize ***the code to make it usable across multiple websites. I did this by observing the URL patterns of the product pages and identifying commonalities, such as the presence or absence of '/connection/all' in the URL.To generate the product URLs, I used a regular expression to match the product name in the URL, along with some alphabetical constraints to ensure that the correct product URL was fetched.
# Take some time to generate output max 5 minutes





This code is designed to extract product variations from any Shopify website URL using web scraping techniques. The code extracts relevant information from the website, such as product descriptions and variant details, and then applies DBSCAN clustering algorithm to group similar products together.

The extracted variants can include attributes such as colors, sizes, or other product features, depending on how they are represented on the website. The clustering algorithm is used to group products together based on their similarity, which can help identify common product categories and subcategories, and potentially reveal insights about customer preferences and behaviors.

To ensure the clustering algorithm performs optimally, the hyperparameters for the DBSCAN algorithm are tuned, with the eps and min_samples values adjusted to achieve good clustering results. This may involve experimenting with different values to find the optimal settings for the specific dataset being analyzed.

In [1]:
import requests
import json
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import DBSCAN
import numpy as np

In [2]:
def allpages(url,page):
    r=requests.get(url+f"/products.json?page={page}",timeout=5)
    data=r.json()
    if len(data['products'])>0:                       # paginating through pages
        return data['products']
    else:
        return

In [3]:
def dataparsing(data):
    products=[]
    for product in data:                                    # Fetching Title of all products
        products.append(product['title'])
    return products

In [4]:
def dbscanclustering(product):
    product_vectors=[]
    product_groups = {}
    product_descriptions = [i[0] for i in product]
    vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,3))
    product_vectors = vectorizer.fit_transform(product_descriptions)
    
    clustering = DBSCAN(eps=3,min_samples=2).fit(product_vectors)
    for i, label in enumerate(clustering.labels_):                              # DBSCAN clustering with tuned hyperparameter with using bag of n-grams for word representation
        label = str(label)                                                      # Similar product clustered together
        if label not in product_groups and label!="-1":
            product_groups[label] = []
        if label!='-1':
          product_groups[label].append(product[i][1])
    return product_groups


In [5]:
# def FindAllGroups(url):
#     product=[]
#     products=[]
#     product_key=[]
#     product_groups=[]
#     lastProducts = []
#     for page in range(1,50):
#         data=allpages(url,page)
#         try:
#             products.append(dataparsing(data))
#         except:
#             print(f'Completed,Total pages={page-1}')
#             break
            
#     # product=[product for i in products for product in i]
#     for i in products:
#       for product in i:
#         product_url=re.sub(r'[^a-zA-Z0-9]+', '-', product)
#         product_url=product_url.lower()
#         # product_url=product_url.removesuffix('/collections/all')
#         product_key.append([product,url+'/products/'+product_url])
#     product_alternates=dbscanclustering(product_key)
  
#     for key in product_alternates:
#       lastProducts.append({"product alternates": product_alternates[key]})

#     return json.dumps(lastProducts)

In [12]:
def FindAllGroups(url):
    product=[]
    products=[]
    product_key=[]
    product_groups=[]
    lastProducts = []
    for page in range(1,50):
        data=allpages(url,page)
        try:
            products.append(dataparsing(data))
        except:
            print(f'Completed,Total pages={page-1}')
            break
            
    # product=[product for i in products for product in i]
    for i in products:
      for product in i:
        # product_url=re.sub(r'[^a-zA-Z0-9]+', '-', product)
        product_url = re.sub(r'[^a-zA-Z0-9]+', '-', product.lower().replace('"', '').replace(' ', '-').replace('_', '-'))
        product_url1=product_url.lower()                                        #Product name is converted into url format by RegEx     
        product_url=url+'/products/'+product_url1
        response = requests.head(product_url)
        
        if response.status_code==200:
          product_key.append([product,product_url])
        else:
          product_url =url[:-len('/collections/all')]                          #Code is created such that to identify correct url is generated
          product_url=product_url+'/products/'+product_url1                     #URL is checked if product is correctly fetched or other types of URL is generated to fetch products
          response = requests.head(product_url)                                
          if response.status_code==200:
            product_key.append([product,product_url])

        
    product_alternates=dbscanclustering(product_key)
  
    for key in product_alternates:
      lastProducts.append({"product alternates": product_alternates[key]})

    return json.dumps(lastProducts)

In [17]:
product_alternates=FindAllGroups('https://www.boysnextdoor-apparel.co/collections/all')

Completed,Total pages=23


In [18]:
print(product_alternates)                   # Output in required format

[{"product alternates": ["https://www.boysnextdoor-apparel.co/products/and-wander-heather-waist-bag-charcoal", "https://www.boysnextdoor-apparel.co/products/and-wander-heather-waist-bag-navy"]}, {"product alternates": ["https://www.boysnextdoor-apparel.co/products/beams-two-pocket-sweater-navy", "https://www.boysnextdoor-apparel.co/products/beams-two-pocket-sweater-white", "https://www.boysnextdoor-apparel.co/products/boysnextdoor-pocket-sweater-black", "https://www.boysnextdoor-apparel.co/products/boysnextdoor-pocket-sweater-green", "https://www.boysnextdoor-apparel.co/products/boysnextdoor-pocket-sweater-pink", "https://www.boysnextdoor-apparel.co/products/boysnextdoor-pocket-sweater-white"]}, {"product alternates": ["https://www.boysnextdoor-apparel.co/products/beams-x-abu-garcia-limited-edition-shorts-black", "https://www.boysnextdoor-apparel.co/products/beams-x-abu-garcia-limited-edition-shorts-green"]}, {"product alternates": ["https://www.boysnextdoor-apparel.co/products/beams-x