# Data Scrapping

**Objective:**  
    - To automate the scraping of dishes and ingredients from restaurant websites or third-party platforms with minimal manual intervention.

**Thought Process:**  
    - I plan to begin the data scraping process by compiling a comprehensive list of restaurants located in the Sydney CBD 2000 region. Using the names of these establishments, I will leverage SerpAPI to obtain their corresponding websites. Subsequently, I will extract the dishes and their ingredients directly from these websites. This entire workflow is designed to be fully automated, minimizing the need for manual input.

# Code Flow
- To gather the names of restaurants, cafes, and pubs in Sydney CBD 2000, I initially considered using popular websites such as Timeout, TripAdvisor, and OpenTable. However, a significant limitation I encountered was that these platforms typically feature only the top-rated establishments. For example, they often present lists like "10 Best Restaurants in Central Business District" or "35 Best Restaurants in Sydney CBD 2000."


- To address this issue, I attempted to extract data directly from Google search results, which provided a more comprehensive list of all restaurants in the Sydney CBD 2000 area. Unfortunately, this approach was short-lived due to recurring errors with the ChromeDriver, leading me to revert back to the initial method of sourcing information from curated lists.


- After compiling a list of 200 restaurants and successfully retrieving their official websites using the SerpAPI: Google Search API, I now have a comprehensive CSV file containing both the restaurant names and their respective websites. The next step is to extract the dishes offered by these restaurants, along with their corresponding ingredients.

In [None]:
import requests
from bs4 import BeautifulSoup
import csv

API_KEY = '27f4290fb90ed6306e7f696ac90b83ad6be414615761680819fe0e40a02d3e02'

def get_restaurant_names():
    url = "https://www.timeout.com/sydney/restaurants/the-best-restaurants-in-sydney"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        restaurant_elements = soup.find_all(['h2', 'h3'])
        restaurant_names = [restaurant.get_text(strip=True) for restaurant in restaurant_elements]
        return restaurant_names
    else:
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
        return []

def get_website(restaurant_name):
    params = {
        "q": restaurant_name,
        "location": "Sydney",
        "google_domain": "google.com.au",
        "hl": "en",
        "gl": "au",
        "api_key": API_KEY
    }
    response = requests.get("https://serpapi.com/search.json", params=params)
    if response.status_code == 200:
        results = response.json()
        if 'organic_results' in results and results['organic_results']:
            website = results['organic_results'][0].get('link')
            return website
        else:
            return "No website found"
    else:
        print(f"Failed to fetch results for {restaurant_name}, status code: {response.status_code}")
        return "Failed to retrieve website"

def scrape_menu(website_url):
    try:
        response = requests.get(website_url, timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            menu_div = soup.find('div', class_='menu')
            menu_items = []
            if menu_div:
                pdf_link = menu_div.find('a')
                if pdf_link and 'href' in pdf_link.attrs:
                    pdf_url = pdf_link['href']
                    if not pdf_url.startswith('http'):
                        pdf_url = f"{website_url}/{pdf_url}"
                    menu_items.append(pdf_url)
            possible_menu = soup.find_all(string=lambda text: "menu" in text.lower())
            if possible_menu:
                menu_items.extend([menu.get_text(strip=True) for menu in possible_menu])
            return " | ".join(menu_items) if menu_items else "Menu not found"
        else:
            return "Failed to retrieve menu"
    except Exception as e:
        print(f"Error retrieving menu from {website_url}: {str(e)}")
        return "Error"

def save_restaurants_with_menu_to_csv():
    restaurant_names = get_restaurant_names()
    with open('restaurants_with_websites_and_menus.csv', mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Restaurant Name', 'Website', 'Menu'])
        for name in restaurant_names:
            website = get_website(name)
            print(f"Restaurant: {name}, Website: {website}")
            if "http" in website:
                menu = scrape_menu(website)
            else:
                menu = "Website not found"
            writer.writerow([name, website, menu])

save_restaurants_with_menu_to_csv()


- The above method gave me ony 200 restraunts name which i think is not enoungh so i tried using llama model to egt the list of restraunts, dish they have in there menu and the ingredients.

In [1]:
#Llama Extraction code 

# Preprocessing Data

- Load the CSV file (data_updated.csv) with the appropriate encoding (ISO-8859-1) to handle special characters.
- Define and apply a function to remove leading numbers and periods (e.g., "1. Restaurant Name") from the Restraunts column using a regular expression. 
- Handle missing or non-string values in the dishes and ingridients columns by
  - Filling any NaN (missing) values with empty strings.
  - Converting all values in these columns to strings to ensure consistent data types.
- Group the data by the Restraunts column and merge the dishes and ingridients columns by concatenating unique values for each restaurant.



In [10]:
import pandas as pd
import re

df = pd.read_csv('data_updated.csv', encoding='ISO-8859-1')
def clean_restaurant_name(name):
    return re.sub(r'^\d+\.\s*', '', name)
df['Restraunts'] = df['Restraunts'].apply(clean_restaurant_name)
df['Dishes'] = df['Dishes'].fillna('').astype(str)
df['Ingridients'] = df['Ingridients'].fillna('').astype(str)
df_merged = df.groupby('Restraunts').agg({
    'Dishes': lambda x: ', '.join(x.unique()),   # Merge unique dishes into a single string
    'Ingridients': lambda x: ', '.join(x.unique()) # Merge unique ingredients into a single string
}).reset_index()
df_merged.to_csv('data_updated_merged.csv', index=False)
print("Merging completed and saved to 'data_updated_merged.csv'.")

Merging completed and saved to 'data_updated_merged.csv'.


# Handling Aliases
 - There are three ways to solve this aliases problem.
   - The manual way 
   - Using fuzzy matching with certain threshold to solve this.
   - Clustering Similar Ingredients Using Embeddings

## The Fuzzy Way 

In [13]:
import pandas as pd

# Load the CSV file into a pandas DataFrame
df = pd.read_csv("data_updated_merged.csv")
df['Ingridients'] = df['Ingridients'].fillna('').astype(str)
# Split the Ingredients column into individual ingredients and create a unique list
all_ingredients = set()
df['Ingridients'].str.split(', ').apply(all_ingredients.update)

# Convert set to a list
all_ingredients = list(all_ingredients)


In [14]:
print(all_ingredients)

['burgers', 'Fish', 'variety of baked goods', 'variety of vegetables', '', 'Chutney', 'Premium meats', 'Meat', 'Lamb shoulder', 'noodles', 'Dumplings', 'egg', 'chocolate', 'variety of toppings', 'Ocean Trout', 'Batter', 'cured meats', 'mozzarella cheese', 'croissant pastry', 'Beans', 'Various seafood', 'prawn', 'basil', 'duck', 'trout', 'fine dining ingredients', 'Puff Pastry', 'hoisin sauce', 'Coconut Milk', 'calamari', 'Sake', 'bacon', 'banh mi bread', 'bread', 'olives', 'pastry', 'squid', 'soft shell crab', 'batter', 'Yuca', 'Oysters', 'parmesan cheese', 'raw fish', 'omelette', 'Crab', 'garlic butter', 'halloumi cheese', 'Broccoli', 'blood sausage', 'artichoke', 'Cream', 'Butter', 'soy sauce', 'bun', 'beer batter', 'egg yolk', 'Yogurt', 'spices', 'Pork', 'variety of beans', 'Sandwiches', 'Pasta dishes', 'mixed vegetables', 'Tea', 'fish roe', 'Lentils', 'chili', 'Beef broth', 'seafood mix', 'variety of seafood', 'Mozzarella', 'Premium seafood', 'Snails', 'tomatoes', 'onions', 'variet

In [16]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [17]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Define a function to find similar ingredients using fuzzy matching
def find_similar_ingredients(ingredient_list, threshold=85):
    similar_ingredients = {}
    
    for i, ingredient in enumerate(ingredient_list):
        # Compare the current ingredient to all others
        matches = process.extract(ingredient, ingredient_list, limit=len(ingredient_list))
        
        # Filter out matches below the threshold
        matches = [match for match in matches if match[1] > threshold and match[0] != ingredient]
        
        if matches:
            # Create a mapping of ingredient -> most similar match
            similar_ingredients[ingredient] = matches[0][0]
    
    return similar_ingredients

# Get a mapping of similar ingredients
similar_ingredient_mapping = find_similar_ingredients(all_ingredients, threshold=85)
print(similar_ingredient_mapping)



{'Fish': 'fish', 'variety of baked goods': 'Baked goods', 'variety of vegetables': 'variety', 'Premium meats': 'premium meats', 'Meat': 'meat', 'Lamb shoulder': 'lamb shoulder', 'noodles': 'Noodles', 'egg': 'egg yolk', 'chocolate': 'Chocolate', 'variety of toppings': 'variety', 'Ocean Trout': 'trout', 'Batter': 'batter', 'cured meats': 'Meat', 'mozzarella cheese': 'Mozzarella', 'croissant pastry': 'pastry', 'Beans': 'variety of beans', 'Various seafood': 'seafood', 'prawn': 'prawns', 'basil': 'Basil', 'duck': 'Duck', 'trout': 'Ocean Trout', 'Puff Pastry': 'pastry', 'hoisin sauce': 'Hoisin Sauce', 'Coconut Milk': 'Coconut milk', 'bacon': 'Bacon', 'banh mi bread': 'bread', 'bread': 'Bread', 'pastry': 'Pastry', 'soft shell crab': 'Crab', 'batter': 'Batter', 'Oysters': 'oysters', 'parmesan cheese': 'Cheese', 'raw fish': 'Raw Fish', 'Crab': 'crab', 'garlic butter': 'Butter', 'halloumi cheese': 'Cheese', 'blood sausage': 'sausage', 'Cream': 'cream', 'Butter': 'butter', 'soy sauce': 'Soy Sauc

In [18]:
for ingredient, similar in similar_ingredient_mapping.items():
    print(f"{ingredient} -> {similar}")

Fish -> fish
variety of baked goods -> Baked goods
variety of vegetables -> variety
Premium meats -> premium meats
Meat -> meat
Lamb shoulder -> lamb shoulder
noodles -> Noodles
egg -> egg yolk
chocolate -> Chocolate
variety of toppings -> variety
Ocean Trout -> trout
Batter -> batter
cured meats -> Meat
mozzarella cheese -> Mozzarella
croissant pastry -> pastry
Beans -> variety of beans
Various seafood -> seafood
prawn -> prawns
basil -> Basil
duck -> Duck
trout -> Ocean Trout
Puff Pastry -> pastry
hoisin sauce -> Hoisin Sauce
Coconut Milk -> Coconut milk
bacon -> Bacon
banh mi bread -> bread
bread -> Bread
pastry -> Pastry
soft shell crab -> Crab
batter -> Batter
Oysters -> oysters
parmesan cheese -> Cheese
raw fish -> Raw Fish
Crab -> crab
garlic butter -> Butter
halloumi cheese -> Cheese
blood sausage -> sausage
Cream -> cream
Butter -> butter
soy sauce -> Soy Sauce
beer batter -> Batter
egg yolk -> egg
spices -> Spices
Pork -> pork
variety of beans -> Beans
Pasta dishes -> pasta d

## Using Embeddings

In [22]:
!pip install spacy 
!pip install nltk
!pip install scikit-learn

Collecting spacy
  Using cached spacy-3.8.2-cp39-cp39-win_amd64.whl (12.3 MB)
Collecting thinc<8.4.0,>=8.3.0
  Using cached thinc-8.3.2-cp39-cp39-win_amd64.whl (1.5 MB)
Collecting langcodes<4.0.0,>=3.2.0
  Using cached langcodes-3.4.1-py3-none-any.whl (182 kB)
Collecting weasel<0.5.0,>=0.1.0
  Using cached weasel-0.4.1-py3-none-any.whl (50 kB)
Collecting preshed<3.1.0,>=3.0.2
  Using cached preshed-3.0.9-cp39-cp39-win_amd64.whl (122 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Using cached catalogue-2.0.10-py3-none-any.whl (17 kB)
Collecting srsly<3.0.0,>=2.4.3
  Using cached srsly-2.4.8-cp39-cp39-win_amd64.whl (483 kB)
Collecting wasabi<1.2.0,>=0.9.1
  Using cached wasabi-1.1.3-py3-none-any.whl (27 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Using cached murmurhash-1.0.10-cp39-cp39-win_amd64.whl (25 kB)
Collecting language-data>=1.2
  Using cached language_data-1.2.0-py3-none-any.whl (5.4 MB)
Collecting confection<1.0.0,>=0.0.1
  Using cached confection-0.1.5-py3-none-any.whl (35 kB)
Colle

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
anaconda-project 0.11.1 requires ruamel-yaml, which is not installed.
llama-index-core 0.10.68.post1 requires numpy<2.0.0, but you have numpy 2.0.2 which is incompatible.
chromadb 0.5.5 requires numpy<2.0.0,>=1.22.5, but you have numpy 2.0.2 which is incompatible.




ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
daal4py 2021.6.0 requires daal==2021.4.0, which is not installed.
thinc 8.3.2 requires numpy<2.1.0,>=2.0.0; python_version >= "3.9", but you have numpy 1.24.4 which is incompatible.
tensorflow-intel 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.24.4 which is incompatible.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.24.4 which is incompatible.
blis 1.0.1 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.24.4 which is incompatible.



Collecting numpy>=1.14.6
  Downloading numpy-1.24.4-cp39-cp39-win_amd64.whl (14.9 MB)
     ---------------------------------------- 14.9/14.9 MB 3.1 MB/s eta 0:00:00
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
Successfully installed numpy-1.24.4


In [23]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     -------------------------------------- 33.5/33.5 MB 305.8 kB/s eta 0:00:00
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [25]:
import pandas as pd
import spacy

# Load spaCy's medium-sized English model
nlp = spacy.load('en_core_web_md')

# Load the CSV file into a pandas DataFrame
df = pd.read_csv("data_updated_merged.csv")
df['Ingridients'] = df['Ingridients'].fillna('').astype(str)

# Split the Ingredients column into individual ingredients
all_ingredients = set()
df['Ingridients'].str.split(', ').apply(all_ingredients.update)

# Convert set to a list
all_ingredients = list(all_ingredients)

# Optional: Remove common stopwords or symbols if necessary
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def clean_ingredient(ingredient):
    return ' '.join([word for word in ingredient.split() if word.lower() not in stop_words])

all_ingredients = [clean_ingredient(ingredient) for ingredient in all_ingredients]


TypeError: 'float' object is not iterable