## Project Plan
I plan on taking two unique datasets from 2 different beer recipe websites and determining if the 2 datasets could be joined. Once joined I would like to analyse the data to determine what analysis could be done with the data by breaking it into subgroubs by beer style, beer strength, hop bitterness (IBU) and other classifications that will aid in analysis.

The first dataset will be scraped from Beersmith.com. Beersmith is primarily a beer recipe system for amatuer and professional brewers. (there is a cost to use the software) It allows you to build/copy/modify the recipe. The recipe includes the entire brewing process. The brewer can than chose to share the recipe with other brewers. 
The second dataset was downloaded from kaggle. The original data came from brewersfriend.com. 

## Scraping https://beersmithrecipes.com/recent:
For each of the 21 recipes per page, I want to collect:

Recipe Name → Found inside <h4><a title='View Recipe' href='URL'>NAME</a></h4>
Recipe URL → Extracted from <a href="URL">
Beer Style → Inside <span class='subtitle'>Cream Ale ( 1C)</span>
Brewer Name → Inside <a title='View Profile' href='URL'>USERNAME</a>
Original Gravity (OG), Bitterness (IBUs), ABV → Found inside <span class='recipelist'>OG: 1.044, Bitterness: 15.2 IBUs, ABV: 4.7 %</span>

Recipe URL contains the unique recipe number generated from the host site. I will extract that and create it's own column on recipe_number.
Recipe list is also a compound field that contains 5 unique pieces of data, I will split this into it's own columns. Since this is simple comma delimited I plan on creating the additional columns in Excel.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

# Set up Selenium WebDriver
driver = webdriver.Chrome()

# Open the Beersmith Recent Recipes page
url = "https://beersmithrecipes.com/recent"
driver.get(url)

# Wait for page to load
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "h4 > a[title='View Recipe']"))
) #Total scrape took

# Extract recipe details
recipes = driver.find_elements(By.CSS_SELECTOR, "h4")

recipe_data = []
for recipe in recipes:
    try:
        name_element = recipe.find_element(By.CSS_SELECTOR, "a[title='View Recipe']")
        name = name_element.text
        recipe_url = name_element.get_attribute("href")
        
        beer_style = recipe.find_element(By.XPATH, "./following-sibling::span[@class='subtitle']").text
        brewer = recipe.find_element(By.XPATH, "./following-sibling::span[@class='subtitle']/a").text
        stats = recipe.find_element(By.XPATH, "./following-sibling::span[@class='recipelist']").text

        recipe_data.append({
            "Recipe Name": name,
            "Recipe URL": recipe_url,
            "Beer Style": beer_style,
            "Brewer": brewer,
            "Stats": stats
        })
    
    except Exception as e:
        print(f"Skipping an item due to error: {e}")

# Close the browser
driver.quit()

# Convert to DataFrame and display
df = pd.DataFrame(recipe_data)
print(df)

                                         Recipe Name  \
0         Super Magnifico Mexican Lager 8g - Solo v1   
1                       Clemens Honey Stout - 12 gal   
2                                    Modelo Especial   
3                                           GammaRay   
4   Rockaway Chocolate Peanut Butter Stout 2 Batch 2   
5                       ALE: MB: Cali Mountain Stout   
6           ALE: MB: Belgian Pale Turned Into Saison   
7                           Munich Dunkel, Batch 109   
8                           Gold Metal Munich Helles   
9                               RIS Imperial Stout!!   
10                       BB Internationl Amber Lager   
11                                   Millsner - Copy   
12                           10A: Olustvere Pale Ale   
13                                       Mother Gose   
14                               fullerss golden byo   
15                         Killian's Irish Red Clone   
16                                   Louie's Luc

In [6]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

# Set up Selenium WebDriver
driver = webdriver.Chrome()
base_url = "https://beersmithrecipes.com/recent"
driver.get(base_url)

all_recipes = []

while True:
    # Wait for recipes to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "h4 > a[title='View Recipe']"))
    )

    # Extract recipe details
    recipes = driver.find_elements(By.CSS_SELECTOR, "h4")

    for recipe in recipes:
        try:
            name_element = recipe.find_element(By.CSS_SELECTOR, "a[title='View Recipe']")
            name = name_element.text
            recipe_url = name_element.get_attribute("href")

            beer_style = recipe.find_element(By.XPATH, "./following-sibling::span[@class='subtitle']").text
            brewer = recipe.find_element(By.XPATH, "./following-sibling::span[@class='subtitle']/a").text
            stats = recipe.find_element(By.XPATH, "./following-sibling::span[@class='recipelist']").text

            all_recipes.append({
                "Recipe Name": name,
                "Recipe URL": recipe_url,
                "Beer Style": beer_style,
                "Brewer": brewer,
                "Stats": stats
            })
        
        except Exception as e:
            print(f"Skipping an item due to error: {e}")

    # Try to find the "Next Page" link
    try:
        next_page_elements = driver.find_elements(By.CSS_SELECTOR, "div.pagelinks a")
        next_page_url = next_page_elements[-1].get_attribute("href")  # Last link is usually "Next Page"

        # If the next page URL is different, navigate to it
        if next_page_url and next_page_url != driver.current_url:
            driver.get(next_page_url)
            time.sleep(3)  # Allow time for the page to load
        else:
            print("No more pages. Stopping.")
            break

    except:
        print("No Next Page link found. Stopping.")
        break

# Close the browser
driver.quit()

# Convert to DataFrame and save to CSV
df = pd.DataFrame(all_recipes)
df.to_csv("beer_recipes.csv", index=False)

print("Scraping complete. Data saved to beer_recipes.csv")

KeyboardInterrupt: 

In [7]:
pd.DataFrame(all_recipes)


Unnamed: 0,Recipe Name,Recipe URL,Beer Style,Brewer,Stats
0,Super Magnifico Mexican Lager 8g - Solo v1,https://beersmithrecipes.com/viewrecipe/513713...,Cream Ale ( 1C),cdburg,"OG: 1.044 (10.9° P), Bitterness: 15.2 IBUs, AB..."
1,Clemens Honey Stout - 12 gal,https://beersmithrecipes.com/viewrecipe/209010...,Imperial Stout (20C),stevclem,"OG: 1.097 (23.2° P), Bitterness: 74.4 IBUs, AB..."
2,Modelo Especial,https://beersmithrecipes.com/viewrecipe/494521...,Vienna Lager ( 7A),jonsl8,"OG: 1.045 (11.2° P), Bitterness: 14.4 IBUs, AB..."
3,GammaRay,https://beersmithrecipes.com/viewrecipe/223328...,New England IPA (21B),rickkickin,"OG: 1.060 (14.9° P), Bitterness: 64.5 IBUs, AB..."
4,Rockaway Chocolate Peanut Butter Stout 2 Batch 2,https://beersmithrecipes.com/viewrecipe/510436...,Sweet Stout (16A),HiawathaBrewing,"OG: 1.059 (14.5° P), Bitterness: 29.5 IBUs, AB..."
...,...,...,...,...,...
63116,Daddy T's ESB,https://beersmithrecipes.com/viewrecipe/649/da...,Extra Special/Strong Bitter (English Pale Ale)...,BeerSmith,"OG: 1.050 (12.5° P), Bitterness: 39.4 IBUs, AB..."
63117,Culver City Stout,https://beersmithrecipes.com/viewrecipe/647/cu...,Dry Stout (13A),BeerSmith,"OG: 1.044 (11.0° P), Bitterness: 37.4 IBUs, AB..."
63118,Cromwell Bitter,https://beersmithrecipes.com/viewrecipe/645/cr...,Standard/Ordinary Bitter ( 8A),BeerSmith,"OG: 1.036 (9.0° P), Bitterness: 28.9 IBUs, ABV..."
63119,Cremora Cream Ale,https://beersmithrecipes.com/viewrecipe/644/cr...,Cream Ale ( 6A),BeerSmith,"OG: 1.052 (12.8° P), Bitterness: 16.2 IBUs, AB..."


In [9]:
df.shape


(630, 5)

In [10]:
pd.DataFrame(all_recipes)

Unnamed: 0,Recipe Name,Recipe URL,Beer Style,Brewer,Stats
0,Super Magnifico Mexican Lager 8g - Solo v1,https://beersmithrecipes.com/viewrecipe/513713...,Cream Ale ( 1C),cdburg,"OG: 1.044 (10.9° P), Bitterness: 15.2 IBUs, AB..."
1,Clemens Honey Stout - 12 gal,https://beersmithrecipes.com/viewrecipe/209010...,Imperial Stout (20C),stevclem,"OG: 1.097 (23.2° P), Bitterness: 74.4 IBUs, AB..."
2,Modelo Especial,https://beersmithrecipes.com/viewrecipe/494521...,Vienna Lager ( 7A),jonsl8,"OG: 1.045 (11.2° P), Bitterness: 14.4 IBUs, AB..."
3,GammaRay,https://beersmithrecipes.com/viewrecipe/223328...,New England IPA (21B),rickkickin,"OG: 1.060 (14.9° P), Bitterness: 64.5 IBUs, AB..."
4,Rockaway Chocolate Peanut Butter Stout 2 Batch 2,https://beersmithrecipes.com/viewrecipe/510436...,Sweet Stout (16A),HiawathaBrewing,"OG: 1.059 (14.5° P), Bitterness: 29.5 IBUs, AB..."
...,...,...,...,...,...
63116,Daddy T's ESB,https://beersmithrecipes.com/viewrecipe/649/da...,Extra Special/Strong Bitter (English Pale Ale)...,BeerSmith,"OG: 1.050 (12.5° P), Bitterness: 39.4 IBUs, AB..."
63117,Culver City Stout,https://beersmithrecipes.com/viewrecipe/647/cu...,Dry Stout (13A),BeerSmith,"OG: 1.044 (11.0° P), Bitterness: 37.4 IBUs, AB..."
63118,Cromwell Bitter,https://beersmithrecipes.com/viewrecipe/645/cr...,Standard/Ordinary Bitter ( 8A),BeerSmith,"OG: 1.036 (9.0° P), Bitterness: 28.9 IBUs, ABV..."
63119,Cremora Cream Ale,https://beersmithrecipes.com/viewrecipe/644/cr...,Cream Ale ( 6A),BeerSmith,"OG: 1.052 (12.8° P), Bitterness: 16.2 IBUs, AB..."


In [11]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 630 entries, 0 to 629
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Recipe Name  630 non-null    object
 1   Recipe URL   630 non-null    object
 2   Beer Style   630 non-null    object
 3   Brewer       630 non-null    object
 4   Stats        630 non-null    object
dtypes: object(5)
memory usage: 24.7+ KB


In [12]:
print(len(all_recipes))  # Should print 63121

63121


In [13]:
df = pd.DataFrame(all_recipes)
df.to_csv("beer_recipes_recovered.csv", index=False)
print("Recovered data saved.")

Recovered data saved.


In [14]:
print(type(all_recipes))  # Should be <class 'list'>
print(type(all_recipes[0]))  # Should be <class 'dict'>

<class 'list'>
<class 'dict'>


In [15]:
df = pd.DataFrame.from_records(all_recipes)  # Alternative method
print(df.shape)  # Should be (63121, 5)

(63121, 5)


In [16]:
df.to_csv("beer_recipes_fixed.csv", index=False)
print("Saved data successfully!")

Saved data successfully!
