# Project

You're going to build a streamlit app like the [Westminster Directory app](https://westminster-directory.streamlit.app/) or [Recipe app](https://allrecipes.streamlit.app/) I showed in class. 

You are expected to use what we have learned in class:

- numpy
- pandas
- matplotlib
- regualr expression
- web scraping
- streamlit 
- etc..

In [1]:
# Cell 2: imports & helper
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import pandas as pd
import re
import json
import time




## Step 1: Project Idea and Plan

You need to submit your project idea and plan by the class time on 4/22 Tuesday next week.

Here are some example ideas:

- Weather Data: scrapes weather data from a weather website (e.g., Weather.com) for a specific location. Extract information such as temperature, humidity, wind speed, and weather condition.

- Job Listings: scrapes job listings from a job search website (e.g., Indeed.com or LinkedIn). Gather details such as job title, company name, location, and job description.

- News Headlines: scrapes headlines from a news website (e.g., CNN.com or BBC.com). Extract the title of the article, publication date, and a brief summary.

- Wikipedia on a specific topic.

- Movie: scrapes information about movies from a movie database website (e.g., IMDb or Rotten Tomatoes). Gather details such as movie title, release year, genre, cast, and ratings.

- Some professional websites related to your major. 

Here are some example plans:

- Recipe app: Provide a functionality to choose recipes at the selected range of calories. 

- Job app: Provide the trend of the programming languages in the market. 

- Movie: Provide the trend of the review rating. Analyze the sentiment and genre information. 

- News: Analyze how hot a topic is. 

## Step 2: Project

You have about 2-3 weeks to build your project in following steps:

1. Exploring and making a project idea and plan. 

2. Scraping the data. 

3. Desiging and drafting the interface and functionality of an app. 

4. Building the streamlit app in your local laptop. 

5. Publishing it in public via github and streamlit cloud.

In [5]:
def get_primary_alcohol(ingredients):
    alcohol_keywords = [
        'whiskey','bourbon','rye','scotch','vodka',
        'gin','rum','tequila','brandy','mezcal','liqueur'
    ]
    for item in ingredients:
        name = (item.get('ingredient') or '').lower()
        for alc in alcohol_keywords:
            if alc in name:
                return alc
    return None

In [6]:

def parse_recipe(soup, url):

    title_tag = soup.find(class_="heading__title")
    title = title_tag.get_text(strip=True) if title_tag else ""

    author_tag = soup.find(class_="mntl-attribution__item-name")
    author = author_tag.get_text(strip=True) if author_tag else ""

    story_tag = soup.find(
        class_=(
            "comp article__header--project mntl-sc-page mntl-block "
            "article-intro text-passage structured-content"
        )
    )
    story = story_tag.get_text(" ", strip=True) if story_tag else ""


    ingredients = []
    for li in soup.find_all(class_="structured-ingredients__list-item"):
        qty_tag  = li.find('span', {'data-ingredient-quantity': True})
        unit_tag = li.find('span', {'data-ingredient-unit': True})
        name_tag = li.find('span', {'data-ingredient-name': True})

        qty  = qty_tag.get_text(strip=True) if qty_tag else ""
        unit = unit_tag.get_text(strip=True) if unit_tag else ""
        name = name_tag.get_text(strip=True) if name_tag else ""

        note = None
        if not qty and not unit:
            full = li.get_text(" ", strip=True)
            if ':' in full:
                note = full.split(':', 1)[0]

        entry = {'quantity': qty, 'unit': unit, 'ingredient': name}
        if note:
            entry['note'] = note
        ingredients.append(entry)

    steps = [
        s.get_text(" ", strip=True)
        for s in soup.find_all(
            class_=(
                "comp mntl-sc-block mntl-sc-block-startgroup "
                "mntl-sc-block-group--LI"
            )
        )
    ]

    rc_tag = soup.find(
        class_=(
            "comp aggregate-star-rating__count mntl-aggregate-rating "
            "mntl-text-block"
        )
    )
    rc_text = rc_tag.get_text() if rc_tag else ""
    m = re.search(r'(\d+)', rc_text)
    review_count = int(m.group(1)) if m else 0

    primary_alcohol = get_primary_alcohol(ingredients) or ""

    return{
        'url':             url,
        'title':           title,
        'author':          author,
        'story':           story,
        'ingredients':     ingredients,
        'steps':           steps,
        'review_count':    review_count,
        'primary_alcohol': primary_alcohol
    }


In [7]:

start_url = "https://www.liquor.com/cocktail-and-other-recipes-4779343"
headers = {"User-Agent": "Mozilla/5.0"}
max_depth = 3

recipe_links = set()
other_links = set()
visited = set()
num_visited = 0

queue = [(start_url, 0)]

recipe_data = []

while queue:
    url, depth = queue.pop(0)
    if url in visited or depth > max_depth:
        continue
    visited.add(url)

    print(f"Number: {num_visited} Visiting: {url} (Depth: {depth}) Recipes Stored: {len(recipe_data)}")
    num_visited += 1
    time.sleep(0.05)

    try:
        resp = requests.get(url, headers=headers)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")
        if 'recipe' in url:
            recipe_data.append(parse_recipe(soup,url))

        for link in soup.find_all('a', href=True):
            href = link['href']
            if not href.startswith('https://www.liquor.com'):
                continue
            if 'recipe' in href:
                if href not in recipe_links:
                    recipe_links.add(href)
                    queue.append((href, depth + 1))
            else:
                if href not in other_links:
                    other_links.add(href)
                    queue.append((href, depth + 1))
    except Exception as e:
        print(f"Error visiting {url}: {e}")


Number: 0 Visiting: https://www.liquor.com/cocktail-and-other-recipes-4779343 (Depth: 0) Recipes Stored: 0
Number: 1 Visiting: https://www.liquor.com/ (Depth: 1) Recipes Stored: 1
Number: 2 Visiting: https://www.liquor.com/cocktail-by-spirit-4779438 (Depth: 1) Recipes Stored: 1
Number: 3 Visiting: https://www.liquor.com/cocktail-type-4779426 (Depth: 1) Recipes Stored: 1
Number: 4 Visiting: https://www.liquor.com/cocktail-preparation-style-4779415 (Depth: 1) Recipes Stored: 1
Number: 5 Visiting: https://www.liquor.com/cocktails-by-occasion-4779403 (Depth: 1) Recipes Stored: 1
Number: 6 Visiting: https://www.liquor.com/cocktail-flavors-4779387 (Depth: 1) Recipes Stored: 1
Number: 7 Visiting: https://www.liquor.com/other-recipes-4779379 (Depth: 1) Recipes Stored: 1
Number: 8 Visiting: https://www.liquor.com/spirits-and-liqueurs-4779376 (Depth: 1) Recipes Stored: 2
Number: 9 Visiting: https://www.liquor.com/bourbon-4779371 (Depth: 1) Recipes Stored: 2
Number: 10 Visiting: https://www.liquo

In [8]:
print(f"Found {len(recipe_links)} recipes:")
print(f"\nFound {len(other_links)} other links:")
for link in sorted(recipe_links):
    print(link)

print()
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print()

for link in sorted(other_links):
    print(link)


Found 1621 recipes:

Found 2588 other links:
https://www.liquor.com/14-hours-ahead-cocktail-recipe-5096893
https://www.liquor.com/25th-hour-cocktail-recipe-5209421
https://www.liquor.com/50-50-birthday-cocktail-recipe-5070388
https://www.liquor.com/a-little-chili-punch-recipe-platypus-8414536
https://www.liquor.com/abbey-toddy-cocktail-recipe-5114280
https://www.liquor.com/absinthe-cocktail-recipes-5075576
https://www.liquor.com/absinthe-suisse-cocktail-recipe-5075636
https://www.liquor.com/absinthe-suissesse-cocktail-recipe-5212145
https://www.liquor.com/across-the-pacific-cocktail-recipe-5083941
https://www.liquor.com/after-hours-tennis-club-cocktail-recipe-5101642
https://www.liquor.com/alcoholic-carrot-cake-oreos-recipe-5115644
https://www.liquor.com/alcoholic-cinnamon-bun-oreos-recipe-5115637
https://www.liquor.com/alcoholic-nutter-butters-recipe-5115634
https://www.liquor.com/alcoholic-oreos-recipe-5115646
https://www.liquor.com/alcoholic-snow-cone-recipes-5193507
https://www.liq

In [10]:

recipes_df = pd.DataFrame(recipe_data)


In [11]:

ingredients_df = (
    recipes_df
    .loc[:, ['url','ingredients']]
    .explode('ingredients')
    .reset_index(drop=True)
)

ingredients_df = pd.concat([
    ingredients_df.drop(columns='ingredients'),
    pd.json_normalize(ingredients_df['ingredients'])
], axis=1)

In [12]:
steps_df = (
    recipes_df
    .loc[:, ['url','steps']]
    .explode('steps')
    .reset_index(drop=True)
)
steps_df['step_number'] = steps_df.groupby('url').cumcount() + 1
steps_df = steps_df.rename(columns={'steps':'instruction'})


In [13]:
print(recipes_df.shape, ingredients_df.shape, steps_df.shape)

(1586, 8) (8907, 5) (5357, 3)


In [14]:
recipes_df.to_csv('recipes.csv', index=False)
ingredients_df.to_csv('ingredients.csv', index=False)
steps_df.to_csv('steps.csv', index=False)