# Project

You're going to build a streamlit app like the [Westminster Directory app](https://westminster-directory.streamlit.app/) or [Recipe app](https://allrecipes.streamlit.app/) I showed in class. 

You are expected to use what we have learned in class:

- numpy
- pandas
- matplotlib
- regualr expression
- web scraping
- streamlit 
- etc..

In [1]:
# Cell 2: imports & helper
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import pandas as pd
import re
import json
import time




## Step 1: Project Idea and Plan

You need to submit your project idea and plan by the class time on 4/22 Tuesday next week.

Here are some example ideas:

- Weather Data: scrapes weather data from a weather website (e.g., Weather.com) for a specific location. Extract information such as temperature, humidity, wind speed, and weather condition.

- Job Listings: scrapes job listings from a job search website (e.g., Indeed.com or LinkedIn). Gather details such as job title, company name, location, and job description.

- News Headlines: scrapes headlines from a news website (e.g., CNN.com or BBC.com). Extract the title of the article, publication date, and a brief summary.

- Wikipedia on a specific topic.

- Movie: scrapes information about movies from a movie database website (e.g., IMDb or Rotten Tomatoes). Gather details such as movie title, release year, genre, cast, and ratings.

- Some professional websites related to your major. 

Here are some example plans:

- Recipe app: Provide a functionality to choose recipes at the selected range of calories. 

- Job app: Provide the trend of the programming languages in the market. 

- Movie: Provide the trend of the review rating. Analyze the sentiment and genre information. 

- News: Analyze how hot a topic is. 

## Step 2: Project

You have about 2-3 weeks to build your project in following steps:

1. Exploring and making a project idea and plan. 

2. Scraping the data. 

3. Desiging and drafting the interface and functionality of an app. 

4. Building the streamlit app in your local laptop. 

5. Publishing it in public via github and streamlit cloud. 

In [2]:
# Cell 3: run it
page = "https://www.liquor.com/bourbon-cocktails-4779435"

headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(page, headers=headers)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
links = []
recipe_boxes = soup.find_all(class_="mntl-taxonomysc-sibling-node mntl-text-link js-carousel-item")
for recipe in recipe_boxes:
    links.append(recipe["href"])
links.append(page)

for link in links:
    print("-", link)


- https://www.liquor.com/vodka-cocktails-4779437
- https://www.liquor.com/rum-cocktails-4779434
- https://www.liquor.com/scotch-cocktails-4779431
- https://www.liquor.com/rye-whiskey-cocktails-4779433
- https://www.liquor.com/whiskey-cocktails-4779430
- https://www.liquor.com/tequila-and-mezcal-cocktails-4779429
- https://www.liquor.com/brandy-cocktails-4779428
- https://www.liquor.com/other-cocktails-4779427
- https://www.liquor.com/gin-cocktails-4779436
- https://www.liquor.com/bourbon-cocktails-4779435


In [3]:
# Cell 3: run it
page = "https://www.liquor.com/cocktail-by-spirit-4779438"
other_links = []
recipe_links = []
for link in links:
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(link, headers=headers)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    recipe_boxes = soup.find_all(class_="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image")
    for recipe in recipe_boxes:
        temp_link = recipe["href"]
        if re.search(r'recipe', temp_link):
            recipe_links.append(temp_link)
        else:
            other_links.append(temp_link)

for link in recipe_links:
    print(link)
print(f"Found {len(recipe_links)} recipes:")

print()
for link in other_links:
    print(link)
print(f"Found {len(other_links)} Other links:")

https://www.liquor.com/kansas-city-ice-water-cocktail-recipe-8558889
https://www.liquor.com/lavender-mule-cocktail-recipe-8412231
https://www.liquor.com/nice-list-cocktail-recipe-5496825
https://www.liquor.com/a-little-chili-punch-recipe-platypus-8414536
https://www.liquor.com/dirty-shirley-cocktail-recipe-5441621
https://www.liquor.com/homemade-zima-cocktail-recipe-5324713
https://www.liquor.com/martini-recipe-variations-5218629
https://www.liquor.com/cajun-martini-cocktail-recipe-5218591
https://www.liquor.com/dreamy-dorini-smoking-martini-cocktail-recipe-5203959
https://www.liquor.com/farmers-cocktail-recipe-5196661
https://www.liquor.com/hairy-navel-cocktail-recipe-5189037
https://www.liquor.com/slideshows/vodka-cocktail-recipes/
https://www.liquor.com/chi-chi-cocktail-recipe-5188137
https://www.liquor.com/bay-breeze-cocktail-recipe-5181821
https://www.liquor.com/chocolate-martini-cocktail-recipe-5120730
https://www.liquor.com/pom-blood-orange-old-fashioned-cocktail-recipe-5119325


In [4]:



recipe_data = []
for recipe_link in recipe_links:
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(recipe_link, headers=headers)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    title = soup.find(class_="heading__title").text
    shitty_life_story = soup.find(class_="comp article__header--project mntl-sc-page mntl-block article-intro text-passage structured-content").text.replace('\n', '')
    ingredient_items = soup.find_all(class_="structured-ingredients__list-item")

    ingredients = []

    for li in ingredient_items:
        # find the spans
        qty_tag  = li.find('span', {'data-ingredient-quantity': True})
        unit_tag = li.find('span', {'data-ingredient-unit': True})
        name_tag = li.find('span', {'data-ingredient-name': True})

        qty  = qty_tag .get_text(strip=True) if qty_tag  else None
        unit = unit_tag.get_text(strip=True) if unit_tag else None
        name = name_tag.get_text(strip=True) if name_tag else None

        # detect a leading “Garnish:” note if no qty/unit
        note = None
        if not qty and not unit:
            full = li.get_text(separator=' ', strip=True)
            if ':' in full:
                note = full.split(':',1)[0]

        ingredients.append({
            'quantity':    qty,
            'unit':        unit,
            'ingredient':  name,
            **({'note': note} if note else {})
        })
    steps = []
    step_items = soup.find_all(class_="comp mntl-sc-block mntl-sc-block-startgroup mntl-sc-block-group--LI")
    for step in step_items:
        steps.append(step.text.replace('\n', ''))
    review_count = soup.find(class_="comp aggregate-star-rating__count mntl-aggregate-rating mntl-text-block").text
    r = re.search('(\d+)', review_count)
    review_count = r.group(1) if r else None

    print(recipe_link)
    print()
    print(title)
    print()
    print(shitty_life_story)
    print()
    print(ingredients)
    print()
    print(steps)
    print()
    print(review_count)
    break

https://www.liquor.com/kansas-city-ice-water-cocktail-recipe-8558889

Kansas City Ice Water

 The Kansas City Ice Water, also known as the KC Ice Water, is a simple cocktail that usually includes gin, vodka, lime juice, triple sec, and lemon-lime soda. The drink remains popular in its original form, but that hasn’t stopped some bartenders from seeking to improve upon its base template. This zesty, floral, and crowd-pleasing riff comes from Andrew Olsen, national beverage director of J. Rieger & Co, a historic distillery in Kansas City, Missouri. “The KC Ice Water was a common order in the high-volume bar scene of many college towns and local haunts, especially in the Midwest region,” says Olsen. “In 2017, when opening up a restaurant in the Country Club Plaza district, I recognized that popularity and wanted to make something that was reminiscent of the original flavors, but with a more craft element added to it.” “Our version stays with the vodka base but instead of splitting with gin

In [9]:

start_url = "https://www.liquor.com/cocktail-and-other-recipes-4779343"
headers = {"User-Agent": "Mozilla/5.0"}
max_depth = 2

recipe_links = set()
other_links = set()
visited = set()
num_visited = 0

queue = [(start_url, 0)]

while queue:
    url, depth = queue.pop(0)
    if url in visited or depth > max_depth:
        continue
    visited.add(url)

    print(f"Number: {num_visited} Visiting: {url} (Depth: {depth}) recipe length: {len(recipe_links)}")
    num_visited += 1
    time.sleep(0.05)

    try:
        resp = requests.get(url, headers=headers)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "html.parser")

        for link in soup.find_all('a', href=True):
            href = link['href']
            if not href.startswith('https://www.liquor.com'):
                continue
            if 'recipe' in href:
                if href not in recipe_links:
                    recipe_links.add(href)
                    queue.append((href, depth + 1))
            else:
                if href not in other_links:
                    other_links.add(href)
                    queue.append((href, depth + 1))
    except Exception as e:
        print(f"Error visiting {url}: {e}")


Visiting: https://www.liquor.com/cocktail-and-other-recipes-4779343 (Depth: 0) recipe length: 0
Visiting: https://www.liquor.com/ (Depth: 1) recipe length: 39
Visiting: https://www.liquor.com/cocktail-by-spirit-4779438 (Depth: 1) recipe length: 41
Visiting: https://www.liquor.com/cocktail-type-4779426 (Depth: 1) recipe length: 55
Visiting: https://www.liquor.com/cocktail-preparation-style-4779415 (Depth: 1) recipe length: 64
Visiting: https://www.liquor.com/cocktails-by-occasion-4779403 (Depth: 1) recipe length: 82
Visiting: https://www.liquor.com/cocktail-flavors-4779387 (Depth: 1) recipe length: 122
Visiting: https://www.liquor.com/other-recipes-4779379 (Depth: 1) recipe length: 148
Visiting: https://www.liquor.com/spirits-and-liqueurs-4779376 (Depth: 1) recipe length: 197
Visiting: https://www.liquor.com/bourbon-4779371 (Depth: 1) recipe length: 197
Visiting: https://www.liquor.com/brandy-4779364 (Depth: 1) recipe length: 213
Visiting: https://www.liquor.com/gin-4779369 (Depth: 1) r

In [12]:
print(f"Found {len(recipe_links)} recipes:")
print(f"\nFound {len(other_links)} other links:")
for link in sorted(recipe_links):
    print(link)

print()
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print("━" * 90)
print()

for link in sorted(other_links):
    print(link)


Found 1586 recipes:

Found 2410 other links:
https://www.liquor.com/14-hours-ahead-cocktail-recipe-5096893
https://www.liquor.com/25th-hour-cocktail-recipe-5209421
https://www.liquor.com/50-50-birthday-cocktail-recipe-5070388
https://www.liquor.com/a-little-chili-punch-recipe-platypus-8414536
https://www.liquor.com/abbey-toddy-cocktail-recipe-5114280
https://www.liquor.com/absinthe-cocktail-recipes-5075576
https://www.liquor.com/absinthe-suisse-cocktail-recipe-5075636
https://www.liquor.com/absinthe-suissesse-cocktail-recipe-5212145
https://www.liquor.com/across-the-pacific-cocktail-recipe-5083941
https://www.liquor.com/after-hours-tennis-club-cocktail-recipe-5101642
https://www.liquor.com/alcoholic-carrot-cake-oreos-recipe-5115644
https://www.liquor.com/alcoholic-cinnamon-bun-oreos-recipe-5115637
https://www.liquor.com/alcoholic-nutter-butters-recipe-5115634
https://www.liquor.com/alcoholic-oreos-recipe-5115646
https://www.liquor.com/alcoholic-snow-cone-recipes-5193507
https://www.liq

In [16]:
def get_primary_alcohol(ingredients):
    alcohol_keywords = [
        'whiskey','bourbon','rye','scotch','vodka',
        'gin','rum','tequila','brandy','mezcal','liqueur'
    ]
    for item in ingredients:
        name = (item.get('ingredient') or '').lower()
        for alc in alcohol_keywords:
            if alc in name:
                return alc
    return None

In [18]:
recipe_data = []
num_scraped = 0

for recipe_link in recipe_links:
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(recipe_link, headers=headers)
    print(f"Visiting: {recipe_link}")
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    num_scraped += 1

    # title & intro (empty string if missing)
    title_tag = soup.find(class_="heading__title")
    title = title_tag.get_text(strip=True) if title_tag else ""

    story_tag = soup.find(
        class_=(
            "comp article__header--project mntl-sc-page mntl-block "
            "article-intro text-passage structured-content"
        )
    )
    story = story_tag.get_text(" ", strip=True) if story_tag else ""

    # ingredients (always a list; fields default to "")
    ingredients = []
    for li in soup.find_all(class_="structured-ingredients__list-item"):
        qty_tag  = li.find('span', {'data-ingredient-quantity': True})
        unit_tag = li.find('span', {'data-ingredient-unit': True})
        name_tag = li.find('span', {'data-ingredient-name': True})

        qty  = qty_tag.get_text(strip=True) if qty_tag else ""
        unit = unit_tag.get_text(strip=True) if unit_tag else ""
        name = name_tag.get_text(strip=True) if name_tag else ""

        note = None
        if not qty and not unit:
            full = li.get_text(" ", strip=True)
            if ':' in full:
                note = full.split(':', 1)[0]

        entry = {'quantity': qty, 'unit': unit, 'ingredient': name}
        if note:
            entry['note'] = note
        ingredients.append(entry)

    # steps (empty list if none)
    steps = [
        s.get_text(" ", strip=True)
        for s in soup.find_all(
            class_=(
                "comp mntl-sc-block mntl-sc-block-startgroup "
                "mntl-sc-block-group--LI"
            )
        )
    ]

    # review count (0 if missing or unparsable)
    rc_tag = soup.find(
        class_=(
            "comp aggregate-star-rating__count mntl-aggregate-rating "
            "mntl-text-block"
        )
    )
    rc_text = rc_tag.get_text() if rc_tag else ""
    m = re.search(r'(\d+)', rc_text)
    review_count = int(m.group(1)) if m else 0

    # primary alcohol (empty string if none found)
    primary_alcohol = get_primary_alcohol(ingredients) or ""

    recipe_data.append({
        'url':             recipe_link,
        'title':           title,
        'story':           story,
        'ingredients':     ingredients,
        'steps':           steps,
        'review_count':    review_count,
        'primary_alcohol': primary_alcohol
    })

    print(f"Scraped: {title}  (#{num_scraped})")
    time.sleep(0.05)

https://www.liquor.com/whiskey-highball-cocktail-recipe-5085252
Scraped: Whiskey Highball NUM: 1
https://www.liquor.com/recipes/luck-of-the-irish/
Scraped: Luck of the Irish NUM: 2
https://www.liquor.com/blood-sage-cocktail-recipe-5119331
Scraped: Blood Sage NUM: 3
https://www.liquor.com/silver-fizz-cocktail-recipe-5224789
Scraped: Silver Fizz NUM: 4
https://www.liquor.com/recipes/stone-cold-larceny/
Scraped: Stone Cold Larceny NUM: 5
https://www.liquor.com/recipes/green-tea-highball/
Scraped: Green Tea Highball NUM: 6
https://www.liquor.com/recipes/the-beatnik/
Scraped: Beatnik NUM: 7
https://www.liquor.com/recipes/20th-century/
Scraped: 20th Century NUM: 8
https://www.liquor.com/recipes/freehand-old-fashioned/
Scraped: Freehand Old Fashioned NUM: 9
https://www.liquor.com/recipes/momisette/
Scraped: Momisette NUM: 10
https://www.liquor.com/recipes/cabana-club/
Scraped: Cabana Club NUM: 11
https://www.liquor.com/tia-mia-cocktail-recipe-7484869
Scraped: Tia Mia NUM: 12
https://www.liquo

AttributeError: 'NoneType' object has no attribute 'get_text'

In [None]:
# 1. One row per recipe (ingredients & steps are lists)
recipes_df = pd.DataFrame(recipe_data)



In [None]:
# 2. Explode & normalize ingredients
ingredients_df = (
    recipes_df
    .loc[:, ['url','ingredients']]
    .explode('ingredients')
    .reset_index(drop=True)
)
# Turn the dict in 'ingredients' column into real columns
ingredients_df = pd.concat([
    ingredients_df.drop(columns='ingredients'),
    pd.json_normalize(ingredients_df['ingredients'])
], axis=1)

In [None]:
steps_df = (
    recipes_df
    .loc[:, ['url','steps']]
    .explode('steps')
    .reset_index(drop=True)
)
steps_df['step_number'] = steps_df.groupby('url').cumcount() + 1
steps_df = steps_df.rename(columns={'steps':'instruction'})


In [None]:
print(recipes_df.shape, ingredients_df.shape, steps_df.shape)