# <center>HW1: Web Scraping </center>

## Instructions: 


- Please read the problem description carefully
- Make sure to complete all requirements (shown as bullets) . In general, it would be much easier if you complete the requirements in the order as shown in the problem description
- Follow the Submission Instruction to submit your assignment.
- Code of academic integrity:
    - **Each assignment needs to be completed independently. This is NOT group assignment**. 
    - Never ever copy others' work (even with minor modification, e.g. changing variable names)
    - If you generate code using large lanaguage models, make sure to adapt the generated code to meet all requirements and it can be executed successfully. Also, please clearly indicate which part of the code is generated by LLMs and describe your adaptation.

<img src="HW1.png" width ="70%">

## Q1: collect reviews (10 points)

Write a function `scrape_reviews(url)` to scrape all **reviews** for a movie, including, 
- **rating** (see (1) in Figure)
- **summary** (see (2) in Figure)
- **details** (see (3(a) and 3(b) in Figure): Scrape the review details as shown in 3(a-b). If the details of a review are hidden behind a "Spoiler" button (see 3(a)), click the button to expand the review first and then collect the details. Hints:
    - Collect all Spoiler buttons and click them one by one
    - Then collect the page content. If you use CSS selector, this selector: `"div.ipc-overflowText--children, div[data-testid=review-spoiler-content] div.review-content"` may cover both cases
- **review date** (see (4) in Figure).


Requirements:
- `Function Input`:`page URL`: the URL of a movie review page
- `Function Output`: save all reviews as a DataFrame of columns (`rating, summary, detail,  date`). 
- If a field, e.g. rating, is missing, use `None` to indicate it. 

In [1]:
import asyncio
from playwright.async_api import async_playwright
import pandas as pd
# ADD IMPORT STATEMENTS HERE

In [7]:
async def scrape_reviews(url):
    
    results = None
    
    browser = await (await async_playwright().start()).webkit.launch(headless=False)
    page = await browser.new_page()
    await page.goto(url)
    
    all_button = await page.query_selector('span.ipc-see-more__text:text("All")')
    if all_button:
        await all_button.click()
        await asyncio.sleep(10) 

    all_reviews = await page.query_selector_all('article.sc-7d2e5b85-1.cvfQlw.user-review-item')
    
    results = []
    for review in all_reviews:

        spoiler_buttons = await review.query_selector('button:has(span:text("Spoiler"))')
        if spoiler_buttons:
            await spoiler_buttons.click()
            await asyncio.sleep(1)
            
        r = await review.query_selector('span.ipc-rating-star--rating')
        rating = await r.inner_text() if r else None

        s = await review.query_selector('h3.ipc-title__text')
        summary = await s.inner_text() if s else None
        
        d = await review.query_selector('div.ipc-html-content-inner-div')
        details = await d.inner_text() if d else None
        
        rd = await review.query_selector('li.ipc-inline-list__item.review-date')
        rating_date = await rd.inner_text() if rd else None

        results.append({'rating': rating, 'summary': summary, 'details': details,'rating_date': rating_date})
        
    await browser.close()

    results = pd.DataFrame(results)
    
    return results

In [9]:
# Pick any moview from IMDB and find its movie id. Then replace the movie id in the url below.

url = "https://www.imdb.com/title/tt34956443/reviews/"
#url = "https://www.imdb.com/title/tt30788842/reviews/?ref_=tt_ov_ql_2"
results = await scrape_reviews(url)

results

Unnamed: 0,rating,summary,details,rating_date
0,9,wonderful,It is so good to see that it exceeds expectati...,"Jan 31, 2025"
1,9,Wonderful but only a step away from perfect,"This movie presents a complex theme, exploring...","Feb 1, 2025"
2,9,A landmark in animated film,The first time to go to cinema for us is to wa...,"Feb 25, 2025"
3,10,Nezha 2 is one of the best animated movies I h...,"""Nezha 2"" is nothing short of a cinematic gem,...","Feb 9, 2025"
4,10,A Chinese epic animation-while non-Chinese vie...,"I didn't get to watch Ne Zha 1 in theaters, an...","Feb 12, 2025"
...,...,...,...,...
148,9,"Wonderful, go watch it","A word of advice: do watch Nezha (2019), the p...","Mar 2, 2025"
149,10,This is the real list of deities!,The core sword of political similes is the dar...,"Feb 26, 2025"
150,10,One of the best animated movies I have ever seen.,I watched this yesterday in the theaters and I...,"Feb 26, 2025"
151,1,unsatisfying experience,"""Ne Zha 2"" is a film that struggles to capture...","Mar 2, 2025"
