# STA326 Assignment 1: Web Scraping
This is an assignment that is openly available for the Data Science Practice (STA326).

## Overview


In this assignment, we will explore web scraping, which can often include diverse information from website, and also use the data for simple analysis. We take [douban](https://movie.douban.com/top250) as the target website in this assignment.

In [88]:
# Imports

import requests  # send request
from bs4 import BeautifulSoup  # parse web pages
import pandas as pd  # csv
from time import sleep  # wait

## Part 1: Web Scraping

### Scraping Rules

1) If you are using another organization's website for scraping, make sure to check the website's terms & conditions. 

2) Do not request data from the website too aggressively (quickly) with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.

3) The layout of a website may change from time to time. Because of this, if you're scraping a website, make sure to revisit the site and rewrite your code as needed.

### 1a) Basic workflow

We will first retrieve the contents on a page and examine them a bit.

Make a variable called `page_url`, that stores the URL (as a string) like:
https://movie.douban.com/top250?start=0

Now, to open the URL, use `requests.get()` and provide `page_url` as its input. Store this in a variable called `page`.

After that, make a variable called `soup` to parse the HTML using `BeautifulSoup`. Consider that there will be a method from `BeautifulSoup` that you'll need to call on to get the content from the page. 


### 1b) Extracting Data

In order to extract the data we want, we’ll start with extracting a data list of interest.

Extract the data from the page and save it in a variable (list) like `movie_name` and `movie_star`. 

Make sure you extract it as a string.

We present a example to scrape `movie_name` and `movie_star` from `page_url`.

In [89]:
movie_name = []  # movie name
movie_star = []  # movie star

# Define a request header (to prevent anti-scraping)
headers = {
    'authority': 'curlconverter.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'cache-control': 'max-age=0',
    'if-modified-since': 'Fri, 15 Jul 2022 21:44:42 GMT',
    'if-none-match': 'W/"62d1dfca-3a58"',
    'referer': 'https://link.csdn.net/?target=https%3A%2F%2Fcurlconverter.com%2F',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Microsoft Edge";v="102"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'cross-site',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.30',
}

page_url = 'https://movie.douban.com/top250?start=0'

res = requests.get(page_url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')

for movie in soup.select('.item'):
    name = movie.select('.hd a')[0].text.replace('\n', '')  # select the movie name
    movie_name.append(name)
    star = movie.select('.rating_num')[0].text # select the movie star
    movie_star.append(star)

print(movie_name)
print(movie_star)

['肖申克的救赎\xa0/\xa0The Shawshank Redemption\xa0/\xa0月黑高飞(港)  /  刺激1995(台)', '霸王别姬\xa0/\xa0再见，我的妾  /  Farewell My Concubine', '阿甘正传\xa0/\xa0Forrest Gump\xa0/\xa0福雷斯特·冈普', '泰坦尼克号\xa0/\xa0Titanic\xa0/\xa0铁达尼号(港 / 台)', '这个杀手不太冷\xa0/\xa0Léon\xa0/\xa0终极追杀令(台)  /  杀手莱昂', '千与千寻\xa0/\xa0千と千尋の神隠し\xa0/\xa0神隐少女(台)  /  千与千寻的神隐', '美丽人生\xa0/\xa0La vita è bella\xa0/\xa0一个快乐的传说(港)  /  Life Is Beautiful', '星际穿越\xa0/\xa0Interstellar\xa0/\xa0星际启示录(港)  /  星际效应(台)', '盗梦空间\xa0/\xa0Inception\xa0/\xa0潜行凶间(港)  /  全面启动(台)', "辛德勒的名单\xa0/\xa0Schindler's List\xa0/\xa0舒特拉的名单(港)  /  辛德勒名单", '楚门的世界\xa0/\xa0The Truman Show\xa0/\xa0真人Show(港)  /  真人戏', "忠犬八公的故事\xa0/\xa0Hachi: A Dog's Tale\xa0/\xa0秋田犬八千(港)  /  忠犬小八(台)", "海上钢琴师\xa0/\xa0La leggenda del pianista sull'oceano\xa0/\xa0声光伴我飞(港)  /  一九零零的传奇", '三傻大闹宝莱坞\xa0/\xa03 Idiots\xa0/\xa0三个傻瓜(台)  /  作死不离3兄弟(港)', '放牛班的春天\xa0/\xa0Les choristes\xa0/\xa0歌声伴我心(港)  /  唱诗班男孩', '机器人总动员\xa0/\xa0WALL·E\xa0/\xa0太空奇兵·威E(港)  /  瓦力(台)', '疯狂动物城\xa0/\xa0Zootopia\xa0/\xa0优兽大都会(港)  /  动物方城市(台)'

### 1c) Collecting into a dataframe

Create a dataframe `movie_df` and add the data from the lists above to it. 
- `movie_name` is the movie name. Set the column name as `movie name`
- `movie_star` is the population estimate via star. Add it to the dataframe, and set the column name as `movie star`

Make sure to check the head of your dataframe to see that everything looks right! ie: movie_df.head()

We give an example to store the data as a text file (.csv). 

In [90]:
csv_name =  "MovieDouban.csv"

movie_df = pd.DataFrame() # initialize a DataFrame object
movie_df['movie name'] = movie_name
movie_df['movie star'] = movie_star

movie_df.to_csv(csv_name, encoding='utf_8_sig')  # save data to a csv file

## Task 1 Extract DouBan's Top 250 Movies Information

### Task Description
Your task is to write two Python function. 

1. The first one named `get_movie_info` that scrapes the complete information of the top 250 movies listed on DouBan's movie ranking. The function should store the following details for each movie in the corresponding lists:

- `movie_name`: List to store the names of the movies.
- `movie_url`: List to store the URLs of the movies.
- `movie_star`: List to store the ratings of the movies.
- `movie_star_people`: List to store the number of people who have rated the movie.
- `movie_director`: List to store the directors of the movies.
- `movie_actor`: List to store the main actors of the movies.
- `movie_year`: List to store the release year of the movies.
- `movie_country`: List to store the country of origin of the movies.
- `movie_type`: List to store the genre of the movies.


2. The second function  `save_to_csv` function should return a dictionary `movie_df` where each key is one of the above categories, and the value is the list containing all the relevant information for the top 250 movies.

### Hint

While implementing the `get_movie_info` function, pay special attention to the following:

1. **Director and Actor Information**: The webpage lists both directors and main actors for movies. Ensure that you correctly identify and separate these two pieces of information. In cases where the main actors are not mentioned, the `movie_actor` list should contain `None`.

2. **Handling Movie with Multiple Years**: `'大闹天宫 / 大闹天宫 上下集 / The Monkey King'` have multiple release years listed together. You need to handle these cases appropriately to ensure that the `movie_year` list is populated correctly. A simple way is to link all possible years with a string for the mentioned movie.


In [91]:
# save the data here
movie_name = []
movie_url = [] 
movie_star = [] 
movie_star_people = []
movie_director = [] 
movie_actor = []  
movie_year = []  
movie_country = []  
movie_type = [] 


def get_movie_info(url, headers):
    """
    Fetches detailed information about movies from DouBan
    :param url: URL to scrape
    :param headers: Headers for the scraping request
    :return: A dictionary contains movie information
    """

    ##### scrape movie info ###### 
    # YOUR CODE HERE
    ##############################



def save_to_csv(csv_name):

    """
     save data to csv
    :csv_name: Saved name
    :return: A dictionary contains movie information
    """

    movie_df = pd.DataFrame() # initialize a DataFrame

    ##### save to dataframe ###### 
    # YOUR CODE HERE
    ##############################


    return movie_df

### Function Test

You can test your implementation with the below code.

In [92]:
# Define a request header (to prevent anti-scraping)
headers = {
    'authority': 'curlconverter.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'cache-control': 'max-age=0',
    'if-modified-since': 'Fri, 15 Jul 2022 21:44:42 GMT',
    'if-none-match': 'W/"62d1dfca-3a58"',
    'referer': 'https://link.csdn.net/?target=https%3A%2F%2Fcurlconverter.com%2F',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Microsoft Edge";v="102"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'cross-site',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.30',
}
# Start scraping data from DouBan
for i in range(10):  # Scrape 10 pages in total, 25 entries per page
    page_url = 'https://movie.douban.com/top250?start={}'.format(str(i * 25))
    print('Starting to scrape page {}, URL: {}'.format(str(i + 1), page_url))
    get_movie_info(page_url, headers)
    sleep(1)  # Wait for 1 second to prevent scraping protection

# Save the data to a CSV file
save_to_csv(csv_name="STA326_MovieDouban250.csv")

Starting to scrape page 1, URL: https://movie.douban.com/top250?start=0
Starting to scrape page 2, URL: https://movie.douban.com/top250?start=25
Starting to scrape page 3, URL: https://movie.douban.com/top250?start=50
Starting to scrape page 4, URL: https://movie.douban.com/top250?start=75
Starting to scrape page 5, URL: https://movie.douban.com/top250?start=100
Starting to scrape page 6, URL: https://movie.douban.com/top250?start=125
Starting to scrape page 7, URL: https://movie.douban.com/top250?start=150
Starting to scrape page 8, URL: https://movie.douban.com/top250?start=175
Starting to scrape page 9, URL: https://movie.douban.com/top250?start=200
Starting to scrape page 10, URL: https://movie.douban.com/top250?start=225


## Complete!

Congrats, you're done!
