# Scraper
---
## Table of contents
1. [Scraping](#Scraping)
    1. [Imports](#Imports)
    2. [Helper methods](#Helper-methods)
    3. [Methods for scraping](#Methods-for-scraping)
    4. [Start scraping!](#Start-scraping!)

## Scraping
---
## Dependencies
You can install the dependencies of this repository by this command:
```
pip install selenium fonttools requests bs4 pandas matplotlib
```
This implementation uses `selenium` as the tool for scraping, and `fontTools` to process the anti-scraping-protected rater count information.

`selenium` is system and browser dependant. This notebook was designed to run on a x64 Windows system with Firefox 93.0 browser. If the OS and browser of your system is the same with the above configuration, you should be able to run the code below with no problem by simply following step 2 below. However, if your configuration is different, then following all of the few tweaks below should get you onto the right track:
1. Install the latest [Mozilla Firefox](https://www.mozilla.org/en-US/firefox/new/)
2. Download the latest [web driver for Firefox](https://github.com/mozilla/geckodriver/releases) corresponding to your OS, and extract the driver to the `maoyan_scraper` directory of this repository (i.e., the same folder where this notebook is stored)
3. Change the `executable_path` of the web driver in the [Start scraping!](#Start-scraping!) section below to the newly extracted web driver

For `fontTools`, you don't need to make any further configurations. Hooray!

In [1]:
import io
from urllib import parse
import math
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
from typing import List, Tuple
import pandas as pd
from fontTools.ttLib import TTFont
import re
import matplotlib.pyplot as plt

## Helper methods
Let's start with some helper methods. `captcha_handler` detects if [maoyan](https://maoyan.com) is annoyed by our requests and asking us to prove that we are human. If so, it will patiently wait for us to mollify [maoyan](https://maoyan.com) and hand the control to other methods once we have finished the proof.

In [2]:
def captcha_handler(driver):
    while '猫眼验证中心' in driver.page_source:  # wait for captcha to be manually solved
        driver.implicitly_wait(5)  # check the webpage once every 5 seconds

Then comes `translate_numbers`. If you take a closer look at a movie [detail page](https://maoyan.com/films/1200486), you may notice that the **rating**, **rater count**, and **box office** fields are protected from being scrapped in the following way:

1. picked some unused unicode characters
2. generate a font file that encodes 0-9 to the picked characters

This results in the displayed information(say, rating), albeit appeared as 9.6 to human eyes under the effect of the customized font, to become . when copied to elsewhere.

`translate_numbers` recognizes the number values of the protected fields by comparing the strokes of the font to a known font.

In [3]:
def translate_numbers(new_font) -> dict:
    base_font = TTFont('./iconfont.woff')
    char_list = [base_font['glyf']['uniF36C'],
                 base_font['glyf']['uniE0EB'],
                 base_font['glyf']['uniEB75'],
                 base_font['glyf']['uniF64C'],
                 base_font['glyf']['uniE6A2'],
                 base_font['glyf']['uniF5D3'],
                 base_font['glyf']['uniF5BB'],
                 base_font['glyf']['uniE307'],
                 base_font['glyf']['uniE47E'],
                 base_font['glyf']['uniE16B']]

    numbers_dictionary = {}
    for key in new_font['glyf'].keys():
        if 'uni' in key:
            numbers_dictionary[key] = match_char(char_list, new_font['glyf'][key])

    return numbers_dictionary


def distance(a: Tuple[int], b: Tuple[int]) -> float:
    return math.sqrt(((a[0] - b[0]) ** 2) + ((a[1] - b[1]) ** 2))


def nearest_point_distance(base_char, new_point: Tuple[int]) -> float:
    distances: List[float] = []
    for base_point in base_char.coordinates:
        distances.append(distance(base_point, new_point))
    return min(distances)


def match_char(char_list, new_char) -> int:
    if len(new_char.coordinates) < 10:
        return -1

    distance_list = []
    for base_char in char_list:
        avg_distance: float = 0
        for index, new_point in enumerate(new_char.coordinates):
            avg_distance += nearest_point_distance(base_char, new_point)
        avg_distance /= len(new_char.coordinates)
        distance_list.append(avg_distance)

    return distance_list.index(min(distance_list))

## Methods for scraping
Now the helper methods are ready, it is time for the methods for scrapping to join the show. `get_movie_index` goes through the [Top 100 movies](https://maoyan.com/board/4) and scrape information about the rank, link,name, main cast, and rating of the 100 movies. `get_movie_detail` then goes into the detail page of each movie to further obtain information about the movie's type, country/region, length, release time, release place, rater count, and box office. After these information are collected, they will be combined by `combine_movie_info`.

In [4]:
def get_movie_index(driver, offset: int = 0):
    movie_list = []

    while offset < 100:
        driver.get(parse.urljoin('https://maoyan.com', f'board/4?offset={offset}'))
        captcha_handler(driver)
        WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.TAG_NAME, 'dd'))  # wait for element to load
        source = driver.page_source
        source = bs(source, 'html.parser')

        movie_entries = source.find_all('dd')
        for entry in movie_entries:
            # acquire values
            rank: int = int(entry.i.get_text().strip())  # strip() removes whitespace
            link: str = entry.a.get('href')
            title: str = entry.a.get('title')
            stars: List[str] = entry.find('p', class_='star').get_text().strip()[3:].replace('，', ',').split(',')
            stars = [star.strip() for star in stars]
            rating_raw: str = entry.find('i', class_='integer').get_text().strip() + \
                              entry.find('i', class_='fraction').get_text().strip()
            rating: float = float(rating_raw)
            # store movie entry
            movie = {'rank': rank, 'title': title, 'link': link, 'stars': tuple(stars), 'rating': rating}
            movie_list.append(movie)

        offset += 10

    movie_index = pd.DataFrame(movie_list)
    return movie_index


def get_movie_detail(driver, movie_index):
    detail_list = []
    rank = 1

    while rank <= 100:
        driver.get(parse.urljoin('https://maoyan.com', movie_index.query(f'rank == {rank}')['link'].values[0]))
        captcha_handler(driver)
        WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.CLASS_NAME, 'tab-content-container'))
        source = driver.page_source
        source = bs(source, 'html.parser')

        detail = {'rank': rank}
        
        # director
        director = source.find('div', class_='celebrity-group')
        assert director.div.get_text().strip() == '导演'
        detail['director'] = director.ul.li.div.a.get_text().strip()

        # type, country/region, length, release time, release place
        brief = source.find('div', class_='movie-brief-container').ul.findAll('li')
        types = [type_element.get_text().strip() for type_element in brief[0].findAll('a')]
        detail['types'] = tuple(types)
        country_region_and_length = brief[1].get_text().strip().split('/')
        detail['country/region'] = tuple(country_region_and_length[0].strip().replace('，', ',').split(','))
        detail['length'] = int(country_region_and_length[1].strip()[:-2])
        release_time_and_place = brief[2].get_text().strip()
        time_charset = '1234567890 -:'
        split_point = 0
        while release_time_and_place[split_point] in time_charset:
            split_point += 1
        detail['release time'] = release_time_and_place[:10]
        detail['release place'] = release_time_and_place[split_point:-2]

        # rater count
        font_url = re.findall("url\('(//vfile.meituan.net/colorstone/.+\.woff)'\)", source.prettify())[0]
        new_font = io.BytesIO(requests.get(f'http:{font_url}').content)
        new_font = TTFont(new_font)
        num_dict = translate_numbers(new_font)  # translated unicode -> number dictionary
        num_dict_keys = num_dict.keys()

        rater_raw = source.find('span', class_='score-num').get_text().strip()[:-3]
        rater = ''
        for index, character in enumerate(rater_raw):
            character_key = f'uni{character.encode("unicode_escape").decode()[-4:].upper()}'
            if character_key in num_dict_keys:
                rater += str(num_dict[character_key])
            else:
                rater += character

        if rater[-1] == '万':
            rater = round(float(rater[:-1]) * 10000)
        else:
            rater = int(rater)

        detail['rater count'] = rater

        # box office (may be unavailable)
        box = source.findAll('div', class_='film-mbox-item')
        try:
            detail['box office (10k CNY)'] = int(box[1].div.get_text().strip())
        except (ValueError, IndexError):
            detail['box office (10k CNY)'] = -1

        detail_list.append(detail)
        rank += 1

    movie_detail = pd.DataFrame(detail_list)
    return movie_detail


def combine_movie_info(movie_index, movie_detail):
    movie_info = pd.merge(movie_index, movie_detail, how='left', on='rank')
    return movie_info

## Start scraping!
It's time. Run the cell below to start scraping. note that for some movie, the box office data is unavailable and is represented as -1.

**Note:** if you have tweaked your browser and web driver to configure `selenium`, please remember to change the `executable_path` of the web driver

In [5]:
driver = webdriver.Firefox(executable_path='./geckodriver.exe')  # web driver

movie_index = get_movie_index(driver)
movie_detail = get_movie_detail(driver, movie_index)
driver.close()

movie_info = combine_movie_info(movie_index, movie_detail)
movie_info.to_csv('./movie_info.csv', encoding='utf-8-sig')  # export as .csv
movie_info.to_pickle('./movie_info.pkl')  # export as pickle

  """Entry point for launching an IPython kernel.
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes 

## View data
The scrapped movie information can be accessed by:

In [6]:
movie_info.head()

Unnamed: 0,rank,title,link,stars,rating,director,types,country/region,length,release time,release place,rater count,box office (10k CNY)
0,1,我不是药神,/films/1200486,"(徐峥, 周一围, 王传君)",9.6,文牧野,"(剧情, 喜剧)","(中国大陆,)",117,2018-07-05,中国大陆,2719000,310002
1,2,肖申克的救赎,/films/1297,"(蒂姆·罗宾斯, 摩根·弗里曼, 鲍勃·冈顿)",9.5,弗兰克·德拉邦特,"(剧情, 犯罪)","(美国,)",142,1994-09-13,美国,8390,-1
2,3,绿皮书,/films/1206605,"(维果·莫腾森, 马赫沙拉·阿里, 琳达·卡德里尼)",9.5,彼得·法雷里,"(剧情, 喜剧, 传记)","(美国,)",130,2019-03-01,中国大陆,253000,47872
3,4,海上钢琴师,/films/1292,"(蒂姆·罗斯, 比尔·努恩, 克兰伦斯·威廉姆斯三世)",9.3,朱塞佩·托纳多雷,"(剧情, 爱情, 音乐)","(意大利,)",126,2019-11-15,中国大陆,91252,14376
4,5,哪吒之魔童降世,/films/1211270,"(吕艳婷, 囧森瑟夫, 瀚墨)",9.6,饺子,"(动画, 喜剧, 奇幻)","(中国大陆,)",110,2019-07-26,中国大陆,3967000,66507


The scraper has also stored the movie information to the `/maoyan_scraper` directory of this repository, hence you can also access it by:

In [7]:
movie_info = pd.read_pickle('./movie_info.pkl')
movie_info.head()

Unnamed: 0,rank,title,link,stars,rating,director,types,country/region,length,release time,release place,rater count,box office (10k CNY)
0,1,我不是药神,/films/1200486,"(徐峥, 周一围, 王传君)",9.6,文牧野,"(剧情, 喜剧)","(中国大陆,)",117,2018-07-05,中国大陆,2719000,310002
1,2,肖申克的救赎,/films/1297,"(蒂姆·罗宾斯, 摩根·弗里曼, 鲍勃·冈顿)",9.5,弗兰克·德拉邦特,"(剧情, 犯罪)","(美国,)",142,1994-09-13,美国,8390,-1
2,3,绿皮书,/films/1206605,"(维果·莫腾森, 马赫沙拉·阿里, 琳达·卡德里尼)",9.5,彼得·法雷里,"(剧情, 喜剧, 传记)","(美国,)",130,2019-03-01,中国大陆,253000,47872
3,4,海上钢琴师,/films/1292,"(蒂姆·罗斯, 比尔·努恩, 克兰伦斯·威廉姆斯三世)",9.3,朱塞佩·托纳多雷,"(剧情, 爱情, 音乐)","(意大利,)",126,2019-11-15,中国大陆,91252,14376
4,5,哪吒之魔童降世,/films/1211270,"(吕艳婷, 囧森瑟夫, 瀚墨)",9.6,饺子,"(动画, 喜剧, 奇幻)","(中国大陆,)",110,2019-07-26,中国大陆,3967000,66507
