# Assignment 1 Web Scraping & Data Analysis

## Outline

+ [Install neccessary packages](#Install-neccessary-packages)
+ [Import related packages](#Import-related-packages)
+ [Designed functions](#Designed-functions)
  + [1. Font decoding algorithm](#1.-Font-decoding-algorithm)
  + [2. Web scraping functions](#2.-Web-scraping-functions)
+ [Function application](#Function-application)

## Install neccessary packages

In [1]:
!pip install requests
!pip install pandas 
!pip install bs4
!pip install fontTools
!pip install numpy
!pip install sklearn
# !pip install threadpool 



## Import related packages

In [2]:
# use for decoding algorithm
import numpy as np
from sklearn.metrics import mean_squared_error
import re
import requests
from requests.exceptions import RequestException
import time
from bs4 import BeautifulSoup
import pandas as pd
# parse font file (.woff)
from fontTools.ttLib import TTFont
import os
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor #线程池，进程池

## Designed functions

### 1. Font decoding algorithm
> ProperResult class is designed to calculate the proper relationship between the base font and the newly downloaded font. The main approach is to calculate the gradient among the points of a specific character. After a sort of the gradients, choose a best map among characters with lowest loss to match the encoded numbers.

+ **Diagram demonstration**:
<img src="images/ProperResult.png" alt="ProperResult" style="zoom:90%;" />

In [3]:
class ProperResult:
    """
        usually use the ProperResult.result to get the best map
        
        parameters:
          data_base: dict, base font stored in a form: {encoded number: points} eg: {'uniE815':(1,49),(32,56)...}
          data_font: dict, new font stored in a form: {encoded number: points} eg: {'uniE211':(1,23),(12,50)...}
          
    """
    def __init__(self, data_base, data_font):
        self.result = list(data_font.keys())
        self.data_base = list(data_base.keys())
        self.Coordinates = self.generate_Coordinates(data_base, data_font)
        self.whole = self.calculate_whole()
        self.check_if_smaller()
    
    # Store all the distance into a list
    def generate_Coordinates(self, data_base, data_font):
        Coordinates = []
        for name, points in data_font.items():
            Coordinates.append(self.Coordinate(name, data_base, points))
        return Coordinates
    
    # Calculate the whole loss
    def calculate_whole(self):
        whole = 0
        for item in range(len(self.result)):
            whole += self.Coordinates[item].distance[self.data_base[item]]
        return whole
    
    # If changing the position of map leads to the reduction of whole loss, then swap the position
    def check_if_smaller(self):
        while True:
            i = 0
            while i < 10:
                j = i + 1
                flag = False
                while j < 10:
                    if self.Coordinates[i].distance[self.data_base[j]] + \
                            self.Coordinates[j].distance[self.data_base[i]] \
                            < self.Coordinates[i].distance[self.data_base[i]] \
                            + self.Coordinates[j].distance[self.data_base[j]]:
                        self.swap(i, j)
                        flag = True
                        break
                    j += 1
                if flag:
                    break
                i += 1
            if i == 10:
                break
    
    # swap the position
    def swap(self, position1, position2):
        self.result[position1], self.result[position2] = self.result[position2], self.result[position1]
        self.Coordinates[position1], self.Coordinates[position2] = self.Coordinates[position2], self.Coordinates[
            position1]

    # store the distance between a character to all of the other characters
    class Coordinate:
        def __init__(self, name, sample, points):
            self.name = name
            self.points = points
            self.distance = self.calculate_distance(sample)
            self.minimum = min(self.distance.items(), key=lambda x: x[1])
            self.sorted_result = sorted(self.distance.items(), key=lambda x: x[1], reverse=False)

        def calculate_distance(self, sample):
            result = {}
            for name, points in sample.items():
                result[name] = calculate_rmse_distance(self.points, points)
            return result
        

### 2. Web scraping functions

+ **Diagram demonstration**:
<img src="images/web scraping.png" alt="web scraping" style="zoom:90%;" />

**Information requesting part**

In [4]:
# request the html text of a single page
def get_single_page(url,proxies=None):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
#             'Cookie': '__mta=256733748.1632830079504.1632842059702.1632842109743.26; uuid_n_v=v1; uuid=DC1FA030205211EC8F810FD937DF6B98FF471C1ADEC34DDDADC5F415E8503753; _csrf=e827d2725a857ece1b4a392ea6b636b48edc2ae14dbb0c820703b8c3c7747408; _lxsdk_cuid=1755f7af674c8-07477984254229-333769-1fa400-1755f7af674c8; _lxsdk=DC1FA030205211EC8F810FD937DF6B98FF471C1ADEC34DDDADC5F415E8503753; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1632830079; __mta=256733748.1632830079504.1632839525262.1632839531803.9; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1632842110; _lxsdk_s=17c2cb8bb6f-1c7-017-268%7C%7C58'
        }
        response = requests.get(url, headers=headers, proxies=proxies)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

    
# parse inner links from html
def parse_gather_url(html):
    html = BeautifulSoup(html, 'lxml')
    links = html.select('#app .board-wrapper dd > a ')
    result = []
    for link in links:
        result.append(link.attrs['href'])
    return result

def proxy_gen(pages=10):
    proxyUrl = f"http://http.tiqu.letecs.com/getip3?num={pages}&type=1&pro=&city=0&yys=0&port=1&pack=189489&ts=0&ys=0&cs=0&lb=1&sb=0&pb=4&mr=1&regions="
    proxyList = requests.get(proxyUrl).text.strip().split('\r\n')
    for i in range(pages):
        proxyMeta = proxyList[i]
        yield  {
            "http"  : proxyMeta,
            "https"  : proxyMeta
        }
    
# get the links in one function: request 10 pages to get 100 inner links    
def collect_urls():
    links = []
    i = 0
    proxy = proxy_gen()
    while i < 10:
        url = 'http://maoyan.com/board/4?offset=' + str(i * 10)
        html = get_single_page(url, proxy)
        if html == None:
            continue
        new_link = parse_gather_url(html)
        if new_link == []:
            print(f"Fail to get {i + 1}th page information, retrying")
            proxy = proxy_gen()
#             time.sleep(0.2)
        else:
            print(f"Acquiring {i + 1}th page information")
            links.extend(new_link)
            time.sleep(0.2)
            i += 1
    return links


# parse specific information of inner page
def parse_specific_page_information(html, url):
    font_dict = parse_font(html, url)
    single_info = dict()
    to_be_transformed = re.findall('<span class="stonefont">(.*?)</span>', html)
    single_info["rating"] = to_be_transformed[0] # rating
    single_info["rating number"] = to_be_transformed[1] # rating number
    if len(to_be_transformed) > 2: # cumulative income
        single_info["cumulative income"] = to_be_transformed[2] + re.findall('<span class="unit">(.*?)</span>', html)[0]
    html = BeautifulSoup(html, 'lxml')
    single_info["title"] = html.select_one('h1').string # title
    single_info["title_en"] = html.select_one('.movie-brief-container > div').string # English title
    series_of_description = html.select('ul>.ellipsis')
    flag = html.select_one(".film-mbox-item:first-of-type > div:nth-child(1)")
    if flag and flag != "暂无": # first week income
        single_info["first week income"] = html.select_one(
            ".film-mbox-item:first-of-type > div:nth-child(1)").string + "万"
    single_info["type"] = list(map(lambda x: x.string.strip(), series_of_description[0].select('a'))) # type
    single_info["area"] = series_of_description[1].string.split('/')[0].strip() # area
    single_info["duration"] = series_of_description[1].string.split('/')[1].strip() # duration
    single_info["time in CN"] = series_of_description[2].string.strip()[:10] # release time in China
    personnel = list(map(lambda x: x.string.strip(), html.select('.info a')))
    single_info["director"] = personnel[0] # director
    single_info["actors"] = personnel[1:] # actors
    star = html.select(".time > ul")
    reviews = html.select(".comment-content")
    single_info["reviews"] = [(star[i]["data-score"], reviews[i].string.strip()) for i in range(len(star))] # reviews
    portrait = list(map(lambda x: x.text.strip(), html.select('.award-list .award-item > div:first-of-type')))
    content = list(map(lambda x: x.text.strip(), html.select('.award-list .award-item >.content')))
    single_info["awards"] = {portrait[i]: trans2awards(content)[i] for i in range(len(portrait))} # awards
    if html.select_one(".film-honors-item:first-of-type>.honors-name:first-of-type"):
        single_info["number of prize"] = html.select_one(
            ".film-honors-item:first-of-type>.honors-name:first-of-type").string[:-1]
    if html.select_one(".film-honors-item:nth-child(2)>.honors-name:first-of-type"):
        single_info["number of nomination"] = html.select_one(
            ".film-honors-item:nth-child(2)>.honors-name:first-of-type").string[:-1]
    return replace_font(single_info, font_dict)

In [5]:
# get the links in one function: request 10 pages to get 100 inner links    
from concurrent.futures import ThreadPoolExecutor, as_completed

def collect_urls(*args):
    def inner_request():
        time.sleep(0.2)
        return get_single_page(url, proxy)
    offset, proxy = args
    print(offset, proxy)
    url = 'http://maoyan.com/board/4?offset=' + str(offset * 10)
    html = inner_request()
    new_link = parse_gather_url(html)
    if new_link == []:
        print(f"Fail to get {offset + 1}th page information, retrying")
        return []
#         collect_urls(*args)
    else:
        print(f"Acquiring {offset + 1}th page information")
        return new_link

links = []
def thread_request(pages):
    links = []
    proxy = list(proxy_gen(pages))
    thread_pool = ThreadPoolExecutor(3) #定义5个线程执行此任务
#     process_pool = ProcessPoolExecutor(5) #定义5个进程
    all_task = [thread_pool.submit(collect_urls, i,proxy[i]) for i in range(pages)]
    for future in as_completed(all_task):
        new_link = future.result()
        links.extend(new_link)
    return links
#         print("in main: get page {}s success".format(data))
#     for  in range(pages):
#         print((i,proxy[i]))
#         task1 = thread_pool.submit(collect_urls,())
#         print(task1.done())
        


In [6]:
links = thread_request(10)

0 {'http': '42.6.114.125:7807', 'https': '42.6.114.125:7807'}
1 {'http': '42.6.114.103:4891', 'https': '42.6.114.103:4891'}
2 {'http': '42.6.114.125:6686', 'https': '42.6.114.125:6686'}
Acquiring 1th page information
3 {'http': '42.6.114.103:3053', 'https': '42.6.114.103:3053'}
Acquiring 3th page information
4 {'http': '42.6.114.103:4365', 'https': '42.6.114.103:4365'}
Acquiring 2th page information
5 {'http': '42.6.114.103:5151', 'https': '42.6.114.103:5151'}
Acquiring 5th page information
6 {'http': '42.6.114.125:7088', 'https': '42.6.114.125:7088'}
Acquiring 6th page information
7 {'http': '42.6.114.114:2540', 'https': '42.6.114.114:2540'}
Acquiring 4th page information
8 {'http': '42.6.114.124:3300', 'https': '42.6.114.124:3300'}
Acquiring 7th page information
9 {'http': '42.6.114.114:3249', 'https': '42.6.114.114:3249'}
Acquiring 8th page information
Acquiring 9th page information
Acquiring 10th page information


In [8]:
len(links)

100

**Additional useful function**

In [18]:
# caculate rmse distance used by the Coordinate class
def calculate_rmse_distance(dis1, dis2):
    size = min(len(dis1), len(dis2))
    return np.sqrt(mean_squared_error(dis1[:size, :], dis2[:size, :]) + abs(len(dis1) - len(dis2)) * 10)


# gradient calculated for preprocess data
def gradient(to_be_fused):
    for name, points in to_be_fused.items():
        to_be_fused[name] = np.sort(np.diff(points, axis=0), axis=0)


# transform list into awards list 
def trans2awards(data: list) -> list:
    result = []
    for i in data:
        awards = i.split("\n")
        tmp_dict = {}
        for j in awards:
            tmp_dict[j[:2]] = list(map(lambda x: x.strip(), j[3:].split('/')))
        result.append(tmp_dict)
    return result


# create a dirctory if not exits
def mkdir(path):
    if not os.path.exists(path):
        os.makedirs(path)


# replace contents by dictionary accordingly    
def replace_character(content: str, mappings: dict) -> str:
    for character, replace_element in mappings.items():
        content = content.replace(
            character, replace_element
        )
    return content

**Font parse**

In [None]:
# parse font by created class and methods
def parse_font(html,url):
    woff_url = re.findall(r"vfile.*?woff", html)[0]
    path = "font"
    mkdir(path)
    font_name = f'{path}/{url[url.rfind("/")+1:]}.woff'
    with open(font_name,'wb') as f:
        f.write(requests.get("http://" + woff_url).content)
    baseFonts = TTFont('basefonts.woff')  
    base_nums = ['1', '4', '7', '5', '0', '2', '6', '3', '9', '8']  # basic number list
    base_fonts = ['uniE815','uniE6A0','uniF1EC','uniEB0C','uniF13C',
                  'uniEC95','uniE301','uniE5A9','uniE195','uniEFB5']  # basic map list
    onlineFonts = TTFont(font_name)  # downloaded font file
    uni_list = onlineFonts.getGlyphNames()[1:-1]  # delete useless part
    data_base ={}
    data_font = {}
    for i in range(10):
        data_base[base_fonts[i]]=np.array(list(baseFonts['glyf'][base_fonts[i]].coordinates))
        data_font[uni_list[i]]=np.array(list(onlineFonts['glyf'][uni_list[i]].coordinates))
    for i in [data_base,data_font]:
        gradient(i)
    font_result = ProperResult(data_base, data_font).result
    font_dict = dict()
    for i in range(len(font_result)):
        font_dict[("&#x"+font_result[i][3:]+";").lower()]= base_nums[i]
    return font_dict

# replace encoded font into proper font
def replace_font(single_info,font_dict):
    for key,value in single_info.items():
        if type(value)==str:
            single_info[key]=replace_character(value, font_dict)
    return single_info

## Function application

**Request the inner links of 100 movies**

In [57]:
links = collect_urls()
links

Acquiring 1th page information
Acquiring 2th page information
Fail to get 3th page information, retrying
Acquiring 3th page information
Acquiring 4th page information


KeyboardInterrupt: 

In [None]:
data_links = pd.DataFrame({"links":links})
data_links.to_csv("data.csv",index=0)

**Store thm in data.csv file**

In [10]:
data = pd.read_csv("data.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   links   100 non-null    object
dtypes: object(1)
memory usage: 928.0+ bytes


**Prepare a dataFrame for data to write in**

In [None]:
columns = ("title","title_en","rating","rating number","first week income","cumulative income",
    "type","area","duration","time in CN","director","actors","reviews","awards","number of prize","number of nomination")
dataset = pd.DataFrame(columns=columns)

In [312]:
def acquire_all_pages(url):
    global dataset
    whole_url = "https://maoyan.com" + url
    html = get_single_page(whole_url)
    if re.findall(r"猫眼验证中心", html):
        print("validation...")
        time.sleep(10)
        acquire_all_pages(url)
    else:
        dataset = dataset.append(parse_specific_page_information(html,url),ignore_index=True)
data["links"].iloc[10:].apply(acquire_all_pages)
dataset

2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array


validation...
validation...


2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.stringData array
2 extra bytes in post.str

Unnamed: 0,title,title_en,rating,rating number,first week income,cumulative income,type,area,duration,time in CN,director,actors,reviews,awards,number of prize,number of nomination
0,我不是药神,Dying To Survive,9.6,271.5万,123812万,31.00亿,"[剧情, 喜剧]",中国大陆,117分钟,2018-07-05,文牧野,"[徐峥, 周一围, 王传君, 谭卓, 文牧野, 徐峥, 周一围, 王传君, 谭卓, 章宇, ...","[(10, 我认为这是一部跟摔跤吧爸爸同层次的电影，忍到最后还是会流泪。生命很坚强也很脆弱，...","{'第38届香港金像奖': {'获奖': ['最佳两岸华语电影']}, '第55届台湾电影金...",28,24
1,肖申克的救赎,The Shawshank Redemption,3.5,8150,,,"[剧情, 犯罪]",美国,142分钟,1994-09-13,弗兰克·德拉邦特,"[蒂姆·罗宾斯, 摩根·弗里曼, 鲍勃·冈顿, 威廉·桑德勒, 弗兰克·德拉邦特, 蒂姆·罗...","[(10, 简直就是穿越过来的基督山伯爵啊！这是我看完《肖申克的救赎》的第一反应。犹记得高中...","{'第67届奥斯卡金像奖': {'提名': ['最佳影片', '最佳男主角', '最佳改编剧...",4,16
2,绿皮书,Green Book,3.5,25.2万,11502万,4.73亿,"[剧情, 喜剧, 传记]",美国,130分钟,2019-03-01,彼得·法雷里,"[维果·莫腾森, 马赫沙拉·阿里, 琳达·卡德里尼, 塞巴斯蒂安·马尼斯科, 彼得·法雷里,...","[(9, 无意中刷到这部电影被简介吸引买了首映，第二天就发现绿皮书斩获奥斯卡最佳影片，本担心...","{'第91届奥斯卡金像奖': {'获奖': ['最佳影片', '最佳男配角', '最佳原创剧...",23,36
3,海上钢琴师,La leggenda del pianista sull'oceano,9.3,90918,6351万,1.44亿,"[剧情, 爱情, 音乐]",意大利,126分钟,2019-11-15,朱塞佩·托纳多雷,"[蒂姆·罗斯, 比尔·努恩, 克兰伦斯·威廉姆斯三世, 普路特·泰勒·文斯, 朱塞佩·托纳多...","[(9, 在暴风雨夜的船上如履平地，闲庭信步，无惧风暴还肆意欢愉地弹琴邀人共赏！看到这段忍不...","{'第57届金球奖': {'获奖': ['最佳电影音乐']}, '第12届欧洲电影奖': {...",8,5
4,霸王别姬,Farewell My Concubine,3.4,7634,暂无万,5万,"[剧情, 爱情]","中国大陆,中国香港",171分钟,1993-07-26,陈凯歌,"[张国荣, 张丰毅, 巩俐, 吕齐, 陈凯歌, 张国荣, 张丰毅, 巩俐, 吕齐, 英达, ...","[(10, 你不曾真的离去，你始终在我心里。\nMiss You Much Leslie),...","{'第66届奥斯卡金像奖': {'提名': ['最佳外语片', '最佳摄影']}, '第46...",5,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,钢琴家,The Pianist,8.9,758,,,"[剧情, 音乐, 传记, 战争]","法国,德国,英国,波兰",150分钟,2002-05-24,罗曼·波兰斯基,"[艾德里安·布洛迪, 艾米莉娅·福克斯, 米哈乌·热布罗夫斯基, 埃德·斯托帕德, 罗曼·波...","[(10, 在大荧幕上看到还是蛮震撼的。), (9, 好早之前看的片子了。隐隐约约还记得一点...","{'第75届奥斯卡金像奖': {'获奖': ['最佳导演', '最佳男主角', '最佳改编剧...",20,23
96,致命魔术,The Prestige,8.8,1302,暂无万,5309万美元,"[剧情, 悬疑, 惊悚]","美国,英国",130分钟,2006-10-17,克里斯托弗·诺兰,"[休·杰克曼, 克里斯蒂安·贝尔, 迈克尔·凯恩, 斯嘉丽·约翰逊, 克里斯托弗·诺兰, 休...","[(10, 诺兰的这部魔术作品是我所看过的魔术电影中最喜爱的一部。\n①.剧情紧凑，影片开头...","{'第79届奥斯卡金像奖': {'提名': ['最佳摄影', '最佳艺术指导']}, '第3...",0,5
97,饮食男女,Eat Drink Man Woman,8.8,310,暂无万,729万美元,"[剧情, 家庭]","中国台湾,美国",124分钟,1994-08-03,李安,"[郎雄, 吴倩莲, 杨贵媚, 王渝文, 李安, 郎雄, 吴倩莲, 杨贵媚, 王渝文, 张艾嘉...","[(10, 《饮食男女》\n乍一看这名字以为是个爱情片。毕竟是学识阅历都不够丰富，知道个“食...","{'第67届奥斯卡金像奖': {'提名': ['最佳外语片']}, '第31届台湾电影金马奖...",0,5
98,模仿游戏,The Imitation Game,3.9,53966,1889万,5247万,"[剧情, 战争, 传记]",英国,114分钟,2015-07-21,莫滕·泰杜姆,"[本尼迪克特·康伯巴奇, 凯拉·奈特莉, 马修·古迪, 罗里·金奈尔, 莫滕·泰杜姆, 本尼...","[(10, 比起同档期的电影，真的是不错了。私心为了本尼去看这部电影，看之前虽然大概知道图灵...","{'第87届奥斯卡金像奖': {'获奖': ['最佳改编剧本'], '提名': ['最佳影片...",9,37


In [25]:
dataset.to_csv("dataset.csv",index=0)

In [3]:
dataset = pd.read_csv("dataset.csv")
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   title                 100 non-null    object 
 1   title_en              100 non-null    object 
 2   rating                100 non-null    object 
 3   rating number         100 non-null    object 
 4   first week income     32 non-null     object 
 5   cumulative income     56 non-null     object 
 6   type                  100 non-null    object 
 7   area                  100 non-null    object 
 8   duration              100 non-null    object 
 9   time in CN            100 non-null    object 
 10  director              100 non-null    object 
 11  actors                100 non-null    object 
 12  reviews               100 non-null    object 
 13  awards                100 non-null    object 
 14  number of prize       94 non-null     float64
 15  number of nomination  94

In [4]:
dataset

Unnamed: 0,title,title_en,rating,rating number,first week income,cumulative income,type,area,duration,time in CN,director,actors,reviews,awards,number of prize,number of nomination
0,我不是药神,Dying To Survive,9.6,271.5万,123812万,31.00亿,"['剧情', '喜剧']",中国大陆,117分钟,2018-07-05,文牧野,"['徐峥', '周一围', '王传君', '谭卓', '文牧野', '徐峥', '周一围',...","[('10', '我认为这是一部跟摔跤吧爸爸同层次的电影，忍到最后还是会流泪。生命很坚强也很...","{'第38届香港金像奖': {'获奖': ['最佳两岸华语电影']}, '第55届台湾电影金...",28.0,24.0
1,肖申克的救赎,The Shawshank Redemption,3.5,8150,,,"['剧情', '犯罪']",美国,142分钟,1994-09-13,弗兰克·德拉邦特,"['蒂姆·罗宾斯', '摩根·弗里曼', '鲍勃·冈顿', '威廉·桑德勒', '弗兰克·德...","[('10', '简直就是穿越过来的基督山伯爵啊！这是我看完《肖申克的救赎》的第一反应。犹记...","{'第67届奥斯卡金像奖': {'提名': ['最佳影片', '最佳男主角', '最佳改编剧...",4.0,16.0
2,绿皮书,Green Book,3.5,25.2万,11502万,4.73亿,"['剧情', '喜剧', '传记']",美国,130分钟,2019-03-01,彼得·法雷里,"['维果·莫腾森', '马赫沙拉·阿里', '琳达·卡德里尼', '塞巴斯蒂安·马尼斯科',...","[('9', '无意中刷到这部电影被简介吸引买了首映，第二天就发现绿皮书斩获奥斯卡最佳影片，...","{'第91届奥斯卡金像奖': {'获奖': ['最佳影片', '最佳男配角', '最佳原创剧...",23.0,36.0
3,海上钢琴师,La leggenda del pianista sull'oceano,9.3,90918,6351万,1.44亿,"['剧情', '爱情', '音乐']",意大利,126分钟,2019-11-15,朱塞佩·托纳多雷,"['蒂姆·罗斯', '比尔·努恩', '克兰伦斯·威廉姆斯三世', '普路特·泰勒·文斯',...","[('9', '在暴风雨夜的船上如履平地，闲庭信步，无惧风暴还肆意欢愉地弹琴邀人共赏！看到这...","{'第57届金球奖': {'获奖': ['最佳电影音乐']}, '第12届欧洲电影奖': {...",8.0,5.0
4,霸王别姬,Farewell My Concubine,3.4,7634,,5万,"['剧情', '爱情']","中国大陆,中国香港",171分钟,1993-07-26,陈凯歌,"['张国荣', '张丰毅', '巩俐', '吕齐', '陈凯歌', '张国荣', '张丰毅'...","[('10', '你不曾真的离去，你始终在我心里。\nMiss You Much Lesli...","{'第66届奥斯卡金像奖': {'提名': ['最佳外语片', '最佳摄影']}, '第46...",5.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,钢琴家,The Pianist,8.9,758,,,"['剧情', '音乐', '传记', '战争']","法国,德国,英国,波兰",150分钟,2002-05-24,罗曼·波兰斯基,"['艾德里安·布洛迪', '艾米莉娅·福克斯', '米哈乌·热布罗夫斯基', '埃德·斯托帕...","[('10', '在大荧幕上看到还是蛮震撼的。'), ('9', '好早之前看的片子了。隐隐...","{'第75届奥斯卡金像奖': {'获奖': ['最佳导演', '最佳男主角', '最佳改编剧...",20.0,23.0
96,致命魔术,The Prestige,8.8,1302,,5309万美元,"['剧情', '悬疑', '惊悚']","美国,英国",130分钟,2006-10-17,克里斯托弗·诺兰,"['休·杰克曼', '克里斯蒂安·贝尔', '迈克尔·凯恩', '斯嘉丽·约翰逊', '克里...","[('10', '诺兰的这部魔术作品是我所看过的魔术电影中最喜爱的一部。\n①.剧情紧凑，影...","{'第79届奥斯卡金像奖': {'提名': ['最佳摄影', '最佳艺术指导']}, '第3...",0.0,5.0
97,饮食男女,Eat Drink Man Woman,8.8,310,,729万美元,"['剧情', '家庭']","中国台湾,美国",124分钟,1994-08-03,李安,"['郎雄', '吴倩莲', '杨贵媚', '王渝文', '李安', '郎雄', '吴倩莲',...","[('10', '《饮食男女》\n乍一看这名字以为是个爱情片。毕竟是学识阅历都不够丰富，知道...","{'第67届奥斯卡金像奖': {'提名': ['最佳外语片']}, '第31届台湾电影金马奖...",0.0,5.0
98,模仿游戏,The Imitation Game,3.9,53966,1889万,5247万,"['剧情', '战争', '传记']",英国,114分钟,2015-07-21,莫滕·泰杜姆,"['本尼迪克特·康伯巴奇', '凯拉·奈特莉', '马修·古迪', '罗里·金奈尔', '莫...","[('10', '比起同档期的电影，真的是不错了。私心为了本尼去看这部电影，看之前虽然大概知...","{'第87届奥斯卡金像奖': {'获奖': ['最佳改编剧本'], '提名': ['最佳影片...",9.0,37.0
