# Assignment 1 Web Scraping & Data Analysis

## Sparkles:
+ **Speed:** <span style="color:red">Best time consume (93.98s)</span>
    + Use thread to accelerate the scraping
    + Use ip pool to reduce the validation from website
        + **If you want to run the scraping, please login the website of ip pool and use your personal link to test. It is quite expensive for a student to support the test. Website link:https://zhimahttp.com/**
+ **accuracy** 
    + Make sure the 100% accuracy of font decoding

## Outline
This notebook can be run directly, which has committed the web scraping part.  
If you want to test the scraping, please change the ip pool link and recommit the code back.

+ [Install neccessary packages](#Install-neccessary-packages)
+ [Import related packages](#Import-related-packages)
+ [Designed functions](#Designed-functions)
  + [1. Font decoding algorithm](#1.-Font-decoding-algorithm)
  + [2. Web scraping functions](#2.-Web-scraping-functions)  
  Change your personal link in proxy_gen() function to test the function
+ [Function application](#Function-application)
  + [Request the inner links of 100 movies](#Request-the-inner-links-of-100-movies)
  + [Store them in data.csv file](#Store-them-in-data.csv-file)
  + [Prepare a dataFrame for data to write in](#Prepare-a-dataFrame-for-data-to-write-in)
  + [Start scraping](#Start-scraping)
  + [The time consume](#The-time-consume)
  + [Store data into dataset.csv file](#Store-data-into-dataset.csv-file)
+ [Data Analysis](#Data-Analysis)
  + [Overall inspection](#Overall-inspection)
  + [Basic analysis: Type vs area Example](#Basic-analysis:-Type-vs-area-Example)
  + [Machine learning](#Machine-learning)

## Install neccessary packages

In [1]:
# !pip install requests
# !pip install pandas 
# !pip install bs4
# !pip install fontTools
# !pip install numpy
# !pip install sklearn
# !pip install pyecharts
# !pip install copy
# !pip install datatime
# !pip install jieba

## Import related packages

In [2]:
# use for decoding algorithm
import numpy as np
from sklearn.metrics import mean_squared_error
import re
import requests
from requests.exceptions import RequestException
import time
from bs4 import BeautifulSoup
import pandas as pd
# parse font file (.woff)
from fontTools.ttLib import TTFont
import os
import warnings
# thread pool
from concurrent.futures import ThreadPoolExecutor,as_completed
from pyecharts.charts import Pie,Bar,Grid,WordCloud,Line
from pyecharts import options as opts
from pyecharts.commons.utils import JsCode
import copy
import datetime
import jieba
import jieba.posseg as pseg
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVR
from sklearn.linear_model import ElasticNetCV, LassoCV, RidgeCV

# ignore warnings
warnings.filterwarnings("ignore")

## Designed functions

### 1. Font decoding algorithm
> ProperResult class is designed to calculate the proper relationship between the base font and the newly downloaded font. The main approach is to calculate the gradient among the points of a specific character. After a sort of the gradients, choose a best map among characters with lowest loss to match the encoded numbers.

+ **Diagram demonstration**:
<img src="images/ProperResult.png" alt="ProperResult" style="zoom:90%;" />

In [3]:
class ProperResult:
    """
        usually use the ProperResult.result to get the best map
        
        parameters:
          data_base: dict, base font stored in a form: {encoded number: points} eg: {'uniE815':(1,49),(32,56)...}
          data_font: dict, new font stored in a form: {encoded number: points} eg: {'uniE211':(1,23),(12,50)...}
          
    """
    def __init__(self, data_base, data_font):
        self.result = list(data_font.keys())
        self.data_base = list(data_base.keys())
        self.Coordinates = self.generate_Coordinates(data_base, data_font)
        self.whole = self.calculate_whole()
        self.check_if_smaller()
    
    # Store all the distance into a list
    def generate_Coordinates(self, data_base, data_font):
        Coordinates = []
        for name, points in data_font.items():
            Coordinates.append(self.Coordinate(name, data_base, points))
        return Coordinates
    
    # Calculate the whole loss
    def calculate_whole(self):
        whole = 0
        for item in range(len(self.result)):
            whole += self.Coordinates[item].distance[self.data_base[item]]
        return whole
    
    # If changing the position of map leads to the reduction of whole loss, then swap the position
    def check_if_smaller(self):
        while True:
            i = 0
            while i < 10:
                j = i + 1
                flag = False
                while j < 10:
                    if self.Coordinates[i].distance[self.data_base[j]] + \
                            self.Coordinates[j].distance[self.data_base[i]] \
                            < self.Coordinates[i].distance[self.data_base[i]] \
                            + self.Coordinates[j].distance[self.data_base[j]]:
                        self.swap(i, j)
                        flag = True
                        break
                    j += 1
                if flag:
                    break
                i += 1
            if i == 10:
                break
    
    # swap the position
    def swap(self, position1, position2):
        self.result[position1], self.result[position2] = self.result[position2], self.result[position1]
        self.Coordinates[position1], self.Coordinates[position2] = self.Coordinates[position2], self.Coordinates[
            position1]

    # store the distance between a character to all of the other characters
    class Coordinate:
        def __init__(self, name, sample, points):
            self.name = name
            self.points = points
            self.distance = self.calculate_distance(sample)
            self.minimum = min(self.distance.items(), key=lambda x: x[1])
            self.sorted_result = sorted(self.distance.items(), key=lambda x: x[1], reverse=False)

        def calculate_distance(self, sample):
            result = {}
            for name, points in sample.items():
                result[name] = calculate_rmse_distance(self.points, points)
            return result
        

### 2. Web scraping functions
> Use **thread pool** and **ip pool** to accelerate the speed

+ **Diagram demonstration**:
<img src="images/web scraping.png" alt="web scraping" style="zoom:90%;" />

**Information requesting part**

In [4]:
# request the html text of a single page
def get_single_page(url,proxies=None):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
#             'Cookie': '__mta=256733748.1632830079504.1632842059702.1632842109743.26; uuid_n_v=v1; uuid=DC1FA030205211EC8F810FD937DF6B98FF471C1ADEC34DDDADC5F415E8503753; _csrf=e827d2725a857ece1b4a392ea6b636b48edc2ae14dbb0c820703b8c3c7747408; _lxsdk_cuid=1755f7af674c8-07477984254229-333769-1fa400-1755f7af674c8; _lxsdk=DC1FA030205211EC8F810FD937DF6B98FF471C1ADEC34DDDADC5F415E8503753; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1632830079; __mta=256733748.1632830079504.1632839525262.1632839531803.9; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1632842110; _lxsdk_s=17c2cb8bb6f-1c7-017-268%7C%7C58'
        }
        response = requests.get(url, headers=headers, proxies=proxies)
        if response.status_code == 200:
            return response.text
        else:
            return get_single_page(url,list(proxy_gen(1))[0]) # change ip and request again
    except RequestException as e:
        time.sleep(0.2)
        return get_single_page(url,list(proxy_gen(1))[0])

    
# parse inner links from html
def parse_gather_url(html):
    html = BeautifulSoup(html, 'lxml')
    links = html.select('#app .board-wrapper dd > a ')
    ranks = html.select('#app .board-wrapper dd > i ')
    rate = html.select('.score')
    result = []
    for link,rank,rate in zip(links,ranks,rate):
        result.append((link.attrs['href'],rank.text, rate.text))
    return result

# ip pool for fast request
def proxy_gen(pages=10):
    # please use you personal link to replace proxyUrl, this link only leaves 71 ip for scraping,
    # which supports around 50 pages scraping
    proxyUrl = f"http://http.tiqu.letecs.com/getip3?num={pages}&type=1&pro=&city=0&yys=0&port=1&time=1&ts=0&ys=0&cs=0&lb=1&sb=0&pb=4&mr=1&regions="
    proxyList = requests.get(proxyUrl).text.strip().split('\r\n')
    for i in range(pages):
        proxyMeta = proxyList[i]
        yield  {
            "http"  : proxyMeta,
            "https"  : proxyMeta
        }
    
# get the links in one function: request 10 pages to get 100 inner links    
def collect_urls(*args):
    offset, proxy = args
#     print(offset, proxy)
    url = 'http://maoyan.com/board/4?offset=' + str(offset * 10)
    html = get_single_page(url, proxy)
    new_link = parse_gather_url(html)
    if new_link == []:
        print(f"Fail to get {offset + 1}th page information, retrying")
        return []
    else:
        print(f"Acquiring {offset + 1}th page information")
        return new_link
    
# Use thread to request 10 pages for 100 urls
def thread_100pages_request(pages=10):
    links = []
    proxy = list(proxy_gen(pages))
    thread_pool = ThreadPoolExecutor(3) 
    all_task = [thread_pool.submit(collect_urls, i,proxy[i]) for i in range(pages)]
    for future in as_completed(all_task):
        new_link = future.result()
        links.extend(new_link)
    return links

**Additional useful function**

In [5]:
# caculate rmse distance used by the Coordinate class
def calculate_rmse_distance(dis1, dis2):
    size = min(len(dis1), len(dis2))
    return np.sqrt(mean_squared_error(dis1[:size, :], dis2[:size, :]) + abs(len(dis1) - len(dis2)) * 50) 


# gradient calculated for preprocess data
def gradient(to_be_fused):
    for name, points in to_be_fused.items():
#         to_be_fused[name] = np.sort(np.diff(points, axis=0), axis=0)
        to_be_fused[name] = np.diff(points, axis=0)


# transform list into awards list 
def trans2awards(data: list) -> list:
    result = []
    for i in data:
        awards = i.split("\n")
        tmp_dict = {}
        for j in awards:
            tmp_dict[j[:2]] = list(map(lambda x: x.strip(), j[3:].split('/')))
        result.append(tmp_dict)
    return result


# create a dirctory if not exits
def mkdir(path):
    if not os.path.exists(path):
        os.makedirs(path)


# replace contents by dictionary accordingly    
def replace_character(content: str, mappings: dict) -> str:
    for character, replace_element in mappings.items():
        content = content.replace(
            character, replace_element
        )
    return content


# parse specific information of inner page
def parse_specific_page_information(html, url,rank,if_modified):
    try:
        font_dict = parse_font(html, url)
    except Exception as e:
        print(e)
        print(html)
        raise Exception
    single_info = dict()
    to_be_transformed = re.findall('<span class="stonefont">(.*?)</span>', html)
    single_info["rating"] = to_be_transformed[0] # rating
    single_info["rating number"] = to_be_transformed[1] # rating number
    if len(to_be_transformed) > 2: # cumulative income
        single_info["cumulative income"] = to_be_transformed[2] + re.findall('<span class="unit">(.*?)</span>', html)[0]
    else:
        single_info["cumulative income"] = None
    html = BeautifulSoup(html, 'lxml')
    single_info["title"] = html.select_one('h1').string # title
    single_info["title_en"] = html.select_one('.movie-brief-container > div').string # English title
    series_of_description = html.select('ul>.ellipsis')
    flag = html.select_one(".film-mbox-item:first-of-type > div:nth-child(1)")
    if flag and flag != "暂无万" and flag != "暂无": # first week income
        single_info["first week income"] = html.select_one(
            ".film-mbox-item:first-of-type > div:nth-child(1)").string + "万"
    else:
        single_info["first week income"] = None
    single_info["type"] = list(map(lambda x: x.string.strip(), series_of_description[0].select('a'))) # type
    single_info["area"] = series_of_description[1].string.split('/')[0].strip() # area
    single_info["duration"] = series_of_description[1].string.split('/')[1].strip() # duration
    single_info["time in CN"] = series_of_description[2].string.strip()[:10] # release time in China
    personnel = list(map(lambda x: x.string.strip(), html.select('.info a')))
    single_info["director"] = personnel[0] # director
    single_info["actors"] = personnel[1:] # actors
    star = html.select(".time > ul")
    reviews = html.select(".comment-content")
    single_info["reviews"] = [(star[i]["data-score"], reviews[i].string.strip()) for i in range(len(star))] # reviews
    portrait = list(map(lambda x: x.text.strip(), html.select('.award-list .award-item > div:first-of-type')))
    content = list(map(lambda x: x.text.strip(), html.select('.award-list .award-item >.content')))
    single_info["awards"] = {portrait[i]: trans2awards(content)[i] for i in range(len(portrait))} # awards
    if html.select_one(".film-honors-item:first-of-type>.honors-name:first-of-type"):
        single_info["number of prize"] = html.select_one(
            ".film-honors-item:first-of-type>.honors-name:first-of-type").string[:-1]
    else:
        single_info["number of prize"] = None
    if html.select_one(".film-honors-item:nth-child(2)>.honors-name:first-of-type"):
        single_info["number of nomination"] = html.select_one(
            ".film-honors-item:nth-child(2)>.honors-name:first-of-type").string[:-1]
    else:
        single_info["number of nomination"] = None
    single_info["rank"] = rank
    result = replace_font(single_info, font_dict,if_modified)
    return result

In [6]:
# Request 100 pages and store in the dataframe
columns = ("rating","rating number","cumulative income","title","title_en","first week income",
    "type","area","duration","time in CN","director","actors","reviews","awards","number of prize","number of nomination","rank")

def request_inner_pages(top_ranks,if_modified):
    # Put each dict into list
    def add_to_list(dict_list):
        for i,value in enumerate(dict_list.values()):
            to_added_list[i].append(value)
            
    content = dict.fromkeys(columns)
    to_added_list = [[] for i in range(17)]
    proxy = list(proxy_gen(top_ranks))
    thread_pool = ThreadPoolExecutor(5) 
    all_task = []
    for i in range(top_ranks):
        all_task.append(thread_pool.submit(acquire_all_pages, data.values[i][0],i+1,proxy[i],if_modified))
        time.sleep(0.2)
    for future in as_completed(all_task):
        line = future.result()
        add_to_list(line)
    for i,key in enumerate(content.keys()):
        content[key] = to_added_list[i]
    return pd.DataFrame(content)

# Request and parse single page
def acquire_all_pages(*args):
    url,rank,proxy,if_modified = args
    whole_url = "https://maoyan.com" + url
    html = get_single_page(whole_url,proxy)
    result = None
    if not re.findall(r"猫眼验证中心", html):
        result = parse_specific_page_information(html,url,rank,if_modified)
    if not result:
        result = acquire_all_pages(url,rank,list(proxy_gen(1))[0],if_modified)
    print(f"rank {rank} have completed")
    return result

**Font parse**

In [7]:
# parse font by created class and methods
def parse_font(html,url):
    woff_url = re.findall(r"vfile.*?woff", html)[0]
    path = "font"
    mkdir(path)
    font_name = f'{path}/{url[url.rfind("/")+1:]}.woff'
    with open(font_name,'wb') as f:
        f.write(requests.get("http://" + woff_url).content)
    baseFonts = TTFont('basefonts.woff')  
    base_nums = ['1', '4', '7', '5', '0', '2', '6', '3', '9', '8']  # basic number list
    base_fonts = ['uniE815','uniE6A0','uniF1EC','uniEB0C','uniF13C',
                  'uniEC95','uniE301','uniE5A9','uniE195','uniEFB5']  # basic map list
    onlineFonts = TTFont(font_name)  # downloaded font file
    uni_list = onlineFonts.getGlyphNames()[1:-1]  # delete useless part
    data_base ={}
    data_font = {}
    for i in range(10):
        data_base[base_fonts[i]]=np.array(list(baseFonts['glyf'][base_fonts[i]].coordinates))
        data_font[uni_list[i]]=np.array(list(onlineFonts['glyf'][uni_list[i]].coordinates))
    for i in [data_base,data_font]:
        gradient(i)
    font_result = ProperResult(data_base, data_font).result
    font_dict = dict()
    for i in range(len(font_result)):
        font_dict[("&#x"+font_result[i][3:]+";").lower()]= base_nums[i]
    return font_dict

# replace encoded font into proper font
def replace_font(single_info,font_dict,if_modified):
    for key,value in single_info.items():
        if type(value)==str:
            single_info[key]=replace_character(value, font_dict)
    if float(single_info["rating"])!=data.loc[int(single_info['rank'])-1,"rate"] and if_modified:
        print(f"Movie rank {single_info['rank']} is wrong, modifying, ratings are {single_info['rating']} vs {data.loc[int(single_info['rank'])-1,'rate']}")
        return False
    return single_info

## Function application

### Request the inner links of 100 movies

In [8]:
# request 100 inner urls
## !!! if occur error, run again

# links = thread_100pages_request(10)

### Store them in data.csv file

In [9]:
# data_links = pd.DataFrame({"links":[link[0] for link in links],"ranks":[int(rank[1]) for rank in links],"rate":[float(rank[2]) for rank in links]}).sort_values(by="ranks").set_index("ranks")
# data_links.to_csv("data.csv",index=0)

In [10]:
# read them from file
data = pd.read_csv("data.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   links   100 non-null    object 
 1   rate    100 non-null    float64
dtypes: float64(1), object(1)
memory usage: 1.7+ KB


In [11]:
# display urls
data

Unnamed: 0,links,rate
0,/films/1200486,9.6
1,/films/1297,9.5
2,/films/1206605,9.5
3,/films/1292,9.3
4,/films/1211270,9.6
...,...,...
95,/films/1233,8.8
96,/films/1219776,8.6
97,/films/20131,8.8
98,/films/78646,9.3


### Start scraping 100 pages

In [12]:
# start = time.time()
# dataset_with_modified = request_inner_pages(100,True).sort_values(by="rank")
# end = time.time()
# with_time = f"The consume of time is {end-start}"

In [13]:
# # modification makes sure the accuray of the data, if not, it will be faster like this
# start = time.time()
# dataset_without_modified = request_inner_pages(100,False).sort_values(by="rank")
# end = time.time()
# without_time = f"The consume of time is {end-start}"

### The time consume

In [14]:
# print(with_time, " vs ", without_time)

### Store data into dataset.csv file

In [15]:
# dataset_with_modified.reset_index(drop=True).to_csv("dataset.csv",index=0)

In [16]:
dataset = pd.read_csv("dataset.csv")
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   rating                100 non-null    float64
 1   rating number         100 non-null    object 
 2   cumulative income     56 non-null     object 
 3   title                 100 non-null    object 
 4   title_en              100 non-null    object 
 5   first week income     32 non-null     object 
 6   type                  100 non-null    object 
 7   area                  100 non-null    object 
 8   duration              100 non-null    object 
 9   time in CN            100 non-null    object 
 10  director              100 non-null    object 
 11  actors                100 non-null    object 
 12  reviews               100 non-null    object 
 13  awards                100 non-null    object 
 14  number of prize       94 non-null     float64
 15  number of nomination  94

## Data Analysis

### Overall inspection

In [17]:
dataset.describe()

Unnamed: 0,rating,number of prize,number of nomination,rank
count,100.0,94.0,94.0,100.0
mean,8.98,11.787234,17.031915,50.5
std,0.348445,11.013064,13.764105,29.011492
min,8.0,0.0,0.0,1.0
25%,8.7,3.0,5.25,25.75
50%,8.95,8.0,14.0,50.5
75%,9.3,18.75,24.0,75.25
max,9.8,40.0,59.0,100.0


In [18]:
dataset.sample(5)

Unnamed: 0,rating,rating number,cumulative income,title,title_en,first week income,type,area,duration,time in CN,director,actors,reviews,awards,number of prize,number of nomination,rank
20,9.0,7753,5.73亿,少年派的奇幻漂流,Life of Pi,10263万,"['剧情', '奇幻', '冒险']","美国,中国台湾,英国,加拿大",127分钟,2012-11-22,李安,"['苏拉·沙玛', '伊尔凡·可汗', '塔布', '拉菲·斯波', '李安', '苏拉·沙...","[('10', '在pad上看了一次，去IMAX看了一次，奇思妙想，精彩纷呈，很完美的电影。...","{'第85届奥斯卡金像奖': {'获奖': ['最佳导演', '最佳摄影', '最佳视觉效果...",26.0,52.0,21
24,9.4,6932,13.06亿,泰坦尼克号,Titanic,44388万,"['剧情', '爱情', '灾难']",美国,194分钟,1998-04-03,詹姆斯·卡梅隆,"['莱昂纳多·迪卡普里奥', '凯特·温丝莱特', '比利·赞恩', '格劳瑞亚·斯图尔特'...","[('10', '这应该是我人生中看的第一部电影了 初二初三的时候躲在被窝里 大半夜一个人用...","{'第70届奥斯卡金像奖': {'获奖': ['最佳影片', '最佳导演', '最佳摄影',...",38.0,30.0,25
15,9.2,1653,,辛德勒的名单,Schindler's List,,"['剧情', '历史', '战争']",美国,195分钟,1993-11-30,史蒂文·斯皮尔伯格,"['连姆·尼森', '拉尔夫·费因斯', '本·金斯利', '艾伯丝·戴维兹', '史蒂文·...","[('10', '一部黑白电影能折射出人性伟大的斑斓色彩，非《辛德勒的名单》莫属。打字机的敲...","{'第66届奥斯卡金像奖': {'获奖': ['最佳影片', '最佳导演', '最佳改编剧本...",31.0,23.0,16
92,9.2,13.7万,1.92亿,奇迹男孩,Wonder,5467万,"['剧情', '家庭']",美国,113分钟,2018-01-19,斯蒂芬·卓博斯基,"['雅各布·特瑞布雷', '朱莉娅·罗伯茨', '欧文·威尔逊', '伊扎贝拉·维多维奇',...","[('10', '、很温暖。从头哭到尾。我长痘痘长了快十年，去年去美容院导致烂脸，从去年九月...","{'第90届奥斯卡金像奖': {'提名': ['最佳化妆与发型']}, '第71届英国电影学...",3.0,4.0,93
97,8.8,1315,5309万美元,致命魔术,The Prestige,,"['剧情', '悬疑', '惊悚']","美国,英国",130分钟,2006-10-17,克里斯托弗·诺兰,"['休·杰克曼', '克里斯蒂安·贝尔', '迈克尔·凯恩', '斯嘉丽·约翰逊', '克里...","[('10', '诺兰的这部魔术作品是我所看过的魔术电影中最喜爱的一部。\n①.剧情紧凑，影...","{'第79届奥斯卡金像奖': {'提名': ['最佳摄影', '最佳艺术指导']}, '第3...",0.0,5.0,98


### Basic analysis: Type vs area Example

In [19]:
feature_verus_rank = dataset[['area','rank','type']]
feature_verus_rank

Unnamed: 0,area,rank,type
0,中国大陆,1,"['剧情', '喜剧']"
1,美国,2,"['剧情', '犯罪']"
2,美国,3,"['剧情', '喜剧', '传记']"
3,意大利,4,"['剧情', '爱情', '音乐']"
4,中国大陆,5,"['动画', '喜剧', '奇幻']"
...,...,...,...
95,"中国台湾,美国",96,"['剧情', '家庭']"
96,美国,97,"['剧情', '悬疑', '犯罪']"
97,"美国,英国",98,"['剧情', '悬疑', '惊悚']"
98,英国,99,"['剧情', '战争', '传记']"


In [20]:
# type_count:  key:value for number of types eg. {'剧情': 78,'喜剧': 19,...}
# country_tendency: country_tendency for the different types
# country_movies: rank of the top movies of countries
type_count =dict()
country_tendency = dict()
country_movies = dict()
for key,value in feature_verus_rank[["area",'type']].values:
    key_list = key.replace("，",",").split(',')
    for country in key_list:
        if country not in country_tendency:
            country_tendency[country] = dict()
            country_movies[country] = 1
        else:
            country_movies[country] += 1
        for movie_type in eval(value):
            if movie_type in country_tendency[country]:
                country_tendency[country][movie_type] +=1
            else:
                country_tendency[country][movie_type] = 1
    for movie_type in eval(value):
        if movie_type in type_count:
            type_count[movie_type] +=1
        else:
            type_count[movie_type] = 1
            
country_rank = copy.deepcopy(country_movies)
country_rank["中国"] = country_rank["中国大陆"] + country_rank["中国香港"] + country_rank["中国台湾"]
country_rank = sorted(country_rank.items(), key = lambda x:x[1], reverse = True)

#### Type tendency

In [21]:
# Check the proportion of types
(
    Pie()
    .add(
        "",
        [z for z in type_count.items()],
        radius=[60, 140],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Type tendency"),
        legend_opts=opts.LegendOpts(
            type_="scroll", pos_top="5%", pos_left="80%", orient="vertical"
        ),
    ).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%"))
).render_notebook()

#### Top movie number of each country

In [22]:
# Check the proportion of types
## The overall number is not 100 because some country could participate in the production of one movie 
(
    Bar()
    .add_xaxis(
        list(country_movies.keys())
    )
    .add_yaxis("Number of top movies", list(country_movies.values()))
    .set_global_opts(
        xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=45)),
        title_opts=opts.TitleOpts(title="Top movie number of each country", subtitle="According to the top 100 movies"),
    )
).render_notebook()

In [23]:
# rank of the country by number of movies 
country_rank

[('美国', 60),
 ('中国', 16),
 ('英国', 15),
 ('日本', 10),
 ('法国', 8),
 ('德国', 8),
 ('中国大陆', 7),
 ('意大利', 7),
 ('中国香港', 6),
 ('韩国', 5),
 ('加拿大', 4),
 ('中国台湾', 3),
 ('瑞士', 3),
 ('印度', 2),
 ('黎巴嫩', 1),
 ('西班牙', 1),
 ('巴西', 1),
 ('荷兰', 1),
 ('奥地利', 1),
 ('澳大利亚', 1),
 ('波兰', 1)]

#### Type tendency of these country

In [24]:
country_tendency

{'中国大陆': {'剧情': 5,
  '喜剧': 4,
  '动画': 1,
  '奇幻': 2,
  '爱情': 2,
  '家庭': 1,
  '历史': 1,
  '动作': 2,
  '西部': 1},
 '美国': {'剧情': 49,
  '犯罪': 11,
  '喜剧': 8,
  '传记': 5,
  '动作': 7,
  '悬疑': 10,
  '惊悚': 10,
  '科幻': 7,
  '爱情': 13,
  '冒险': 10,
  '历史': 3,
  '战争': 5,
  '动画': 5,
  '家庭': 6,
  '奇幻': 5,
  '灾难': 1},
 '意大利': {'剧情': 7, '爱情': 5, '音乐': 1, '战争': 2, '犯罪': 1},
 '中国香港': {'剧情': 6,
  '爱情': 2,
  '家庭': 1,
  '历史': 1,
  '喜剧': 1,
  '动作': 1,
  '西部': 1,
  '犯罪': 1,
  '悬疑': 1},
 '日本': {'剧情': 6,
  '犯罪': 1,
  '动画': 4,
  '冒险': 3,
  '奇幻': 4,
  '家庭': 3,
  '爱情': 3,
  '歌舞': 1,
  '喜剧': 1},
 '法国': {'剧情': 7,
  '动作': 1,
  '犯罪': 1,
  '喜剧': 3,
  '音乐': 2,
  '爱情': 3,
  '传记': 1,
  '战争': 1},
 '英国': {'动作': 3,
  '悬疑': 2,
  '惊悚': 4,
  '科幻': 3,
  '剧情': 12,
  '奇幻': 2,
  '冒险': 3,
  '喜剧': 4,
  '战争': 3,
  '音乐': 2,
  '传记': 3,
  '爱情': 2,
  '犯罪': 1},
 '印度': {'喜剧': 2, '动作': 1, '家庭': 1, '剧情': 1, '冒险': 1},
 '黎巴嫩': {'剧情': 1},
 '中国台湾': {'剧情': 3, '奇幻': 1, '冒险': 1, '爱情': 1, '家庭': 2},
 '加拿大': {'剧情': 4,
  '奇幻': 1,
  '冒险': 1,
  '科幻': 1,
  '悬疑': 

In [25]:
country_rank_result = [result[0] for result in country_rank[: 6]]
country_rank_result

['美国', '中国', '英国', '日本', '法国', '德国']

In [26]:
America = (
    Bar()
    .add_xaxis(list(country_tendency['美国'].keys()))
    .add_yaxis("Number of films", list(country_tendency['美国'].values()))
    .set_global_opts(title_opts=opts.TitleOpts(title="America",pos_top="3%",))
)
China = (
    Bar()
    .add_xaxis(list(country_tendency['中国大陆'].keys()))
    .add_yaxis("Number of films", list(country_tendency['中国大陆'].values()))
    .set_global_opts(title_opts=opts.TitleOpts(title="Chinese", pos_top="3%",pos_right="5%"))
)
UK = (
    Bar()
    .add_xaxis(list(country_tendency['英国'].keys()))
    .add_yaxis("Number of films", list(country_tendency['英国'].values()))
    .set_global_opts(title_opts=opts.TitleOpts(title="UK",pos_top="35%",))
)
JP = (
    Bar()
    .add_xaxis(list(country_tendency['日本'].keys()))
    .add_yaxis("Number of films", list(country_tendency['日本'].values()))
    .set_global_opts(title_opts=opts.TitleOpts(title="Japan", pos_top="35%",pos_right="5%"))
)
French = (
    Bar()
    .add_xaxis(list(country_tendency['法国'].keys()))
    .add_yaxis("Number of films", list(country_tendency['法国'].values()))
    .set_global_opts(title_opts=opts.TitleOpts(title="French",pos_top="70%"))
)
German = (
    Bar()
    .add_xaxis(list(country_tendency['德国'].keys()))
    .add_yaxis("Number of films", list(country_tendency['德国'].values()))
    .set_global_opts(title_opts=opts.TitleOpts(title="German",pos_top="70%", pos_right="5%"))
)
    
(
    Grid(init_opts=opts.InitOpts(width="900px",height="900px"))
    .add(America,grid_opts=opts.GridOpts(pos_bottom="70%", pos_right="50%"))
    .add(China,grid_opts=opts.GridOpts(pos_bottom="70%",pos_left="55%"))
    .add(UK,grid_opts=opts.GridOpts(pos_top="36%",pos_bottom="40%", pos_right="50%"))
    .add(JP,grid_opts=opts.GridOpts(pos_top="36%",pos_bottom="40%", pos_left="55%"))
    .add(French,grid_opts=opts.GridOpts(pos_top="70%", pos_right="50%"))
    .add(German,grid_opts=opts.GridOpts(pos_top="70%", pos_left="55%"))
).render_notebook()

In [27]:
dataset.sample()

Unnamed: 0,rating,rating number,cumulative income,title,title_en,first week income,type,area,duration,time in CN,director,actors,reviews,awards,number of prize,number of nomination,rank
41,9.5,135万,15.32亿,疯狂动物城,Zootopia,15517万,"['动画', '动作', '冒险']",美国,109分钟,2016-03-04,拜伦·霍华德,"['瑞奇·摩尔', '金妮弗·古德温', '杰森·贝特曼', '伊德瑞斯·艾尔巴', '拜伦...","[('10', '疯狂动物城搞笑中的真理1:不论你做什么工作，安全问题永远是家人最关心的（兔...","{'第89届奥斯卡金像奖': {'获奖': ['最佳动画长片']}, '第74届金球奖': ...",18.0,19.0,42


### Further analysis: Review as an example

In [28]:
# Collect all review in one variable
review = []
for film in dataset.reviews.values:
    for row in eval(film):  
        review.append(row[1])
review = ",".join(review).replace("\n","").replace(" ","").replace("\r","").replace("\u3000","").replace("\xa0","").replace("•","")
# stopwords
with open('stopwords.txt') as f:
    stopword = [x.strip() for x in f.readlines()]
# cut word dictionary
jieba.load_userdict("word_dict.txt")
review = pseg.cut(review,use_paddle=True)
count = {}
# filter word counts by tags and amount
for i,tag in review:
    judge = ["v","xc","w","c","m","q","p",'u']
    if i not in stopword and tag not in judge :
        if i in count:
            count[i] +=1
        else:
            count[i] = 1
filtered_count = dict()
for key,value in count.items():
    if value>10:
        filtered_count[key] = value
filtered_count

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Public\Documents\Wondershare\CreatorTemp\jieba.cache
Loading model cost 0.865 seconds.
Prefix dict has been built successfully.


{'流泪': 12,
 '生命': 70,
 '关系': 26,
 '良心': 11,
 '国家': 32,
 '政府': 12,
 '发展': 17,
 '制度': 13,
 '压力': 14,
 '未来': 17,
 '妈妈': 16,
 '身体': 13,
 '不好': 15,
 '眼泪': 19,
 '很好': 14,
 '真实': 99,
 '中国': 73,
 '法律': 22,
 '优秀': 17,
 '演员': 67,
 '社会': 78,
 '现实': 97,
 '人性': 78,
 '好好': 20,
 '生活': 222,
 '善良': 59,
 '平凡': 19,
 '温暖': 50,
 '题材': 34,
 '内心': 57,
 '超级': 19,
 '英雄': 24,
 '值得一看': 14,
 '泪点': 19,
 '故事': 296,
 '真相': 20,
 '制作': 13,
 '演技': 69,
 '角色': 56,
 '影响': 34,
 '经典': 99,
 '情节': 62,
 '台词': 34,
 '历史': 52,
 '辛德勒': 15,
 '强大': 20,
 '人类': 65,
 '极致': 15,
 '体验': 15,
 '美': 34,
 '画面': 61,
 '特效': 16,
 '情感': 75,
 '信仰': 27,
 '艺术': 24,
 '国内': 26,
 '残酷': 32,
 '场景': 42,
 '太': 149,
 '不幸': 16,
 '年轻': 33,
 '学会': 19,
 '语言': 17,
 '身份': 15,
 '人生': 142,
 '悲伤': 18,
 '难': 33,
 '剧情': 111,
 '梦想': 41,
 '心中': 42,
 '主人公': 41,
 '观众': 90,
 '美好': 91,
 '过程': 44,
 '主线': 11,
 '主角': 69,
 '身上': 22,
 '自由': 52,
 '名字': 43,
 '力量': 28,
 '剧本': 20,
 '小说': 25,
 '当年': 27,
 '奥斯卡': 33,
 '始终': 26,
 '地位': 12,
 '美国': 61,
 '全世界': 14,
 '精神': 35,
 '体制': 15,
 '

In [29]:
(
    WordCloud()
    .add(series_name="Review Tendency", data_pair=filtered_count.items(), word_size_range=[6, 66])
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="Review Tendency", title_textstyle_opts=opts.TextStyleOpts(font_size=23)
        ),
        tooltip_opts=opts.TooltipOpts(is_show=True),
    )
    
).render_notebook()

### Explore rank secrets

In [30]:
explore_set = dataset.copy(deep=True)

#### Preprocess data
Transform objects into numerial data

In [31]:
explore_set['time in CN'] = pd.to_datetime(dataset['time in CN'], format='%Y/%m/%d').map(datetime.datetime.toordinal)
explore_set.sample()

Unnamed: 0,rating,rating number,cumulative income,title,title_en,first week income,type,area,duration,time in CN,director,actors,reviews,awards,number of prize,number of nomination,rank
27,9.1,4315,1396万,三傻大闹宝莱坞,3 idiots,742万,"['剧情', '喜剧', '冒险']",印度,171分钟,734479,拉吉库马尔·希拉尼,"['阿米尔·汗', '黄渤', '卡琳娜·卡普', '汤唯', '拉吉库马尔·希拉尼', '...","[('9', '很好看很搞笑很温馨。。。就是这个电影名字太毁了。。。好几次因为这个名字太傻我...",{'第37届日本电影学院奖': {'提名': ['Best Foreign Language...,0.0,3.0,28


In [32]:
explore_set.actors.apply(lambda x: " ".join(eval(x)))

0     徐峥 周一围 王传君 谭卓 文牧野 徐峥 周一围 王传君 谭卓 章宇 杨新鸣 王砚辉 贾晨飞...
1     蒂姆·罗宾斯 摩根·弗里曼 鲍勃·冈顿 威廉·桑德勒 弗兰克·德拉邦特 蒂姆·罗宾斯 摩根·...
2     维果·莫腾森 马赫沙拉·阿里 琳达·卡德里尼 塞巴斯蒂安·马尼斯科 彼得·法雷里 维果·莫腾...
3     蒂姆·罗斯 比尔·努恩 克兰伦斯·威廉姆斯三世 普路特·泰勒·文斯 朱塞佩·托纳多雷 蒂姆·...
4     吕艳婷 囧森瑟夫 瀚墨 陈浩 饺子 吕艳婷 囧森瑟夫 瀚墨 陈浩 绿绮 张珈铭 杨卫 李南 ...
                            ...                        
95    郎雄 吴倩莲 杨贵媚 王渝文 李安 郎雄 吴倩莲 杨贵媚 王渝文 张艾嘉 赵文瑄 陈昭荣 归...
96    约翰·赵 米切尔·拉 黛博拉·梅辛 约瑟夫·李 阿尼什·查甘蒂 约翰·赵 米切尔·拉 黛博拉...
97    休·杰克曼 克里斯蒂安·贝尔 迈克尔·凯恩 斯嘉丽·约翰逊 克里斯托弗·诺兰 休·杰克曼 克...
98    本尼迪克特·康伯巴奇 凯拉·奈特莉 马修·古迪 罗里·金奈尔 莫滕·泰杜姆 本尼迪克特·康伯...
99    莱昂纳多·迪卡普里奥 马克·鲁法洛 本·金斯利 马克斯·冯·叙多夫 马丁·斯科塞斯 莱昂纳多...
Name: actors, Length: 100, dtype: object

**As actors are massively larger than the number of movies, it cannot be used to analyze the correlation.Hence, actors feature will be ignored**

In [33]:
vectorizer = CountVectorizer()
Count = vectorizer.fit_transform(explore_set.actors.apply(lambda x: " ".join(eval(x)))).todense()
Count.shape

(100, 15257)

#### Drop useless features
Features like names are actually useless for us to analyze data, same as the reviews(optional)

In [34]:
explore_set.drop(["actors","title","title_en"],axis=1,inplace=True)
explore_set.sample()

Unnamed: 0,rating,rating number,cumulative income,first week income,type,area,duration,time in CN,director,reviews,awards,number of prize,number of nomination,rank
36,9.0,1095,233万美元,,"['剧情', '家庭', '历史']","中国大陆,中国香港",132分钟,728065,张艺谋,"[('9', '人为了什么而活着？估计很多人都说不清楚。但是对于福贵而说，就是为了活着而活着...","{'第47届戛纳电影节': {'获奖': ['评委会大奖', '最佳男演员', '天主教人道...",7.0,3.0,37


In [35]:
# type set
type_set = set()
type_hot = {}
rows = [eval(x) for x in explore_set.type.values]
for row in rows:
    for item in row:
        type_set.add(item)
for key in list(type_set):
    type_hot[key] = [0 for i in range(100)]
for i,row in enumerate(rows):
    for item in row:
        type_hot[item][i]+=1
        
pd.DataFrame(type_hot)

Unnamed: 0,传记,悬疑,科幻,剧情,歌舞,战争,惊悚,灾难,家庭,动画,历史,动作,冒险,爱情,犯罪,奇幻,音乐,西部,喜剧
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
96,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
97,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
98,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [36]:
def remove_units(x):
    if x[-1] == "万":
        return float(x.replace("万",""))*10000
    if x[-1] == "亿":
        return float(x.replace("亿",""))*100000000
    else:
        return float(x)

In [37]:
# Concat and drop useless features like type and reviews(Because only few reviews are listed)
explore_set_com = pd.concat([explore_set,pd.DataFrame(type_hot)],axis = 1) 
explore_set_com.drop(["type",'reviews'],axis=1,inplace = True)
# Delete minutes and translate to int
explore_set_com[["duration"]] = explore_set_com["duration"].str.replace("分钟","").astype("int64")
explore_set_com[["rating number"]] = explore_set_com["rating number"].apply(remove_units)

In [38]:
# transform income data into numerial data
def transform_unit(x):
    if x[-1] == "万":
        return float(x.replace("万",""))
    elif  x[-1] == "亿":
        return float(x.replace("亿",""))*10000
tmpTestIncome = explore_set_com[["rank","first week income","cumulative income"]].dropna()
tmpTestIncome["first week income"] = tmpTestIncome["first week income"].str.replace("万","").astype("float64")
tmpTestIncome["cumulative income"] = tmpTestIncome["cumulative income"].apply(transform_unit)
tmpTestIncome

Unnamed: 0,rank,first week income,cumulative income
0,1,123812.0,318800.0
2,3,11502.0,47900.0
3,4,6351.0,14400.0
4,5,121.0,503600.0
6,7,4840.0,9675.0
7,8,2380.0,5979.0
11,12,19234.0,48800.0
12,13,26885.0,87700.0
17,18,11869.0,123000.0
18,19,8685.0,129900.0


In [39]:
# Rank vs income
(
    Line()
    .add_xaxis(tmpTestIncome["rank"])
    .add_yaxis("Cumulative income", tmpTestIncome["cumulative income"].values)
    .add_yaxis("First week income", tmpTestIncome["first week income"].values)
    .set_global_opts(title_opts=opts.TitleOpts(title="Rank vs income"))
    
).render_notebook()

**It is hard to consider income has to do with rank, becuse of its few data and uncertainty according to period and area,
so drop it**

In [40]:
explore_set_com.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   rating                100 non-null    float64
 1   rating number         100 non-null    float64
 2   cumulative income     56 non-null     object 
 3   first week income     32 non-null     object 
 4   area                  100 non-null    object 
 5   duration              100 non-null    int64  
 6   time in CN            100 non-null    int64  
 7   director              100 non-null    object 
 8   awards                100 non-null    object 
 9   number of prize       94 non-null     float64
 10  number of nomination  94 non-null     float64
 11  rank                  100 non-null    int64  
 12  传记                    100 non-null    int64  
 13  悬疑                    100 non-null    int64  
 14  科幻                    100 non-null    int64  
 15  剧情                    10

In [41]:
explore_set_com.drop(["first week income","cumulative income"],axis=1,inplace = True)

explore_set_com[["number of prize"]] = explore_set_com[["number of prize"]].fillna(explore_set_com[["number of prize"]].mean()).astype("int64")
explore_set_com.sample()

Unnamed: 0,rating,rating number,area,duration,time in CN,director,awards,number of prize,number of nomination,rank,...,动画,历史,动作,冒险,爱情,犯罪,奇幻,音乐,西部,喜剧
43,9.1,1925000.0,美国,181,737173,乔·罗素,"{'第92届奥斯卡金像奖': {'提名': ['最佳视觉效果']}, '第73届英国电影学院...",10,14.0,44,...,0,0,1,1,0,0,1,0,0,0


In [42]:
# award one hot encoding
award_set = set()
award_hot = {}
rows = [eval(x) for x in explore_set_com.awards.values]
for row in rows:
    for item in row.keys():
        award_set.add(re.sub("第\d+届", "", item))
for key in list(award_set):
    award_hot[key] = [0 for i in range(100)]
for i,row in enumerate(rows):
    for item in row:
        key = re.sub("第\d+届", "", item)
        award_hot[key][i]+=1
        
pd.DataFrame(award_hot)

Unnamed: 0,美国演员工会奖,圣地亚哥影评人协会奖,安妮奖,英国电影学院奖,中国电影导演协会年度奖,洛迦诺国际电影节,底特律影评人协会奖,韩国电影大钟奖,美国服装设计工会奖,金众电影青年,...,洛杉矶影评人协会奖,温哥华影评人协会奖,华沙电影节,美国在线影评人协会奖,法国电影凯撒奖,美国电影学会奖,芝加哥影评人协会奖,柏林国际电影节,圣塞巴斯蒂安国际电影节,日本电影学院奖
0,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,1,0,0,1,0,0,1,...,0,1,0,1,0,1,1,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0


**It is not a good idea to concat the award because the feature will be larger than data, so drop it**

In [43]:
explore_set_com.drop(["awards"],axis=1,inplace = True)
explore_set_com.sample()

Unnamed: 0,rating,rating number,area,duration,time in CN,director,number of prize,number of nomination,rank,传记,...,动画,历史,动作,冒险,爱情,犯罪,奇幻,音乐,西部,喜剧
76,8.4,844.0,美国,126,729363,格斯·范·桑特,5,20.0,77,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
train = explore_set_com.copy(deep=True)

**Use this function to automatically transform objects into numerical data**

In [45]:
def transInt(data,column):
    tem = data[column].value_counts().to_dict()
    j = 1
    for i in tem.keys():
        tem[i]=j
        j+=1
    data[column] = data[column].map(tem)
    return tem

director_map = transInt(train,'director')
area_map= transInt(train,'area')

In [46]:
director_map

{'大卫·芬奇': 1,
 '克里斯托弗·诺兰': 2,
 '李安': 3,
 '史蒂文·斯皮尔伯格': 4,
 '宫崎骏': 5,
 '朱塞佩·托纳多雷': 6,
 '王家卫': 7,
 '蒂姆·波顿': 8,
 '詹姆斯·卡梅隆': 9,
 '理查德·林克莱特': 10,
 '理查德·柯蒂斯': 11,
 '彼特·道格特': 12,
 '彼得·威尔': 13,
 '阿兰·葛斯彭纳': 14,
 '卢卡·瓜达尼诺': 15,
 '肯尼斯·罗纳根': 16,
 '托尼·凯耶': 17,
 '马丁·麦克唐纳': 18,
 '埃里克·布雷斯': 19,
 '中岛哲也': 20,
 '延尚昊': 21,
 '弗兰克·德拉邦特': 22,
 '杨德昌': 23,
 '奥里奥尔·保罗': 24,
 '罗曼·波兰斯基': 25,
 '姜文': 26,
 '托德·菲利普斯': 27,
 '奥利维埃·纳卡什': 28,
 '乔治·库克': 29,
 '吕克·贝松': 30,
 '新海诚': 31,
 '罗伯·莱纳': 32,
 '梅尔·吉布森': 33,
 '阿尼什·查甘蒂': 34,
 '马丁·布莱斯特': 35,
 '拉吉库马尔·希拉尼': 36,
 '弗洛里安·亨克尔·冯·多纳斯马尔克': 37,
 '克里斯托夫·巴拉蒂': 38,
 '李·昂克里奇': 39,
 '罗伯特·泽米吉斯': 40,
 '安德鲁·斯坦顿': 41,
 '赛尔乔·莱翁内': 42,
 '岩井俊二': 43,
 '盖·里奇': 44,
 '马克·赫尔曼': 45,
 '涅提·蒂瓦里': 46,
 '刘伟强': 47,
 '陈凯歌': 48,
 '格斯·范·桑特': 49,
 '比利·怀德': 50,
 '北野武': 51,
 '昆汀·塔伦蒂诺': 52,
 '朗·霍华德': 53,
 '斯蒂芬·卓博斯基': 54,
 '莫滕·泰杜姆': 55,
 '李濬益': 56,
 '让-皮埃尔·热内': 57,
 '娜丁·拉巴基': 58,
 '韦斯·安德森': 59,
 '布莱恩·辛格': 60,
 '罗伯托·贝尼尼': 61,
 '弗朗西斯·福特·科波拉': 62,
 '西德尼·吕美特': 63,
 '拜伦·霍华德': 64,
 '加布里埃莱·穆奇诺': 65,
 '贝纳尔多·贝托鲁奇': 66,
 '饺子

In [47]:
# area one hot encoding
area_set = country_movies.keys()
area_hot = {}
rows = [x.split() for x in explore_set_com.area.values]
for key in list(area_set):
    area_hot[key] = [0 for i in range(100)]
for i,row in enumerate(rows):
    tmp = row[0].replace("，",',').split(",")
    for item in tmp:
        area_hot[item][i]+=1
        
pd.DataFrame(area_hot)

Unnamed: 0,中国大陆,美国,意大利,中国香港,日本,法国,英国,印度,黎巴嫩,中国台湾,加拿大,韩国,德国,瑞士,西班牙,巴西,荷兰,奥地利,澳大利亚,波兰
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
96,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
97,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Construct train_set and test_set

In [48]:
Y = dataset[["rank"]]
X = pd.concat([train,pd.DataFrame(area_hot)],axis = 1).drop(['rank'],axis =1)
X

Unnamed: 0,rating,rating number,area,duration,time in CN,director,number of prize,number of nomination,传记,悬疑,...,加拿大,韩国,德国,瑞士,西班牙,巴西,荷兰,奥地利,澳大利亚,波兰
0,9.6,2719000.0,9,117,736880,73,28,24.0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9.5,8384.0,1,142,728184,22,4,16.0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,9.5,253000.0,1,130,737119,77,23,36.0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,9.3,91224.0,6,126,737378,6,8,5.0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9.6,3966000.0,9,110,737266,67,12,10.0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,8.8,320.0,14,124,728143,3,0,5.0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,8.6,16012.0,1,102,737042,34,2,2.0,0,1,...,0,0,0,0,0,0,0,0,0,0
97,8.8,1315.0,3,130,732601,2,0,5.0,0,1,...,0,0,0,0,0,0,0,0,0,0
98,9.3,59375.0,4,114,735800,55,9,37.0,1,0,...,0,0,0,0,0,0,0,0,0,0


#### Standardization

In [49]:
def standardization( df, column, rate=1):
    return (df[column]-df[column].min())*rate/(df[column].max() - df[column].min())
X["number of nomination"] = X["number of nomination"].fillna(X["number of nomination"].mean())
X["rating number"] = standardization(X,"rating number",20)
X["time in CN"] = standardization(X,"time in CN",20)
X["duration"] = standardization(X,"duration",20)
X["rating"] = standardization(X,"rating",10)

In [50]:
X.sample()

Unnamed: 0,rating,rating number,area,duration,time in CN,director,number of prize,number of nomination,传记,悬疑,...,加拿大,韩国,德国,瑞士,西班牙,巴西,荷兰,奥地利,澳大利亚,波兰
62,4.444444,1.290039,9,3.797468,18.372531,69,0,2.0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Machine learning

In [51]:
# ridge
alphas_alt = np.arange(5,20,0.01)
# lasso
alphas2 = [5e-05, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008]
# elasticnet
e_alphas = [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007]
e_l1ratio = np.arange(0.8,1, 0.01)
ridge = RidgeCV(alphas=alphas_alt)
lasso = LassoCV(max_iter=1e7, alphas=alphas2, random_state=4)
elasticnet = ElasticNetCV(max_iter=1e7, alphas=e_alphas, l1_ratio=e_l1ratio)                               
svr = SVR(C= 20, epsilon= 0.008, gamma=0.0003,)

In [52]:
elastic = elasticnet.fit(X,Y)
elastic_score =elastic.score(X,Y)
lasso = lasso.fit(X, Y)
lasso_score = lasso.score(X,Y)
ridge = ridge.fit(X, Y)
ridge_score = ridge.score(X,Y)
svr = svr.fit(X, Y)
svr_score = svr.score(X,Y)

In [53]:
score = [elastic_score, lasso_score, ridge_score, 
              svr_score]
pd.DataFrame({
    'Model': ['elastic_net', 'lasso', 'ridge', 
              'svr'],
    'Score': score}).sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
1,lasso,0.582309
0,elastic_net,0.582202
2,ridge,0.370701
3,svr,0.147054


#### Choose the better algorithm, check its parameter

In [54]:
result = pd.DataFrame(columns = X.columns)
result.loc['lasso',:] = lasso.coef_
result.loc["elastic",:] = elastic.coef_
result.loc["ridge",:] = ridge.coef_

In [55]:
result

Unnamed: 0,rating,rating number,area,duration,time in CN,director,number of prize,number of nomination,传记,悬疑,...,加拿大,韩国,德国,瑞士,西班牙,巴西,荷兰,奥地利,澳大利亚,波兰
lasso,-7.224766,-0.736531,0.41942,1.543406,-1.173904,-0.040037,-0.386362,-0.256407,7.926529,9.870749,...,5.347394,15.398166,-17.740075,18.096262,-3.705427,21.410122,0.0,-19.654198,79.69764,36.915889
elastic,-7.187103,-0.72241,0.414328,1.505391,-1.164856,-0.039782,-0.386595,-0.254841,8.367025,9.957613,...,5.090626,15.757778,-17.087817,17.265586,-3.668091,10.404817,9.88858,-18.438341,77.241771,35.203674
ridge,-3.97164,-1.266439,0.178557,-0.291446,-0.942337,-0.034113,-0.544252,-0.027983,5.862276,3.274897,...,1.09098,3.139274,-0.472988,-0.302313,-0.532916,-0.750217,-0.750217,-0.114717,2.137118,1.476423


In [56]:
result.loc["lasso",:].sort_values(ascending=False)

澳大利亚                     79.69764
波兰                      36.915889
英国                      28.423856
家庭                      21.665379
巴西                      21.410122
瑞士                      18.096262
动画                      15.440009
韩国                      15.398166
中国香港                     11.48791
悬疑                       9.870749
喜剧                       8.915103
传记                       7.926529
动作                       6.037231
加拿大                      5.347394
奇幻                       3.755453
duration                 1.543406
战争                       1.440985
area                      0.41942
荷兰                            0.0
director                -0.040037
剧情                      -0.157782
number of nomination    -0.256407
number of prize         -0.386362
rating number           -0.736531
time in CN              -1.173904
灾难                       -1.38288
西班牙                     -3.705427
惊悚                      -4.681422
犯罪                      -5.473725
爱情            

In [57]:
result.loc["elastic",:].sort_values(ascending=False)

澳大利亚                    77.241771
波兰                      35.203674
英国                      28.298837
家庭                      21.591455
瑞士                      17.265586
韩国                      15.757778
动画                      15.234323
中国香港                    11.411328
巴西                      10.404817
悬疑                       9.957613
荷兰                        9.88858
喜剧                       8.825224
传记                       8.367025
动作                       5.802905
加拿大                      5.090626
奇幻                       3.987026
战争                       1.644514
duration                 1.505391
area                     0.414328
director                -0.039782
剧情                      -0.098751
number of nomination    -0.254841
number of prize         -0.386595
rating number            -0.72241
灾难                      -1.117618
time in CN              -1.164856
西班牙                     -3.668091
惊悚                       -4.57339
犯罪                      -5.244334
爱情            

In [58]:
result.loc["ridge",:].sort_values(ascending=False)

英国                      8.314847
传记                      5.862276
家庭                      3.348184
悬疑                      3.274897
韩国                      3.139274
美国                      2.846058
战争                      2.462613
澳大利亚                    2.137118
惊悚                      1.860276
奇幻                      1.637404
波兰                      1.476423
加拿大                      1.09098
灾难                      0.693453
动画                      0.654769
喜剧                       0.54538
音乐                      0.457658
中国香港                    0.414433
area                    0.178557
历史                      0.038324
number of nomination   -0.027983
director               -0.034113
中国台湾                   -0.074935
奥地利                    -0.114717
动作                     -0.180939
冒险                     -0.225395
duration               -0.291446
瑞士                     -0.302313
西部                     -0.424771
德国                     -0.472988
西班牙                    -0.532916
number of 