<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#项目概述" data-toc-modified-id="项目概述-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>项目概述</a></span></li><li><span><a href="#数据抓取" data-toc-modified-id="数据抓取-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>数据抓取</a></span><ul class="toc-item"><li><span><a href="#使用到的库" data-toc-modified-id="使用到的库-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>使用到的库</a></span></li><li><span><a href="#下载网页" data-toc-modified-id="下载网页-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>下载网页</a></span></li><li><span><a href="#处理网页信息" data-toc-modified-id="处理网页信息-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>处理网页信息</a></span><ul class="toc-item"><li><span><a href="#处理数字" data-toc-modified-id="处理数字-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>处理数字</a></span></li><li><span><a href="#处理日期" data-toc-modified-id="处理日期-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>处理日期</a></span></li><li><span><a href="#估算番剧长度" data-toc-modified-id="估算番剧长度-2.3.3"><span class="toc-item-num">2.3.3&nbsp;&nbsp;</span>估算番剧长度</a></span></li></ul></li><li><span><a href="#汇总信息" data-toc-modified-id="汇总信息-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>汇总信息</a></span><ul class="toc-item"><li><span><a href="#获取合法的mdid" data-toc-modified-id="获取合法的mdid-2.4.1"><span class="toc-item-num">2.4.1&nbsp;&nbsp;</span>获取合法的mdid</a></span></li><li><span><a href="#整合数据" data-toc-modified-id="整合数据-2.4.2"><span class="toc-item-num">2.4.2&nbsp;&nbsp;</span>整合数据</a></span></li></ul></li><li><span><a href="#代码运行" data-toc-modified-id="代码运行-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>代码运行</a></span></li></ul></li><li><span><a href="#回归分析" data-toc-modified-id="回归分析-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>回归分析</a></span></li><li><span><a href="#可视化" data-toc-modified-id="可视化-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>可视化</a></span></li></ul></div>

# 项目概述

> bilibili，全称为哔哩哔哩弹幕网，亦称哔哩哔哩、bilibili弹幕网，或简称为B站，是总部位于中国大陆上海的一个以ACG相关内容起家的弹幕视频分享网站。\
作为数据分析课程的期末论文，本项目打算从0开始收集B站的番剧数据，包括`播放量`、`追番人数`、`弹幕总数`、`评分`、`评分人数`等等一系列变量。\
而后进行数据的分析与可视化，数据分析部分主要是建立回归模型寻找播放量与其他变量之间的关系，可视化部分主要是对番剧的开播时间进行时间轴上的数量展示，观察什么时期的番剧比较多。

**这里的番剧不仅仅包含动画内容、也包含电影、剧集等，只要包含在B站media之中的内容都进行统计分析**

# 数据抓取

## 使用到的库

In [1]:
import re
import time
from datetime import datetime

import pandas as pd

import requests
import parsel

## 下载网页

已知番剧的media id时，使用requests库下载网页（本项目涉及的全是网页的静态内容），注意编码格式为utf-8

In [17]:
def gethtml(mdid):
    '''
    获取哔哩哔哩番剧详情页面的html文本
    input  mdid【int】
    output 对应番剧的详情页面【str】
    '''

    url = f'https://www.bilibili.com/bangumi/media/md{mdid}'
    response = requests.get(url)
    response.encoding = 'utf8'
    html = response.text
    # B站域名下非法的网址会跳转到错误页面，包含以下字符串
    if 'Σ(oﾟдﾟoﾉ) 无法找到该页面~' in html:
        html = 'invalid'
    return html

## 处理网页信息

### 处理数字

B站的播放量、追番数以及弹幕数经常以万或者亿结尾，写一个函数转换一下

In [18]:
def eval_playdata(s):
    '''
    把playdata中的xxx万、亿转换为数字
    input  s【str】
    output 对应数字表达的整数值【int】
    '''

    d = {'万': 1e4, '亿': 1e8}
    try:
        #没有汉字，纯数字
        ans = eval(s)
    except SyntaxError:
        # 判断末尾是不是万或者亿，否则报错
        if s[-1] in d:
            ans = int(eval(s[:-1]) * d[s[-1]])
        else:
            ans = 'NA'
    return ans

### 处理日期

由于B站对开播时间的描述不尽相同，故而特别处理一下\
大致有一下几种情况：
- %Y年%m月%d日开播【例如[md7](https://www.bilibili.com/bangumi/media/md7)】
- %Y年%m月开播【例如[md9892](https://www.bilibili.com/bangumi/media/md9892)】
- %Y年开播【例如[md9352](https://www.bilibili.com/bangumi/media/md9352)】
- %Y年%m月%d日上映【例如[md10086](https://www.bilibili.com/bangumi/media/md10086)】
- %Y开播【例如[md27372](https://www.bilibili.com/bangumi/media/md27372)】

对于不满足以上规范的时间，不做错误处理，直接排除掉这些数据

In [19]:
def eval_broadcast_date(s):
    '''
    把播放日期格式化为datetime格式
    input  s【str】
    output 日期表达的datetime【datetime】
    '''
    
    if '日' in s:
        ans = datetime.strptime(s[:-2], '%Y年%m月%d日')
    elif '月' in s:
        ans = datetime.strptime(s[:-2], '%Y年%m月')
    elif '年' in s:
        ans = datetime.strptime(s[:-2], '%Y年')
    else:
        ans = datetime.strptime(s[:-2], '%Y')
    
    return ans

### 估算番剧长度

如果番剧还在连载，就返回NA

In [20]:
def eval_length(over):
    '''
    估算番剧的长度
    input  over【str】,例如'已完结，全3集'或者'123分钟'
    output 番剧总长度，单位为分钟【int】
    '''
    # 去除掉空格
    over = over.replace(' ','')
    ### 连载中
    if '更新' in over:
        return 'NA'
    ### 已完结
    # 电影
    elif over[-2:]=='分钟':
        length = int(over[:-2])
    # 动画
    elif over[-1]=='话':
        # 按照每一话25分钟估算
        length = int(over[5:-1])*25
    # 剧集
    elif over[-1]=='集':
        # 按照每一集45分钟估算
        length = int(over[5:-1])*45
    else:
        length = 'NA'
    return length

## 汇总信息

这里主要使用re库配合正则表达式，以及parsel库的css选择器，对网页的元素进行筛选

In [21]:
def getinfo(html):
    '''
    获取番剧的详细信息
    input html【str】
    output 提取到的番剧信息【dict】
    '''

    assert html != 'invalid'
    selector = parsel.Selector(html)
    # 标题
    title = selector.css('.media-info-title-t::text').get()
    # 评分
    score = selector.css('.media-info-score-content::text').get()
    # 评分人数
    review_times = selector.css('.media-info-review-times::text').get()
    # tags
    tags = selector.css('.media-tag::text').getall()
    # 播放数据
    media_info_label = selector.css('.media-info-label::text').getall()
    media_info_play_data = selector.css('em::text').getall()
    play_data = dict(zip(media_info_label, media_info_play_data))
    # 是否为系列
    series = True if '系列' in media_info_label[1] else False
    # 播放量、追番数、弹幕数
    play_times, followers, bullet_screen = selector.css('em::text').getall()
    # 开播时间(是否完结)
    ## 这两个时间，用CSS选择器无法正常获取，只能用正则表达式
    pattern1 = '<div class="media-info-time"><span>.*</span> <span>.*</span></div>'
    prefix_len = len('<div class="media-info-time"><span>')
    suffix_len = len('</span></div>')
    play_season_str = re.search(pattern1, html)[0]
    broadcast_date, _, over = play_season_str.partition('</span> <span>')
    broadcast_date, over = broadcast_date[prefix_len:], over[:-suffix_len]
    # 是否为电影or剧场版，标注分钟的一般是电影或者剧场版动画
    film = True if '分钟' in over else False
    # 番剧长度
    length = eval_length(over)
    # 是否为大会员专享
    pattern2 = '<div class="btn-pay-wrapper vip-only">'
    vip_only = True if re.search(pattern2, html) else False
    # 整理到一个字典里,这个时候尽量把更多的信息保留下来，所以会有很多后缀为str的变量
    # 后续可以再把这些变量删掉
    info = dict(
        title=title,
        # 可能没有评分
        score=eval(score) if score else 'NA',
        review_times=int(review_times[:-2]) if review_times else 'NA',
        # tags用逗号连接
        tags=','.join(tags),
        # play_data=str(play_data),
        series=series,
        film=film,
        # xxx万、亿转换为数字
        play_times=eval_playdata(play_times),
        followers=eval_playdata(followers),
        bullet_screen=eval_playdata(bullet_screen),
        # 字符串时间
        broadcast_date_str=broadcast_date,
        # 转换为时间类型
        broadcast_date=eval_broadcast_date(broadcast_date),
        over_str=over,
        over=True if over[:3] == '已完结' else False,
        length=length,
        vip_only=vip_only)
    return info

In [22]:
# 运行示例
getinfo(gethtml(28226002))

{'title': '旷野青春',
 'score': 9.2,
 'review_times': 133,
 'tags': '人文,自然',
 'series': False,
 'film': False,
 'play_times': 764000,
 'followers': 20000,
 'bullet_screen': 2760,
 'broadcast_date_str': '2019年12月23日开播',
 'broadcast_date': datetime.datetime(2019, 12, 23, 0, 0),
 'over_str': '连载中, 每周一、周五更新1集',
 'over': False,
 'length': 'NA',
 'vip_only': False}

### 获取合法的mdid

由于B站的media id是顺序编号的，故而我们这里尝试从0开始往后一次遍历，每次调用函数都会在以及遍历的基础上再向前寻找n个。（B站目前的编号大概已经到了2千万，所以数据量很大，本项目只获取了1万4千余条数据）

In [23]:
def moremdid(n):
    '''
    读取本地的mdidlist.txt文件中的mdid，接着探索更多可能的mdid
    input  n【int】
    output 所有已知合法的mdid【list】
    '''

    # 读取文件中的mdidlist
    with open('./data/mdidlist.txt', mode='r') as f:
        mdid = list(map(lambda x: eval(x[:-1]), f.readlines()))
        f.close()
    start = mdid[-1] + 1
    for i in range(start, start + n):
        # 如果页面不存在，返回404是逻辑上的FLase
        if requests.get(f'https://www.bilibili.com/bangumi/media/md{i}'):
            mdid.append(i)
    # 覆盖写，更新源文件中的mdidlist
    with open('./data/mdidlist.txt', mode='w') as f:
        f.writelines(map(lambda x: str(x) + '\n', mdid))
        f.close()
    print(f'completed, length of mdid now is {len(mdid)}')
    return mdid

### 整合数据

使用pandas把上面整理好的数据汇总到一起

In [24]:
def generate_df(mdid):
    '''
    获取mdid中所有番剧的信息，放到一个df中
    input mdid【list】
    output 汇总的所有番剧信息【pandas.DataFrame】
    '''
    l = []
    for i in mdid:
        # 不要过快访问
        time.sleep(0.1)
        # 用try语句捕获错误，尽量让程序一运行发现更多的错误
        try:
            html = gethtml(i)
            info = getinfo(html)
            info['mdid'] = i
            l.append(info)
        except Exception as e:
            # 不过exception不是很多就放弃那些报错的数据
            print(repr(e))
            print(f'{i} failed!')
    return pd.DataFrame(l)

## 代码运行

In [25]:
# 读取本地的mdlist文件
mdid = moremdid(0)

completed, length of mdid now is 14685


In [None]:
# 获取mdid中包含的所有番剧信息，汇总到一个df中
df = generate_df(mdid)

In [26]:
# 保存文件，以备后续处理
df.to_excel('./data/BLBLdata.xlsx', index=False, encoding='utf-8')

# 回归分析

In [29]:
# 重置
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [61]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices

In [29]:
# 读取数据
df = pd.read_excel('./data/BLBLdata.xlsx')
df.shape

(14267, 16)

In [33]:
df.loc[0]

title                                 漫研部
score                                 8.7
review_times                        224.0
tags                          搞笑,泡面,日常,校园
series                               True
film                                False
play_times                      1079000.0
followers                         92000.0
bullet_screen                      5283.0
broadcast_date_str            2013年1月3日开播
broadcast_date        2013-01-03 00:00:00
over_str                        已完结, 全13话
over                                 True
length                              325.0
vip_only                            False
mdid                                    7
Name: 0, dtype: object

In [64]:
# 回归用到的数据
variables = ['series','film','play_times','followers','bullet_screen','over','vip_only']
df_ols = df.loc[ : , variables]
df_ols.shape

(14267, 7)

In [65]:
# 含有的数据缺失值舍去
df_ols.dropna(inplace=True)
df_ols.shape

(13246, 7)

In [67]:
df_ols.head()

Unnamed: 0,series,film,play_times,followers,bullet_screen,over,vip_only
0,True,False,1079000.0,92000.0,5283.0,True,False
1,True,False,624000.0,92000.0,2910.0,True,False
2,False,False,288000.0,43000.0,16000.0,True,False
3,True,False,1693000.0,194000.0,7656.0,True,False
4,True,False,1684000.0,194000.0,5774.0,True,False


In [91]:
df_ols.corr()

Unnamed: 0,series,film,play_times,followers,bullet_screen,over,vip_only
series,1.0,-0.19422,0.145946,0.348945,0.038988,0.180284,0.117681
film,-0.19422,1.0,-0.091739,-0.136331,-0.020466,-0.918725,0.320583
play_times,0.145946,-0.091739,1.0,0.575098,0.263467,0.061395,0.116256
followers,0.348945,-0.136331,0.575098,1.0,0.13158,0.125232,0.200782
bullet_screen,0.038988,-0.020466,0.263467,0.13158,1.0,0.001394,0.028047
over,0.180284,-0.918725,0.061395,0.125232,0.001394,1.0,-0.317504
vip_only,0.117681,0.320583,0.116256,0.200782,0.028047,-0.317504,1.0


In [88]:
pd.crosstab(df_ols['film'],df_ols['over'])

over,False,True
film,Unnamed: 1_level_1,Unnamed: 2_level_1
False,457,9245
True,3544,0


In [104]:
y, X = dmatrices(
    'play_times ~  film  + bullet_screen + over +vip_only',
    data=df_ols,
    return_type='dataframe')

In [105]:
model = sm.OLS(y,X).fit()
model.summary()

0,1,2,3
Dep. Variable:,play_times,R-squared:,0.099
Model:,OLS,Adj. R-squared:,0.099
Method:,Least Squares,F-statistic:,364.3
Date:,"Tue, 21 Dec 2021",Prob (F-statistic):,4.3299999999999995e-298
Time:,16:56:35,Log-Likelihood:,-244230.0
No. Observations:,13246,AIC:,488500.0
Df Residuals:,13241,BIC:,488500.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,9.642e+06,1.16e+06,8.279,0.000,7.36e+06,1.19e+07
film[T.True],-1.306e+07,1.23e+06,-10.632,0.000,-1.55e+07,-1.07e+07
over[T.True],-5.434e+06,1.18e+06,-4.594,0.000,-7.75e+06,-3.12e+06
vip_only[T.True],8.726e+06,5.08e+05,17.190,0.000,7.73e+06,9.72e+06
bullet_screen,2.4202,0.078,30.840,0.000,2.266,2.574

0,1,2,3
Omnibus:,24910.18,Durbin-Watson:,1.817
Prob(Omnibus):,0.0,Jarque-Bera (JB):,51716056.123
Skew:,14.337,Prob(JB):,0.0
Kurtosis:,307.763,Cond. No.,25800000.0


# 可视化