为能更科学地对艺术品进行定价，学者们开始运用计量经济模型对艺术品市场开展研究，通过衡量市场规模和追踪走势变化实现直观的艺术指数化。从1985年苏富比(微博)拍卖公司首次发布“苏富比艺术市场综合指数”到现如今的三十余年里，艺术指数已经从艺术市场的摘要性综合指标发展成为投资与资产配置的重要参照指标，国际上普遍适用的艺术指数和估值系统主要有(1)Mei Moses Art Index、(2)Art Market Research、(3)ArtPrice Indicator、以及(3)Artnet。第一，Mei Moses Art Index即梅摩艺术品指数，由时任纽约大学史登商学院的梅建平教授与Michael Moses教授共同创立，通过追踪每一件艺术品历年来的多次拍卖成交记录，比较直观地反映出一件艺术品的流传脉络和价格走势。梅摩指数的资料收录范围包括1810年至今的28000多对交易记录数据，类别涵盖了印象派和现代派、古典画派、美国画派、英国画派、拉丁画派、当代艺术、海外中国传统艺术、以及中国当代油画等八大类指数。梅摩艺术品指数其特别之处在于只使用重复出售作品的数据，从而反映价值的实际增幅而非总体平均，因此被认为是最接近真实价格的指数。摩根士丹利已将梅摩指数确定为世界十大生产指数之一，此外，摩根大通、美林、UBS、花旗银行、德意志银行等大型金融机构都在引用这个指数作为艺术品投资参考。第二，位于伦敦的Art Market Research也在使用几大拍卖行数据的基础上，着重比较各品项投资报酬率，目前已建立了超过10万个指数，用户包括佳士得、苏富比、英国税务局、及纽约联邦储备银行等艺术与金融机构。第三，诞生于法国的艺术品价格网Artprice则以网罗全球拍卖市场记录而著称，提供涉及超过405000名艺术家的2500多万个艺术品拍卖价格和指数(ArtPrice Indicator)，并根据某件艺术品历次拍卖成交价、艺术品本身的特点以及艺术品的价格指数，通过该艺术品类型特有的算法计算而得出的该件艺术品的不同时期的价格的折现值。第四，国际著名艺术品网站Artnet也致力于建立大型艺术品交易数据库，即Artnet Price Database，以提升艺术品市场的价格透明度，该数据库已被包括各大拍卖行、艺术商、博物馆及保险公司之内的社会各界广泛应用。 


In [1]:
import csv
import numpy as np
import pandas as pd

file_name = 'artwork_transactions_sample.csv'
artwork_transactions = pd.read_csv(file_name)
artwork_transactions.head()

Unnamed: 0,id,trans_type,auction_info_link,display_price,calc_price,eval_price,eval_price_min,eval_price_max,eval_price_curr,collector,...,trade_curr,origin_url,created_at,updated_at,transaction_date,comments_count,likes_count,lot,auction_id,company_id
0,1310631,auction,,流拍,,"USD 1,600-2,400",,,,,...,,http://auction.artron.net/paimai-art0029920063/,2014-01-18 00:00:00,2014-01-18 00:00:00,2014-01-17 16:00:00,0,0,63,1331,177
1,1476875,auction,,流拍,,"RMB 80,000-120,000",,,,,...,,http://auction.artron.net/paimai-art5017610140/,2012-05-04 00:00:00,2012-05-04 00:00:00,2012-05-03 16:00:00,0,0,140,2344,3
2,1698094,auction,,流拍,,"HKD 3,000-5,000",,,,,...,,http://auction.artron.net/paimai-art64950475/,2010-05-27 00:00:00,2010-05-27 00:00:00,2010-05-26 16:00:00,0,0,475,3310,101
3,1085516,auction,,"RMB 22,600",22600.0,"RMB 20,000-30,000",,,,,...,,http://auction.artron.net/paimai-art5036120262/,2013-07-09 00:00:00,2013-07-09 00:00:00,2013-07-08 16:00:00,0,0,262,1824,316
4,730475,auction,,流拍,,"RMB 22,000-32,000",,,,,...,,http://auction.artron.net/paimai-art5027780224/,2012-12-16 00:00:00,2012-12-16 00:00:00,2012-12-15 16:00:00,0,0,224,1219,238


In [2]:
print (len(artwork_transactions))
print (artwork_transactions.info())

10000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 29 columns):
id                      10000 non-null int64
trans_type              10000 non-null object
auction_info_link       0 non-null float64
display_price           10000 non-null object
calc_price              5192 non-null float64
eval_price              9098 non-null object
eval_price_min          3 non-null float64
eval_price_max          3 non-null float64
eval_price_curr         3 non-null object
collector               0 non-null float64
sell_agency             0 non-null float64
search_string           0 non-null float64
trans_category          9974 non-null object
trans_art_type          10000 non-null object
artwork_auction_name    10000 non-null object
image_url               0 non-null float64
icon_img_url            0 non-null float64
info_update_time        9997 non-null object
trade_channel_id        10000 non-null int64
trade_curr              1 non-null object
ori

In [3]:
def find_out_lables(column):
    lables = []
    for value in column:
        if value not in lables:
            lables.append(value)
    print(lables)
    return lables


find_out_lables(artwork_transactions['trans_category'])
find_out_lables(artwork_transactions['trans_art_type'])


['综合媒材', '绘画', '书法', '油画', '水粉水彩', '版画', '素描', '当代艺术', '雕塑', '西画雕塑其它', nan, '摄影', '现当代及其它瓷器', '装置']
['西画雕塑', '中国书画', '其它类', '陶瓷']


['西画雕塑', '中国书画', '其它类', '陶瓷']

In [4]:
column = zip (artwork_transactions['trans_art_type'],artwork_transactions['trans_category'])
type_lables = []
for value in column:
    if value not in type_lables:
        type_lables.append(value)
print('category',type_lables)


category [('西画雕塑', '综合媒材'), ('中国书画', '绘画'), ('中国书画', '书法'), ('西画雕塑', '油画'), ('西画雕塑', '水粉水彩'), ('西画雕塑', '版画'), ('西画雕塑', '素描'), ('西画雕塑', '当代艺术'), ('西画雕塑', '雕塑'), ('西画雕塑', '西画雕塑其它'), ('中国书画', nan), ('西画雕塑', '摄影'), ('其它类', nan), ('陶瓷', '现当代及其它瓷器'), ('西画雕塑', '装置'), ('西画雕塑', nan)]


In [5]:
artwork_transactions.drop(['auction_info_link'],axis=1,inplace=True)
artwork_transactions.drop(['collector'],axis=1,inplace=True)
artwork_transactions.drop(['sell_agency'],axis=1,inplace=True)
artwork_transactions.drop(['image_url'],axis=1,inplace=True)
artwork_transactions.drop(['icon_img_url'],axis=1,inplace=True)
artwork_transactions.drop(['search_string'],axis=1,inplace=True)

artwork_transactions.drop(['eval_price_min'],axis=1,inplace=True)
artwork_transactions.drop(['eval_price_max'],axis=1,inplace=True)
artwork_transactions.drop(['eval_price_curr'],axis=1,inplace=True)        #information in eval_price
artwork_transactions.drop(['trade_curr'],axis=1,inplace=True)             #only one value

In [6]:
artwork_transactions['trans_category']=artwork_transactions['trans_category'].fillna('nan')


#evalprice 不是数，要确认其type，evalprice_min\max\curr的信息其实都在evalprice中
#info_update_time 和交易id应该是有关系的，在总的数据库中，应用相邻的id的交易之info_update_time填充
#calc_price是成交价格，流拍艺术品的成交价格为nan

print (artwork_transactions.info())
artwork_transactions.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 19 columns):
id                      10000 non-null int64
trans_type              10000 non-null object
display_price           10000 non-null object
calc_price              5192 non-null float64
eval_price              9098 non-null object
trans_category          10000 non-null object
trans_art_type          10000 non-null object
artwork_auction_name    10000 non-null object
info_update_time        9997 non-null object
trade_channel_id        10000 non-null int64
origin_url              10000 non-null object
created_at              10000 non-null object
updated_at              10000 non-null object
transaction_date        10000 non-null object
comments_count          10000 non-null int64
likes_count             10000 non-null int64
lot                     10000 non-null object
auction_id              10000 non-null int64
company_id              10000 non-null int64
dtypes: float64(1), int64(

Unnamed: 0,id,trans_type,display_price,calc_price,eval_price,trans_category,trans_art_type,artwork_auction_name,info_update_time,trade_channel_id,origin_url,created_at,updated_at,transaction_date,comments_count,likes_count,lot,auction_id,company_id
0,1310631,auction,流拍,,"USD 1,600-2,400",综合媒材,西画雕塑,0063 冬季烟火 其二 纸板上粘贴日本纸，水彩 铅笔 亚克力,2015-07-12 05:07:34,4058,http://auction.artron.net/paimai-art0029920063/,2014-01-18 00:00:00,2014-01-18 00:00:00,2014-01-17 16:00:00,0,0,63,1331,177
1,1476875,auction,流拍,,"RMB 80,000-120,000",绘画,中国书画,0140 牡丹双蝶 镜片 设色纸本,2015-07-12 13:30:32,2415,http://auction.artron.net/paimai-art5017610140/,2012-05-04 00:00:00,2012-05-04 00:00:00,2012-05-03 16:00:00,0,0,140,2344,3
2,1698094,auction,流拍,,"HKD 3,000-5,000",绘画,中国书画,0475 采菊图 立轴 设色纸本,2015-07-12 20:07:52,2985,http://auction.artron.net/paimai-art64950475/,2010-05-27 00:00:00,2010-05-27 00:00:00,2010-05-26 16:00:00,0,0,475,3310,101
3,1085516,auction,"RMB 22,600",22600.0,"RMB 20,000-30,000",书法,中国书画,0262 书法 镜片连框 墨笔纸本,2015-07-11 14:19:01,13270,http://auction.artron.net/paimai-art5036120262/,2013-07-09 00:00:00,2013-07-09 00:00:00,2013-07-08 16:00:00,0,0,262,1824,316
4,730475,auction,流拍,,"RMB 22,000-32,000",书法,中国书画,0224 书法,2015-07-10 09:14:31,8413,http://auction.artron.net/paimai-art5027780224/,2012-12-16 00:00:00,2012-12-16 00:00:00,2012-12-15 16:00:00,0,0,224,1219,238


 ## problem: 
* display price 和 calc_price 分别是什么
    
    --》display是成交价格，calc_price为换算为人民币
    
    

* 关于display price“流拍”、“未提供”等情况
    
    --》可以先用分类器判断是否流拍
    
    --》将“未提供“先抛弃掉
    
    

* 关于价格，尽可能外币按当时汇率转换

In [7]:
#证实nan全部来自流拍和为提供
for i in range(len(artwork_transactions)):
    if artwork_transactions['display_price'][i] != '未提供'and artwork_transactions['display_price'][i] != '流拍' and np.isnan(artwork_transactions['calc_price'][i]):
        print(artwork_transactions['display_price'][i],artwork_transactions['calc_price'][i])
    if not np.isnan(artwork_transactions['calc_price'][i]) and (artwork_transactions['display_price'][i] == '未提供'or artwork_transactions['display_price'][i] == '流拍'):
        print(artwork_transactions['display_price'][i],artwork_transactions['calc_price'][i])
 

In [8]:
#删除‘为提供’的hang
artwork_transactions = artwork_transactions[artwork_transactions.display_price != '未提供']
artwork_transactions = artwork_transactions.reset_index()

In [9]:
artwork_transactions['ifsold'] = range(len(artwork_transactions))

In [10]:
artwork_transactions.head()

Unnamed: 0,index,id,trans_type,display_price,calc_price,eval_price,trans_category,trans_art_type,artwork_auction_name,info_update_time,...,origin_url,created_at,updated_at,transaction_date,comments_count,likes_count,lot,auction_id,company_id,ifsold
0,0,1310631,auction,流拍,,"USD 1,600-2,400",综合媒材,西画雕塑,0063 冬季烟火 其二 纸板上粘贴日本纸，水彩 铅笔 亚克力,2015-07-12 05:07:34,...,http://auction.artron.net/paimai-art0029920063/,2014-01-18 00:00:00,2014-01-18 00:00:00,2014-01-17 16:00:00,0,0,63,1331,177,0
1,1,1476875,auction,流拍,,"RMB 80,000-120,000",绘画,中国书画,0140 牡丹双蝶 镜片 设色纸本,2015-07-12 13:30:32,...,http://auction.artron.net/paimai-art5017610140/,2012-05-04 00:00:00,2012-05-04 00:00:00,2012-05-03 16:00:00,0,0,140,2344,3,1
2,2,1698094,auction,流拍,,"HKD 3,000-5,000",绘画,中国书画,0475 采菊图 立轴 设色纸本,2015-07-12 20:07:52,...,http://auction.artron.net/paimai-art64950475/,2010-05-27 00:00:00,2010-05-27 00:00:00,2010-05-26 16:00:00,0,0,475,3310,101,2
3,3,1085516,auction,"RMB 22,600",22600.0,"RMB 20,000-30,000",书法,中国书画,0262 书法 镜片连框 墨笔纸本,2015-07-11 14:19:01,...,http://auction.artron.net/paimai-art5036120262/,2013-07-09 00:00:00,2013-07-09 00:00:00,2013-07-08 16:00:00,0,0,262,1824,316,3
4,4,730475,auction,流拍,,"RMB 22,000-32,000",书法,中国书画,0224 书法,2015-07-10 09:14:31,...,http://auction.artron.net/paimai-art5027780224/,2012-12-16 00:00:00,2012-12-16 00:00:00,2012-12-15 16:00:00,0,0,224,1219,238,4


In [11]:
artwork_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7631 entries, 0 to 7630
Data columns (total 21 columns):
index                   7631 non-null int64
id                      7631 non-null int64
trans_type              7631 non-null object
display_price           7631 non-null object
calc_price              5192 non-null float64
eval_price              7107 non-null object
trans_category          7631 non-null object
trans_art_type          7631 non-null object
artwork_auction_name    7631 non-null object
info_update_time        7629 non-null object
trade_channel_id        7631 non-null int64
origin_url              7631 non-null object
created_at              7631 non-null object
updated_at              7631 non-null object
transaction_date        7631 non-null object
comments_count          7631 non-null int64
likes_count             7631 non-null int64
lot                     7631 non-null object
auction_id              7631 non-null int64
company_id              7631 non-null int64

In [12]:
for i in range(len(artwork_transactions)):
    if np.isnan(artwork_transactions.calc_price[i]):
        artwork_transactions['ifsold'][i] = 0
    elif artwork_transactions.calc_price[i] > 0:
        artwork_transactions['ifsold'][i] = 1
    else: print('nooooooo',i)
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [13]:
# 先弄80%数据出来train,剩下的test
def sperate_train_data(data):
    the_num_of_total_data = len(data)
    data_train = data[:int(the_num_of_total_data * 0.8)]
    data_test = data[int(the_num_of_total_data * 0.8):]
    return data_train, data_test

In [14]:
# 流拍没有 lable
(ifsold_train,ifsold_test) = sperate_train_data(list(artwork_transactions['ifsold']))

In [15]:
# about art type (feature)
type_map ={('西画雕塑', '综合媒材'):1,
('中国书画', '绘画'):2, 
('中国书画', '书法'):3, 
('西画雕塑', '油画'):4, 
('西画雕塑', '水粉水彩'):5, 
('西画雕塑', '版画'):6,
('西画雕塑', '素描'):7, 
('西画雕塑', '当代艺术'):8, 
('西画雕塑', '雕塑'):9, 
('西画雕塑', '西画雕塑其它'):10, 
('中国书画', 'nan'):11, 
('西画雕塑', '摄影'):12, 
('其它类', 'nan'):13, 
('陶瓷', '现当代及其它瓷器'):14, 
('西画雕塑', '装置'):15, 
('西画雕塑', 'nan'):16}


trans_category = list(artwork_transactions['trans_category'])
trans_art_type = list(artwork_transactions['trans_art_type'])
category_type = zip(trans_art_type,trans_category)
art_type = []
for x in category_type:
    art_type.append(type_map[x])
    
(art_type_train,art_type_test) = sperate_train_data(art_type)

In [16]:
# about time 
import datetime
import time

def texts_to_dates(texts):
    dates = []
    for text in texts:
        date = datetime.datetime.strptime(text, '%Y-%m-%d %H:%M:%S').date()
        dates.append(date)
    return dates


def texts_to_strptime(texts):
    dates = []
    for text in texts:
        date = datetime.datetime.strptime(text, '%Y-%m-%d %H:%M:%S').date()
        date_s = time.mktime(date.timetuple())
        dates.append(date_s)
    return dates

def get_age_days(date1s,date2s):
    age_days=[]
    for i in range(len(date1s)):
        age_days.append((date1s[i] - date2s[i]).days)
    return age_days


transaction_date = texts_to_strptime(artwork_transactions.transaction_date)
(transaction_date_train, transaction_date_test) = sperate_train_data(transaction_date)

In [17]:
# about time 
auction_id = list(artwork_transactions.auction_id)
company_id = list(artwork_transactions.company_id)
(auction_train, auction_test) = sperate_train_data(list(artwork_transactions.auction_id))
(company_train, company_test) = sperate_train_data(list(artwork_transactions.company_id))



In [18]:
#about eval_price
eval_price = []
eval_price_min = []
eval_price_max = []
for price in artwork_transactions['eval_price']:
    if pd.isnull(price): eval_price.append([])
    else:
        price = price.split(' ')
        price[1] = (price[1].strip('\u3000')).split('-')
        price_min = int(price[1][0].replace(',',''))
        price_max = int(price[1][1].replace(',',''))
        eval_price.append(price)
        eval_price_min.append(price_min)
        eval_price_max.append(price_max)
        
(eval_price_max_train, eval_price_max_test) = sperate_train_data(eval_price_max)
(eval_price_min_train, eval_price_min_test) = sperate_train_data(eval_price_min)

In [19]:

lable_train = ifsold_train
lable_test = ifsold_test
feature_train =  list(zip(company_train,auction_train,transaction_date_train,art_type_train,eval_price_max_train+eval_price_min_train))
feature_test =  list(zip(company_test,auction_test,transaction_date_test,art_type_test,eval_price_max_test+eval_price_min_test))

In [22]:
from sklearn import tree
from sklearn.metrics import accuracy_score

dt = tree.DecisionTreeClassifier(min_samples_split=200)#(min_samples_split= 400)
dt = dt.fit(feature_train, lable_train)
pred_dt = dt.predict(feature_test)
pred2_dt = dt.predict(feature_train)
print(accuracy_score(lable_test, pred_dt), accuracy_score(lable_train, pred2_dt))

0.680419122462 0.718217562254


In [25]:
importances = dt.feature_importances_
print(importances.argsort())

[3 4 1 0 2]


company_train, 

auction_train, 

transaction_date_train,

art_type_train, 

eval_price_max_train+eval_price_min_train

In [23]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb = gnb.fit(feature_train, lable_train)
pred_gnb = gnb.predict(feature_test)
pred2_gnb = gnb.predict(feature_train)
print(accuracy_score(lable_test, pred_gnb), accuracy_score(lable_train, pred2_gnb))

0.66339227243 0.678243774574
