# 关于Mercari Price Suggestion Challenge比赛的总结（一）

在工作之余时间和一些小伙伴一起参加了kaggle的比赛（工作的时候能真正投入到比赛的时间真的很少，所以学习了一些kernel的作品），虽然fork了其他人的作品，但跑完仔细分析了下收货不小。另外，这次比赛非常重要的一点是对时间的限制，一个小时的时间要完成所有的步骤！！（这也是让我们小伙伴比较棘手的一点，想调参或者上其他训练时间耗费更多的复杂模型你就需要做一个trade-off）。还有比赛也不能拿比赛中的数据来训练word2vec模型，这一点也很重要。但可以拿公开的word2vec模型。比如Facebook的[fasttext](https://github.com/facebookresearch/fastText "fasttext")~

趁着过年一点闲暇时间总结一番，我觉得重要的不是刷名次，而是沉淀一些Feature Engineering tricks和Algorithm.（捂脸，主要是想懒得炼丹调参了。另外主要也是想趁过年这段时间总结下去年一年所做过的项目）

## rnn版本

#### 这篇主要是利用RNN and Ridge model来做predict，最后的public score是0.42688

### Data

这场比赛提供的数据很有意思，既有文本类的数据，也有数值类型的数据，这些数据一起组合成了结构化的表数据。然后提供给参赛者，看如何通过特征的组合变化来融合成一个强大的特征~
首先简单分析下每个字段的数据，直接附上官方的数据说明。很容易明白里面的含义~

The files consist of a list of product listings. These files are tab-delimited.
- train_id or test_id - the id of the listing
- name - the title of the listing. Note that we have cleaned the data to remove text that look like prices (e.g. 20 dollar ) to avoid leakage. These removed prices are represented as [rm]
- item_condition_id - the condition of the items provided by the seller
- category_name - category of the listing
- brand_name
- price - the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist in test.tsv since that is what you will predict.
- shipping - 1 if shipping fee is paid by seller and 0 by buyer
- item_description - the full description of the item. Note that we have cleaned the data to remove text that look like prices (e.g. 20 dollar) to avoid leakage. These removed prices are represented as [rm]

## 准备工作

### 引入相关包

In [58]:
#Import packages
##Import all needed packages for constructing models and solving the competition
from datetime import datetime 
start_real = datetime.now()
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dropout, Dense, concatenate, GRU, Embedding, Flatten, Activation
# from keras.layers import Bidirectional
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K
from nltk.corpus import stopwords
import math
# set seed
np.random.seed(123)

### 定义rmsl误差函数（这里的Y和Y_pred都取了对数，因为真实的Y不是正态分布）

In [59]:
#Define RMSL Error Function
##This is for checking the predictions at the end. Note that the Y and Y_pred will already be in log scale by the time this is used, so no need to log them in the function.
def rmsle(Y, Y_pred):
    assert Y.shape == Y_pred.shape
    return np.sqrt(np.mean(np.square(Y_pred - Y )))

### 加载数据

In [60]:
#Load train and test data
train_df = pd.read_table('F:\kaggle\MercariPrice/train.tsv')
test_df = pd.read_table('F:\kaggle\MercariPrice/test.tsv')
print(train_df.shape, test_df.shape)

(1482535, 8) (693359, 7)


这里为了测试方便，按照2:1的比例来取train和test数据

In [61]:
train_df = train_df[:2000]
test_df = test_df[:1000]
print(train_df.shape, test_df.shape)

(2000, 8) (1000, 7)


## 数据预处理

- 数据预处理——去除低prices：Mercari任何低于3美元的商品都不会允许在平台上发布。所以列表中出现低于3美元的物品都要remove掉！这样对最后的模型有利。

In [62]:
train_df = train_df.drop(train_df[(train_df.price < 3.0)].index)
print(train_df.shape)

(1999, 8)


- 数据预处理——去除停用词.但是这一步并没有参与执行！这是因为模型对停用词有很强的鲁棒性，虽然去除停用词对模型有一个小的提升，但是基于严格的时间限制，不值得花费1-2分钟运行这个块。

In [63]:
# %%time

# stop = stopwords.words('english')
# train_df.item_description.fillna(value='No description yet', inplace=True)
# train_df['item_description'] = train_df['item_description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
# train_df.name.fillna(value="missing", inplace=True)
# train_df['name'] = train_df['name'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# test_df.item_description.fillna(value='No description yet', inplace=True)
# test_df['item_description'] = test_df['item_description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
# test_df.name.fillna(value="missing", inplace=True)
# test_df['name'] = test_df['name'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

- 数据预处理——item_description字段描述的长度，即描述所使用的单词的原始数量，与价格有一定的相关性。The RNN might find this out on it's own, but since a max depth is used to save computations, it does not always know.
描述长度明显有助于模型，但name字段的长度可能不是会影响那么多。因为name的长度对模型也不会降低模型性能，因此也可留下名字长度。

In [64]:
# get name and description lengths
def wordCount(text):
    try:
        if text == 'No description yet':
            return 0
        else:
            text = text.lower()
            words = [w for w in text.split(" ")]
            return len(words)
    except: 
        return 0
train_df['desc_len'] = train_df['item_description'].apply(lambda x: wordCount(x))
test_df['desc_len'] = test_df['item_description'].apply(lambda x: wordCount(x))
train_df['name_len'] = train_df['name'].apply(lambda x: wordCount(x))
test_df['name_len'] = test_df['name'].apply(lambda x: wordCount(x))
train_df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,desc_len,name_len
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet,0,7
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...,36,4
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...,29,2
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...,32,3
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity,5,4


- 数据预处理——将category_name字段split成三个parts这样使得模型可以获取更多的信息。
I tried making a small 3 part RNN layer for this instead which does worse than this method but is occasionally faster.

In [65]:
# split category name into 3 parts
def split_cat(text):
    try: return text.split("/")
    except: return ("No Label", "No Label", "No Label")
train_df['subcat_0'], train_df['subcat_1'], train_df['subcat_2'] = \
zip(*train_df['category_name'].apply(lambda x: split_cat(x)))
test_df['subcat_0'], test_df['subcat_1'], test_df['subcat_2'] = \
zip(*test_df['category_name'].apply(lambda x: split_cat(x)))

In [66]:
train_df.head(5)

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,desc_len,name_len,subcat_0,subcat_1,subcat_2
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet,0,7,Men,Tops,T-shirts
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...,36,4,Electronics,Computers & Tablets,Components & Parts
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...,29,2,Women,Tops & Blouses,Blouse
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...,32,3,Home,Home Décor,Home Décor Accents
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity,5,4,Women,Jewelry,Necklaces


In [67]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1999 entries, 0 to 1999
Data columns (total 13 columns):
train_id             1999 non-null int64
name                 1999 non-null object
item_condition_id    1999 non-null int64
category_name        1988 non-null object
brand_name           1158 non-null object
price                1999 non-null float64
shipping             1999 non-null int64
item_description     1999 non-null object
desc_len             1999 non-null int64
name_len             1999 non-null int64
subcat_0             1999 non-null object
subcat_1             1999 non-null object
subcat_2             1999 non-null object
dtypes: float64(1), int64(5), object(7)
memory usage: 218.6+ KB


- 数据预处理——缺失值填充。
##The brand name data is sparse, missing over 600,000 values. This gets some of those values back by checking their names. However, It does not seem to help the models either way at this point. An exact name match against all_brand names will find about 3000 of these. We can be pretty confident in these. At the other extreme, we can search for any matches throughout all words in name. This finds over 200,000 but a lot of these are incorrect. Can land somewhere in the middle by either keeping cases or trimming out some of the 5000 brand names.
##For example, PINK is a brand by victoria secret. If we remove case, then almost all pink items are labeled as PINK brand. The other issue is that some of the "brand names" are not brands but really categories like "Boots" or "Keys".
##Currently, checking every word in name of a case-sensitive match does best. This gets around 137,000 finds while avoiding the problems with brands like PINK.

In [68]:
full_set = pd.concat([train_df,test_df])
all_brands = set(full_set['brand_name'].values)
train_df.brand_name.fillna(value="missing", inplace=True)
test_df.brand_name.fillna(value="missing", inplace=True)

In [69]:
# get to finding!
premissing = len(train_df.loc[train_df['brand_name'] == 'missing'])
def brandfinder(line):
    brand = line[0]
    name = line[1]
    namesplit = name.split(' ')
    if brand == 'missing':
        for x in namesplit:
            if x in all_brands:
                return name
    if name in all_brands:
        return name
    return brand
train_df['brand_name'] = train_df[['brand_name','name']].apply(brandfinder, axis = 1)
test_df['brand_name'] = test_df[['brand_name','name']].apply(brandfinder, axis = 1)
found = premissing-len(train_df.loc[train_df['brand_name'] == 'missing'])
print(found)

68


In [70]:
train_df.head(3)

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,desc_len,name_len,subcat_0,subcat_1,subcat_2
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,MLB Cincinnati Reds T Shirt Size XL,10.0,1,No description yet,0,7,Men,Tops,T-shirts
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...,36,4,Electronics,Computers & Tablets,Components & Parts
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...,29,2,Women,Tops & Blouses,Blouse


## 数据拆分

In [71]:
##Standard split the train test for validation and log the price

# Scale target variable to log.
train_df["target"] = np.log1p(train_df.price)

# Split training examples into train/dev examples.
train_df, dev_df = train_test_split(train_df, random_state=123, train_size=0.99)

# Calculate number of train/dev/test examples.
n_trains = train_df.shape[0]
n_devs = dev_df.shape[0]
n_tests = test_df.shape[0]
print("Training on", n_trains, "examples")
print("Validating on", n_devs, "examples")
print("Testing on", n_tests, "examples")

Training on 1979 examples
Validating on 20 examples
Testing on 1000 examples


## RNN Model

用RNN模型来解决问题，有以下几个步骤：
- 预处理数据
- 定义RNN模型
- 用RNN模型拟合训练集样本
- 在验证集上评估RNN模型
- 用RNN模型对测试集进行预测

### 预处理数据

In [72]:
# 拼接 train - dev - test data以方便处理
full_df = pd.concat([train_df, dev_df, test_df])

In [73]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2999 entries, 1503 to 999
Data columns (total 15 columns):
brand_name           2999 non-null object
category_name        2985 non-null object
desc_len             2999 non-null int64
item_condition_id    2999 non-null int64
item_description     2999 non-null object
name                 2999 non-null object
name_len             2999 non-null int64
price                1999 non-null float64
shipping             2999 non-null int64
subcat_0             2999 non-null object
subcat_1             2999 non-null object
subcat_2             2999 non-null object
target               1999 non-null float64
test_id              1000 non-null float64
train_id             1999 non-null float64
dtypes: float64(4), int64(4), object(7)
memory usage: 374.9+ KB


In [74]:
#填充缺失值（NA值），用“missing”来替换“No description yet”以提升模型！
# Filling missing values
def fill_missing_values(df):
    df.category_name.fillna(value="missing", inplace=True)
    df.brand_name.fillna(value="missing", inplace=True)
    df.item_description.fillna(value="missing", inplace=True)
    df.item_description.replace('No description yet',"missing", inplace=True)
    return df

print("Filling missing data...")
full_df = fill_missing_values(full_df)
print(full_df.category_name[1])

Filling missing data...
1    Electronics/Computers & Tablets/Components & P...
1              Other/Office supplies/Shipping Supplies
Name: category_name, dtype: object


In [75]:
full_df.head(3)

Unnamed: 0,brand_name,category_name,desc_len,item_condition_id,item_description,name,name_len,price,shipping,subcat_0,subcat_1,subcat_2,target,test_id,train_id
1503,Hoover,Home/Home Appliances/Vacuums & Floor Care,35,1,Brand New Hoover Air Express Handheld Vacuum w...,Hoover Air Express Hand Vacuum,5,12.0,1,Home,Home Appliances,Vacuums & Floor Care,2.564949,,1503.0
553,Old Navy,Kids/Girls 0-24 Mos/One-Pieces,10,2,Shirt 0-3 month leggings 6 months old navy and...,Winter bundle,2,8.0,0,Kids,Girls 0-24 Mos,One-Pieces,2.197225,,553.0
775,American Boy & Girl,Kids/Toys/Dolls & Accessories,5,2,Theater case/ Trunk with box.,Marisol Lot HOLD APPLE,4,29.0,0,Kids,Toys,Dolls & Accessories,3.401197,,775.0


In [76]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2999 entries, 1503 to 999
Data columns (total 15 columns):
brand_name           2999 non-null object
category_name        2999 non-null object
desc_len             2999 non-null int64
item_condition_id    2999 non-null int64
item_description     2999 non-null object
name                 2999 non-null object
name_len             2999 non-null int64
price                1999 non-null float64
shipping             2999 non-null int64
subcat_0             2999 non-null object
subcat_1             2999 non-null object
subcat_2             2999 non-null object
target               1999 non-null float64
test_id              1000 non-null float64
train_id             1999 non-null float64
dtypes: float64(4), int64(4), object(7)
memory usage: 454.9+ KB


#### 处理类别型特征

In [77]:


print("Processing categorical data...")
le = LabelEncoder()
# full_df.category = full_df.category_name
le.fit(full_df.category_name)
full_df['category'] = le.transform(full_df.category_name)

le.fit(full_df.brand_name)
full_df.brand_name = le.transform(full_df.brand_name)

le.fit(full_df.subcat_0)
full_df.subcat_0 = le.transform(full_df.subcat_0)

le.fit(full_df.subcat_1)
full_df.subcat_1 = le.transform(full_df.subcat_1)

le.fit(full_df.subcat_2)
full_df.subcat_2 = le.transform(full_df.subcat_2)

Processing categorical data...


In [78]:
full_df.head(3)

Unnamed: 0,brand_name,category_name,desc_len,item_condition_id,item_description,name,name_len,price,shipping,subcat_0,subcat_1,subcat_2,target,test_id,train_id,category
1503,200,Home/Home Appliances/Vacuums & Floor Care,35,1,Brand New Hoover Air Express Handheld Vacuum w...,Hoover Air Express Hand Vacuum,5,12.0,1,3,44,318,2.564949,,1503.0,103
553,354,Kids/Girls 0-24 Mos/One-Pieces,10,2,Shirt 0-3 month leggings 6 months old navy and...,Winter bundle,2,8.0,0,4,38,222,2.197225,,553.0,171
775,32,Kids/Toys/Dolls & Accessories,5,2,Theater case/ Trunk with box.,Marisol Lot HOLD APPLE,4,29.0,0,4,86,109,3.401197,,775.0,192


In [79]:
del le

In [80]:
full_df.head(3)

Unnamed: 0,brand_name,category_name,desc_len,item_condition_id,item_description,name,name_len,price,shipping,subcat_0,subcat_1,subcat_2,target,test_id,train_id,category
1503,200,Home/Home Appliances/Vacuums & Floor Care,35,1,Brand New Hoover Air Express Handheld Vacuum w...,Hoover Air Express Hand Vacuum,5,12.0,1,3,44,318,2.564949,,1503.0,103
553,354,Kids/Girls 0-24 Mos/One-Pieces,10,2,Shirt 0-3 month leggings 6 months old navy and...,Winter bundle,2,8.0,0,4,38,222,2.197225,,553.0,171
775,32,Kids/Toys/Dolls & Accessories,5,2,Theater case/ Trunk with box.,Marisol Lot HOLD APPLE,4,29.0,0,4,86,109,3.401197,,775.0,192


#### 处理文本型特征，将文本数据转换成序列

In [81]:
print("Transforming text data to sequences...")
raw_text = np.hstack([full_df.item_description.str.lower(), full_df.name.str.lower(), full_df.category_name.str.lower()])

print("   Fitting tokenizer...")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)

print("   Transforming text to sequences...")
full_df['seq_item_description'] = tok_raw.texts_to_sequences(full_df.item_description.str.lower())
full_df['seq_name'] = tok_raw.texts_to_sequences(full_df.name.str.lower())
full_df['seq_category'] = tok_raw.texts_to_sequences(full_df.category_name.str.lower())

Transforming text data to sequences...
   Fitting tokenizer...
   Transforming text to sequences...


In [82]:
del tok_raw

print(full_df['seq_name'][:5])

1503    [3817, 520, 1182, 412, 1761]
553                        [485, 29]
775            [8977, 206, 321, 401]
918        [98, 280, 8978, 8979, 57]
1297        [30, 50, 2259, 371, 738]
Name: seq_name, dtype: object


In [83]:
full_df.head(4)

Unnamed: 0,brand_name,category_name,desc_len,item_condition_id,item_description,name,name_len,price,shipping,subcat_0,subcat_1,subcat_2,target,test_id,train_id,category,seq_item_description,seq_name,seq_category
1503,200,Home/Home Appliances/Vacuums & Floor Care,35,1,Brand New Hoover Air Express Handheld Vacuum w...,Hoover Air Express Hand Vacuum,5,12.0,1,3,44,318,2.564949,,1503.0,103,"[17, 6, 3817, 520, 1182, 3003, 1761, 9, 5400, ...","[3817, 520, 1182, 412, 1761]","[36, 36, 1594, 9938, 2784, 128]"
553,354,Kids/Girls 0-24 Mos/One-Pieces,10,2,Shirt 0-3 month leggings 6 months old navy and...,Winter bundle,2,8.0,0,4,38,222,2.197225,,553.0,171,"[103, 133, 38, 424, 57, 50, 451, 502, 244, 1, ...","[485, 29]","[33, 97, 133, 140, 224, 52, 379]"
775,32,Kids/Toys/Dolls & Accessories,5,2,Theater case/ Trunk with box.,Marisol Lot HOLD APPLE,4,29.0,0,4,86,109,3.401197,,775.0,192,"[5403, 129, 5404, 9, 67]","[8977, 206, 321, 401]","[33, 160, 443, 26]"
918,207,"Women/Athletic Apparel/Pants, Tights, Leggings",12,1,Brand new with tags. Black background with tan...,Lularoe OS peeking Santa leggings.,5,44.0,0,10,5,231,3.806662,,918.0,323,"[17, 6, 9, 87, 28, 826, 9, 654, 418, 1021, 221...","[98, 280, 8978, 8979, 57]","[3, 45, 59, 64, 132, 57]"


In [84]:
#定义一些常量；注意，前几行旁边的注释表示该列中的最长条目。 
MAX_NAME_SEQ = 10 #17
MAX_ITEM_DESC_SEQ = 75 #269
MAX_CATEGORY_SEQ = 8 #8
MAX_TEXT = np.max([
    np.max(full_df.seq_name.max()),
    np.max(full_df.seq_item_description.max()),
#     np.max(full_df.seq_category.max()),
]) + 100
MAX_CATEGORY = np.max(full_df.category.max()) + 1
MAX_BRAND = np.max(full_df.brand_name.max()) + 1
MAX_CONDITION = np.max(full_df.item_condition_id.max()) + 1
MAX_DESC_LEN = np.max(full_df.desc_len.max()) + 1
MAX_NAME_LEN = np.max(full_df.name_len.max()) + 1
MAX_SUBCAT_0 = np.max(full_df.subcat_0.max()) + 1
MAX_SUBCAT_1 = np.max(full_df.subcat_1.max()) + 1
MAX_SUBCAT_2 = np.max(full_df.subcat_2.max()) + 1

In [85]:
np.max(full_df.subcat_2.max()) + 1

335

In [86]:
#Get data for RNN model
def get_rnn_data(dataset):
    X = {
        'name': pad_sequences(dataset.seq_name, maxlen=MAX_NAME_SEQ),
        'item_desc': pad_sequences(dataset.seq_item_description, maxlen=MAX_ITEM_DESC_SEQ),
        'brand_name': np.array(dataset.brand_name),
        'category': np.array(dataset.category),
#         'category_name': pad_sequences(dataset.seq_category, maxlen=MAX_CATEGORY_SEQ),
        'item_condition': np.array(dataset.item_condition_id),
        'num_vars': np.array(dataset[["shipping"]]),
        'desc_len': np.array(dataset[["desc_len"]]),
        'name_len': np.array(dataset[["name_len"]]),
        'subcat_0': np.array(dataset.subcat_0),
        'subcat_1': np.array(dataset.subcat_1),
        'subcat_2': np.array(dataset.subcat_2),
    }
    return X

In [87]:
print(np.array(full_df.seq_item_description))
print(pad_sequences(full_df.seq_item_description, maxlen=MAX_ITEM_DESC_SEQ))

[ list([17, 6, 3817, 520, 1182, 3003, 1761, 9, 5400, 1, 1949, 8, 5401, 1104, 49, 39, 27, 1761, 113, 869, 2538, 19, 11, 547, 19, 172, 85, 4, 63, 5402, 653, 243, 371, 178, 381, 15])
 list([103, 133, 38, 424, 57, 50, 451, 502, 244, 1, 1950])
 list([5403, 129, 5404, 9, 67]) ..., list([39, 52, 257, 166, 80, 87])
 list([448, 16, 185, 10, 448, 16, 188])
 list([167, 365, 206, 12, 5318, 5319, 396])]
[[   0    0    0 ...,  178  381   15]
 [   0    0    0 ...,  244    1 1950]
 [   0    0    0 ..., 5404    9   67]
 ..., 
 [   0    0    0 ...,  166   80   87]
 [   0    0    0 ...,  448   16  188]
 [   0    0    0 ..., 5318 5319  396]]


#### 将数据转换成rnn模型所需的数据形式

In [88]:
train = full_df[:n_trains]
dev = full_df[n_trains:n_trains+n_devs]
test = full_df[n_trains+n_devs:]

X_train = get_rnn_data(train)
Y_train = train.target.values.reshape(-1, 1)

X_dev = get_rnn_data(dev)
Y_dev = dev.target.values.reshape(-1, 1)

X_test = get_rnn_data(test)

In [89]:
X_train

{'brand_name': array([200, 354,  32, ...,  78, 505, 389], dtype=int64),
 'category': array([103, 171, 192, ..., 339, 353, 325], dtype=int64),
 'desc_len': array([[35],
        [10],
        [ 5],
        ..., 
        [ 0],
        [ 9],
        [41]], dtype=int64),
 'item_condition': array([1, 2, 2, ..., 2, 1, 2], dtype=int64),
 'item_desc': array([[   0,    0,    0, ...,  178,  381,   15],
        [   0,    0,    0, ...,  244,    1, 1950],
        [   0,    0,    0, ..., 5404,    9,   67],
        ..., 
        [   0,    0,    0, ...,    0,    0,   77],
        [   0,    0,    0, ...,   89,  233,  106],
        [   0,    0,    0, ...,    5, 1330,   71]]),
 'name': array([[   0,    0,    0, ..., 1182,  412, 1761],
        [   0,    0,    0, ...,    0,  485,   29],
        [   0,    0,    0, ...,  206,  321,  401],
        ..., 
        [   0,    0,    0, ...,  623, 4066,   93],
        [   0,    0,    0, ...,    0,  548,  390],
        [   0,    0,    0, ...,  122,    7,  487]]),
 'na

In [90]:
X_train["name"].shape
X_train["num_vars"].shape[1]

1

### 定义RNN模型

In [91]:
print(X_train['subcat_2'].shape)
print(X_train['desc_len'].shape)
print(X_train['item_desc'].shape)
print(X_train['name'].shape[1])


(1979,)
(1979, 1)
(1979, 75)
10


In [92]:
##Now to build the model. Old category stuff is commented out but left in case of revist. (other adjustment notes in comments)
# set seed again in case testing models adjustments by looping next 2 blocks
np.random.seed(123)
def new_rnn_model(lr=0.001, decay=0.0):
    # Inputs
    name = Input(shape=[X_train["name"].shape[1]], name="name")
    item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc")
    brand_name = Input(shape=[1], name="brand_name")
#     category = Input(shape=[1], name="category")
#     category_name = Input(shape=[X_train["category_name"].shape[1]], name="category_name")
    item_condition = Input(shape=[1], name="item_condition")
    num_vars = Input(shape=[X_train["num_vars"].shape[1]], name="num_vars")
    desc_len = Input(shape=[1], name="desc_len")
    name_len = Input(shape=[1], name="name_len")
    subcat_0 = Input(shape=[1], name="subcat_0")
    subcat_1 = Input(shape=[1], name="subcat_1")
    subcat_2 = Input(shape=[1], name="subcat_2")

    # Embeddings layers (adjust outputs to help model)
    emb_name = Embedding(MAX_TEXT, 20)(name)
    emb_item_desc = Embedding(MAX_TEXT, 60)(item_desc)
    emb_brand_name = Embedding(MAX_BRAND, 10)(brand_name)
#     emb_category_name = Embedding(MAX_TEXT, 20)(category_name)
#     emb_category = Embedding(MAX_CATEGORY, 10)(category)
    emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition)
    emb_desc_len = Embedding(MAX_DESC_LEN, 5)(desc_len)
    emb_name_len = Embedding(MAX_NAME_LEN, 5)(name_len)
    emb_subcat_0 = Embedding(MAX_SUBCAT_0, 10)(subcat_0)
    emb_subcat_1 = Embedding(MAX_SUBCAT_1, 10)(subcat_1)
    emb_subcat_2 = Embedding(MAX_SUBCAT_2, 10)(subcat_2)
    

    # rnn layers (GRUs are faster than LSTMs and speed is important here)
    rnn_layer1 = GRU(16) (emb_item_desc)
    rnn_layer2 = GRU(8) (emb_name)
#     rnn_layer3 = GRU(8) (emb_category_name)

    # main layers
    main_l = concatenate([
        Flatten() (emb_brand_name)
#         , Flatten() (emb_category)
        , Flatten() (emb_item_condition)
        , Flatten() (emb_desc_len)
        , Flatten() (emb_name_len)
        , Flatten() (emb_subcat_0)
        , Flatten() (emb_subcat_1)
        , Flatten() (emb_subcat_2)
        , rnn_layer1
        , rnn_layer2
#         , rnn_layer3
        , num_vars
    ])
    # (incressing the nodes or adding layers does not effect the time quite as much as the rnn layers)
    main_l = Dropout(0.1)(Dense(512,kernel_initializer='normal',activation='relu') (main_l))
    main_l = Dropout(0.1)(Dense(256,kernel_initializer='normal',activation='relu') (main_l))
    main_l = Dropout(0.1)(Dense(128,kernel_initializer='normal',activation='relu') (main_l))
    main_l = Dropout(0.1)(Dense(64,kernel_initializer='normal',activation='relu') (main_l))

    # the output layer.
    output = Dense(1, activation="linear") (main_l)
    
    model = Model([name, item_desc, brand_name , item_condition, 
                   num_vars, desc_len, name_len, subcat_0, subcat_1, subcat_2], output)

    optimizer = Adam(lr=lr, decay=decay)
    # (mean squared error loss function works as well as custom functions)  
    model.compile(loss = 'mse', optimizer = optimizer)

    return model

model = new_rnn_model()
model.summary()
#from keras.utils import plot_model
#plot_model(model, to_file='model.png')
del model

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
brand_name (InputLayer)          (None, 1)             0                                            
____________________________________________________________________________________________________
item_condition (InputLayer)      (None, 1)             0                                            
____________________________________________________________________________________________________
desc_len (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
name_len (InputLayer)            (None, 1)             0                                            
___________________________________________________________________________________________

### 用数据拟合RNN模型并设置超参数
最关键的是时间的花销。这将花费35~40分钟去run的RNN！用小的batches的2个epoch的效果比大的batches的更多的epochs要好。


In [93]:
# 模型超参数的设置
BATCH_SIZE = 512 * 3
epochs = 2

In [94]:
# 计算衰减学习率
exp_decay = lambda init, fin, steps: (init/fin)**(1/(steps-1)) - 1
steps = int(len(X_train['name']) / BATCH_SIZE) * epochs
lr_init, lr_fin = 0.005, 0.001
print(steps)
lr_decay = exp_decay(lr_init, lr_fin, steps)

2


In [95]:
# 创建模型并拟合训练集
# verbose=1 does is printing a log line after every batch.
rnn_model = new_rnn_model(lr=lr_init, decay=lr_decay)
rnn_model.fit(
        X_train, Y_train, epochs=epochs, batch_size=BATCH_SIZE,
        validation_data=(X_dev, Y_dev), verbose=1,
)

Train on 1979 samples, validate on 20 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1ce2e9e8>

In [96]:
#用RNN模型评估验证集
print("Evaluating the model on validation data...")
Y_dev_preds_rnn = rnn_model.predict(X_dev, batch_size=BATCH_SIZE)
print(" RMSLE error:", rmsle(Y_dev, Y_dev_preds_rnn))

Evaluating the model on validation data...
 RMSLE error: 2.61973697446


In [97]:
#Make prediction for test data
rnn_preds = rnn_model.predict(X_test, batch_size=BATCH_SIZE, verbose=1)
rnn_preds = np.expm1(rnn_preds)



## Ridge Models

In [98]:
##Now onto the Ridge models. Less to play with in the Ridge models but it is faster than the RNN.
# Concatenate train - dev - test data for easy to handle
full_df = pd.concat([train_df, dev_df, test_df])

In [99]:
full_df.head(3)

Unnamed: 0,brand_name,category_name,desc_len,item_condition_id,item_description,name,name_len,price,shipping,subcat_0,subcat_1,subcat_2,target,test_id,train_id
1503,Hoover,Home/Home Appliances/Vacuums & Floor Care,35,1,Brand New Hoover Air Express Handheld Vacuum w...,Hoover Air Express Hand Vacuum,5,12.0,1,Home,Home Appliances,Vacuums & Floor Care,2.564949,,1503.0
553,Old Navy,Kids/Girls 0-24 Mos/One-Pieces,10,2,Shirt 0-3 month leggings 6 months old navy and...,Winter bundle,2,8.0,0,Kids,Girls 0-24 Mos,One-Pieces,2.197225,,553.0
775,American Boy & Girl,Kids/Toys/Dolls & Accessories,5,2,Theater case/ Trunk with box.,Marisol Lot HOLD APPLE,4,29.0,0,Kids,Toys,Dolls & Accessories,3.401197,,775.0


In [100]:
##Handle missing data and convert data type to string¶
##All inputs must be strings in a ridge model. The other note here is that filling NAs for item_description use 'No description yet' so it is read the same as the 'No description yet' entries.
print("Handling missing values...")
full_df['category_name'] = full_df['category_name'].fillna('missing').astype(str)
full_df['subcat_0'] = full_df['subcat_0'].astype(str)
full_df['subcat_1'] = full_df['subcat_1'].astype(str)
full_df['subcat_2'] = full_df['subcat_2'].astype(str)
full_df['brand_name'] = full_df['brand_name'].fillna('missing').astype(str)
full_df['shipping'] = full_df['shipping'].astype(str)
full_df['item_condition_id'] = full_df['item_condition_id'].astype(str)
full_df['desc_len'] = full_df['desc_len'].astype(str)
full_df['name_len'] = full_df['name_len'].astype(str)
full_df['item_description'] = full_df['item_description'].fillna('No description yet').astype(str)

Handling missing values...


In [101]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2999 entries, 1503 to 999
Data columns (total 15 columns):
brand_name           2999 non-null object
category_name        2999 non-null object
desc_len             2999 non-null object
item_condition_id    2999 non-null object
item_description     2999 non-null object
name                 2999 non-null object
name_len             2999 non-null object
price                1999 non-null float64
shipping             2999 non-null object
subcat_0             2999 non-null object
subcat_1             2999 non-null object
subcat_2             2999 non-null object
target               1999 non-null float64
test_id              1000 non-null float64
train_id             1999 non-null float64
dtypes: float64(4), object(11)
memory usage: 374.9+ KB


In [102]:
#Vectorizing all the data
##Takes around 8-10 minutes depending on the inputs used.
print("Vectorizing data...")
default_preprocessor = CountVectorizer().build_preprocessor()
def build_preprocessor(field):
    field_idx = list(full_df.columns).index(field)
    return lambda x: default_preprocessor(x[field_idx])

Vectorizing data...


In [103]:
full_df.head(3)

Unnamed: 0,brand_name,category_name,desc_len,item_condition_id,item_description,name,name_len,price,shipping,subcat_0,subcat_1,subcat_2,target,test_id,train_id
1503,Hoover,Home/Home Appliances/Vacuums & Floor Care,35,1,Brand New Hoover Air Express Handheld Vacuum w...,Hoover Air Express Hand Vacuum,5,12.0,1,Home,Home Appliances,Vacuums & Floor Care,2.564949,,1503.0
553,Old Navy,Kids/Girls 0-24 Mos/One-Pieces,10,2,Shirt 0-3 month leggings 6 months old navy and...,Winter bundle,2,8.0,0,Kids,Girls 0-24 Mos,One-Pieces,2.197225,,553.0
775,American Boy & Girl,Kids/Toys/Dolls & Accessories,5,2,Theater case/ Trunk with box.,Marisol Lot HOLD APPLE,4,29.0,0,Kids,Toys,Dolls & Accessories,3.401197,,775.0


In [104]:
full_df['name'].value_counts()

Bundle                                      5
Dress                                       4
Boots                                       4
Black booties                               2
Tommy Hilfiger                              2
Giffin 25 rdta full tank kit                2
Blouse                                      2
Pandora bracelet                            2
Kate Spade Wallet                           2
Lululemon crops                             2
Lularoe OS leggings                         2
Michael Kors                                2
Coach purse                                 2
3 packs frozen gift bundle                  1
Gold Tall n Short M8 Lighter Bundle Sale    1
NWT VS ULTIMATE SPORTS BRA 34ddd            1
Oversized gold circle clear lens glasses    1
New boy elf on the shelf ships today        1
iPhone 6 64gb Gold (Sprint)                 1
???SALE??? JUICY COUTURE TOGGLE NECKLACE    1
Off shoulder top L                          1
Instrumental Beauty sonic skin bru

In [105]:
#pipeline包提供了FeatureUnion类来进行整体并行处理
vectorizer = FeatureUnion([
    ('name', CountVectorizer(
        ngram_range=(1, 2),
        max_features=50000,
        preprocessor=build_preprocessor('name'))),
#     ('category_name', CountVectorizer(
#         token_pattern='.+',
#         preprocessor=build_preprocessor('category_name'))),
    ('subcat_0', CountVectorizer(
        token_pattern='.+',
        preprocessor=build_preprocessor('subcat_0'))),
    ('subcat_1', CountVectorizer(
        token_pattern='.+',
        preprocessor=build_preprocessor('subcat_1'))),
    ('subcat_2', CountVectorizer(
        token_pattern='.+',
        preprocessor=build_preprocessor('subcat_2'))),
    ('brand_name', CountVectorizer(
        token_pattern='.+',
        preprocessor=build_preprocessor('brand_name'))),
    ('shipping', CountVectorizer(
        token_pattern='\d+',
        preprocessor=build_preprocessor('shipping'))),
    ('item_condition_id', CountVectorizer(
        token_pattern='\d+',
        preprocessor=build_preprocessor('item_condition_id'))),
    ('desc_len', CountVectorizer(
        token_pattern='\d+',
        preprocessor=build_preprocessor('desc_len'))),
    ('name_len', CountVectorizer(
        token_pattern='\d+',
        preprocessor=build_preprocessor('name_len'))),
    ('item_description', TfidfVectorizer(
        ngram_range=(1, 3),
        max_features=100000,
        preprocessor=build_preprocessor('item_description'))),
])

In [106]:
full_df.values

array([['Hoover', 'Home/Home Appliances/Vacuums & Floor Care', '35', ...,
        2.5649493574615367, nan, 1503.0],
       ['Old Navy', 'Kids/Girls 0-24 Mos/One-Pieces', '10', ...,
        2.1972245773362196, nan, 553.0],
       ['American Boy & Girl', 'Kids/Toys/Dolls & Accessories', '5', ...,
        3.4011973816621555, nan, 775.0],
       ..., 
       ['PINK', "Women/Women's Handbags/Totes & Shoppers", '6', ..., nan,
        997.0, nan],
       ['Funko', 'Kids/Toys/Action Figures & Statues', '7', ..., nan,
        998.0, nan],
       ['missing', 'Vintage & Collectibles/Trading Cards/Sports', '7', ...,
        nan, 999.0, nan]], dtype=object)

In [107]:
X = vectorizer.fit_transform(full_df.values)

X_train = X[:n_trains]
Y_train = train_df.target.values.reshape(-1, 1)

X_dev = X[n_trains:n_trains+n_devs]
Y_dev = dev_df.target.values.reshape(-1, 1)

X_test = X[n_trains+n_devs:]
print(X.shape, X_train.shape, X_dev.shape, X_test.shape)

(2999, 112822) (1979, 112822) (20, 112822) (1000, 112822)


In [108]:
#对训练集拟合ridge模型
##有交叉验证的Ridge模型比没有的要好一点，但是即便是最小的两次交叉验证，也依然花费4~5分钟。由于严格时间的限制，如果想要更多次将变得不实际。正常的Ridge回归模型仅仅花费30s.
#Training data: 
##X : {array-like, sparse matrix}, shape = [n_samples, n_features]
#Target values
##y : array-like, shape = [n_samples] or [n_samples, n_targets]
#这里输入的是csr稀疏矩阵
#solver
##'auto' : use svd if n_samples > n_features or when X is a sparse
##         matrix, otherwise use eigen
##'svd' : force computation via singular value decomposition of X
##        (does not work for sparse matrices)
##'eigen' : force computation via eigendecomposition of X^T X
#normalize : boolean, optional, default False
#This parameter is ignored when fit_intercept is set to False. 
#If True, the regressors X will be normalized before regression by subtracting the mean 
#and dividing by the l2-norm. 
#If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.
#
ridge_model = Ridge(
    solver='auto', fit_intercept=True, alpha=1.0,
    max_iter=100, normalize=False, tol=0.05, random_state = 1,
)
ridge_modelCV = RidgeCV(
    fit_intercept=True, alphas=[5.0],
    normalize=False, cv = 2, scoring='neg_mean_squared_error',
)
ridge_model.fit(X_train, Y_train)
ridge_modelCV.fit(X_train, Y_train)

RidgeCV(alphas=[5.0], cv=2, fit_intercept=True, gcv_mode=None,
    normalize=False, scoring='neg_mean_squared_error',
    store_cv_values=False)

In [114]:
#score():The coefficient R^2 is defined as (1 - u/v), 
#where u is the residual sum of squares ((y_true - y_pred) ** 2).sum()
# and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 
ridge_model.score(X_train, Y_train)

0.97646890936611108

In [109]:
#在验证集上评估Ridge模型
Y_dev_preds_ridge = ridge_model.predict(X_dev)
Y_dev_preds_ridge = Y_dev_preds_ridge.reshape(-1, 1)
print("RMSL error on dev set:", rmsle(Y_dev, Y_dev_preds_ridge))
Y_dev_preds_ridgeCV = ridge_modelCV.predict(X_dev)
Y_dev_preds_ridgeCV = Y_dev_preds_ridgeCV.reshape(-1, 1)
print("CV RMSL error on dev set:", rmsle(Y_dev, Y_dev_preds_ridgeCV))
#在测试集上做预测

ridge_preds = ridge_model.predict(X_test)
ridge_preds = np.expm1(ridge_preds)
ridgeCV_preds = ridge_modelCV.predict(X_test)
ridgeCV_preds = np.expm1(ridgeCV_preds)

RMSL error on dev set: 0.554456598655
CV RMSL error on dev set: 0.5674141717


### 综合几个模型的预测结果，求取各个模型的权重最优的组合

In [110]:
#在验证集上评估组合起来的模型
##组合三个预测结果成一个预测结果。不是简单的求平均，而是综合预测将使用比率来改变3个模型的权重。它还使用一个简单的循环遍历所有可能的比率，以在验证集中找到最佳比例。 它不是计算效率最高的循环，但它只需要2秒就可以运行，所以没什么大不了的。
def aggregate_predicts3(Y1, Y2, Y3, ratio1, ratio2):
    assert Y1.shape == Y2.shape
    return Y1 * ratio1 + Y2 * ratio2 + Y3 * (1.0 - ratio1-ratio2)

# Y_dev_preds = aggregate_predicts3(Y_dev_preds_rnn, Y_dev_preds_ridgeCV, Y_dev_preds_ridge, 0.4, 0.3)
# print("RMSL error for RNN + Ridge + RidgeCV on dev set:", rmsle(Y_dev, Y_dev_preds))
#ratio optimum finder for 3 models
best1 = 0
best2 = 0
lowest = 0.99
for i in range(100):
    for j in range(100):
        r = i*0.01
        r2 = j*0.01
        if r+r2 < 1.0:
            Y_dev_preds = aggregate_predicts3(Y_dev_preds_rnn, Y_dev_preds_ridgeCV, Y_dev_preds_ridge, r, r2)
            fpred = rmsle(Y_dev, Y_dev_preds)
            if fpred < lowest:
                best1 = r
                best2 = r2
                lowest = fpred
#             print(str(r)+"-RMSL error for RNN + Ridge + RidgeCV on dev set:", fpred)
Y_dev_preds = aggregate_predicts3(Y_dev_preds_rnn, Y_dev_preds_ridgeCV, Y_dev_preds_ridge, best1, best2)

print(best1)
print(best2)
print("(Best) RMSL error for RNN + Ridge + RidgeCV on dev set:", rmsle(Y_dev, Y_dev_preds))

0.04
0.0
(Best) RMSL error for RNN + Ridge + RidgeCV on dev set: 0.543152766234


### 将最好的结果写入文件

In [111]:
# best predicted submission
preds = aggregate_predicts3(rnn_preds, ridgeCV_preds, ridge_preds, best1, best2)
submission = pd.DataFrame({
        "test_id": test_df.test_id,
        "price": preds.reshape(-1),
})
submission.to_csv("./rnn_ridge_submission_best.csv", index=False)