### 数据处理
- 原始数据集直接从Kaggle上下载，解压后train目录下一共有25000张图片，test目录下一共有12500张图片
- 我们需要通过Keras ImageDataGenerator的flow_from_directory方法来加载我们的图片，所以我们需要将训练集和测试集的图片放到子文件夹中
- 测试集很简单，建一个子目录，将文件全部移动过去就行
- 训练集需要建立两个子目录，将狗和猫的图片分别移动到两个子目录中去
- 训练集中的图片名称格式为{种类}.序号.jpg，比如 cat.1.jpg 或 dog.1.jpg。我们可以利用命名规则来移动训练集的图片

当前的图片目录结构如下
```
data
 ├── test   [12500 images]
 └── train  [25000 images]
```
预处理后图片目录结构如下
```
data
 ├── test
 │   └── none [12500 images]
 └── train
     ├── cat  [12500 images]
     └── dog  [12500 images]
```

In [10]:
# 数据处理代码
import os
from shutil import move

train_src = 'data/train/'
test_src = 'data/test/'
dog_dest = 'data/train/dog/'
cat_dest = 'data/train/cat/'
test_dest = 'data/test/none/'

#创建子目录
os.makedirs(dog_dest, exist_ok=True)
os.makedirs(cat_dest, exist_ok=True)
os.makedirs(test_dest, exist_ok=True)

#移动测试图片至子文件夹中
for root, dirs, files in os.walk(test_src):
    if(root == test_src):
        for name in files:
            if(name.find('jpg')>-1):
                move(test_src + name, test_dest + name)

#移动训练集图片至对应的子文件夹中
for root, dirs, files in os.walk(train_src):
    if(root == train_src):
        for name in files:
            if(name.find('jpg')>-1 and name.find('cat')>-1):
                move(train_src + name, cat_dest + name)
            elif(name.find('jpg')>-1 and name.find('dog')>-1):
                move(train_src + name, dog_dest + name)
            else:
                pass

print("cat images: ", 
      len([name for name in os.listdir(cat_dest) if os.path.isfile(os.path.join(cat_dest, name))]))
print("dog images: ", 
      len([name for name in os.listdir(dog_dest) if os.path.isfile(os.path.join(dog_dest, name))]))
print("test images: ", 
      len([name for name in os.listdir(test_dest) if os.path.isfile(os.path.join(test_dest, name))]))

cat images:  12500
dog images:  12500
test images:  12500


### 导出深度特征
- 根据当前训练以及测试集导出VGG16,VGG19,ResNet50,Xception以及InceptionV3的深度特征
- VGG16,VGG19,ResNet50要求的图片的大小为（224， 224），Xception，Inception要求的图片大小为（299，299）
- 先对所有数据进行一个预处理的操作，把数据缩放到-1到1之间
- 其次我们加入一个平局池化操作，一方面是缩小我们导出的深度特征文件的大小，另一方是防止过拟合
- 最后使用Keras的ImageGenerator导出深度特征的数组，存放在本地磁盘上供接下来的模型训练使用
- 每个模型导出深度特征的耗时，如下
- VGG16 耗时约3分25秒
- VGG19 耗时约3分52秒
- ResNet50 耗时约3分30秒，
- InceptionV3 耗时约4分34秒
- Xception 耗时约6分钟56秒

In [1]:
from keras.models import *
from keras.layers import *
from keras.applications import *
from keras.preprocessing.image import *

import time
import h5py
import math

train_data_path = 'data/train/'
test_data_path = 'data/test/'

def save_bottleneck_features(MODEL, image_size, module_name, preprocess):
    
    start_time = time.time()
    
    width = image_size[0]
    height = image_size[1]
    input_tensor = Input((height, width, 3))
    x = Lambda(preprocess)(input_tensor)
    
    base_model = MODEL(input_tensor=x, weights='imagenet', include_top=False)
    model = Model(base_model.input, GlobalAveragePooling2D()(base_model.output))

    gen = ImageDataGenerator()
    train_generator = gen.flow_from_directory(train_data_path, image_size, shuffle=False)
    test_generator = gen.flow_from_directory(test_data_path, image_size, shuffle=False, class_mode=None)

    train = model.predict_generator(train_generator)
    test = model.predict_generator(test_generator)
    
    with h5py.File("bottleneck_features/{}_bottleneck_features.h5".format(module_name)) as h:
        h.create_dataset("train", data=train)
        h.create_dataset("test", data=test)
        h.create_dataset("label", data=train_generator.classes)
        
    end_time = time.time()
    
    print("{} extract features total consumed: {} seconds".format(module_name, end_time - start_time))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
save_bottleneck_features(VGG16, (224, 224), 'VGG16', vgg16.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
VGG16 extrac features total consumed: 205.36566758155823 seconds


In [2]:
save_bottleneck_features(VGG19, (224, 224), 'VGG19', vgg19.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
VGG19 extrac features total consumed: 232.02474784851074 seconds


In [3]:
save_bottleneck_features(ResNet50, (224, 224), 'ResNet50', resnet50.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
ResNet50 extrac features total consumed: 210.49121832847595 seconds


In [4]:
save_bottleneck_features(InceptionV3, (299, 299), 'InceptionV3', inception_v3.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
InceptionV3 extrac features total consumed: 274.9980471134186 seconds


In [5]:
save_bottleneck_features(Xception, (299, 299), 'Xception', xception.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
Xception extrac features total consumed: 416.50056076049805 seconds


### 模型训练
- 这里一共定义了3个方法，retrieve_features用于读取本地磁盘的深度特征文件并从文件中提取出X_train,X_test,y_train3个数组
- train_model用于构建并训练自己的模型，我们的模型包含2层，BatchNormalization层是为了防止过拟合，Dense层为了做分类。
- generate_submission_csv用于生成提交至Kaggle的文件
- 因为Kaggle官方采用的时LogLoss作为评估标准，所以我们这里限制了预测概率的最大和最小值

In [89]:
import h5py
import numpy as np
import pandas as pd
from keras.models import *
from keras.layers import *
from keras.preprocessing.image import *
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
from keras.optimizers import *
import time

def retrieve_features(files):
    X_train = []
    X_test = []
    y_train = []
    
    for filename in files:
        with h5py.File(filename, 'r') as h:
            X_train.append(np.array(h['train']))
            X_test.append(np.array(h['test']))
            y_train = np.array(h['label'])
        
    X_train = np.concatenate(X_train, axis=1)
    X_test = np.concatenate(X_test, axis=1)
    
    return X_train, X_test, y_train

    

def train_model(X_train, y_train, epochs, optimizer='adam'):
    # construct model
    input_tensor = Input(X_train.shape[1:])
    x = BatchNormalization()(input_tensor)
    x = Dense(1, activation='sigmoid')(x)
    model = Model(input_tensor, x)
    # compile model
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    # train model
    start_time = time.time()
    model.fit(X_train, y_train, batch_size=200, epochs=epochs, validation_split=0.2, verbose=1)
    end_time = time.time()
    print("Trainning model total consumed:{} seconds".format(end_time - start_time))
    
    return model

def generate_submission_csv(X_test, model):
    start_time = time.time()
    y_pred = model.predict(X_test, verbose=1)
    end_time = time.time()
    print("Predicting model total consumed:{} seconds".format(end_time - start_time))
    y_pred = y_pred.clip(min=0.005, max=0.995)

    df = pd.read_csv("data/sample_submission.csv")

    gen = ImageDataGenerator()
    test_generator = gen.flow_from_directory('data/test/', (224, 224), shuffle=False, batch_size=32, class_mode=None)

    for i, fname in enumerate(test_generator.filenames):
        index = int(fname[fname.rfind('\\')+1:fname.rfind('.')])
        df.at[index-1, 'label'] = y_pred[i]

    df.to_csv('data/pred.csv', index=None)

### 迁移学习-VGG16
- 训练了10代，一共耗时8.65秒，训练集的最高准确率可以到达0.9828，验证集的最高准确率可以到达0.9658
- 预测一共耗时1.65秒，最后将生成的文件上传至Kaggle得分为0.08130

In [10]:
bottleneck_files = ["bottleneck_features/VGG16_bottleneck_features.h5"]

X_train, X_test, y_train = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train, 10)

generate_submission_csv(X_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed:8.657156229019165 seconds
Predicting model total consumed:1.6564559936523438 seconds
Found 12500 images belonging to 1 classes.


###  迁移学习-VGG19
- 训练了10代，一共耗时8.53秒，训练集的最高准确率可以到达0.9848，验证集的最高准确率可以到达0.9728
- 预测一共耗时1.63秒，最后将生成的文件上传至Kaggle得分为0.07487

In [11]:
bottleneck_files = ["bottleneck_features/VGG19_bottleneck_features.h5"]

X_train, X_test, y_train = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train, 10)

generate_submission_csv(X_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed:8.53214144706726 seconds
Predicting model total consumed:1.6252193450927734 seconds
Found 12500 images belonging to 1 classes.


###  迁移学习-ResNet50
- 训练了5代，训练一共耗时5.92秒，训练集的最高准确率可以到达0.9921，验证集的最高准确率可以到达0.9830
- 预测一共耗时0.5秒，最后将生成的文件上传至Kaggle得分为0.06036
- 这里只训练5代的原因是，多次训练后发现5代之后容易出现过拟合的情况

In [25]:
bottleneck_files = ["bottleneck_features/ResNet50_bottleneck_features.h5"]

X_train, X_test, y_train = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train, 5)

generate_submission_csv(X_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Trainning model total consumed:5.922510862350464 seconds
Predicting model total consumed:0.5312831401824951 seconds
Found 12500 images belonging to 1 classes.


###  迁移学习-InceptionV3
- 训练了3代，训练一共耗时3.84秒，训练集的最高准确率可以到达0.9943，验证集的最高准确率可以到达0.9912
- 预测一共耗时0.7秒，最后将生成的文件上传至Kaggle得分为0.04588
- 这里只训练了3代，因为仅仅3代之后就非常容易出现过拟合的情况

In [37]:
bottleneck_files = ["bottleneck_features/InceptionV3_bottleneck_features.h5"]

X_train, X_test, y_train = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train, 3)

generate_submission_csv(X_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Trainning model total consumed:3.8441414833068848 seconds
Predicting model total consumed:0.7031993865966797 seconds
Found 12500 images belonging to 1 classes.


###  迁移学习-Xception
- 训练了5代，训练一共耗时8.15秒，训练集的最高准确率可以到达0.9963，验证集的最高准确率可以到达0.9946
- 预测一共耗时1.3秒，最后将生成的文件上传至Kaggle得分为0.04630

In [86]:
bottleneck_files = ["bottleneck_features/Xception_bottleneck_features.h5"]

X_train, X_test, y_train = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train, 5)

generate_submission_csv(X_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Trainning model total consumed:8.157132863998413 seconds
Predicting model total consumed:1.3126063346862793 seconds
Found 12500 images belonging to 1 classes.


###  迁移学习-集成所有模型的属性
- 训练了15代，训练一共耗时23.59秒，训练集的最高准确率可以到达0.9978，验证集的最高准确率可以到达0.9944
- 预测一共耗时1.56秒，最后将生成的文件上传至Kaggle得分为0.03762
- 这里我们自定义了一个learning rate更小的optimzer，防止模型学习过快，无法正常收敛

In [93]:
bottleneck_files = ["bottleneck_features/VGG16_bottleneck_features.h5",
                    "bottleneck_features/VGG19_bottleneck_features.h5", 
                    "bottleneck_features/ResNet50_bottleneck_features.h5",
                    "bottleneck_features/InceptionV3_bottleneck_features.h5",
                    "bottleneck_features/Xception_bottleneck_features.h5"]

X_train, X_test, y_train = retrieve_features(bottleneck_files)

optimizer = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

model = train_model(X_train, y_train, 15, optimizer)

generate_submission_csv(X_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Trainning model total consumed:23.59627342224121 seconds
Predicting model total consumed:1.5626370906829834 seconds
Found 12500 images belonging to 1 classes.


In [15]:
import math

def logloss(true_label, predicted):
    if true_label == 1:
        return -math.log(predicted)
    else:
        return -math.log(1 - predicted)

In [21]:
# 当我们预测正确时 true label = 1 , 我们预测的值为 0.999时
print(logloss(1, 0.999))
# 当我们预测正确时 true label = 0 , 我们预测的值为 0.001时
print(logloss(0, 0.001))

# 当我们预测错误时 true label = 1 , 我们预测的值为 0.4时
print(logloss(1, 0.000000000000000000001))
# 当我们预测错误时 true label = 0 , 我们预测的值为 0.6时
print(logloss(0, 0.6))

0.0010005003335835344
0.0010005003335835344
48.35428695287496
0.916290731874155
