### 数据处理
- 原始数据集直接从Kaggle上下载，解压并且经过预处理之后分别放在 train/cat，train/dog以及test/none目录下（详见readme）
- 同时项目中提供了一个额外的[扩充数据](http://www.robots.ox.ac.uk/%7Evgg/data/pets/)，扩充数据集总共有7390张图片，猫的图片有2400张，狗的图片有4990张
- 数据解压后文件全部在images文件夹下，文件名的格式为{种类}_{序号}.jpg，比如 Abyssinian_1.jpg
- 同时官方给出了所有图片的种类明细，比如Abyssinian是猫。所以应该把图片Abyssinian_1.jpg分类到train/cat的目录下面
- 

#### 图片目录结构
当前的图片目录结构如下
```
data
 ├── images [7390 images]
 ├── test
 │   └── none [12500 images]
 └── train
     ├── cat [12500 images]
     └── dog [12500 images]
```
我们需要把images下的7390张图片，根据官方给出的类别分类到train目录的cat和dog子目录中。分类成功的话新的数据集猫的图片数量应该为12500+2400=14900，狗的图片数量为12500+4990=17490。处理后的图片目录结构应该如下
```
data
 ├── test
 │   └── none [12500 images]
 └── train
     ├── cat [14900 images]
     └── dog [17490 images]
```

In [1]:
import os
from shutil import copyfile

dog_breeds = ['american_bulldog', 'american_pit_bull_terrier','basset_hound','beagle','boxer',
             'chihuahua','english_cocker_spaniel','english_setter','german_shorthaired','great_pyrenees',
             'havanese','japanese_chin','keeshond','leonberger','miniature_pinscher','newfoundland','pomeranian',
             'pug','saint_bernard','samoyed','scottish_terrier','shiba_inu','staffordshire_bull_terrier',
             'wheaten_terrier','yorkshire_terrier']

cat_breeds = ['Abyssinian','Bengal','Birman','Bombay','British_Shorthair','Egyptian_Mau','Maine_Coon','Persian',
              'Ragdoll','Russian_Blue','Siamese','Sphynx']

src = 'data/images/'
dog_dest = 'data/local_test/dog/'
cat_dest = 'data/local_test/cat/'

os.makedirs(dog_dest, exist_ok=True)
os.makedirs(cat_dest, exist_ok=True)

def copy_extra_data(dest, breeds):
    for root, dirs, files in os.walk(src):
        for name in files:
            for val in breeds:
                if(name.find(val)>-1):
                    copyfile(src + name, dest + name)
                    

copy_extra_data(dog_dest, dog_breeds)
copy_extra_data(cat_dest, cat_breeds)

print("Local cat test images: ", 
      len([name for name in os.listdir(cat_dest) if os.path.isfile(os.path.join(cat_dest, name))]))
print("Local dog test images: ", 
      len([name for name in os.listdir(dog_dest) if os.path.isfile(os.path.join(dog_dest, name))]))

Local cat test images:  2400
Local dog test images:  4990


### 导出深度特征
- 根据当前训练以及测试集导出VGG16,VGG19,ResNet50,Xception以及InceptionV3的深度特征
- VGG16,VGG19,ResNet50要求的图片的大小为（224， 224）
- Xception，Inception要求的图片大小为（299，299）
- 先对所有数据进行一个预处理的操作，把数据缩放到-1到1之间
- 其次我们加入一个平局池化操作，一方面是缩小我们导出的深度特征文件的大小，另一方是防止过拟合
- 最后使用Keras的ImageGenerator对数据进行增加处理

In [30]:
from keras.models import *
from keras.layers import *
from keras.applications import *
from keras.preprocessing.image import *

import time
import h5py
import math

train_data_path = 'data/train/'
test_data_path = 'data/test/'
local_test_data_path = 'data/local_test/'

def save_bottleneck_features(MODEL, image_size, module_name, preprocess):
    
    start_time = time.time()
    
    width = image_size[0]
    height = image_size[1]
    input_tensor = Input((height, width, 3))
    x = Lambda(preprocess)(input_tensor)
    
    base_model = MODEL(input_tensor=x, weights='imagenet', include_top=False)
    model = Model(base_model.input, GlobalAveragePooling2D()(base_model.output))

    gen = ImageDataGenerator()
    train_generator = gen.flow_from_directory(train_data_path, image_size, shuffle=False)
    test_generator = gen.flow_from_directory(test_data_path, image_size, shuffle=False, class_mode=None)
    local_test_generator = gen.flow_from_directory(local_test_data_path, image_size, shuffle=False, class_mode=None)

    train = model.predict_generator(train_generator)
    test = model.predict_generator(test_generator)
    local_test = model.predict_generator(local_test_generator)
    
    with h5py.File("bottleneck_features/{}_bottleneck_features.h5".format(module_name)) as h:
        h.create_dataset("train", data=train)
        h.create_dataset("test", data=test)
        h.create_dataset("label", data=train_generator.classes)
        h.create_dataset("local_test", data=local_test)
        h.create_dataset("local_test_label", data=local_test_generator.classes)
        
    end_time = time.time()
    
    print("{} extrac features total consumed: {} seconds".format(module_name, end_time - start_time))

In [3]:
save_bottleneck_features(VGG16, (224, 224), 'VGG16', vgg16.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
Found 7390 images belonging to 2 classes.
VGG16 extrac features total consumed: 268.376017332077 seconds


In [4]:
save_bottleneck_features(VGG19, (224, 224), 'VGG19', vgg19.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
Found 7390 images belonging to 2 classes.
VGG19 extrac features total consumed: 299.7511031627655 seconds


In [5]:
save_bottleneck_features(ResNet50, (224, 224), 'ResNet50', resnet50.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
Found 7390 images belonging to 2 classes.
ResNet50 extrac features total consumed: 251.73025679588318 seconds


In [6]:
save_bottleneck_features(InceptionV3, (299, 299), 'InceptionV3', inception_v3.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
Found 7390 images belonging to 2 classes.
InceptionV3 extrac features total consumed: 322.9861743450165 seconds


In [7]:
save_bottleneck_features(Xception, (299, 299), 'Xception', xception.preprocess_input)

Found 25000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
Found 7390 images belonging to 2 classes.
Xception extrac features total consumed: 494.8071129322052 seconds


### 迁移学习
- 至此针对以上模型的深度特征提取完毕
- VGG16耗时约4分19秒，VGG19耗时越5分3秒，ResNet50耗时约4分14秒，InceptionV3耗时约5分45秒，Xception耗时约8分钟19秒
- 依据这些深度特征我们可以来构建新的模型并且只需要构建最后一层
- 这里首先构建一个dropout层，参数为0.5，最后构建一个全连接层来做分类

In [38]:
import h5py
import numpy as np
import pandas as pd
from keras.models import *
from keras.layers import *
from keras.preprocessing.image import *
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
import time

def retrieve_features(files):
    X_train = []
    X_test = []
    y_train = []
    X_local_test = []
    y_local_test = []
    
    for filename in files:
        with h5py.File(filename, 'r') as h:
            X_train.append(np.array(h['train']))
            X_test.append(np.array(h['test']))
            X_local_test.append(np.array(h['local_test']))
            y_train = np.array(h['label'])
            y_local_test = np.array(h['local_test_label'])
        
    X_train = np.concatenate(X_train, axis=1)
    X_test = np.concatenate(X_test, axis=1)
    X_local_test = np.concatenate(X_local_test, axis=1)
    
    return X_train, X_test, X_local_test, y_train, y_local_test

    

def train_model(X_train, y_train):
    # construct model
    input_tensor = Input(X_train.shape[1:])
    x = BatchNormalization()(input_tensor)
    x = Dense(1, activation='sigmoid')(x)
    model = Model(input_tensor, x)
    # compile model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    # train model
    start_time = time.time()
    model.fit(X_train, y_train, batch_size=200, epochs=10, validation_split=0.2, verbose=1)
    end_time = time.time()
    print("Trainning model total consumed:{} seconds".format(end_time - start_time))
    
    return model

def evaluate_model(X_local_test, y_local_test, model):
    evaluate = model.evaluate(X_local_test, y_local_test, verbose=1)
    print("Evaluate result, loss:{} , acc:{}".format(evaluate[0], evaluate[1]))


def generate_submission_csv(X_test, model):

    y_pred = model.predict(X_test, verbose=1)
    #y_pred = y_pred.clip(min=0.005, max=1)

    df = pd.read_csv("data/sample_submission.csv")

    gen = ImageDataGenerator()
    test_generator = gen.flow_from_directory('data/test/', (224, 224), shuffle=False, batch_size=32, class_mode=None)

    for i, fname in enumerate(test_generator.filenames):
        index = int(fname[fname.rfind('\\')+1:fname.rfind('.')])
        df.at[index-1, 'label'] = y_pred[i]


    df.to_csv('data/pred.csv', index=None)
    df.head(10)

In [17]:
# train and evaluate using all the models bottleneck_features
bottleneck_files = ["bottleneck_features/VGG16_bottleneck_features.h5",
                    "bottleneck_features/VGG19_bottleneck_features.h5", 
                    "bottleneck_features/ResNet50_bottleneck_features.h5",
                    "bottleneck_features/InceptionV3_bottleneck_features.h5",
                    "bottleneck_features/Xception_bottleneck_features.h5"]


X_train, X_test, X_local_test, y_train, y_local_test = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train)

evaluate_model(X_local_test, y_local_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed:10.235456943511963 seconds
Evaluate result, loss:0.015419765178416724 , acc:0.9952638700947226


In [8]:
# train and evaluate using VGG16
bottleneck_files = ["bottleneck_features/VGG16_bottleneck_features.h5"]

X_train, X_test, X_local_test, y_train, y_local_test = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train)

evaluate_model(X_local_test, y_local_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed:10.407723903656006 seconds
Evaluate result, loss:0.08219051772916969 , acc:0.9677943166441136


In [15]:
# train and evaluate using VGG19
bottleneck_files = ["bottleneck_features/VGG19_bottleneck_features.h5"]

X_train, X_test, X_local_test, y_train, y_local_test = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train)

evaluate_model(X_local_test, y_local_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed:7.750809907913208 seconds
Evaluate result, loss:0.09374141918086662 , acc:0.9645466847090663


In [16]:
# train and evaluate using ResNet50
bottleneck_files = ["bottleneck_features/ResNet50_bottleneck_features.h5"]

X_train, X_test, X_local_test, y_train, y_local_test = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train)

evaluate_model(X_local_test, y_local_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed:9.15723466873169 seconds
Evaluate result, loss:0.06548396722878 , acc:0.9783491204330176


In [9]:
# train and evaluate using Xception
bottleneck_files = ["bottleneck_features/Xception_bottleneck_features.h5"]

X_train, X_test, X_local_test, y_train, y_local_test = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train)

evaluate_model(X_local_test, y_local_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed:12.313801765441895 seconds
Evaluate result, loss:0.020657120406509036 , acc:0.9928281461434371


In [11]:
# train and evaluate using Inception
bottleneck_files = ["bottleneck_features/InceptionV3_bottleneck_features.h5"]

X_train, X_test, X_local_test, y_train, y_local_test = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train)

evaluate_model(X_local_test, y_local_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed:11.642153978347778 seconds
Evaluate result, loss:0.023863461442611998 , acc:0.993234100135318


In [19]:
# train and evaluate using all the models bottleneck_features
bottleneck_files = ["bottleneck_features/InceptionV3_bottleneck_features.h5",
                    "bottleneck_features/Xception_bottleneck_features.h5"]


X_train, X_test, X_local_test, y_train, y_local_test = retrieve_features(bottleneck_files)

model = train_model(X_train, y_train)

evaluate_model(X_local_test, y_local_test, model)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed:10.891825914382935 seconds
Evaluate result, loss:0.01649118572826267 , acc:0.9939106901217862


In [39]:
generate_submission_csv(X_test, model)

Found 12500 images belonging to 1 classes.


In [49]:
def logloss(true_label, predicted, eps=1e-15):
  p = np.clip(predicted, eps, 1 - eps)
  if true_label == 1:
    return -log(p)
  else:
    return -log(1 - p)

0.994 0.005


In [55]:
print(math.log(0.005))
print(math.log(0.00000000000000000001))

-5.298317366548036
-46.051701859880914
