### 数据处理
- 原始数据集直接从Kaggle上下载，解压并且经过预处理之后分别放在 train/cat，train/dog以及test/none目录下（详见readme）
- 同时项目中提供了一个额外的[扩充数据](http://www.robots.ox.ac.uk/%7Evgg/data/pets/)
- 数据解压后文件全部在images文件夹下，文件名的格式为{种类}_{序号}.jpg，比如 Abyssinian_1.jpg
- 同时官方给出了所有图片的种类明细，比如Abyssinian是猫。所以应该把图片Abyssinian_1.jpg分类到train/cat的目录下面
- 扩充数据集总共有7390张图片，猫的图片有2400张，狗的图片有4990张

#### 图片目录结构
当前的图片目录结构如下
```
data
 ├── images [7390 images]
 ├── test
 │   └── none [12500 images]
 └── train
     ├── cat [12500 images]
     └── dog [12500 images]
```
我们需要把images下的7390张图片，根据官方给出的类别分类到train目录的cat和dog子目录中。分类成功的话新的数据集猫的图片数量应该为12500+2400=14900，狗的图片数量为12500+4990=17490。处理后的图片目录结构应该如下
```
data
 ├── test
 │   └── none [12500 images]
 └── train
     ├── cat [14900 images]
     └── dog [17490 images]
```

In [9]:
import os
from shutil import copyfile

dog_breeds = ['american_bulldog', 'american_pit_bull_terrier','basset_hound','beagle','boxer',
             'chihuahua','english_cocker_spaniel','english_setter','german_shorthaired','great_pyrenees',
             'havanese','japanese_chin','keeshond','leonberger','miniature_pinscher','newfoundland','pomeranian',
             'pug','saint_bernard','samoyed','scottish_terrier','shiba_inu','staffordshire_bull_terrier',
             'wheaten_terrier','yorkshire_terrier']

cat_breeds = ['Abyssinian','Bengal','Birman','Bombay','British_Shorthair','Egyptian_Mau','Maine_Coon','Persian',
              'Ragdoll','Russian_Blue','Siamese','Sphynx']

src = 'data/images/'
dog_dest = 'data/train/dog/'
cat_dest = 'data/train/cat/'

def handle_extra_data(dest, breeds):
    for root, dirs, files in os.walk(src):
        for name in files:
            for val in breeds:
                if(name.find(val)>-1):
                    copyfile(src + name, dest + name)
                    
                    
print("Total extral images: ", len([name for name in os.listdir(src) if os.path.isfile(os.path.join(src, name))]))
print("Total cat images: ", len([name for name in os.listdir(cat_dest) if os.path.isfile(os.path.join(cat_dest, name))]))
print("Total dog images: ", len([name for name in os.listdir(dog_dest) if os.path.isfile(os.path.join(dog_dest, name))]))

handle_extra_data(dog_dest, dog_breeds)
handle_extra_data(cat_dest, cat_breeds)


print("After added extral images:")
print("Total cat images: ", len([name for name in os.listdir(cat_dest) if os.path.isfile(os.path.join(cat_dest, name))]))
print("Total dog images: ", len([name for name in os.listdir(dog_dest) if os.path.isfile(os.path.join(dog_dest, name))]))

Total extral images:  7390
Total cat images:  12500
Total dog images:  12500
After added extral images:
Total cat images:  14900
Total dog images:  17490


### 导出深度特征
- 根据当前训练以及测试集导出VGG16,VGG19,ResNet50,Xception以及InceptionV3的深度特征
- VGG16,VGG19,ResNet50要求的图片的大小为（224， 224）
- Xception，Inception要求的图片大小为（299，299）
- 先对所有数据进行一个预处理的操作，把数据缩放到-1到1之间
- 其次我们加入一个平局池化操作，一方面是缩小我们导出的深度特征文件的大小，另一方是防止过拟合
- 最后使用Keras的ImageGenerator对数据进行增加处理

In [1]:
from keras.models import *
from keras.layers import *
from keras.applications import *
from keras.preprocessing.image import *

import time
import h5py

train_data_path = 'data/train/'
test_data_path = 'data/test/'

def save_bottleneck_features(MODEL, image_size, module_name, preprocess):
    
    start_time = time.time()
    
    width = image_size[0]
    height = image_size[1]
    input_tensor = Input((height, width, 3))
    x = Lambda(preprocess)(input_tensor)
    
    base_model = MODEL(input_tensor=x, weights='imagenet', include_top=False)
    model = Model(base_model.input, GlobalAveragePooling2D()(base_model.output))

    gen = ImageDataGenerator()
    train_generator = gen.flow_from_directory(train_data_path, image_size, shuffle=False, batch_size=36)
    test_generator = gen.flow_from_directory(test_data_path, image_size, shuffle=False, batch_size=36, class_mode=None)

    train = model.predict_generator(train_generator)
    test = model.predict_generator(test_generator)
    with h5py.File("bottleneck_features/{}_bottleneck_features.h5".format(module_name)) as h:
        h.create_dataset("train", data=train)
        h.create_dataset("test", data=test)
        h.create_dataset("label", data=train_generator.classes)
        
    end_time = time.time()
    
    print("{} extrac features total consumed: {} seconds".format(module_name, end_time - start_time))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
save_bottleneck_features(VGG16, (224, 224), 'VGG16', vgg16.preprocess_input)

Found 32390 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
VGG16 extrac features total consumed: 258.84423875808716 seconds


In [3]:
save_bottleneck_features(VGG19, (224, 224), 'VGG19', vgg19.preprocess_input)

Found 32390 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
VGG19 extrac features total consumed: 302.42244029045105 seconds


In [4]:
save_bottleneck_features(ResNet50, (224, 224), 'ResNet50', resnet50.preprocess_input)

Found 32390 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
ResNet50 extrac features total consumed: 253.90742802619934 seconds


In [2]:
save_bottleneck_features(InceptionV3, (299, 299), 'InceptionV3', inception_v3.preprocess_input)

Found 32390 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
InceptionV3 extrac features total consumed: 344.6618912220001 seconds


In [3]:
save_bottleneck_features(Xception, (299, 299), 'Xception', xception.preprocess_input)

Found 32390 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.
Xception extrac features total consumed: 498.6959562301636 seconds


### 迁移学习
- 至此针对以上模型的深度特征提取完毕
- VGG16耗时约4分19秒，VGG19耗时越5分3秒，ResNet50耗时约4分14秒，InceptionV3耗时约5分45秒，Xception耗时约8分钟19秒
- 依据这些深度特征我们可以来构建新的模型并且只需要构建最后一层
- 这里首先构建一个dropout层，参数为0.5，最后构建一个全连接层来做分类

In [2]:
import h5py
import numpy as np
from keras.models import *
from keras.layers import *
import time

X_train = []
X_test = []
bottle_neck_files = ["bottleneck_features/VGG16_bottleneck_features.h5",
                    "bottleneck_features/VGG19_bottleneck_features.h5", 
                    "bottleneck_features/ResNet50_bottleneck_features.h5",
                    "bottleneck_features/InceptionV3_bottleneck_features.h5",
                    "bottleneck_features/Xception_bottleneck_features.h5"]

for filename in bottle_neck_files:
    with h5py.File(filename, 'r') as h:
        X_train.append(np.array(h['train']))
        X_test.append(np.array(h['test']))
        y_train = np.array(h['label'])

X_train = np.concatenate(X_train, axis=1)
X_test = np.concatenate(X_test, axis=1)

print(X_train.shape[1:])

input_tensor = Input(X_train.shape[1:])
x = Dropout(0.5)(input_tensor)
x = Dense(1, activation='sigmoid')(x)
model = Model(input_tensor, x)

model.compile(optimizer='adadelta', loss='binary_crossentropy', metrics=['accuracy'])

start_time = time.time()

model.fit(X_train, y_train, batch_size=128, epochs=10, validation_split=0.2, verbose=1)

end_time = time.time()
    
print("Trainning model total consumed: {} seconds".format(end_time - start_time))

(7168,)
Train on 25912 samples, validate on 6478 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Trainning model total consumed: 24.594088792800903 seconds


In [7]:
import pandas as pd
from keras.preprocessing.image import *

train_data_path = 'data/train/'
test_data_path = 'data/test/'

y_pred = model.predict(X_test, verbose=1)
y_pred = y_pred.clip(min=0.005, max=0.995)


df = pd.read_csv("data/sample_submission.csv")

gen = ImageDataGenerator()
test_generator = gen.flow_from_directory(test_data_path, (224, 224), shuffle=False, batch_size=32, class_mode=None)

for i, fname in enumerate(test_generator.filenames):
    index = int(fname[fname.rfind('\\')+1:fname.rfind('.')])
    df.at[index-1, 'label'] = y_pred[i]
    

df.to_csv('data/pred.csv', index=None)
df.head(10)

Found 12500 images belonging to 1 classes.


Unnamed: 0,id,label
0,1,0.995
1,2,0.995
2,3,0.995
3,4,0.995
4,5,0.005
5,6,0.005
6,7,0.005
7,8,0.005
8,9,0.005
9,10,0.005
