# Avito Text and Image Feature Extraction
### Andrew Ribeiro | May 2018 | Andrew@kexp.io
This notebook is designed to help you extract features from the text and image components for the challenge [Avito Demand Prediction Challenge](https://www.kaggle.com/c/avito-demand-prediction/kernels). All code assumes your input files and folders are in folder labled *input* in the same directory as this notebook. 

This notebook uses pretraned word2vec embeddings and can be found here: [Fasttext Russian 2M] (https://www.kaggle.com/jingqliu/fasttext-russian-2m).

** You must download the Avito and Fasttext data from Kaggle in order to extract the features. **

## Libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from gensim.models import KeyedVectors
from keras.applications.vgg19 import VGG19
from keras.models import Model
from skimage.io import imread
from skimage.transform import resize
from keras.utils import Sequence
import h5py

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Loading the Data

In [2]:
trainDF = pd.read_csv("input/avito-demand-prediction/train.csv")
testDF = pd.read_csv("input/avito-demand-prediction/test.csv")

# Corrupted images in the training set.
badImgs = ["4f029e2a00e892aa2cac27d98b52ef8b13d91471f613c8d3c38e3f29d4da0b0c",
           "b98b291bd04c3d92165ca515e00468fd9756af9a8f1df42505deed1dcfb5d7ae",
           "60d310a42e87cdf799afcd89dc1b11ae3fdc3d0233747ec7ef78d82c87002e83",
           "8513a91e55670c709069b5f85e12a59095b802877715903abef16b7a6f306e58"]

trainDF[trainDF["image"].isin(badImgs)] = np.nan

# Image Path
imagePath = "./input/avito-demand-prediction/train_jpg/data/competition_files/train_jpg/"
imageExt = ".jpg"

# Fasttext Word2Vec Path
russVecPath = 'input/fasttext-russian-2m/wiki.ru.vec'

trainDF.head()

Unnamed: 0,item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability
0,b912c3c6a6ad,e00f8ff2eaf9,Свердловская область,Екатеринбург,Личные вещи,Товары для детей и игрушки,Постельные принадлежности,,,Кокоби(кокон для сна),"Кокон для сна малыша,пользовались меньше месяц...",400.0,2.0,2017-03-28,Private,d10c7e016e03247a3bf2d13348fe959fe6f436c1caf64c...,1008.0,0.12789
1,2dac0150717d,39aeb48f0017,Самарская область,Самара,Для дома и дачи,Мебель и интерьер,Другое,,,Стойка для Одежды,"Стойка для одежды, под вешалки. С бутика.",3000.0,19.0,2017-03-26,Private,79c9392cc51a9c81c6eb91eceb8e552171db39d7142700...,692.0,0.0
2,ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ...",4000.0,9.0,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a...,3032.0,0.43177
3,02996f1dd2ea,bf5cccea572d,Татарстан,Набережные Челны,Личные вещи,Товары для детей и игрушки,Автомобильные кресла,,,Автокресло,Продам кресло от0-25кг,2200.0,286.0,2017-03-25,Company,e6ef97e0725637ea84e3d203e82dadb43ed3cc0a1c8413...,796.0,0.80323
4,7c90be56d2ab,ef50846afc0b,Волгоградская область,Волгоград,Транспорт,Автомобили,С пробегом,ВАЗ (LADA),2110.0,"ВАЗ 2110, 2003",Все вопросы по телефону.,40000.0,3.0,2017-03-16,Private,54a687a3a0fc1d68aed99bdaaf551c5c70b761b16fd0a2...,2264.0,0.20797


## Pretrained VGG19 Imagenet Features

In [2]:
# Feature Extraction Hyperparams
imgShape = (224,224,3)
featureLayerName = "fc1" # We will extract the activations from the first fully connected layer. 
batchSize = 100 # Number of images we process at once. 

# Feature Extraction Setup
vgg = VGG19(weights='imagenet', input_shape=imgShape, classes=1000)
#vgg.summary()
fc1_features = Model(vgg.input,vgg.get_layer(featureLayerName).output)

In [None]:
# Feature Extraction - Training Data
onlyImages = trainDF[~trainDF["image"].isnull()]

hf = h5py.File('AvitoVGGFeatures_Train.h5', 'w')
nBatches = int(np.floor(trainDF.shape[0] / batchSize))

def readAndReshape (img):
    return resize(imread(img),imgShape)

for batchIndex in range(nBatches):
    sampleFrame =  onlyImages[batchIndex*batchSize:(batchIndex+1)*batchSize]
    
    imgURIs     = sampleFrame["image"].map(lambda x: imagePath+x+imageExt)
    imgReads    = imgURIs.map(imread)
    imgReads    = imgReads.map(lambda img : resize(img,imgShape))
    feedImgs    = np.hstack(imgReads.as_matrix()).reshape((batchSize,imgShape[0],imgShape[1],imgShape[2]))
    outFeatures = fc1_features.predict(feedImgs)
    
    idx = 0
    for idxName in sampleFrame.index:
        hf.create_dataset(str(idxName), data=outFeatures[idx])
        idx += 1
    break
        
hf.close()

In [31]:
# Feature Extraction - Test Data
hf = h5py.File('AvitoVGGFeatures_Test.h5', 'w')

onlyImages = testDF[~testDF["image"].isnull()]

hf = h5py.File('AvitoVGGFeatures_Train.h5', 'w')
nBatches = int(np.floor(trainDF.shape[0] / batchSize))

for batchIndex in range(nBatches):
    sampleFrame =  onlyImages[batchIndex*batchSize:(batchIndex+1)*batchSize]
    
    imgURIs     = sampleFrame["image"].map(lambda x: imagePath+x+imageExt)
    imgReads    = imgURIs.map(imread)
    imgReads    = imgReads.map(lambda img : resize(img,imgShape))
    feedImgs    = np.hstack(imgReads.as_matrix()).reshape((batchSize,imgShape[0],imgShape[1],imgShape[2]))
    outFeatures = fc1_features.predict(feedImgs)
    
    idx = 0
    for idxName in sampleFrame.index:
        hf.create_dataset(str(idxName), data=outFeatures[idx])
hf.close()

FileNotFoundError: [Errno 2] No such file or directory: './input/avito-demand-prediction/train_jpg/data/competition_files/train_jpg/a8b57acb5ab304f9c331ac7a074219aed4d349d8aef386bd5e32262735e68512.jpg'

In [24]:
# Example of how to read 
hf = h5py.File('AvitoVGGFeatures_Train.h5', 'r')

for i in hf.keys():
    print(i,np.array(hf.get(i)).shape)
    break # remove to itterate through entire set. 
    
hf.close()

0 (1, 4096)


In [30]:
onlyImages = trainDF[~trainDF["image"].isnull()]

In [33]:
onlyImages.index[0]

0

In [14]:
hf.close()

In [27]:
# Feature Extraction Hyperparams
imgShape = (224,224,3)
featureLayerName = "fc1" # We will extract the activations from the first fully connected layer. 
batchSize = 50 # Number of images we process at once. 

# Feature Extraction Setup
vgg = VGG19(weights='imagenet', input_shape=imgShape, classes=1000)
#vgg.summary()
fc1_features = Model(vgg.input,vgg.get_layer(featureLayerName).output)

# Feature Extraction - Training Data
hf = h5py.File('AvitoVGGFeatures_Train.h5', 'w')

for index, trainingSample in trainDF.iterrows():
    if not isinstance(trainingSample["image"],str):
        hf.create_dataset(str(index), data=np.nan)
    else:
        imgURI      = imagePath+trainingSample["image"]+imageExt
        imgRead     = resize(imread(imgURI),imgShape)
        outFeatures = fc1_features.predict(np.expand_dims(imgRead, axis=0))
        hf.create_dataset(str(index), data=outFeatures)
        
hf.close()

# Feature Extraction - Test Data
hf = h5py.File('AvitoVGGFeatures_Test.h5', 'w')

for index, testSample in testDF.iterrows():
    if not isinstance(testSample["image"],str):
        hf.create_dataset(str(index), data=np.nan)
    else:
        imgURI      = imagePath+testSample["image"]+imageExt
        imgRead     = resize(imread(imgURI),imgShape)
        outFeatures = fc1_features.predict(np.expand_dims(imgRead, axis=0))
        hf.create_dataset(str(index), data=outFeatures)
        
hf.close()

In [25]:
trainDF[0:1]["image"].iloc[0]

'd10c7e016e03247a3bf2d13348fe959fe6f436c1caf64c7679f17c333c959b19'

In [26]:
hf = h5py.File('AvitoVGGFeatures_Train.h5', 'w')

OSError: Unable to create file (unable to truncate a file which is already open)