## Text embeddings and image keypoints

In this notebook I do two things: use trained FastText model to process text data and convert it into 300-dimensional vector which can be used in models like LGB; I also extract keypoints from images.

In [1]:
from gensim.models.wrappers import FastText
from gensim.models import FastText
import pandas as pd
import numpy as np
import tqdm
from path import Path
import cv2



### Text embeddings

If you want to use text in a non neural net model, there are many options:
- use vectorizer and train model on created tokens;
- reduce dimensionality after vectorizing the text;
- create meta-features using other models;
- vectorize text using word2vec-like models;
- etc;

In this notebook I use a pre-trained model to vectorize tet using FastText.

In [4]:
fasttext_model = FastText.load('embeddings/avito_big_150m_sg1.w2v')

In [11]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [12]:
train['description'] = train['description'].apply(lambda x: str(x).replace('/\n', ' ').replace('\xa0', ' ').replace('.', '. ').replace(',', ', '))
test['description'] = test['description'].apply(lambda x: str(x).replace('/\n', ' ').replace('\xa0', ' ').replace('.', '. ').replace(',', ', '))

In [None]:
vectors = []
for text in train['description']:
    a = np.zeros((1, 300))
    c = 0
    for word in text.split():
        if word in fasttext_model:
            a += fasttext_model[word]
            c += 1
    a = a / c
    vectors.append(a[0])

In [None]:
vectors_test = []
for text in test['description']:
    a = np.zeros((1, 300))
    c = 0
    for word in text.split():
        if word in fasttext_model:
            a += fasttext_model[word]
            c += 1
    a = a / c
    vectors_test.append(a[0])

In [21]:
np.save('train_emb.npy', np.array(vectors))
np.save('test_emb.npy', np.array(vectors_test))

### Extracting image keypoints

In this competition there were a lot of ways which participants used images in the models. One of them is extracting keypoints from the images and creating a features showing a number of keypoints in the image.

Keypoint is a name for an "interesting point" in the image - it could be an animal, a building and so on. So this could be useful for the model.

In [4]:
s = 'train_jpg/'
imgs = Path(s).files('*.png')
imgs += Path(s).files('*.jpg')
imgs += Path(s).files('*.jpeg')

In [None]:
def keyp(img):
    '''https://www.kaggle.com/c/avito-demand-prediction/discussion/59414#347781'''
    try:        
        img1 = cv2.imread(img,0)
        fast = cv2.FastFeatureDetector_create()

        # find and draw the keypoints
        kp = fast.detect(img1,None)
        kp = len(kp)
        return  [img, kp]
    except:
        return [img, 0]

In [None]:
features_train = np.empty((len(imgs), 2), dtype=object)
step = 20000
i = 0

In [None]:
i = 0
while i < len(imgs):
    print(i)
    imgs_temp = imgs[i:i + step]
    features_temp = np.empty((len(imgs_temp), 2), dtype = object)
    for ind, img in tqdm.tqdm(enumerate(imgs_temp)):
        img_keys1 = keyp(img)
        features_temp[ind, :] = img_keys1

    features_train[i:i + step, :] = features_temp
    i += step


np.save('f:/Avito/train_keypoints.npy', features_train)

In [1]:
s = 'test_jpg/'
print('reading images')
imgs = Path(s).files('*.png')
imgs += Path(s).files('*.jpg')
imgs += Path(s).files('*.jpeg')

features_test = np.empty((len(imgs), 2), dtype=object)
step = 20000
i = 0

reading images


In [9]:
i = 0
while i < len(imgs):
    print(i)
    imgs_temp = imgs[i:i + step]
    features_temp = np.empty((len(imgs_temp), 2), dtype = object)
    for ind, img in tqdm.tqdm(enumerate(imgs_temp)):
        img_keys1 = keyp(img)
        features_temp[ind, :] = img_keys1

    features_test[i:i + step, :] = features_temp
    i += step


np.save('f:/Avito/test_keypoints.npy', features_test)

0


20000it [01:24, 235.94it/s]


20000


20000it [01:23, 239.17it/s]


40000


20000it [01:25, 234.80it/s]


60000


20000it [01:24, 235.81it/s]


80000


20000it [01:23, 240.05it/s]


100000


20000it [01:23, 239.13it/s]


120000


20000it [01:26, 232.54it/s]


140000


20000it [01:26, 231.29it/s]


160000


20000it [01:27, 228.49it/s]


180000


20000it [01:26, 230.45it/s]


200000


20000it [01:25, 233.77it/s]


220000


20000it [01:26, 231.87it/s]


240000


20000it [01:27, 227.97it/s]


260000


20000it [01:27, 229.78it/s]


280000


20000it [01:23, 238.44it/s]


300000


20000it [01:23, 239.51it/s]


320000


20000it [01:23, 238.95it/s]


340000


20000it [01:24, 235.69it/s]


360000


20000it [01:21, 244.69it/s]


380000


20000it [01:23, 238.40it/s]


400000


20000it [01:22, 241.68it/s]


420000


20000it [01:23, 238.92it/s]


440000


20000it [01:22, 241.12it/s]


460000


5829it [00:24, 238.28it/s]


In [7]:
np.save('f:/Avito/test_keypoints.npy', features_test)