# FastText

In this notebook we train a FastText model from scratch on the dataset, in order to utilize it later to produce POI emebddings. POIs are then classified through their embeddings.

In [1]:
import numpy as np
import pandas as pd
import os

from gensim.models import FastText
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Load train and test sets.

In [2]:
labels = [
    'Active Life', 'Arts & Entertainment', 'Automotive', 'Beauty & Spas',
    'Education', 'Event Planning & Services', 'Financial Services', 'Food',
    'Health & Medical', 'Home Services', 'Hotels & Travel', 'Local Flavor',
    'Local Services', 'Mass Media', 'Nightlife', 'Pets', 'Professional Services',
    'Public Services & Government', 'Real Estate', 'Religious Organizations',
    'Restaurants', 'Shopping'
]

train_df = pd.read_csv('data/train.csv', na_filter=False)
test_df = pd.read_csv('data/test.csv', na_filter=False)

Train the FastText model on train sequences.

In [3]:
emb_size = 300
epochs = 100
ft_model_fname = f'ft_{emb_size}_{epochs}.model'
models_dir = 'ft_models'

train_sequences = train_df['sequence'].apply(lambda x: x.split())
test_sequences = test_df['sequence'].apply(lambda x: x.split())

if ft_model_fname in os.listdir(models_dir):
    ft_model = FastText.load(os.path.join(models_dir, ft_model_fname))
else:
    ft_model = FastText(size=emb_size)
    ft_model.build_vocab(sentences=train_sequences)
    ft_model.train(sentences=train_sequences, total_examples=len(train_sequences), epochs=epochs)
    ft_model.save(os.path.join(models_dir, ft_model_fname))

Create POI embeddings by averaging the representations of the tokens containing them.

In [4]:
train_features = np.stack(train_sequences.apply(lambda x: ft_model.wv[x].mean(axis=0)))
test_features = np.stack(test_sequences.apply(lambda x: ft_model.wv[x].mean(axis=0)))

train_labels = train_df['categories'].str.get_dummies(sep=', ')
test_labels = test_df['categories'].str.get_dummies(sep=', ')

Classification via Logistic Regression.

In [5]:
test_preds = np.zeros((len(test_df), len(labels)))
scores = []

for label_idx, label_name in enumerate(labels):
    train_target = train_labels[label_name]
    test_target = test_labels[label_name]

    clf = LogisticRegression(solver='sag')
    clf.fit(train_features, train_target)
    preds = clf.predict(test_features)
    test_preds[:, label_idx] = preds

    score = accuracy_score(test_target, preds)
    scores.append(score)
    print('Test score for class {} is {:.4f}'.format(label_name, score))

print('Mean test score is {:.4f}'.format(np.mean(scores)))



Test score for class Active Life is 0.9793




Test score for class Arts & Entertainment is 0.9765
Test score for class Automotive is 0.9711
Test score for class Beauty & Spas is 0.9749




Test score for class Education is 0.9868




Test score for class Event Planning & Services is 0.9592




Test score for class Financial Services is 0.9928
Test score for class Food is 0.9084




Test score for class Health & Medical is 0.9720
Test score for class Home Services is 0.9560




Test score for class Hotels & Travel is 0.9837




Test score for class Local Flavor is 0.9922




Test score for class Local Services is 0.9411




Test score for class Mass Media is 0.9982




Test score for class Nightlife is 0.9783




Test score for class Pets is 0.9924




Test score for class Professional Services is 0.9694




Test score for class Public Services & Government is 0.9954




Test score for class Real Estate is 0.9878




Test score for class Religious Organizations is 0.9989
Test score for class Restaurants is 0.9724
Test score for class Shopping is 0.9226
Mean test score is 0.9732
