# Задача

В файлах `airlines.train.tsv` и `airlines.test.tsv` находятся данные о пользовательских оценках различных авиакомпаний. Полноценный набор данных доступен <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> по ссылке </a>.

В данных есть информация про название авиакомпании, страну того, кто оставляет отзыв, класс, которым он летел, текстовое сообщение и итоговая оценка от 0 до 10.

Задача - по первым 4 параметрам (авиакомпания, страна, класс, текстовое сообщение) предсказать оценку, которую поставил пользователь. Для этого необходимо дополнительно превратить данные в формат vw. Про формат, в котором нужно предоставить решения будет написано ниже.

В качестве ответа необходимо сдать обученные веса модели vowpal wabbit. Для оценки решения на тестовых данных будет запущен vw с этими весами и будет подсчитана метрика R2. Решения, которые получили качество больше `0.35` будут оцениваться в 100%. Решения с меньшим качеством будут оценены ниже в соответствии с полученных качеством. Саму модель (веса) необходимо сохранить в файл `result.vw`.

Формат vw:
* Целевая переменная - пользовательская оценка
* 4 неймспейса - name, country, cabin, review
* Значения в name, country, cabin приведены в монолитный формат - все символы, не являющиеся буквой или цифрой (то есть подходящие под регулярное выражение `\W`) заменены на `_`, а также вся строка приведена к нижнему регистру.
* В review оставлены только корректные элементы (то есть подходящие под регулярное выражение `[a-zA-Z0-9_]+`).

Для демонстрации того, как выглядит этот формат, в файле `airlines.test.sample.vw` лежат 10 первых элементов из тестовой выборки, которые закодированны соответствующим образом.


In [1]:
import pandas as pd
import numpy as np

In [2]:
columns=['name', 'country', 'cabin', 'review', 'rating']
df_train = pd.read_csv('airlines.train.tsv', sep='\t')
df_test = pd.read_csv('airlines.test.tsv', sep='\t')
df_train.columns = columns
df_test.columns = columns

In [3]:
df_train.head()

Unnamed: 0,name,country,cabin,review,rating
0,sunwing-airlines,Canada,Economy,March 5th 2014 from Ottawa Canada to Cuba WG 6...,9.0
1,lufthansa,United Kingdom,Economy,SIN-FRA-BHX in Economy. First leg from Singapo...,7.0
2,spirit-airlines,United States,Economy,"Spirit does what they state on their web site,...",7.0
3,sunwing-airlines,Canada,Premium Economy,My fiancé and I were booked to fly to Cayo San...,1.0
4,british-airways,United States,First Class,DXB-LHR B777-200ER BA0108 August 18 First Clas...,9.0


In [4]:
df_test.head()

Unnamed: 0,name,country,cabin,review,rating
0,south-african-airways,United Kingdom,Economy,JNB-LHR on the new airbus. Seats were roomy an...,8.0
1,jet-airways,Qatar,Business Class,Flew Business Class DOH-BOM-DOH. Outbound: Use...,6.0
2,american-airlines,United States,First Class,This is a rough review because we flew first b...,5.0
3,flybe,United Kingdom,Economy,Am thoroughly fed up with Flybe customer servi...,1.0
4,american-airlines,United Arab Emirates,Economy,I have flown MIA-JFK on an old B767-300. Fligh...,5.0


In [5]:
import re
def formating(text):
    return re.sub('\W', '_', text).lower()
formating('United Kingdom')

'united_kingdom'

In [6]:
def cleaning(text):
    word_pattern = re.compile(r"[a-zA-Z0-9_]+")
    words = []
    for match in re.finditer(word_pattern, text.lower()):
        words.append(match.group(0))
    
    if not words: 
        return None
    return " ".join(words)

In [7]:
def df_format(df):
    for column in df.columns[:3]:
        df[column] = df[column].apply(formating)
    
    df[df.columns[3]] = df[df.columns[3]].apply(cleaning)
    return df

In [8]:
def write_vw(df, filename):
    with open(filename, "w") as f:
        for name, country, cabin, review, rating in df.values:
            vw_object = "{} |name {} |country {} |cabin {} |review {}".format(rating, name, country, cabin, review)
            if not vw_object:
                continue
            f.write(vw_object + '\n')

In [9]:
write_vw(df_format(df_train), "airlines_train.vw")
write_vw(df_format(df_test), "airlines_test.vw")

In [10]:
! head -n 2 airlines_test.vw

8.0 |name south_african_airways |country united_kingdom |cabin economy |review jnb lhr on the new airbus seats were roomy and comfy staff polite and friendly and inflight entertainment system outstanding we had terrible turbulence throughout the flight but the captain was informative and reassuring and everyone remained calm food not great but otherwise excellent
6.0 |name jet_airways |country qatar |cabin business_class |review flew business class doh bom doh outbound used the oryx lounge at doha airport which was nice cabin was nearly empty seats are similar to those on jet s domestic business class found it difficult to sleep with the recline provided at 6 3 legrests did not help as my legs overshot it the light sandwich was passable service was attentive and cheerful inbound evening flight so looked forward to meal and wine same cheap french table wine indian non veg meal was not great cabin crew were attentive and friendly ife was limited one negative was that my bag was one of t

In [11]:
! head -n 2 airlines.test.sample.vw

8.0 |name south_african_airways |country united_kingdom |cabin economy |review jnb lhr on the new airbus seats were roomy and comfy staff polite and friendly and inflight entertainment system outstanding we had terrible turbulence throughout the flight but the captain was informative and reassuring and everyone remained calm food not great but otherwise excellent
6.0 |name jet_airways |country qatar |cabin business_class |review flew business class doh bom doh outbound used the oryx lounge at doha airport which was nice cabin was nearly empty seats are similar to those on jet s domestic business class found it difficult to sleep with the recline provided at 6 3 legrests did not help as my legs overshot it the light sandwich was passable service was attentive and cheerful inbound evening flight so looked forward to meal and wine same cheap french table wine indian non veg meal was not great cabin crew were attentive and friendly ife was limited one negative was that my bag was one of t

Ваши полученные коэффициенты будут проверятся примерно следующим образом

In [12]:
%%time
! (rm vw.cache || exit 0)
! vw --final_regressor result.vw airlines_train.vw --bit_precision 22 --learning_rate 0.1 --passes 1000 --cache_file vw.cache

final_regressor = result.vw
Num weight bits = 22
learning rate = 0.1
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = vw.cache
Reading datafile = airlines_train.vw
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
81.000000 81.000000            1            1.0   9.0000   0.0000      167
62.006304 43.012608            2            2.0   7.0000   0.4416      120
41.398979 20.791654            4            4.0   1.0000   2.4589      253
40.406127 39.413274            8            8.0   1.0000   3.7094      301
30.456728 20.507330           16           16.0   8.0000   0.9863       67
31.211055 31.965382           32           32.0   1.0000   3.0708      160
31.972794 32.734532           64           64.0  10.0000   7.6112      217
25.612868 19.252943          128          128.0   8.0000   5.6953      130
23.906521 22.20017

In [15]:
! vw --testonly --initial_regressor result.vw --predictions predictions.txt airlines_test.vw
calc_r2('predictions.txt', 'airlines_test.vw')

only testing
predictions = predictions.txt
Num weight bits = 22
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airlines_test.vw
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
2.932962 2.932962            1            1.0   8.0000   9.7126       48
2.327584 1.722206            2            2.0   6.0000   7.3123      116
9.546790 16.765996            4            4.0   1.0000   4.1912       92
5.849053 2.151316            8            8.0   8.0000   6.4084      152
4.395241 2.941429           16           16.0   8.0000   5.8056       53
3.587803 2.780364           32           32.0   8.0000   7.1935       56
3.389135 3.190468           64           64.0   8.0000   6.8690       37
3.145809 2.902482          128          128.0   8.0000   9.7600      200
3.566722 3.987635          256          256.0   2.0000   4.

0.6540537678643386

In [14]:
from sklearn.metrics import r2_score

def read_target_from_vw(vw_object):
    return float(vw_object.split(' ')[0])


def calc_r2(predictions_path, answers_path):
    with open(predictions_path, 'r') as f:
        y_pred = np.array([float(value) for value in f.readlines()])
        
    with open(answers_path, 'r') as f:
        y_expected = np.array([read_target_from_vw(value) for value in f.readlines()])
        
    return r2_score(y_expected, y_pred)

In [16]:
# После запуска будет подсчитана метрика r2 на всем тестовом наборе данных. 
# Ваша задача - выбить не меньше 0.35, подстраивая параметры vw


calc_r2('predictions.txt', 'airlines_test.vw')

0.6540537678643386