# Задача

В файлах `airlines.reviews.train.tsv` и `airlines.reviews.test.tsv` находятся данные о пользовательских оценках различных авиакомпаний. Полноценный набор данных доступен <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> по ссылке </a>.

В данных есть отзыв, который оставил пользователь и его оценка от 0 до 10. Задача - как можно лучше предсказывать оценку по текстовому описанию, используя различные подходы для работы с текстами.

Ниже представлен код, который обучает базовую модель - линейная модель vw, где весь текст закодирован просто как Bag-of-words. Ваши изменения должны улучшить текущую модель за счет более качественного кодирования текстовых данных.

In [1]:
import pandas as pd
import numpy as np

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [2]:
df_train = pd.read_csv('airlines.reviews.train.tsv', sep='\t')
df_test = pd.read_csv('airlines.reviews.test.tsv', sep='\t')

In [3]:
len_train = len(df_train)
df = pd.concat([df_train, df_test], ignore_index=True)

In [4]:
df_train.head()

Unnamed: 0,rating,content
0,9.0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...
1,7.0,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.
2,7.0,"Spirit does what they state on their web site, they get you there - cheaply. For that I give them 5 stars because they did exactly what the said they would do. The plane was full and the seats were close together. I read all about that before I bought the ticket and it was as they said it would be, hence the low cost. Plan ahead and know what to expect and it will be a great experience. Its obvious that some of the people that gave 1 star reviews didn't understand about cost of bags or any extras and not done their homework - and are now very disappointed."
3,1.0,"My fiancé and I were booked to fly to Cayo Santa Maria (CUBA) February 6-13 2014. Our flight was scheduled to leave at 6.10am. Upon arriving at the airport at 4.30am we quickly noticed that the line up was very long. When we finally got to the check-in desk they asked us where we were headed we replied Cayo Santa Maria. We advised her that we had checked in online already and we just needed to print our boarding passes. She took our baggage and weighed it. Right before she was about to send it off a rude manager from the back came and just yelled out ""gates to Santa Clara are closed"". We were so shocked because it was only 4.55am at that time. We told them the plane would just be sitting there we could still make it. The rep simply told us ""please step aside we need to assist other pas..."
4,9.0,"DXB-LHR B777-200ER BA0108 August 18 First Class. Transferred from an Emirates flight in DXB. BA DXB Galleries Lounge reception staff member excellent. Boarding reasonable with an on-time departure. Cabin crew outstanding and definitely lived up to the ""To Fly. To Serve"" BA slogan. Food tasty and well presented but not quite First Class and cost-cutting was evident. The New First seat is comfortable though the footrest is poorly designed and a storage area for small inflight items is missing. IFE monitor controls and selection very good but the screen could be more adjustable for reach. Lovely cabin ambience including colors textures mood lighting and window blinds. Toilets cramped and stocked with the cheapest liquid soaps and toilet paper. Overall an enjoyable flight but as a longstan..."


In [5]:
df_test.head()

Unnamed: 0,rating,content
0,8.0,JNB-LHR on the new airbus. Seats were roomy and comfy staff polite and friendly and inflight entertainment system outstanding. We had terrible turbulence throughout the flight but the captain was informative and reassuring and everyone remained calm. Food not great but otherwise excellent.
1,6.0,"Flew Business Class DOH-BOM-DOH. Outbound: Used the Oryx lounge at Doha airport which was nice. Cabin was nearly empty. Seats are similar to those on Jet's domestic business class. Found it difficult to sleep with the recline provided. At 6'3"" legrests did not help as my legs overshot it. The light sandwich was passable. Service was attentive and cheerful. Inbound: Evening flight so looked forward to meal and wine. Same cheap French table wine. Indian non-veg meal was not great. Cabin crew were attentive and friendly. IFE was limited. One negative was that my bag was one of the last off both flights with a priority tag."
2,5.0,This is a rough review because we flew first business and coach. We usually fly coach but for a trip to Napa we used our points to go first class. The AA/United merger combined the worst two airlines in the Western world. Flew on 4/7 (260 / 193) - BAN-DFX-SAN. Service food seating excellent. Plane a little old and shaky but all in all a good flight. Returned 4/13 (193/5290) SAN-Charlotte-BNA. Although we had first class we were relegated to business with an accompanying drop in quality across the board. The trouble is the age of the planes - it's like something from a museum. The noise from the engine was so loud it was like sticking your head under the hood of a car. But for once all flights left on time without mechanical problems. We were closer to the real world of 95% of all trave...
3,1.0,Am thoroughly fed up with Flybe customer services. My wife and I checked into the Flybe desk at Faro on the 29th September for the return trip to Exeter. Despite the fact that we had pre-booked our seats paying the New Economy Flight tariff we were told the aircraft was too full to allow us to sit together. I complained to Flybe customer services yet I have not even received the courtesy of an acknowledgement. Both outbound and inbound flights were also an hour late on departure.
4,5.0,I have flown MIA-JFK on an old B767-300. Flight was full. Seat very old but comfortable and pitch was ok. Entertainment only through overhead screens. I was shocked that this aircraft had the new hatrack design only for premium cabins Economy cabin still had the 90's hatrack design which a) looks very old and b) does not have enough room for everyone. Only non-alcoholic drink service for this 2hr flight. Staff very unprofessional. There seems to be an inconsistency in their uniform. One female flight attendant was wearing - what seemed to be her personal apron that featured dozens of US flags next to each other. After landing in JFK we had to hold at a remote stand before our gate was available. As usual passengesr were standing up and retrieving their bags from the overhead bins despi...


In [6]:
X_train, Y_train = df_train['content'], df_train['rating']
X_test, Y_test = df_test['content'], df_test['rating']

In [7]:
from sklearn.metrics import r2_score
import re


def convert_to_vw(raw_text, target):
    word_pattern = re.compile(r"[a-zA-Z0-9_]+")
    words = []
    for match in re.finditer(word_pattern, raw_text.lower()):
        words.append(match.group(0))
    
    if not words: 
        return None
    return "{} |d {}".format(float(target), " ".join(words))

def write_vw(X_data, Y_data, filename):
    with open(filename, "w") as f:
        for x, y in zip(X_data, Y_data):
            vw_object = convert_to_vw(x, y)
            if not vw_object:
                continue
            f.write(vw_object + '\n')
            
def read_target_from_vw(vw_object):
    return float(vw_object.split(' ')[0])

def calc_r2(predictions_path, answers_path):
    with open(predictions_path, 'r') as f:
        y_pred = np.array([float(value) for value in f.readlines()])
        
    with open(answers_path, 'r') as f:
        y_expected = np.array([read_target_from_vw(value) for value in f.readlines()])
        
    return r2_score(y_expected, y_pred)

In [8]:
write_vw(X_train, Y_train, 'airlines.train.vw')
write_vw(X_test, Y_test, 'airlines.test.vw')

In [9]:
! head -n 1 airlines.train.vw

9.0 |d march 5th 2014 from ottawa canada to cuba wg 630 they announced that the flight was going to be delayed 1 hour no explanation why they started boarding and we took off only 1 2 hour late there were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side on the way back from cuba on march 12th 2014 wg 631 we were slow going through immigration no fault of sunwing finally arrived to our plane at 10 35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected the 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us overall the staff were great very friendly and approachable the food served was pretty good considering most airlines don t offer meal service for free it was comparable to meals we ve had to purchase on other airlines


## Byte Pair Encoding

In [10]:
import sentencepiece as spm

In [11]:
text = df.content
sents = [t.split('.') for t in text]

with open('airbnb_sents.txt', 'w') as f:
    for t in sents:
        for s in t:
            f.write(s)
            f.write('\n')

In [12]:
%%time

! spm_train --input=airbnb_sents.txt --model_prefix=airlines.model1 --vocab_size=1000 --character_coverage=0.99--model_type=bpe

sentencepiece_trainer.cc(79) LOG(INFO) Starts training with : 
trainer_spec {
  input: airbnb_sents.txt
  input_format: 
  model_prefix: airlines.model1
  model_type: UNIGRAM
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.99
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normaliza

In [13]:
with open('aairlines.texts.txt', 'w') as f:
    for t in text:
        f.write(t)
        f.write('\n')

In [14]:
%%time

!spm_encode --model=airlines.model1.model --output_format=id aairlines.texts.txt > aairlines.bpe.txt

CPU times: user 33.2 ms, sys: 11.9 ms, total: 45.1 ms
Wall time: 3.29 s


In [15]:
! head -n 2 aairlines.bpe.txt

940 235 99 620 44 290 8 242 40 21 613 7 75 410 21 208 0 3 0 128 115 0 304 94 303 152 45 12 46 5 20 9 588 7 59 263 140 190 67 934 754 0 304 581 12 248 6 49 428 258 125 140 0 114 190 281 0 454 26 3 0 19 149 101 26 79 12 826 6 76 30 21 41 16 257 26 574 14 720 85 3 440 137 3 440 0 157 5 237 150 44 75 410 21 15 940 140 114 99 620 208 0 3 0 128 138 49 26 834 588 403 976 67 763 19 63 152 40 16 0 162 41 287 314 7 116 103 48 402 0 128 200 135 5 196 69 4 3 260 30 12 24 245 32 745 18 6 5 103 428 258 235 264 571 441 264 906 148 412 12 0 31 3 0 19 149 26 634 262 354 166 8 146 137 101 617 573 68 140 114 250 3 18 21 465 47 187 137 580 585 858 149 0 455 5 108 26 221 51 179 6 3 368 191 21 90 136 0 31 87 280 9 634 70 729 16 370 309 369 34 8 438 229 64 22 398 0 355 9 589 136 7 450 49 34 169 56 7 791 15 159 309 0
3 680 17 817 17 178 184 0 14 469 0 518 161 44 611 15 5 37 128 0 115 9 221 716 32 325 13 9 22 8 152 245 461 7 163 94 268 86 422 79 25 274 166 30 86 12 488 3 0 449 58 16 137 84 174 143 40 189 141 

In [16]:
def convert_to_vw(raw_text, target):
    return "{} |d {}".format(float(target), raw_text)
        
def write_vw(X_data, Y_data, filename):
    with open(filename, "w") as f:
        for x, y in zip(X_data, Y_data):
            vw_object = convert_to_vw(x, y)
            if not vw_object:
                continue
            f.write(vw_object)

In [17]:
with open('aairlines.bpe.txt', 'r') as bpe:
    bpe_texts = bpe.readlines()

    
len(bpe_texts)

34868

In [18]:
write_vw(bpe_texts[:len_train], df.rating[:len_train], 'airlines.train.vw')
write_vw(bpe_texts[len_train:], df.rating[len_train:], 'airlines.test.vw')

In [19]:
! head -n 1 'airlines.train.vw'

9.0 |d 940 235 99 620 44 290 8 242 40 21 613 7 75 410 21 208 0 3 0 128 115 0 304 94 303 152 45 12 46 5 20 9 588 7 59 263 140 190 67 934 754 0 304 581 12 248 6 49 428 258 125 140 0 114 190 281 0 454 26 3 0 19 149 101 26 79 12 826 6 76 30 21 41 16 257 26 574 14 720 85 3 440 137 3 440 0 157 5 237 150 44 75 410 21 15 940 140 114 99 620 208 0 3 0 128 138 49 26 834 588 403 976 67 763 19 63 152 40 16 0 162 41 287 314 7 116 103 48 402 0 128 200 135 5 196 69 4 3 260 30 12 24 245 32 745 18 6 5 103 428 258 235 264 571 441 264 906 148 412 12 0 31 3 0 19 149 26 634 262 354 166 8 146 137 101 617 573 68 140 114 250 3 18 21 465 47 187 137 580 585 858 149 0 455 5 108 26 221 51 179 6 3 368 191 21 90 136 0 31 87 280 9 634 70 729 16 370 309 369 34 8 438 229 64 22 398 0 355 9 589 136 7 450 49 34 169 56 7 791 15 159 309 0


## word2vec

In [20]:
import re
import nltk.data 
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

words_only = re.compile('[a-z]+')

def letters(s, regex = words_only):
    if isinstance(s, str):
        return words_only.findall(s.lower())
    else:
        return []

# удаление стоп-слов
def remove_stopwords(tokens, sw = stop_words):
    return [t for t in tokens if not t in sw]

def preprocess(s):
    return remove_stopwords(letters(s))

In [21]:
from nltk.stem.snowball import SnowballStemmer
from functools import lru_cache


snowball = SnowballStemmer("english")

def stemm_description(d):
    @lru_cache(maxsize=128)
    def stemm_token(token):
        return snowball.stem(token)

    return [stemm_token(t) for t in d if len(stemm_token(t)) >= 3]

In [22]:
from multiprocessing import Pool
from tqdm import tqdm_notebook as tqdm

with Pool(8) as p:
    clean_text = list(tqdm(p.imap(letters, df.content), total=len(df)))
    
df['clean_text'] = clean_text
    
with Pool(8) as p:
    stems = list(tqdm(p.imap(stemm_description, df.clean_text), total=len(df)))
    
df['stems'] = stems

df.sample()

HBox(children=(FloatProgress(value=0.0, max=34809.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=34809.0), HTML(value='')))




Unnamed: 0,rating,content,clean_text,stems
22046,9.0,I never had much of an issue with this airline. Ppolite staff food very reasonable a good selection of drinks and in-flight entertainment. On occasion the flights can be late but this is more due to the weather and the odd mechanical hiccup however this has improved. You may find this airline very expensive (Amsterdam to Atyrau) with 50% saving to be had via other routings. These comments relate to trips Amsterdam to Atyrau Atyrau to Aktau and Istanbul to Atyrau over the last 3 years.,"[i, never, had, much, of, an, issue, with, this, airline, ppolite, staff, food, very, reasonable, a, good, selection, of, drinks, and, in, flight, entertainment, on, occasion, the, flights, can, be, late, but, this, is, more, due, to, the, weather, and, the, odd, mechanical, hiccup, however, this, has, improved, you, may, find, this, airline, very, expensive, amsterdam, to, atyrau, with, saving, to, be, had, via, other, routings, these, comments, relate, to, trips, amsterdam, to, atyrau, atyrau, to, aktau, and, istanbul, to, atyrau, over, the, last, years]","[never, had, much, issu, with, this, airlin, ppolit, staff, food, veri, reason, good, select, drink, and, flight, entertain, occas, the, flight, can, late, but, this, more, due, the, weather, and, the, odd, mechan, hiccup, howev, this, has, improv, you, may, find, this, airlin, veri, expens, amsterdam, atyrau, with, save, had, via, other, rout, these, comment, relat, trip, amsterdam, atyrau, atyrau, aktau, and, istanbul, atyrau, over, the, last, year]"


In [23]:
with open('text_for_vector_models.txt', 'w') as f:
    for s in df.stems:
        f.write(' '.join(s))
        f.write('\n')

In [24]:
from gensim.models import word2vec

In [25]:
w2v_model = word2vec.Word2Vec(df.stems, workers=4, size=300, min_count=10, window=4, sample=1e-3)

In [26]:
print(w2v_model.wv.most_similar(positive=["arriv"], topn=5))

[('land', 0.7369639277458191), ('reach', 0.6460075378417969), ('leav', 0.5718560218811035), ('depart', 0.5467904806137085), ('departur', 0.5097861289978027)]


In [27]:
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(list(word2vec.values())[0])

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])
w2v = dict(zip(w2v_model.wv.index2word, w2v_model.wv.vectors))

In [28]:
train_vec = MeanEmbeddingVectorizer(w2v).transform(df.stems[:len_train])
test_vec = MeanEmbeddingVectorizer(w2v).transform(df.stems[len_train:])

In [29]:
train_vec_list = []
for vec in train_vec:
    train_vec_list.append(" ".join(vec.astype(str))+'\n')

test_vec_list = []
for vec in test_vec:
    test_vec_list.append(" ".join(vec.astype(str))+'\n')

In [30]:
write_vw(train_vec_list, df.rating[:len_train], 'airlines.train.vw')
write_vw(test_vec_list, df.rating[len_train:], 'airlines.test.vw')

In [31]:
! head -n 1 airlines.train.vw

9.0 |d -0.533525 0.22157243 -0.079161815 0.06949304 -0.18635605 -0.26331526 0.27673662 -0.30878136 -0.050869215 -0.390512 -0.047729775 0.11917531 0.15479024 -0.05440995 0.578881 0.042289637 -0.2302667 -0.09609324 0.13519546 -0.005928808 0.07478955 0.33502516 0.20409022 0.007634786 0.23034781 0.037199657 0.089911 -0.0130975265 -0.12902294 0.21588038 0.04489422 0.4076767 0.012097265 -0.03573991 0.043163758 -0.17549662 -0.09406937 -0.17676057 -0.13911007 0.051370833 0.0036234178 0.09253819 -0.045813717 0.06842078 0.083209164 0.30235234 -0.023727683 -0.21897721 -0.014977982 -0.029575758 0.012575022 -0.16241282 -0.24299334 -0.12627102 -0.15009402 0.041331537 0.022551334 0.065563224 -0.13570751 -0.1952967 -0.1322302 0.025231747 0.017009635 0.23763262 -0.011145899 -0.09357851 0.04317278 -0.020360943 -0.18530348 -0.30971047 -0.15794027 -0.029386092 -0.015007634 -0.09153813 -0.14747308 -0.20038635 -0.07110803 -0.0098932 -0.09247125 0.112206034 0.039026182 0.06304746 -0.003676025 0.076778665 0.0

## fasttext

In [32]:
with open('sentences_w2v.txt', 'w') as f:
    for s in df.stems:
        f.write(' '.join(s)+'\n')

In [33]:
import fasttext

%time ft_model = fasttext.train_unsupervised('sentences_w2v.txt', minn=3, maxn=4, dim=300)

CPU times: user 5min 52s, sys: 78.8 ms, total: 5min 52s
Wall time: 2min 57s


In [34]:
ft_model.get_nearest_neighbors('two')

[(0.5578595399856567, 'three'),
 (0.4924752116203308, 'four'),
 (0.45771321654319763, 'twenti'),
 (0.44715416431427, 'one'),
 (0.4464314579963684, 'six'),
 (0.4384196102619171, 'lwo'),
 (0.4272649884223938, 'seven'),
 (0.42494577169418335, 'avml'),
 (0.41626739501953125, 'sever'),
 (0.4153198003768921, 'btv')]

In [36]:
class MeanFTEmbeddingVectorizer(object):
    def __init__(self, ft_model):
        self.ft_model = ft_model
        self.dim = 300

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.ft_model[w] for w in words ]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

In [37]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
regressor = Ridge(solver='sparse_cg')

ft_pipeline = Pipeline([
    ("fasttext vectorizer", MeanFTEmbeddingVectorizer(ft_model)),
    ("regressor", regressor)])

In [38]:
%%time 
ft_pipeline.fit(df.stems[:len_train], df.rating[:len_train])

CPU times: user 16.9 s, sys: 1.87 s, total: 18.8 s
Wall time: 17.2 s


Pipeline(steps=[('fasttext vectorizer',
                 <__main__.MeanFTEmbeddingVectorizer object at 0x7fe412e53700>),
                ('regressor', Ridge(solver='sparse_cg'))])

In [39]:
y_pred = ft_pipeline.predict(df.stems[len_train:])
score = r2_score(df.rating[len_train:], y_pred)
print(score)

0.6270737894348059


In [40]:
train_vec_ = MeanFTEmbeddingVectorizer(ft_model).transform(df.stems[:len_train])
test_vec_ = MeanFTEmbeddingVectorizer(ft_model).transform(df.stems[len_train:])

In [41]:
train_vec_list_ = []
for vec in train_vec_:
    train_vec_list_.append(" ".join(vec.astype(str))+'\n')

test_vec_list_ = []
for vec in test_vec_:
    test_vec_list_.append(" ".join(vec.astype(str))+'\n')

In [42]:
write_vw(train_vec_list_, df.rating[:len_train], 'airlines.train.vw')
write_vw(test_vec_list_, df.rating[len_train:], 'airlines.test.vw')

In [43]:
! tail -n 1 airlines.test.vw

6.0 |d -0.056210034 0.12246174 0.008659998 -0.091258794 0.05676221 0.025556434 -0.01025939 0.05970572 -0.0029892186 0.007282945 0.063229434 -0.010545965 -0.033377443 0.011570189 0.040170886 0.00063148804 0.041947342 -0.026340084 0.025963942 -0.002189623 -0.012326189 -0.024820626 -0.0237386 -0.066191286 0.0034169997 -0.025406586 0.016302496 -0.0616548 0.07028285 0.0017741205 0.0290974 -0.029379597 0.04772751 0.029853476 -0.008407073 -0.061618462 -0.04064907 0.082222 0.045179326 0.08398146 0.03352416 0.009024294 -0.010396097 0.033919666 -0.10519761 -0.06565115 0.053353775 0.001979665 0.02526908 0.02047187 0.024780752 -0.0075714756 -0.20205498 -0.17929636 0.15614599 0.056763515 0.096764185 -0.081832394 0.005990319 -0.016435638 0.09416235 -0.01722016 -0.11328355 0.029640773 -0.073450774 -0.007783102 -0.13208783 0.032507986 -0.07274781 -0.0033256337 -0.09201386 0.07687179 0.13965818 0.14147465 0.059970718 0.05565129 -0.04969842 0.019261595 0.027872834 0.009260165 -0.031177362 0.051768526 -0

## Only stemming

In [44]:
def convert_to_vw(raw_text, target):
    return "{} |d {}".format(float(target), raw_text)
        
def write_vw(X_data, Y_data, filename):
    with open(filename, "w") as f:
        for x, y in zip(X_data, Y_data):
            vw_object = convert_to_vw(x, y)
            if not vw_object:
                continue
            f.write(vw_object)

In [45]:
x_train = df.stems[:len_train].apply(lambda t: " ".join(t)+'\n')
x_test = df.stems[len_train:].apply(lambda t: " ".join(t)+'\n')

In [46]:
write_vw(x_train, df.rating[:len_train], 'airlines.train.vw')
write_vw(x_test, df.rating[len_train:], 'airlines.test.vw')

In [47]:
! head -n 2 airlines.train.vw

9.0 |d march from ottawa canada cuba they announc that the flight was delay hour explan whi they start board and took off onli hour late there were were seat togeth and remain were put aisl seat side side the way back from cuba march were slow through immigr fault sunw final arriv our plane the door immedi close and the plane took off minut later minut earlier than expect the were pretti much split each seat old daughter herself behind overal the staff were great veri friend and approach the food serv was pretti good consid most airlin don offer meal servic for free was compar meal had purchas other airlin
7.0 |d sin fra bhx economi first leg from singapor the was great larg becaus was fortun enough get exit row seat with unlimit legroom judg fellow passeng one wouldn happi with normal seat they had rather pathet legroom nice modern avod system but the ptvs were rather small compar other airlin servic was realli friend and warm but few frill amen kit whatsoev footrest meal were alrigh

## обучение

In [48]:
! vw --final_regressor result.model.vw airlines.train.vw --learning_rate 5 --bit_precision 18 --passes 20 -c -k

final_regressor = result.model.vw
Num weight bits = 18
learning rate = 5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = airlines.train.vw.cache
Reading datafile = airlines.train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
81.000000 81.000000            1            1.0   9.0000   0.0000      111
45.582334 10.164668            2            2.0   7.0000   3.8118       95
39.327467 33.072601            4            4.0   1.0000   9.0000      183
42.267739 45.208011            8            8.0   1.0000  10.0000      212
27.301089 12.334438           16           16.0   8.0000   3.9412       52
23.931698 20.562308           32           32.0   1.0000   8.4146      124
16.819548 9.707398           64           64.0  10.0000  10.0000      164
13.836894 10.854240          128          128.0   8.0000   6.8688      106
12.438965 11.041037          25

In [49]:
! vw --initial_regressor result.model.vw --testonly --predictions predictions.txt airlines.test.vw

only testing
predictions = predictions.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airlines.test.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
2.024555 2.024555            1            1.0   8.0000   9.4229       43
1.079673 0.134791            2            2.0   6.0000   6.3671       96
9.942823 18.805973            4            4.0   1.0000   4.5513       69
5.763292 1.583761            8            8.0   8.0000   6.1873      127
3.924885 2.086479           16           16.0   8.0000   6.5185       39
3.637553 3.350221           32           32.0   8.0000   5.4999       45
3.979342 4.321132           64           64.0   8.0000   7.7984       26
4.706120 5.432898          128          128.0   8.0000   5.3631      145
4.857834 5.009548          256          256.0   2.0000   3.5602       77
4.750883 4.643932

In [50]:
calc_r2('predictions.txt', 'airlines.test.vw')

0.5623280639048646

Базовая модель выдает **0.55**. Решения, которые выбили не менее **0.56** будут оценены в 100 баллов. Все решения с меньшим качеством - соразмерно тому, какое качество получено.

В качестве ответа необходимо прислать Zip архив с названием `result.zip`, в котором будут два файла - `airlines.train.vw` и `airlines.test.vw` (названия должны быть именно такими!) с вашими закодированными признаками. Важно никаким образом не менять порядок и набор объектов в каждом из файлов. Решения, в которых будут обнаружены изменения порядка или набора будут оцениваться в 0 баллов.

На этих двух файлах будет вначале запущен процесс обучения vw с параметрами, как в базовой модели. То есть `vw --final_regressor result.model.vw airlines.train.vw --learning_rate 5 --bit_precision 18 --passes 20 -c -k`. После этого будет подсчитана метрика r2 на файле `airlines.test.vw` запуском соответствующей команды (последовательность в точности совпадает с базовым решением выше).

Так, если бы вы хотели сдать базовое решение в качестве своего, вам необходимо было бы сделать следующее.

In [53]:
! zip result.zip airlines.train.vw airlines.test.vw

updating: airlines.train.vw (deflated 66%)
updating: airlines.test.vw (deflated 66%)


In [54]:
! head -c 100 result.zip

PK    �5Smq�оB> �   airlines.train.vwUT	 F�(aF�(aux �  d   ��k����$��V+�����ZH�$