# Задача

В файлах `airlines.reviews.train.tsv` и `airlines.reviews.test.tsv` находятся данные о пользовательских оценках различных авиакомпаний. Полноценный набор данных доступен <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> по ссылке </a>.

В данных есть отзыв, который оставил пользователь, и его оценка от 0 до 10. Пока мы будем работать __только с текстами отзыва train выборки__ (файл "airlines.reviews.train.tsv").

__Примечание:__ Задания 1-3 надо выполнять последовательно, так как в каждом следующем используются результаты предыдущего.

In [1]:
import pandas as pd
import numpy as np

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [2]:
df = pd.read_csv('airlines.reviews.train.tsv', sep='\t', usecols=['content'])

In [3]:
df.head()

Unnamed: 0,content
0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...
1,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.
2,"Spirit does what they state on their web site, they get you there - cheaply. For that I give them 5 stars because they did exactly what the said they would do. The plane was full and the seats were close together. I read all about that before I bought the ticket and it was as they said it would be, hence the low cost. Plan ahead and know what to expect and it will be a great experience. Its obvious that some of the people that gave 1 star reviews didn't understand about cost of bags or any extras and not done their homework - and are now very disappointed."
3,"My fiancé and I were booked to fly to Cayo Santa Maria (CUBA) February 6-13 2014. Our flight was scheduled to leave at 6.10am. Upon arriving at the airport at 4.30am we quickly noticed that the line up was very long. When we finally got to the check-in desk they asked us where we were headed we replied Cayo Santa Maria. We advised her that we had checked in online already and we just needed to print our boarding passes. She took our baggage and weighed it. Right before she was about to send it off a rude manager from the back came and just yelled out ""gates to Santa Clara are closed"". We were so shocked because it was only 4.55am at that time. We told them the plane would just be sitting there we could still make it. The rep simply told us ""please step aside we need to assist other pas..."
4,"DXB-LHR B777-200ER BA0108 August 18 First Class. Transferred from an Emirates flight in DXB. BA DXB Galleries Lounge reception staff member excellent. Boarding reasonable with an on-time departure. Cabin crew outstanding and definitely lived up to the ""To Fly. To Serve"" BA slogan. Food tasty and well presented but not quite First Class and cost-cutting was evident. The New First seat is comfortable though the footrest is poorly designed and a storage area for small inflight items is missing. IFE monitor controls and selection very good but the screen could be more adjustable for reach. Lovely cabin ambience including colors textures mood lighting and window blinds. Toilets cramped and stocked with the cheapest liquid soaps and toilet paper. Overall an enjoyable flight but as a longstan..."


### Задание 3 (30 баллов)

Работайте с текстами, полученным в задании 2.

Сделайте TF-IDF преобразование (c n-gram range = (1, 1)) для коллекции документов. Для каждого документа найдите топ-1 стемму с самым высоким весом tf-idf. Запишите эти стеммы в файл __tfidf_stems.txt__ в следующем формате: каждому документу соответствует одно слово, строки в документе должны идти в том же порядке, что документы в исходном датасете. В итоговом файле должно быть столько же строк и слов, сколько документов в файле "airlines.reviews.train.tsv".

In [4]:
### Ваше решение задания 3

In [5]:
import re
def cleaning(text):
    word_pattern = re.compile('[a-z]+')
    words = []
    for match in re.finditer(word_pattern, text.lower()):
        words.append(match.group(0))
    
    if not words: 
        return None
    return words
df['clean_content'] = df.content.apply(cleaning)

In [6]:
from nltk.stem.snowball import SnowballStemmer
from functools import lru_cache
from tqdm import tqdm

snowball = SnowballStemmer("english")

In [7]:
def stemm_description(d):
    @lru_cache(maxsize=128)
    def stemm_token(token):
        return snowball.stem(token)

    return [stemm_token(t) for t in d if len(stemm_token(t))]

In [8]:
from multiprocessing import Pool

with Pool(2) as p:
    stems = list(tqdm(p.imap(stemm_description, df.clean_content), total=len(df.clean_content)))
    
df['stems'] = stems

100%|██████████| 23322/23322 [00:11<00:00, 1987.46it/s]


In [9]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [10]:
def remove_stopwords(tokens, sw = stop_words):
    return [t for t in tokens if not t in sw]
df['stems_out_stop'] = df.stems.apply(remove_stopwords)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(1, 1))
tfidf = vec.fit_transform(df['stems_out_stop'].apply(lambda x: ' '.join(x)))

In [12]:
index_value={i[1]:i[0] for i in vec.vocabulary_.items()}

fully_indexed = []
for row in tfidf:
    fully_indexed.append({index_value[column]:value for (column,value) in zip(row.indices,row.data)})

In [13]:
answer = []
for fi in fully_indexed:
    answer.append(sorted(fi.items(), key=lambda x: -x[1])[0][0])

In [15]:
with open('tfidf_stems.txt', 'w') as f:
    f.write('\n'.join(answer))

In [16]:
answer[:5]

['wg', 'rather', 'star', 'santa', 'ba']