# Задача

В файлах `airlines.reviews.train.tsv` и `airlines.reviews.test.tsv` находятся данные о пользовательских оценках различных авиакомпаний. Полноценный набор данных доступен <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> по ссылке </a>.

В данных есть отзыв, который оставил пользователь, и его оценка от 0 до 10. Пока мы будем работать __только с текстами отзыва train выборки__ (файл "airlines.reviews.train.tsv").

__Примечание:__ Задания 1-3 надо выполнять последовательно, так как в каждом следующем используются результаты предыдущего.

In [10]:
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [2]:
df = pd.read_csv('airlines.reviews.train.tsv', sep='\t', usecols=['content'])

In [4]:
import re
words_only = re.compile('[a-z]+')
def letters(s, regex = words_only):
    if isinstance(s, str):
        return words_only.findall(s.lower())
    else:
        return []
df['clean_content'] = df.content.apply(letters)

In [5]:
df.head(2)

Unnamed: 0,content,clean_content
0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...,"[march, th, from, ottawa, canada, to, cuba, wg, they, announced, that, the, flight, was, going, to, be, delayed, hour, no, explanation, why, they, started, boarding, and, we, took, off, only, hour, late, there, were, of, us, were, seated, together, and, remaining, were, put, in, aisle, seats, side, by, side, on, the, way, back, from, cuba, on, march, th, wg, we, were, slow, going, through, immigration, no, fault, of, sunwing, finally, arrived, to, our, plane, at, am, the, doors, immediately, closed, and, the, plane, took, off, minutes, later, minutes, earlier, than, expected, the, of, us, were, pretty, much, split, up, by, ...]"
1,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.,"[sin, fra, bhx, in, economy, first, leg, from, singapore, on, the, a, was, great, largely, because, i, was, fortunate, enough, to, get, an, exit, row, seat, with, unlimited, legroom, judging, by, fellow, passengers, one, wouldn, t, be, happy, with, normal, seats, as, they, had, rather, pathetic, legroom, nice, modern, avod, system, but, the, ptvs, were, rather, small, compared, to, other, a, airlines, service, was, really, friendly, and, warm, but, few, frills, no, amenity, kit, whatsoever, no, footrests, meals, were, alright, but, again, rather, simple, compared, to, asian, carriers, second, leg, to, birmingham, on, an, a, was, above, average, by, intra, ...]"


### Задание 2 (10 баллов)

Работайте с текстом, полученным в задании 1.

Проведите стемминг с помощью SnowballStemmer из библиотеки NLTK. После этого удалите все стоп-слова (стоп-слова возьмите из библиотеки NLTK). Найдите топ-20 самых частотных стемм среди оставшихся после удаления стоп-слов и запишите в порядке убывания их частоты (аналогично заданию 1) в файл __popular_stems.txt__

Полученные тексты (стеммы с удаленными стоп-словами) сохраните для задания 3.

In [5]:
### Ваше решение задания 2

In [6]:
from nltk.stem.snowball import SnowballStemmer
from functools import lru_cache
from tqdm import tqdm

snowball = SnowballStemmer("english")

In [20]:
def stemm_description(d):
    @lru_cache(maxsize=128)
    def stemm_token(token):
        return snowball.stem(token)

    return [stemm_token(t) for t in d]

In [15]:
from multiprocessing import Pool

with Pool(2) as p:
    stems = list(tqdm(p.imap(stemm_description, df.clean_content), total=len(df.clean_content)))
    
df['stems'] = stems

100%|██████████| 23322/23322 [00:14<00:00, 1651.77it/s]


In [22]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [23]:
def remove_stopwords(tokens, sw = stop_words):
    return [t for t in tokens if not t in sw]
df['stems_out_stop'] = df.stems.apply(remove_stopwords)

In [6]:
import re
import nltk.data 
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words[:10]

# токенизация 
words_only = re.compile('[a-z]+')

def letters(s, regex = words_only):
    if isinstance(s, str):
        return words_only.findall(s.lower())
    else:
        return []

# удаление стоп-слов
def remove_stopwords(tokens, sw = stop_words):
    return [t for t in tokens if not t in sw]

def preprocess(s):
    return remove_stopwords(letters(s))

In [7]:
df['clean_description'] = df.content.map(preprocess)

In [8]:
from functools import lru_cache
def stemm_description(d):
    @lru_cache(maxsize=128)
    def stemm_token(token):
        return snowball.stem(token)

    return [stemm_token(t) for t in d if len(stemm_token(t)) >= 3]

In [13]:
from multiprocessing import Pool
from nltk.stem.snowball import SnowballStemmer
snowball = SnowballStemmer("english")
snowball.stem('apartment')

with Pool(2) as p:
    stems = list(tqdm(p.imap(stemm_description, df.clean_description), total=len(df)))
    
df['stems'] = stems

HBox(children=(FloatProgress(value=0.0, max=23322.0), HTML(value='')))




In [14]:
from collections import Counter
words = np.concatenate(df['stems'].values).tolist()
answer = sorted(Counter(words), key=Counter(words).get, reverse=True)[:20]

In [16]:
with open('popular_stems.txt', 'w') as f:
    f.write('\n'.join(answer))

In [15]:
Counter(words).most_common(20)

[('flight', 47092),
 ('seat', 23530),
 ('time', 14828),
 ('servic', 14326),
 ('good', 13337),
 ('food', 12481),
 ('airlin', 11004),
 ('hour', 10313),
 ('crew', 10160),
 ('staff', 8934),
 ('plane', 8671),
 ('check', 8551),
 ('return', 8270),
 ('cabin', 8109),
 ('class', 8095),
 ('fli', 7976),
 ('board', 7770),
 ('would', 7728),
 ('one', 7437),
 ('busi', 6956)]