### Импорт библиотек

In [2]:
import pandas as pd

### Познакомимся с данными

In [3]:
train_df = pd.read_json('data/ranking_train.jsonl', lines=True)

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88107 entries, 0 to 88106
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      88107 non-null  object
 1   comments  88107 non-null  object
dtypes: object(2)
memory usage: 1.3+ MB


In [5]:
train_df.head()

Unnamed: 0,text,comments
0,How many summer Y Combinator fundees decided n...,[{'text': 'Going back to school is not identic...
1,CBS acquires last.fm for $280m,[{'text': 'It will be curious to see where thi...
2,How Costco Became the Anti-Wal-Mart,[{'text': 'I really hate it when people falsel...
3,"Fortune Favors Big Turds | Screw The Money, Th...",[{'text': 'His real point is that something ca...
4,StartupWeekend: 70 Founders Create One Company...,[{'text': 'Looks like someone hasn't read The ...


### Проверим присутствие Nan значений в данных

In [6]:
train_df.isna().sum()

text        0
comments    0
dtype: int64

### Количество уникальных значений в столбце `text` 

In [7]:
train_df['text'].nunique()

87664

In [8]:
duplicates = train_df[train_df.duplicated(subset='text')]
duplicates['text'].value_counts()

Facebook is down                                           6
Ask HN: Idea Sunday                                        5
Ask HN: What are you working on?                           4
Ask HN: Who's Hiring?                                      4
Ask HN: Review my startup                                  3
                                                          ..
Google I/O 2012                                            1
ZeroRPC                                                    1
All that is wrong with the Recruitment Industry            1
Study predicts imminent irreversible planetary collapse    1
Clinkle                                                    1
Name: text, Length: 399, dtype: int64

### Сгруппируем построчно посты к каждому комментарию

In [9]:
df = train_df.explode('comments', ignore_index=True)
df.head(10)

Unnamed: 0,text,comments
0,How many summer Y Combinator fundees decided n...,{'text': 'Going back to school is not identica...
1,How many summer Y Combinator fundees decided n...,{'text': 'There will invariably be those who d...
2,How many summer Y Combinator fundees decided n...,{'text': 'For me school is a way to be connect...
3,How many summer Y Combinator fundees decided n...,{'text': 'I guess it really depends on how hun...
4,How many summer Y Combinator fundees decided n...,{'text': 'I know pollground decided to go back...
5,CBS acquires last.fm for $280m,{'text': 'It will be curious to see where this...
6,CBS acquires last.fm for $280m,{'text': 'Does this mean that there's now a bi...
7,CBS acquires last.fm for $280m,{'text': 'Also on BBC News: http://news.bbc.c...
8,CBS acquires last.fm for $280m,{'text': 'I don't understand what they do that...
9,CBS acquires last.fm for $280m,{'text': 'sold out too cheaply. given their le...


### Нормализуем данные JSON в плоскую таблицу, отделив `text` от `score`.

In [10]:
df_comments = pd.json_normalize(df['comments'])
df_comments.head(5)

Unnamed: 0,text,score
0,Going back to school is not identical with giv...,0
1,There will invariably be those who don't see t...,1
2,For me school is a way to be connected to what...,2
3,I guess it really depends on how hungry you ar...,3
4,I know pollground decided to go back to school...,4


### Для более эффективного обучения и оптимальной сходимости предварительно обученной модели необходимо внести изменения, поскольку в ее наборе данных для score = 1.0 соответствует два одинаковых текста.
#### Конвертируем `score` в интервал [0.2 ;  0.8], где ` 0.8` cоответствует самому популярному комментарию, `0.2` самому непопулярному

In [11]:
map_dict = {0: 0.8, 1: 0.65, 2: 0.5, 3: 0.35, 4: 0.2}

df_comments['score'] = df_comments['score'].map(map_dict)
df_comments.columns = ['comment', 'score']
df_comments.tail(20)

Unnamed: 0,comment,score
440515,We use a HSA-qualified high-deductible policy ...,0.8
440516,I&#x27;m self-employed and use a catastrophic ...,0.65
440517,"Mom &amp; School. Thanks Obama! No really, t...",0.5
440518,I&#x27;ve had one employer offer to pay you th...,0.35
440519,"I ended up on disability in 2003, been on Medi...",0.2
440520,neat insight! A friend of mine convinced me to...,0.8
440521,The fixed header and footer on this page makes...,0.65
440522,"Good read, thanks for putting this together. G...",0.5
440523,Awesome article. Thanks Justin!,0.35
440524,dopeness - very useful,0.2


### Преобразуем исходный датафрейм в удобный вид, сделав слияние с предыдущими результатами

In [12]:
df = pd.merge(df['text'], df_comments, left_index=True, right_index=True)
df.head(10)

Unnamed: 0,text,comment,score
0,How many summer Y Combinator fundees decided n...,Going back to school is not identical with giv...,0.8
1,How many summer Y Combinator fundees decided n...,There will invariably be those who don't see t...,0.65
2,How many summer Y Combinator fundees decided n...,For me school is a way to be connected to what...,0.5
3,How many summer Y Combinator fundees decided n...,I guess it really depends on how hungry you ar...,0.35
4,How many summer Y Combinator fundees decided n...,I know pollground decided to go back to school...,0.2
5,CBS acquires last.fm for $280m,It will be curious to see where this heads in ...,0.8
6,CBS acquires last.fm for $280m,Does this mean that there's now a big-name com...,0.65
7,CBS acquires last.fm for $280m,Also on BBC News: http://news.bbc.co.uk/1/low...,0.5
8,CBS acquires last.fm for $280m,I don't understand what they do that is worth ...,0.35
9,CBS acquires last.fm for $280m,sold out too cheaply. given their leadership p...,0.2


#### Интересно глянуть, что содержат самые релевантные комментарии и как они соотносятся к самому повторяющемуся тексту публикации во всем наборе данных

In [13]:
facebook = df[df['text'] == "Facebook is down"]
facebook.to_excel('facebokIsDown.xlsx', )

### Сохраним датафрейм в формате pickle

In [14]:
df.to_pickle("data/ranking_train.pkl")  