In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import html2text
from bs4 import BeautifulSoup
from tqdm import tqdm
from langdetect import detect

In [2]:
data_source = pd.read_csv("competitions_info.csv")
docs = []

for doc in data_source.Description.values:
    parsed_html = BeautifulSoup(str(doc))
    for script in parsed_html(["script", "style"]):
        script.decompose()
    docs.append(" ".join(list(parsed_html.stripped_strings)).replace("\n", " "))
data_source["prep"] = docs

In [3]:
bert_data = pd.read_csv("competitions_info_summ.csv")
bart_data = pd.read_csv("competitions_info_summ_bart.csv")
gpt2_data = pd.read_csv("competitions_info_summ_gpt2.csv")

In [5]:
data = pd.DataFrame(bert_data.Summarized_flag)
data["Summarized_flag_bart"] = bart_data.Summarized_flag
data["Summarized_flag_gpt2"] = gpt2_data.Summarized_flag
data["summ_bert"] = bert_data.Summarized
data["summ_bart"] = bart_data.Summarized
data["summ_gpt2"] = gpt2_data.Summarized
data = data[(data["Summarized_flag_bart"] == True) 
            & (data["Summarized_flag"] == True)
            & (data["Summarized_flag_gpt2"] == True)]
len(data)

639

### Выделим краткие описания, которые удачно сгенерировались оба алгоритма:

In [6]:
data_summ = data[data["Summarized_flag"] == True]

In [7]:
data_summ.head()

Unnamed: 0,Summarized_flag,Summarized_flag_bart,Summarized_flag_gpt2,summ_bert,summ_bart,summ_gpt2
5,True,True,True,Your success depends upon how closely you can ...,There are 5 essay sets. Each of the sets of es...,Your success depends upon how closely you can ...
8,True,True,True,BACKGROUND AND OBJECTIVES The data set for thi...,Great Candidates of America (GCA) is one of th...,BACKGROUND AND OBJECTIVES The data set for thi...
12,True,True,True,You are presented the opportunity of a lifetim...,The Board of Directors of the National Bureau ...,You are presented the opportunity of a lifetim...
13,True,True,True,"For this challenge, potential Facebook recruit...",Facebook is seeking data-savvy software engine...,"For this challenge, potential Facebook recruit..."
15,True,True,True,Understanding how and why we are here is one o...,"Galaxies come in all shapes, sizes and colors:...",Understanding how and why we are here is one o...


In [8]:
count = 5
texts_5_10 = []
texts_10_20 = []
texts_20_30 = []

for ind, data_index in enumerate(data_summ.index):
    source_text = data_source.loc[[data_index]].prep.values[0]
    # Пропустим НЕ английские тексты
    if detect(source_text) != 'en':
        continue
    row = data_summ.loc[data_index]
    sent_num = len([x for x in source_text.split(".") if len(x) > 5])
    data = f"Text id:\n{data_index}\nSource text:\n{source_text}\n\n" \
    f"BERT summarization:\n{row[3]}\n\nBART summarization:\n{row[4]}\n\nGPT2 summarization:\n{row[5]}\n\n" \
    + ("-"*100)
    if 5 < sent_num <= 10 and not source_text in [text[0] for text in texts_5_10]:
        texts_5_10.append((source_text, data))
    elif 10 < sent_num <= 20 and not source_text in [text[0] for text in texts_10_20]:
        texts_10_20.append((source_text, data))
    elif 20 < sent_num <= 30 and not source_text in [text[0] for text in texts_20_30]:
        texts_20_30.append((source_text, data))

## Оценим суммаризацию на текстах разной длины:

##### 5-10 предложений:

In [12]:
[print(x[1]) for x in texts_5_10[:15]]

Text id:
35
Source text:
RNA polymerase II is crucial for gene transcription (DNA to RNA) and is therefore the most studied polymerase. Transcription start sites (TSS), where the polymerase II first binds to the DNA, are  marked by specific short DNA sequences. Also other structural properties of the genome (histons, nucleosomes) need to allow binding. After polymerase II binds to the DNA, it opens the DNA helix and moves along the DNA during transcription. The column polII_presence  in data specifies experimentally measured concentration of polymerase II at each position in the genome. Each line in data specifies a position (nucleotide) in the DNA sequence. Nucleotides are listed in the same order as the appear in the genome and are described with: polymerase II presence (polII_presence - target variable, absent in testing file): 4 (very present), 3 (present), 2 (insignificant), 1 (absent), 0 (absent with very high probability) nucleotide (DNA): A, T, C, G, gene structure on + strand 

[None, None, None, None, None, None, None, None, None, None, None, None, None]

##### 10-20 предложений:

In [13]:
[print(x[1]) for x in texts_10_20[:10]]

Text id:
13
Source text:
For this challenge, potential Facebook recruits will be exploring the map of the entire internet. Unlike the map of a city, where best routes are relatively fixed except for the occasional construction or parade detour, the paths that information travels  over the web are constantly changing. There is no centralized system of stop-lights or traffic cops.  Instead, there are tens of thousands of autonomous systems using a common protocol to advertise the next available hops, updated depending on service-agreements,  capacity, and load. This will be a test of both the candidates engineering know-how and their ability to statistically learn on complex, dynamic graph structures. The Task: you will be given a path which, at one point in the training time period, was an optimal path from node A to B. The question is then to make a probalistic prediction, for each of the 5 test graphs, whether the given path is STILL an optimal  path.  This is a much more difficult ta

[None, None, None, None, None, None, None, None, None, None]

##### 20-30 предложений:

In [16]:
[print(x[1]) for x in texts_20_30[:15]]

Text id:
5
Source text:
Your success depends upon how closely you can deliver scores to those of human expert graders. For this competition, there are 5 essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information  and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. Each of the data sets has its own unique characteristics. The variability is intended to test the  limits of your scoring engine's capabilities. The data has these columns: id : A unique identifier for each individual student essay set : 1-5, an id for each set of essays essay : The ascii text of a student's response rater1 : Rater 1's grade rater2 : Rater 2's grade grade : Resolved score between the raters In addition, a Microsoft Word 2010 Readme file describes each e

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]