# Цель ноутбука:

Ответить на следующие вопросы:


* количество типов разных сущностей
* средняя длина сущностей разного типа
* топ 10 сущностей каждого типа 
* есть ли вложенность / пересечения между сущностями

# Читаем данные

In [1]:
import json

with open('../data/train_dataset.json') as f:
    data = json.load(f)
    
print(len(data))
    
with open('../data/dev_dataset.json') as f:
    data += json.load(f)
    
print(len(data))

168300
177100


In [2]:
data[130000]

{'tokens': ['марка', 'лыж', 'и', 'ботинок', '—', 'fischer'],
 'token_labels': ['O', 'O', 'O', 'O', 'O', 'B-CORP'],
 'lang': 'RU-Russian',
 'id': 'b894d113-5781-4557-ac92-4df41c3676a0',
 'domain': 'train',
 'type': 'train'}

In [3]:
data[-2100]

{'tokens': ['71', '134аэн', '—', 'трамвайный', 'вагон', 'лм-99аэн'],
 'token_labels': ['O', 'O', 'O', 'O', 'O', 'B-PROD'],
 'lang': 'RU-Russian',
 'id': 'f5fc6956-ca44-44be-8760-29497c58bb8d',
 'domain': 'dev',
 'type': 'dev'}

# Распределение типов сущностей по языкам и по train/dev

In [4]:
from collections import Counter

def count_entities(sample):
    collapsed_entities = []
    saw_begin_before = False

    for label in sample['token_labels']:
        if label == 'O':
            saw_begin_before = False
        elif label[0] == 'B':
            saw_begin_before = True
            collapsed_entities.append(label[2:])
        elif label[0] == 'I':
            if saw_begin_before:
                continue
            else:
                raise ValueError("Found I-label without B-label before")
    
    c = dict(Counter(collapsed_entities))
    c['type'] = sample['type']
    c['lang'] = sample['lang']
    return c

In [5]:
import pandas as pd
entity_counts = pd.DataFrame(
    [count_entities(sample) for sample in data]
).fillna(0)

entity_counts.head(10)

Unnamed: 0,CORP,type,lang,GRP,PER,CW,PROD,LOC
0,1.0,train,BN-Bangla,0.0,0.0,0.0,0.0,0.0
1,0.0,train,BN-Bangla,1.0,0.0,0.0,0.0,0.0
2,0.0,train,BN-Bangla,0.0,1.0,0.0,0.0,0.0
3,1.0,train,BN-Bangla,0.0,0.0,0.0,0.0,0.0
4,0.0,train,BN-Bangla,1.0,0.0,0.0,0.0,0.0
5,0.0,train,BN-Bangla,0.0,1.0,0.0,0.0,0.0
6,0.0,train,BN-Bangla,0.0,1.0,0.0,0.0,0.0
7,0.0,train,BN-Bangla,0.0,1.0,0.0,0.0,0.0
8,0.0,train,BN-Bangla,0.0,0.0,1.0,0.0,0.0
9,0.0,train,BN-Bangla,0.0,0.0,1.0,0.0,0.0


In [6]:
entity_names = [c for c in entity_counts.columns if c not in ['type', 'lang']]

Cписок сущностей:

In [7]:
entity_names

['CORP', 'GRP', 'PER', 'CW', 'PROD', 'LOC']

### Ответ на вопрос про пересечение/вложенность 

Нет, ее нет (иначе бы мы получили более сложные названия сущностей). 

Доля каждой из сущностей во всем датасете:

In [8]:
entity_counts[entity_names].mean()

CORP    0.195528
GRP     0.194715
PER     0.261406
CW      0.228052
PROD    0.208741
LOC     0.321637
dtype: float64

В разрезе train/dev

In [9]:
entity_counts.groupby(['type'])[entity_names].mean()

Unnamed: 0_level_0,CORP,GRP,PER,CW,PROD,LOC
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
dev,0.1975,0.186136,0.266136,0.228977,0.21,0.333182
train,0.195425,0.195163,0.261159,0.228004,0.208675,0.321034


Пропорции сущностей сохраняются. А если в разрезе разных языков?

In [10]:
entity_counts.groupby(['lang', 'type'])[entity_names].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,CORP,GRP,PER,CW,PROD,LOC
lang,type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
BN-Bangla,dev,0.15875,0.1475,0.18,0.15,0.2375,0.12625
BN-Bangla,train,0.169804,0.15719,0.170327,0.14098,0.208366,0.15366
DE-German,dev,0.20625,0.2,0.37,0.23625,0.16625,0.37
DE-German,train,0.201503,0.229346,0.345621,0.229216,0.193529,0.312288
EN-English,dev,0.24125,0.2375,0.3625,0.22,0.18375,0.2925
EN-English,train,0.203333,0.233399,0.352745,0.245229,0.191046,0.31366
ES-Spanish,dev,0.17625,0.21,0.30875,0.24,0.1925,0.3425
ES-Spanish,train,0.189412,0.21085,0.307582,0.241176,0.198693,0.324706
FA-Farsi,dev,0.2,0.205,0.25125,0.25875,0.19625,0.405
FA-Farsi,train,0.19549,0.209085,0.279216,0.241438,0.193137,0.371438


Здесь уже появляется более-менее заметная разница. Посмотрим на нее поближе

In [11]:
langmeans = entity_counts.groupby(['lang', 'type'])[entity_names].mean().reset_index(level=1)
langmeans_diffs = (langmeans.loc[langmeans['type'] == 'train', entity_names] - langmeans.loc[langmeans['type'] == 'dev', entity_names]).abs()
langmeans_diffs

Unnamed: 0_level_0,CORP,GRP,PER,CW,PROD,LOC
lang,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BN-Bangla,0.011054,0.00969,0.009673,0.00902,0.029134,0.02741
DE-German,0.004747,0.029346,0.024379,0.007034,0.027279,0.057712
EN-English,0.037917,0.004101,0.009755,0.025229,0.007296,0.02116
ES-Spanish,0.013162,0.00085,0.001168,0.001176,0.006193,0.017794
FA-Farsi,0.00451,0.004085,0.027966,0.017312,0.003113,0.033562
HI-Hindi,0.008971,0.000817,0.008211,0.009338,0.010139,0.0071
KO-Korean,0.021536,0.001969,0.037279,0.008791,0.019812,0.007949
NL-Dutch,0.019894,0.012328,0.023105,0.009199,0.01933,0.012377
RU-Russian,0.014632,0.00576,0.000719,0.000719,0.002165,0.000498
TR-Turkish,0.004542,0.024453,0.000253,0.003905,0.010605,0.059404


Разница почти вплоть до 0.06. Это не очень мало, но вряд ли она нам как-то заметно испортить жизнь. Просто нужно будет учитывать этот возможный сдвиг. 

In [12]:
langmeans_diffs.max()

CORP    0.037917
GRP     0.029346
PER     0.037279
CW      0.025229
PROD    0.029134
LOC     0.059404
dtype: float64

Но тут был подсчет именно числа сущностей. Если в каком-то тексте было несколько сущностей одного вида, то они все считались. 

Попробуем теперь то же самое, но теперь будем считать тексты

In [13]:
entity_counts[entity_names] = (entity_counts[entity_names] > 0).astype(int)

In [14]:
entity_counts.groupby(['type']).mean()

Unnamed: 0_level_0,CORP,GRP,PER,CW,PROD,LOC
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
dev,0.182614,0.166023,0.215341,0.2025,0.188068,0.215682
train,0.180392,0.173987,0.212299,0.20082,0.184302,0.213553


Тут все почти без изменений. Сразу посмотрим на новые максимумы разностей средних.

In [15]:
langmeans = entity_counts.groupby(['lang', 'type'])[entity_names].mean().reset_index(level=1)
langmeans_diffs = (langmeans.loc[langmeans['type'] == 'train', entity_names] - langmeans.loc[langmeans['type'] == 'dev', entity_names]).abs()
langmeans_diffs.max()

CORP    0.033113
GRP     0.027557
PER     0.024485
CW      0.024747
PROD    0.029134
LOC     0.033440
dtype: float64

В целом, если смотреть на тексты, то разница между train и dev становится менее заметной. Это хорошо, поможет при разбиении. 

In [16]:
langmeans_diffs.to_csv('../data/processed_data/langmeans_diffs.csv')  # это нужно для ноутбука "Train-val split"

# Посмотрим теперь на сами сущности

In [17]:
def extract_entities(sample):
    extracted_entities = []

    buffer = []
    for token, label in zip(sample['tokens'], sample['token_labels']):
        if label == 'O':
            if buffer:
                extracted_entities.append({
                    'entity': buffer[0][1].split('-')[1],
                    'text': ' '.join([b[0] for b in buffer]),
                    'type': sample['type'],
                    'lang': sample['lang']
                })
                buffer = []
            
        elif label[0] == 'B':
            if buffer:
                extracted_entities.append({
                    'entity': buffer[0][1].split('-')[1],
                    'text': ' '.join([b[0] for b in buffer]),
                    'type': sample['type'],
                    'lang': sample['lang']
                })
                buffer = []
            
            buffer.append((token, label))
        elif label[0] == 'I':
            buffer.append((token, label))
            
    return extracted_entities

In [18]:
entities = []

for sample in data:
    entities.extend(extract_entities(sample))

In [19]:
entities[:4]

[{'entity': 'GRP',
  'text': 'হার্ভার্ড বিশ্ববিদ্যালয়',
  'type': 'train',
  'lang': 'BN-Bangla'},
 {'entity': 'PER',
  'text': 'মার্কো মারুলি',
  'type': 'train',
  'lang': 'BN-Bangla'},
 {'entity': 'CORP', 'text': 'এমজি মোটর', 'type': 'train', 'lang': 'BN-Bangla'},
 {'entity': 'GRP',
  'text': 'উদ্ভিদ উদ্যান',
  'type': 'train',
  'lang': 'BN-Bangla'}]

In [20]:
c = Counter(['  |  '.join([e['text'],e['entity']]) for e in entities])

In [21]:
c.most_common(10)

[('rotten tomatoes  |  CORP', 589),
 ('ایران  |  LOC', 527),
 ('united states census bureau  |  GRP', 453),
 ('köy  |  LOC', 372),
 ('national register of historic places  |  GRP', 307),
 ('대한민국  |  LOC', 276),
 ('ऐतिहासिक स्थलों का राष्ट्रीय पंजीकरण  |  GRP', 237),
 ('dvd  |  PROD', 231),
 ('بخش مرکزی  |  LOC', 227),
 ('미국  |  LOC', 221)]

In [22]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

Но подавляющее большинство сущностей встречается только один раз:

In [23]:
pd.value_counts(list(c.values()))[:20]

1     113794
2      15537
3       5474
4       2654
5       1505
6        983
7        671
8        488
9        369
10       304
11       216
12       180
13       178
15       115
14       114
16       107
18        75
17        65
19        62
20        54
dtype: int64

Теперь по разным типам. Топ-10 типов сущностей каждого вида:

In [24]:
for entity_name in entity_names:
    print(entity_name, '\n')
    c = Counter([e['text'] for e in entities if e['entity'] == entity_name])
    print('\n'.join(f'{v[1]}\t\t{v[0]}' for v in c.most_common(10)))
    print('\n\n__________________________________')

CORP 

589		rotten tomatoes
193		रॉटेन टमेटोज़
182		mtv
165		bbc
163		netflix
154		tvn
140		شرکت هواپیمایی
124		hbo
123		爛 番 茄
120		jtbc


__________________________________
GRP 

453		united states census bureau
307		national register of historic places
237		ऐतिहासिक स्थलों का राष्ट्रीय पंजीकरण
109		संयुक्त राज्य जनगणना ब्यूरो
101		초등학교
81		the new york times
59		중학교
56		the beatles
48		comité olímpico internacional
47		the guardian


__________________________________
PER 

54		carlos linneo
52		반고
35		türk
32		eugène simon
28		edward meyrick
26		alman
25		এডওয়ার্ড মেয়ারিক
23		স্টিফান ভন ব্রুনিং
23		michael jackson
22		william shakespeare


__________________________________
CW 

159		فیلم
131		microsoft windows
87		single
68		linux
63		windows
58		ios
55		itunes
55		한서
52		android
52		sencillo


__________________________________
PROD 

231		dvd
196		motorfiets
130		xbox 360
114		playstation 4
113		playstation 3
105		yumurta
102		xbox one
92		playstation 2
90		cd
90		کانی


______

И посмотрим еще на длины сущностей (в смысле количества токенов)


По-простому (средняя длина)

In [25]:
for entity_name in entity_names:
    m = np.mean([len(e['text'].split()) for e in entities if e['entity'] == entity_name])
    print(entity_name, '\t', m) 

CORP 	 2.134791334631958
GRP 	 2.417424918227773
PER 	 2.2474277206160136
CW 	 2.566065184655828
PROD 	 1.7099574234308474
LOC 	 1.7034905542349557


И аналогично предыдущего пункту выведу топ-10 самых часто встречающихся длин

In [26]:
for entity_name in entity_names:
    print(entity_name, '\n')
    c = Counter([len(e['text'].split()) for e in entities if e['entity'] == entity_name])
    print('\n'.join(f'{v[1]}\t\t{v[0]}' for v in c.most_common(10)))
    print('\n\n__________________________________')

CORP 

13982		1
11133		2
4882		3
1597		4
715		5
645		6
392		7
249		8
144		9
62		10


__________________________________
GRP 

12272		2
8774		1
6741		3
3247		4
1628		5
513		6
267		7
95		8
49		9
22		10


__________________________________
PER 

30818		2
6002		1
5054		3
948		4
631		6
622		5
454		7
210		8
87		9
76		10


__________________________________
CW 

12413		1
11522		2
6340		3
4219		4
2245		5
1069		6
559		7
334		8
226		9
114		10


__________________________________
PROD 

20524		1
10500		2
3055		3
1310		4
497		5
235		6
93		7
80		9
57		8
32		10


__________________________________
LOC 

36947		1
9700		2
4279		3
2574		4
1111		5
505		6
282		7
181		8
149		9
92		10


__________________________________
