In [1]:
import pandas as pd
import re
import sklearn
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
import numpy as np

In [3]:
vostok = pd.read_csv('vostok_clean.csv', sep = ',')
sputnik = pd.read_csv('sputnik_clean.csv', sep = ',')
n1 = pd.read_csv('n1_clean.csv', sep = ',')
voa = pd.read_csv('voa_clean.csv', sep = ',')
rfe = pd.read_csv('rfe_clean.csv', sep = ',')

In [4]:
all_data = pd.concat([vostok,sputnik, n1, voa, rfe], axis=0, ignore_index=True)
len(all_data)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  """Entry point for launching an IPython kernel.


10599

Preparing month and country source columns for stratified sampling

In [454]:
all_data['month'] = all_data['date']
months = all_data['month'].tolist()
months = [re.search('-(..)-', month).group(1) for month in months]
all_data['month'] = pd.DataFrame(months)
all_data['country-source'] =  all_data['source']
countries = all_data['country-source'].tolist()
countries = [re.search("-(.*)$", country).group(1) for country in countries]
all_data['country-source'] =  pd.DataFrame(countries)

Here again we check for uniquness of rows, to see if any articles are shared across platforms and thus to remove them. 

In [455]:
all_data.nunique()

Unnamed: 0         3812
category             23
date               4631
source                5
text              10598
title             10549
month                 4
country-source        2
dtype: int64

As it is possible that titles are shared across outlets, we simply drop the one row where we have a repeated text. 

In [456]:
all_data = all_data.drop_duplicates('text')
len(all_data)

10598

In [457]:
del all_data['Unnamed: 0']
all_data.head(10)

Unnamed: 0,category,date,source,text,title,month,country-source
0,Društvo,2018-11-02,Vostok-RU,"PovodomDana slobode“, kojim se od ove godine u...",Beogradski dani slobode u Lugansku,11,RU
1,Rusija,2018-11-02,Vostok-RU,Uvodeći ograničenja Rusija je recipročno odgov...,Peskov: Sankcije Ukrajini iznuđena recipročna ...,11,RU
2,Ekonomijа,2018-11-02,Vostok-RU,"""Rosatom"" je uspešno pustio u rad reaktor prve...",„Rosatom“ pustio u rad reaktor prve plutajuće ...,11,RU
3,Region,2018-11-02,Vostok-RU,Crnogorsko pravosuđe nastavilo je progon lider...,Mandiću i Kneževiću opet oduzimaju pasoše,11,RU
4,Društvo,2018-11-02,Vostok-RU,Sudije u Hagu potpuno su se podelile u slučaju...,U Hagu donete dve suprotne odluke: Ne zna se k...,11,RU
5,Region,2018-11-02,Vostok-RU,"Zabrana ulaska u Crnu Goru Matiji Bećkoviću, Č...",„Zabrana ulaska u Crnu Goru Bećkoviću osim što...,11,RU
6,Rusija,2018-11-02,Vostok-RU,Danas se odlučuje kakav će biti svet u naredni...,Putin: Danas se odlučuje kakav će biti svet u ...,11,RU
7,Politika,2018-11-02,Vostok-RU,Premijer Italije Đuzepe Konte kaže da se nada ...,Konte se nada nastavku kontakata sa Putinom,11,RU
8,Bezbednost,2018-11-02,Vostok-RU,Teroristi su isporučili dva kontejnera sa hlor...,Savčenko: Teroristi isporučili dva kontejnera ...,11,RU
9,Rusija,2018-11-02,Vostok-RU,Najnovija izjava britanskog ministra spoljnih ...,Zaharova: Izjava Hanta o Rusiji oštra retorika...,11,RU


Check distribution before stratified sample

In [458]:
all_data.groupby(["month", "country-source"]).size()

month  country-source
01     RU                1038
       USA               2039
02     RU                1127
       USA               1725
11     RU                1181
       USA               1143
12     RU                1124
       USA               1221
dtype: int64

We extract a stratified sample across two columns. The stratified sample is proportionate to the original data set. As we are ultimately concerned with the overview based on source country we stratify according to that column, rather than individual sources. As each coder gets assigned a maximum of 370 articles and there are five coders, we are looking for 1570/10598 which is roughly 0.148. 1570 = 300 * 5 coders + 70 (40 training, 30 reliability).

In [462]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.149, random_state=42)
for train_index, test_index in split.split(all_data, all_data[["country-source", 'month']]):
    strat_test_set = all_data.loc[train_index]
    strat_coder_set = all_data.loc[test_index]
strat_coder_set.head()

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,category,date,source,text,title,month,country-source
10360,,2019-02-11 15:13:59,RFE-USA,Cilj novoformiranog Koordinacionog tima za pra...,Đorđević: Stručnjaci da pomognu Vladi Srbije u...,2,USA
6727,Vesti,2019-01-28,N1-USA,[Policija je uhapsila zastupnika humanitarne o...,Uhapšen zastupnik humanitarne organizacije u N...,1,USA
7605,Kultura,2019-02-19,N1-USA,"[Orkestar Muzikon, nastavlja ""Level Up!"" sezon...","""Rock N Classic"": Kamerni orkestar Muzikon izv...",2,USA
7889,Vesti,2019-02-23,N1-USA,"[Izvor: N1, Uz Beograd, protesti su održani u ...","Protesti i u Kragujevcu, Pasjanu, Jagodini, u ...",2,USA
10027,,2019-01-17 18:46:42,RFE-USA,Policija u Gani saopštila je danas da je ubije...,Ubijen novinar u Gani koji je razotkrio korupc...,1,USA


Let's check the distribution.

In [463]:
len(strat_coder_set)

1580

In [464]:
strat_coder_set.groupby(["month", "country-source"]).size()

month  country-source
01     RU                155
       USA               304
02     RU                168
       USA               257
11     RU                176
       USA               170
12     RU                168
       USA               182
dtype: int64

The distribution by country and source is exactly as expected. All proportions of data are kept. Now we need to create five separate files for each coder, once again stratified. To do so we randomly drop ten articles so that we have exactly 1570 articles. Those 10 articles are returned to the testing dataset. 

In [466]:
random_ten = strat_coder_set.sample(10)
droping = random_ten['text'].tolist()
strat_coder_set = strat_coder_set[~strat_coder_set['text'].isin(droping)]
strat_test_set= strat_test_set.append(random_ten)
len(strat_coder_set)

1570

In [467]:
def repeated_split(x, dataset):
    split = StratifiedShuffleSplit(n_splits=1,  test_size=x, random_state=42)
    for train_index, test_index in split.split(dataset, dataset[["country-source", 'month']]):
        strat_train_set = dataset.iloc[train_index]
        strat_test_set = dataset.iloc[test_index]
    return strat_train_set, strat_test_set

In [468]:
first_coder = repeated_split(0.19, strat_coder_set)
second_coder = repeated_split(0.23, first_coder[0])
third_coder = repeated_split(0.30, second_coder[0])
fourth_coder = repeated_split(0.44, third_coder[0])
fifth_coder = repeated_split(0.81, fourth_coder[0])

In [469]:
damjan_data = first_coder[1]
milica_data = second_coder[1]
ognjan_data = third_coder[1]
spela_data = fourth_coder[1]
teo_data = fifth_coder[1]
trai_reli = fifth_coder[0]

In [470]:
len(damjan_data)

299

In [471]:
len(milica_data)

293

In [472]:
len(ognjan_data)

294

In [473]:
len(spela_data)

301

In [474]:
len(teo_data)

311

In [475]:
len(trai_reli)

72

As we can see, there is a slight disbalance in our data distribution due to our approach. This will be corrected by extracting 1 random article from the spela_data to the damjan_data. A random sample of 6 from teo_data goes to ognjan_data and a random sample of 5 from teo_data goes to milica_data, with 2 more random items from trai_reli to milica_data

In [476]:
spela_damjan = spela_data.sample(n=1)
droping = spela_damjan['text'].tolist()
spela_data = spela_data[~spela_data['text'].isin(droping)]
damjan_data = damjan_data.append(spela_damjan)
teo_ognjan = teo_data.sample(6)
droping = teo_ognjan['text'].tolist()
teo_data = teo_data[~teo_data['text'].isin(droping)]
ognjan_data = ognjan_data.append(teo_ognjan)
teo_milica = teo_data.sample(5)
droping = teo_milica['text'].tolist()
teo_data = teo_data[~teo_data['text'].isin(droping)]
milica_data = milica_data.append(teo_milica)
trai_milica = trai_reli.sample(2)
droping = trai_milica['text'].tolist()
trai_reli = trai_reli[~trai_reli['text'].isin(droping)]
milica_data = milica_data.append(trai_milica)

In [477]:
len(damjan_data)

300

In [478]:
len(milica_data)

300

In [479]:
len(ognjan_data)

300

In [480]:
len(spela_data)

300

In [481]:
len(teo_data)

300

In [482]:
len(trai_reli)

70

Let's check that every observation in the datasets is unique. To do that we simply merge all data_sets and check for the number of unique values for text. We expect 1570. 

In [483]:
all_coders = pd.concat([damjan_data, milica_data, ognjan_data, spela_data, teo_data, trai_reli],axis=0, ignore_index=True)

In [484]:
all_coders.nunique()

category            17
date               813
source               5
text              1570
title             1569
month                4
country-source       2
dtype: int64

Finally we create separate training and reliability data sets.

In [508]:
trai_reli_new = repeated_split(0.57, trai_reli)
training_data = trai_reli_new[1]
reliability_data = trai_reli_new[0]

In [509]:
len(training_data)

40

In [510]:
len(reliability_data)

30

In [512]:
strat_test_set.to_csv('test_data.csv', sep=',', encoding='utf-8')
strat_coder_set.to_csv('coder_data.csv', sep = ',', encoding = 'utf-8')
damjan_data.to_csv('damjan_data.csv', sep = ',', encoding = 'utf-8')
milica_data.to_csv('milica_data.csv', sep = ',', encoding = 'utf-8')
teo_data.to_csv('teo_data.csv', sep = ',', encoding = 'utf-8')
ognjan_data.to_csv('ognjan_data.csv', sep = ',', encoding = 'utf-8')
spela_data.to_csv('spela_data.csv', sep = ',', encoding = 'utf-8')
trai_reli.to_csv('trai_reli.csv', sep = ',', encoding = 'utf-8')
training_data.to_csv('training_data.csv', sep = ',', encoding = 'utf-8')
reliability_data.to_csv('reliability_data.csv', sep = ',', encoding = 'utf-8')

Now that we have the data,  the next step is to somehow transfer it into word doc format for each coder. 

In [17]:
damjan = pd.read_csv('damjan_data.csv', sep = ',', encoding = 'utf-8')

In [71]:
def import_zip_print(file):
    name = pd.read_csv(file , sep = ',', encoding = 'utf-8')
    title = name['title'].tolist()
    text = name['text'].tolist()
    index = name.index.tolist()
    index = ['UniqueID: ' + str(ind) for ind in index]
    title = ['Title: ' + titl for titl in title]
    text = ['Text: ' + tex for tex in text]
    empty_list = []
    empty_list.extend([list(a) for a in zip(index, title, text)])
    return empty_list

In [78]:
damjan_final = import_zip_print('damjan_data.csv')
ognjan_final = import_zip_print('ognjan_data.csv')
teo_final = import_zip_print('teo_data.csv')
spela_final = import_zip_print('spela_data.csv')
milica_final = import_zip_print('milica_data.csv')
training_final = import_zip_print('training_data.csv')
reliability_final = import_zip_print('reliability_data.csv')

In [6]:
damjan_final = pd.read_csv('teo_data.csv')
damjan_final[:100]

Unnamed: 0.1,Unnamed: 0,category,date,source,text,title,month,country-source
0,5890,Svet,2019-01-15,N1-USA,[Odbacivanje postignutog dogovora o Bregzitu u...,Junker: Povećan rizik od razlaza EU i Velike B...,1,USA
1,9290,,2018-11-27 17:20:46,RFE-USA,Saudijski prijestolonasljednik Mohmed bin Salm...,Protesti u Tunisu zbog dolaska saudijskog princa,11,USA
2,3234,Politika,2018-12-19 15:27:00,Sputnik-RU,Predednik kosovske skupštine Kadri Veselji izj...,Veselji: Sporazum bez promene granica i bez ZSO,12,RU
3,7889,Vesti,2019-02-23,N1-USA,"[Izvor: N1, Uz Beograd, protesti su održani u ...","Protesti i u Kragujevcu, Pasjanu, Jagodini, u ...",2,USA
4,1752,Ekonomijа,2019-02-03,Vostok-RU,Rusija će aktivno sarađivati sa evropskim zeml...,Rusija će sarađivati sa zemljama EU po pitanju...,2,RU
5,927,Analize,2018-12-17,Vostok-RU,Evropska Unija planira da podrži organizacije ...,EU planira da podrži aktiviste koji „brane dem...,12,RU
6,9127,,2018-11-17 13:46:18,RFE-USA,"Najmanje 40 ljudi, uglavnom civila, poginulo j...",Poginulo 40 ljudi u udarima koalicije predvođe...,11,USA
7,9670,,2018-12-22 12:20:21,RFE-USA,"Pokret ""Žuti prsluci"", koji je za danas planir...",Protest 'Žutih prsluka' u Parizu danas jedva p...,12,USA
8,8065,Vesti,2019-02-26,N1-USA,[Ministar odbrane Srbije Aleksandar Vulin reka...,Vulin: Srpska flota lovaca sa 14 migova najmod...,2,USA
9,5214,Svet,2018-12-18,N1-USA,[Evropski komesar za ekonomiju Pjer Moskovisi ...,Moskovisi želi da Italija izbegne sankcije,12,USA


In [4]:
damjan = pd.read_csv('coder_1_data.csv')
milica = pd.read_csv('coder_2_data.csv')
ognjan = pd.read_csv('coder_5_data.csv')
teo = pd.read_csv('coder_3_data.csv')
spela = pd.read_csv('coder_4_data.csv')

In [6]:
all_data = pd.concat([coder_1, coder_2, coder_5, coder_3, coder_4])

From this point we organize each coder article set into a Word document. 

Having obtained training responses from coders we check their intepretation of the codebook. 