This notebook loads datasets of semantic proximity {Word-in-Context} for various languages the [WUG format](https://www.ims.uni-stuttgart.de/en/research/resources/experiment-data/wugs/). We provide the data in a minimal and an extended format. There are in total 4 dataframes: judgments_full, judgments_wug, uses_full and uses_wug. There are 20 transformed datasets. The notebook should run of-the-shelf in a colab environment with python 3.8. 
The datasets and their versions are as follows:

#RuDSI-russian
rudsi = 'https://github.com/kategavrishina/RuDSI/tree/main/data' 

#NorDiaChange-russian
nordia1 = 'https://github.com/ltgoslo/nor_dia_change/tree/main/subset1/data'
nordia2 = 'https://github.com/ltgoslo/nor_dia_change/tree/main/subset2/data'

#RuShiftEval-russian
rushifteval1 = 'https://github.com/akutuzov/rushifteval_public/tree/main/durel/rushifteval1/data'
rushifteval2 = 'https://github.com/akutuzov/rushifteval_public/tree/main/durel/rushifteval2/data'
rushifteval3 = 'https://github.com/akutuzov/rushifteval_public/tree/main/durel/rushifteval3/data'

#RuSemShift- russian
rusemshift1 = 'https://github.com/juliarodina/RuSemShift/tree/master/rusemshift_1/DWUG/data'
rusemshift2 = 'https://github.com/juliarodina/RuSemShift/tree/master/rusemshift_2/DWUG/data'

#Discowug- deutsch (Version: 1.1.1)
https://zenodo.org/record/7396225/files/discowug.zip



#surel- deutsch (Version: 3.0.0)
https://zenodo.org/record/5784569/files/surel.zip


#durel -deutsch (Version: 3.0.0)
https://zenodo.org/record/5784453/files/durel.zip


#DWUG_deutsch- dwug_de (Version: 2.3.0)
https://zenodo.org/record/7441645/files/dwug_de.zip


#RefWUG - deutsch (Version: 1.1.0)
https://zenodo.org/record/5791269/files/refwug.zip


#DWUG_English dwug_en (Version: 2.0.1)
https://zenodo.org/record/7387261/files/dwug_en.zip


#DWUG_Swedish (Version: 2.0.1)
https://zenodo.org/record/7389506/files/dwug_sv.zip


#DWUG_ESPANOL (Version: 4.0.0)
https://zenodo.org/record/6433667/files/dwug_es.zip


#DiaWUG ESPANOL (Version: 1.1.0)
https://zenodo.org/record/5791193/files/diawug.zip


#DUPS_WUG English (version 2.0.0)
https://zenodo.org/record/5500223/files/DUPS-WUG.zip

#WIC dataset
https://pilehvar.github.io/wic/package/WiC_dataset.zip

#TempoWIC dataset
https://codalab.lisn.upsaclay.fr/my/datasets/download/3e22f138-ca00-4b10-a0fd-2e914892200d

#Raw-C dataset
https://raw.githubusercontent.com/seantrott/raw-c/main/data/processed/raw-c.csv

#Usim Dataset
http://www.dianamccarthy.co.uk/downloads/WordMeaningAnno2012/

#CosimLex Dataset
https://www.clarin.si/repository/xmlui/handle/11356/1308/allzip


Many of the data sets are transformed when running the notebook. We cannot guarantee that there are no errors. Hence, please make sure that you compare the created data frames to the original data sets before doing serious research with them.

Note: To facilitate the running of cosimlex.sh, an encoding 'utf-8' has been added to all file open statements in cosimlex.py. Also, please run this script without gpu.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import io
import numpy as np
import os
from zipfile import ZipFile

In [2]:
!git clone https://github.com/Garrafao/WUGs.git

Cloning into 'WUGs'...
remote: Enumerating objects: 823, done.[K
remote: Counting objects: 100% (183/183), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 823 (delta 122), reused 183 (delta 122), pack-reused 640[K
Receiving objects: 100% (823/823), 1.25 MiB | 3.89 MiB/s, done.
Resolving deltas: 100% (479/479), done.


In [3]:
!pip install fuzzywuzzy #needed for rawc script

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [4]:
#!cd /content/WUGs/scripts/misc && bash -e usim2data.sh #USIM dataset (erroneous at the moment)

In [5]:
#!cd /content/WUGs/scripts/misc && bash -e evonlp2wug.sh #tempowic dataset (erroneous at the moment)

In [6]:
!python3 -m spacy download fi_core_news_sm

2023-02-21 15:19:19.861358: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-21 15:19:21.607017: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-21 15:19:21.607147: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-21 15:19:24.722198: E tensorfl

In [7]:
!python3 -m spacy download hr_core_news_sm

2023-02-21 15:19:39.978822: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-21 15:19:41.525534: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-21 15:19:41.525708: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-21 15:19:43.788022: E tensorfl

In [8]:
#!cd /content/WUGs/scripts/misc && bash -e cosimlex2wug.sh #cosimlex dataset(erroneous at the moment)

In [9]:
#RuDSI-russian
rudsi = 'https://github.com/kategavrishina/RuDSI/tree/main/data' 

#NorDiaChange-russian
nordia1 = 'https://github.com/ltgoslo/nor_dia_change/tree/main/subset1/data'
nordia2 = 'https://github.com/ltgoslo/nor_dia_change/tree/main/subset2/data'

#RuShiftEval-russian
rushifteval1 = 'https://github.com/akutuzov/rushifteval_public/tree/main/durel/rushifteval1/data'
rushifteval2 = 'https://github.com/akutuzov/rushifteval_public/tree/main/durel/rushifteval2/data'
rushifteval3 = 'https://github.com/akutuzov/rushifteval_public/tree/main/durel/rushifteval3/data'

#RuSemShift- russian
rusemshift1 = 'https://github.com/juliarodina/RuSemShift/tree/master/rusemshift_1/DWUG/data'
rusemshift2 = 'https://github.com/juliarodina/RuSemShift/tree/master/rusemshift_2/DWUG/data'

#Discowug- deutsch
!wget https://zenodo.org/record/7396225/files/discowug.zip
with ZipFile('discowug.zip', 'r') as discowug:
    discowug.extractall()


#surel- deutsch
!wget https://zenodo.org/record/5784569/files/surel.zip
with ZipFile('surel.zip', 'r') as surel:
    surel.extractall()

#durel -deutsch
!wget https://zenodo.org/record/5784453/files/durel.zip
with ZipFile('durel.zip', 'r') as durel:
    durel.extractall()

#DWUG_deutsch
!wget https://zenodo.org/record/7441645/files/dwug_de.zip
with ZipFile('dwug_de.zip', 'r') as dwug_de:
    dwug_de.extractall()

#RefWUG - deutsch
!wget https://zenodo.org/record/5791269/files/refwug.zip
with ZipFile('refwug.zip', 'r') as refwug:
    refwug.extractall()

#DWUG_English
!wget https://zenodo.org/record/7387261/files/dwug_en.zip
with ZipFile('dwug_en.zip', 'r') as dwug_en:
    dwug_en.extractall()


#DWUG_Swedish
!wget https://zenodo.org/record/7389506/files/dwug_sv.zip
with ZipFile('dwug_sv.zip', 'r') as dwug_sv:
    dwug_sv.extractall()


#DWUG_ESPANOL
!wget https://zenodo.org/record/6433667/files/dwug_es.zip
with ZipFile('dwug_es.zip', 'r') as dwug_es:
    dwug_es.extractall()

#DiaWUG ESPANOL
!wget https://zenodo.org/record/5791193/files/diawug.zip
with ZipFile('diawug.zip', 'r') as diawug:
    diawug.extractall()


#DUPS_WUG English
!wget https://zenodo.org/record/5500223/files/DUPS-WUG.zip
with ZipFile('DUPS-WUG.zip', 'r') as dups:
    dups.extractall()




--2023-02-21 15:19:56--  https://zenodo.org/record/7396225/files/discowug.zip
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5486849 (5.2M) [application/octet-stream]
Saving to: ‘discowug.zip’


2023-02-21 15:20:02 (1.33 MB/s) - ‘discowug.zip’ saved [5486849/5486849]

--2023-02-21 15:20:03--  https://zenodo.org/record/5784569/files/surel.zip
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2676637 (2.6M) [application/octet-stream]
Saving to: ‘surel.zip’


2023-02-21 15:20:12 (390 KB/s) - ‘surel.zip’ saved [2676637/2676637]

--2023-02-21 15:20:12--  https://zenodo.org/record/5784453/files/durel.zip
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP requ

In [10]:
%run /content/WUGs/scripts/misc/wic_final.ipynb #WIC dataset

--2023-02-21 15:22:52--  https://pilehvar.github.io/wic/package/WiC_dataset.zip
Resolving pilehvar.github.io (pilehvar.github.io)... 185.199.110.153, 185.199.108.153, 185.199.109.153, ...
Connecting to pilehvar.github.io (pilehvar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 275984 (270K) [application/zip]
Saving to: ‘WiC_dataset.zip’


2023-02-21 15:22:52 (12.6 MB/s) - ‘WiC_dataset.zip’ saved [275984/275984]





Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [11]:
%run /content/WUGs/scripts/misc/rawc2wug.py #Raw-C dataset



In [12]:
direc = []
#i = os.listdir('WUGs/scripts/misc/wugdata')
#direc.append(i)
#j = os.listdir('WUGs/scripts/misc/data')
#direc.append(j)
#k = os.listdir('WUGs/scripts/misc/wugformat')
#direc.append(k)
l = os.listdir('/content/rawc')                               #all data directories extracted from bash scripts and python scripts
direc.append(l)
m = os.listdir('/content/WIC')
direc.append(m)

In [14]:
paths = []
#for i in direc[0]:
    #paths.append('WUGs/scripts/misc/wugdata/'+i+ '/data/')       #list of directory paths

#for i in direc[2]:
    #paths.append('WUGs/scripts/misc/wugformat/'+ i + '/wug/data/')

for i in direc[1]:
    paths.append('/content/WIC/'+ i +'/')

#paths.append("WUGs/scripts/misc/data/")
paths.append("/content/rawc/")
#paths.append("/content/WIC")



In [16]:
folders = []                       #list of all folders
for ds in paths:
    path = os.listdir(ds)
    folders.append(path)


In [16]:
# path_j = []                                       #final list judgments paths
# for i in folders[0]:
#     pathj = paths[0] + i + "/judgments.csv"
#     path_j.append(pathj)
# for i in folders[1]:
#     pathj = paths[1] + i + "/judgments.csv"
#     path_j.append(pathj)
# for i in folders[2]:
#     pathj = paths[2] + i + "/judgments.csv"
#     path_j.append(pathj)
# for i in folders[3]:
#     pathj = paths[3] + i + "/judgments.csv"
#     path_j.append(pathj)
# for i in folders[4]:
#     pathj = paths[4] + i + "/judgments.csv"
#     path_j.append(pathj)
# for i in folders[5]:
#     pathj = paths[5] + i + "/judgments.csv"
#     path_j.append(pathj)
# for i in folders[12]:
#     pathj = paths[12] + i + "/judgments.csv"
#     path_j.append(pathj)

In [18]:
path_k = []                                   #final list judgments paths and dataframe for wic and rawc
for i in folders[0]:
    pathj = paths[0] + i + "/judgments.csv"
    path_k.append(pathj)
for i in folders[1]:
    pathj = paths[1] + i + "/judgments.csv"
    path_k.append(pathj)
for i in folders[2]:
    pathj = paths[2] + i + "/judgments.csv"
    path_k.append(pathj)
for i in folders[3]:
    pathj = paths[3] + i + "/judgments.csv"
    path_k.append(pathj)
raw_df = pd.DataFrame()
for i in path_k:
   Tmp = pd.read_csv(i, delimiter='\t')
   Tmp['name'] = i.split('/')[2]
   raw_df = pd.concat([raw_df, Tmp])    

In [19]:
raw_df['language'] = 'english'

In [19]:
# path_co = []                                   #final paths of cosimlex
# for i in folders[6]:
#     pathj = paths[6] + i + "/judgments.csv"
#     path_co.append(pathj)
# for i in folders[7]:
#     pathj = paths[7] + i + "/judgments.csv"
#     path_co.append(pathj)
# for i in folders[8]:
#     pathj = paths[8] + i + "/judgments.csv"
#     path_co.append(pathj)



In [20]:
# cosim_df = pd.DataFrame()             #cosimlex judgments dataframe
# for i in path_co:
#    Tmp = pd.read_csv(i, delimiter='\t')
#    Tmp['name'] = i.split('/')[4]
#    cosim_df = pd.concat([cosim_df, Tmp])

In [21]:
# cosim_df.loc[cosim_df["name"] == "fi", "language"] = 'Finnish' 
# cosim_df.loc[cosim_df["name"] == "hr", "language"] = 'Croatian'
# cosim_df.loc[cosim_df["name"] == "en", "language"] = 'english'

In [22]:
#cosim_df['name'] = 'Cosimlex'

In [23]:
# judgemt_df = pd.DataFrame()                   #temp judgments dataframe
# for i in path_j:
#    Tmp = pd.read_csv(i, delimiter='\t')
#    Tmp['name'] = i.split('/')[3]
#    judgemt_df = pd.concat([judgemt_df, Tmp])
   

In [25]:
# judgemt_df.loc[judgemt_df["name"] == "wugdata", "name"] = 'TempoWic'
# #judgemt_df.loc[judgemt_df["name"] == "wugformat", "name"] = 'CoSimLex'
# judgemt_df.loc[judgemt_df["name"] == "data", "name"] = 'USim'

In [26]:
# judgemt_df.loc[judgemt_df["name"] == "TempoWic", "language"] = 'English'
# judgemt_df.loc[judgemt_df["name"] == "USim", "language"] = 'English'
# #judgemt_df.loc[judgemt_df["name"] == "CoSimLex", "language"] = 'English' # change language, whats fi,and hr

In [21]:
dwugde = "dwug_de/data"                          #WUG data directory paths
dwugen = "dwug_en/data"
dwugsv = "dwug_sv/data"
discowugg = "discowug/data"
durel = "durel/data"
surel = "surel/data"
refwug = "refwug/data"
dwuges = 'dwug_es/data'
diawug = 'diawug/data'
#dwugla = 'dwug_la/data'
dups = 'DUPS-WUG/data'
dupswug = ''
dwug = [dwugde, dwugen,dwugsv,discowugg, durel, surel, refwug, dwuges, diawug, dups]
dirlist = []
for dataset in dwug:
  dir = os.listdir(dataset)
  dirlist.append(dir)

In [22]:
dwug_j = []                                                #dwug data paths
for i in dirlist[0]:
  dwugde_j = "dwug_de/data/" + i + "/judgments.csv"
  dwug_j.append(dwugde_j)
for i in dirlist[1]:
  dwugen_j = "dwug_en/data/" + i + "/judgments.csv"
  dwug_j.append(dwugen_j)
for i in dirlist[2]:
  dwugsv_j = "dwug_sv/data/" + i + "/judgments.csv"
  dwug_j.append(dwugsv_j)
for i in dirlist[3]:
  discowugg_j = "discowug/data/" + i + "/judgments.csv"
  dwug_j.append(discowugg_j)
for i in dirlist[4]:
  durel_j = "durel/data/" + i + "/judgments.csv"
  dwug_j.append(durel_j)
for i in dirlist[5]:
  surel_j = "surel/data/" + i + "/judgments.csv"
  dwug_j.append(surel_j)  
for i in dirlist[6]:
  refwug_j = "refwug/data/" + i + "/judgments.csv"
  dwug_j.append(refwug_j)
for i in dirlist[7]:
  dwuges_j = "dwug_es/data/" + i + "/judgments.csv"
  dwug_j.append(dwuges_j)
for i in dirlist[8]:
  diawug_j = "diawug/data/" + i + "/judgments.csv"
  dwug_j.append(diawug_j)
for i in dirlist[9]:
  dups_j = "DUPS-WUG/data/" + i + "/judgments.csv"
  dwug_j.append(dups_j)

In [23]:
judgemnt_df = pd.DataFrame()            #dwug data df
for i in dwug_j:
   Tmp = pd.read_csv(i, delimiter='\t')
   Tmp['name'] = i.split('/')[0]
   judgemnt_df = pd.concat([judgemnt_df, Tmp])


In [26]:
path_u = []                      #uses paths for tempowic and usim
for i in folders[0]:
    pathj = paths[0] + i + "/uses.csv"
    path_u.append(pathj)
for i in folders[1]:
    pathj = paths[1] + i + "/uses.csv"
    path_u.append(pathj)
for i in folders[2]:
    pathj = paths[2] + i + "/uses.csv"
    path_u.append(pathj)
for i in folders[3]:
    pathj = paths[3] + i + "/uses.csv"
    path_u.append(pathj)
# for i in folders[4]:
#     pathj = paths[4] + i + "/uses.csv"
#     path_u.append(pathj)
# for i in folders[5]:
#     pathj = paths[5] + i + "/uses.csv"
#     path_u.append(pathj)

# for i in folders[12]:
#     pathj = paths[12] + i + "/uses.csv"
#     path_u.append(pathj)

In [25]:
# path_cou = []                               #for cosimlex uses paths
# for i in folders[6]:
#     pathj = paths[6] + i + "/uses.csv"
#     path_cou.append(pathj)
# for i in folders[7]:
#     pathj = paths[7] + i + "/uses.csv"
#     path_cou.append(pathj)
# for i in folders[8]:
#     pathj = paths[8] + i + "/uses.csv"
#     path_cou.append(pathj)

IndexError: ignored

In [33]:
# cosim_uses_df = pd.DataFrame()            #cosimlex uses df
# for i in path_cou:
#    Tmp = pd.read_csv(i, delimiter='\t')
#    Tmp['name'] = i.split('/')[4]
#    cosim_uses_df = pd.concat([cosim_uses_df, Tmp])

In [34]:
# cosim_uses_df.loc[cosim_uses_df["name"] == "fi", "language"] = 'Finnish' 
# cosim_uses_df.loc[cosim_uses_df["name"] == "hr", "language"] = 'Croatian'
# cosim_uses_df.loc[cosim_uses_df["name"] == "en", "language"] = 'english'

In [35]:
#cosim_uses_df['name'] = 'Cosimlex'

In [27]:
# path_k = []                            #wic and rawc uses df
# for i in folders[9]:
#     pathj = paths[9] + i + "/uses.csv"
#     path_k.append(pathj)
# for i in folders[10]:
#     pathj = paths[10] + i + "/uses.csv"
#     path_k.append(pathj)
# for i in folders[11]:
#     pathj = paths[11] + i + "/uses.csv"
#     path_k.append(pathj)
# for i in folders[13]:
#     pathj = paths[13] + i + "/judgments.csv"
#     path_k.append(pathj)
raw_uses_df = pd.DataFrame()
for i in path_u:
   Tmp = pd.read_csv(i, delimiter='\t')
   Tmp['name'] = i.split('/')[2]
   raw_uses_df = pd.concat([raw_uses_df, Tmp]) 

In [28]:
raw_uses_df['language'] = 'english'

In [38]:
# use_df = pd.DataFrame()    #uses df temp
# for i in path_u:
#    Tmp = pd.read_csv(i, delimiter='\t')
#    Tmp['name'] = i.split('/')[3]
#    use_df = pd.concat([use_df, Tmp])
   

In [30]:
dwug_u = []                                           #dwug data uses paths
for i in dirlist[0]:
  dwugde_u = "dwug_de/data/" + i + "/uses.csv"
  dwug_u.append(dwugde_u)
for i in dirlist[1]:
  dwugen_u = "dwug_en/data/" + i + "/uses.csv"
  dwug_u.append(dwugen_u)
for i in dirlist[2]:
  dwugsv_u = "dwug_sv/data/" + i + "/uses.csv"
  dwug_u.append(dwugsv_u)
for i in dirlist[3]:
  discowugg_u = "discowug/data/" + i + "/uses.csv"
  dwug_u.append(discowugg_u)
for i in dirlist[4]:
  durel_u = "durel/data/" + i + "/uses.csv"
  dwug_u.append(durel_u)
for i in dirlist[5]:
  surel_u = "surel/data/" + i + "/uses.csv"
  dwug_u.append(surel_u)  
for i in dirlist[6]:
  refwug_u = "refwug/data/" + i + "/uses.csv"
  dwug_u.append(refwug_u)
for i in dirlist[7]:
  dwuges_u = "dwug_es/data/" + i + "/uses.csv"
  dwug_u.append(dwuges_u)
for i in dirlist[8]:
  diawug_u = "diawug/data/" + i + "/uses.csv"
  dwug_u.append(diawug_u)

for i in dirlist[9]:
  dups_u = "DUPS-WUG/data/" + i + "/uses.csv"
  dwug_u.append(dups_u)

In [31]:
judgemnt_df

Unnamed: 0,identifier1,identifier2,annotator,judgment,comment,lemma,round,name,group
0,steub_tirol_1846-7463-39,lehnert_seehaefen02_1892-4170-8,annotator1,4.0,,Seminar,3.0,dwug_de,
1,heckert_schulgesetzgebung_1847-8359-12,26120215_1980_02_21_01_232.tcf.xml-4-7,annotator1,2.0,,Seminar,3.0,dwug_de,
2,2532889X_1981-02-24_01_017.tcf.xml-1812-11,lange_maedchenschule_1887-423-1,annotator1,3.0,,Seminar,3.0,dwug_de,
3,heckert_schulgesetzgebung_1847-1594-25,heckert_schulgesetzgebung_1847-3371-18,annotator1,3.0,,Seminar,3.0,dwug_de,
4,lehnert_seehaefen02_1892-6033-27,2532889X_1967-11-25_01_063.tcf.xml-3-10,annotator1,3.0,,Seminar,3.0,dwug_de,
...,...,...,...,...,...,...,...,...,...
218,disk-0,disk-3,45465124,2.0,,disk,,DUPS-WUG,
219,disk-0,disk-3,45404982,3.0,,disk,,DUPS-WUG,
220,disk-0,disk-3,45438972,3.0,,disk,,DUPS-WUG,
221,disk-0,disk-3,45214830,1.0,,disk,,DUPS-WUG,


In [32]:
judgemnt_df.loc[judgemnt_df["name"] == "dwug_de", "language"] = 'deutsch'
judgemnt_df.loc[judgemnt_df["name"] == "dwug_en", "language"] = 'english'
judgemnt_df.loc[judgemnt_df["name"] == "DUPS-WUG", "language"] = 'english'
judgemnt_df.loc[judgemnt_df["name"] == "dwug_es", "language"] = 'spanish'
#judgemnt_df.loc[judgemnt_df["name"] == "dwug_la", "language"] = 'latin'
judgemnt_df.loc[judgemnt_df["name"] == "dwug_sv", "language"] = 'swedish'
judgemnt_df.loc[judgemnt_df["name"] == "durel", "language"] = 'deutsch'
judgemnt_df.loc[judgemnt_df["name"] == "surel", "language"] = 'deutsch'
judgemnt_df.loc[judgemnt_df["name"] == "discowug", "language"] = 'deutsch'
judgemnt_df.loc[judgemnt_df["name"] == "refwug", "language"] = 'deutsch'
judgemnt_df.loc[judgemnt_df["name"] == "diawug", "language"] = 'spanish'


In [33]:
judgemnt_df.reset_index(drop = True)

Unnamed: 0,identifier1,identifier2,annotator,judgment,comment,lemma,round,name,group,language
0,steub_tirol_1846-7463-39,lehnert_seehaefen02_1892-4170-8,annotator1,4.0,,Seminar,3.0,dwug_de,,deutsch
1,heckert_schulgesetzgebung_1847-8359-12,26120215_1980_02_21_01_232.tcf.xml-4-7,annotator1,2.0,,Seminar,3.0,dwug_de,,deutsch
2,2532889X_1981-02-24_01_017.tcf.xml-1812-11,lange_maedchenschule_1887-423-1,annotator1,3.0,,Seminar,3.0,dwug_de,,deutsch
3,heckert_schulgesetzgebung_1847-1594-25,heckert_schulgesetzgebung_1847-3371-18,annotator1,3.0,,Seminar,3.0,dwug_de,,deutsch
4,lehnert_seehaefen02_1892-6033-27,2532889X_1967-11-25_01_063.tcf.xml-3-10,annotator1,3.0,,Seminar,3.0,dwug_de,,deutsch
...,...,...,...,...,...,...,...,...,...,...
261008,disk-0,disk-3,45465124,2.0,,disk,,DUPS-WUG,,english
261009,disk-0,disk-3,45404982,3.0,,disk,,DUPS-WUG,,english
261010,disk-0,disk-3,45438972,3.0,,disk,,DUPS-WUG,,english
261011,disk-0,disk-3,45214830,1.0,,disk,,DUPS-WUG,,english


In [34]:
#final judgments df
#judgment_df = pd.concat([judgemt_df, judgemnt_df], axis = 0)
#judgement_final_df = pd.concat([judgment_df, judgemnt_df], axis = 0)    
judgment_df = pd.concat([judgemnt_df, raw_df], axis = 0)
#judgment_df = pd.concat([judgment_df, cosim_df], axis = 0)

In [36]:
judgemnt_df_wug = judgment_df[["identifier1", "identifier2", "annotator", "judgment", "comment", "lemma", "name", "language"]]

In [38]:
usee_df = pd.DataFrame()            #uses dwug df
for i in dwug_u:
    Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
    Tmp['name'] = i.split('/')[0]
    usee_df = pd.concat([usee_df, Tmp])
    usee_df = pd.concat([usee_df, pd.read_csv(i, delimiter='\t', quoting = 3)])

In [39]:
usee_df.loc[usee_df["name"] == "dwug_de", "language"] = 'deutsch'
usee_df.loc[usee_df["name"] == "dwug_en", "language"] = 'english'
usee_df.loc[usee_df["name"] == "DUPS-WUG", "language"] = 'english'
usee_df.loc[usee_df["name"] == "dwug_es", "language"] = 'spanish'
usee_df.loc[usee_df["name"] == "dwug_la", "language"] = 'latin'
usee_df.loc[usee_df["name"] == "dwug_sv", "language"] = 'swedish'
usee_df.loc[usee_df["name"] == "durel", "language"] = 'deutsch'
usee_df.loc[usee_df["name"] == "surel", "language"] = 'deutsch'
usee_df.loc[usee_df["name"] == "discowug", "language"] = 'deutsch'
usee_df.loc[usee_df["name"] == "refwug", "language"] = 'deutsch'
usee_df.loc[usee_df["name"] == "diawug", "language"] = 'spanish'

In [40]:
#combining uses df
#uses_full_df = pd.concat([usee_df, use_df], axis = 0)
uses_full_df = pd.concat([usee_df, raw_uses_df], axis = 0)
#judgement_final_df = pd.concat([judgment_df, judgemnt_df], axis = 0)
#uses1_df = pd.concat([uses_full_df, raw_uses_df], axis = 0)
#uses_df_full = pd.concat([uses1_df, cosim_uses_df], axis = 0)

In [41]:
wug_use_df= [['lemma', 'pos', 'date', 'grouping', 'identifier', 'description', 'context', 'indexes_target_token', 'indexes_target_sentence', 'name', 'language']]

In [42]:
russian = [rudsi, nordia1, nordia2, rushifteval1, rushifteval2, rushifteval3, rusemshift1, rusemshift2]
find_class = []
for URL in russian:
  page = requests.get(URL)
  soup = BeautifulSoup( page.content , 'html.parser')
  classy = soup.find_all('a', class_="Link--primary")[3:]
  find_class.append(classy)
  

In [43]:
judgements = []
uses = []
for find_by_class in find_class:
  for i in find_by_class:
    judgements.append("https://raw.githubusercontent.com/"+i['href']+"/judgments.csv")
    uses.append("https://raw.githubusercontent.com/"+i['href']+"/uses.csv")

In [44]:
judgements_df = pd.DataFrame()
for i in judgements:
   Tmp = pd.read_csv(io.StringIO(requests.get(i.replace("/tree","")).content.decode('utf-8')), delimiter='\t')
   Tmp['name'] = i.split('/')[5]
   judgements_df = pd.concat([judgements_df, Tmp])


In [45]:
judgements_df["language"] = "Russian"

In [46]:
judgements_df.loc[judgements_df["name"] == "NorDiaChange", "language"] = 'Norweigan'

In [47]:
judgements_df = judgements_df.reset_index(drop = True)

In [48]:
usees_df = pd.DataFrame()
for i in uses:
    Tmp = pd.read_csv(io.StringIO(requests.get(i.replace("/tree","")).content.decode('utf-8')), delimiter='\t')
    Tmp['name'] = i.split('/')[5]
    usees_df = pd.concat([usees_df, Tmp])
    #uses_df = uses_df[['lemma', 'pos', 'date', 'grouping', 'identifier', 'description', 'context', 'indexes_target_token', 'indexes_target_sentence']]

In [49]:
usees_df

Unnamed: 0,lemma,pos,date,grouping,identifier,description,context,indexes_target_token,indexes_target_sentence,identifier_system,project,lang,user,name
0,год,NN,example,22,22_год_110831,,"Именно в эту местность, на приморскую виллу вб...",95:99,0:130,27802.0,sem_cluster,ru,ek.gavrishina,RuDSI
1,год,NN,example,22,22_год_24500,,"Понадобилось несколько лет, чтобы наконец изба...",56:65,0:129,27813.0,sem_cluster,ru,ek.gavrishina,RuDSI
2,год,NN,example,22,22_год_626263,,"Но он крепился и держал себя твердо, как взрос...",0:0,0:89,27796.0,sem_cluster,ru,ek.gavrishina,RuDSI
3,год,NN,example,22,22_год_399278,,10 июля 1985 года агенты французских спецслужб...,13:17,0:109,27817.0,sem_cluster,ru,ek.gavrishina,RuDSI
4,год,NN,example,22,22_год_67014,,"И еще: города будущего(графика)-- это то, что ...",65:69,0:107,27805.0,sem_cluster,ru,ek.gavrishina,RuDSI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109,эрмитажный,ADJ,1991-2017,2,эрмитажный-109,,"Но уже во вторую встречу, на эрмитажном спекта...",29:39,0:594,,,,,RuSemShift
110,эрмитажный,ADJ,1918-1990,1,эрмитажный-110,,Я решил поговорить с самым умным человеком в н...,51:61,0:120,,,,,RuSemShift
111,эрмитажный,ADJ,1991-2017,2,эрмитажный-111,,В конце января 1904 года в Эрмитажном театре б...,27:37,0:59,,,,,RuSemShift
112,эрмитажный,ADJ,1918-1990,1,эрмитажный-112,,"Я уже хорошо ориентировался там, знал, что так...",79:89,0:96,,,,,RuSemShift


In [50]:
usees_df['language'] = 'Russian'
usees_df.loc[usees_df["name"] == "NorDiaChange", "language"] = 'Norweigan'

In [52]:
judgment_final_df = pd.concat([judgment_df, judgements_df], axis = 0)
#uses_final_df = pd.concat([uses_df_full, usees_df], axis=0)
uses_final_df = pd.concat([uses_full_df, usees_df], axis=0)

In [53]:
#judgements_grouped_df = judgements_grouped_df.reset_index(drop= True)
judgment_final_df = judgment_final_df.reset_index(drop= True)
uses_final_df = uses_final_df.reset_index(drop= True)

In [None]:
use_df_wug= uses_final_df[['lemma', 'pos', 'date', 'grouping', 'identifier', 'description', 'context', 'indexes_target_token', 'indexes_target_sentence', 'name', 'language']]

In [None]:
judgemnt_df_wug = judgment_final_df[["identifier1", "identifier2", "annotator", "judgment", "comment", "lemma", "name", "language"]]

In [None]:
for i in list(judgements_df["lemma"].value_counts().index):
    df = judgements_df[judgements_df["lemma"]==i]
    numpy_df = df.to_numpy()
    header = list(df.columns)
    numpy_df = np.vstack([header, numpy_df])
    if not os.path.exists(i):
        os.mkdir(i)
    np.savetxt(i+"/judgements.csv", numpy_df,fmt='%s', delimiter='\t')

In [None]:
for i in list(use_df_wug["lemma"].value_counts().index):
    df = use_df_wug[use_df_wug["lemma"]==i]
    numpy_df = df.to_numpy()
    header = list(df.columns)
    numpy_df = np.vstack([header, numpy_df])
    if not os.path.exists(i):
        os.mkdir(i)
    np.savetxt(i+"/uses.csv", numpy_df,fmt='%s', delimiter='\t')