<a href="https://colab.research.google.com/github/Garrafao/WUGs/blob/main/scripts/misc/one_for_all.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook loads datasets of semantic proximity (Word-in-Context) for various languages the [WUG format](https://www.ims.uni-stuttgart.de/en/research/resources/experiment-data/wugs/). We provide the data in a minimal and an extended format. There are in total 4 dataframes: judgments_full, judgments_wug, uses_full and uses_wug. There are 20 transformed datasets. The notebook should run of-the-shelf in a colab environment with python 3.8.

Many of the data sets are transformed when running the notebook. We cannot guarantee that there are no errors. Hence, please make sure that you compare the created data frames to the original data sets before doing serious research with them.

Note: Please run this script without gpu on colab.

The datasets and their versions are as follows:

#RuDSI - Russian
rudsi = 'https://github.com/kategavrishina/RuDSI/tree/main/data'

#NorDiaChange - Norwegian
nordia1 = 'https://github.com/ltgoslo/nor_dia_change/tree/main/subset1/data'
nordia2 = 'https://github.com/ltgoslo/nor_dia_change/tree/main/subset2/data'

#RuShiftEval - Russian

https://github.com/akutuzov/rushifteval_public.git
rushifteval1 = 'https://github.com/akutuzov/rushifteval_public/tree/main/durel/rushifteval1/data'
rushifteval2 = 'https://github.com/akutuzov/rushifteval_public/tree/main/durel/rushifteval2/data'
rushifteval3 = 'https://github.com/akutuzov/rushifteval_public/tree/main/durel/rushifteval3/data'

#RuSemShift - Russian
rusemshift1 = 'https://github.com/juliarodina/RuSemShift/tree/master/rusemshift_1/DWUG/data'
rusemshift2 = 'https://github.com/juliarodina/RuSemShift/tree/master/rusemshift_2/DWUG/data'

#DiscoWUG - German (Version: 1.1.1)
https://zenodo.org/record/7396225/files/discowug.zip



#SURel - German (Version: 3.0.0)
https://zenodo.org/record/5784569/files/surel.zip


#DURel - German (Version: 3.0.0)
https://zenodo.org/record/5784453/files/durel.zip


#DWUG DE- German (Version: 2.3.0)
https://zenodo.org/record/7441645/files/dwug_de.zip


#RefWUG - German (Version: 1.1.0)
https://zenodo.org/record/5791269/files/refwug.zip


#DWUG EN - English (Version: 2.0.1)
https://zenodo.org/record/7387261/files/dwug_en.zip


#DWUG SV - Swedish(Version: 2.0.1)
https://zenodo.org/record/7389506/files/dwug_sv.zip


#DWUG ES - Spanish(Version: 4.0.0)
https://zenodo.org/record/6433667/files/dwug_es.zip


#DiaWUG - Spanish (Version: 1.1.0)
https://zenodo.org/record/5791193/files/diawug.zip


#DUPS_WUG - English (version 2.0.0)
https://zenodo.org/record/5500223/files/DUPS-WUG.zip

#WIC - English (version v1.0)
https://pilehvar.github.io/wic/package/WiC_dataset.zip

#TempoWIC - English
https://codalab.lisn.upsaclay.fr/my/datasets/download/3e22f138-ca00-4b10-a0fd-2e914892200d

#Raw-C - English
https://raw.githubusercontent.com/seantrott/raw-c/main/data/processed/raw-c.csv

#Usim - English
http://www.dianamccarthy.co.uk/downloads/WordMeaningAnno2012/

#CosimLex - English, Croatian, Finnish
https://www.clarin.si/repository/xmlui/handle/11356/1308/allzip


In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import io
import numpy as np
import os
from zipfile import ZipFile
import csv

In [2]:
!git clone https://github.com/Garrafao/WUGs.git #contains transformation scripts

Cloning into 'WUGs'...
remote: Enumerating objects: 1751, done.[K
remote: Counting objects: 100% (677/677), done.[K
remote: Compressing objects: 100% (310/310), done.[K
remote: Total 1751 (delta 396), reused 603 (delta 349), pack-reused 1074[K
Receiving objects: 100% (1751/1751), 5.07 MiB | 14.06 MiB/s, done.
Resolving deltas: 100% (1074/1074), done.


In [3]:
!git clone "https://github.com/akutuzov/rushifteval_public.git" #rushifteval
!git clone "https://github.com/juliarodina/RuSemShift.git" #rusemshift
!git clone "https://github.com/kategavrishina/RuDSI.git" #rudsi
!git clone "https://github.com/ltgoslo/nor_dia_change.git" #nordiachange

Cloning into 'rushifteval_public'...
remote: Enumerating objects: 3238, done.[K
remote: Counting objects: 100% (3238/3238), done.[K
remote: Compressing objects: 100% (1378/1378), done.[K
remote: Total 3238 (delta 1918), reused 3163 (delta 1857), pack-reused 0[K
Receiving objects: 100% (3238/3238), 16.40 MiB | 15.17 MiB/s, done.
Resolving deltas: 100% (1918/1918), done.
Updating files: 100% (3704/3704), done.
Cloning into 'RuSemShift'...
remote: Enumerating objects: 2100, done.[K
remote: Counting objects: 100% (2100/2100), done.[K
remote: Compressing objects: 100% (991/991), done.[K
remote: Total 2100 (delta 1182), reused 2013 (delta 1108), pack-reused 0[K
Receiving objects: 100% (2100/2100), 9.80 MiB | 17.13 MiB/s, done.
Resolving deltas: 100% (1182/1182), done.
Updating files: 100% (1562/1562), done.
Cloning into 'RuDSI'...
remote: Enumerating objects: 311, done.[K
remote: Counting objects: 100% (226/226), done.[K
remote: Compressing objects: 100% (167/167), done.[K
remote:

In [4]:
!pip install fuzzywuzzy #needed for rawc script

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [5]:
!cd WUGs/scripts/misc && bash -e usim2data.sh #transform USim to WUG

[1musim2data.sh[m
--2024-05-11 20:55:26--  http://www.dianamccarthy.co.uk/downloads/WordMeaningAnno2012/cl-meaningincontext.tgz
Resolving www.dianamccarthy.co.uk (www.dianamccarthy.co.uk)... 212.159.9.91, 212.159.8.91
Connecting to www.dianamccarthy.co.uk (www.dianamccarthy.co.uk)|212.159.9.91|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 253228 (247K) [application/x-gzip]
Saving to: ‘./source/cl-meaningincontext.tgz’


2024-05-11 20:55:27 (606 KB/s) - ‘./source/cl-meaningincontext.tgz’ saved [253228/253228]

README
Data/
Data/README
Data/lexsub_wcdata.xml
Data/lexsubwc.dtd
Markup/
Markup/WordSenseBest/
Markup/WordSenseBest/wsbestmode.csv
Markup/WordSenseBest/README_wsbest.txt
Markup/WordSenseBest/wsbestratings.csv
Markup/SynonymBest/
Markup/SynonymBest/README_synbest.txt
Markup/SynonymBest/synbestratings.csv
Markup/WordSenseSimilarity/
Markup/WordSenseSimilarity/README_wssim.txt
Markup/WordSenseSimilarity/wssim2ratings.csv
Markup/UsageSimilarity/
Markup/Us

In [6]:
!cd WUGs/scripts/misc && bash -e evonlp2wug.sh  #transforms tempowic to wug

[1mevonlp2wug.sh[m
--2024-05-11 20:55:38--  https://codalab.lisn.upsaclay.fr/my/datasets/download/3e22f138-ca00-4b10-a0fd-2e914892200d
Resolving codalab.lisn.upsaclay.fr (codalab.lisn.upsaclay.fr)... 129.175.8.8
Connecting to codalab.lisn.upsaclay.fr (codalab.lisn.upsaclay.fr)|129.175.8.8|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://miniodis-rproxy.lisn.upsaclay.fr/py3-private/dataset_data_file/da46462c-0b8c-44c8-98a3-ef085c89e189/TempoWiC_Starting_Kit.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=EASNOMJFX9QFW4QIY4SL%2F20240511%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240511T205533Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=6ecbca01305590a97fa0a054b7ccda9fd105638a6971b30e78b6e123d4516a6d [following]
--2024-05-11 20:55:39--  https://miniodis-rproxy.lisn.upsaclay.fr/py3-private/dataset_data_file/da46462c-0b8c-44c8-98a3-ef085c89e189/TempoWiC_Starting_Kit.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=EASNOM

In [7]:
!python3 -m spacy download fi_core_news_sm #needed for cosimlex

Collecting fi-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fi_core_news_sm-3.7.0/fi_core_news_sm-3.7.0-py3-none-any.whl (14.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.3/14.3 MB[0m [31m70.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fi-core-news-sm
Successfully installed fi-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fi_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [8]:
!python3 -m spacy download hr_core_news_sm #needed for cosimlex

Collecting hr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/hr_core_news_sm-3.7.0/hr_core_news_sm-3.7.0-py3-none-any.whl (13.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: hr-core-news-sm
Successfully installed hr-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('hr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [9]:
!cd WUGs/scripts/misc && bash -e cosimlex2wug.sh #transforms cosimlex to wug

[1mcosimlex2wug.sh[m
--2024-05-11 20:57:34--  https://www.clarin.si/repository/xmlui/handle/11356/1308/allzip
Resolving www.clarin.si (www.clarin.si)... 95.87.154.205
Connecting to www.clarin.si (www.clarin.si)|95.87.154.205|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘./source/allzip’

allzip                  [   <=>              ] 751.19K  1.34MB/s    in 0.5s    

2024-05-11 20:57:36 (1.34 MB/s) - ‘./source/allzip’ saved [769221]

Archive:  ./source/allzip.zip
  inflating: ./source/cosimlex_en.csv  
  inflating: ./source/cosimlex_fi.csv  
  inflating: ./source/cosimlex_hr.csv  
  inflating: ./source/cosimlex_sl.csv  
  inflating: ./source/README.md      
  inflating: ./source/cosimlex_scores.zip  


In [None]:
#RuDSI
rudsi = 'RuDSI/data/'

#NorDiaChange
nordia1 = 'nor_dia_change/subset1/data/'
nordia2 = 'nor_dia_change/subset2/data/'

#RuShiftEval
rushifteval1 = 'rushifteval_public/durel/rushifteval1/data/'
rushifteval2 = 'rushifteval_public/durel/rushifteval2/data/'
rushifteval3 = 'rushifteval_public/durel/rushifteval3/data/'

#RuSemShift
rusemshift1 = 'RuSemShift/rusemshift_1/DWUG/data/'
rusemshift2 = 'RuSemShift/rusemshift_2/DWUG/data/'

#Discowug
!wget https://zenodo.org/record/7396225/files/discowug.zip
with ZipFile('discowug.zip', 'r') as discowug:
    discowug.extractall()


#surel
!wget https://zenodo.org/record/5784569/files/surel.zip
with ZipFile('surel.zip', 'r') as surel:
    surel.extractall()

#durel
!wget https://zenodo.org/record/5784453/files/durel.zip
with ZipFile('durel.zip', 'r') as durel:
    durel.extractall()

#DWUG DE
!wget https://zenodo.org/record/7441645/files/dwug_de.zip
with ZipFile('dwug_de.zip', 'r') as dwug_de:
    dwug_de.extractall()

#RefWUG
!wget https://zenodo.org/record/5791269/files/refwug.zip
with ZipFile('refwug.zip', 'r') as refwug:
    refwug.extractall()

#DWUG EN
!wget https://zenodo.org/record/7387261/files/dwug_en.zip
with ZipFile('dwug_en.zip', 'r') as dwug_en:
    dwug_en.extractall()


#DWUG SV
!wget https://zenodo.org/record/7389506/files/dwug_sv.zip
with ZipFile('dwug_sv.zip', 'r') as dwug_sv:
    dwug_sv.extractall()


#DWUG ES
!wget https://zenodo.org/record/6433667/files/dwug_es.zip
with ZipFile('dwug_es.zip', 'r') as dwug_es:
    dwug_es.extractall()

#DiaWUG
!wget https://zenodo.org/record/5791193/files/diawug.zip
with ZipFile('diawug.zip', 'r') as diawug:
    diawug.extractall()


#DUPS_WUG
!wget https://zenodo.org/record/5500223/files/DUPS-WUG.zip
with ZipFile('DUPS-WUG.zip', 'r') as dups:
    dups.extractall()




--2024-05-06 08:45:22--  https://zenodo.org/record/7396225/files/discowug.zip
Resolving zenodo.org (zenodo.org)... 188.185.79.172, 188.184.103.159, 188.184.98.238, ...
Connecting to zenodo.org (zenodo.org)|188.185.79.172|:443... connected.
HTTP request sent, awaiting response... 301 MOVED PERMANENTLY
Location: /records/7396225/files/discowug.zip [following]
--2024-05-06 08:45:23--  https://zenodo.org/records/7396225/files/discowug.zip
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 5486849 (5.2M) [application/octet-stream]
Saving to: ‘discowug.zip’


2024-05-06 08:45:25 (4.06 MB/s) - ‘discowug.zip’ saved [5486849/5486849]

--2024-05-06 08:45:25--  https://zenodo.org/record/5784569/files/surel.zip
Resolving zenodo.org (zenodo.org)... 188.185.79.172, 188.184.103.159, 188.184.98.238, ...
Connecting to zenodo.org (zenodo.org)|188.185.79.172|:443... connected.
HTTP request sent, awaiting response... 301 MOVED PERMANENTLY
Location: /recor

In [None]:
%run WUGs/scripts/misc/wic2wug.ipynb #transforms WIC dataset to wug

--2024-05-11 20:58:39--  https://pilehvar.github.io/wic/package/WiC_dataset.zip
Resolving pilehvar.github.io (pilehvar.github.io)... 185.199.109.153, 185.199.111.153, 185.199.110.153, ...
Connecting to pilehvar.github.io (pilehvar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 275984 (270K) [application/zip]
Saving to: ‘WiC_dataset.zip’


2024-05-11 20:58:39 (38.2 MB/s) - ‘WiC_dataset.zip’ saved [275984/275984]



In [None]:
%run WUGs/scripts/misc/rawc2wug.py #Raw-C to wug

In [None]:
direc = []
i = os.listdir('WUGs/scripts/misc/wugdata')
direc.append(i)

k = os.listdir('WUGs/scripts/misc/wugformat')
direc.append(k)                                   #all data directories extracted from tempowic, cosimlex and wic



In [None]:
paths = []          #list of directory paths
for i in direc[0]:
    paths.append('WUGs/scripts/misc/wugdata/'+i+ '/data/')     #tempowic

for i in direc[1]:
    paths.append('WUGs/scripts/misc/wugformat/'+ i + '/wug_all/data/all') #cosimlex


paths.append('/content/WiC_data/') #wic

paths.append("WUGs/scripts/misc/data/")  #usim
paths.append("/content/raw-c/") #rawc




In [None]:
folders = []                       #list of all folders names(lemma wise) in tempowic, cosimlex, wic, usim, rawc
for ds in paths:
    path = os.listdir(ds)
    folders.append(path)


In [None]:
#final list judgments paths for tempowic, cosimlex and usim
path_j = []

path_usim = []
for i in folders[0]:
     pathj = paths[0] + i + "/judgments.csv" #tempowic
     path_j.append(pathj)
for i in folders[3]:
     pathj = paths[3] + i + "/judgments.csv" #tempowic
     path_j.append(pathj)
for i in folders[5]:
     pathj = paths[5] + i + "/judgments.csv"  #tempowic
     path_j.append(pathj)

for i in folders[10]:            #usim
     pathj = paths[10] + i + "/judgments.csv"
     path_usim.append(pathj)

In [None]:
paths

['WUGs/scripts/misc/wugdata/tempowic_trial_all/data/',
 'WUGs/scripts/misc/wugdata/tempowic_train/data/',
 'WUGs/scripts/misc/wugdata/tempowic_train_all/data/',
 'WUGs/scripts/misc/wugdata/tempowic_validation_all/data/',
 'WUGs/scripts/misc/wugdata/tempowic_trial/data/',
 'WUGs/scripts/misc/wugdata/tempowic_validation/data/',
 'WUGs/scripts/misc/wugformat/en/wug_all/data/all',
 'WUGs/scripts/misc/wugformat/fi/wug_all/data/all',
 'WUGs/scripts/misc/wugformat/hr/wug_all/data/all',
 '/content/WiC_data/',
 'WUGs/scripts/misc/data/',
 '/content/raw-c/']

In [None]:
pathco = []
pat = paths[6] + "/judgments.csv" #cosimlex
pathco.append(pat)

pat = paths[7]  + "/judgments.csv" #cosimlex
pathco.append(pat)

pat = paths[8]  + "/judgments.csv" #cosimlex
pathco.append(pat)


In [None]:
pathco

['WUGs/scripts/misc/wugformat/en/wug_all/data/all/judgments.csv',
 'WUGs/scripts/misc/wugformat/fi/wug_all/data/all/judgments.csv',
 'WUGs/scripts/misc/wugformat/hr/wug_all/data/all/judgments.csv']

In [None]:
#final list judgments paths and dataframe for wic and rawc
path_k = []
p = []
for i in folders[9]:
     pathj = paths[9] + i + "/judgments.csv"      #wic
     path_k.append(pathj)
for i in folders[11]:                             #rawc
    pathj = paths[11] + i + "/judgments.csv"
    p.append(pathj)
#judgements dataframe for rawc and wic datasets
wic_df = pd.DataFrame()
rawc_df = pd.DataFrame()
for i in path_k:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[3]
   wic_df = pd.concat([wic_df, Tmp])

for i in p:
  tmp_df =  pd.read_csv(i, delimiter='\t', quoting = 3)
  tmp_df['dataset'] = i.split('/')[2]
  rawc_df = pd.concat([rawc_df, tmp_df])

In [None]:
raw_df = pd.DataFrame
raw_df = pd.concat([wic_df, rawc_df])

In [None]:
raw_df['language'] = 'English'

In [None]:
raw_df = raw_df.reset_index(drop = True)

In [None]:
raw_df.loc[raw_df["dataset"] == "dev", "dataset"] = 'wic_dev'
raw_df.loc[raw_df["dataset"] == "train", "dataset"] = 'wic_train'
raw_df.loc[raw_df["dataset"] == "test", "dataset"] = 'wic_test'


In [None]:
pathco

['WUGs/scripts/misc/wugformat/en/wug_all/data/all/judgments.csv',
 'WUGs/scripts/misc/wugformat/fi/wug_all/data/all/judgments.csv',
 'WUGs/scripts/misc/wugformat/hr/wug_all/data/all/judgments.csv']

In [None]:
cosim_df = pd.DataFrame()             #cosimlex judgments dataframe
for i in pathco:
   Tmp = pd.read_csv(i, delimiter='\t', quoting =3)
   Tmp['dataset'] = i.split('/')[4]
   cosim_df = pd.concat([cosim_df, Tmp])

In [None]:
cosim_df.loc[cosim_df["dataset"] == "fi", "language"] = 'Finnish'
cosim_df.loc[cosim_df["dataset"] == "hr", "language"] = 'Croatian'
cosim_df.loc[cosim_df["dataset"] == "en", "language"] = 'English'


In [None]:
cosim_df.loc[cosim_df["language"] == "Finnish", "dataset"] = 'Cosimlex_fi'
cosim_df.loc[cosim_df["language"] == "Croatian", "dataset"] = 'Cosimlex_hr'
cosim_df.loc[cosim_df["language"] == "English", "dataset"] = 'Cosimlex_en'

In [None]:
cosim_df.dataset.unique()

array(['Cosimlex_en', 'Cosimlex_fi', 'Cosimlex_hr'], dtype=object)

In [None]:
#cosim_df['dataset'] = 'Cosimlex'
cosim_df = cosim_df.reset_index(drop = True)

In [None]:
path_usim.remove('WUGs/scripts/misc/data/dwug_en/judgments.csv')

In [None]:
judge_df = pd.DataFrame()
jud_df =  pd.DataFrame()                 #judgments dataframe for tempowic and usim
for i in path_j:
    Tmp = pd.read_csv(i, delimiter='\t', quoting =3)
    Tmp['dataset'] = i.split('/')[4]
    judge_df = pd.concat([judge_df, Tmp])

for i in path_usim:
    Temp = pd.read_csv(i, delimiter='\t', quoting =3)
    Temp['dataset'] = i.split('/')[3]
    jud_df = pd.concat([jud_df, Temp])


In [None]:
judgemt_df = pd.DataFrame()
judgemt_df = pd.concat([judge_df, jud_df])

In [None]:
judgemt_df.loc[judgemt_df["dataset"] == "data", "dataset"] = 'USim'

In [None]:
judgemt_df.loc[judgemt_df["dataset"] == "TempoWic", "language"] = 'English'
judgemt_df.loc[judgemt_df["dataset"] == "USim", "language"] = 'English'

In [None]:
judgemt_df = judgemt_df.reset_index(drop = True)

In [None]:
dwugde = "dwug_de/data"                          #WUG data directory paths
dwugen = "dwug_en/data"
dwugsv = "dwug_sv/data"
discowugg = "discowug/data"
durel = "durel/data"
surel = "surel/data"
refwug = "refwug/data"
dwuges = 'dwug_es/data'
diawug = 'diawug/data'
dups = 'DUPS-WUG/data'
dupswug = ''
dwug = [dwugde, dwugen,dwugsv,discowugg, durel, surel, refwug, dwuges, diawug, dups]
dirlist = []
for dataset in dwug:
  dir = os.listdir(dataset)
  dirlist.append(dir)

In [None]:
dwug_j = []                                                #dwug data paths
for i in dirlist[0]:
  dwugde_j = "dwug_de/data/" + i + "/judgments.csv"
  dwug_j.append(dwugde_j)
for i in dirlist[1]:
  dwugen_j = "dwug_en/data/" + i + "/judgments.csv"
  dwug_j.append(dwugen_j)
for i in dirlist[2]:
  dwugsv_j = "dwug_sv/data/" + i + "/judgments.csv"
  dwug_j.append(dwugsv_j)
for i in dirlist[3]:
  discowugg_j = "discowug/data/" + i + "/judgments.csv"
  dwug_j.append(discowugg_j)
for i in dirlist[4]:
  durel_j = "durel/data/" + i + "/judgments.csv"
  dwug_j.append(durel_j)
for i in dirlist[5]:
  surel_j = "surel/data/" + i + "/judgments.csv"
  dwug_j.append(surel_j)
for i in dirlist[6]:
  refwug_j = "refwug/data/" + i + "/judgments.csv"
  dwug_j.append(refwug_j)
for i in dirlist[7]:
  dwuges_j = "dwug_es/data/" + i + "/judgments.csv"
  dwug_j.append(dwuges_j)
for i in dirlist[8]:
  diawug_j = "diawug/data/" + i + "/judgments.csv"
  dwug_j.append(diawug_j)
for i in dirlist[9]:
  dups_j = "DUPS-WUG/data/" + i + "/judgments.csv"
  dwug_j.append(dups_j)

In [None]:
judgemnt_df = pd.DataFrame()            #dwug data judgments df
for i in dwug_j:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[0]
   judgemnt_df = pd.concat([judgemnt_df, Tmp])


In [None]:
judgemnt_df[judgemnt_df.duplicated()]

Unnamed: 0,identifier1,identifier2,annotator,judgment,comment,lemma,round,dataset,group
281,Aussterben-c2-i85,Aussterben-c2-i70,annotator1,4.0,,Aussterben,,discowug,
22,neunjahrig-c1-i8,neunjahrig-c1-i44,annotator1,4.0,,neunjahrig,,discowug,
171,musaeus_reisen01_1779-358-81,ludovici_grundriss_1756-15872-133,annotator2,4.0,,flott,,durel,earlier
172,musaeus_reisen01_1779-358-81,ludovici_grundriss_1756-15872-133,annotator3,4.0,,flott,,durel,earlier
173,musaeus_reisen01_1779-358-81,ludovici_grundriss_1756-15872-133,annotator4,4.0,,flott,,durel,earlier
35,wlp_CU-rag.zip-cu-g-7.txt-419688-10,wlp_CU-rag.zip-cu-g-6.txt-1816722-5,annotator10,4.0,,chamaco_pibe_chico,,diawug,
123,wlp_AR-tez.zip-ar-b-8.txt-3277778-4,wlp_CU-rag.zip-cu-g-6.txt-1816722-5,annotator10,4.0,,chamaco_pibe_chico,,diawug,
80,wlp_ES-sbo.zip-es-g-3.txt-5574825-6,wlp_ES-sbo.zip-es-g-9.txt-16481170-3,annotator16,4.0,,vidriera_escaparate,,diawug,


In [None]:
path_u = []
path_us = []                   #uses paths for tempowic and usim
for i in folders[0]:
    pathj = paths[0] + i + "/uses.csv"    #tempowic
    path_u.append(pathj)
for i in folders[3]:
    pathj = paths[3] + i + "/uses.csv"     #tempowic
    path_u.append(pathj)
for i in folders[4]:
    pathj = paths[4] + i + "/uses.csv"     #tempowic
    path_u.append(pathj)
for i in folders[10]:
     pathj = paths[10] + i + "/uses.csv"   #usim
     path_us.append(pathj)

In [None]:
path_cou = []                               #for cosimlex uses paths
pat = paths[6] + "/uses.csv"
path_cou.append(pat)

pat = paths[7] + "/uses.csv"
path_cou.append(pat)

pat = paths[8] +  "/uses.csv"
path_cou.append(pat)

In [None]:
cosim_uses_df = pd.DataFrame()            #cosimlex uses df
for i in path_cou:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[4]
   cosim_uses_df = pd.concat([cosim_uses_df, Tmp])

In [None]:
cosim_uses_df.loc[cosim_uses_df["dataset"] == "fi", "language"] = 'Finnish'
cosim_uses_df.loc[cosim_uses_df["dataset"] == "hr", "language"] = 'Croatian'
cosim_uses_df.loc[cosim_uses_df["dataset"] == "en", "language"] = 'English'

In [None]:
cosim_uses_df.loc[cosim_uses_df["language"] == "Finnish", "dataset"] = 'Cosimlex_fi'
cosim_uses_df.loc[cosim_uses_df["language"] == "Croatian", "dataset"] = 'Cosimlex_hr'
cosim_uses_df.loc[cosim_uses_df["language"] == "English", "dataset"] = 'Cosimlex_en'

In [None]:
#cosim_uses_df['dataset'] = 'Cosimlex'
cosim_uses_df = cosim_uses_df.reset_index(drop = True)

In [None]:
path_k = []
path_r = []                         #wic and rawc uses df
for i in folders[9]:
    pathj = paths[9] + i + "/uses.csv" #wic
    path_k.append(pathj)
for i in folders[11]:
    pathj = paths[11] + i + "/uses.csv" #rawc
    path_r.append(pathj)
raw_u_df = pd.DataFrame()
raw_us_df = pd.DataFrame()
for i in path_k:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[3]
   raw_u_df = pd.concat([raw_u_df, Tmp])

for i in path_r:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[2]
   raw_us_df = pd.concat([raw_us_df, Tmp])

In [None]:
raw_uses_df = pd.DataFrame()
raw_uses_df = pd.concat([raw_u_df, raw_us_df])

In [None]:
raw_uses_df['language'] = 'English'

In [None]:
raw_uses_df.loc[raw_uses_df["dataset"] == "dev", "dataset"] = 'wic_dev'
raw_uses_df.loc[raw_uses_df["dataset"] == "train", "dataset"] = 'wic_train'
raw_uses_df.loc[raw_uses_df["dataset"] == "test", "dataset"] = 'wic_test'

In [None]:
path_us.remove("WUGs/scripts/misc/data/dwug_en/uses.csv")

In [None]:
u_df = pd.DataFrame()
ud_df =  pd.DataFrame()                 #uses dataframe for tempowic and usim
for i in path_u:
    Tmp = pd.read_csv(i, delimiter='\t', quoting =3)
    Tmp['dataset'] = i.split('/')[4]
    u_df = pd.concat([u_df, Tmp])

for i in path_us:
    Tmp = pd.read_csv(i, delimiter='\t', quoting =3)
    Tmp['dataset'] = i.split('/')[3]
    ud_df = pd.concat([ud_df, Tmp])


In [None]:
use_df = pd.DataFrame()
use_df = pd.concat([u_df, ud_df])

In [None]:
#use_df.loc[use_df["dataset"] == "wugdata", "dataset"] = 'TempoWic'
use_df.loc[use_df["dataset"] == "data", "dataset"] = 'USim'

In [None]:
use_df['language'] = 'English'

In [None]:
dwug_u = []                                           #dwug data uses paths
for i in dirlist[0]:
  dwugde_u = "dwug_de/data/" + i + "/uses.csv"
  dwug_u.append(dwugde_u)
for i in dirlist[1]:
  dwugen_u = "dwug_en/data/" + i + "/uses.csv"
  dwug_u.append(dwugen_u)
for i in dirlist[2]:
  dwugsv_u = "dwug_sv/data/" + i + "/uses.csv"
  dwug_u.append(dwugsv_u)
for i in dirlist[3]:
  discowugg_u = "discowug/data/" + i + "/uses.csv"
  dwug_u.append(discowugg_u)
for i in dirlist[4]:
  durel_u = "durel/data/" + i + "/uses.csv"
  dwug_u.append(durel_u)
for i in dirlist[5]:
  surel_u = "surel/data/" + i + "/uses.csv"
  dwug_u.append(surel_u)
for i in dirlist[6]:
  refwug_u = "refwug/data/" + i + "/uses.csv"
  dwug_u.append(refwug_u)
for i in dirlist[7]:
  dwuges_u = "dwug_es/data/" + i + "/uses.csv"
  dwug_u.append(dwuges_u)
for i in dirlist[8]:
  diawug_u = "diawug/data/" + i + "/uses.csv"
  dwug_u.append(diawug_u)
for i in dirlist[9]:
  dups_u = "DUPS-WUG/data/" + i + "/uses.csv"
  dwug_u.append(dups_u)

In [None]:
judgemnt_df.loc[judgemnt_df["dataset"] == "dwug_de", "language"] = 'German'
judgemnt_df.loc[judgemnt_df["dataset"] == "dwug_en", "language"] = 'English'
judgemnt_df.loc[judgemnt_df["dataset"] == "DUPS-WUG", "language"] = 'English'
judgemnt_df.loc[judgemnt_df["dataset"] == "dwug_es", "language"] = 'Spanish'
judgemnt_df.loc[judgemnt_df["dataset"] == "dwug_sv", "language"] = 'Swedish'
judgemnt_df.loc[judgemnt_df["dataset"] == "durel", "language"] = 'German'
judgemnt_df.loc[judgemnt_df["dataset"] == "surel", "language"] = 'German'
judgemnt_df.loc[judgemnt_df["dataset"] == "discowug", "language"] = 'German'
judgemnt_df.loc[judgemnt_df["dataset"] == "refwug", "language"] = 'German'
judgemnt_df.loc[judgemnt_df["dataset"] == "diawug", "language"] = 'Spanish'


In [None]:
#final judgments df (without russian and norwegian datasets)
judgment_df = pd.DataFrame()
judgment_df = pd.concat([judgment_df, judgemt_df], axis = 0)
judgment_df = pd.concat([judgment_df, judgemnt_df], axis = 0)
judgment_df = pd.concat([judgment_df, raw_df], axis = 0)
judgment_df = pd.concat([judgment_df, cosim_df], axis = 0)

In [None]:
judgment_df = judgment_df.reset_index(drop=True)

In [None]:
usee_df = pd.DataFrame()            #uses dwug df
for i in dwug_u:
    Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
    Tmp['dataset'] = i.split('/')[0]
    usee_df = pd.concat([usee_df, Tmp])

In [None]:
usee_df.loc[usee_df["dataset"] == "dwug_de", "language"] = 'German'
usee_df.loc[usee_df["dataset"] == "dwug_en", "language"] = 'English'
usee_df.loc[usee_df["dataset"] == "DUPS-WUG", "language"] = 'English'
usee_df.loc[usee_df["dataset"] == "dwug_es", "language"] = 'Spanish'
#usee_df.loc[usee_df["dataset"] == "dwug_la", "language"] = 'latin'
usee_df.loc[usee_df["dataset"] == "dwug_sv", "language"] = 'Swedish'
usee_df.loc[usee_df["dataset"] == "durel", "language"] = 'German'
usee_df.loc[usee_df["dataset"] == "surel", "language"] = 'German'
usee_df.loc[usee_df["dataset"] == "discowug", "language"] = 'German'
usee_df.loc[usee_df["dataset"] == "refwug", "language"] = 'German'
usee_df.loc[usee_df["dataset"] == "diawug", "language"] = 'Spanish'

In [None]:
#combining uses df
uses_full_df = pd.concat([usee_df, use_df], axis = 0)
uses1_df = pd.concat([uses_full_df, raw_uses_df], axis = 0)
uses_df_full = pd.concat([uses1_df, cosim_uses_df], axis = 0)

In [None]:
#getting the data
rudsi_f = os.listdir(rudsi)
nordia_f1= os.listdir(nordia1)
nordia_f2 =os.listdir(nordia2)
rushift_f1 = os.listdir(rushifteval1)
rushift_f2 = os.listdir(rushifteval2)
rushift_f3 = os.listdir(rushifteval3)
rusem_f1 = os.listdir(rusemshift1)
rusem_f2 = os.listdir(rusemshift2)

In [None]:
judgements_rusem = []
judgements_nordia = []
judgements_rudsi = []
judgements_rushift = []
uses_rusem = []
uses_nordia = []
uses_rudsi = []
uses_rushift = []


In [None]:
for j in rudsi_f :
      judgements_rudsi.append(rudsi+j+"/judgments.csv")
      uses_rudsi.append(rudsi+j+"/uses.csv")
for j in rusem_f1:
      judgements_rusem.append(rusemshift1+j+"/judgments.csv")
      uses_rusem.append(rusemshift1+j+"/uses.csv")
for j in rusem_f2:
      judgements_rusem.append(rusemshift2+j+"/judgments.csv")
      uses_rusem.append(rusemshift2+j+"/uses.csv")
for j in rushift_f1 :
      judgements_rushift.append(rushifteval1+j+"/judgments.csv")
      uses_rushift.append(rushifteval1+j+"/uses.csv")
for j in rushift_f2 :
      judgements_rushift.append(rushifteval2+j+"/judgments.csv")
      uses_rushift.append(rushifteval2+j+"/uses.csv")
for j in rushift_f3 :
      judgements_rushift.append(rushifteval3+j+"/judgments.csv")
      uses_rushift.append(rushifteval3+j+"/uses.csv")
for j in nordia_f1 :
      judgements_nordia.append(nordia1+j+"/judgments.csv")
      uses_nordia.append(nordia1+j+"/uses.csv")
for j in nordia_f2:
      judgements_nordia.append(nordia2+j+"/judgments.csv")
      uses_nordia.append(nordia2+j+"/uses.csv")

In [None]:
#judgments dataframe for rudsi, rusemshift, rushifteval, nordiachange
jud_rudsi = pd.DataFrame()
for i in judgements_rudsi:
    Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
    Tmp['dataset'] = i.split('/')[0]
    jud_rudsi = pd.concat([jud_rudsi, Tmp])


In [None]:
jud_rusems = pd.DataFrame()
for i in judgements_rusem:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[1]
   jud_rusems = pd.concat([jud_rusems, Tmp])


In [None]:
jud_rushift = pd.DataFrame()
for i in judgements_rushift:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[2]
   jud_rushift = pd.concat([jud_rushift, Tmp])

In [None]:
jud_nordia = pd.DataFrame()
for i in judgements_nordia:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[1]
   jud_nordia = pd.concat([jud_nordia, Tmp])

In [None]:
jud_nordia.loc[jud_nordia['dataset'] == 'subset1', 'dataset'] = 'NorDiaChange1'
jud_nordia.loc[jud_nordia['dataset'] == 'subset2', 'dataset'] = 'NorDiaChange2'

In [None]:
judgements_df = pd.DataFrame()
judgements_df = pd.concat([judgements_df, jud_rudsi])
judgements_df = pd.concat([judgements_df, jud_rusems])
judgements_df = pd.concat([judgements_df, jud_rushift])
judgements_df = pd.concat([judgements_df, jud_nordia])

In [None]:
judgements_df["language"] = "Russian"

In [None]:
judgements_df.loc[judgements_df["dataset"] == "NorDiaChange1", "language"] = 'Norwegian'
judgements_df.loc[judgements_df["dataset"] == "NorDiaChange2", "language"] = 'Norwegian'

In [None]:
use_rudsi = pd.DataFrame()
for i in uses_rudsi:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[0]
   use_rudsi = pd.concat([use_rudsi, Tmp])

In [None]:
use_rusems = pd.DataFrame()
for i in uses_rusem:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[1]
   use_rusems = pd.concat([use_rusems, Tmp])

In [None]:
use_rushift = pd.DataFrame()
for i in uses_rushift:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[2]
   use_rushift = pd.concat([use_rushift, Tmp])

In [None]:
use_nordia = pd.DataFrame()
for i in uses_nordia:
   Tmp = pd.read_csv(i, delimiter='\t', quoting = 3)
   Tmp['dataset'] = i.split('/')[1]
   use_nordia = pd.concat([use_nordia, Tmp])

In [None]:
use_nordia.loc[use_nordia['dataset'] == 'subset1', 'dataset'] = 'NorDiaChange1'
use_nordia.loc[use_nordia['dataset'] == 'subset2', 'dataset'] = 'NorDiaChange2'

In [None]:
usees_df = pd.DataFrame()
usees_df = pd.concat([usees_df, use_rudsi])
usees_df = pd.concat([usees_df, use_rusems])
usees_df = pd.concat([usees_df, use_rushift])
usees_df = pd.concat([usees_df, use_nordia])

In [None]:
usees_df['language'] = 'Russian'
usees_df.loc[usees_df["dataset"] == "NorDiaChange1", "language"] = 'Norwegian'
usees_df.loc[usees_df["dataset"] == "NorDiaChange2", "language"] = 'Norwegian'

In [None]:
#final judgments dataframe full format
judgments_full = pd.concat([judgment_df, judgements_df], axis = 0)

In [None]:
judgments_full.dataset.unique()

array(['tempowic_trial_all', 'tempowic_validation_all',
       'tempowic_validation', 'USim', 'dwug_de', 'dwug_en', 'dwug_sv',
       'discowug', 'durel', 'surel', 'refwug', 'dwug_es', 'diawug',
       'DUPS-WUG', 'wic_train', 'wic_dev', 'wic_test', 'raw-c',
       'Cosimlex_en', 'Cosimlex_fi', 'Cosimlex_hr', 'RuDSI',
       'rusemshift_1', 'rusemshift_2', 'rushifteval1', 'rushifteval2',
       'rushifteval3', 'NorDiaChange1', 'NorDiaChange2'], dtype=object)

In [None]:
datasets = ['tempowic_trial_all', 'tempowic_validation_all',
       'tempowic_validation', 'USim', 'wic_train', 'wic_dev', 'wic_test', 'raw-c',
       'Cosimlex_en', 'Cosimlex_fi', 'Cosimlex_hr']

In [None]:
#final uses dataframe full format
uses_full = pd.concat([uses_df_full, usees_df], axis=0)

In [None]:
#resetting the index of uses and judgments dataframes because they have repeated indices
judgments_full = judgments_full.reset_index(drop= True)
uses_full = uses_full.reset_index(drop= True)

In [None]:
#final uses and judgments in wug format
judgments_wug = judgments_full[["identifier1", "identifier2", "annotator", "judgment", "comment", "lemma", "dataset", "language"]]
uses_wug= uses_full[['lemma', 'pos', 'date', 'grouping', 'identifier', 'description', 'context', 'indexes_target_token', 'indexes_target_sentence', 'dataset', 'language']]

In [None]:
dup = uses_full[uses_full.duplicated()] #to gwt duplicates

In [None]:
uses_full.dataset.unique()

array(['dwug_de', 'dwug_en', 'dwug_sv', 'discowug', 'durel', 'surel',
       'refwug', 'dwug_es', 'diawug', 'DUPS-WUG', 'tempowic_trial_all',
       'tempowic_validation_all', 'tempowic_trial', 'USim', 'wic_train',
       'wic_dev', 'wic_test', 'raw-c', 'Cosimlex_en', 'Cosimlex_fi',
       'Cosimlex_hr', 'RuDSI', 'rusemshift_1', 'rusemshift_2',
       'rushifteval1', 'rushifteval2', 'rushifteval3', 'NorDiaChange1',
       'NorDiaChange2'], dtype=object)

In [None]:
filtered_df_uses = uses_wug[~uses_wug['dataset'].isin(datasets)] #filtered uses dataframe

In [None]:
filtered_df_judgments = judgments_wug[~judgments_wug['dataset'].isin(datasets)] #filtered judgments dataframe

In [None]:
filtered_df_judgments.to_csv('final_judgments.csv',index = False, sep='\t', encoding='utf-8', quoting=csv.QUOTE_NONE, quotechar = '')

In [None]:
filtered_df_uses.to_csv('final_uses.csv',index = False, sep='\t', encoding='utf-8', quoting=csv.QUOTE_NONE, quotechar = '')

In [None]:
fin_df_use = filtered_df_uses.to_dict(orient='list')

In [None]:
result_df1 = pd.merge(filtered_df_judgments, filtered_df_uses, left_on='identifier1', right_on='identifier', how='left')

In [None]:
result_df2 = pd.merge(judgments_wug, uses_wug, left_on='identifier2', right_on='identifier', how='left')

In [None]:
resultt = pd.merge(result_df1, result_df2)

In [None]:
for i in list(judgments_wug["dataset"].value_counts().index):
    df_temp = judgments_wug[judgments_wug["dataset"]==i]
    if not os.path.exists(i):
        os.mkdir(i)
    df_temp.to_csv(i +'/judgments.csv',index = False, sep='\t', encoding='utf-8', quoting=csv.QUOTE_NONE, quotechar = '')

In [None]:
for i in list(uses_wug["dataset"].value_counts().index):
    df_temp = uses_wug[uses_wug["dataset"]==i]
    if not os.path.exists(i):
        os.mkdir(i)
    df_temp.to_csv(i +'/uses.csv',index = False, sep='\t', encoding='utf-8', quoting=csv.QUOTE_NONE, quotechar = '')

In [None]:
for i in list(judgments_full["dataset"].value_counts().index):
    df_temp = judgments_full[judgments_full["dataset"]==i]
    if not os.path.exists(i):
        os.mkdir(i)
    df_temp.to_csv(i +'/judgments.csv',index = False, sep='\t', encoding='utf-8', quoting=csv.QUOTE_NONE, quotechar = '')

In [None]:
for i in list(uses_full["dataset"].value_counts().index):
    df_temp = uses_full[uses_full["dataset"]==i]
    if not os.path.exists(i):
        os.mkdir(i)
    df_temp.to_csv(i +'/uses.csv',index = False, sep='\t', encoding='utf-8', quoting=csv.QUOTE_NONE, quotechar = '')