# <center>Script for extracting the german "gloss sentences"<center> 
## <center>from the EAF transcripts A and B of the annotated DGS Corpus<center> 

<hr style="border:1.0px solid gray"></hr> 

What is meant by "gloss sentence" - the german sentence in gloss annotation.  
Those sentences will be the source language for translation to german (the german sentences will be the target language). 


An example of a german gloss sentence with its translation to a german sentence:  

*Source language (german gloss sentence):* **SEHEN1* SELBST1A* LEBEN1A* SEHEN1 $GEST-OFF^** 

*Target language (german sentence transaltion):* **Wie mein Leben aussieht?** 

<hr style="border:1.0px solid gray"></hr> 

In [1]:
#imports 
from bs4 import BeautifulSoup 
import requests 
from urllib.parse import urljoin 
import urllib.request 
import pandas as pd 
import pickle 
from operator import itemgetter 
from tqdm import tqdm   

In [2]:
#url of the DGS Corpus 
url_dgs_corpus = "https://www.sign-lang.uni-hamburg.de/meinedgs/ling/start-name_en.html" 

#request the dgs corpus page 
r = requests.get(url_dgs_corpus) 

#get the html content of the dgs corpus page 
html = r.text 

In [3]:
#create a content soup from the html content of the dgs corpus page with BeautifulSoup
content_soup = BeautifulSoup(html, 'html.parser') 

In [4]:
#rows with all types of files - ILEX, EAF, MP4...
rows_with_transcripts = content_soup.find('table', {'class': 'transcripts'}).find_all('tr') 

In [5]:
#list with all hrefs of the EAF files  
list_eaf_files = [] 

In [6]:
#get all the cells with transcripts data 
for r in rows_with_transcripts[1:]: 
    cells_with_transcripts = r.find_all('td') 
    
    #cells with the EAF transcript files 
    eaf_files = cells_with_transcripts[5] 
    
    #add the href of each EAF transcript file to a list   
    if(eaf_files.find('a')) != None:
        list_eaf_files.append(eaf_files.find('a').attrs['href']) 

In [7]:
#list with the absolute urls of each EAF transcript 
absolute_paths_eaf_transcripts = [] 

In [8]:
#create an absolute path for each EAF transcript with 
#taking the base url of the DGS Corpus and 
#the href of each EAF transcript from the list_eaf_files 

for single_eaf in list_eaf_files: 
    absolute_url = urljoin(url_dgs_corpus, single_eaf) 
    absolute_paths_eaf_transcripts.append(absolute_url) 

<hr style="border:1.0px solid gray"></hr> 

### <center>Extract the gloss sentences corresonding to the german sentences of speakers A <center> 
    
<hr style="border:1.0px solid gray"></hr> 

#### Very important for the sorting of glosses: 
In order for the glosses to form a sentence, they are sorted first according to their starting time and second if starting time of dominant and non-dominant glosses are overlapping, put first the gloss of the dominant hand and after it the gloss of the non-dominant hand. 

<hr style="border:1.0px solid gray"></hr>  

In [9]:
#a list to save all the *gloss sentences A*  
gloss_data_a = []  

In [10]:
%%time 

with open(f"path_to_save_statistics_for_transrcipts_a", "w") as t: 
    for transcript in tqdm(absolute_paths_eaf_transcripts): 
        #this is a list with the content of all tags that have the attribute "ANNOTATION_VALUE" (they include german glosses, german sentences, 
        #english glosses, english sentences, etc.)   

        #from this content extract FIRST *only* the tags with german sentences from signers A  
        transcript_content_a = [] 

        #this is a list for the specific time encoding of each sentence 
        time_encodings_a = [] 

        #this is a list of the german sentences A   
        german_sentences_a = [] 

        #for each transcript, extract first all the german sentences from signers A 
        with urllib.request.urlopen(transcript) as f:
            content = f.read().decode('utf-8') 
            transcript_content_a = BeautifulSoup(content, 'xml').find_all(name="ANNOTATION_VALUE") 
            time_encodings_a = BeautifulSoup(content, 'xml').find_all(name="TIME_SLOT") 
            for value in range(0, len(transcript_content_a)): 
                #if the value of the tags attribute TIER_ID is a german sentence from signer A, extract it
                if transcript_content_a[value].parent.parent.parent.attrs['TIER_ID'] == "Deutsche_Übersetzung_A": 
                    #this is the time encoding for the sentence (both starting and ending) 
                    time = transcript_content_a[value].parent.attrs 
                    #this is the starting time of the sentence 
                    start = time['TIME_SLOT_REF1'] 
                    #this is the ending time of the sentence  
                    end = time['TIME_SLOT_REF2'] 
                    #the sentence itself 
                    sentence_a = transcript_content_a[value].text 
                    #group the sentence + its start + its end 
                    sentence_group_a = [sentence_a, int(start[2:]), int(end[2:])]   
                    #add the german sentence to the list of german sentences 
                    german_sentences_a.append(sentence_group_a) 


        #from the whole content extract *only* the tags with german glosses from signers A 
        transcript_content_a_gloss = [] 

        #this is a list for the specific time encoding of each gloss  
        time_encodings_a_gloss = [] 

        #this is a list of the german glosses A    
        german_glosses_a = [] 

        count_r = 0 

        count_l = 0 

        with urllib.request.urlopen(transcript) as f:
            content = f.read().decode('utf-8') 
            transcript_content_a_gloss = BeautifulSoup(content, 'xml').find_all(name="ANNOTATION_VALUE") 
            time_encodings_a_gloss = BeautifulSoup(content, 'xml').find_all(name="TIME_SLOT") 
            for value in range(0, len(transcript_content_a_gloss)): 
                if (transcript_content_a_gloss[value].parent.parent.parent.attrs['TIER_ID'] == "Lexem_Gebärde_r_A" and transcript_content_a_gloss[value].parent.parent.parent.attrs['LINGUISTIC_TYPE_REF'] == "L_tokens_right_left__finer_granularity"):  
                    #tag indicating a gloss done with the right (dominant) hand     
                    tag = "r" 

                    #increase the count of glosses done with the right hand, if it is the dominant hand  
                    count_r = count_r + 1 

                    #this is the time encoding for the gloss (both starting and ending) 
                    time = transcript_content_a_gloss[value].parent.attrs 
                    #this is the starting time of the gloss  
                    start = time['TIME_SLOT_REF1'] 
                    #this is the ending time of the gloss  
                    end = time['TIME_SLOT_REF2'] 
                    #the gloss itself 
                    gloss_a = transcript_content_a_gloss[value].text 
                    #group the gloss + its start + its end + its tag  
                    gloss_group_a = [gloss_a, int(start[2:]), int(end[2:]), tag]   
                    #add the german gloss to the list of german glosses  
                    german_glosses_a.append(gloss_group_a) 
                if (transcript_content_a_gloss[value].parent.parent.parent.attrs['TIER_ID'] == "Lexem_Gebärde_l_A" and transcript_content_a_gloss[value].parent.parent.parent.attrs['LINGUISTIC_TYPE_REF'] == "L_tokens_right_left__finer_granularity"):  
                    #tag indicating a gloss done with the left (non-dominant) hand  
                    tag = "l" 

                    #increase the count of glosses done with the left hand, if it is the dominant hand 
                    count_l = count_l + 1 

                    #this is the time encoding for the gloss (both starting and ending) 
                    time = transcript_content_a_gloss[value].parent.attrs 
                    #this is the starting time of the gloss  
                    start = time['TIME_SLOT_REF1'] 
                    #this is the ending time of the gloss  
                    end = time['TIME_SLOT_REF2'] 
                    #the gloss itself 
                    gloss_a = transcript_content_a_gloss[value].text 
                    #group the gloss + its start + its end + its tag  
                    gloss_group_a = [gloss_a, int(start[2:]), int(end[2:]), tag]   
                    #add the german gloss to the list of german glosses  
                    german_glosses_a.append(gloss_group_a) 


         
        transcript_tag = "" 
        
        #######################################################################################################
        
        #if count of glosses done with the right hand is more, then it is the dominant hand 
        if (count_r > count_l): 
            transcript_tag = "R" 
            for gloss in german_glosses_a: 
                if (gloss[3] == "r"): 
                    gloss[3] = "r".upper() 

            sorted_german_glosses_a = sorted(german_glosses_a, key = itemgetter(1, 3)) 

        #if the glosses done with the left hand are moe, then it is the dominant hand      
        elif (count_l > count_r): 
            transcript_tag = "L" 
            for gloss in german_glosses_a: 
                if (gloss[3] == "l"): 
                    gloss[3] = "l".upper()  
            sorted_german_glosses_a = sorted(german_glosses_a, key = itemgetter(1, 3)) 
            
        else: 
            transcript_tag = "No signer A for the transcript!" 

        #sort the list of german glosses A according to:   
        #1. their starting time 
        #2. their tag (see the description "Very important for the sorting of glosses" above) 

        #######################################################################################################
        
        full_annotation_a = [] 

        for sentence in german_sentences_a: 
            line = '' 
            gloss_sequence = ''
            for gloss in sorted_german_glosses_a: 
                if ((gloss[1]) >= sentence[1] and gloss[2] <= sentence[2]): 
                    line = '{} {}'.format(line, gloss[0])  
            full_annotation_a.append(line)   

        gloss_data_a = gloss_data_a + full_annotation_a 
        
        t.write(f"{transcript} - {transcript_tag}") 

    print(len(gloss_data_a)) 
    
    #t.write(f"{transcript} - {transcript_tag}") 

t.close() 

100%|██████████████████████████████████████████████████████████████████████████████| 406/406 [1:19:07<00:00, 11.69s/it]

31542
Wall time: 1h 19min 7s





Unnamed: 0,German Gloss Sentence
0,SEHEN1* SELBST1A* LEBEN1A* SEHEN1 $GEST-OFF^
1,AUFWACHSEN1A* ICH1 TAUB-GEHÖRLOS1A*
2,MEHR1* MEIN1* GEFÜHL3 VORSTELLUNG1A* ICH1* AL...
3,ICH1 MEHR1* TAUB-GEHÖRLOS1A* TREFFEN1* ZUSAMM...
4,KLAR1A* $GEST^
...,...
100,$INDEX1* RICHTIG1B* $INDEX1* FRANKFURT1* UMGE...
101,SCHÖN5 ERINNERUNG2 ICH1 SCHÖN1A* $INDEX1*
102,ICH1 MEISTENS1B $INDEX1* MEISTENS1B NUR2B ESS...
103,$INDEX1 INTERNAT1A KANN2B LOCKER1


In [14]:
#save the list of german gloss sentences A using pickle 
#to be able to use them later without extracting them again 

with open("path", "wb") as fp: 
    pickle.dump(gloss_data_a, fp)  

In [15]:
with open("path", "rb") as fp: 
    b = pickle.load(fp) 

In [16]:
with open("path", "rb") as fp: 
    a = pickle.load(fp)  

In [17]:
print(len(a)) 
print(len(b)) 

31542
31542


In [18]:
df_a = pd.DataFrame(gloss_data_a, columns=["German Gloss Sentence"]) 

df_a 

Unnamed: 0,German Gloss Sentence
0,SEHEN1* SELBST1A* LEBEN1A* SEHEN1 $GEST-OFF^
1,AUFWACHSEN1A* ICH1 TAUB-GEHÖRLOS1A*
2,MEHR1* MEIN1* GEFÜHL3 VORSTELLUNG1A* ICH1* AL...
3,ICH1 MEHR1* TAUB-GEHÖRLOS1A* TREFFEN1* ZUSAMM...
4,KLAR1A* $GEST^
...,...
31537,DANACH2A* SPÄTER10* PAUSE1 ANDERS1*
31538,DU1 WENN1A DAZU1^* FERTIG1B HIN2 STIMMT1A WIE...
31539,DU1 MEHR1^* AUCH1A BEISPIEL1 GESCHICHTE3* SCH...
31540,ANDERS1* NACHBAR2B* FREUND7* NACHBAR2B $INDEX...


In [20]:
#save a data frame with the gloss sentences a 
df_a.to_csv(f"path", encoding="utf-8-sig", index=False, header=False)     

<hr style="border:1.0px solid gray"></hr> 

### <center>Extract the gloss sentences corresonding to the german sentences of speakers B <center> 
    
<hr style="border:1.0px solid gray"></hr> 

In [21]:
#a list to save all the gloss sentences B 
gloss_data_b = []  

In [22]:
%%time 


with open(f"path_to_save_statistics_for_transrcipts_b", "w") as t:

    for transcript in tqdm(absolute_paths_eaf_transcripts): 
        #this is a list with the content of all tags that have the attribute "ANNOTATION_VALUE" (they include german glosses, german sentences, 
        #english glosses, english sentences, etc.)   

        #from this content extract FIRST *only* the tags with german sentences from signers B  
        transcript_content_b = [] 

        #this is a list for the specific time encoding of each sentence 
        time_encodings_b = [] 

        #this is a list of the german sentences B   
        german_sentences_b = [] 

        #for each transcript, extract first all the german sentences from signers B  
        with urllib.request.urlopen(transcript) as f:
            content = f.read().decode('utf-8') 
            transcript_content_b = BeautifulSoup(content, 'xml').find_all(name="ANNOTATION_VALUE") 
            time_encodings_b = BeautifulSoup(content, 'xml').find_all(name="TIME_SLOT") 
            for value in range(0, len(transcript_content_b)): 
                #if the value of the tags attribute TIER_ID is a german sentence from signers B, extract it
                if transcript_content_b[value].parent.parent.parent.attrs['TIER_ID'] == "Deutsche_Übersetzung_B": 
                    #this is the time encoding for the sentence (both starting and ending) 
                    time = transcript_content_b[value].parent.attrs 
                    #this is the starting time of the sentence 
                    start = time['TIME_SLOT_REF1'] 
                    #this is the ending time of the sentence  
                    end = time['TIME_SLOT_REF2'] 
                    #the sentence itself 
                    sentence_b = transcript_content_b[value].text 
                    #group the sentence + its start + its end 
                    sentence_group_b = [sentence_b, int(start[2:]), int(end[2:])]   
                    #add the german sentence to the list of german sentences 
                    german_sentences_b.append(sentence_group_b) 


        #from the whole content extract *only* the tags with german glosses from signers B  
        transcript_content_b_gloss = [] 

        #this is a list for the specific time encoding of each gloss  
        time_encodings_b_gloss = [] 

        #this is a list of the german glosses B    
        german_glosses_b = [] 

        count_r = 0 

        count_l = 0 

        with urllib.request.urlopen(transcript) as f:
            content = f.read().decode('utf-8') 
            transcript_content_b_gloss = BeautifulSoup(content, 'xml').find_all(name="ANNOTATION_VALUE") 
            time_encodings_b_gloss = BeautifulSoup(content, 'xml').find_all(name="TIME_SLOT") 
            for value in range(0, len(transcript_content_b_gloss)): 
                if (transcript_content_b_gloss[value].parent.parent.parent.attrs['TIER_ID'] == "Lexem_Gebärde_r_B" and transcript_content_b_gloss[value].parent.parent.parent.attrs['LINGUISTIC_TYPE_REF'] == "L_tokens_right_left__finer_granularity"):  
                    #tag indicating a gloss done with the right (dominant) hand   
                    tag = "r" 

                    #increase the count of glosses done with the right hand, if it is the dominant hand 
                    count_r = count_r + 1 

                    #this is the time encoding for the gloss (both starting and ending) 
                    time = transcript_content_b_gloss[value].parent.attrs 
                    #this is the starting time of the gloss  
                    start = time['TIME_SLOT_REF1'] 
                    #this is the ending time of the gloss  
                    end = time['TIME_SLOT_REF2'] 
                    #the gloss itself 
                    gloss_b = transcript_content_b_gloss[value].text 
                    #group the gloss + its start + its end + its tag  
                    gloss_group_b = [gloss_b, int(start[2:]), int(end[2:]), tag]   
                    #add the german gloss to the list of german glosses  
                    german_glosses_b.append(gloss_group_b) 
                if (transcript_content_b_gloss[value].parent.parent.parent.attrs['TIER_ID'] == "Lexem_Gebärde_l_B" and transcript_content_b_gloss[value].parent.parent.parent.attrs['LINGUISTIC_TYPE_REF'] == "L_tokens_right_left__finer_granularity"):  
                    #tag indicating a gloss done with the left (non-dominant) hand 
                    tag = "l" 

                    #increase the count of glosses done with the left hand, if it is the dominant hand 
                    count_l = count_l + 1 

                    #this is the time encoding for the gloss (both starting and ending) 
                    time = transcript_content_b_gloss[value].parent.attrs 
                    #this is the starting time of the gloss  
                    start = time['TIME_SLOT_REF1'] 
                    #this is the ending time of the gloss  
                    end = time['TIME_SLOT_REF2'] 
                    #the gloss itself 
                    gloss_b = transcript_content_b_gloss[value].text 
                    #group the gloss + its start + its end + its tag  
                    gloss_group_b = [gloss_b, int(start[2:]), int(end[2:]), tag]   
                    #add the german gloss to the list of german glosses  
                    german_glosses_b.append(gloss_group_b) 


        transcript_tag = "" 

        #######################################################################################################

        #if count of glosses done with the right hand is more, then it is the dominant hand 
        if (count_r > count_l): 
            transcript_tag = "R" 
            for gloss in german_glosses_b: 
                if (gloss[3] == "r"): 
                    gloss[3] = "r".upper() 

            sorted_german_glosses_b = sorted(german_glosses_b, key = itemgetter(1, 3)) 

        #if the glosses done with the left hand are moe, then it is the dominant hand      
        elif (count_l > count_r): 
            transcript_tag = "L" 
            for gloss in german_glosses_b: 
                if (gloss[3] == "l"): 
                    gloss[3] = "l".upper()  
            sorted_german_glosses_b = sorted(german_glosses_b, key = itemgetter(1, 3)) 
            
        else: 
            transcript_tag = "No signer B for the transcript!" 

        #sort the list of german glosses A according to:   
        #1. their starting time 
        #2. their tag (see the description "Very important for the sorting of glosses" above) 

        #######################################################################################################


        full_annotation_b = [] 

        for sentence in german_sentences_b: 
            line = '' 
            gloss_sequence = ''
            for gloss in sorted_german_glosses_b: 
                if ((gloss[1]) >= sentence[1] and gloss[2] <= sentence[2]): 
                    line = '{} {}'.format(line, gloss[0])   
            full_annotation_b.append(line)  

        gloss_data_b = gloss_data_b + full_annotation_b 
        
        t.write(f"{transcript} - {transcript_tag}\n") 


    print(len(gloss_data_b)) 

    #t.write(f"{transcript} - {transcript_tag}") 
    
t.close() 


  0%|                                                                                          | 0/406 [00:00<?, ?it/s][A
  0%|▏                                                                                 | 1/406 [00:07<53:22,  7.91s/it][A
  0%|▍                                                                                 | 2/406 [00:08<38:22,  5.70s/it][A
  1%|▌                                                                                 | 3/406 [00:14<38:33,  5.74s/it][A
  1%|▊                                                                                 | 4/406 [00:24<46:59,  7.01s/it][A
  1%|▉                                                                               | 5/406 [00:41<1:07:10, 10.05s/it][A
  1%|█▏                                                                              | 6/406 [01:05<1:35:18, 14.30s/it][A
  2%|█▍                                                                              | 7/406 [01:21<1:37:27, 14.66s/it][A
  2%|█▌        

32380
Wall time: 59min 33s





In [23]:
df_b = pd.DataFrame(gloss_data_b, columns=["German Gloss Sentence"]) 

df_b 

Unnamed: 0,German Gloss Sentence
0,$INDEX-MONITOR1 DARUM1 $INDEX-MONITOR1 $ALPHA...
1,GRUND4B $INDEX-MONITOR1 $ALPHA1:D-A VOR1G^ $I...
2,$INDEX1 BESTE1* LIEB1A $ALPHA1:D-N-A PERSON1 ...
3,ICH2 DARUM1 $INDEX1 SCHOCK2A* UNFALL8 ICH1 TR...
4,WIE3A* $INDEX1 NICHT-MEHR1A PERSON1 ABLAUF2^*...
...,...
32375,HAUSAUFGABE1* JA1A MATHEMATIK1B* UND1 $LIST1:...
32376,UND2A WIE3B* $INDEX1* DAZU1^* BEISPIEL1* ICH2...
32377,WENN1A VERGESSEN1 $GEST-OFF^ ICH2 WIE3B* STRA...
32378,FERTIG1A*


In [24]:
#save the list of german gloss sentences B using pickle 
#to be able to use it later without extracting them again 

with open("path", "wb") as fp: 
    pickle.dump(gloss_data_b, fp) 

In [2]:
with open("path", "rb") as fp: 
    c = pickle.load(fp)  

In [3]:
len(c) 

32380

In [4]:
df_b = pd.DataFrame(c, columns=["German Gloss Sentence"]) 

df_b  

Unnamed: 0,German Gloss Sentence
0,$INDEX-MONITOR1 DARUM1 $INDEX-MONITOR1 $ALPHA...
1,GRUND4B $INDEX-MONITOR1 $ALPHA1:D-A VOR1G^ $I...
2,$INDEX1 BESTE1* LIEB1A $ALPHA1:D-N-A PERSON1 ...
3,ICH2 DARUM1 $INDEX1 SCHOCK2A* UNFALL8 ICH1 TR...
4,WIE3A* $INDEX1 NICHT-MEHR1A PERSON1 ABLAUF2^*...
...,...
32375,HAUSAUFGABE1* JA1A MATHEMATIK1B* UND1 $LIST1:...
32376,UND2A WIE3B* $INDEX1* DAZU1^* BEISPIEL1* ICH2...
32377,WENN1A VERGESSEN1 $GEST-OFF^ ICH2 WIE3B* STRA...
32378,FERTIG1A*


In [9]:
with open("path", "rb") as fp: 
    a = pickle.load(fp) 

In [10]:
df_a = pd.DataFrame(a, columns=["German Gloss Sentence"]) 

df_a  

Unnamed: 0,German Gloss Sentence
0,SEHEN1* SELBST1A* LEBEN1A* SEHEN1 $GEST-OFF^
1,AUFWACHSEN1A* ICH1 TAUB-GEHÖRLOS1A*
2,MEHR1* MEIN1* GEFÜHL3 VORSTELLUNG1A* ICH1* AL...
3,ICH1 MEHR1* TAUB-GEHÖRLOS1A* TREFFEN1* ZUSAMM...
4,KLAR1A* $GEST^
...,...
31537,DANACH2A* SPÄTER10* PAUSE1 ANDERS1*
31538,DU1 WENN1A DAZU1^* FERTIG1B HIN2 STIMMT1A WIE...
31539,DU1 MEHR1^* AUCH1A BEISPIEL1 GESCHICHTE3* SCH...
31540,ANDERS1* NACHBAR2B* FREUND7* NACHBAR2B $INDEX...


In [5]:
#save a data frame with the gloss sentences b  
df_b.to_csv(f"path", encoding="utf-8-sig", index=False, header=False)     

#### Concatenating df_a and df_b gives a total of 63922 gloss sentences 

In [11]:
glosses_frames = [df_a, df_b] 

result = pd.concat(glosses_frames)  

In [12]:
#save the data frame with the concatenated data frames for gloss sentences a and gloss sentences b 
result.to_csv(f"path", encoding="utf-8-sig", index=False, header=False)    