## Objective

## Guiding Questions:
1. What is the easiest way to parse the connl file?
2. Can I use regex to extract a tuple containing token, tag?

## Key findings
*Pattern*
- `(?:(\S+)(?:\t)(\S+))+')`
*Pre-processing*
- open file, and let `doc = f.read()`
- let `split_doc = doc.split('\n\n')` (i.e. split on empty lines, which are sentence delimiters)
*Match*
- `[re.findall(pattern, sentence) for sentence in split_doc]` results in a list where each entry is a list of tuples `(token, tag)`.

## Imports and setup

In [131]:
import warnings
warnings.filterwarnings('ignore')

import re

Path to one of the conll datasets:

In [132]:
filepath = '/home/fero/Desktop/nlp/in5550-2020-exam/data/raw/dev.conll'

Test lines:

In [133]:
line = 'Karsten	O'
sentence = '''Karsten	O
og	O
Petra	O
på	O
safari	O
:	O'''

#### Regex pattern 1.

In [134]:
p1 = re.compile(r'(?:(\S+)(?:\t)(\S+))+')
match_line1 = re.match(p1, line)
match_sentence1 = re.match(p1, sentence)
search_line1 = re.search(p1, line)
search_sentence1 = re.search(p1, sentence)
findall1 = re.findall(p1, sentence)

In [135]:
print('P1 line, match.groups()')
match_line1.groups()

P1 line, match.groups()


('Karsten', 'O')

In [136]:
print('P1 line, search.groups()')
search_line1.groups()

P1 line, search.groups()


('Karsten', 'O')

In [137]:
print('P1 sentence, match.groups()')
match_sentence1.groups()

P1 sentence, match.groups()


('Karsten', 'O')

In [138]:
print('P1 sentence, search.groups()')
search_sentence1.groups()

P1 sentence, search.groups()


('Karsten', 'O')

In [139]:
print('P1 sentence, findall()')
findall1

P1 sentence, findall()


[('Karsten', 'O'),
 ('og', 'O'),
 ('Petra', 'O'),
 ('på', 'O'),
 ('safari', 'O'),
 (':', 'O')]

#### Regex pattern 2.

In [140]:
p2 = re.compile(r'(?:(\S+)(?:\t)(\S+)\n?)+')
match_line2 = re.match(p2, line)
match_sentence2 = re.match(p2, sentence)
search_line2 = re.search(p2, line)
search_sentence2 = re.search(p2, sentence)
findall2 = re.findall(p2, sentence)

In [141]:
print('P2 line, match.groups()')
match_line2.groups()

P2 line, match.groups()


('Karsten', 'O')

In [142]:
print('P2 line, search.groups()')
search_line2.groups()

P2 line, search.groups()


('Karsten', 'O')

In [143]:
print('P2 sentence, match.groups()')
match_sentence2.groups()

P2 sentence, match.groups()


(':', 'O')

In [144]:
print('P2 sentence, search.groups()')
search_sentence2.groups()

P2 sentence, search.groups()


(':', 'O')

In [145]:
print('P2 sentence, findall()')
findall2

P2 sentence, findall()


[(':', 'O')]

#### Regex pattern 3.

In [146]:
p3 = re.compile(r'(?:(\S+)(?:\t)(\S+)\n?)+')
match_line3 = re.match(p3, line)
match_sentence3 = re.match(p3, sentence)
search_line3 = re.search(p3, line)
search_sentence3 = re.search(p3, sentence)
findall3 = re.findall(p3, sentence)

In [147]:
print('P3 line, match.groups()')
match_line3.groups()

P3 line, match.groups()


('Karsten', 'O')

In [148]:
print('P3 line, search.groups()')
search_line3.groups()

P3 line, search.groups()


('Karsten', 'O')

In [149]:
print('P3 sentence, match.groups()')
match_sentence3.groups()

P3 sentence, match.groups()


(':', 'O')

In [150]:
print('P3 sentence, search.groups()')
search_sentence3.groups()

P3 sentence, search.groups()


(':', 'O')

In [151]:
print('P3 sentence, findall()')
findall3

P3 sentence, findall()


[(':', 'O')]

#### Regex pattern 3.

In [152]:
p4 = re.compile(r'(?:(\S+)(?:\t)(\S+)\n?)+')
match_line4 = re.match(p4, line)
match_sentence4 = re.match(p4, sentence)
search_line4 = re.search(p4, line)
search_sentence4 = re.search(p4, sentence)
findall4 = re.findall(p4, sentence)

In [153]:
print('P4 line, match.groups()')
match_line4.groups()

P4 line, match.groups()


('Karsten', 'O')

In [154]:
print('P4 line, search.groups()')
search_line4.groups()

P4 line, search.groups()


('Karsten', 'O')

In [155]:
print('P4 sentence, match.groups()')
match_sentence4.groups()

P4 sentence, match.groups()


(':', 'O')

In [156]:
print('P4 sentence, search.groups()')
search_sentence4.groups()

P4 sentence, search.groups()


(':', 'O')

In [157]:
print('P4 sentence, findall()')
findall4


P4 sentence, findall()


[(':', 'O')]

#### Try matching over several sentences:

In [158]:
several_sentences = '''Karsten	O
og	O
Petra	O
på	O
safari	O
:	O

Løvefilm	B-targ-Negative
som	O
ikke	O
biter	O

Litt	O
for	O
nusselig	O
barnefilm	B-targ-Negative
.	O
'''

In [159]:
# Use pattern 1 : (r'(?:(\S+)(?:\t)(\S+))+')
findall_ = re.findall(p1, several_sentences)

In [160]:
print('P1 several_sentences, findall()')
findall_

P1 several_sentences, findall()


[('Karsten', 'O'),
 ('og', 'O'),
 ('Petra', 'O'),
 ('på', 'O'),
 ('safari', 'O'),
 (':', 'O'),
 ('Løvefilm', 'B-targ-Negative'),
 ('som', 'O'),
 ('ikke', 'O'),
 ('biter', 'O'),
 ('Litt', 'O'),
 ('for', 'O'),
 ('nusselig', 'O'),
 ('barnefilm', 'B-targ-Negative'),
 ('.', 'O')]

Read entire file as string.

In [161]:
doc = None
with open('/home/fero/Desktop/nlp/in5550-2020-exam/data/raw/dev.conll', 'r') as f:
    doc = f.read()

In [162]:
doc[:12]

'Karsten\tO\nog'

Split doc on double newline: `\n\n`

In [163]:
split_doc = doc.split('\n\n')
split_doc[:5]

['Karsten\tO\nog\tO\nPetra\tO\npå\tO\nsafari\tO\n:\tO',
 'Løvefilm\tB-targ-Negative\nsom\tO\nikke\tO\nbiter\tO',
 'Litt\tO\nfor\tO\nnusselig\tO\nbarnefilm\tB-targ-Negative\n.\tO',
 'Den\tO\nfjerde\tO\nKarsten\tO\nog\tO\nPetra\tO\n-\tO\nfilmen\tB-targ-Negative\ner\tO\ntrolig\tO\nnervepirrende\tO\nnok\tO\nfor\tO\nde\tO\naller\tO\nyngste\tO\n,\tO\nmen\tO\nfor\tO\nandre\tO\nfremstår\tO\nden\tO\nlike\tO\nnusselig\tO\nog\tO\nharmløs\tO\nsom\tO\nen\tO\nnyfødt\tO\nløveunge\tO\n.\tO',
 'I\tO\nsommer\tO\nfikk\tO\ndrapet\tO\npå\tO\nløven\tO\nCecil\tO\ni\tO\nZimbabwe\tO\ninternett\tO\ntil\tO\nå\tO\nkoke\tO\n.\tO']

For each item in `split_doc`, i.e. each sentence, try to apply the pattern using `findall()`.

In [164]:
pattern = re.compile((r'(?:(\S+)(?:\t)(\S+))+'))
matches = []

for sent in split_doc[:5]:
    match = re.findall(pattern, sent)
    if match:
        matches.append(match)
    else:
        print(f'No match found for sentence:\n{sent}')

In [165]:
print(*matches[0], sep='\n')
print()
print(*matches[3], sep='\n')


('Karsten', 'O')
('og', 'O')
('Petra', 'O')
('på', 'O')
('safari', 'O')
(':', 'O')

('Den', 'O')
('fjerde', 'O')
('Karsten', 'O')
('og', 'O')
('Petra', 'O')
('-', 'O')
('filmen', 'B-targ-Negative')
('er', 'O')
('trolig', 'O')
('nervepirrende', 'O')
('nok', 'O')
('for', 'O')
('de', 'O')
('aller', 'O')
('yngste', 'O')
(',', 'O')
('men', 'O')
('for', 'O')
('andre', 'O')
('fremstår', 'O')
('den', 'O')
('like', 'O')
('nusselig', 'O')
('og', 'O')
('harmløs', 'O')
('som', 'O')
('en', 'O')
('nyfødt', 'O')
('løveunge', 'O')
('.', 'O')
