# <span style="color:red">Excercise</span>
### Submitted by Leopold Lemmermann

---
##  <span style="color:red"> Excercise_1</span>

Read the country name and capital city from the [this](https://geographyfieldwork.com/WorldCapitalCities.htm) page, which lists the world capital cities with their country. Save the result as a <span style="color:blue">comma separated value (csv)</span> file format.

In [1]:
!pip install requests
!pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
from requests import get
from bs4 import BeautifulSoup
from re import sub

url = "https://geographyfieldwork.com/WorldCapitalCities.htm"
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

table = soup.find_all('table')[2]
rows = table.find_all('tr')

# footnotes of format \[0-9*\] is stripped
def strip_footnotes(text: str) -> str:
  return sub(r'\[\d*\]', '', text)

with open('data/capitals.csv', 'w') as f:
  f.write('country,capital\n')

  for row in rows:
    cells = row.find_all('td')
    if len(cells) > 1:
      country = cells[0].text.strip()
      capital = cells[1].text.strip()
      country = strip_footnotes(country)
      capital = strip_footnotes(capital)
      if not country.isdigit() and not capital.isdigit():
        f.write(f'{country},{capital}\n')


---
##  <span style="color:red"> Excercise_2.1</span>
Modify the <span style="color:blue">regex</span> above for sentence segmentation so that the following text is split into correct sentences.

>```Fruits like apple, orange, and mango are healthy. But they are expensive, i.e Mr. Bean can't afford them! One can order some online from www.rewe.de. Prof. Karl, Dep. of Plant Science. Email: karl@plant.science.de. Regards!```


In [3]:
from re import split
text = """
Fruits like apple, orange, and mango are healthy. But they are expensive,
i.e. Mr. Bean can't afford them! One can order some online from www.rewe.de.
Prof. Karl, Dep. of Plant Science. Email: karl@plant.science.de. Regards!
"""
pattern = r"(?<!i\.e\.)(?<!e\.g\.)(?<![A-Z][a-z]\.)(?<![A-Z][a-z]{3}\.)(?<=[\.|?|\!])\s+(?=[A-Z])"

sentences = split(pattern, text.replace('\n', ' '))
sentences = [sentence.strip() for sentence in sentences if sentence.strip()]

for sentence in sentences:
  print(sentence)

Fruits like apple, orange, and mango are healthy.
But they are expensive, i.e. Mr. Bean can't afford them!
One can order some online from www.rewe.de.
Prof. Karl, Dep. of Plant Science.
Email: karl@plant.science.de.
Regards!


---
##  <span style="color:red"> Excercise_2.2</span>
Modify/re-write the word tokenization pattern given above so that you can achieve near `ideal` tokenization for the following text
>```"I said, 'what're you? Crazy?'" said Sandowsky. "I can't afford to do that."```

See the ideal tokenization result from the `Exercise_2 - Ideal tokenization - file` in Moodle.

<img src="ideal_tokenization.png" />

In [4]:
from re import split, sub

text="""
"I said, 'what're you? Crazy?'" said Sandowsky. "I can't
afford to do that."
"""
pattern="((?<=can)'t|'re|[A-Za-z]+|[.,!?'\";])"

tokens = split(pattern, text.replace("\n", " "))
tokens = [token.strip() for token in tokens if token.strip()]
tokens = [sub(r"'re", 'are', token) for token in tokens]
tokens = [sub(r"'t", 'not', token) for token in tokens]

for token in tokens:
  print(token)

"
I
said
,
'
what
are
you
?
Crazy
?
'
"
said
Sandowsky
.
"
I
can
not
afford
to
do
that
.
"


---
##  <span style="color:red"> Excercise_3</span>
### Lemmatization for German
There is no lemmatization library in NLTK for German. However, the [<span style="color:blue">GermaLemma</span>](https://github.com/WZBSocialScienceCenter/germalemma) (https://github.com/WZBSocialScienceCenter/germalemma) library is an open source lemmatizer for German. To lemmatize a word, you need to pass the POS tag as a secondary argument. In this exercise, you can use the POS tagger for German from <span style="color:blue">pattern.de</span> but then you have to convert tags into `N`, `V`, `ADJ`, or `ADV`. So your task is, when the word category is in one of the four tags, map them and pass to the lematizer. If the POS tag is not in the four categories, return the word itself as the lemma. See the cells below on how to execute the lemmatizer and pos tager for German. 

You can install <span style="color:blue">GermaLemma</span> as
>```pip install -U germalemma```

Also make sure mysql and related packages are installed


In [5]:
#uncomment the following for Mysql cleint dev in Linux
!sudo apt install default-libmysqlclient-dev
# Installing GermaLemma
!pip install germalemma

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
default-libmysqlclient-dev is already the newest version (1.0.7).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Defaulting to user installation because normal site-packages is not writeable


In [6]:
from germalemma import GermaLemma
from pattern.de import parse, split

sentence2 = """Die Brände in Brasilien setzen erhebliche Mengen an klimaschädlichen Treibhausgasen frei.
Die Nasa hat nun simuliert, wie sich Kohlenmonoxid über Südamerika ausbreitet.
Am Boden schadet das Gas der Gesundheit erheblich."""
lemma = GermaLemma()

poses = parse(sentence2)
sentences = split(poses)
lemmas = []
for sentence in sentences:
    for token in sentence:
        MAPD_POS = "OTHER"
        if token.pos in ["NN", "NNS", "NNP", "NNPS"]:
            MAPD_POS = "N"
        elif token.pos in ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]:
            MAPD_POS = "V"
        elif token.pos in ["JJ", "JJR", "JJS"]:
            MAPD_POS = "ADJ"
        elif token.pos in ["RB", "RBR", "RBS"]:
            MAPD_POS = "ADV"
        if MAPD_POS == "OTHER":
            lemmas.append(token.string)
        else:
            lemmas.append(lemma.find_lemma(token.string, MAPD_POS))

for lemma in lemmas:
    print(lemma)

Die
Brand
in
Brasilien
setzen
erheblich
Menge
an
klimaschädlich
Treibhausgas
frei
.
Die
Nasa
haben
nun
Simuliert
,
wie
sich
Kohlenmonoxid
über
Südamerika
Ausbreitet
.
Am
Boden
schaden
das
Gas
der
Gesundheit
erheblich
.


---
##  <span style="color:red"> Excercise_4</span>
## Lemmatizer comparison
For this excercise, you are given two lists in data directory <span style="color:blue"> verba_lemma.csv, noun_lemma.csv </span>. The files contain a huge list of verbs and nouns along with their lemma(s). The lists are adapted from here http://wordlist.aspell.net/agid-readme/. Your task is to compare performace of different lemmatizers on both these lists. For lemmatizers use NLTK, Spacy, LemmInflect and stanford Stanza (optionally).
Report the % of correctly lemmatized instances for each lemmatizer in form of a table.
You don't need to use complete lists, a random sample of 1000 words from each list is suffiecient for this task. In this case, include the code to sample the words.

In [7]:
!pip install nltk
!pip install spacy
!python -m spacy download en_core_web_sm
!pip install lemminflect

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Defaulting to user installation because normal site-packages is not writeable


In [8]:
from csv import reader
from random import sample

# import verb_lemma.csv and noun_lemma.csv
verbs, nouns = {}, {}
with open('data/verb_lemma.csv', 'r') as f:
  csv = reader(f)
  for row in csv:
    verbs[row[0]] = row[1]

with open('data/noun_lemma.csv', 'r') as f:
  csv = reader(f)
  for row in csv:
    nouns[row[0]] = row[1]

# select a random sample of 1000 words from verbs and nouns each
sample_words = sample(sorted(verbs.keys()), 1000) + sample(sorted(nouns.keys()), 1000)
correct_lemmas = [verbs[word] if word in verbs else nouns[word] for word in sample_words]

print(f"{len(sample_words)} words: ", sample_words[:10])
print(f"with {len(sample_words)} lemmas: ", correct_lemmas[:10])

2000 words:  ['introspects', 'recapitalised', 'coring', 'hunting', 'phoneying', 'degustates', 'lactating', 'reneged', 'colonising', 'hypercriticizes']
with 2000 lemmas:  ['introspect', 'recapitalise', 'core', 'hunt', 'phoney', 'degustate', 'lactate', 'renege', 'colonise', 'hypercriticize']


In [9]:
from nltk.stem import WordNetLemmatizer
from spacy import load
from lemminflect import getAllLemmas

lemmas = {
  "nltk": [],
  "spacy": [],
  "lemminflect": []
}

nltkLemmatizer = WordNetLemmatizer()
spacyNlp = load("en_core_web_sm")

for word in sample_words:
  lemmas["nltk"].append(nltkLemmatizer.lemmatize(word))
  lemmas["spacy"].append(spacyNlp(word)[0].lemma_)
  result = list(getAllLemmas(word).values())
  lemmas["lemminflect"].append(result[0] if len(result) > 0 else word)

def compare_lemmas(provider: str):
  correct = sum([1 for i in range(len(sample_words)) if lemmas[provider][i] == correct_lemmas[i]])
  print(f'Correct lemmatisation percentage for {provider}: {correct/len(sample_words)*100:.2f}%')

compare_lemmas("nltk")
compare_lemmas("spacy")
compare_lemmas("lemminflect")

Correct lemmatisation percentage for nltk: 31.85%
Correct lemmatisation percentage for spacy: 76.95%
Correct lemmatisation percentage for lemminflect: 0.95%
