# Tutorial 1
Course: Creation and Annotation of Linguistic Resources

Department: Computational Linguistics

Semester: Spring 2021

Lecturer: Duygu Ataman

Tutorer: Tobias Weisskopf https://github.com/BlackSquirrelz

---

## Agenda:

- GIT
- IDE
- Virtual Environment

- Examples

- Questions

- Help with Projects

---

# Examples
Required imports

In [44]:
import xml.etree.ElementTree as ET # For XML Parsing
import csv # For reading CSV
import json
import requests # For getting content from the web
from bs4 import BeautifulSoup # For parsing content from the web

## Function definitions
Different ways how to read data from the file system

- Example 1: Regular Text File
- Example 2: CSV File
- Example 3: XML File
- Example 4: JSON File
- Example 5: Getting Content via API
- Example 6: Web Content with other means

## Example 1 - Generic Text File

In [4]:
# Generic Function to read a file
def open_file(file_path):
    with open(file_path, 'r') as f:
        text = [line.strip() for line in f]
    return text

In [8]:
# Example 1
print("\nReading a regular Text File.")
example_1 = open_file('Data/example_1.txt')
print(f"Output: {example_1}\n")


Reading a regular Text File.
Output: ['This is a regular file.', 'It has some lines, and some punctuation.']



## Example 2 - CSV

In [12]:
# Reading CSV Files
def csv_open(file_path):
    with open(file_path, newline='\n') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        text = [token for token in reader]
    return text

In [13]:
print("\nReading a CSV File.")
example_2 = csv_open('Data/example_2.csv')
print(f"Output: {example_2}\n")


Reading a CSV File.
Output: [['Basel', 'CH'], ['Zurich', 'CH'], ['Bern', 'CH'], ['Boston', 'USA'], ['Beijing', 'CN'], ['Tokyo', 'JP'], ['Kuala Lumpur', 'MY'], ['Singapore', 'SG']]



## Example 3 - XML
XML (eXtensible Markup Language) used as a human and machine readable format to share data between software, or to send data to a middle ware.

In [29]:
# Parsing XML Files
def xml_parsing(file_path):
    root_node = ET.parse(file_path).getroot()
    text = [tag.text for tag in root_node.findall('token')]
    return text

In [30]:
print("\nParsing an XML File.")
example_3 = xml_parsing('Data/example_3.xml')
print(f"Output: {example_3}\n")


Parsing an XML Text File.
Output: ['This', 'is', 'a', 'sentence']



Further reading: 

http://fedora.clarin-d.uni-saarland.de/teaching/Corpus_Linguistics/Tutorial_XML.html

## Example 4 - JSON
Similar to XML, JSON  (JavaScript Object Notation) is used for transferring data from one system to another, it is often used to make API calls (more on that later). Other than XML JSON is easier to understand (IMO), and more modern.

In [45]:
def get_json(file_name):
    """ Generic function to retrieve data from JSON file"""
    with open(file_name) as f:
        data = json.load(f)
        return data

In [51]:
print("\nParsing a JSON File.")
example_4 = get_json('Data/example_4.json')
print(f"Output: {example_4}")


Parsing a JSON File.
Output: {'Dogs': [{'category': 'Companion dogs', 'name': 'Chihuahua'}, {'category': 'Hounds', 'name': 'Foxhound'}]}


In [58]:
type(example_4)

dict

In [59]:
dogs = example_4['Dogs']

This means that we an access the elements, like in a dictionary. For instance to find the names of the dogs:

In [63]:
for dog in dogs:
    print(dog['name'])

Chihuahua
Foxhound


---

## Example 5: More specialised functions - Calling API's 

Getting Data from the World Wide Web. Companies and Institutions sometimes offer Application Programming Interfaces (API). We can use them to get additional data from websites.

In [87]:
# Getting Content from the Internet
# Documentation: https://docs.python.org/3/library/urllib.request.html#module-urllib.request
def call_api(url):
    response = requests.get(url)
    if response.status_code != 200:
        response = None
    return response.json()

In [88]:
example_5 = call_api('http://urbanscraper.herokuapp.com/define/Dogecoin')

In [89]:
print(type(example_5))

<class 'dict'>


In [95]:
print(f"The term: {example_5['term']} Is defined as '{example_5['definition']}' for example: '{example_5['example']}'")

The term: Dogecoin Is defined as 'An online decentralized cryptocurrency which was originally created as a joke. Similar to Litecoin, Dogecoin uses the hashing algorithm, Scrypt.' for example: 'Dogecoin worths almost nothing, how come you believe that Dogecoin will rise?'


## Example 6: More specialised functions - Other Web Content

Getting Data from the World Wide Web. This is where things get tricky. Luckily there are already some good resources out there:

https://github.com/pleyad/HorizonsCorpus

In [16]:
# Getting Content from the Internet
# Documentation: https://docs.python.org/3/library/urllib.request.html#module-urllib.request
def get_web_content(url):
    response = requests.get(url)
    if response.status_code != 200:
        response = None
    return response.text

We are starting with one URL to get.

In [32]:
# Getting Content from the Web
print("\nParsing  Text from the Web.")
example_4 = get_web_content('https://www.horizonte-magazin.ch/2020/12/03/parfuem-der-baeume-ist-kampfstoff/')
#print(f"Output: {example_4}\n")


Parsing  Text from the Web.


In [34]:
def web_content_parsing(html_doc):
    soup = BeautifulSoup(html_doc, 'html.parser')
    return soup

In [39]:
parsed_web = web_content_parsing(example_4)

Getting all the links in a document, so we could further process those in the future.

In [71]:
links = [link.get('href') for link in parsed_web.find_all('a') if link.get('href')[:1] == 'h']
print(links)

['https://www.horizons-mag.ch/2020/12/03/whispering-trees/', 'https://www.revue-horizons.ch/2020/12/03/larme-des-arbres-est-leur-parfum/', 'https://www.facebook.com/horizonsmagazine', 'http://www.twitter.com/horizonte_de', 'https://www.horizonte-magazin.ch/', 'https://www.horizonte-magazin.ch/', 'https://www.horizonte-magazin.ch/kategorie/fokus/', 'https://www.horizonte-magazin.ch/kategorie/fokus/diversitaet-an-hochschulen/', 'https://www.horizonte-magazin.ch/kategorie/fokus/das-perfektionierte-essen/', 'https://www.horizonte-magazin.ch/kategorie/fokus/die-lehren-aus-der-pandemie/', 'https://www.horizonte-magazin.ch/kategorie/fokus/geistreich-gegen-die-klimakatastrophe/', 'https://www.horizonte-magazin.ch/kategorie/hintergrund/', 'https://www.horizonte-magazin.ch/kategorie/kurz-und-knapp/', 'https://www.horizonte-magazin.ch/kategorie/innovation/', 'https://www.horizonte-magazin.ch/kategorie/mensch-und-meinung/', 'https://www.horizonte-magazin.ch/kategorie/in-bildern/', 'https://www.hor

In [73]:
article_title = parsed_web.title.text

In [74]:
article_links = {'article': article_title, 'links': links}

In [75]:
print(article_links)

{'article': 'Parfüm der Bäume ist Kampfstoff - Horizonte', 'links': ['https://www.horizons-mag.ch/2020/12/03/whispering-trees/', 'https://www.revue-horizons.ch/2020/12/03/larme-des-arbres-est-leur-parfum/', 'https://www.facebook.com/horizonsmagazine', 'http://www.twitter.com/horizonte_de', 'https://www.horizonte-magazin.ch/', 'https://www.horizonte-magazin.ch/', 'https://www.horizonte-magazin.ch/kategorie/fokus/', 'https://www.horizonte-magazin.ch/kategorie/fokus/diversitaet-an-hochschulen/', 'https://www.horizonte-magazin.ch/kategorie/fokus/das-perfektionierte-essen/', 'https://www.horizonte-magazin.ch/kategorie/fokus/die-lehren-aus-der-pandemie/', 'https://www.horizonte-magazin.ch/kategorie/fokus/geistreich-gegen-die-klimakatastrophe/', 'https://www.horizonte-magazin.ch/kategorie/hintergrund/', 'https://www.horizonte-magazin.ch/kategorie/kurz-und-knapp/', 'https://www.horizonte-magazin.ch/kategorie/innovation/', 'https://www.horizonte-magazin.ch/kategorie/mensch-und-meinung/', 'https

---
# References:

https://github.com/pleyad/HorizonsCorpus

https://beautiful-soup-4.readthedocs.io/en/latest/

https://www.askpython.com/python/examples/python-xml-parser

In [113]:
text = parsed_web.get_text()
print(text)
































Parfüm der Bäume ist Kampfstoff - Horizonte






























































 












AbonnierenArchiv




 



ENFR 

  




Horizonte




Fokus 

Diversität an Hochschulen
Das perfektionierte Essen
Die Lehren aus der Pandemie
Geistreich gegen die Klimakatastrophe
> Weitere Themen


Hintergrund
Kurz und knapp
Innovation
Mensch und Meinung
In Bildern









WALDÖKOLOGIE
Parfüm der Bäume ist Kampfstoff
Die Pflanzen im Wald setzen viele flüchtige Stoffe frei, die eine Duftwolke bilden. Sie könnte etwas über den Zustand des Forstes verraten.

Yvonne Vahlensieck, 3. Dezember 2020




Die Waldluft ist voller Duftstoffe. Die für Menschen meist wohlriechenden Moleküle sind die Schlachtrufe der Pflanzen, um gegen ihre Fressfeinde zu mobilisieren. | Foto: imageBROKER
Es ist bekannt, dass Blüten Duftstoffe produzieren, um Bestäuber wie Bienen und Schmetterlinge anzulocken. Doch nur wenige wissen, dass auch grüne Blätter ständig c