# 4. Daten beschaffen

## Inhalte 
1. Korpus herunterladen
2. Korpus entpacken
3. Korpus mit NLTK einlesen

Die folgenden Textanalysen nutzen die Daten des Märchenkorpus (lizenziert unter CC BY 3.0): Walter, M. (2013). "Märchenkorpus Version 1.0" Humboldt-Universität zu Berlin. doi:[10.34644/laudatio-dev-UyRUCnMB7CArCQ9C63ji](https://doi.org/10.34644/laudatio-dev-UyRUCnMB7CArCQ9C63ji).

In [1]:
# import the libraries needed in this notebook
# "standard library" imports
import pathlib
import shutil
import zipfile

# third party library imports
import requests

# prepare paths
DATA_DIR = pathlib.Path().cwd().parent.joinpath("data")
RAW_DATA_DIR = DATA_DIR.joinpath("raw")
TXT_DATA_DIR = RAW_DATA_DIR.joinpath("txt")
CORPUS_BASE_DIR = DATA_DIR.joinpath("corpus")
CORPUS_DIR = CORPUS_BASE_DIR.joinpath("grimm")

## 1. Korpus herunterladen

Im [Laudatio-Repositorium](https://www.laudatio-repository.org) wird für den [Märchenkorpus](https://www.laudatio-repository.org/browse/corpus/UyRUCnMB7CArCQ9C63ji/corpora) folgender Downloadlink angezeigt "Download (TXT)": https://www.laudatio-repository.org/download/format/15/31/1.0

Wir laden die Zip-Datei herunter und entpacken diese. Da sie eine weitere Zip-Datei enthält, wiederholen wir das Entpacken und erhalten ein Verzeichnis mit 211 Textdateien, eine Datei pro Märchen: z.B. `grimm_aschenputtel_119-126.txt`.

In [2]:
TXT_URL = "https://www.laudatio-repository.org/download/format/15/31/1.0"

In [3]:
# download zip file
response = requests.get(TXT_URL)
response

<Response [200]>

In [4]:
filename = RAW_DATA_DIR.joinpath("txt.zip")
# write zip file to disk
with open(filename, "wb") as f:
    f.write(response.content)

# unpack first layer
with zipfile.ZipFile(filename, "r") as zip_ref:
    zip_ref.extractall(RAW_DATA_DIR)

# unpack second layer
filename = RAW_DATA_DIR.joinpath("txt/txt_1-0.zip")
with zipfile.ZipFile(filename, "r") as zip_ref:
    zip_ref.extractall(RAW_DATA_DIR)

# delete second layer zipfile
filename.unlink()

# delete corpus directory
shutil.rmtree(CORPUS_DIR, ignore_errors=True)

# move unzipped directory to corpus directory path, i.e. rename it
shutil.move(TXT_DATA_DIR, CORPUS_DIR)

WindowsPath('C:/Users/Fuetterer/workspace/Textarbeit-mit-Python/data/corpus/grimm')

In [5]:
# Read all files using "utf-8-sig" and save them using "utf-8", removing BOM

for textfile in CORPUS_DIR.glob("*.txt"):
    textfile.write_text(textfile.read_text(encoding="utf-8-sig"), encoding="utf-8")

In [6]:
from nltk.corpus import PlaintextCorpusReader

fairytale_corpus_reader = PlaintextCorpusReader(str(CORPUS_DIR), "[\w+-]*\.txt")
fairytale_corpus_reader.fileids()[:10]

['grimm_Das_kluge_Grethel_395-397.txt',
 'grimm_allerleirauh_353-359.txt',
 'grimm_aschenputtel_119-126.txt',
 'grimm_bruder_lustig_402-413.txt',
 'grimm_bruederchen_und_schwesterchen__057-064.txt',
 'grimm_das_blaue_licht_150-154.txt',
 'grimm_das_buerle_335-340.txt',
 'grimm_das_buerle_im_himmel_331-331.txt',
 'grimm_das_dietmarsische_luegenmaerchen_294-294.txt',
 'grimm_das_eigensinnige_kind_155-155.txt']

In [7]:
fairytale_corpus_reader.words()[:10]

['77', '.', 'Das', 'kluge', 'Grethel', '.', 'Es', 'war', 'eine', 'Köchin']

In [8]:
import session_info

session_info.show()