# Introduction to the corpus
The following notebook is designed to give an introduction to basic functions to can perform with the corpus using the Natural Language Toolkit as well as processes for preparing the texts.

In [1]:
# Requirements
import re
import os
import nltk

Generate a list of filenames for from data-folder

In [2]:
file_list = ['data/' + file for file in os.listdir('data/') if file.endswith('.txt')]

Combine all available files into a single variable. This is intended to simplify the process and enables exploration of the entire corpus. If doing comparative studies, perform the following procedures for each file in the corpus.

In [3]:
corpus = ''
for text in file_list:
    with open(text,'r',encoding='utf-8') as file:
        corpus += file.read()

## Cleaning the data

Import the 'punctuation' variable from 'string' library and use as basis for removing punctuation. Do this last, since certain types of punctuation might have specific functions.

In [4]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Cleaning the corpus using regular expressions. Pay attention to the '=', which is used for hyphenating words at line breaks.

In [5]:
corpus_clean = corpus
corpus_clean = re.sub('=\n','',corpus_clean)
corpus_clean = re.sub('\n',' ',corpus_clean)
corpus_clean = re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]','',corpus_clean)
corpus_clean = corpus_clean.lower()
corpus_clean = re.sub(' +',' ',corpus_clean)

## NLTK

In [6]:
corpus_freq = nltk.FreqDist(corpus_clean.split())

In [7]:
corpus_freq.most_common(20)

[('oc', 116625),
 ('at', 55665),
 ('som', 46747),
 ('det', 39613),
 ('til', 33813),
 ('er', 32954),
 ('hand', 32275),
 ('de', 31228),
 ('i', 31021),
 ('der', 30635),
 ('icke', 29471),
 ('den', 29092),
 ('aff', 21477),
 ('saa', 21422),
 ('paa', 20529),
 ('gud', 16835),
 ('en', 16757),
 ('for', 15682),
 ('da', 13992),
 ('men', 13888)]

Many corpora are not very useful at first, since they contain many words that only provide noise. Therefore we will remove words that are not meaningful using a stopword list.

In [8]:
with open('stopwords/stopwords.txt','r',encoding='utf-8') as file:
    stopwords = file.read()
corpus_no_stops = [word for word in corpus_clean.split() if word not in stopwords]

In [9]:
corpus_no_stops_freq = nltk.FreqDist(corpus_no_stops)

In [10]:
corpus_no_stops_freq.most_common(20)

[('gud', 16835),
 ('guds', 12965),
 ('oss', 11331),
 ('siger', 10293),
 ('alle', 8178),
 ('christus', 7194),
 ('ord', 6783),
 ('see', 5962),
 ('herre', 5534),
 ('mand', 5398),
 ('giøre', 5376),
 ('jesus', 4846),
 ('christi', 4525),
 ('lige', 4462),
 ('att', 4445),
 ('jesu', 4306),
 ('aand', 4035),
 ('herren', 3923),
 ('sagde', 3889),
 ('hellige', 3619)]

A way of exploring the corpus is using the concordance function. There you can examine one or more words in their context.
Begin by building an nltk text object. Here we are using the clean text, but with all the stopwords included to understand the context better.

In [11]:
corpus_nltk = nltk.Text(corpus_clean.split())

In [12]:
corpus_nltk.concordance('all',120)

Displaying 25 of 1884 matches:
onis huus oc hoff da haffuer hand godgen oc gierne forlat all verdsens ære høyhed oc h hed som hand haffde som den der 
e gamle surdey forlad heller med mose ægypil herlighed oc all verdsens ære end du skulle der ved hindris udi din guds t
d vil med oplæste ord stelle eder for øyen hu til gud aff all evighed hafuer eder kaldet oc udvald s siger s povel til 
tus gaff sig selff for paa det hand skulde forløse os fra all wretvijszhed rense sig selff it folck til eyedom som skul
ra synden skulle leffue i retfærdig du da som leffuer udi all wgudelighed oc wretfærhed med huad samvittighed kand du s
ynd tilregne j skulle høre med glæde see det glam som bar all verdsens synder j skulle finde sæm ært det s hans lærer e
evig ære oc loff med faderen oc den hellig aand nu oc udi all evighed amen dette hellige evangelium beskriffuer mnartus
t vaar at christus skulde opstaa fra de døde da vijder at all den christelige religion vaar intet udi nogen maade nysti
tor den m