# Introduction to the corpus
The following notebook is designed to give an introduction to basic functions to can perform with the corpus using the Natural Language Toolkit as well as processes for preparing the texts.

In [33]:
# Requirements
import re
import os
import nltk

Generate a list of filenames for from data-folder

In [16]:
file_list = ['data/'+file for file in os.listdir('data/') if file.endswith('.txt')]

Combine all available files into a single variable. This is intended to simplify the process and enables exploration of the entire corpus. If doing comparative studies, perform the following procedures for each file in the corpus.

In [19]:
corpus = ''
for text in file_list:
    with open(text,'r',encoding='utf-8') as file:
        corpus += file.read()

## Cleaning the data

Import the 'punctuation' variable from 'string' library and use as basis for removing punctuation. Do this last, since certain types of punctuation might have specific functions.

In [52]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Cleaning the corpus using regular expressions. Pay attention to the '=', whcih is used for hyphenating words at line breaks.

In [37]:
corpus_clean = corpus
corpus_clean = re.sub('=\n','',corpus_clean)
corpus_clean = re.sub('\n',' ',corpus_clean)
corpus_clean = re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]','',corpus_clean)
corpus_clean = corpus_clean.lower()
corpus_clean = re.sub(' +',' ',corpus_clean)

## NLTK

In [41]:
corpus_freq = nltk.FreqDist(corpus_clean.split())

In [53]:
corpus_freq.most_common(20)

[('oc', 47970),
 ('at', 24031),
 ('som', 20653),
 ('det', 15660),
 ('hand', 15432),
 ('til', 15395),
 ('er', 14988),
 ('den', 13811),
 ('icke', 12824),
 ('de', 12553),
 ('i', 12443),
 ('der', 10374),
 ('paa', 10289),
 ('aff', 9074),
 ('jeg', 7796),
 ('saa', 7560),
 ('for', 7138),
 ('en', 6734),
 ('udi', 6682),
 ('gud', 6661)]

Many corpora are not very useful at first, since they contain many words that only provide noise. Therefore we will remove words that are not meaningful using a stopword list.

In [59]:
with open('stopwords/stopwords.txt','r',encoding='utf-8') as file:
    stopwords = file.read()
corpus_no_stops = [word for word in corpus_clean.split() if word not in stopwords]

In [60]:
corpus_no_stops_freq = nltk.FreqDist(corpus_no_stops)

In [61]:
corpus_no_stops_freq.most_common(20)

[('gud', 6661),
 ('guds', 6020),
 ('siger', 4848),
 ('alle', 3446),
 ('ord', 3166),
 ('jesus', 3036),
 ('christus', 3025),
 ('jesu', 2886),
 ('see', 2576),
 ('christi', 2442),
 ('herre', 2325),
 ('herren', 2282),
 ('aand', 2100),
 ('sagde', 2010),
 ('1', 1892),
 ('jesum', 1804),
 ('giøre', 1787),
 ('sige', 1768),
 ('saaledis', 1654),
 ('mand', 1630)]

A way of exploring the corpus is using the concordance function. There you can examine one or more words in their context.
Begin by building an nltk text object. Here we are using the clean text, but with all the stopwords included to understand the context better.

In [49]:
corpus_nltk = nltk.Text(corpus_clean.split())

In [51]:
corpus_nltk.concordance('all',120)

Displaying 25 of 871 matches:
 haffuer hand godgen oc gierne forlat all verdsens ære høyhed oc h hed som hand
d heller med mose ægypil herlighed oc all verdsens ære end du skulle der ved hi
d stelle eder for øyen hu til gud aff all evighed hafuer eder kaldet oc udvald 
or paa det hand skulde forløse os fra all wretvijszhed rense sig selff it folck
fue i retfærdig du da som leffuer udi all wgudelighed oc wretfærhed med huad sa
e høre med glæde see det glam som bar all verdsens synder j skulle finde sæm vi
 faderen oc den hellig aand nu oc udi all evighed amen dette hellige evangelium
kulde opstaa fra de døde da vijder at all den christelige religion vaar intet u
c paa det j dis bedre kunde eracte at all denne quindernis bekymring er idel ki
ng er idel kiødelig bekymring oc uden all grund da hører huad her staae udi vor
uds oc jesu aabenbaring forstandet at all saadan fryct oc fare vaar idel forfen
ke skulle finde jesum udi graffuen oc all deris wmage oc bekostning udi saa maa
 10 legem 