# Introduction

This project is part of Codecademy's course [*Apply Natural Language Processing with Python*](https://www.codecademy.com/learn/paths/natural-language-processing).

Main aim of the project is to apply a simple text summarization technique using Python in order to extract the summary of a long text. The text used for this project is my Master's Thesis in Digital Humanities for the University of Groningen.

The length of the text is 47 pages, from which are exluded the *References* section and the technical information of the thesis (for instance, the title, the author, student number, etc.). As it is shown below, the overall length of the text used for this project is **16236** tokens.

In [88]:
import numpy as np
import pandas as pd
import re
from bs4 import BeautifulSoup

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

In [89]:
with open("text/thesis_final_version.txt", encoding = "utf-8") as inp:
    text = inp.read()

In [90]:
stopwords = set(stopwords.words("english"))
tokens = word_tokenize(text)
print(f"The overall length of the text used is: {len(tokens)} tokens")

The overall length of the text used is: 16236 tokens


# Analysis

In [37]:
# calculate how often a word occurs in a text
freqTable = dict()

for word in tokens:
    words = word.lower()
    if word in stopwords:
        continue
# continue in order to skip stopwords in the counting
#if the word is in the freqTable then add one more
    if word in freqTable:
        freqTable[word] += 1
# if it is not in the freqTable, then initiate the counting from 0
    else:
        freqTable[word] = 1
# print(freqTable)

In [38]:
# create a dictionary that will sum the frequency of each word
# with respect to its sentence
# this will give to every sentence, based on the words that it contains
# a different weight, which will be used to summarize the text
# with the most frequent/important words-sentences

sentences = sent_tokenize(text)
sentenceValue = dict()

for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence.lower():
            if sentence in sentenceValue:
                sentenceValue[sentence] += freq
            else:
                sentenceValue[sentence] = freq
                
# print(sentenceValue)

In [45]:
# sort the frequencies of the sentences in an descending order

sort_sentenceValue = sorted(sentenceValue.items(), key=lambda x: x[1], reverse=True)
for i in sort_sentenceValue:
    print(i[0], i[1])

This can be generalized for all authorship attribution cases, because when we deal with imitators of the main author’s style of writing (in our case Seneca the Younger) and the possible candidate(s) is not included in the distractors’ corpus, then it is highly possible for the model to attribute the disputed text to the main author; especially when it reports such a high confidence for the prediction.46
Looking at the top 100-1500 MFC 4-grams and 5-grams in a BCT (see Figure 2), we can observe that O and Phoenissae behave as outliers within the Senecan corpus, even though they are clustered with the genuine Senecan plays. 2944
One potential explanation for this could be because Burrow’s Delta was firstly created by Burrow for English texts.30 In addition to that, Kestemont (2014, p. 63) reports that Delta method is biased against words, which might create some problems with character n-grams because, as it was proved before, highly inflected and agglutinative languages contain some of 

In [46]:
sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]
print(sumValues)

773816


In [48]:
average = sumValues / len(sentenceValue)
print(average)

1700.6945054945054


In [78]:
# only sentences with a value higher than the average will be included to the summary

summary = ""
for sentence in sentences:
    if (sentence in sentenceValue) and (sentenceValue[sentence] > average):
        summary += " " + "\n" + sentence
with open("text/thesis_summary.txt", "w") as infile:
    infile.write(summary)