# Visualize output of scripts knownWords.py and hapaxCount.py

Task 1: create bar plot of percentage of words in each of the newspaper articles that appears in one of the two Dutch lexicons of the project IMPACT (Dutch words and Dutch names). Ideally this percentage is 100% but when the articles contain OCR errors, the percentage will be lower. Lower scores can also be caused byj unknown words and names. Results are displayed in bins of 10% wide (x-axis). The y-axis contains absolute article counts. Comparison of words in the text versus words in the lexison was done case-insensitive.

Data: one day of newspaper articles of the newspaper De Volkskrant (2 Jan 1965) obtained from delpher.nl

In [None]:
import re
import sys

LEXICONRESULTFILENAME = "/home/erikt/projects/teamproject2018-11/data/vk19650102/analysis3.txt"
LEXICONTITLE = "Known-word percentages per article (OCR)"

def readScoreFile(fileName):
    try: inFile = open(fileName,"r")
    except Exception as e: sys.exit("error:",str(e))
    scores = []
    for line in inFile:
        line = line.strip()
        fields = line.split()
        percentageString = fields[-1]
        scoreString = re.sub(r"%$","",percentageString)
        score = float(scoreString)
        scores.append(score)
    return(scores)

lexiconScores = readScoreFile(LEXICONRESULTFILENAME)

In [None]:
import matplotlib.pyplot as plt

COLORS = ["r","g","b","orange","purple","yellow"]
OUTPUTIMAGEFILE = "visualizeScores.png"
WIDTH = 7

def aggregate(numbersIn):
    numbersOut = {}
    for number in numbersIn:
        roundedDown = 10*int(number/10)
        if roundedDown == 100: roundedDown = 90
        if not roundedDown in numbersOut: numbersOut[roundedDown] = 0
        numbersOut[roundedDown] += 1
    return(numbersOut)

def makeLabels(buckets):
    labels = []
    for bucket in buckets:
        if bucket == 90: labels.append(str(bucket)+"-"+str(bucket+10)+"%")
        else: labels.append(str(bucket)+"-"+str(bucket+9)+"%")
    return(labels)

def makeBarPlot(data,title):
    labels = makeLabels(list(data.keys()))
    plt.figure(figsize=(15,8))
    plt.rcParams.update({'font.size': 18})
    barplot = plt.bar(list(data.keys()),list(data.values()),color=COLORS,width=WIDTH)
    plt.title(title)
    plt.xticks(list(data.keys()),labels)
    plt.savefig(OUTPUTIMAGEFILE)
    plt.show()

lexiconAggregates = aggregate(lexiconScores)
makeBarPlot(lexiconAggregates,LEXICONTITLE)

Task 2: create a bar plot of the percentage of unique words in each of the newspaper articles. Ideally this percentage is 0% but when the articles contain OCR errors, the percentage will be higher. Higher scores can also be caused by low-frequency words and names which appear in just one article. In order to determine which words were unique, all words in the data were considered. Results are displayed in bins of 10% wide (x-axis). The y-axis contains absolute article counts. Comparison of words in the text was done case-insensitive.

Data: one day of newspaper articles of the newspaper De Volkskrant (2 Jan 1965) obtained from delpher.nl

In [None]:
HAPAXRESULTFILENAME = "/home/erikt/projects/teamproject2018-11/data/vk19650102/analysis4.txt"
HAPAXTITLE = "Percentages of unique words (hapaxes) per article (OCR)"

hapaxScores = readScoreFile(HAPAXRESULTFILENAME)
hapaxAggregates = aggregate(hapaxScores)
makeBarPlot(hapaxAggregates,HAPAXTITLE)

Task 3: create a scatter plot of the results of the first two tasks. The scatter plot contains a dot for each article on  the location defined by the percentage of known words (x-axis) and the percentage of unique words (y-axis).

Data: same as in tasks 1 and 2

In [None]:
SCATTERTITLE = "Known words percentage (horizontal) vs hapax percentage (vertical)"

def makeScatterPlot(dataX,dataY,title):
    plt.figure(figsize=(15,8))
    scatter = plt.scatter(dataX,dataY)
    plt.title(SCATTERTITLE)
    plt.savefig(OUTPUTIMAGEFILE)
    plt.show()
    
makeScatterPlot(lexiconScores,hapaxScores,SCATTERTITLE)

Task 4: compute the correlation coefficient of the results of the first two tasks. The closer the number is to 1 or -1, the stronger the two considered measures are correlated.

Data: same as in tasks 1 and 2


In [None]:
import numpy as np

print("correlation coefficient: "+str(int(1000*np.corrcoef(lexiconScores,hapaxScores)[0][1])/1000))