## A function that calculates the error between expected word frequency and article word frequency using the KL Divergence Method


In [5]:
import pandas as pd
import numpy 
import wordfreq

<class 'ModuleNotFoundError'>: No module named 'wordfreq'

In [6]:
class Analyser:

    #Here assume that words is a loaded dict of structure: {"total": #number#, "list": {#all words with numbers#}}
    @staticmethod
    def calculate_probablity_values(words):
        count = words["total"]
        words_df = pd.Series(words["list"]).to_frame("count")
        words_df = words_df.sort_values(by='count',ascending=False)
        words_df['count'] /= count
        words_df.columns = ["Article Frequency"]
        return words_df
        
    @staticmethod
    def top_n_words_probability(n, language):
        all_words = wordfreq.top_n_list(language, n)
        probability_values = [wordfreq.word_frequency(w, language) for w in all_words]
        lang_df = pd.DataFrame(data={"Language Frequency": probability_values}, index=pd.Index(all_words, name='word'))
        return lang_df


In [7]:
def lang_confidence_score(word_counts, language_words_with_frequency):
    words_df = Analyser.calculate_probablity_values(word_counts)
    words_df.columns = ["Data Frequency"]
    combined_df = words_df.join(language_words_with_frequency, how='right')
    combined_df["Data Frequency"] = combined_df["Data Frequency"].fillna(1e-10)
    combined_df["Data Frequency"] = combined_df["Data Frequency"] / combined_df["Data Frequency"].sum()
    combined_df["Language Frequency"] = combined_df["Language Frequency"] / combined_df["Language Frequency"].sum()
    
    # KL Divergence Calculation
    score = numpy.sum(combined_df["Language Frequency"] * numpy.log(
         combined_df["Language Frequency"]/  
         combined_df["Data Frequency"]))

    return score



# Experimental Results

The choice of the language had measurable yet not overwhelming impact on the outcome - predictably English won with lowest average error score - followed by German and with French being the least similar to expected frequency. All languages showed significantly better prediction of frequency for literally works than for wiki articles. Interestingly, the choice of the wiki (pokemon wiki for german and french and xkcd for english) had very little impact on prediction rates. It is postulated that as long as the articles cluster around a single range of topics such as science, japanese culture or sports, then the results will be quite similar for every type of content - with slight variation according to the amount of internet data in this topic. For the purpose of testing performance on literaly works "Uncle Tom's Cabin", "Les Miserables" and "Also Sprach Zaratustra" were chosen for their time period similarity and a wealth of vocabulary. Interestingly the chosen metric (KL divergence) showed increasing error rates as the number of matched words increased, finally reaching a maximum value after about 1000 words - in which case it was however still possible to spot what language the article was composed in. The score differences were maximised for K = 100. All six charts with the details of used articles are in the /charts folder  