# Keyword Extraction

The simple count based method to extract sublanguage specific vocabulary only allows explorative approaches. It gives no objective measurement of how specific a word is to a sublanguage corpus.
To alleviate this problem we can either use Log-Likelihood or tf-idf to extract sublanguage specific vocabulary.

### TF-IDF

To determine the difference between 2 or more sources, we have to formulate a weight for the
each word with regards to each text source. One possible measure is the tf/idf measure which is a weighting bases on the unique usage of a term in single documents. The more often a term is used in different
documents the less importance it gets w.r.t. the tf/idf weight. In detail, this follows the intuition
that a term which appears very often can’t be unique to a certain class or domain.
Following Wikipedia, the tf-idf value increases proportionally to the number of times a word
appears in the document, but is often offset by the frequency of the word in the corpus, which
helps to adjust for the fact that some words appear more frequently in general.[1]

For normalized term frequency $tf(t,D)$ there are various options (see lecture, videos in moodle or research).

### Log Likelihood 
Another possibility to measure relative importance of terms is Log-Likelihood.
When using a reference corpus for comparison we use the term counts in the different domains and
a reference corpus in order to determine significant differences. 
The used significance test is called the “Log-Likelihood”-Ratio Test (LL). The LL-value gives the expectation of a term to be appearing in the target w.r.t. the reference
corpus. 


### Copora

We provide text for the three domains `Automobil`, `Wirtschaft` and `Sport`.
When in need of a reference corpus go to the wortschatz-portal and download a large enough sample of references, around 1 million sentences should suffice.

### Text Preprocessing

Be aware that the prepocessing of text has considerable influence on the outcome. Part of this exercise is to
to deploy a reasonable preprocessing pipeline. Make use of the knowledge about the Zipf distribution and other text preprocessing techniques.

To analyze differences we need to build a single "document" for each domain. This
means, if there are more than one document per domain, we’ll concat all texts belonging to one domain to a single text source.

### Also:

It makes sense to first introduce a function that transforms a collection of documents into an Document-Term-Matrix (DTM). For that, the numpy library's array class is worth a look. Also data sizes may exceed memory. It is important to consider data structures to accomodate for that, e.g. sparse arrays.

**Hint 1** If you use numpy, be aware that numpy contains a lot of useful functions like logarithms or sorting.

**Hint 2** Beware of numerical traps like the undefined logarithm of 0.

In [1]:
import nltk
import numpy as np


In [2]:
import zipfile
content = {}
with zipfile.ZipFile("keyword.zip") as zfile:
    for f in zfile.namelist():
        if f != "keyword/":
            content[f] = zfile.read(f).decode("utf8")
content.keys()      

dict_keys(['keyword/automobil_50k.txt', 'keyword/wirtschaft_50k.txt', 'keyword/sport_50k.txt'])

In [3]:
path = "../news_de.zip"
def load_zip(path):
    import zipfile
    r = {}
    with zipfile.ZipFile(path, "r") as zfile:
        for file in zfile.namelist():
            with zfile.open(file, "r") as f:
                r[file.split("_")[2]] = " ".join([x.decode("utf8").split("\t")[1].lower() for x in f.readlines()])
    return r
reference = " ".join(list(load_zip(path).values())[:2])

In [4]:
reference[:100]

'um die in jeder hinsicht zufriedenzustellen, tüftelt er einen weg aus, sinnlose bürokratie wie laden'

In [5]:
def preprocess(txt):
    nlp = spacy.load('de_core_news_md')
    txt = txt if len(txt[0]) > 1 else nltk.word_tokenize(txt)
    txt = [x.replace("ß","ss").lower() for x in txt if x.isalpha() and len(x) > 1]
    return txt

In [6]:
class TFIDF:
    def create_dtm(self, texts, cut_first=200, min_freq=3):
        
        # Clean texts
        self.texts = {k:preprocess(v) for k,v in texts.items()}
        
        # Count texts
        self.vocab = nltk.FreqDist(sum(self.texts.values(),[]))
        
        # Prune overall vocabulary
        self.vocab = sorted(list(self.vocab.items()),key= lambda x: -x[1]) 
        self.vocab = [x[0] for x in self.vocab[200:] if x[1] >=min_freq]
        
        # count and restrict domain level text to the vocabulary
        self.term_frequencies = {genre: nltk.FreqDist(text) for genre, text in self.texts.items()}
        self.dtm = np.array([[v.get(w,0) for w in self.vocab] for k,v in self.term_frequencies.items()])
    
    def tfidf(self):
        tf = np.log(self.dtm + 1e-25)
        #tf = 0.5 + 0.5 * self.dtm / self.dtm.max(-1, keepdims=True) # alternative normalization.
        idf = np.log(self.dtm.shape[0] /((da.dtm>0).sum(0) + 1e-25))
        return tf * idf

    def tfidf_keywords(self, n=10):
        """Iterate all copora for printing"""
        tfidf = self.tfidf()
        return {k:[self.vocab[k] for k in tfidf[i].argsort(0)[::-1][:n]] for i,k in enumerate(self.texts.keys())}
    


In [7]:
da = TFIDF()
da.create_dtm(content)

In [8]:
da.tfidf_keywords(n=25)

{'keyword/automobil_50k.txt': ['hubraum',
  'serienmässig',
  'motorhaube',
  'kühlergrill',
  'lamborghini',
  'durchschnittsverbrauch',
  'mittelkonsole',
  'rückbank',
  'fahrleistungen',
  'gti',
  'concept',
  'heckklappe',
  'selbstzünder',
  'zweisitzer',
  'supersportwagen',
  'touran',
  'vierzylinder',
  'zylinder',
  'sechszylinder',
  'cdi',
  'hinterachse',
  'motorshow',
  'newtonmeter',
  'facelift',
  'cabrios'],
 'keyword/wirtschaft_50k.txt': ['dax',
  'gdl',
  'zentralbank',
  'commerzbank',
  'iwf',
  'mehdorn',
  'arbeitsmarkt',
  'karstadt',
  'arcandor',
  'microsoft',
  'aktienmarkt',
  'währungsfonds',
  'hre',
  'finanzmärkten',
  'verdi',
  'lehman',
  'postbank',
  'notenbank',
  'grossbank',
  'tecdax',
  'zumwinkel',
  'mdax',
  'staatsanleihen',
  'grossbanken',
  'bip'],
 'keyword/sport_50k.txt': ['tsv',
  'vfb',
  'hertha',
  'sg',
  'dfb',
  'borussia',
  'torhüter',
  'tabellenführer',
  'bsc',
  'fsv',
  'gaal',
  'hoffenheim',
  'durchgang',
  'nowit

In [9]:
class LogLike:
    def create_dtm(self, texts, cut_first=200, min_freq=3):

        # Clean texts
        self.texts = {k: preprocess(v) for k, v in texts.items()}

        # Count texts
        self.term_frequencies = {genre: nltk.FreqDist(text) for genre, text in self.texts.items()}
        self.vocab = sum(self.term_frequencies.values(), nltk.FreqDist())

        # Prune overall vocabulary
        self.vocab = sorted(list(self.vocab.items()), key=lambda x: -x[1])
        self.vocab = [x[0] for x in self.vocab[200:] if x[1] >= min_freq]
        self.term_frequencies = {genre: {k:freq[k] for k in (set(freq.keys()) & set(self.vocab)) }  for genre, freq in self.term_frequencies.items()}


        # count and restrict domain level text to the vocabulary
        self.dtm = np.array([[v.get(w, 0) for w in self.vocab] for k, v in self.term_frequencies.items()])

    def log_likelihood(self, corpus, n=None, threshold=None):
        i = list(da.texts.keys()).index(corpus)
       
        a = self.dtm[i]
        b = self.dtm[list(da.texts.keys()).index("reference")]


        c = a.sum()
        d = b.sum()

        e1 = c * (a + b) / (c + d) + 1e-25
        e2 = d * (a + b) / (c + d) + 1e-25

        ll = 2 * (a * np.log(a / e1 + 1e-25) + b * np.log(b / e2 + 1e-25))

        if threshold is not None:
            return [da.vocab[k.item()] for k in np.where(ll > threshold)[0]]

        if n is not None:
            return [self.vocab[k] for k in ll.argsort(0)[::-1][:n]]


    def ll_keywords(self, n=None, threshold=None):
        return {k: self.log_likelihood(k, n=n, threshold=threshold) for k in self.texts.keys() if k != "reference"}

In [10]:
content["reference"] = reference #Add the reference corpus
da = LogLike()
da.create_dtm(content,cut_first=200)

In [11]:
llk = da.ll_keywords(n=25)
llk

{'keyword/automobil_50k.txt': ['liter',
  'bmw',
  'audi',
  'vw',
  'wagen',
  'mercedes',
  'fahrzeuge',
  'dm',
  'porsche',
  'toyota',
  'hersteller',
  'diesel',
  'motor',
  'opel',
  'litern',
  'verbrauch',
  'modell',
  'modelle',
  'marke',
  'coupé',
  'ford',
  'fahrer',
  'fahren',
  'fahrzeug',
  'gestern'],
 'keyword/wirtschaft_50k.txt': ['dm',
  'bank',
  'dollar',
  'banken',
  'krise',
  'opel',
  'quartal',
  'verlinken',
  'konzern',
  'link',
  'gm',
  'kostenfrei',
  'ubs',
  'finanzkrise',
  'kunden',
  'porsche',
  'artikel',
  'verwenden',
  'mrd',
  'bahn',
  'unten',
  'franken',
  'stehenden',
  'möchten',
  'polizei'],
 'keyword/sport_50k.txt': ['trainer',
  'saison',
  'mannschaft',
  'dm',
  'spieler',
  'sieg',
  'sv',
  'team',
  'minute',
  'wm',
  'verlinken',
  'link',
  'tor',
  'league',
  'kostenfrei',
  'bayern',
  'löw',
  'spd',
  'minuten',
  'spielen',
  'partie',
  'stehenden',
  'verwenden',
  'artikel',
  'liga']}