# Frequeny of words from KORE dataset of the entity input corpus

See [CountWords_Entity.py](https://github.com/Nadine-Schmitt/bachelorThesis-nadischm/blob/master/Code/CountWords_Entity.py) script.

For each entity from the KORE datatset the number of occurence in the entity input corpus is calculated and printed as output.

## Import

[pickle](https://docs.python.org/3/library/pickle.html), [argparse](https://docs.python.org/3/library/argparse.html), [os](https://docs.python.org/2/library/os.html) and [gensim](https://radimrehurek.com/gensim/) are needed for this script to work:

In [None]:
import pickle
import argparse
import os
import gensim
from gensim.models import Word2Vec



## General Usage

The usage of the script can be seen with the default -h or --help flag:

In [2]:
%%cmd
python CountWords_Entity.py --help


Microsoft Windows [Version 10.0.17134.885]
(c) 2018 Microsoft Corporation. Alle Rechte vorbehalten.

(base) C:\Users\nadin\Documents\Bachelorarbeit\Code>python CountWords_Entity.py --help
usage: CountWords_Entity.py [-h] koreSource inputSource

Script for translating wikipedia page names

positional arguments:
  koreSource   source folder with kore dataset
  inputSource  source folder with inputlist entity

optional arguments:
  -h, --help   show this help message and exit

(base) C:\Users\nadin\Documents\Bachelorarbeit\Code>

## Functions

The underscore_creator() take an entity (e.g. Apple Inc.) and transform it into its id (Apple_Inc.)

In [None]:
def underscore_creator(s): 
    w = ''
    l = s.split(' ')
    for index in range(len(l) -1): 
        w += l[index] + '_'
    w += l[-1]
    return w 

Read KORE dataset into list and transform entities into their id with the underscore-creator():

In [None]:
def readKore(source):
    
    list = [line.rstrip('\n') for line in open(source, encoding="UTF-8")]
   
    lenght = len(list)
        for index in range(lenght):
        if list[index][0] == "\t":
            list[index] = list[index][1:len(list[index])]
    #entity: words with underscore
    listUnderscore = []
    for e in list:
        eNew = underscore_creator(e)
        listUnderscore.append(eNew)
            
    return listUnderscore




The following function reads the inputList entity file by file and for each word from the KORE dataset the number of occurence is calculated:

In [None]:
def loadInputList(dirname, KoreList):
    counterList = []
    i = 0
    while i < len(KoreList):
        counterList.append(0)
        i = i+1
    loadSentence = []
    filenames = [f.path for f in os.scandir(dirname) ]
    for fname in filenames:
        print("read sentences from ", fname)
        loadSentence = []
        with open(fname, 'r') as f:
            line = f.readline()
            while line:
                words = line.split()
                for w in words:
                    loadSentence.append(w)
                line = f.readline()
            #print(loadSentence)
            #count words
            iterator = 0
            for e in KoreList:
                #print(e)
                counterList[iterator] += loadSentence.count(e)
                iterator = iterator +1
                
    return counterList     



## Configuration and Main

In [None]:
parser = argparse.ArgumentParser(description='Script for translating wikipedia page names')
parser.add_argument('koreSource', type=str, help='source folder with kore dataset')
parser.add_argument('inputSource', type=str, help='source folder with inputlist entity')

args = parser.parse_args()

In [None]:
#read Kore data
KoreList = readKore(args.koreSource)
KoreList[0] ='Apple'
for e in KoreList:
    print(e) #print all words from Kore
count = loadInputList(args.inputSource,KoreList)
for e in count:
    print(e) #print afterwards their occurence
#print(KoreList[0])




## Convert Jupyter Notebook into py-script

In [None]:
!jupyter nbconvert --to script CountWords_Entity.ipynb