# Datenverarbeitung

In [10]:
import os

from own.loading import load_train_test_rid_lists
from own.loading import load_reviews_and_rids

from own.processing import process_set # process_set(RID_list, dir, replacement_patterns) - no return

from own.functions import get_matching_reviews

from own.vocab import create_vocab
from own.vocab import define_min_occurrence
from own.vocab import save_vocab

## Die RIDs des Test- und Trainsets laden

*load_train_test_rid_lists* liefert aus den vorher abgespeicherten Dateien mit Test- und Trainingsset zwei Listen, mit den zugehörigen RIDs zurück.

In [3]:
train_rids, test_rids = load_train_test_rid_lists()

Loaded Trainset successfully
Loaded Testset successfully


## Die Texte und RIDs von plain_reviews.txt laden

*load_reviews_and_rids* Lädt die Texte eines Reviews als Liste von Sätze und die dazugehörigen RIDs.

In [2]:
review_list, RID_list = load_reviews_and_rids(file_path = os.path.join("data","reviews","plain_reviews.txt"))

File loaded successfully


## Verarbeitung des Testsets

*get_matching_reviews* sucht die passenden Review-Texte zu gesuchten RIDs.

*process_set* durchläuft mit jedem Review eines Sets alle in der Bachelorarbeit beschriebenen Verarbeitungsschritte und speichert folglich das Set in einer einzelnen Datei ab.

In [4]:
matching_reviews, matching_RIDs = get_matching_reviews(RID_list, review_list, test_rids)

Found 200 of 200 seached results


In [5]:
%%time
process_set("processed_testset", matching_RIDs, matching_reviews)

Directory  data\reviews  already exists
Saving processed_testset.txt
File saved successfully
Wall time: 1min 22s


## Verarbeitung des Trainingssets

In [6]:
matching_reviews, matching_RIDs = get_matching_reviews(RID_list, review_list, train_rids)

Found 1800 of 1800 seached results


In [7]:
%%time
process_set("processed_trainset", matching_RIDs, matching_reviews)

Directory  data\reviews  already exists
Saving processed_trainset.txt
File saved successfully
Wall time: 13min 34s


## Wortschatz des Trainingssets erstellen

*load_reviews_and_rids* Lädt in diesem Falle die verarbeiteten Texte und RIDs des Trainsets

*create_vocab* erstellt hierbei eine gezählte Repräsentation aller Tokens in den vorhandenen texten.

*define_min_occurence* Liefert eine Liste mit validen Tokens zurück, die über $x$-mal vorkommen (x=2).

*save_vocab* Speichert den Wortschatz in einer Datei ab.

In [8]:
file_path = os.path.join("data", "reviews", "processed_trainset.txt")
train_review_list, train_RID_list = load_reviews_and_rids(file_path)

File loaded successfully


In [11]:
%%time

# Create vocab for trainset
train_vocab = create_vocab(train_review_list)
train_tokens = define_min_occurrence(train_vocab)

directory = os.path.join("data", "vocabs")
file_name = "train_vocab"
save_vocab(directory, file_name, train_tokens)

Defining min_occurence
 Vocab length before truncating: 13046
 Vocab length after truncating: 8109
Directory  data\vocabs  successfully created 
File train_vocab saved successfully
Wall time: 94.9 ms
