__<font size="6" weight="bold">Bayes Localizer</font>__

Diese Klasse beinhaltet eine Liste mit den trainierten Modellen für die einzelnen Pattern. <br>
Bei der Initialisierung werden alle Modelle neu erstellt oder geladen, falls sie bereits vorhanden sind <br>

## Einbinden der Bibliotheken

In [7]:
import spacy
import pickle
import os 
import os.path as path
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from nltk.corpus import stopwords


## Game_Pattern Klasse laden

In [30]:
%run ./GamePatterns.ipynb

## nlp_processed_game Klasse laden

In [29]:
%run ./nlpProcessedGame.ipynb

## Klasse Bayes Localizer

__Klassenmember__:<br>
directoryname: Name des Verzeichnis der Modelle<br>
directoryname_output: Name des Verzeichnis des Outputs bei der train function.<br>
min_length: Mindest Anzahl der Wörter / Satz<br>
remove_stopwords: Gibt an ob Stopwörter entfernt werden soll (True = ja | False = nein), wenn eine neue Anleitung predicted werden soll.<br>
lemmatize: Gibt an ob Sätze lemmatisiert werden sollen, wenn eine neue Anleitung predicted werden soll.<br>
__Funktionen:__

__init__: 
<br>Parameter:<br>
remove_stopwords : Boolean<br>
lemmatize : Boolean<br>
Initialisierung, es wird ein leeres modelDictionary und counterVectorizeDirectory erzeugt.
__set_min_sentence_length(min_length)__:
<br>Parameter<br>
min_length : integer<br>
Mindest-Wortanzahl der Sätze. Sätze die weniger Wörter beinhalten, werden ignoriert.<br><br>
__load_all_models(filepath)__: 
<br>Parameter<br>
filepath : String (Default ist leerer String)
Lädt alle trainierten Modelle. Filepath ist optional. Ist keiner angegeben, werden die Modelle aus dem Standardverzeichnis(directoryname) geladen<br><br>
__load_model(id,filepath)__: 
<br>Parameter<br>
id : integer <br>
filepath : String <br>
Läd trainiertes Model über id und filepath.
<br><br>
__load_counterVectorize(id,filepath)__:
<br>Parameter<br>
id : integer <br>
filepath: String <br>
Läd die Sparse Matrix über id und filepath.
<br><br>
__train(id, dataframe)__:
<br>Parameter<br>
id : integer<br>
dataframe: panda dataframe<br>
Trainiert Model mit übergebenen Dataframe und speichert unter speichert das Model im Standardverzeichnis ab "directoryname".Ein weiteres File, welches Accuracy,Precision,Recall und F1 Score des trainierten Modell beinhaltet wird im "directoryname"/Bayes_Output abgespeichert.<br>
<br><br>
__clean_stopwords(sentences)__:
<br> Parameter <br>
sentences: String<br>
Entfernt Stoppwörter aus Anleitungen die predicted werden sollen, sofern der "remove_stopwords" Parameter auf True ist.<br>
<br><br>
__create_sentence_index_list(predictions,sentences)__:
<br>Parameter<br>
predictions: list of integers<br>
sentences : String list
Gibt eine Liste mit den Satznummern zurück, in denen das Pattern gefunden wurde. <br><br>
__read_nlp_processed_game_to_game_patterns(processedManual)__: NLP-verarbeitete Anleitung als Übergabeparamter. Gibt Game_Pattern für Anleitung zurück.

In [25]:
class BayesLocalizer: 
    directoryname="Bayes"
    directoryname_output ="Bayes_Output"
    
    def __init__(self,remove_stopwords=False,lemmatize = False):
        self.modelDictionary = {}
        self.counterVectorizeDirectory ={}
        self.min_length = 0
        self.remove_stopwords = remove_stopwords
        self.lemmatize = lemmatize
        
    def set_min_sentence_length(self, min_length):
        self.min_length = min_length
                       
    def load_all_models(self,filepath=""):
        if filepath == '':
            filepath = f'{os.getcwd()}/{self.directoryname}'
        if os.path.exists(filepath):
            for file in os.listdir(filepath):
                if file.endswith(".sav") and file.startswith("B"):
                    pattern_id = str(file).replace('Bayes-', '').replace("Counter-",'').replace('.sav', '')
                    self.modelDictionary[pattern_id] = self.load_model(pattern_id, filepath)
                    self.counterVectorizeDirectory[pattern_id] = self.load_counterVectorize(pattern_id, filepath)
        else:
            print("Directory does not exist!")

    
    def load_model(self, id,filepath):
        file = self.get_filename(id,filepath,"Bayes")
        if path.exists(file):
            return pickle.load(open(file, 'rb'))
        else:
            print("Warning:No Bayes Model found for the Pattern:"+str(id))
            return None
    
    def load_counterVectorize(self,id,filepath):
        file = self.get_filename(id,filepath,"Counter")
        if path.exists(file):
            #TODO try except mit Meldung
            return pickle.load(open(file, 'rb'))
        else:
            print("Warning: No Sparse Matrix found for the Pattern:"+str(id))
            #TODO Meldung!
            return None
        
    def train(self, id, dataframe):        
        #create directory if not exist
        os.makedirs(self.directoryname, exist_ok=True)
        os.makedirs(self.directoryname+"/" +self.directoryname_output, exist_ok=True)
        bayes_file = self.get_filename(id,self.directoryname,"Bayes")
        countervector_file = self.get_filename(id,self.directoryname,"Counter")
        X = dataframe[0]
        Y = dataframe[1]
        X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.9 )
        naive_bayes = BernoulliNB()
        #wordvetor erstellung
        vectorizer = CountVectorizer(binary=True)
        x_train = vectorizer.fit_transform(X_train)
        x_test = vectorizer.transform(X_test)
        pickle.dump(vectorizer,open(countervector_file,"wb"))
        #trainieren
        naive_bayes.fit(x_train, y_train)
        predictions = naive_bayes.predict(x_test)
        f = open(self.directoryname+"/"+self.directoryname_output+ "/Precision_Accuracy_Recall_F1_:"+str(id)+".txt","w")
        f.write("Accuracy score"+ str (accuracy_score(y_test, predictions))+"\n")
        f.write('Precision score: ' + str (precision_score(y_test, predictions,pos_label=0))+"\n")
        f.write('Recall score: ' + str (recall_score(y_test, predictions,pos_label=0))+"\n")
        f.write('f1_score score: ' + str (f1_score(y_test, predictions,pos_label=0))+"\n")
        f.close()
        pickle.dump(naive_bayes, open(bayes_file, 'wb'))
    
    
    def get_filename(self, id,filepath,name=""):
        filename = "/" + name +"-"+str(id)+".sav"
        return filepath + filename
    
    def clean_stopwords(self,sentence):
        if(os.path.isfile("Stopwords/Stopword.txt")):
            STOPWORDS = set(line.strip() for line in open('Stopwords/Stopwords.txt'))
        else:
            STOPWORDS = set(stopwords.words('german'))
        sentence_clean =""
        sentence =  [w.lower() for w in sentence.split() if not w.lower() in STOPWORDS] 
        for word in sentence:
            sentence_clean+=word+" "
        return sentence_clean

    def create_sentence_index_list(self, predictions, sentences):
        sentenceIndexList = [] 
        count = 0
        for i in range(0,len(predictions)):
            if(len(sentences[i].split()) >= self.min_length):
                if(predictions[i] == 1):
                    sentenceIndexList.append(i)
        return sentenceIndexList    
    
    def read_nlp_processed_game_to_game_patterns(self, processedManual):
        gamePattern = game_patterns()
        gamePattern.set_ID(processedManual.get_ID())
        doc=processedManual.get_spacy_doc()
        sentenceTexts = []
        for sent in doc.sents:
            sentenceTexts.append(str(sent))
        gamePattern.set_sentences(sentenceTexts)
        
        if self.lemmatize:
            listLemmaSentences = list(str(sent.lemma_) for sent in doc.sents)
        else:
            listLemmaSentences = list(str(sent.text) for sent in doc.sents)
           
        if(len(listLemmaSentences) == 0):
            return gamePattern
        
        if self.remove_stopwords:
            listLemmaSentences = [self.clean_stopwords(sent) for sent in listLemmaSentences]
                    
        for key in self.modelDictionary:
            model = self.modelDictionary.get(key)
            count_vector = self.counterVectorizeDirectory.get(key)
            if model:
                testing_data = count_vector.transform(listLemmaSentences)
                predictions = model.predict(testing_data)
                indexedSentences = self.create_sentence_index_list(predictions,listLemmaSentences)
                gamePattern.add_pattern(int(key),indexedSentences)               
        return gamePattern
                
            