
Für die Entwicklung des Klassifikators nutzen wir die Daten des SemEval 2013 Task 2b, ein Korpus mit 14882 annotierten Tweets: 

- Informationen zum Korpus: https://aclanthology.org/S13-2052/
- Corpus als zip-Datei: [semeval2013.zip](https://moodle.zdv.uni-tuebingen.de/pluginfile.php/269639/mod_assign/introattachment/0/semeval2013.zip?forcedownload=1)



In [1]:
import nltk
import re
import string

from nltk import FreqDist
from nltk import MaxentClassifier
from nltk import ConditionalExponentialClassifier
from nltk import DecisionTreeClassifier
from nltk import WekaClassifier
from nltk import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.util import ngrams

from nltk import pos_tag

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ngoc-\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ngoc-\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Ngoc-\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ngoc-\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Zuerst lesen wir die Daten ein. Diese befinden sich in:
- semeval2013/twitter-2013dev-A
- semeval2013/twitter-2013test-A
- semeval2013/twitter-2013train-A

Wir entfernen alle Wörter, die mit "@", "#", "http" anfangen, und konvertieren alle unicode-escape-characters zu deren unicode-character. 

In [2]:
with open('semeval2013/twitter-2013dev-A.txt', "r") as file:
    devlines = file.readlines()    
dev = [line.strip().split("\t") for line in devlines]
dev_ = dev

with open('semeval2013/twitter-2013test-A.txt', "r") as file:
    testlines = file.readlines()    
test = [line.strip().split("\t") for line in testlines]
test_ = test

with open('semeval2013/twitter-2013train-A.txt', "r") as file:
    trainlines = file.readlines()    
train = [line.strip().split("\t") for line in trainlines]
train_ = train

#konvertiere alle unicode-escapes zu unicode
def decode_unicode_escapes(text):
    try:
        return re.sub(r'\\[uU]([\dA-Fa-f]{4,8})', lambda m: chr(int(m.group(1), 16)), text)
    except:
        return " "

    
def remove_words(sentence):
    # Create a regular expression pattern to match words starting with "@" or "#" or "http"
    pattern = r'@\w+|#\w+|http?\S+'
    
    # Use the sub() function of the re module to replace all matched patterns with an empty string
    clean_sentence = re.sub(pattern, '', sentence)
    
    return clean_sentence    

dev = [[tag, sentiment, remove_words(decode_unicode_escapes(content))] for [tag, sentiment, content] in dev]
test = [[tag, sentiment, remove_words(decode_unicode_escapes(content))] for [tag, sentiment, content] in test]
train = [[tag, sentiment, remove_words(decode_unicode_escapes(content))] for [tag, sentiment, content] in train]
         
    
dev = [(tag, sentiment, ' '.join(word_tokenize(content))) for [tag, sentiment, content] in dev]
test = [(tag, sentiment, ' '.join(word_tokenize(content))) for [tag, sentiment, content] in test]
train = [(tag, sentiment, ' '.join(word_tokenize(content))) for [tag, sentiment, content] in train]


def generate_bigrams(tokens):
    bigrams = list(ngrams(tokens, 2))
    return [' '.join(bigram) for bigram in bigrams]


dev = [
    [tag, sentiment, ' '.join(word_tokenize(content)) + ' ' + ' '.join(generate_bigrams(word_tokenize(content)))]
    for [tag, sentiment, content] in dev
]
test = [
    [tag, sentiment, ' '.join(word_tokenize(content)) + ' ' + ' '.join(generate_bigrams(word_tokenize(content)))]
    for [tag, sentiment, content] in test
]
train = [
    [tag, sentiment, ' '.join(word_tokenize(content)) + ' ' + ' '.join(generate_bigrams(word_tokenize(content)))]
    for [tag, sentiment, content] in train
]


Anschließend lemmatisieren und stemmen wir alle Sätze 


In [3]:
lemmatizer = WordNetLemmatizer()

sste = SnowballStemmer("english")

dev = [(tag, sentiment, ' '.join(str_val)) for tag, sentiment, str_val in [[lst[0], lst[1], [sste.stem(word) for word in lst[2].split()] ] for lst in dev]]
test = [(tag, sentiment, ' '.join(str_val)) for tag, sentiment, str_val in [[lst[0], lst[1], [sste.stem(word) for word in lst[2].split()] ] for lst in test]]
train = [(tag, sentiment, ' '.join(str_val)) for tag, sentiment, str_val in [[lst[0], lst[1], [sste.stem(word) for word in lst[2].split()] ] for lst in train]]



dev = [[lst[0], lst[1], ' '.join([lemmatizer.lemmatize(word) for word in lst[2].split()])] for lst in dev]
test = [[lst[0], lst[1], ' '.join([lemmatizer.lemmatize(word) for word in lst[2].split()])] for lst in test]
train = [[lst[0], lst[1], ' '.join([lemmatizer.lemmatize(word) for word in lst[2].split()])] for lst in train]



Zuerst zählen wir die Anzahl der Label für jedes der Datensplits. Anschließend normalisieren wir über die Anzahl der Tweets im jeweiligen Datensplit. 

In [4]:
def count_sentiment(file):
    positive = 0
    neutral = 0
    negative = 0
    for [tag, sentiment, content] in file:
        if sentiment == "positive":
            positive += 1
        if sentiment == "neutral":
            neutral += 1
        if sentiment == "negative":
            negative += 1
    return [positive, neutral, negative]

def normalize_sentiment(file):
    result = [x / len(file) for x in count_sentiment(file)]
    return result
    
for file in [dev, test, train]:
    print(str(normalize_sentiment(file)))

[0.3476420798065296, 0.4467956469165659, 0.20556227327690446]
[0.41584437552861575, 0.4265576543557936, 0.15759797011559065]
[0.37587773647253203, 0.4735646427096241, 0.15055762081784388]


Da die Klassentypen ähnlich gleichmäßig über die Test/ Train/ Dev- Datensätze verteilt sind, ist der Train-Datensatz für das Training auf die Evaluierung des Test-Datensatzes geeignet. Hier ist ebenfalls erwähnenswert, dass positive und neutrale Tweets mit ungefähr doppelter Wahrscheinlichkeit auftreten als negative Tweets. Bereits ein zufälliges Zuweisen zwischen positiver Kategorie und negativer Kategorie würde somit in einer Accuracy von ca. 40% resultieren. 

Um eine Baseline zu implementieren, entnehmen wir aus den Datensätzen die häufigsten Wörter, die jeweils in Tweets mit Label "positive", "negative", "neutral" zu finden sind. Für einen zu klassifizierenden Tweet zählen wir die Aufkommen dieser Wörter und weisen das Label zu, das die meisten dieser Wörter enthält. 

Zuerst erstellen wir das Dictionary, mit dem wir den Classifier trainieren.

In [5]:
import random
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

        
    
# Extrahiere alle positiven/ negativen/ neutralen Trainings-Tweets und füge diese in ein Dictionary ein, dass für einen 
# Bayes-Klassifizierer geeignet ist

positive_tweets = [[content] for [tag, sentiment, content] in train if sentiment == "positive"]
positive_tweets = [[word for sentence in sublist for word in sentence.split()] for sublist in positive_tweets]

negative_tweets = [[content] for [tag, sentiment, content] in train if sentiment == "negative"]
negative_tweets = [[word for sentence in sublist for word in sentence.split()] for sublist in negative_tweets]

neutral_tweets = [[content] for [tag, sentiment, content] in train if sentiment == "neutral"]
neutral_tweets = [[word for sentence in sublist for word in sentence.split()] for sublist in neutral_tweets]

positive_tokens_for_model = get_tweets_for_model(positive_tweets)
negative_tokens_for_model = get_tweets_for_model(negative_tweets)
neutral_tokens_for_model = get_tweets_for_model(neutral_tweets)



positive_dataset = [(tweet_dict, "positive") for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "negative") for tweet_dict in negative_tokens_for_model]
neutral_dataset = [(tweet_dict, "neutral") for tweet_dict in neutral_tokens_for_model]

train_dataset = positive_dataset + negative_dataset + neutral_dataset


Wir müssen ebenfalls das Test-Dataset in ein dictionary-Format bringen, das für einen Naiven Bayes-Klassifizierer geeignet ist. Dies ist äquivalent zu dem Trainings-Dataset

In [6]:

# Extrahiere alle positiven/ negativen/ neutralen Test-Tweets und füge diese in ein Dictionary ein, dass für einen 
# Bayes-Klassifizierer geeignet ist
positive_test_tweets = [[content] for [tag, sentiment, content] in test if sentiment == "positive"]
positive_test_tweets = [[word for sentence in sublist for word in sentence.split()] for sublist in positive_test_tweets]

negative_test_tweets = [[content] for [tag, sentiment, content] in test if sentiment == "negative"]
negative_test_tweets = [[word for sentence in sublist for word in sentence.split()] for sublist in negative_test_tweets]

neutral_test_tweets = [[content] for [tag, sentiment, content] in test if sentiment == "neutral"]
neutral_test_tweets = [[word for sentence in sublist for word in sentence.split()] for sublist in neutral_test_tweets]

positive_test_tokens_for_model = get_tweets_for_model(positive_test_tweets)
negative_test_tokens_for_model = get_tweets_for_model(negative_test_tweets)
neutral_test_tokens_for_model = get_tweets_for_model(neutral_test_tweets)



positive_test_dataset = [(tweet_dict, "positive") for tweet_dict in positive_test_tokens_for_model]
negative_test_dataset = [(tweet_dict, "negative") for tweet_dict in negative_test_tokens_for_model]
neutral_test_dataset = [(tweet_dict, "neutral") for tweet_dict in neutral_test_tokens_for_model]

test_dataset = positive_test_dataset + negative_test_dataset + neutral_test_dataset


Anschließend können wir einen einen Naiven Bayes-Klassifizierer trainieren und testen:

In [7]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_dataset)

accuracy = classify.accuracy(classifier, test_dataset)
print("Accuracy is:", accuracy)

print(classifier.show_most_informative_features(30))



Accuracy is: 0.5841556244713842
Most Informative Features
                    luck = True           positi : neutra =     39.1 : 1.0
                   excit = True           positi : neutra =     38.2 : 1.0
                     sad = True           negati : neutra =     37.1 : 1.0
                    fuck = True           negati : neutra =     35.7 : 1.0
                   happi = True           positi : neutra =     30.0 : 1.0
                   sorri = True           negati : neutra =     29.6 : 1.0
                  injuri = True           negati : positi =     29.1 : 1.0
                    amaz = True           positi : neutra =     28.5 : 1.0
                  awesom = True           positi : neutra =     27.0 : 1.0
                   thank = True           positi : neutra =     24.9 : 1.0
                     fun = True           positi : neutra =     24.3 : 1.0
                    suck = True           negati : neutra =     23.3 : 1.0
                    cant = True           

Der Naive-Bayes-Klassifizierer resultiert in einer Accurracy von ~59%. Wir überprüfen folgenderweise, ob Wörter, die oft ausschließlich in positiven/ negativen/ neutralen Tweets auch hilfreich für den obigen Bayes-Klassifizierer sind folgendermaßen:

- Für alle positiven/negativen/ neutralen Tweets, sammle alle Wörter die in derartigen Tweets vorkommen
- Stopwörter und Punktuation (mit Ausnahme von Smileys) ist nicht aussagekräftig und wird somit entfernt
- Zähle dann mithilfe von FreqDist die häufigsten positiven, negativen, und neutralen Wörter
- Für alle positiven/ negativen/ neutralen Wortmengen, betrachte nicht diejenigen, die auch in den jeweilig anderen Wortmengen auftreten
- Die übrigbleibenden Wortmengen werden dann mit den hilfreichesten Features des Bayes-Klassifizierer verglichen.

In [8]:
positive_words = [content for [tag, sentiment, content] in train if sentiment == "positive"]
negative_words = [content for [tag, sentiment, content] in train if sentiment == "negative"]
neutral_words = [content for [tag, sentiment, content] in train if sentiment == "neutral"]

smiley_regex = re.compile(r'(:\)|:-\)|:\(|:-\(|;\)|;-\)|:\D|:-\D|<3)')

smileys = set()

for smiley in smiley_regex.findall(' '.join(map(chr, range(128)))):  
    smileys.add(smiley)

def remove_stopwords_and_punctuation(text_list):
    stop_words = set(stopwords.words('english'))
    punctuation = set(string.punctuation) - smileys
    
    filtered_text_list = []
    for text in text_list:
        words = nltk.word_tokenize(text)
        filtered_words = [word.lower() for word in words if (word.lower() not in stop_words and word.lower() not in punctuation)]
        filtered_text_list.append(" ".join(filtered_words))
    return filtered_text_list


positive_words = " ".join(remove_stopwords_and_punctuation(positive_words)).split()
negative_words = " ".join(remove_stopwords_and_punctuation(negative_words)).split()
neutral_words = " ".join(remove_stopwords_and_punctuation(neutral_words)).split()

fdist_positive = FreqDist(positive_words)
fdist_negative = FreqDist(negative_words)
fdist_neutral = FreqDist(neutral_words)

common_positive_words = set([item[0] for item in fdist_positive.most_common(400)])
common_negative_words = set([item[0] for item in fdist_negative.most_common(400)])
common_neutral_words = set([item[0] for item in fdist_neutral.most_common(400)])

common_positive_words_ = set(common_positive_words) -  (set(common_negative_words) | set(common_neutral_words))
common_negative_words_ = set(common_negative_words) - (set(common_positive_words) | set (common_neutral_words))
common_neutral_words_ = set(common_neutral_words) - (set(common_positive_words) | set (common_negative_words))

most_informative = [word for [word, contains] in classifier.most_informative_features(200)]
print("- Frequente positive Wörter, die auch informativ sind:  \n     {}".format(common_positive_words_.intersection(most_informative)))
print("- Frequente negative Wörter, die auch informativ sind:  \n     {}".format(common_negative_words_.intersection(most_informative)))
print("- Frequente neutrale Wörter, die auch informativ sind:  \n     {}".format(common_neutral_words_.intersection(most_informative)))

- Frequente positive Wörter, die auch informativ sind:  
     {'brilliant', 'proud', 'enjoy', 'excit', 'luck', 'favorit', 'sweet', 'perfect', 'yay', 'happi', 'nice', 'funni', 'thank', 'awesom', 'cool', 'fun', 'amaz'}
- Frequente negative Wörter, die auch informativ sind:  
     {'pavol', 'tire', 'protest', 'dont', 'voic', 'rider', 'anymor', 'trayvon', 'sad', 'suck', 'sorri', 'injuri', 'bitch', 'net', 'wrong', 'ghost', 'cri', 'cancel', 'warn', 'crash', 'smh', 'hell', 'dead', 'fail', 'sick', 'piss', 'hate', 'worst', 'delay', 'wont', 'loss', 'demitra', 'die', 'kinda', 'damn'}
- Frequente neutrale Wörter, die auch informativ sind:  
     {'begin', 'intern', 'trial'}


Das Ergebnis für positive bzw. negative Tweets ist nicht überraschend, da dies zumeist sehr aussagekräftige Word-features sind. Überraschend ist lediglich die Menge der frequenten und informativen negativen Wörter im Vergleich zu der Menge der frequenten und informativen positiven Wörtern, da es um einiges mehr derartiger negativer Wörter gibt, als positiver. 

Die frequenten und informativen neutralen Wörter sind nicht überraschend, da neutrale Tweets meist weniger aussagekräftige Features (im Vgl. mit positiven und negativen) beinhalten. 



Anstatt den Bayes-Klassifizierer direkt auf den Train-Datensatz zu trainieren, können wir auch einen feature-Extraktor verwenden, der die nach obiger Methode häufigsten Wörter in positiven/ negativen/ neutralen Tweets extrahiert. 

In [9]:
common_positive_words_ = [word for word in common_positive_words_]    
common_negative_words_ = [word for word in common_negative_words_]
common_neutral_words_ = [word for word in common_negative_words_]

all_words = common_positive_words_ + common_negative_words_ + common_neutral_words_ 

def extract_features(document):    
    document_words = set(document)
    features = {}
    word_features = all_words
    for word in word_features:
        features[word] = (word in document_words)
    return features

documents = [(content.split(), sentiment) for [tag, sentiment, content] in train]

labeled_features = [(extract_features(doc), category) for (doc, category) in documents]

classifier2 = classifier.train(labeled_features)

accuracy2 = nltk.classify.accuracy(classifier2, test_dataset)

print(accuracy2)

print(classifier2.show_most_informative_features(30))



0.558782069354384
Most Informative Features
                    luck = True           positi : neutra =     39.1 : 1.0
                   excit = True           positi : neutra =     38.2 : 1.0
                     sad = True           negati : neutra =     37.1 : 1.0
                   happi = True           positi : neutra =     30.0 : 1.0
                   sorri = True           negati : neutra =     29.6 : 1.0
                  injuri = True           negati : positi =     29.1 : 1.0
                    amaz = True           positi : neutra =     28.5 : 1.0
                  awesom = True           positi : neutra =     27.0 : 1.0
                   thank = True           positi : neutra =     24.9 : 1.0
                     fun = True           positi : neutra =     24.3 : 1.0
                    suck = True           negati : neutra =     23.3 : 1.0
                    fail = True           negati : positi =     22.5 : 1.0
                     cri = True           negati : neutr




Die Nützlichkeit von POS-Tagging kann mithilfe des nltk-Unigram Taggers untersucht werden. Dieser wird auf dem brown-dataset trainiert.

In [10]:
from nltk import UnigramTagger
from nltk.corpus import brown

brown_train = brown.tagged_sents()
ut = UnigramTagger(brown_train)

Dieser Unigram-Tagger kann verbessert werden, indem wir für unschlüssige Tags mithilfe dem Suffix (oder dem Auftreten von Nummern) erweitert taggen. Ebenfalls ändern wir das output von einem (word, tag)-Tuple zu einem word-tag String für Einfachkeit bei der folgenden Konvertierung in ein geeignetes Dictionary. 


In [11]:

def utplus (sent) :
    out = [(word,"ZONK") for word in sent]   
    
    out = ut.tag(sent)   
    for i in range(len(out)):
        if out[i][1] == None:            
            #replace all None tags which contain a number as NUM:
            if(bool(re.search('1|2|3|4|5|6|7|8|9|0|one|two|three|four|five|six|seven|eight|nine|ten', out[i][0].lower()))):
                out[i] = (out[i][0], 'NUM')              
            #replace all None tags which have a capital Letter as the first letter with NOUN:
            if(out[i][0][0].isupper()):
                out[i] = (out[i][0], 'NN')
            #replace all None tags which have -ly | -able suffix with ADV:                
            if(out[i][0].endswith(('ly', 'able'))):
                out[i] = (out[i][0], 'ADV')
            #replace all None tags which have -ally | -ous | -est | -ial |-ic with ADJ:
            if(out[i][0].endswith(('ally', 'ous', 'est', 'ial', 'ate', 'ic', 'al', 'ent'))):
                out[i] = (out[i][0], 'ADJ')
            #replace all None tags which have -ies | -cy | -age | -'s |-ing | -ion |s |-ity |-ment | -ty suffix with NN:
            if(out[i][0].endswith(('ies', 'cy', 'age', '\'s', 'ing', 'ion', 's', 'ity', 'ment', 'ty', 'ance'))):
                out[i] = (out[i][0], 'NN')
            #replace all None tags which have -ed suffix with VBD
            if(out[i][0].endswith(('ed'))):
                out[i] = (out[i][0], 'VBD')  
            #replace all None tags which have -ing suffix with VBG
            if(out[i][0].endswith(('ing'))):
                out[i] = (out[i][0], 'VBG')                
            else:
                out[i] = (out[i][0], 'NN')
                
    for (word, tag) in out:
        if(tag == None):
            print(word)
    out = [word.lower() + "-" + tag for (word, tag) in out]        
    print(out)
    return out

Wir wiederholen jetzt obigen Code unter Betrachtung der POS-Tags, allerdings ohne Lemmatisierung.

In [12]:
dev_ = [[tag, sentiment, utplus(remove_words(decode_unicode_escapes(content)).split())] for [tag, sentiment, content] in dev_]

test_ = [[tag, sentiment, utplus(remove_words(decode_unicode_escapes(content)).split())] for [tag, sentiment, content] in test_]

train_ = [[tag, sentiment, utplus(remove_words(decode_unicode_escapes(content)).split())] for [tag, sentiment, content] in train_]


['won-NN', 'the-AT', 'match-VB', '.-.', 'plus,-NN', 'tomorrow-NR', 'is-BEZ', 'a-AT', 'very-QL', 'busy-JJ', 'day,-NN', 'with-IN', 'awareness-NN', 'day’s-NN', 'and-CC', 'debates.-NN', 'gulp.-NN', 'debates...-NN']
['some-DTI', 'areas-NNS', 'of-IN', 'new-JJ-TL', 'england-NP', 'could-MD', 'see-VB', 'the-AT', 'first-OD', 'flakes-NNS', 'of-IN', 'the-AT', 'season-NN', 'tuesday.-NN']
['2nd-OD', 'worst-JJT', 'qb.-NN', 'definitely-NN', 'tony-NP', 'romo.-NN', 'the-AT', 'man-NN', 'who-WPS', 'likes-VBZ', 'to-TO', 'share-NN', 'the-AT', 'ball-NN', 'with-IN', 'everyone.-NN', 'including-IN', 'the-AT', 'other-AP', 'team.-NN']
['washington-NP', '--IN', 'us-NN', 'president-NN-TL', 'barack-NN', 'obama-NN', 'vowed-VBD', 'wednesday-NR', 'as-CS', 'he-PPS', 'visited-VBD', 'storm-ravaged-NN', 'new-JJ-TL', 'jersey-NP-TL', 'shore-NN', 'to...-NN']
['did-DOD', 'y𠆚ll-NN', 'hear-VB', 'what-WDT', 'tony-NP', 'romo-NN', 'dressed-VBN', 'up-RP', 'as-CS', 'for-IN', 'halloween?-NN', 'a-AT', 'giants-NNS-TL', 'quaterback!-NN',

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [13]:
positive_tweets = [content for [tag, sentiment, content] in train_ if sentiment == "positive"]
negative_tweets = [content for [tag, sentiment, content] in train_ if sentiment == "negative"]
neutral_tweets = [content for [tag, sentiment, content] in train_ if sentiment == "neutral"]

positive_tokens_for_model = get_tweets_for_model(positive_tweets)
negative_tokens_for_model = get_tweets_for_model(negative_tweets)
neutral_tokens_for_model = get_tweets_for_model(neutral_tweets)

positive_dataset = [(tweet_dict, "positive") for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "negative") for tweet_dict in negative_tokens_for_model]
neutral_dataset = [(tweet_dict, "neutral") for tweet_dict in neutral_tokens_for_model]

train_dataset = positive_dataset + negative_dataset + neutral_dataset



Das gleiche ebenfalls für das Test-Dataset:

In [14]:
positive_test_tweets = [content for [tag, sentiment, content] in test_ if sentiment == "positive"]
negative_test_tweets = [content for [tag, sentiment, content] in test_ if sentiment == "negative"]
neutral_test_tweets = [content for [tag, sentiment, content] in test_ if sentiment == "neutral"]

positive_test_tokens_for_model = get_tweets_for_model(positive_test_tweets)
negative_test_tokens_for_model = get_tweets_for_model(negative_test_tweets)
neutral_test_tokens_for_model = get_tweets_for_model(neutral_test_tweets)

positive_test_dataset = [(tweet_dict, "positive") for tweet_dict in positive_test_tokens_for_model]
negative_test_dataset = [(tweet_dict, "negative") for tweet_dict in negative_test_tokens_for_model]
neutral_test_dataset = [(tweet_dict, "neutral") for tweet_dict in neutral_test_tokens_for_model]

test_dataset = positive_test_dataset + negative_test_dataset + neutral_test_dataset

Als Resultant haben wir dann:

In [15]:
tagged_classifier = NaiveBayesClassifier.train(train_dataset)

tagged_accuracy = classify.accuracy(tagged_classifier, test_dataset)

print("Accuracy is:", tagged_accuracy)

print(tagged_classifier.show_most_informative_features(20))


Accuracy is: 0.5390470820411616
Most Informative Features
                 fuck-VB = True           negati : neutra =     70.2 : 1.0
                  fun-NN = True           positi : neutra =     34.0 : 1.0
             happy-JJ-TL = True           positi : neutra =     33.3 : 1.0
                 luck-NN = True           positi : neutra =     33.2 : 1.0
                thank-VB = True           positi : neutra =     32.3 : 1.0
             excited-VBN = True           positi : neutra =     26.5 : 1.0
                happy-JJ = True           positi : neutra =     26.0 : 1.0
              awesome-JJ = True           positi : neutra =     25.6 : 1.0
                great-JJ = True           positi : neutra =     24.6 : 1.0
                sorry-JJ = True           negati : neutra =     24.5 : 1.0
                 hate-VB = True           negati : positi =     24.5 : 1.0
                  sad-JJ = True           negati : neutra =     23.3 : 1.0
                   :(-NN = True           

Die resultierende Accuracy ist (minimal) niedriger als die Accuracy der Baseline ohne Lemmatisierung. Dies ist zu erwarten, da verschiedene Wortformmen des gleichen Wortstammes bei informativen Wortstammen keinen großen Einfluss auf das Sentiment haben. Da jedoch durch das POS-Tagging beim Bayes-Klassifizier als unterschiedliche Features betrachtet werden, stehen dem Klassifizierer weniger Trainingsdaten zur Verfügung. Weiterhin betrachtet der Tagger nicht Kontraktionen (z.B. he'd, can't, we'll, it's, won't,...). Als Resultat ist die Accuracy minimal schlechter. Es lohnt sich also nur, den Klassifizierter auf lemmatisierte Datensätze zu trainieren, aber nicht auf getaggte. 

Erwähnenswert ist, dass das Aufkommen des Feature "fuck-VB" im Vergleich zu den vorherigen Klassifizierern einen betrachtlicheren Informationsgewinn einnimmt. Unter Betrachtung des POS-tags (und ohne Lemmatisierung) verdoppelt sich nahehin die Rate, in der sich dieses Feature in negativen Tweets finden lässt. 



Neben der Baseline, in der ein Bayes-Klassifizierer auf das gesamte Test-Dataset trainiert wird, wird ebenfalls noch ein Bayes-Klassifizierer nur auf die häufigsten Wörter trainiert, die jeweils (fast nur) in positiven, negativen, und neutralen Tweets auftauchen. Da der Baseline somit mehr Trainingsdaten zur Verfügung stehen, ist es nicht überraschend dass diese besser ausfällt als das zweite Verfahren: 

In [16]:
print("Klassifizierer auf alle Wörter trainiert: " + str(accuracy))
print("Klassifizierer auf frequente Wörter trainiert: " + str(accuracy2))


Klassifizierer auf alle Wörter trainiert: 0.5841556244713842
Klassifizierer auf frequente Wörter trainiert: 0.558782069354384
