<a href="https://colab.research.google.com/github/AlexanderVerheecke/TwitterSentimentAnalysis/blob/main/Traditional_Machine_Learning_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

How to run:
The dataset used in this colab file is taken from my personal google Drive folder. I was unable to link the gitlab file to here. If the system is not connected to my Google Drive folder, the user will need to download the 
datasets themselves from : https://drive.google.com/drive/folders/1hxBtAXu-IfoajBtJIG_8q7lFcmcLWYJu?usp=sharing 
- SemEval data: SemEval 2017 -> SemEval2017_DataSet.csv
- English: OwnTweets -> English -> latestEnglish.csv
- English translation: OwnTweets -> latestGermanTranlatedToEnglish.csv
- DAI data: DAI TU Berlin -> de_sentiment_UNIQUE.csv
- German: OwnTweets -> German -> latestGerman.csv
- German translation: OwnTweets -> German -> latestEnglishTranslatedToGERMAN.csv




The datasets will need to be uploaded to Colabs files folder on left and the file path copied to the respective dataset reading. Once all data is correctly loaded, the user will simply need to 'run all' under 'run time'.

Under " MODEL TRAINING AND PREDICTION ", the models will be trained with the datasets and output their performance in form of a classification report.

A comparison of the best performing model's (SVM) predictions and true labels can be seen under ' PREDICTION COMPARISON WITH TRUE LABELS '. It has two subsections 'GERMAN VS ENGLISH TRANSLATION" and "ENGLISH VS GERMAN TRANSLATION"


# Imports


In [None]:
import pandas as pd
import numpy as np
import re

# nltk and its downloads
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

# sklearn and the various models to train on training data
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
# to evaluate the models
from sklearn.metrics import accuracy_score, classification_report


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Reading SemEval and German_Unqiue dataset

In [None]:
# Reading data to be trained on

SemEval = pd.read_csv('/content/drive/MyDrive/Education/University/Master/Classes/Thesis/Data/SemEval2017/SemEval2017_DataSet.csv')
SemEval = pd.DataFrame(SemEval)

DAI = pd.read_csv('/content/drive/MyDrive/Education/University/Master/Classes/Thesis/Data/DAI TU Berlin/de_sentiment_UNIQUE.csv')
DAI = pd.DataFrame(DAI)


# Reading german labelled and its English translation
German = pd.read_csv('/content/drive/MyDrive/Education/University/Master/Classes/Thesis/Data/OwnTweets/German/latestGerman.csv')
German = pd.DataFrame(German)

German_translated = pd.read_csv('/content/drive/MyDrive/Education/University/Master/Classes/Thesis/Data/OwnTweets/latestGermanTranlatedToEnglish.csv')
German_translated = pd.DataFrame(German_translated)

# Reading english labelled and its German translation
English = pd.read_csv('/content/drive/MyDrive/Education/University/Master/Classes/Thesis/Data/OwnTweets/English/latestEnglish.csv')
English = pd.DataFrame(English)

English_translated = pd.read_csv('/content/drive/MyDrive/Education/University/Master/Classes/Thesis/Data/OwnTweets/German/latestEnglishTranslatedToGERMAN.csv')
English_translated = pd.DataFrame(English_translated)

pd.set_option('display.max_colwidth', None)

# Pre-processing

In [None]:
import re
import nltk
from nltk.stem.cistem import Cistem
from nltk.stem import *
from nltk.stem.porter import *

from nltk.corpus import stopwords

from nltk.tokenize import TweetTokenizer 


nltk.download("stopwords")
tweet_tokenizer = TweetTokenizer()
stemmer_ENG = PorterStemmer()
stemmer_GER = Cistem()


"""
Some of the functionality of the tweet_to_words functions have been inspried by https://gist.github.com/arybanary/4d6f7596825a4c95d74d9ec1597daefd#file-stemming_tweets-py . 
Additional functionalities has been added to it.
Extra addittions for processing function:
- remove symbols and single German character
- expand English contractions

"""
remove_symbols = '(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])'  #emojis, symbols, and punctuation
remove_symbols_GER = '(@[A-Za-z0-9]+)|([^0-9A-ZÄÜÖẞa-zäüöß \t])'  #emojis, symbols, and punctuation BUT KEEPING GERMAN SPECIAL CHARS
single_CHAR_GER = '(^| ).( |$)' #matches any single character between spaces

# removing negation from stopwords in German and English
Ger_stop = stopwords.words("german")
Ger_negation = ['nicht', 'nichts', 'keine', 'keinen', ]
for i in Ger_negation:
  Ger_stop.remove(i)

english_stop = stopwords.words('english')
english_negative = ['nor', 'not']
for i in english_negative:
  english_stop.remove(i)

#function to expand English contractions to their full form. Any contraction matching this will be expanded to WORD and EXPANSION i.e., you're -> you are
def expand(text): 
    
    # specific
    text = re.sub(r"won\'t", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    # general
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    text = re.sub(r"wanna","want to", text)
    return text



def tweet_to_words_GER(text):
    # print("Original: ",text)
    text = text.lower() # Convert to lower case
    # print("Lowered: ", text)
    text = re.sub(r"(http\S+)|(www\S+)", "XXXURLXXX", text) # Replace links with meaningless URL indicator
    # print("Links: ",text)
    text = re.sub(r"#", "", text) # Remove '#' in front of hashtags so the words following the hashtag can still be analysed
    # print("Hashtag: ", text)
    text = re.sub(r"@\S+", "XXXUSERNAMEXXX", text) # Replace mentions with meaningless Username indicator
    # print("Mentions: ", text)
    text = re.sub(remove_symbols_GER," ", text) # removes symbols
    # print("Symbols: ",text)
    text = re.sub("(?:(?<=^)|(?<=\s))(\d+[.,]*)+(?=$|\s)", "", text) # Remove all numbers not being part of alphanumeric word
    # print("non-alpha: ",text)
    text = re.sub(r"rt ", "", text) # Remove 'RT'
    # print("RT: ", text)  
    text = re.sub(single_CHAR_GER, " ", text)
    # print("Single CHAR: ", text)
    words = tweet_tokenizer.tokenize(text)
    # print("Tokenised: ",words)
    words = [w for w in words if w not in Ger_stop] # Remove stopwords
    # print("Stopwords: ", words)
    words = [stemmer_GER.stem(w) for w in words] # Stem
    
    return words
  
def tweet_to_words_ENG(text):
    # print("Original text: ", text)
    text = text.lower() # Convert to lower case
    # print("Lowered: ", text)
    text = re.sub(r"(http\S+)|(www\S+)", "XXXURLXXX", text) # Replace links with meaningless URL indicator
    # print("Links: ", text)
    text = re.sub(r"#", "", text) # Remove '#' in front of hashtags so the words following the hashtag can still be analysed
    # print("Hashtag: ", text)
    text = re.sub(r"@\S+", "XXXUSERNAMEXXX", text) # Replace mentions with meaningless Username indicator
    # print("Mentions: ", text)
    text = re.sub(remove_symbols," ", text) # removes symbols
    # print("Symbols: ",text)
    text = expand(text) #expands contractions
    # print("Expand: ", text)
    text = re.sub("(?:(?<=^)|(?<=\s))(\d+[.,]*)+(?=$|\s)", "", text) # Remove all numbers not being part of alphanumeric word
    # print("non-Alpha: ", text)
    text = re.sub(r"rt ", "", text) # Remove 'RT'
    # print("RT: ", text)
    words = tweet_tokenizer.tokenize(text)
    words = [w for w in words if w not in english_stop] # Remove stopwords
    words = [stemmer_ENG.stem(w) for w in words] # Stem
    
    return words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
SemEval['clean'] = SemEval["Tweet"].apply(tweet_to_words_ENG)
DAI['clean'] = DAI['Tweet'].apply(tweet_to_words_GER)


# creates a randomly shuffled dataset of SemEval and DAI
# randomness results in different results each time due to different train/test but difference is really insignificant
combinedData = SemEval.append(DAI, ignore_index=True)
combinedData = combinedData.sample(frac=1)

German['clean'] = German['Tweet'].apply(tweet_to_words_GER)
German_translated['clean'] = German_translated['Tweet'].apply(tweet_to_words_ENG)

English['clean'] = English["Tweet"].apply(tweet_to_words_ENG)

English_translated['clean'] = English_translated["Tweet"].apply(tweet_to_words_GER)


#Cleaning function demonstration for demo

Uncomment in above two cleaning functions to show difference:
- print("Original")

In [None]:
to_clean_ENG = ["The left has really gone Full retard haven't they?", "@user @user @user @user many didn't vote GOP because they wanted a plan for social security and jobs.", "To motivate deep learning & resilience help youth develop initiative. Great tips from @user"]
for i in to_clean_ENG:
  print(tweet_to_words_ENG(i))

['left', 'realli', 'gone', 'full', 'retard']
['xxxusernamexxx', 'xxxusernamexxx', 'xxxusernamexxx', 'xxxusernamexxx', 'mani', 'vote', 'gop', 'want', 'plan', 'social', 'secur', 'job']
['motiv', 'deep', 'learn', 'resili', 'help', 'youth', 'develop', 'initi', 'great', 'tip', 'xxxusernamexxx']


In [None]:
to_clean_GER = ["Wie ich einfach schon wieder kotzen könnte. Und heulen. Am Besten abwechselnd. Scheiße.", "Wenn Zeit ist werde ich heute einen eBook-Reader von Sony mal testen ( beruflich natürlich)", "Wie die Zeit vergeht; wenn man Spaß hat."]
for i in to_clean_GER:
  print(tweet_to_words_GER(i))

# ['vollerei', 'lass', 'gruss', 'xxxusernametokenxxx', 'wirklich', 'langweil', 'nich', 'schlaf', ',', 'xxxurltokenxxx', 'geh', 'lieblingsserie', 'schau', ':d', '!', 'norma']

['einfach', 'schon', 'kotz', 'heul', 'bes', 'abwechsel', 'scheiss']
['zeit', 'heu', 'ebook', 'read', 'sony', 'mal', 'tes', 'beruflich', 'naturlich']
['zeit', 'vergeh', 'spass']


# Train Test split

In [None]:
#training 75%, test 25%
SemEval_train = SemEval.iloc[:9213]
SemEval_test = SemEval.iloc[9214:]


DAI_train = DAI.iloc[:1336]
DAI_test = DAI.iloc[1337:]


In [None]:

# # For clarity:  function to convert string sentiment to lower, or int to lower string
# def format(df):
#   # labels = {-1: 'negative', 0: 'neutral', 1: 'positive'}
#   labels = {"Negative": 'negative', "Neutral" : 'neutral', "Positive" : 'positive'}
#   df['Sentiment'] = df['Sentiment'].map(labels)
#   return df[["Sentiment", "Tweet"]]



# SKLEARN MODELS

Preperation for easier model input


In [None]:

# NOTE: Needed to assign an empty preprocess and tokenizer lamba function, else fit-transforming wouldn't work for some reason. I still don't know why.
tfidf_SemEval = TfidfVectorizer(max_features=2000, ngram_range=(1,1), lowercase=False,preprocessor=lambda x: x, tokenizer=lambda x: x)
tfidf_DAI = TfidfVectorizer(max_features=2000, ngram_range=(1,1), lowercase=False, preprocessor=lambda x: x, tokenizer=lambda x: x)


classification_labels = ['Negative', 'Neutral', 'Positive']


# tfidf_SemEval = CountVectorizer(max_features=2000, ngram_range=(1,2), lowercase=False,preprocessor=lambda x: x, tokenizer=lambda x: x)
# tfidf_DAI = CountVectorizer(max_features=2000, ngram_range=(1,2), lowercase=False,preprocessor=lambda x: x, tokenizer=lambda x: x)


In [None]:
#SemEval
SemEval_train_tweet = tfidf_SemEval.fit_transform(SemEval_train['clean'])
SemEval_train_label = SemEval_train['Sentiment']

SemEval_test_tweet = tfidf_SemEval.transform(SemEval_test['clean'])
SemEval_test_label = SemEval_test['Sentiment']

#DAI
DAI_train_tweet = tfidf_DAI.fit_transform(DAI_train['clean'])
DAI_train_label = DAI_train["Sentiment"]

DAI_test_tweet = tfidf_DAI.transform(DAI_test['clean'])
DAI_test_label = DAI_test["Sentiment"]

In [None]:

#German labelled and English translation
German_tweet = tfidf_DAI.transform(German['clean'])
German_label = German["Sentiment"]

German_translated_tweet = tfidf_SemEval.transform(German_translated['clean'])
German_translated_label = German_translated["Sentiment"]

#English labelled and German translation
English_tweet = tfidf_SemEval.transform(English['clean'])
English_label = English["Sentiment"]

English_translated_tweet = tfidf_DAI.transform(English_translated['clean'])
English_translated_label = English_translated["Sentiment"]


In [None]:
German.head(13)

Unnamed: 0,Sentiment,Chloe,Tweet,clean
0,0,-1.0,@Marcel126610 Irgendwie schon. Aber ...wer will ihn denn dann auch irgendwo rumsitzen haben...🤷‍♀️,"[xxxusernamexxx, irgendwie, schon, wer, irgendwo, rumsitz]"
1,0,1.0,@PewPeeew aight nichts leichter als das,"[xxxusernamexxx, aigh, nich, leich]"
2,-1,-1.0,"@LibertyHannes @RikeWaldfee @MarcoBuschmann Uhhhhh echt jetzt "" ich darf töten wenn ich will"" mimimi 😭 und was mit euren Blagen ist, ist mir doch egal. Puh was ich dazu sage #FDPmachtkrankundarm und #FDPunter5Prozent . Ihr seit auf einen guten Weg . 😂 Ach #dielinke erholt sich gerade in den ersten Umfragen wider. 😂","[xxxusernamexxx, xxxusernamexxx, xxxusernamexxx, uhhhhh, ech, darf, tot, mimimi, blag, egal, puh, sag, fdpmachtkrankundarm, fdpu, 5proz, seit, gut, ach, dielink, erhol, rad, ers, umfrag, wider]"
3,0,-1.0,"Unter #IhrHabtEuchSelbstAusgegrenzt meinte gerade ein Querdenker, es sollte doch konsequenzlos möglich sein sich nicht impfen zu lassen. Tja was soll ich sagen: Für Pfleger und Ärzte am Ende ihrer Kräfte war es auch nicht konsequenzlos, wenn Querdenker die Impfung verweigern.","[ihrhabteuchselbstausgegrenz, mein, rad, querdenk, konsequenzlo, moglich, nich, impf, lass, tja, sag, pfleg, arz, end, kraf, nich, konsequenzlo, querdenk, impfung, verweig]"
4,0,0.0,"Und zur Sicherheit noch einmal: es geht hier um die Etikettierung, nicht das Testprodukt selbst. Kein Einfluss auf Ergebnis. https://t.co/Nne9nYSPgV","[sicherhei, geh, etikettierung, nich, testproduk, einfluss, ergebni, xxxurlxxx]"
5,1,1.0,"@MarcBrup @broennimann Welches Land findest Du inspirierend? (Ehrlich gemeinte Frage) Mich inspirieren Menschen, Natur, Musik, etc..aber ein ganzes Land?🤔 pS: ich bin sehr viel gereist, und könnte höchstens sagen, dass die unterschiedlichsten Menschen und Kulturen inspirierend waren. Die Vielfalt.","[xxxusernamexxx, xxxusernamexxx, land, find, inspirier, ehrlich, mein, frag, inspirier, mensch, natur, musik, etc, ganz, land, ps, reis, hoch, sag, unterschiedlich, mensch, kultur, inspirier, vielfal]"
6,-1,-1.0,@besserossi @deprecatedCode @kotzlpotzl @drlisamaria Leichenfledderei! 😠,"[xxxusernamexxx, xxxusernamexxx, xxxusernamexxx, xxxusernamexxx, leichenfledderei]"
7,-1,-1.0,"Ihr wurdet nie ausgegrenzt, ihr habt euch separiert Ihr habt auf die Solidargemeinschaft geschissen Ihr habt euch von Rechtsextremisten unterstützen lassen Ihr werdet in Geschichtsbüchern nicht als Opfer sondern als Täter stehen #wirhabenausgegrenzt #SolidaritaetmitderWoelfin","[wurd, nie, ausgegrenz, hab, separie, hab, solidargemeinschaf, schiss, hab, rechtsextremi, unterstutz, lass, werd, schichtsbuch, nich, opfer, tater, steh, wirhabenausgegrenz, solidaritaetmitderwoelfi]"
8,-1,-1.0,"Der Anwalt Chan-jo Jun hat Twitter verlassen. Ständig forderte er, dass wir mehr tun im Bereich Hasskriminalität u. bekam sehr viele Morddrohungen. Verstehe, dass er jetzt genug hat. Tragisch ist sein Rückzug hier für unsere Demokratie dennoch. Hoffe sehr, er kommt zurück.","[anwal, cha, jo, jun, twitt, verlass, standig, ford, mehr, tun, bereich, hasskriminalita, bekam, viel, morddrohung, versteh, genug, tragisch, ruckzug, demokratie, dennoch, hoff, komm, zuruck]"
9,0,0.0,Günstiges #Russland-#Gas ist das beste Mittel gegen Vermögensfraß beim Bürger. Stattdessen kommen #Gruene mit #Gasumlage und anderen #Grunen Plagen.,"[gunstig, russla, gas, bes, mittel, vermogensfrass, beim, burg, stattdess, komm, gru, gasumlag, gru, plag]"


In [None]:
German_translated.head(13)

Unnamed: 0,Sentiment,Tweet,clean
0,0,@Marcel126610 Somehow yes. But...who wants it sitting around somewhere...🤷‍♀️,"[xxxusernamexxx, somehow, ye, want, sit, around, somewher]"
1,0,@PewPeeew aight nothing easier than that,"[xxxusernamexxx, aight, noth, easier]"
2,-1,"@LibertyHannes @RikeWaldfee @MarcoBuschmann Uhhhhh really now ""I can kill if I want"" mimimi 😭 and I don't care what's up with your brats. Puh what I say to #FDPmachtsickundarm and #FDPunter5Percent. You are on the right track. 😂 Oh #dielinke is recovering in the first polls. 😂","[xxxusernamexxx, xxxusernamexxx, xxxusernamexxx, uhhhhh, realli, kill, want, mimimi, care, brat, puh, say, fdpmachtsickundarm, fdpunter, 5percent, right, track, oh, dielink, recov, first, poll]"
3,0,"Under #YourHaveYouselfExcluded, a lateral thinker said that it should be possible not to be vaccinated without any consequences. Well, what can I say: For nurses and doctors at the end of their strength, it was not without consequences when lateral thinkers refused to be vaccinated.","[yourhaveyouselfexclud, later, thinker, said, possibl, not, vaccin, without, consequ, well, say, nurs, doctor, end, strength, not, without, consequ, later, thinker, refus, vaccin]"
4,0,"And to be on the safe side again: this is about the labeling, not the test product itself. No influence on the result. https://t.co/Nne9nYSPgV","[safe, side, label, not, test, product, influenc, result, xxxurlxxx]"
5,1,"@MarcBrup @broennimann Which country do you find inspirational? (Honest question) I'm inspired by people, nature, music, etc.. but a whole country?🤔 PS: I've traveled a lot and could at most say that the most diverse people and cultures were inspiring. The diversity.","[xxxusernamexxx, xxxusernamexxx, countri, find, inspir, honest, question, inspir, peopl, natur, music, etc, whole, countri, ps, travel, lot, could, say, divers, peopl, cultur, inspir, divers]"
6,-1,@besserossi @deprecatedCode @kotzlpotzl @drlisamaria scavenging! 😠,"[xxxusernamexxx, xxxusernamexxx, xxxusernamexxx, xxxusernamexxx, scaveng]"
7,-1,"You were never excluded, you separated. You gave a shit about the community of solidarity. You let right-wing extremists support you. You will not appear in history books as a victim but as a perpetrator #we have excluded #SolidaritaetmitderWoelfin","[never, exclud, separ, gave, shit, commun, solidar, let, right, wing, extremist, suppoy, not, appear, histori, book, victim, perpetr, exclud, solidaritaetmitderwoelfin]"
8,-1,Lawyer Chan-jo Jun has left Twitter. He kept demanding that we do more on hate crime and received many death threats. Understand that he's had enough now. His withdrawal here is still tragic for our democracy. I really hope he comes back.,"[lawyer, chan, jo, jun, left, twitter, kept, demand, hate, crime, receiv, mani, death, threat, understand, enough, withdraw, still, tragic, democraci, realli, hope, come, back]"
9,0,"Cheap #Russia #gas is the best remedy against the citizens' wealth being eaten up. Instead, #greens come with #gas surcharges and other #green plagues.","[cheap, russia, ga, best, remedi, citizen, wealth, eaten, instead, green, come, ga, surcharg, green, plagu]"


# ======= MODEL TRAINING AND PREDICTION =======

# Multinomial NB

In [None]:
#SemEval alone
nb_SemEval = MultinomialNB()
nb_SemEval.fit(SemEval_train_tweet, SemEval_train_label)
pred_MNB_SemEval = nb_SemEval.predict(SemEval_test_tweet)
MNB_SemEval = classification_report(SemEval_test_label, pred_MNB_SemEval, target_names= classification_labels)

#SemEval on English labelled
nb_SemEval_english = MultinomialNB()
nb_SemEval_english.fit(SemEval_train_tweet, SemEval_train_label)
pred_MNB_SemEval_english = nb_SemEval_english.predict(English_tweet)
MNB_SemEval_ENGLISH = classification_report(English_label, pred_MNB_SemEval_english, target_names= classification_labels)

#SemEval on German labelled tranlsated into English
nb_SemEval_german = MultinomialNB()
nb_SemEval_german.fit(SemEval_train_tweet, SemEval_train_label)
pred_MNB_SemEval_german = nb_SemEval_german.predict(German_translated_tweet)
MNB_SemEval_GERMAN = classification_report(German_translated_label, pred_MNB_SemEval_german, target_names= classification_labels)

#German_Unqiue alone
nb_DAI = MultinomialNB()
nb_DAI.fit(DAI_train_tweet, DAI_train_label)
pred_MNB_DAI = nb_DAI.predict(DAI_test_tweet)
MNB_DAI = classification_report(DAI_test_label, pred_MNB_DAI, target_names= classification_labels)


#German_Unqiue on German
nb_DAI_german = MultinomialNB()
nb_DAI_german.fit(DAI_train_tweet, DAI_train_label)
pred_MNB_DAI_german = nb_DAI_german.predict(German_tweet)
MNB_DAI_GERMAN = classification_report(German_label, pred_MNB_DAI_german, target_names= classification_labels)


#German_Unqiue on English labelled transalted into German
nb_german = MultinomialNB()
nb_german.fit(DAI_train_tweet, DAI_train_label)
pred_MNB_German = nb_german.predict(English_translated_tweet)
MNB_DAI_ENGLISH  = classification_report(English_translated_label, pred_MNB_German, target_names= classification_labels)


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Logistic Regression:

In [None]:
#SemEval alone
LRmodel_SemEval = LogisticRegression( max_iter = 100)
LRmodel_SemEval.fit(SemEval_train_tweet, SemEval_train_label)
pred_LR_SemEval = LRmodel_SemEval.predict(SemEval_test_tweet)
LOGREG_SemEval = classification_report(SemEval_test_label, pred_LR_SemEval, target_names= classification_labels)


#SemEval on English labelled
LRmodel_SemEval_english = LogisticRegression(max_iter = 100)
LRmodel_SemEval_english.fit(SemEval_train_tweet, SemEval_train_label)
pred_LR_SemEval_english = LRmodel_SemEval_english.predict(English_tweet)
LOGREG_SemEval_ENGLISH = classification_report(English_label, pred_LR_SemEval_english, target_names= classification_labels)

#SemEval on German labelled tranlsated into English
LRmodel_SemEval_german = LogisticRegression(max_iter = 100,)
LRmodel_SemEval_german.fit(SemEval_train_tweet, SemEval_train_label)
pred_LR_SemEval_german = LRmodel_SemEval_german.predict(German_translated_tweet)
LOGREG_SemEval_GERMAN = classification_report(German_translated_label, pred_LR_SemEval_german, target_names= classification_labels)

#DAI alone
LRmodel_DAI = LogisticRegression(max_iter = 100)
LRmodel_DAI.fit(DAI_train_tweet, DAI_train_label)
pred_LR_DAI = LRmodel_DAI.predict(DAI_test_tweet)
LOGREG_DAI = classification_report(DAI_test_label, pred_LR_DAI, target_names= classification_labels)

#German_Unqiue on German
LRmodel_DAI_german = LogisticRegression(max_iter = 100)
LRmodel_DAI_german.fit(DAI_train_tweet, DAI_train_label)
pred_LR_DAI_german = LRmodel_DAI_german.predict(German_tweet)
LOGREG_DAI_GERMAN = classification_report(German_label, pred_LR_DAI_german, target_names= classification_labels)


#German_Unqiue on English labelled transalted into German
LRmodel_german = LogisticRegression(max_iter = 100)
LRmodel_german.fit(DAI_train_tweet, DAI_train_label)
pred_LR_german = LRmodel_german.predict(English_translated_tweet)
LOGREG_DAI_ENGLISH  = classification_report(English_translated_label, pred_LR_german, target_names= classification_labels)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

# Support Vector Machine

In [None]:
#SemEval alone
svcl_SemEval = svm.SVC()
svcl_SemEval.fit(SemEval_train_tweet, SemEval_train_label)
pred_SVC_SemEval = svcl_SemEval.predict(SemEval_test_tweet)
SVM_SemEval = classification_report(SemEval_test_label, pred_SVC_SemEval, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

#SemEval on English labelled
svcl_SemEval_english = svm.SVC()
svcl_SemEval_english.fit(SemEval_train_tweet, SemEval_train_label)
pred_SVC_SemEval_english = svcl_SemEval_english.predict(English_tweet)
SVM_SemEval_ENGLISH = classification_report(English_label, pred_SVC_SemEval_english, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 


#SemEval on German labelled tranlsated into English
svcl_SemEval_german = svm.SVC()
svcl_SemEval_german.fit(SemEval_train_tweet, SemEval_train_label)
pred_SVC_SemEval_german = svcl_SemEval_german.predict(German_translated_tweet)
SVM_SemEval_GERMAN= classification_report(German_translated_label, pred_SVC_SemEval_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

#DAI alone
svcl_DAI= svm.SVC()
svcl_DAI.fit(DAI_train_tweet, DAI_train_label)
pred_SVC_DAI= svcl_DAI.predict(DAI_test_tweet)
SVM_DAI= classification_report(DAI_test_label, pred_SVC_DAI, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

#DAI on German
svcl_DAI_german = svm.SVC()
svcl_DAI_german.fit(DAI_train_tweet, DAI_train_label)
pred_SVC_DAI_german = svcl_DAI_german.predict(German_tweet)
SVM_DAI_GERMAN = classification_report(German_label, pred_SVC_DAI_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 


#DAI on English labelled transalted into German
svcl_german = svm.SVC()
svcl_german.fit(DAI_train_tweet, DAI_train_label)
pred_SVC_german = svcl_german.predict(English_translated_tweet)
SVM_DAI_ENGLISH = classification_report(English_translated_label, pred_SVC_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 


#K Nearest Neighbor

In [None]:
#SemEval alone
knn_SemEval = KNeighborsClassifier(n_neighbors=4)
knn_SemEval.fit(SemEval_train_tweet, SemEval_train_label)
pred_knn_SemEval = knn_SemEval.predict(SemEval_test_tweet)
KNN_SemEval = classification_report(SemEval_test_label, pred_knn_SemEval, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

#SemEval on English labelled
knn_SemEval_english = KNeighborsClassifier(n_neighbors=4)
knn_SemEval_english.fit(SemEval_train_tweet, SemEval_train_label)
pred_knn_SemEval_english = knn_SemEval_english.predict(English_tweet)
KNN_SemEval_ENGLISH = classification_report(English_label, pred_knn_SemEval_english, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

#SemEval on German labelled tranlsated into English
knn_SemEval_german = KNeighborsClassifier(n_neighbors=4)
knn_SemEval_german.fit(SemEval_train_tweet, SemEval_train_label)
pred_knn_SemEval_german = knn_SemEval_german.predict(German_translated_tweet)
KNN_SemEval_GERMAN= classification_report(German_translated_label, pred_knn_SemEval_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

#DAI alone
knn_DAI = KNeighborsClassifier(n_neighbors=4)
knn_DAI.fit(DAI_train_tweet, DAI_train_label)
pred_knn_DAI = knn_DAI.predict(DAI_test_tweet)
KNN_DAI = classification_report(DAI_test_label, pred_knn_DAI, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

#DAI on German
knn_DAI_german = KNeighborsClassifier(n_neighbors=4)
knn_DAI_german.fit(DAI_train_tweet, DAI_train_label)
pred_knn_DAI_german = knn_DAI_german.predict(German_tweet)
KNN_DAI_GERMAN = classification_report(German_label, pred_knn_DAI_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

#DAI on English labelled transalted into German
knn_german = KNeighborsClassifier(n_neighbors=4)
knn_german.fit(DAI_train_tweet, DAI_train_label)
pred_knn_german = knn_german.predict(English_translated_tweet)
KNN_DAI_ENGLISH = classification_report(English_translated_label, pred_knn_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 


# SEMEVAL ALONE RESULTS

In [None]:
print("====================== RESULTS FOR SemEval on itself ==================\n")

print("=====Support Vector Machine_SemEval=====\n")
print(SVM_SemEval)

print("=====Multinominal NB_SemEval=====\n")
print(MNB_SemEval)

print("=====Logistic Regression_SemEval=====\n")
print(LOGREG_SemEval)

print("=====K Nearest Neighbor_SemEval=====\n")
print(KNN_SemEval)

# print("===== SGD_SemEval=====\n")
# print(SGD_SemEval)

# print("=====LINEAR SVC_SemEval=====\n")
# print(LINEARSVC_SemEval)

# print("=====Bernoulli NB_SemEval=====\n")
# print(BNB_SemEval)

# print("=====Random Forest_SemEval=====\n")
# print(RANFOR_SemEval)


=====Support Vector Machine_SemEval=====

              precision    recall  f1-score   support

    Negative       0.68      0.53      0.59      1010
     Neutral       0.60      0.80      0.69      1476
    Positive       0.71      0.39      0.50       584

    accuracy                           0.63      3070
   macro avg       0.67      0.57      0.60      3070
weighted avg       0.65      0.63      0.62      3070

=====Multinominal NB_SemEval=====

              precision    recall  f1-score   support

    Negative       0.67      0.56      0.61      1010
     Neutral       0.61      0.77      0.68      1476
    Positive       0.64      0.39      0.48       584

    accuracy                           0.63      3070
   macro avg       0.64      0.57      0.59      3070
weighted avg       0.63      0.63      0.62      3070

=====Logistic Regression_SemEval=====

              precision    recall  f1-score   support

    Negative       0.66      0.57      0.61      1010
     Neutral

In [None]:
print("====================== RESULTS FOR SemEval ON ENGLISH DATASET ==================\n")

print("=====Support Vector Machine_SemEval_ENGLISH=====\n")
print(SVM_SemEval_ENGLISH)

print("=====Multinominal NB_SemEval_ENGLISH=====\n")
print(MNB_SemEval_ENGLISH)

print("=====Logistic Regression_SemEval_ENGLISH=====\n")
print(LOGREG_SemEval_ENGLISH)

print("=====K Nearest Neighbor_SemEval_ENGLISH=====\n")
print(KNN_SemEval_ENGLISH)

# print("===== SGD_SemEval_ENGLISH=====\n")
# print(SGD_SemEval_ENGLISH)

# print("=====LINEAR SVC_SemEval_ENGLISH=====\n")
# print(LINEARSVC_SemEval_ENGLISH)

# print("=====Bernoulli NB_SemEval_ENGLISH=====\n")
# print(BNB_SemEval_ENGLISH)

# print("=====Random Forest_SemEval_ENGLISH=====\n")
# print(RANFOR_SemEval_ENGLISH)


=====Support Vector Machine_SemEval_ENGLISH=====

              precision    recall  f1-score   support

    Negative       0.37      0.39      0.38       132
     Neutral       0.77      0.84      0.80       679
    Positive       0.63      0.39      0.48       167

    accuracy                           0.70       978
   macro avg       0.59      0.54      0.56       978
weighted avg       0.70      0.70      0.69       978

=====Multinominal NB_SemEval_ENGLISH=====

              precision    recall  f1-score   support

    Negative       0.35      0.44      0.39       132
     Neutral       0.77      0.82      0.79       679
    Positive       0.63      0.31      0.42       167

    accuracy                           0.68       978
   macro avg       0.58      0.52      0.53       978
weighted avg       0.69      0.68      0.67       978

=====Logistic Regression_SemEval_ENGLISH=====

              precision    recall  f1-score   support

    Negative       0.36      0.45      0.4

In [None]:
print("====================== RESULTS FOR SemEval ON GERMAN Translated DATASET ==================\n")

print("=====Support Vector Machine_SemEval_GERMAN=====\n")
print(SVM_SemEval_GERMAN)

print("=====Multinominal NB_SemEval_GERMAN=====\n")
print(MNB_SemEval_GERMAN)

print("=====Logistic Regression_SemEval_GERMAN=====\n")
print(LOGREG_SemEval_GERMAN)

print("=====K Nearest Neighbor_SemEval_GERMAN=====\n")
print(KNN_SemEval_GERMAN)

# print("===== SGD_SemEval_GERMAN=====\n")
# print(SGD_SemEval_GERMAN)

# print("=====LINEAR SVC_SemEval_GERMAN=====\n")
# print(LINEARSVC_SemEval_GERMAN)

# print("=====Bernoulli NB_SemEval_GERMAN=====\n")
# print(BNB_SemEval_GERMAN)

# print("=====Random Forest_SemEval_GERMAN=====\n")
# print(RANFOR_SemEval_GERMAN)


=====Support Vector Machine_SemEval_GERMAN=====

              precision    recall  f1-score   support

    Negative       0.36      0.61      0.45       132
     Neutral       0.88      0.78      0.83       735
    Positive       0.58      0.52      0.55        79

    accuracy                           0.73       946
   macro avg       0.61      0.63      0.61       946
weighted avg       0.78      0.73      0.75       946

=====Multinominal NB_SemEval_GERMAN=====

              precision    recall  f1-score   support

    Negative       0.33      0.61      0.43       132
     Neutral       0.87      0.75      0.81       735
    Positive       0.52      0.44      0.48        79

    accuracy                           0.71       946
   macro avg       0.57      0.60      0.57       946
weighted avg       0.76      0.71      0.73       946

=====Logistic Regression_SemEval_GERMAN=====

              precision    recall  f1-score   support

    Negative       0.34      0.67      0.45  

In [None]:
print("====================== RESULTS FOR DAI ALONE ==================\n")

print("=====Support Vector Machine_DAI=====\n")
print(SVM_DAI)

print("=====Multinominal NB_DAI=====\n")
print(MNB_DAI)

print("=====Logistic Regression_DAI=====\n")
print(LOGREG_DAI)

print("=====K Nearest Neighbor_DAI=====\n")
print(KNN_DAI)

# print("===== SGD_DAI=====\n")
# print(SGD_DAI)

# print("=====LINEARSVC_DAI=====\n")
# print(LINEARSVC_DAI)

# print("=====Bernoulli NB_DAI=====\n")
# print(BNB_DAI)

# print("=====Random Forest_DAI=====\n")
# print(RANFOR_DAI)


=====Support Vector Machine_DAI=====

              precision    recall  f1-score   support

    Negative       1.00      0.04      0.08        74
     Neutral       0.68      0.98      0.80       291
    Positive       0.65      0.22      0.32        79

    accuracy                           0.68       444
   macro avg       0.78      0.41      0.40       444
weighted avg       0.73      0.68      0.60       444

=====Multinominal NB_DAI=====

              precision    recall  f1-score   support

    Negative       1.00      0.01      0.03        74
     Neutral       0.67      0.99      0.80       291
    Positive       0.69      0.11      0.20        79

    accuracy                           0.67       444
   macro avg       0.79      0.37      0.34       444
weighted avg       0.73      0.67      0.56       444

=====Logistic Regression_DAI=====

              precision    recall  f1-score   support

    Negative       0.62      0.07      0.12        74
     Neutral       0.69 

In [None]:
print("====================== RESULTS FOR DAI ON GERMAN ==================\n")

print("=====Support Vector Machine_DAI_GERMAN=====\n")
print(SVM_DAI_GERMAN)

print("=====Multinominal NB_DAI_GERMAN=====\n")
print(MNB_DAI_GERMAN)

print("=====Logistic Regression_DAI_GERMAN=====\n")
print(LOGREG_DAI_GERMAN)

print("=====K Nearest Neighbor_DAI_GERMAN=====\n")
print(KNN_DAI_GERMAN)

# print("===== SGD_DAI_GERMAN=====\n")
# print(SGD_DAI_GERMAN)

# print("=====LINEAR SVC_DAI_GERMAN=====\n")
# print(LINEARSVC_DAI_GERMAN)

# print("=====Bernoulli NB_DAI_GERMANN=====\n")
# print(BNB_DAI_GERMAN)

# print("=====Random Forest_DAI_GERMAN=====\n")
# print(RANFOR_DAI_GERMAN)


=====Support Vector Machine_DAI_GERMAN=====

              precision    recall  f1-score   support

    Negative       0.75      0.02      0.04       132
     Neutral       0.79      0.99      0.88       735
    Positive       0.57      0.10      0.17        79

    accuracy                           0.78       946
   macro avg       0.70      0.37      0.36       946
weighted avg       0.76      0.78      0.70       946

=====Multinominal NB_DAI_GERMAN=====

              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00       132
     Neutral       0.78      1.00      0.88       735
    Positive       1.00      0.04      0.07        79

    accuracy                           0.78       946
   macro avg       0.59      0.35      0.32       946
weighted avg       0.69      0.78      0.69       946

=====Logistic Regression_DAI_GERMAN=====

              precision    recall  f1-score   support

    Negative       0.24      0.05      0.09       132
   

In [None]:
print("====================== RESULTS FOR DAI ENGLISH TRANSALTED ==================\n")

print("=====Support Vector Machine_DAI_ENGLISH=====\n")
print(SVM_DAI_ENGLISH)

print("=====Multinominal NB_UNQIUE_ENGLISH=====\n")
print(MNB_DAI_ENGLISH)

print("=====Logistic Regression_DAI_ENGLISH=====\n")
print(LOGREG_DAI_ENGLISH)

print("=====K Nearest Neighbor_DAI_ENGLISH=====\n")
print(KNN_DAI_ENGLISH)

# print("===== SGD_DAI_ENGLISH=====\n")
# print(SGD_DAI_ENGLISH)

# print("=====LINEARSVC_DAI_ENGLISH=====\n")
# print(LINEARSVC_DAI_ENGLISH)

# print("=====Bernoulli NB_DAI_ENGLISH=====\n")
# print(BNB_DAI_ENGLISH)

# print("=====Random Forest_DAI_ENGLISH=====\n")
# print(RANFOR_DAI_ENGLISH)


=====Support Vector Machine_DAI_ENGLISH=====

              precision    recall  f1-score   support

    Negative       0.25      0.01      0.01       132
     Neutral       0.70      0.98      0.82       679
    Positive       0.54      0.08      0.15       167

    accuracy                           0.70       978
   macro avg       0.50      0.36      0.33       978
weighted avg       0.61      0.70      0.60       978

=====Multinominal NB_UNQIUE_ENGLISH=====

              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00       132
     Neutral       0.70      1.00      0.82       679
    Positive       0.83      0.03      0.06       167

    accuracy                           0.70       978
   macro avg       0.51      0.34      0.29       978
weighted avg       0.63      0.70      0.58       978

=====Logistic Regression_DAI_ENGLISH=====

              precision    recall  f1-score   support

    Negative       0.38      0.08      0.13       1

Also tested how well combined dataset from SemEval and German_UNIQUE performs with traditional machine learning as it did well with BERT.

=> IT DID NOT DO WELL WITH TML


# ====== PREDICTION COMPARISON WITH TRUE LABELS ======

# GERMAN VS ENGLISH TRANSLATION

In [None]:
English_translation_tweet = pd.DataFrame(German_translated["Tweet"])
English_translation_y_true = pd.DataFrame(German_translated_label)
English_translation_y_predicted = pd.DataFrame(pred_SVC_SemEval_german)
SVM_UNIGRAM_ENGLISH_TRANS_DATAFRAME = English_translation_y_true
SVM_UNIGRAM_ENGLISH_TRANS_DATAFRAME["Predicted"] = English_translation_y_predicted
SVM_UNIGRAM_ENGLISH_TRANS_DATAFRAME["Tweet"] = English_translation_tweet

German_tweet = pd.DataFrame(German["Tweet"])
German_y_true = pd.DataFrame(German_label)
German_y_predicted = pd.DataFrame(pred_SVC_DAI_german)

SVM_UNIGRAM_GERMAN_DATAFRAME = German_y_true
SVM_UNIGRAM_GERMAN_DATAFRAME["Predicted"] = German_y_predicted
SVM_UNIGRAM_GERMAN_DATAFRAME["Tweet"] = German_tweet


In [None]:
SVM_UNIGRAM_GERMAN_DATAFRAME.head(13)

Unnamed: 0,Sentiment,Predicted,Tweet
0,0,0,@Marcel126610 Irgendwie schon. Aber ...wer will ihn denn dann auch irgendwo rumsitzen haben...🤷‍♀️
1,0,0,@PewPeeew aight nichts leichter als das
2,-1,0,"@LibertyHannes @RikeWaldfee @MarcoBuschmann Uhhhhh echt jetzt "" ich darf töten wenn ich will"" mimimi 😭 und was mit euren Blagen ist, ist mir doch egal. Puh was ich dazu sage #FDPmachtkrankundarm und #FDPunter5Prozent . Ihr seit auf einen guten Weg . 😂 Ach #dielinke erholt sich gerade in den ersten Umfragen wider. 😂"
3,0,0,"Unter #IhrHabtEuchSelbstAusgegrenzt meinte gerade ein Querdenker, es sollte doch konsequenzlos möglich sein sich nicht impfen zu lassen. Tja was soll ich sagen: Für Pfleger und Ärzte am Ende ihrer Kräfte war es auch nicht konsequenzlos, wenn Querdenker die Impfung verweigern."
4,0,0,"Und zur Sicherheit noch einmal: es geht hier um die Etikettierung, nicht das Testprodukt selbst. Kein Einfluss auf Ergebnis. https://t.co/Nne9nYSPgV"
5,1,0,"@MarcBrup @broennimann Welches Land findest Du inspirierend? (Ehrlich gemeinte Frage) Mich inspirieren Menschen, Natur, Musik, etc..aber ein ganzes Land?🤔 pS: ich bin sehr viel gereist, und könnte höchstens sagen, dass die unterschiedlichsten Menschen und Kulturen inspirierend waren. Die Vielfalt."
6,-1,0,@besserossi @deprecatedCode @kotzlpotzl @drlisamaria Leichenfledderei! 😠
7,-1,0,"Ihr wurdet nie ausgegrenzt, ihr habt euch separiert Ihr habt auf die Solidargemeinschaft geschissen Ihr habt euch von Rechtsextremisten unterstützen lassen Ihr werdet in Geschichtsbüchern nicht als Opfer sondern als Täter stehen #wirhabenausgegrenzt #SolidaritaetmitderWoelfin"
8,-1,0,"Der Anwalt Chan-jo Jun hat Twitter verlassen. Ständig forderte er, dass wir mehr tun im Bereich Hasskriminalität u. bekam sehr viele Morddrohungen. Verstehe, dass er jetzt genug hat. Tragisch ist sein Rückzug hier für unsere Demokratie dennoch. Hoffe sehr, er kommt zurück."
9,0,0,Günstiges #Russland-#Gas ist das beste Mittel gegen Vermögensfraß beim Bürger. Stattdessen kommen #Gruene mit #Gasumlage und anderen #Grunen Plagen.


In [None]:
SVM_UNIGRAM_ENGLISH_TRANS_DATAFRAME.head(13)

Unnamed: 0,Sentiment,Predicted,Tweet
0,0,0,@Marcel126610 Somehow yes. But...who wants it sitting around somewhere...🤷‍♀️
1,0,-1,@PewPeeew aight nothing easier than that
2,-1,-1,"@LibertyHannes @RikeWaldfee @MarcoBuschmann Uhhhhh really now ""I can kill if I want"" mimimi 😭 and I don't care what's up with your brats. Puh what I say to #FDPmachtsickundarm and #FDPunter5Percent. You are on the right track. 😂 Oh #dielinke is recovering in the first polls. 😂"
3,0,-1,"Under #YourHaveYouselfExcluded, a lateral thinker said that it should be possible not to be vaccinated without any consequences. Well, what can I say: For nurses and doctors at the end of their strength, it was not without consequences when lateral thinkers refused to be vaccinated."
4,0,0,"And to be on the safe side again: this is about the labeling, not the test product itself. No influence on the result. https://t.co/Nne9nYSPgV"
5,1,-1,"@MarcBrup @broennimann Which country do you find inspirational? (Honest question) I'm inspired by people, nature, music, etc.. but a whole country?🤔 PS: I've traveled a lot and could at most say that the most diverse people and cultures were inspiring. The diversity."
6,-1,0,@besserossi @deprecatedCode @kotzlpotzl @drlisamaria scavenging! 😠
7,-1,-1,"You were never excluded, you separated. You gave a shit about the community of solidarity. You let right-wing extremists support you. You will not appear in history books as a victim but as a perpetrator #we have excluded #SolidaritaetmitderWoelfin"
8,-1,-1,Lawyer Chan-jo Jun has left Twitter. He kept demanding that we do more on hate crime and received many death threats. Understand that he's had enough now. His withdrawal here is still tragic for our democracy. I really hope he comes back.
9,0,0,"Cheap #Russia #gas is the best remedy against the citizens' wealth being eaten up. Instead, #greens come with #gas surcharges and other #green plagues."


# ENGLISH VS GERMAN TRANSLATION

In [None]:
English_tweet = pd.DataFrame(English['Tweet'])
English_y_true = pd.DataFrame(English_label)
English_y_pred = pd.DataFrame(pred_SVC_SemEval_english)
SVM_UNIGRAM_ENGLISH_DATAFRAME = English_y_true
SVM_UNIGRAM_ENGLISH_DATAFRAME['Predicted'] = English_y_pred
SVM_UNIGRAM_ENGLISH_DATAFRAME['Tweet'] = English_tweet

German_translation_tweet = pd.DataFrame(English_translated["Tweet"])
German_translation_y_true = pd.DataFrame(English_translated_label)
German_translation_y_pred = pd.DataFrame(pred_SVC_german)

SVM_UNIGRAM_GERMAN_TRANSLATION_DATAFRAME = German_translation_y_true
SVM_UNIGRAM_GERMAN_TRANSLATION_DATAFRAME["Predicted"] = German_translation_y_pred
SVM_UNIGRAM_GERMAN_TRANSLATION_DATAFRAME["Tweet"] = German_translation_tweet


In [None]:
SVM_UNIGRAM_ENGLISH_DATAFRAME.head(13)

Unnamed: 0,Sentiment,Predicted,Tweet
0,-1,0,If you got ex drama please don’t hit me up ✌🏽
1,0,0,"If the World Bank wants to support the people, it must urge #Ethiopia to #EndTigraySiege and allow unhindered humanitarian access to #Tigray. #WorldBankStopFundingTigrayGenocide @SecBlinken @JosepBorrellF @WorldBank @UN_HRC @DavidMalpassWBG @IMFNews"
2,1,0,"Amusan and Duplantis, two athletes that broke records at Oregon 22 - https://t.co/z4ry6UPGgm"
3,0,0,FACT https://t.co/0ZJlCLlytP
4,0,0,220726 © 时尚先生Esquire — ESQUIRE China shares a teaser image of Wu Lei on the upcoming cover of ESQUIRE’s August issue! #吴磊 • #wulei • #leowu • #อู๋เหล่ย ࿐ https://t.co/lbm2HKv0Df
5,0,0,@_FacundoZapata Only listen to your soul!
6,1,0,@HibaycOfficial Good project with most potential and huge opportunity @malicknajim @FaridaZamou @JuniorS54441117 #NFTs #nftart #ETH #Hibayc #Airdrops
7,0,1,Jai Nice got engaged and nobody knows who her man even is… MOOD🥂
8,-1,0,The Employment Tribunal found that Garden Court Chambers discriminated against me because of my gender critical belief when it published a statement that I was under investigation and in upholding Stonewall’s complaint against me. #AllisonBaileyWins
9,-1,-1,@fra27236945 @thatdayin1992 Why more than 80% of world population genuinely hate US/NATO/EU? Why do all of them so eagerly wait a moment of liberation? Russia is not alone. 80% of world is on Russia's side. The evil western empire will inevitably fall. The westerners cannot exploit the others for ever.


In [None]:
SVM_UNIGRAM_GERMAN_TRANSLATION_DATAFRAME.head(13)

# UNUSED FUNCTIONS ============================================

# UNUSED get tag

Function that gets the POS of a given token



In [None]:
#function to get the POS tag of a given token to achieve the optimal lemmatisation of the token
def get_tag(token):
    if token == "VB" or token == "VBD" or token == "VBG" or token == "VBN" or token == "VBP" or token == "VBZ": #VERBS
        # print("It is a verb")
        return 'v'
    elif token == "JJ" or token == "JJR" or token == "JJS": #ADJECTIVES
        # print("Is it an adjective")
        return 'a'
    elif token == "RB" or token == "RBR" or token == "RBS": #ADVERBS
        # print("It is an adverb")
        return 'r'
    elif token == "NN" or token == "NNP" or token == "NNS": #NOUNS
        return 'n'
    else:
        # print("Unwanted")
        return "unwanted tag"

# UNUSED base

Function to get the base state of a given word through lemmatisation.

Note: It fails at some lemmatisation 

In [None]:
lemmatizer = WordNetLemmatizer()

# function 'base' tokenises the input, then gets the POS tag of every token, then uses
# the POS tag to determined how to accuartely lemmatise the token
def base(text):

# example = "Peter is walking very slowly to the banks He is the slowest walker he will join us later Tom is better than Peter"
    text = word_tokenize(text) #tokenises the sentence
# text = nltk.pos_tag(text, tagset='universal') #gets the pos tag fro each token in the text
    text = nltk.pos_tag(text) #gets the pos tag fro each token in the text
#  -> 'Peter', 'is', 'walking', 'to', 'the', 'banks', 'He', 'is', 'very', 'quick']
# print(text)
    lemmatized = ""
    # print("Before lemmatizer process: "+ lemmatized)
    for token in text: 
        if get_tag(token[1]) == 'v':
            # print(token[0])
            lem_word = lemmatizer.lemmatize(token[0], 'v')
            # print("VERB: " +token[0], "=>", lem_word)
            lemmatized += lem_word +" "
            # print("VERB")
        elif get_tag(token[1]) == 'a':

            lem_word = lemmatizer.lemmatize(token[0], 'a')
            # print("ADJ: " +token[0], "=>", lem_word)
            lemmatized += lem_word +" "
            # print("ADJECTIVE")
        elif get_tag(token[1]) == 'r':
            # print(token[0])

            lem_word = lemmatizer.lemmatize(token[0], 'r')
            # print("ADV: " +token[0], "=>", lem_word)
            lemmatized += lem_word +" "
            # print("ADVERB")
        elif get_tag(token[1]) == 'n':
            lem_word = lemmatizer.lemmatize(token[0], 'n')
            # print("NOUN: " +token[0], "=>", lem_word)
            lemmatized += lem_word +" "
        else: 
            # lem_word = lemmatizer.lemmatize(token[0])
            lemmatized += token[0] +" "

        # get_tag(token[1])
    return lemmatized

# UNUSED remove stopwords

In [None]:
#fucntion ot remove ENGLISH stopwords from a text
def removeENG_stopwords(text):
    #gets the stop words for English from the nltk corpus
    stop_words = set(nltk.corpus.stopwords.words('english'))
    stop_words.remove("not")
    for word in stop_words:
      text = re.sub(r'(?<!\S)' + word + '+(?!\S)', "", text, flags=re.IGNORECASE) #to change to a easier understandable function
    return text

# UNUSED Data Cleaning function

This is a bit messy but seems to do the job. Could just import a library to clea the tweets up but wanted to see if I could do it myself

Note: removeENG_stopwords disabled because it removing the stopwords will negatively impact scores by -1%. Not a lot but still a neagtive influence

In [None]:
# variables and function calls to clean the data. 
# The issue probably lies here for the bad model performance but output texts seem fine?

remove_symbols = '(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])'  #emojis, symbols, links, and punctuation


remove_spaces = ' +' #multiple spaces between words
remove_RTandAT2 = '(RT |@\w+|@ \w+)' # RT and variants of @username : '@username', '@username_username' , '@ username'
apostrophe = '&#39;'  # ASCI for an aposthrope 
apostrophe_csv = "u2019"  # ASCI for an aposthrope 
quote = '&quot;'     # ASCI for "" 
andSymbol = '&amp;|&' # ASCI for & 
greaterThan = '&gt;' # ASCI for > 
hashtag = '(#\w+)' # hashtags 
double = '(.)\1{2}' # duplicated characters
dots = '\.' # ellipsis
numbers = '([\d]+(?:st| st|nd| nd|rd| rd|th| th))|(\d)' #cardinal numbers
singleCHAR = '(?:^| )[b-hj-z](?= |$)' #matches single characters no 'a' or 'i; -> [b-hj-z]
website = '(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?'
special = '[^\w& ]+' # maches special characters except &
multiple = "(.)\\1{2,}" #matches character duplicated 2 or more times 

def cleanENGText(text):
    cleaned = re.sub(website, "", text)
    cleaned = re.sub(apostrophe_csv, "'", cleaned) #? removes\u2019; converting it to an apostrophe
    cleaned = expand(cleaned)
    cleaned = re.sub(remove_RTandAT2, "", cleaned) #?removes RT and variants of @username : '@username', '@username_username' , '@ username'
    cleaned = re.sub(hashtag, "", cleaned)  #? replaces hashtags with nothing
    cleaned = re.sub(special, "", cleaned)
    cleaned = base(cleaned)
    cleaned = re.sub(double, "", cleaned) #? removes multiple repeated characters until 2 left
    cleaned = cleaned.lower() #? lower and trim trainling and leading whitespaces
    cleaned = re.sub(quote, "\"", cleaned) #? removes &quot converting it to an apostrophe
    cleaned = re.sub(andSymbol, "and", cleaned) #? replaces &amp; with 'and'
    cleaned = re.sub(greaterThan, "", cleaned) #? replaces &gt; with nothing
    # cleaned = removeENG_stopwords(cleaned) #? removes English stopwords
    cleaned = re.sub(remove_symbols," ", cleaned) #? REMOVES EMOJIS, SYMBOLS, HTTPS, AND PUNCTUATION
    cleaned = re.sub(remove_spaces," ", cleaned) #? removes multiple spaces  
    cleaned = re.sub(numbers, "", cleaned) #? replaces numbers with nothing (could just use re.remove for these type of actions)
    cleaned = re.sub(dots, " ", cleaned) #? replaces dots with a space
    cleaned = re.sub(singleCHAR, "", cleaned)
    cleaned = re.sub(remove_spaces," ", cleaned) #? removes multiple spaces  
    cleaned = re.sub(multiple, '\\1', cleaned) #? replaces multiple instances of character with 1 instance 
    cleaned = cleaned.strip()

    return cleaned

# UNUSED BADLY PERFORMING MODELS =========================

# SDGClassifier

In [None]:
# from sklearn.linear_model import SGDClassifier

# #SemEval alone
# model_SemEval = SGDClassifier()
# model_SemEval.fit(SemEval_train_tweet, SemEval_train_label)
# pred_SDG_SemEval = model_SemEval.predict(SemEval_test_tweet)
# SGD_SemEval = classification_report(SemEval_test_label, pred_SDG_SemEval, target_names= classification_labels)

# #SemEval on English labelled
# model_SemEval_english = SGDClassifier()
# model_SemEval_english.fit(SemEval_train_tweet, SemEval_train_label)
# pred_SDG_SemEval_english = model_SemEval_english.predict(English_tweet)
# SGD_SemEval_ENGLISH = classification_report(English_label, pred_SDG_SemEval_english, target_names= classification_labels)

# #SemEval on German labelled tranlsated into English
# model_SemEval_german = SGDClassifier()
# model_SemEval_german.fit(SemEval_train_tweet, SemEval_train_label)
# pred_SDG_SemEval_german = model_SemEval_german.predict(German_translated_tweet)
# SGD_SemEval_GERMAN = classification_report(German_translated_label, pred_SDG_SemEval_german, target_names= classification_labels)

# #DAI alone
# model_DAI = SGDClassifier()
# model_DAI.fit(DAI_train_tweet, DAI_train_label)
# pred_SDG_DAI = model_DAI.predict(DAI_test_tweet)
# SGD_DAI = classification_report(DAI_test_label, pred_SDG_DAI, target_names= classification_labels)

# #DAI on German
# model_DAI_german = SGDClassifier()
# model_DAI_german.fit(DAI_train_tweet, DAI_train_label)
# pred_SDG_DAI_german = model_DAI_german.predict(German_tweet)
# SGD_DAI_GERMAN = classification_report(German_label, pred_SDG_DAI_german, target_names= classification_labels)

# #DAI on English labelled transalted into German
# model_DAI = SGDClassifier()
# model_DAI.fit(DAI_train_tweet, DAI_train_label)
# pred_SDG_German = model_DAI.predict(English_translated_tweet)
# SGD_DAI_ENGLISH  = classification_report(English_translated_label, pred_SDG_German, target_names= classification_labels)


# LINEAR SVC

In [None]:
# #SemEval alone
# clf_SemEval = LinearSVC(max_iter= 1000)
# clf_SemEval.fit(SemEval_train_tweet, SemEval_train_label)
# pred_SVC_SemEval = clf_SemEval.predict(SemEval_test_tweet)
# LINEARSVC_SemEval = classification_report(SemEval_test_label, pred_SVC_SemEval, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel],

# #SemEval on English labelled
# clf_SemEval_english = LinearSVC(max_iter= 1000)
# clf_SemEval_english.fit(SemEval_train_tweet, SemEval_train_label)
# pred_SVC_SemEval_english = clf_SemEval_english.predict(English_tweet)
# LINEARSVC_SemEval_ENGLISH = classification_report(English_label, pred_SVC_SemEval_english, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel],

# #SemEval on German labelled tranlsated into English
# clf_SemEval_german = LinearSVC(max_iter= 1000)
# clf_SemEval_german.fit(SemEval_train_tweet, SemEval_train_label)
# pred_SVC_SemEval_german = clf_SemEval_german.predict(German_translated_tweet)
# LINEARSVC_SemEval_GERMAN = classification_report(German_translated_label, pred_SVC_SemEval_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel],

# #DAI alone
# clf_DAI = LinearSVC(max_iter= 1000)
# clf_DAI.fit(DAI_train_tweet, DAI_train_label)
# pred_SVC_DAI = clf_DAI.predict(DAI_test_tweet)
# LINEARSVC_DAI = classification_report(DAI_test_label, pred_SVC_DAI, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel],

# #DAI on German
# clf_DAI_german = LinearSVC(max_iter= 1000)
# clf_DAI_german.fit(DAI_train_tweet, DAI_train_label)
# pred_SVC_DAI_german = clf_DAI.predict(German_tweet)
# LINEARSVC_DAI_GERMAN = classification_report(German_label, pred_SVC_DAI_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel],

# #DAI on English labelled transalted into German
# clf_german = LinearSVC(max_iter= 1000)
# clf_german.fit(DAI_train_tweet, DAI_train_label)
# pred_SVC_German = clf_german.predict(English_translated_tweet)
# LINEARSVC_DAI_ENGLISH  = classification_report(English_translated_label, pred_SVC_German, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel],



#  Bernoulli Naive Bayes

In [None]:
# #SemEval alone
# BNBmodel_SemEval = BernoulliNB()
# BNBmodel_SemEval.fit(SemEval_train_tweet, SemEval_train_label)
# pred_BNB_SemEval = BNBmodel_SemEval.predict(SemEval_test_tweet)
# BNB_SemEval = classification_report(SemEval_test_label, pred_BNB_SemEval, target_names= classification_labels)

# #SemEval on English labelled
# BNBmodel_SemEval_english = BernoulliNB()
# BNBmodel_SemEval_english.fit(SemEval_train_tweet, SemEval_train_label)
# pred_BNB_SemEval_english = BNBmodel_SemEval_english.predict(English_tweet)
# BNB_SemEval_ENGLISH = classification_report(English_label, pred_BNB_SemEval_english, target_names= classification_labels)

# #SemEval on German labelled tranlsated into English
# BNBmodel_SemEval_german = BernoulliNB()
# BNBmodel_SemEval_german.fit(SemEval_train_tweet, SemEval_train_label)
# pred_BNB_SemEval_german = BNBmodel_SemEval_german.predict(German_translated_tweet)
# BNB_SemEval_GERMAN = classification_report(German_translated_label, pred_BNB_SemEval_german, target_names= classification_labels)

# #German_DAI alone
# BNBmodel_DAI = BernoulliNB()
# BNBmodel_DAI.fit(DAI_train_tweet, DAI_train_label)
# pred_BNB_DAI = BNBmodel_DAI.predict(DAI_test_tweet)
# BNB_DAI = classification_report(DAI_test_label, pred_BNB_DAI, target_names= classification_labels)

# #German_DAI on German
# BNBmodel_DAI = BernoulliNB()
# BNBmodel_DAI.fit(DAI_train_tweet, DAI_train_label)
# pred_BNB_DAI = BNBmodel_DAI.predict(German_tweet)
# BNB_DAI_GERMAN = classification_report(German_label, pred_BNB_DAI, target_names= classification_labels)


# #German_DAI on English labelled transalted into German
# BNBmodel_german = BernoulliNB()
# BNBmodel_german.fit(DAI_train_tweet, DAI_train_label)
# pred_BNB_German = BNBmodel_german.predict(English_translated_tweet)
# BNB_DAI_ENGLISH  = classification_report(English_translated_label, pred_BNB_German, target_names= classification_labels)


# Random Forest

In [None]:
# #SemEval alone
# rfc_SemEval = RandomForestClassifier(n_estimators=200, random_state=0)
# rfc_SemEval.fit(SemEval_train_tweet, SemEval_train_label)
# pred_rfc_SemEval = rfc_SemEval.predict(SemEval_test_tweet)
# RANFOR_SemEval = classification_report(SemEval_test_label, pred_rfc_SemEval, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

# #SemEval on English labelled
# rfc_SemEval_english = RandomForestClassifier(n_estimators=200, random_state=0)
# rfc_SemEval_english.fit(SemEval_train_tweet, SemEval_train_label)
# pred_rfc_SemEval_english = rfc_SemEval_english.predict(English_tweet)
# RANFOR_SemEval_ENGLISH = classification_report(English_label, pred_rfc_SemEval_english, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

# #SemEval on German labelled tranlsated into English
# rfc_SemEval_german = RandomForestClassifier(n_estimators=200, random_state=0)
# rfc_SemEval_german.fit(SemEval_train_tweet, SemEval_train_label)
# pred_rfc_SemEval_german = rfc_SemEval_german.predict(German_translated_tweet)
# RANFOR_SemEval_GERMAN = classification_report(German_translated_label, pred_rfc_SemEval_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

# #DAI alone
# rfc_DAI = RandomForestClassifier(n_estimators=200, random_state=0)
# rfc_DAI.fit(DAI_train_tweet, DAI_train_label)
# pred_rfc_DAI = rfc_DAI.predict(DAI_test_tweet)
# RANFOR_DAI = classification_report(DAI_test_label, pred_rfc_DAI, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 

# #DAI on German
# rfc_DAI_german = RandomForestClassifier(n_estimators=200, random_state=0)
# rfc_DAI_german.fit(DAI_train_tweet, DAI_train_label)
# pred_rfc_DAI_german = rfc_DAI.predict(German_tweet)
# RANFOR_DAI_GERMAN = classification_report(German_label, pred_rfc_DAI_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 


# #DAI on English labelled transalted into German
# rfc_german = RandomForestClassifier(n_estimators=200, random_state=0)
# rfc_german.fit(DAI_train_tweet, DAI_train_label)
# pred_rfc_german = rfc_german.predict(English_translated_tweet)
# RANFOR_DAI_ENGLISH = classification_report(English_translated_label, pred_rfc_german, target_names= classification_labels) #actual testLabel, y_pred  [i.e predicyed testLabel], 


