# Versuch 1: Zuweisungsalgorithmus

Versuch basierend auf Anleitung unter: https://investigate.ai/bloomberg-tweet-topics/assigning-categories-to-text-using-keyword-matching/.

Könnte bezeichnet werden als: **Keyword Matching** oder **Bag of Words**.

Gute Erklärungen zu BoW: https://machinelearningmastery.com/gentle-introduction-bag-words-model/

Daraus folgende kurze Zusammenfassung:
> *A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:*
>
> 1. *A vocabulary of known words.*
> 2. *A measure of the presence of known words.*

In [1]:
# loading libs
import pandas as pd
import Stemmer

In [2]:
df = pd.read_csv("mergerDaten/orr.csv", index_col=0)

In [3]:
#topics = pd.read_csv("mergerDaten/termsSave.csv")

## Ergänzung bzw. Erklärung zur Themenszusammensetzung

Die Kategorien wurden aus den ersten Versuchen des Topic Modelings gebildet. Dabei handelte es sich vorläufig um 30 Kategorien. Es wurden dann die ersten 25 Wörter der jeweiligen Kategorien ausgegeben. Händisch wurde dabei eine Post-Selektion durchgeführt, um offensichtlich unwichtige Wörter zu entfernen (z. B. im Topic "Polizeiaktionen" das Wort "gekommen", da dieses klare Mehrfachbedeutung haben kann und nicht direkt auf das Thema hinweist). Meist handelt es sich bei den entfernten Wörtern um ungefähr fünf bis sechs.

Vereinfacht gesprochen: **aus den 25 Wörtern pro Topic wurden nur diejenigen in die Kategorien für das Bag of Words-Verfahren mit aufgenommen, die zweifelsfrei zur zugewiesenen Kategorie passen. Sofern außerdem Überlappungen zwischen Kategorien bestehen, die unbestreitbar sind, werden diese zusammengeführt.**

In [22]:
categories = {
    'multimediaSelbstverweise' : ['ard', 'swr', 'wdr', 'fernsehen', 'ardmediathek', '@mdrde',
                    '@ard_presse', 'mdr', 'doku', 'podcast', 'folge',
                    'preis', 'serie', 'folgen', 'film', '@daserste', 'mediathek', 'live'],
    
    'umweltKlima' : ['wasser', 'tiere', 'klimawandel', 'trockenheit', 'wald', 'landwirtschaft', 'landwirte',
                     'zoo', 'fischsterben', 'hitze', 'wolf', 'umwelt', 'bäume', 'sommer', 'müll',
                    'fische', 'pflanzen'],
    
    'covid' : ['corona', 'inzidenz', 'neuinfektionen', 'coronavirus', 'rki',
               'corona', 'impfpflicht', 'omikron', 'impfung', 'lauterbach', 'pandemie', 'coronavirus', 
               'impfstoff', 'variante', 'bundesgesundheitsminister', 'gesundheitsminister',
               'kliniken', 'welle', 'patienten', 'impfen', 'impfungen',
               'covid', '@karl_lauterbach', 'geimpft',
               'corona', 'regeln', 'schulen', 'maskenpflicht', 'maßnahmen', 'schüler',
               'tests', '2g', 'quarantäne', 'lockerungen', 'pandemie', 'maske', 'regel', 'masken',
               'gastronomie'],
    
    'ostdeutschland' : ['sachsen', 'thüringen', 'anhalt', 'erfurt', 'thüringer', 'magdeburg',
                       'dresden', 'leipzig', 'halle', 'thueringen', 'mdr', 'anhalts', 'weimar',
                       'chemnitz', 'gera', 'jena'],
    
    'energiekrise' : ['preise', 'energie', 'energiepreise', 'energiekrise', 'gas', 'kosten',
                      'sparen', 'strom', 'inflation', 'teuer', 'steigenden', 'energiekosten',
                      'kunden', 'steigende', 'verbraucher', 'diesel', 'lebensmittel'],
    
    'öpnv' : ['bahn', 'fahren', 'züge', 'bus', 'strecke', 'ticket', 'zug',
              '9euroticket', 'öpnv', 'probleme', 'gespräch', 'bahnhof',
             'zugunglück', 'fahrgäste', 'euro', 'busse', 'bahnen', 'nahverkehr', 'fährt'],
    
    'ukrainekrieg' : ['ukraine', 'scholz', 'russland', 'bundeskanzler', 'krieg', 'putin',
                      'kanzler', 'bundeswehr', 'steinmeier', 'waffen', 'bundesregierung',
                      'bundespräsident', 'präsident', 'baerbock', 'schröder',
                      'russischen', 'waffenlieferungen',
                      'ukraine', 'russland', 'russischen', 'russische', 'ukrainische',
                      'ukrainischen', 'kiew', 'selenskyj', 'russlands', 'mariupol', 'moskau',
                      'angriff', 'truppen', 'ukrainekrieg', 'soldaten', 'usa', 'armee',
                      'ukraine', 'krieg', 'geflüchtete', 'flüchtlinge',
                      'hilfe', 'helfen', 'geflüchteten', 'ukrainische', 'spenden',
                      'ukrainekrieg', 'ukrainer', 'flucht', 'ukrainischen',
                      'russland', 'nato', 'usa', 'präsident', 'kommission',
                      'g7', 'sanktionen', 'gipfel', 'großbritannien', 'britische', 'schweden',
                      'finnland'],
    
    'politikWahlen' : ['cdu', 'spd', 'afd', 'grünen', 'fdp', 'wahl', 'partei', 'landtagswahl', 'grüne',
                        'landtag', 'ministerpräsident', 'gewählt', 'politiker', 'stimmen', 
                        'prozent', 'linke', 'amt', 'bürgermeister', 'oberbürgermeister', 'mehrheit'],
    
    'entlastungspolitik' : ['euro', 'millionen', 'geld', 'milliarden', 'kosten', 'lindner',
                            '9euroticket', 'entlastungspaket', 'finanzminister', 'mio'],
    
    'energieversorgung' : ['gas', 'habeck', 'russland', 'öl', 'wirtschaftsminister', 'energie',
                          'bundeswirtschaftsminister', 'strom', 'energiewende', 'pipeline', 'akw',
                           'erdgas', 'netz', 'gaslieferungen', 'schwedt', 'energiekrise', 'kohle'],
    
    'streiksDemos' : ['flughafen', 'gewerkschaft', 'lufthansa', 'streik', 'demonstriert',
                      'beschäftigten', 'flüge', 'proteste', 'warnstreik','verdi',
                      'tausende', 'demonstration', 'mitarbeiter'],
}

In [23]:
stemmer = Stemmer.Stemmer('de')

dfs = []

for key,values in categories.items():
    words=pd.DataFrame({'category': key, 'term': stemmer.stemWords(values)})
    dfs.append(words)

In [24]:
terms_df = pd.concat(dfs)

In [25]:
terms_df.to_csv('terms_df.csv')

In [50]:
# pd.set_option('display.max_columns', None)
terms_df

Unnamed: 0,category,term
0,multimediaSelbstverweise,ard
1,multimediaSelbstverweise,swr
2,multimediaSelbstverweise,wdr
3,multimediaSelbstverweise,fernseh
4,multimediaSelbstverweise,ardmediathek
...,...,...
8,streiksDemos,warnstreik
9,streiksDemos,verdi
10,streiksDemos,tausend
11,streiksDemos,demonstration


In [27]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer


stemmer = Stemmer.Stemmer('de')

# Based on CountVectorizer
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords([w for w in analyzer(doc)])

In [38]:
# Take the 'term' column from our list of terms
term_list = list(set(terms_df.term))

# binary=True only does 0/1
# vocabulary= is the list of words we're interested in tracking
vectorizer = TfidfVectorizer(binary=False, vocabulary=term_list)
matrix = vectorizer.fit_transform(df.content, )
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names_out())

In [39]:
#term_list
words_df.head(20)

Unnamed: 0,demonstriert,landtagswahl,ukrain,landwirtschaft,sanktion,inzidenz,chemnitz,gera,inflation,waff,kohl,opnv,zug,impfpflicht,dresd,gewerkschaft,@mdrde,bundesgesundheitsminist,stimm,zugungluck,sachs,impfung,ministerprasident,verbrauch,nahverkehr,maskenpflicht,jena,gefluchtet,demonstration,grossbritanni,mull,@ard_press,euro,regeln,mask,burgermeist,fdp,9euroticket,klinik,krieg,tier,gipfel,lauterbach,erdgas,fisch,doku,well,pflanz,baum,@daserst,schwed,fluchtling,scholz,energiekris,@karl_lauterbach,wdr,bundeswirtschaftsminist,gasliefer,verdi,fahrgast,teu,russland,putin,wald,impfstoff,geld,streck,impf,thuring,ol,soldat,oberburgermeist,covid,bundeswehr,entlastungspaket,test,podcast,somm,bundeskanzl,patient,bundesregier,coronavirus,spd,ard,mitarbeit,umwelt,rki,kanzl,landtag,mehrheit,steinmei,tausend,grun,fischsterb,wolf,finnland,partei,weimar,schwedt,spend,polit,lufthansa,milliard,wirtschaftsminist,finanzminist,gesundheitsminist,britisch,geimpft,erfurt,bus,folg,flughaf,klimawandel,lindn,2g,magdeburg,gewahlt,arme,preis,schul,kiew,swr,fernseh,lebensmittel,locker,netz,ukrainekrieg,gesprach,strom,energi,spar,bahnhof,energiekost,energiepreis,selenskyj,ticket,hilf,ardmediathek,variant,kund,kommission,afd,fahr,pipelin,wahl,film,hitz,anhalt,mio,seri,schrod,bahn,corona,gas,steigend,flug,mdr,bundesprasident,landwirt,g7,angriff,energiew,fahrt,baerbock,warnstreik,regel,prot,helf,flucht,zoo,hall,diesel,gastronomi,moskau,amt,leipzig,beschaftigt,link,wass,trock,neuinfektion,cdu,million,thuering,kost,problem,prozent,liv,omikron,usa,quarantan,nato,habeck,waffenliefer,trupp,massnahm,russisch,buss,mariupol,akw,prasident,pandemi,mediathek,streik
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
# Group the terms by category, then loop through each category
for category_name, rows in terms_df.groupby('category'):
    # Convert the terms for that category into a simple list
    # for example, ['student', 'educ', 'teacher']
    terms = list(rows['term'])
    print(f"Looking at {category_name} with terms {terms}")

    # words_df[terms] gets the columns for 'student', 'educ', and 'teacher'
    # .any(axis=1) sees if any of them are a 1, gives True/False
    # .astype(int) converts True/False to 1/0
    # df[category_name] = will assign that value to df['education']
    df[category_name] = words_df[terms].any(axis=1).astype(int)

Looking at covid with terms ['corona', 'inzidenz', 'neuinfektion', 'coronavirus', 'rki', 'corona', 'impfpflicht', 'omikron', 'impfung', 'lauterbach', 'pandemi', 'coronavirus', 'impfstoff', 'variant', 'bundesgesundheitsminist', 'gesundheitsminist', 'klinik', 'well', 'patient', 'impf', 'impfung', 'covid', '@karl_lauterbach', 'geimpft', 'corona', 'regeln', 'schul', 'maskenpflicht', 'massnahm', 'schul', 'test', '2g', 'quarantan', 'locker', 'pandemi', 'mask', 'regel', 'mask', 'gastronomi']
Looking at energiekrise with terms ['preis', 'energi', 'energiepreis', 'energiekris', 'gas', 'kost', 'spar', 'strom', 'inflation', 'teu', 'steigend', 'energiekost', 'kund', 'steigend', 'verbrauch', 'diesel', 'lebensmittel']
Looking at energieversorgung with terms ['gas', 'habeck', 'russland', 'ol', 'wirtschaftsminist', 'energi', 'bundeswirtschaftsminist', 'strom', 'energiew', 'pipelin', 'akw', 'erdgas', 'netz', 'gasliefer', 'schwedt', 'energiekris', 'kohl']
Looking at entlastungspolitik with terms ['euro'

In [41]:
df.groupby('user').ostdeutschland.sum().sort_values(ascending=False)

user
MDRAktuell         1012
mdr_th              734
MDR_SAN             692
MDR_SN              590
MDRpresse           236
rbb24                60
mdrde                43
hessenschau          39
BR24                 36
SWRAktuellBW         25
ndr                  15
rbb24Inforadio       13
WDRaktuell           12
NDRnds                9
DeutscheWelle         4
NDRsh                 3
dlfkultur             3
SWRpresse             3
NDRinfo               3
SWRAktuellRP          3
rbbabendschau         1
butenunbinnen         1
WDR                   1
SRaktuell             1
dlfnova               0
hrPresse              0
SRKommunikation       0
DLF                   0
BR_Presse             0
ARTEde                0
Name: ostdeutschland, dtype: int64

In [42]:
overall = df.groupby('user').sum()
overall.to_csv("ueberblick_gruppiert_user.csv")
overall

  overall = df.groupby('user').sum()


Unnamed: 0_level_0,tweetID,replies,retweets,likes,quotes,isRetweeted,repliedTo_ID,cashtags,userID,followerAmount,covid,energiekrise,energieversorgung,entlastungspolitik,multimediaSelbstverweise,ostdeutschland,politikWahlen,streiksDemos,ukrainekrieg,umweltKlima,öpnv
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARTEde,7406246889977992893,2169,6503,73344,1051,0.0,5.43574e+20,0.0,15777717876,203096536,2,1,6,3,93,0,5,0,25,4,7
BR24,841105500156987464,131626,80746,270054,25872,0.0,1.567368e+21,0.0,1601743068630,1588970990,2257,813,1905,833,270,36,1116,171,3105,348,992
BR_Presse,5675942677297697351,449,549,1880,114,0.0,7.486367e+19,0.0,38012705840,12879016,8,2,5,0,60,0,5,0,7,2,3
DLF,3796672158204130138,4276,3984,16785,1238,0.0,1.00542e+21,0.0,353870173622,520198504,148,26,118,21,21,0,176,1,209,14,29
DeutscheWelle,5850705722675183222,3154,5287,12744,658,0.0,6.523488e+19,0.0,29362562775,311721835,73,52,122,41,39,4,46,2,281,22,32
MDRAktuell,-5165568195108735060,84167,87290,347702,24436,0.0,2.096608e+20,0.0,264452764949,1294543971,1636,655,1462,1055,490,1012,1596,204,2720,135,1122
MDR_SAN,-2193034106078293704,2088,3608,10390,798,0.0,1.394817e+20,0.0,844427374295,102631428,190,31,38,92,44,692,115,27,58,29,98
MDR_SN,4516105826483677468,3915,5713,13988,1261,0.0,6.148856e+19,0.0,1555641912314,99496866,331,56,51,102,110,590,140,51,103,64,118
MDRpresse,-8705109928517028455,448,1073,3021,210,0.0,6.848871e+19,0.0,159960490209,3667373,4,9,12,2,343,236,1,1,23,9,5
NDRinfo,3232177995861581151,10934,10557,42120,2898,0.0,1.856287e+20,0.0,4199132158760,79163740,257,90,215,82,65,3,219,28,329,18,92


In [43]:
df
df.to_csv('categorizedTweetsORR.csv')