# Notebook: Create Wordcloud Analysis 

This notebook is used to create wordclouds using our trained model.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [1]:
from reportlab.graphics import renderPDF
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from svglib.svglib import svg2rlg
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random
import nltk
import re
import os

## Parameters

In [2]:
PLOTS_PATH = "../Plots/"
PARTIES = ["SPD", "CDU_CSU", "GRUENE", "FDP", "AFD", "LINKE"]
DATASET_PATH_PREDICTIONS =  "../Datasets/complete_dataset_predictions/"
DATASET_PATH = "../Datasets/dataset/"
WORD_CLOUDS_PATH = "../Plots/wordclouds/"
FONT_PATH = 'fonts/manrope-regular.otf'

## Setup Packages

In [3]:
nltk.download('punkt')
nltk.download('stopwords')
STOPWORDS = set(stopwords.words("german"))
STOPWORDS.update(["mehr", "heute","https", "bundestag", "thread", "anzeigen", "https", "http", "www", "co"])

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/nils_hellwig/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nils_hellwig/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Code

### 1. Load Data

In [4]:
df = pd.DataFrame({})

for party in PARTIES:
    for subdir, _, files in os.walk(DATASET_PATH_PREDICTIONS + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH_PREDICTIONS):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Read CSV file as pandas dataframe
                df_acc_data = pd.read_csv(DATASET_PATH + party + "/" + file)
                ids = df_acc_data["id"].values
                df_acc_data = df_acc_data[["tweet", "source_party", "source_account", "date"]].reset_index().drop(columns='index')
                
                df_pred = pd.read_csv(DATASET_PATH_PREDICTIONS + party + "/" + file)
                df_pred = df_pred[df_pred["id"].isin(ids)][["pred"]].reset_index().drop(columns='index')
                
                matched_df = pd.concat([df_acc_data, df_pred], axis=1)
                matched_df = matched_df.rename(columns={"pred": "sentiment", "tweet": "text"})
                
                df = pd.concat([df, matched_df], axis=0)

df = df.reset_index().drop(columns='index')

In [5]:
df

Unnamed: 0,text,source_party,source_account,date,sentiment
0,Wichtige wissenschaftliche Erkenntnis- nun mus...,SPD,KarambaDiaby,2021-01-09 19:35:29,0
1,@KarambaDiaby @HalleSpd @SPD_LSA Ich gratulier...,SPD,KarambaDiaby,2021-01-09 17:09:28,0
2,@KarambaDiaby @HalleSpd @SPD_LSA Herzlichen Gl...,SPD,KarambaDiaby,2021-01-09 13:16:13,0
3,@KarambaDiaby @HalleSpd @SPD_LSA Wann werden k...,SPD,KarambaDiaby,2021-01-09 12:32:40,1
4,@KarambaDiaby @HalleSpd @SPD_LSA Glückwunsch.,SPD,KarambaDiaby,2021-01-09 12:13:06,0
...,...,...,...,...,...
707236,@b_riexinger Klima oder Verkehr fast gleich......,LINKE,b_riexinger,2021-12-17 08:19:23,1
707237,@b_riexinger @Linksfraktion Na ob das noch lan...,LINKE,b_riexinger,2021-12-17 08:18:07,1
707238,@b_riexinger Ich wünsch Dir viel Erfolg.,LINKE,b_riexinger,2021-12-17 07:47:59,0
707239,"@b_riexinger Nun, da gibt es ja genügend zu tu...",LINKE,b_riexinger,2021-12-17 02:07:26,2


### 2. Clean Text

Remove account mentioned from tweets

In [6]:
def remove_source_account(text, source_account):
    pattern = re.compile(re.escape(source_account), re.IGNORECASE)
    return re.sub(pattern, '', text)

# Funktion auf die Spalten "text" und "source_account" anwenden
df['text'] = df.apply(lambda row: remove_source_account(row['text'], "@"+row['source_account']), axis=1)

In [7]:
df

Unnamed: 0,text,source_party,source_account,date,sentiment
0,Wichtige wissenschaftliche Erkenntnis- nun mus...,SPD,KarambaDiaby,2021-01-09 19:35:29,0
1,"@HalleSpd @SPD_LSA Ich gratuliere, auch wenn ...",SPD,KarambaDiaby,2021-01-09 17:09:28,0
2,@HalleSpd @SPD_LSA Herzlichen Glückwunsch und...,SPD,KarambaDiaby,2021-01-09 13:16:13,0
3,@HalleSpd @SPD_LSA Wann werden konkret massiv...,SPD,KarambaDiaby,2021-01-09 12:32:40,1
4,@HalleSpd @SPD_LSA Glückwunsch.,SPD,KarambaDiaby,2021-01-09 12:13:06,0
...,...,...,...,...,...
707236,Klima oder Verkehr fast gleich....Hauptsache ...,LINKE,b_riexinger,2021-12-17 08:19:23,1
707237,@Linksfraktion Na ob das noch lange gut geht?...,LINKE,b_riexinger,2021-12-17 08:18:07,1
707238,Ich wünsch Dir viel Erfolg.,LINKE,b_riexinger,2021-12-17 07:47:59,0
707239,"Nun, da gibt es ja genügend zu tuen. Paris ma...",LINKE,b_riexinger,2021-12-17 02:07:26,2


Source: https://data-dive.com/german-nlp-binary-text-classification-of-reviews-part1

In [8]:
def clean_text(text):
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
    RE_TAGS = re.compile(r"<[^>]+>")
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž ]", re.IGNORECASE)
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE)
    
    text = re.sub(RE_TAGS, " ", text)
    text = re.sub(RE_ASCII, " ", text)
    text = re.sub(RE_SINGLECHAR, " ", text)
    text = re.sub(RE_WSPACE, " ", text)
    
    word_tokens = word_tokenize(text)
    text = [word.lower() for word in word_tokens]
    text = " ".join(text)
    
    return text

### 3. SVG to PDF

In [9]:
def svg_to_pdf(svg_filepath, pdf_filepath):
    drawing = svg2rlg(svg_filepath)
    renderPDF.drawToFile(drawing, pdf_filepath)

### 4. Create Wordclouds for Positive, Negative and Neutral Tweets

In [10]:
def get_sentiment_as_name(sentiment_code):
    if sentiment_code == 0:
        return "positive"
    if sentiment_code == 1:
        return "negative"
    if sentiment_code == 2: 
        return "neutral"

In [11]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

def generate_wordclouds(df, mode_name):
    # Group dataframe by sentiment
    grouped = df.groupby('sentiment')

    # Iterate over each group
    for sentiment, group in grouped:
        print(f'Word Cloud for Sentiment: {sentiment}')

        # Create a list of all the tweets for the current sentiment
        tweets = group['text'].tolist()

        # Create a single string of all the tweets for the current sentiment
        text = ' '.join(tweets)

        # Clean Text
        text = clean_text(text)

        # Create a wordcloud
        wordcloud = WordCloud(background_color="white", max_words=100, width=1000, height=700, stopwords=STOPWORDS, font_path=FONT_PATH).generate(text)

        # Save wordcloud as svg
        wordcloud_svg = wordcloud.to_svg(embed_font=True)
        svg_path = WORD_CLOUDS_PATH + mode_name + "_wordcloud_" + get_sentiment_as_name(sentiment) + ".svg"
        f = open(svg_path, "w+")
        f.write(wordcloud_svg)
        f.close()

        # Save wordcloud as pdf
        pdf_path = WORD_CLOUDS_PATH + mode_name + "_wordcloud_" + get_sentiment_as_name(sentiment) + ".pdf"
        svg_to_pdf(svg_path, pdf_path)

        # Save wordcloud as png
        png_path = WORD_CLOUDS_PATH + mode_name + "_wordcloud_" + get_sentiment_as_name(sentiment) + ".png"
        wordcloud.to_file(png_path)

        # Display the wordcloud
        #plt.imshow(wordcloud, interpolation='bilinear')
        #plt.axis("off")
        #plt.show()

In [12]:
generate_wordclouds(df, "complete_dataset")

Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2


## 5. Create Wordclouds for Parties

In [13]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

def generate_wordclouds_for_party(df, mode_name, party):
    df = df[df["source_party"] == party]
    
    # Group dataframe by sentiment
    grouped = df.groupby('sentiment')

    # Iterate over each group
    for sentiment, group in grouped:
        print(f'Word Cloud for Sentiment: {sentiment} | Party: {party}')

        # Create a list of all the tweets for the current sentiment
        tweets = group['text'].tolist()

        # Create a single string of all the tweets for the current sentiment
        text = ' '.join(tweets)

        # Clean Text
        text = clean_text(text)

        # Create a wordcloud
        wordcloud = WordCloud(background_color="white", max_words=100, width=1000, height=700, stopwords=STOPWORDS, font_path=FONT_PATH).generate(text)

        # Save wordcloud as svg
        wordcloud_svg = wordcloud.to_svg(embed_font=True)
        svg_path = WORD_CLOUDS_PATH + mode_name + party + "_wordcloud_" + get_sentiment_as_name(sentiment) + ".svg"
        f = open(svg_path, "w+")
        f.write(wordcloud_svg)
        f.close()

        # Save wordcloud as pdf
        pdf_path = WORD_CLOUDS_PATH + mode_name + party + "_wordcloud_" + get_sentiment_as_name(sentiment) + ".pdf"
        svg_to_pdf(svg_path, pdf_path)

        # Save wordcloud as png
        png_path = WORD_CLOUDS_PATH + mode_name + party + "_wordcloud_" + get_sentiment_as_name(sentiment) + ".png"
        wordcloud.to_file(png_path)

        # Display the wordcloud
        #plt.imshow(wordcloud, interpolation='bilinear')
        #plt.axis("off")
        #plt.show()

In [14]:
for party in PARTIES:
    generate_wordclouds_for_party(df, "party_", party)

Word Cloud for Sentiment: 0 | Party: SPD
Word Cloud for Sentiment: 1 | Party: SPD
Word Cloud for Sentiment: 2 | Party: SPD
Word Cloud for Sentiment: 0 | Party: CDU_CSU
Word Cloud for Sentiment: 1 | Party: CDU_CSU
Word Cloud for Sentiment: 2 | Party: CDU_CSU
Word Cloud for Sentiment: 0 | Party: GRUENE
Word Cloud for Sentiment: 1 | Party: GRUENE
Word Cloud for Sentiment: 2 | Party: GRUENE
Word Cloud for Sentiment: 0 | Party: FDP
Word Cloud for Sentiment: 1 | Party: FDP
Word Cloud for Sentiment: 2 | Party: FDP
Word Cloud for Sentiment: 0 | Party: AFD
Word Cloud for Sentiment: 1 | Party: AFD
Word Cloud for Sentiment: 2 | Party: AFD
Word Cloud for Sentiment: 0 | Party: LINKE
Word Cloud for Sentiment: 1 | Party: LINKE
Word Cloud for Sentiment: 2 | Party: LINKE


### 6. Word Clouds for Months

In [15]:
def get_month_code(month_num):
    if month_num == 1:
        return "January"
    elif month_num == 2:
        return "February"
    elif month_num == 3:
        return "March"
    elif month_num == 4:
        return "April"
    elif month_num == 5:
        return "May"
    elif month_num == 6:
        return "June"
    elif month_num == 7:
        return "July"
    elif month_num == 8:
        return "August"
    elif month_num == 9:
        return "September"
    elif month_num == 10:
        return "October"
    elif month_num == 11:
        return "November"
    elif month_num == 12:
        return "December"    

In [16]:
for i in range(1, 13):
    df_month = df[pd.to_datetime(df['date']).dt.month == i]
    generate_wordclouds(df, "month_"+get_month_code(i))

Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Sentiment: 2
Word Cloud for Sentiment: 0
Word Cloud for Sentiment: 1
Word Cloud for Senti

### 7. Word Clouds for Party and Month

In [17]:
for i in range(1, 13):
    df_month = df[pd.to_datetime(df['date']).dt.month == i]
    for party in PARTIES:
        generate_wordclouds_for_party(df, "month_"+ get_month_code(i)+ "_party_", party)

Word Cloud for Sentiment: 0 | Party: SPD
Word Cloud for Sentiment: 1 | Party: SPD
Word Cloud for Sentiment: 2 | Party: SPD
Word Cloud for Sentiment: 0 | Party: CDU_CSU
Word Cloud for Sentiment: 1 | Party: CDU_CSU
Word Cloud for Sentiment: 2 | Party: CDU_CSU
Word Cloud for Sentiment: 0 | Party: GRUENE
Word Cloud for Sentiment: 1 | Party: GRUENE
Word Cloud for Sentiment: 2 | Party: GRUENE
Word Cloud for Sentiment: 0 | Party: FDP
Word Cloud for Sentiment: 1 | Party: FDP
Word Cloud for Sentiment: 2 | Party: FDP
Word Cloud for Sentiment: 0 | Party: AFD
Word Cloud for Sentiment: 1 | Party: AFD
Word Cloud for Sentiment: 2 | Party: AFD
Word Cloud for Sentiment: 0 | Party: LINKE
Word Cloud for Sentiment: 1 | Party: LINKE
Word Cloud for Sentiment: 2 | Party: LINKE
Word Cloud for Sentiment: 0 | Party: SPD
Word Cloud for Sentiment: 1 | Party: SPD
Word Cloud for Sentiment: 2 | Party: SPD
Word Cloud for Sentiment: 0 | Party: CDU_CSU
Word Cloud for Sentiment: 1 | Party: CDU_CSU
Word Cloud for Sentime