# Notebook: Create Subset

This notebook is used to create a subset of **2000** tweets, which will then be annotated with respect to their sentiment.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [1]:
from sklearn.utils import shuffle
import pandas as pd
import numpy as np
import random
import os

## Parameters

In [2]:
ANNOTATION_DATASET_PATH = '../Datasets/annotation_dataset'
DATASET_PATH = '../Datasets/dataset/'
SUBSET_SIZE = 2000
SEED_VALUE = 1
PARTIES = ["CDU_CSU", "SPD", "AFD", "FDP", "GRUENE", "LINKE"]
ADD_ADDITIONAL_TWEET_FOR_PARTY = False

## Code

### 1. Get Reproducable Results

In [3]:
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)
random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)

### 2. Calculate Number of Tweets crawled for each party

In [4]:
n_tweets_total = 0
party_statistics = {}

In [5]:
for party in PARTIES:
    n_tweets_party = 0
    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]

                # Read dataframe
                df = pd.read_csv(DATASET_PATH + party + "/" +
                                 file, sep=",", index_col=0)

                # Add to counter
                n_tweets_party += df.shape[0]

                # Add length to n_tweets_total
                n_tweets_total += df.shape[0]

    party_statistics[party] = n_tweets_party


In [6]:
n_tweets_total

706945

In [7]:
party_statistics

{'CDU_CSU': 227560,
 'SPD': 228393,
 'AFD': 57565,
 'FDP': 79796,
 'GRUENE': 73160,
 'LINKE': 40471}

### 3. Check which party gets an additional Tweet

In [8]:
def get_key_with_max_value_under_0_5(dictionary):
    filtered_dict = {}
    for key, value in dictionary.items():
        if value < 0.5:
            filtered_dict[key] = value
    return max(filtered_dict, key=filtered_dict.get)

In [9]:
def truncate(x, d):
    return int(x*(10.0**d))/(10.0**d)

for party in party_statistics:
    party_statistics[party] = ((SUBSET_SIZE / n_tweets_total) * party_statistics[party]) - truncate((SUBSET_SIZE / n_tweets_total) * party_statistics[party], 0)

In [10]:
# Check which Party will get an additional Tweet
party_with_additional_tweet = get_key_with_max_value_under_0_5(party_statistics)
party_with_additional_tweet

'LINKE'

### 4. Get Random Tweets From Each Account

In [11]:
n_subset_total = 0

In [12]:
annotation_dataset = pd.DataFrame()
for party in PARTIES:
    # Initialize an empty DataFrame to store the tweets from accounts of a party
    df_party = pd.DataFrame()

    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]

                # Read dataframe
                df_account = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)

                # Add dataframe to party dataframe
                df_party = pd.concat([df_party, df_account], axis=0).reset_index().drop(columns='index')

    n_tweets_party = df_party.shape[0]
    n_tweets_party_for_subset = round((SUBSET_SIZE / n_tweets_total) * n_tweets_party)

    if party_with_additional_tweet == party and ADD_ADDITIONAL_TWEET_FOR_PARTY:
        n_tweets_party_for_subset += 1

    n_subset_total += n_tweets_party_for_subset

    df_samples_for_party = df_party.sample(n=n_tweets_party_for_subset, random_state=SEED_VALUE)
    annotation_dataset = pd.concat([annotation_dataset, df_samples_for_party], axis=0).reset_index().drop(columns='index')      

In [13]:
n_subset_total

2000

### 4. Create Sub Datasets for Annotation 

Save dataset for annotation

In [14]:
try:
    os.makedirs(ANNOTATION_DATASET_PATH)
except FileExistsError:
    pass

In [15]:
annotation_dataset = annotation_dataset.sample(frac=1, random_state=SEED_VALUE).reset_index()
annotation_dataset.to_csv(ANNOTATION_DATASET_PATH + "/annotation_dataset.csv")

For the whole corpus, we do not delete duplicates. Duplicates can occur because a tweet can mention several politicians at once, which means that a tweet could be crawled for multiple politicians. However, we want to avoid evaluating the trained BERT model with tweets that were also used for training. Therefore, we make sure that there are no duplicates among the 2000 annotated tweets that we will later use for training and evaluation of our BERT model.

In [16]:
# Check if the 'id' column is unique
is_unique = df['id'].is_unique
print("Dataset uniqueness: ", is_unique)

Dataset uniqueness:  True


Add column for sentiment label and columns with information that might be helpfull for annotators

In [17]:
annotation_dataset["sentiment"] = ""
annotation_dataset = annotation_dataset.loc[:, ['id', 'username', 'date', 'sentiment', 'tweet', 'link', 'source_account']]

In [18]:
df_session1 = annotation_dataset[:int(SUBSET_SIZE/2)]
df_session2 = annotation_dataset[int(SUBSET_SIZE/2):]

In [19]:
df_session1.to_csv(ANNOTATION_DATASET_PATH + "/tweets_session_1.csv")
df_session1.to_excel(ANNOTATION_DATASET_PATH + "/tweets_session_1.xlsx")
df_session1

Unnamed: 0,id,username,date,sentiment,tweet,link,source_account
0,1382741317021790218,NinaHase1,2021-04-15 18:03:14,,@Karl_Lauterbach Man kann sich nur noch selbst...,https://twitter.com/NinaHase1/status/138274131...,Karl_Lauterbach
1,1437256523231731712,Tom26273989,2021-09-13 04:27:12,,@KlausBreuer6 @nordlohner @cem_oezdemir @Die_G...,https://twitter.com/Tom26273989/status/1437256...,cem_oezdemir
2,1427332322353991683,CharlyCgn65,2021-08-16 19:11:58,,@OlafScholz Die Regierungsparteien stimmten im...,https://twitter.com/CharlyCgn65/status/1427332...,OlafScholz
3,1443667240054894596,EsWirdSchlimmer,2021-09-30 21:01:06,,@AfD_Muenster @AfDimBundestag @Alice_Weidel @T...,https://twitter.com/EsWirdSchlimmer/status/144...,Beatrix_vStorch
4,1474015959253925890,Q_Paxxx,2021-12-23 13:56:05,,@FnlspcMooncake @MartinWsws @MPKretschmer @c_l...,https://twitter.com/Q_Paxxx/status/14740159592...,OlafScholz
...,...,...,...,...,...,...,...
995,1390050771551637508,GewitterTwiter,2021-05-05 22:08:24,,@CatGangBerlin @ChristianFuchs_ @janboehm @jen...,https://twitter.com/GewitterTwiter/status/1390...,jensspahn
996,1384973973490966528,DobermannHerr,2021-04-21 21:55:01,,@PaulZiemiak Ich hoffe die JU unterstützt Herr...,https://twitter.com/DobermannHerr/status/13849...,PaulZiemiak
997,1387188508545142784,compete101,2021-04-28 00:34:47,,@GregorGysi Der Weg wurde in Deutschland schon...,https://twitter.com/compete101/status/13871885...,GregorGysi
998,1399601070595321862,CologneMarine,2021-06-01 06:37:53,,@BSchmidtMattern grillte grad im @DLF @ArminLa...,https://twitter.com/CologneMarine/status/13996...,ArminLaschet


In [20]:
df_session2.to_csv(ANNOTATION_DATASET_PATH + "/tweets_session_2.csv")
df_session2.to_excel(ANNOTATION_DATASET_PATH + "/tweets_session_2.xlsx")
df_session2

Unnamed: 0,id,username,date,sentiment,tweet,link,source_account
1000,1442435320876699653,Brennpunkt6,2021-09-27 11:25:54,,@faznet @ArminLaschet @Markus_Soeder @CDU @SPD...,https://twitter.com/Brennpunkt6/status/1442435...,CDU
1001,1409437100701794306,grrzzzly,2021-06-28 10:02:45,,@Die_Diktatur @MLevitt_NP2013 @jengleruk Und? ...,https://twitter.com/grrzzzly/status/1409437100...,Karl_Lauterbach
1002,1410716291451834370,NLirady,2021-07-01 22:45:48,,"@spdde ""Wir beteiligen uns nicht an unsachlich...",https://twitter.com/NLirady/status/14107162914...,spdde
1003,1369333225890385929,TM01015,2021-03-09 17:04:16,,@CDU @cdurlp @Junge_Union Wieder leere Worthül...,https://twitter.com/TM01015/status/13693332258...,Junge_Union
1004,1358401630908071938,HansJrg70066168,2021-02-07 13:06:01,,@hahnflo @jusos Diesen Gewalt verherrlichenden...,https://twitter.com/HansJrg70066168/status/135...,jusos
...,...,...,...,...,...,...,...
1995,1475817035917635592,Olaf1911,2021-12-28 13:12:55,,@caracas_marion @erick_haase @jurischnoeller @...,https://twitter.com/Olaf1911/status/1475817035...,cem_oezdemir
1996,1380923932551839747,lazyb0y,2021-04-10 17:41:36,,@EskenSaskia Wenn Sie „beruflich wie privat“ s...,https://twitter.com/lazyb0y/status/13809239325...,OlafScholz
1997,1414308108353081349,SoWieRito,2021-07-11 20:38:24,,@161felixx @dieLinke @Janine_Wissler @partei_d...,https://twitter.com/SoWieRito/status/141430810...,dieLinke
1998,1471871996048859138,CCPlayer3D,2021-12-17 15:56:44,,@twiag1985 @DrNathanCole @Lashersday @_Friedri...,https://twitter.com/CCPlayer3D/status/14718719...,_FriedrichMerz


## IMPORTANT: NEXT STEPS

1. Create new Folder "annotated_datasets" in `/Datasets`
2. Add Annotated Datasets in .xlsx format 
3. Name these:
`
['../Datasets/annotated_dataset/tweets_session_1_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_3.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_3.xlsx']
`

Schema:

`
../Datasets/annotated_dataset/tweets_session_{SESSION_ID}_{ANNOTATOR_ID}.xlsx'
`