# Notebook: Create Subset

This notebook is used to create a subset of **2000** tweets, which will then be annotated with respect to their sentiment.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [1]:
from sklearn.utils import shuffle
import pandas as pd
import numpy as np
import random
import os

## Parameters

In [2]:
ANNOTATION_DATASET_PATH = '../Datasets/annotation_dataset'
DATASET_PATH = '../Datasets/dataset/'
SUBSET_SIZE = 2000
SEED_VALUE = 0
PARTIES = ["CDU_CSU", "SPD", "AFD", "FDP", "GRUENE", "LINKE"]

## Code

### 1. Get Reproducable Results

In [3]:
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)
random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)

### 2. Calculate Number of Tweets crawled for each party

In [4]:
n_tweets_total = 0
party_statistics = {}

In [5]:
for party in PARTIES:
    n_tweets_party = 0
    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]

                # Read dataframe
                df = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)

                # Add to counter
                n_tweets_party += df.shape[0]

                # Add length to n_tweets_total
                n_tweets_total += df.shape[0]

    party_statistics[party] = n_tweets_party

In [6]:
n_tweets_total

707241

In [7]:
party_statistics

{'CDU_CSU': 227683,
 'SPD': 228415,
 'AFD': 57572,
 'FDP': 79815,
 'GRUENE': 73261,
 'LINKE': 40495}

### 3. Check which party gets an additional Tweet

In [8]:
def get_key_with_max_value_under_0_5(dictionary):
    filtered_dict = {}
    for key, value in dictionary.items():
        if value < 0.5:
            filtered_dict[key] = value
    return max(filtered_dict, key=filtered_dict.get)

In [9]:
def truncate(x, d):
    return int(x*(10.0**d))/(10.0**d)

for party in party_statistics:
    party_statistics[party] = ((SUBSET_SIZE / n_tweets_total) * party_statistics[party])

In [10]:
party_statistics

{'CDU_CSU': 643.8625588731422,
 'SPD': 645.932574610352,
 'AFD': 162.8073033096215,
 'FDP': 225.70806839535604,
 'GRUENE': 207.1740750324147,
 'LINKE': 114.51541977911349}

### 4. Get Random Tweets From Each Account

In [11]:
n_subset_total = 0

In [12]:
annotation_dataset = pd.DataFrame()
for party in PARTIES:
    # Initialize an empty DataFrame to store the tweets from accounts of a party
    df_party = pd.DataFrame()

    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]

                # Read dataframe
                df_account = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)

                # Add dataframe to party dataframe
                df_party = pd.concat([df_party, df_account], axis=0).reset_index().drop(columns='index')

    n_tweets_party = df_party.shape[0]
    n_tweets_party_for_subset = round((SUBSET_SIZE / n_tweets_total) * n_tweets_party)

    n_subset_total += n_tweets_party_for_subset

    df_samples_for_party = df_party.sample(n=n_tweets_party_for_subset, random_state=SEED_VALUE)
    annotation_dataset = pd.concat([annotation_dataset, df_samples_for_party], axis=0).reset_index().drop(columns='index')      

In [13]:
n_subset_total

2001

### 4. Create Sub Datasets for Annotation 

Save dataset for annotation

In [14]:
try:
    os.makedirs(ANNOTATION_DATASET_PATH)
except FileExistsError:
    pass

In [15]:
annotation_dataset = annotation_dataset.sample(frac=1, random_state=SEED_VALUE).reset_index().drop(columns='index')

By rounding the number of tweets, it is possible that not exactly 2000 tweets will come out. In this case, one tweet must be removed.

In [16]:
party_to_exclude_tweets = annotation_dataset[annotation_dataset['source_party'] == 'LINKE']
tweet_to_delete = party_to_exclude_tweets.sample(n=1)
annotation_dataset = annotation_dataset.drop(tweet_to_delete.index).reset_index().drop(columns='index')

In [17]:
annotation_dataset['source_party'].value_counts()

SPD        646
CDU_CSU    644
FDP        226
GRUENE     207
AFD        163
LINKE      114
Name: source_party, dtype: int64

In [18]:
tweet_to_delete

Unnamed: 0,id,conversation_id,created_at,date,timezone,tweet,language,hashtags,cashtags,user_id,...,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,source_account,source_party
163,1363829626531942404,1363824748535365634,1613997000000.0,2021-02-22 12:34:56,0,@lgbeutin @Linksfraktion Sie und die @dieLinke...,de,[],[],4895322000.0,...,,,,"[{'screen_name': 'lgbeutin', 'name': 'Lorenz G...",,,,,dieLinke,LINKE


In [19]:
annotation_dataset

Unnamed: 0,id,conversation_id,created_at,date,timezone,tweet,language,hashtags,cashtags,user_id,...,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,source_account,source_party
0,1460589265759481859,1460580145450917894,1.637067e+12,2021-11-16 12:43:12,0,@FrankieLix2020 @andfra9 @GoeringEckardt @ABae...,de,[],[],1.430499e+18,...,,,,"[{'screen_name': 'FrankieLix2020', 'name': 'Li...",,,,,OlafScholz,SPD
1,1470141403045019654,1470141403045019654,1.639344e+12,2021-12-12 21:19:59,0,Alter! Der @Karl_Lauterbach ist ja schon wiede...,de,"['karllauterbach', 'annewill']",[],7.619394e+17,...,,,,[],,,,,Karl_Lauterbach,SPD
2,1405950770466496521,1405870103284092930,1.624040e+12,2021-06-18 19:09:19,0,@Karl_Lauterbach Sie sehen auch so aus.,de,[],[],8.620021e+17,...,,,,"[{'screen_name': 'Karl_Lauterbach', 'name': 'P...",,,,,Karl_Lauterbach,SPD
3,1350126525220315136,1350126090673672193,1.610730e+12,2021-01-15 17:03:42,0,@PaulZiemiak @CDU 2/x »Wenn die Bürger die Wah...,de,[],[],1.225431e+18,...,,,,"[{'screen_name': 'PaulZiemiak', 'name': 'Paul ...",,,,,PaulZiemiak,CDU_CSU
4,1443988844500733952,1441846648020168707,1.633109e+12,2021-10-01 18:19:03,0,@ArminLaschet Laschet ist als CDU-Parteivorsit...,de,[],[],9.547100e+17,...,,,,"[{'screen_name': 'ArminLaschet', 'name': 'Armi...",,,,,CDU,CDU_CSU
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1400720644279504898,1400478935851413505,1.622793e+12,2021-06-04 08:46:40,0,@spdde @OlafScholz mit einem Wumms aus der Kri...,de,[],[],9.058720e+17,...,,,,"[{'screen_name': 'spdde', 'name': 'SPD Parteiv...",,,,,OlafScholz,SPD
1996,1415980314871087107,1415969720176574465,1.626431e+12,2021-07-16 11:23:09,0,@manager_magazin Wieviel kriegt @OlafScholz we...,de,[],[],1.331199e+18,...,,,,"[{'screen_name': 'manager_magazin', 'name': 'm...",,,,,OlafScholz,SPD
1997,1419538832488374275,1419405863199064080,1.627279e+12,2021-07-26 07:03:26,0,@jensteutrine @fdp @c_lindner Immer und immer ...,de,[],[],7.605764e+07,...,,,,"[{'screen_name': 'jensteutrine', 'name': 'Jens...",,,,,fdp,FDP
1998,1436026687548952584,1435973268817723392,1.631210e+12,2021-09-09 19:00:17,0,@Mirko17817058 @sebastiankurz @ArminLaschet Na...,de,[],[],1.285508e+18,...,,,,"[{'screen_name': 'Mirko17817058', 'name': 'Mir...",,,,,ArminLaschet,CDU_CSU


Save entire annotation_dataset (unlabled) as CSV

In [20]:
annotation_dataset.to_csv(ANNOTATION_DATASET_PATH + "/annotation_dataset.csv")

In [21]:
annotation_dataset['source_party'].value_counts()

SPD        646
CDU_CSU    644
FDP        226
GRUENE     207
AFD        163
LINKE      114
Name: source_party, dtype: int64

For the whole corpus, we do not delete duplicates. Duplicates can occur because a tweet can mention several politicians at once, which means that a tweet could be crawled for multiple politicians. However, we want to avoid evaluating the trained BERT model with tweets that were also used for training. Therefore, we make sure that there are no duplicates among the 2000 annotated tweets that we will later use for training and evaluation of our BERT model.

In [22]:
# Check if the 'id' column is unique
is_unique = df['id'].is_unique
print("Dataset uniqueness: ", is_unique)

Dataset uniqueness:  True


Add column for sentiment label and columns with information that might be helpfull for annotators

In [23]:
annotation_dataset["sentiment"] = ""
annotation_dataset = annotation_dataset.loc[:, ['id', 'username', 'date', 'sentiment', 'tweet', 'link', 'source_account', 'source_party']]

In [24]:
df_session1 = annotation_dataset[:int(SUBSET_SIZE/2)]
df_session2 = annotation_dataset[int(SUBSET_SIZE/2):]

In [25]:
df_session1.to_csv(ANNOTATION_DATASET_PATH + "/tweets_session_1.csv")
df_session1.to_excel(ANNOTATION_DATASET_PATH + "/tweets_session_1.xlsx")
df_session1

Unnamed: 0,id,username,date,sentiment,tweet,link,source_account,source_party
0,1460589265759481859,vdr137,2021-11-16 12:43:12,,@FrankieLix2020 @andfra9 @GoeringEckardt @ABae...,https://twitter.com/vdr137/status/146058926575...,OlafScholz,SPD
1,1470141403045019654,karsten_mo,2021-12-12 21:19:59,,Alter! Der @Karl_Lauterbach ist ja schon wiede...,https://twitter.com/karsten_mo/status/14701414...,Karl_Lauterbach,SPD
2,1405950770466496521,BalianIbelin65,2021-06-18 19:09:19,,@Karl_Lauterbach Sie sehen auch so aus.,https://twitter.com/BalianIbelin65/status/1405...,Karl_Lauterbach,SPD
3,1350126525220315136,DeineMeinungung,2021-01-15 17:03:42,,@PaulZiemiak @CDU 2/x »Wenn die Bürger die Wah...,https://twitter.com/DeineMeinungung/status/135...,PaulZiemiak,CDU_CSU
4,1443988844500733952,BaumannProf,2021-10-01 18:19:03,,@ArminLaschet Laschet ist als CDU-Parteivorsit...,https://twitter.com/BaumannProf/status/1443988...,CDU,CDU_CSU
...,...,...,...,...,...,...,...,...
995,1453263530090762250,smartmax900,2021-10-27 08:33:20,,Die stille Treppe im deutschen Bundestag. Wuss...,https://twitter.com/smartmax900/status/1453263...,AfDimBundestag,AFD
996,1436245908358934529,Muttihatfrei,2021-09-10 09:31:23,,@kreuzundquerma1 @Karl_Lauterbach @revan_sithl...,https://twitter.com/Muttihatfrei/status/143624...,Karl_Lauterbach,SPD
997,1381172615688155136,davmohr,2021-04-11 10:09:47,,@moellerdav @OlafScholz Was für ein bullshit... 😭,https://twitter.com/davmohr/status/13811726156...,OlafScholz,SPD
998,1364160606866190340,Bernd86391886,2021-02-23 10:30:08,,@Karl_Lauterbach Im Klimageschehen sind Sie no...,https://twitter.com/Bernd86391886/status/13641...,Karl_Lauterbach,SPD


In [26]:
df_session2.to_csv(ANNOTATION_DATASET_PATH + "/tweets_session_2.csv")
df_session2.to_excel(ANNOTATION_DATASET_PATH + "/tweets_session_2.xlsx")
df_session2

Unnamed: 0,id,username,date,sentiment,tweet,link,source_account,source_party
1000,1432673237494898692,jastebo,2021-08-31 12:54:52,,@maischberger @Amira_M_Ali @dieLinke @Beatrix_...,https://twitter.com/jastebo/status/14326732374...,AfD,AFD
1001,1392521582942334978,Heikoala2,2021-05-12 17:46:31,,@Markus__Lanz @MariamLau1 @Karl_Lauterbach Mar...,https://twitter.com/Heikoala2/status/139252158...,Karl_Lauterbach,SPD
1002,1471852747356053504,Eisbaer_LE,2021-12-17 14:40:15,,@CDU @_FriedrichMerz Zurück in die Zukunft. Wü...,https://twitter.com/Eisbaer_LE/status/14718527...,_FriedrichMerz,CDU_CSU
1003,1459496838629838848,WernerE93875908,2021-11-13 12:22:17,,@LisaLies12 @Beatrix_vStorch Ich habe NICHT be...,https://twitter.com/WernerE93875908/status/145...,Beatrix_vStorch,AFD
1004,1436969755689299972,super_tooper,2021-09-12 09:27:42,,@jensspahn @MikeMohring Mein Ort hat nur 200 E...,https://twitter.com/super_tooper/status/143696...,jensspahn,CDU_CSU
...,...,...,...,...,...,...,...,...
1995,1400720644279504898,Cole1971Ger,2021-06-04 08:46:40,,@spdde @OlafScholz mit einem Wumms aus der Kri...,https://twitter.com/Cole1971Ger/status/1400720...,OlafScholz,SPD
1996,1415980314871087107,Eingeloggt_,2021-07-16 11:23:09,,@manager_magazin Wieviel kriegt @OlafScholz we...,https://twitter.com/Eingeloggt_/status/1415980...,OlafScholz,SPD
1997,1419538832488374275,deshbeta,2021-07-26 07:03:26,,@jensteutrine @fdp @c_lindner Immer und immer ...,https://twitter.com/deshbeta/status/1419538832...,fdp,FDP
1998,1436026687548952584,grnFlip,2021-09-09 19:00:17,,@Mirko17817058 @sebastiankurz @ArminLaschet Na...,https://twitter.com/grnFlip/status/14360266875...,ArminLaschet,CDU_CSU


## IMPORTANT: NEXT STEPS

1. Create new Folder "annotated_datasets" in `/Datasets`
2. Add Annotated Datasets in .xlsx format 
3. Name these:
`
['../Datasets/annotated_dataset/tweets_session_1_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_1_3.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_1.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_2.xlsx',
 '../Datasets/annotated_dataset/tweets_session_2_3.xlsx']
`

Schema:

`
../Datasets/annotated_dataset/tweets_session_{SESSION_ID}_{ANNOTATOR_ID}.xlsx'
`