# Notebook: Create Subset

This notebook is used to create a subset of **2000** tweets, which will then be annotated with respect to their sentiment.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [70]:
from sklearn.utils import shuffle
import pandas as pd
import numpy as np
import random
import os

## Parameters

In [71]:
ANNOTATION_DATASET_PATH = '../Datasets/annotation_dataset'
DATASET_PATH = '../Datasets/dataset/'
SUBSET_SIZE = 2000
SEED_VALUE = 0
PARTIES = ["CDU_CSU", "SPD", "AfD", "FDP", "GRUENE", "LINKE"]

## Code

### 1. Get Reproducable Results

In [72]:
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)
random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)

### 2. Calculate Number of Tweets

In [73]:
n_tweets_total = 0

In [74]:
for party in PARTIES:
    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Read dataframe
                df = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)
                
                # Add length to n_tweets_total
                n_tweets_total += df.shape[0]

  df = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)
  df = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)


In [75]:
n_tweets_total

326928

### 3. Get Random Tweets From Each Account

In [76]:
n_subset_total = 0

In [77]:
annotation_dataset = pd.DataFrame()

In [78]:
for party in PARTIES:
    # Initialize an empty DataFrame to store the tweets from accounts of a party
    df_party = pd.DataFrame()
    
    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Read dataframe
                df_account = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)
                
                # Save the information for which account the tweet was crawled
                df_account['source_account'] = username
                df_account['source_party'] = party
                
                # Add dataframe to party dataframe
                df_party = pd.concat([df_party, df_account], axis=0).reset_index().drop(columns='index')
                
    n_tweets_party = df_party.shape[0]
    n_tweets_party_for_subset = round((SUBSET_SIZE / n_tweets_total) * n_tweets_party)
    n_subset_total += n_tweets_party_for_subset
                
    df_samples_for_party = df_party.sample(n=n_tweets_party_for_subset, random_state=SEED_VALUE)
    annotation_dataset = pd.concat([annotation_dataset, df_samples_for_party], axis=0).reset_index().drop(columns='index')       
    #print(party, username, n_tweets_party, n_tweets_party_for_subset, (SUBSET_SIZE / n_tweets_total) * n_tweets_party, n_tweets_party_for_subset)

  df_account = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)
  df_account = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)


In [79]:
n_subset_total

2001

### 4. Create Sub Datasets for Annotation 

In [80]:
annotation_dataset = annotation_dataset.sample(frac=1, random_state=SEED_VALUE).reset_index()
annotation_dataset = annotation_dataset.loc[:, ['id', 'username', 'date', 'tweet']]

In [81]:
df_session1 = annotation_dataset[:int(SUBSET_SIZE/2)]
df_session2 = annotation_dataset[int(SUBSET_SIZE/2):]

In [82]:
df_session1.to_csv(ANNOTATION_DATASET_PATH + "/tweets_session_1.csv")
df_session1

Unnamed: 0,id,username,date,tweet
0,1368659137761054724,bam_pyro,2021-03-07 20:25:41,@Karl_Lauterbach @annewill Die Wahrscheinlichk...
1,1379383386532368384,GanzerG,2021-04-06 10:40:01,@Karl_Lauterbach Herr Lauterbach plappern Sie ...
2,1354360396300300293,laengerals4,2021-01-27 09:27:35,@Joerg_Meuthen Glaubst du den Mist eigentlich ...
3,1392026965117476864,Bavarian_Propag,2021-05-11 08:01:05,@RenateTuebingen @PaulZiemiak @IsraelinGermany...
4,1416311865525817344,WernerHAlbrech1,2021-07-17 08:20:37,@SHomburg @Markus_Soeder Aiwanger verdient Res...
...,...,...,...,...
995,1428244778668396545,AlexMKonrad,2021-08-19 06:37:45,@lcywbb @spdde Hat bei den letzten Wahlen ja a...
996,1446349477661626368,MEBEKE2005,2021-10-08 05:39:22,"@StBrandner Ermittlungen, Herr Brandner. Schla..."
997,1476966956318310401,huhnerbaron,2021-12-31 17:22:17,@Tomsch24 @ulumbamulumba @julianberger86 @StBr...
998,1443568956493549572,ChristianHKuhn,2021-09-30 13:30:34,@DrStefanBerger @KuehniKev @drumheadberlin @BI...


In [83]:
df_session2.to_csv(ANNOTATION_DATASET_PATH + "/tweets_session_2.csv")
df_session2

Unnamed: 0,id,username,date,tweet
1000,1461671888313360397,Reihner_pop,2021-11-19 12:25:09,@GesetzDesTages @willy_wusel @Glamour_is_free ...
1001,1359508051267575809,Honkthefirst,2021-02-10 14:22:32,@Aggregat11 @AfDimEUParl @Joerg_Meuthen @vonde...
1002,1433700794948276244,rudyratlos,2021-09-03 07:58:01,@faznet @HeikoMaas Der Herr Maas ist nicht zu ...
1003,1400690315086946310,claravia9,2021-06-04 05:46:09,@Schmidtlepp @ArminLaschet @TspCheckpoint Step...
1004,1447258986634416130,BitcoinUndScifi,2021-10-10 17:53:25,@SteHaller @MartinDheBoss @davstier @f_schaeff...
...,...,...,...,...
1996,1424070892221513735,KaStBe2,2021-08-07 18:12:13,@Karl_Lauterbach Also ist die Trennung von Ver...
1997,1387380782671478788,Pedro39884887,2021-04-28 12:18:49,@wmoebius @StBrandner Da Sie ideologisch verbl...
1998,1373421922898284549,Ilona_GR_DE,2021-03-20 23:51:18,@unsperrbare @dergruenepunkt @UweNess @SvenjaS...
1999,1445771645075951637,markrudolph2701,2021-10-06 15:23:15,@StefanThumann @Markus_Wojahn @CDU Ich glaube ...
