# Notebook: Create Subset

This notebook is used to create a subset of **2000** tweets, which will then be annotated with respect to their sentiment.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [95]:
from sklearn.utils import shuffle
import pandas as pd
import numpy as np
import random
import os

## Parameters

In [96]:
DATASET_PATH = '../Datasets/dataset/'
SUBSET_SIZE = 2000
SEED_VALUE = 0
PARTIES = ["CDU_CSU", "SPD", "AfD", "FDP", "GRUENE", "LINKE"]

## Code

### 1. Get Reproducable Results

In [97]:
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)
random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)

### 2. Calculate Number of Tweets

In [98]:
n_tweets_total = 0

In [99]:
for party in PARTIES:
    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Read dataframe
                df = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)
                
                # Add length to n_tweets_total
                n_tweets_total += df.shape[0]

  df = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)
  df = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)


In [100]:
n_tweets_total

326928

### 3. Get Random Tweets From Each Account

In [101]:
n_subset_total = 0

In [102]:
annotation_dataset = pd.DataFrame()

In [103]:
for party in PARTIES:
    # Initialize an empty DataFrame to store the tweets from accounts of a party
    df_party = pd.DataFrame()
    
    for subdir, _, files in os.walk(DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Read dataframe
                df_account = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)
                
                # Save the information for which account the tweet was crawled
                df_account['source_account'] = username
                df_account['source_party'] = party
                
                # Add dataframe to party dataframe
                df_party = pd.concat([df_party, df_account], axis=0).reset_index().drop(columns='index')
                
    n_tweets_party = df_party.shape[0]
    n_tweets_party_for_subset = round((SUBSET_SIZE / n_tweets_total) * n_tweets_party)
    n_subset_total += n_tweets_party_for_subset
                
    df_samples_for_party = df_party.sample(n=n_tweets_party_for_subset, random_state=SEED_VALUE)
    annotation_dataset = pd.concat([annotation_dataset, df_samples_for_party], axis=0).reset_index().drop(columns='index')       
    #print(party, username, n_tweets_party, n_tweets_party_for_subset, (SUBSET_SIZE / n_tweets_total) * n_tweets_party, n_tweets_party_for_subset)

  df_account = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)
  df_account = pd.read_csv(DATASET_PATH + party + "/" + file, sep=",", index_col=0)


In [104]:
n_subset_total

2001

In [105]:
annotation_dataset = annotation_dataset.sample(frac=1, random_state=SEED_VALUE)

In [106]:
annotation_dataset = annotation_dataset.loc[:, ['id', 'username', 'date', 'tweet']]

In [107]:
annotation_dataset

Unnamed: 0,id,username,date,tweet
677,1368659137761054724,bam_pyro,2021-03-07 20:25:41,@Karl_Lauterbach @annewill Die Wahrscheinlichk...
980,1379383386532368384,GanzerG,2021-04-06 10:40:01,@Karl_Lauterbach Herr Lauterbach plappern Sie ...
1240,1354360396300300293,laengerals4,2021-01-27 09:27:35,@Joerg_Meuthen Glaubst du den Mist eigentlich ...
156,1392026965117476864,Bavarian_Propag,2021-05-11 08:01:05,@RenateTuebingen @PaulZiemiak @IsraelinGermany...
522,1416311865525817344,WernerHAlbrech1,2021-07-17 08:20:37,@SHomburg @Markus_Soeder Aiwanger verdient Res...
...,...,...,...,...
835,1424070892221513735,KaStBe2,2021-08-07 18:12:13,@Karl_Lauterbach Also ist die Trennung von Ver...
1216,1387380782671478788,Pedro39884887,2021-04-28 12:18:49,@wmoebius @StBrandner Da Sie ideologisch verbl...
1653,1373421922898284549,Ilona_GR_DE,2021-03-20 23:51:18,@unsperrbare @dergruenepunkt @UweNess @SvenjaS...
559,1445771645075951637,markrudolph2701,2021-10-06 15:23:15,@StefanThumann @Markus_Wojahn @CDU Ich glaube ...


In [108]:
annotation_dataset.to_csv("out.csv")