# Notebook: Clean Dataset

This notebook is used to clean the data set. What actions are taken in the process is explained below.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [79]:
import pandas as pd
import os

## Parameters

In [80]:
RAW_DATASET_PATH = "../Datasets/raw_dataset/"
DATASET_PATH = "../Datasets/dataset/"
PARTIES = ["CDU_CSU", "SPD", "AfD", "FDP", "GRUENE", "LINKE"]

## Code

### 1. Create new Directories

In [81]:
# List of parties
parties = ['CDU_CSU', 'FDP', 'AFD', 'LINKE', 'SPD', 'GRUENE']

# Iterate over the parties
for party in parties:
    # Try to create the directory for the party
    try:
        os.makedirs(DATASET_PATH + party)
    except FileExistsError:
        # The directory already exists, so do nothing
        pass

### 2. Clean Dataframe and store as CSV

In [84]:
n_tweets_total = 0
for party in PARTIES:
    n_tweets_party = 0
    for subdir, _, files in os.walk(RAW_DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(RAW_DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Load dataframe of an account
                df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",")
                
                # 1. Filter out rows where the username ist the politician/party account itself
                df = df[df.username != username]
                
                # 2. Filter german tweets
                df = df[df.language == "de"]
                
                n_tweets_party += df.shape[0]
                print(username, df.shape[0])
                
                # Save dataframe
                df.to_csv(DATASET_PATH + "/" + party + "/" + username + ".csv")
    n_tweets_total += n_tweets_party
    print(party, n_tweets_party)
print("Total: ", n_tweets_total)

ArminLaschet 17187
HBraun 1465
andreasscheuer 873
CSU 6089
DerLenzMdB 90
Markus_Soeder 13371


  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",")


ANiebler 5
MarkusFerber 14
Junge_Union 416
ManfredWeber 85
DoroBaer 2216
rbrinkhaus 581
tj_tweets 408
DaniLudwigMdB 1065
JuliaKloeckner 1484
cducsubt 4996
n_roettgen 1393


  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",")


jensspahn 20672
groehe 290
_FriedrichMerz 6643
hahnflo 387
PaulZiemiak 7758
smueller 0


  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",")


CDU 17308
CDU_CSU 104796
KarambaDiaby 128
Ralf_Stegner 2852
hubertus_heil 1536


  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",")


OlafScholz 14161
jusos 244
spdbt 6332


  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",")


Karl_Lauterbach 46511
KuehniKev 3265
larsklingbeil 2008
HeikoMaas 3596
MiRo_SPD 189
EskenSaskia 3395
spdde 8068
SPD 92285
MalteKaufmann 1954
AfD 4676
PetrBystronAFD 154
StBrandner 5523
JoanaCotar 2152
Beatrix_vStorch 1801
GtzFrmming 600
Alice_Weidel 4296
AfDimBundestag 2871
AfDBerlin 341
gottfriedcurio 117
Joerg_Meuthen 2966
Tino_Chrupalla 944
AfD 28395
f_schaeffler 1178
ria_schroeder 448
fdpbt 4335
c_lindner 10150
MaStrackZi 1052
fdp_nrw 439
fdp 14017
LindaTeuteberg 264
Wissing 1447
Lambsdorff 730
KonstantinKuhle 1685
MarcoBuschmann 3792
johannesvogel 1714
FDP 41251
GoeringEckardt 3821
Ricarda_Lang 2681
BriHasselmann 829
KathaSchulze 784
GrueneBundestag 2510
cem_oezdemir 4252
nouripour 132
MiKellner 544
JTrittin 637
KonstantinNotz 1112
RenateKuenast 1137
Die_Gruenen 23113
gruene_jugend 1065
GRUENE 42617
SWagenknecht 2847
dieLinke 5879
Linksfraktion 1649
Janine_Wissler 1151
dielinkeberlin 1209
DietmarBartsch 1129
SusanneHennig 537
GregorGysi 708
jankortemdb 353
anked 361
SevimDagdelen 