# Notebook: Clean Dataset

This notebook is used to clean the dataset. What actions are taken in the process is explained below.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [109]:
import pandas as pd
import re
import os

## Parameters

In [110]:
RAW_DATASET_PATH = "../Datasets/raw_dataset/"
DATASET_PATH = "../Datasets/dataset/"
PARTIES = ["CDU_CSU", "SPD", "AFD", "FDP", "GRUENE", "LINKE"]

## Code

### 1. Create new Directories

In [111]:
# Iterate over the parties
for party in PARTIES:
    # Try to create the directory for the party
    try:
        os.makedirs(DATASET_PATH + party)
    except FileExistsError:
        # The directory already exists, so do nothing
        pass

### 2. Clean Dataframe and Store as CSV

In [112]:
n_tweets_total = 0
for party in PARTIES:
    n_tweets_party = 0
    for subdir, _, files in os.walk(RAW_DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(RAW_DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Load dataframe of an account
                df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")
                
                # Remove all line breaks from the values in the "Column" row
                df['tweet'] = df['tweet'].apply(lambda x: re.sub(r'\r\n|\r|\n', '', x))
                
                if df["id"].nunique() == len(df):
                    print("All values in the column are unique.", username)
                else:
                    print("There are duplicate values in the column.", username)
                
                # 1. Filter out rows where the username ist the politician/party account itself
                df = df[df.username != username]
                
                # 2. Filter german tweets
                df = df[df.language == "de"]
                
                # Reset the index of the dataframe
                df = df.reset_index(drop=True)
                
                n_tweets_party += df.shape[0]
                print(username, df.shape[0])
                
                # Save dataframe
                df.to_csv(DATASET_PATH + "/" + party + "/" + username + ".csv", sep=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    n_tweets_total += n_tweets_party
    print(party, n_tweets_party)
print("Total: ", n_tweets_total)

  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")


All values in the column are unique. ArminLaschet
ArminLaschet 36570
All values in the column are unique. HBraun
HBraun 3250
All values in the column are unique. andreasscheuer
andreasscheuer 2456
All values in the column are unique. CSU
CSU 9201
All values in the column are unique. DerLenzMdB
DerLenzMdB 244


  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")


All values in the column are unique. Markus_Soeder
Markus_Soeder 30846
All values in the column are unique. ANiebler
ANiebler 25
All values in the column are unique. MarkusFerber
MarkusFerber 22
All values in the column are unique. Junge_Union
Junge_Union 911
All values in the column are unique. ManfredWeber
ManfredWeber 533
All values in the column are unique. DoroBaer
DoroBaer 2591
All values in the column are unique. rbrinkhaus
rbrinkhaus 4331
All values in the column are unique. tj_tweets
tj_tweets 399
All values in the column are unique. DaniLudwigMdB
DaniLudwigMdB 3891
All values in the column are unique. JuliaKloeckner
JuliaKloeckner 3386
All values in the column are unique. cducsubt
cducsubt 13482
All values in the column are unique. n_roettgen
n_roettgen 4719
All values in the column are unique. jensspahn
jensspahn 35963
All values in the column are unique. groehe
groehe 80
All values in the column are unique. _FriedrichMerz
_FriedrichMerz 23687
All values in the column are un

  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")


All values in the column are unique. OlafScholz
OlafScholz 27688
All values in the column are unique. jusos
jusos 1914
All values in the column are unique. spdbt
spdbt 9900


  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")


All values in the column are unique. Karl_Lauterbach
Karl_Lauterbach 133491
All values in the column are unique. KuehniKev
KuehniKev 5202
All values in the column are unique. larsklingbeil
larsklingbeil 5728
All values in the column are unique. HeikoMaas
HeikoMaas 6471
All values in the column are unique. MiRo_SPD
MiRo_SPD 354
All values in the column are unique. EskenSaskia
EskenSaskia 7292


  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")


All values in the column are unique. spdde
spdde 22401
SPD 230325
All values in the column are unique. MalteKaufmann
MalteKaufmann 5168
All values in the column are unique. AfD
AfD 8449
All values in the column are unique. PetrBystronAFD
PetrBystronAFD 501
All values in the column are unique. StBrandner
StBrandner 12024
All values in the column are unique. JoanaCotar
JoanaCotar 4363
All values in the column are unique. Beatrix_vStorch
Beatrix_vStorch 3959
All values in the column are unique. GtzFrmming
GtzFrmming 993
All values in the column are unique. Alice_Weidel
Alice_Weidel 9414
All values in the column are unique. AfDimBundestag
AfDimBundestag 4776
All values in the column are unique. AfDBerlin
AfDBerlin 371
All values in the column are unique. gottfriedcurio
gottfriedcurio 271
All values in the column are unique. Joerg_Meuthen
Joerg_Meuthen 4857
All values in the column are unique. Tino_Chrupalla
Tino_Chrupalla 2900
AFD 58046
All values in the column are unique. f_schaeffler
f_s

  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")


All values in the column are unique. fdp
fdp 28303
All values in the column are unique. LindaTeuteberg
LindaTeuteberg 332
All values in the column are unique. Wissing
Wissing 2817
All values in the column are unique. Lambsdorff
Lambsdorff 888
All values in the column are unique. KonstantinKuhle
KonstantinKuhle 2740
All values in the column are unique. MarcoBuschmann
MarcoBuschmann 10115
All values in the column are unique. johannesvogel
johannesvogel 2146
FDP 80546
All values in the column are unique. GoeringEckardt
GoeringEckardt 5240
All values in the column are unique. Ricarda_Lang
Ricarda_Lang 3582
All values in the column are unique. BriHasselmann
BriHasselmann 1815
All values in the column are unique. KathaSchulze
KathaSchulze 4653
All values in the column are unique. GrueneBundestag
GrueneBundestag 6540
All values in the column are unique. cem_oezdemir
cem_oezdemir 9969
All values in the column are unique. nouripour
nouripour 507
All values in the column are unique. MiKellner
Mi

  df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")


All values in the column are unique. Die_Gruenen
Die_Gruenen 30944
All values in the column are unique. gruene_jugend
gruene_jugend 1299
GRUENE 74005
All values in the column are unique. SWagenknecht
SWagenknecht 7202
All values in the column are unique. dieLinke
dieLinke 14305
All values in the column are unique. Linksfraktion
Linksfraktion 3010
All values in the column are unique. Janine_Wissler
Janine_Wissler 1045
All values in the column are unique. dielinkeberlin
dielinkeberlin 1241
All values in the column are unique. DietmarBartsch
DietmarBartsch 3437
All values in the column are unique. SusanneHennig
SusanneHennig 4502
All values in the column are unique. GregorGysi
GregorGysi 1732
All values in the column are unique. jankortemdb
jankortemdb 743
All values in the column are unique. anked
anked 959
All values in the column are unique. SevimDagdelen
SevimDagdelen 172
All values in the column are unique. katjakipping
katjakipping 1089
All values in the column are unique. b_riexing