# Notebook: Clean Dataset

This notebook is used to clean the crawled dataset. What actions are taken in the process is explained below.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [1]:
import pandas as pd
import csv
import re
import os

## Parameters

In [2]:
RAW_DATASET_PATH = "../Datasets/raw_dataset/"
DATASET_PATH = "../Datasets/dataset/"
PARTIES = ["CDU_CSU", "SPD", "AFD", "FDP", "GRUENE", "LINKE"]

## Code

### 1. Create new Directories

In [3]:
# Iterate over the parties
for party in PARTIES:
    # Try to create a subdirectory for the party
    try:
        os.makedirs(DATASET_PATH + party)
    except FileExistsError:
        # The directory already exists, so do nothing
        pass

### 2. Clean Dataframe and Store as CSV

In [4]:
n_tweets_total = 0
for party in PARTIES:
    n_tweets_party = 0
    for subdir, _, files in os.walk(RAW_DATASET_PATH + party):
        for file in files:
            if file.endswith('.csv') and subdir[len(RAW_DATASET_PATH):] in PARTIES:
                # Get username of CSV file
                username = file[:-4]
                
                # Load dataframe of an account
                df = pd.read_csv(RAW_DATASET_PATH + party + "/" + file, sep=",", index_col=0, lineterminator="\n")
                
                # Check if tweet was crawled twice (we have never seen the opposite with the use of twint)
                if df["id"].nunique() == len(df):
                    print("All values in the column are unique.", username)
                else:
                    print("There are duplicate values in the column.", username)
                
                # 1. Filter out rows where the username ist the politician/party account itself
                df = df[df.username != username]
                
                # 2. Filter german tweets
                df = df[df.language == "de"]
                
                # Reset the index of the dataframe
                df = df.reset_index(drop=True)
                
                n_tweets_party += df.shape[0]
                print(username, df.shape[0])
                
                # Save dataframe
                df.to_csv(DATASET_PATH + "/" + party + "/" + username + ".csv", sep=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                
    n_tweets_total += n_tweets_party
    print(party, n_tweets_party)
    
print("Total: ", n_tweets_total)

All values in the column are unique. ArminLaschet
ArminLaschet 36161
All values in the column are unique. HBraun
HBraun 3212
All values in the column are unique. andreasscheuer
andreasscheuer 2431
All values in the column are unique. CSU
CSU 9072
All values in the column are unique. DerLenzMdB
DerLenzMdB 236
All values in the column are unique. Markus_Soeder
Markus_Soeder 30495
All values in the column are unique. ANiebler
ANiebler 25
All values in the column are unique. MarkusFerber
MarkusFerber 21
All values in the column are unique. Junge_Union
Junge_Union 931
All values in the column are unique. ManfredWeber
ManfredWeber 527
All values in the column are unique. DoroBaer
DoroBaer 2560
All values in the column are unique. rbrinkhaus
rbrinkhaus 4280
All values in the column are unique. tj_tweets
tj_tweets 396
All values in the column are unique. DaniLudwigMdB
DaniLudwigMdB 3821
All values in the column are unique. JuliaKloeckner
JuliaKloeckner 3357
All values in the column are unique.