# Exercises on Dataframes in R
* Author: Johannes Maucher
* Last Update: 21.09.2017, a few modifications by OK in 2019
* Corresponding lecture notebook: [01OperationsOnDataframes](../02DataManagement/01OperationsOnDataframes.ipynb)

## Solve the tasks ...

Your solution should contain 
* the implemented code in code-cells, 
* the output of this code
* answers on questions in mark-down cells
* and optionally your remarks, discussion, comments on the solution in markdown-cells.

Send me the resulting Jupyter notebook.

In [694]:
library(tidyverse)

## Tasks



1. Read data from [lobbyPediaParteispenden.csv](../Lecture/data/lobbyPediaParteispenden.csv) into an R dataframe (it is not a data set in context of IT Product Management and Digital Analytics, but it is important under social political aspects - in German: "Die Hand, die mich füttert, die beiß ich nicht:-). This file is a dump of the [Lobbypedia Database](https://lobbypedia.de/wiki/Hauptseite), which contains all donations of more than 10000.- Euros to German political parties. 


In [150]:
lobbyPediaParteispenden <- read.csv(file="../data/lobbyPediaParteispenden.csv", header=TRUE, 
                       sep=",")

2. Remove the columns `Bundesland`, `Branche` and `Schlagworte`, since these columns contain no data. Determine the number of rows and columns in the resulting dataframe and display it's head.


In [151]:
lobbyPediaParteispenden <- subset(lobbyPediaParteispenden, select=-c(Bundesland, Branche, Schlagworte))
glimpse(lobbyPediaParteispenden)

Observations: 2,466
Variables: 7
$ X         <fct> "Parteispende:Münchener Rückversicherungs-Gesellschaft AG...
$ Geldgeber <fct> Münchener Rückversicherungs-Gesellschaft AG, Münchener Rü...
$ Kategorie <fct> Kategorie:Parteispende, Kategorie:Parteispende, Kategorie...
$ Betrag    <fct> "15000,00 \200", "15000,00 \200", "15000,00 \200", "15000...
$ Empfänger <fct> CDU, CDU, SPD, SPD, CSU, CSU, CDU, FDP, SPD, FDP, FDP, CD...
$ Jahr      <int> 2014, 2011, 2012, 2011, 2005, 2015, 2009, 2011, 2015, 200...
$ Ort       <fct> München, München, München, München, München, München, Mün...


3. Transform the values in column `Betrag` into a numeric representation. Keep in mind, there are decimal places and the Euro-Sign which have to be converted etc. 

 Pay attention to the performance and do not use for- or while-loops.


In [152]:
#seperate € symbol
lobbyPediaParteispenden <- lobbyPediaParteispenden %>% separate(Betrag, c("Betrag", "leftover"), sep = " ", remove = TRUE, convert = TRUE)
#delete € symbol
lobbyPediaParteispenden <-subset(lobbyPediaParteispenden, select=-c(leftover))
#replace , by . and convert char to numeric
lobbyPediaParteispenden$Betrag <- as.numeric(gsub(",", ".", lobbyPediaParteispenden$Betrag))

4. Transform the party-names in column `Empfänger`, such that they only contain alpha-numeric characters. I.e. whitespaces and slashes shall be removed. This can be done efficiently by the `gsub()`-function with the regular expression `\\W` assigned to the `pattern`-argument. Implement this transformation of the dataframe-values by defining a corresponding function, which is then used by the `sapply()`-function. Note that after this transformation the column `Empfänger` shall again be a factor-variable, not a character.

 Pay attention to the performance and do not use for- or while-loops.



In [153]:
keepAlphaNumeric <- function(datafield){
    datafield <- as.factor(gsub("\\W", "", as.character(datafield)))
    return (datafield)
}

lobbyPediaParteispenden$Empfänger <- sapply(lobbyPediaParteispenden$Empfänger, keepAlphaNumeric)

lobbyPediaParteispenden

X,Geldgeber,Kategorie,Betrag,Empfänger,Jahr,Ort
Parteispende:Münchener Rückversicherungs-Gesellschaft AG in München-CDU-2014,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,15000.00,CDU,2014,München
Parteispende:Münchener Rückversicherungs-Gesellschaft AG in München-CDU-2011,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,15000.00,CDU,2011,München
Parteispende:Münchener Rückversicherungs Gesellschaft AG-SPD-2012,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,15000.00,SPD,2012,München
Parteispende:Münchener Rückversicherungs Gesellschaft AG-SPD-2011,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,15000.00,SPD,2011,München
Parteispende:Münchener Rückversicherungsgesellschaft AG-CSU-2005,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,15000.00,CSU,2005,München
Parteispende:Münchener Rückversicherungsges. AG-CSU-2015,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,15000.00,CSU,2015,München
Parteispende:Münchener Rückversicherungsgesellschaft AG-CDU-2009,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,30000.00,CDU,2009,München
Parteispende:Münchener Rückversicherungsgesellschaft AG*-FDP-2011,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,15000.00,FDP,2011,München
Parteispende:Münchener Rückversicherungsges. AG-SPD-2015,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,15000.00,SPD,2015,München
Parteispende:Münchener Rückversicherungsgesellschaft AG-FDP-2009,Münchener Rückversicherungs-Gesellschaft AG,Kategorie:Parteispende,22500.00,FDP,2009,München


5. Calculate and display the summary of univariate descriptive statistics on this dataframe and list the data and answer the following questions. Pay attention to the performance of all tasks and do not use for- or while-loops:

    1. Create a data frame to list the companies with the correlated number of donations and the correlated amount of the donation. Which company raises the most donations (number of donations, not amount)?
    2. Create a data frame to list the parties with the correlated number of donations and the correlated amount of the donation. Which party receives the most donations (number of donations, not amount)?
    3. Create a data frame to list the parties with the correlated number, amount, minimum, maximum and mean-value of the party-donations. What is the total amount, minimum, maximum and mean-value of party-donations?
  

In [154]:
A <- lobbyPediaParteispenden %>%
        group_by(lobbyPediaParteispenden$Geldgeber) %>%
        summarise(
            count = n(),                           
            amount = sum(Betrag, na.rm = TRUE),
        ) 
A[order(-A$count),]

lobbyPediaParteispenden$Geldgeber,count,amount
BMW Bayerische Motoren Werke AG,69,5402967.7
Verband der Chemischen Industrie,66,4474381.9
VBM Verband der Bayerischen Metall- und Elektroindustrie e.V.,61,8751393.0
Philip Morris GmbH,49,830853.8
Südwestmetall Verband der Metall- und Elektroindustrie Baden-Württemberg e.V.,49,4352531.8
Allianz AG/SE,48,3002673.8
METALL NRW - Verband der Metall- und Elektroindustrie Nordrhein-Westfalen e.V.,47,2748481.6
Daimler,46,3827331.2
Evonik Industries AG,37,1884000.0
Deutsche Bank AG,34,4809357.5


BMW Bayerische Motoren Werke AG raises the most donations: 69.

In [155]:
B <- lobbyPediaParteispenden %>%
        group_by(lobbyPediaParteispenden$Empfänger) %>%
        summarise(
            count = n(),                           
            amount = sum(Betrag, na.rm = TRUE),
        ) 
B

lobbyPediaParteispenden$Empfänger,count,amount
CDU,1085,44954948.0
SPD,356,14632198.8
CSU,422,22191860.1
FDP,429,20048850.0
Bündnis90DieGrünen,161,4808215.0
AFD,3,90000.0
SSW,6,1413698.9
,1,15000.0
DIEPARTEI,2,28506.8
LINKE,1,60000.0


CDU received the most donations: 1085.

In [158]:
C <- lobbyPediaParteispenden %>%
        group_by(lobbyPediaParteispenden$Empfänger) %>%
        summarise(
            count = n(),                           
            amount = sum(Betrag, na.rm = TRUE),
            min = min (Betrag),
            max = max (Betrag),
            mean = mean(Betrag)
        ) 
C

summarise(  C,
            totalmean = mean(lobbyPediaParteispenden$Betrag, na.rm = TRUE),                        
            totalamount = sum(amount, na.rm = TRUE),   
            totalmin = min(min, na.rm = TRUE),                        
            totalmax = max(max, na.rm = TRUE),                        

        ) 



lobbyPediaParteispenden$Empfänger,count,amount,min,max,mean
CDU,1085,44954948.0,10100.0,425400,41433.13
SPD,356,14632198.8,10100.0,300000,41101.68
CSU,422,22191860.1,10033.84,770000,52587.35
FDP,429,20048850.0,10050.0,850000,46733.92
Bündnis90DieGrünen,161,4808215.0,10070.0,204516,29864.69
AFD,3,90000.0,20000.0,50000,30000.0
SSW,6,1413698.9,118518.0,475726,235616.49
,1,15000.0,15000.0,15000,15000.0
DIEPARTEI,2,28506.8,12756.8,15750,14253.4
LINKE,1,60000.0,60000.0,60000,60000.0


totalmean,totalamount,totalmin,totalmax
43894.27,108243278,10033.84,850000


Party donations:
 - total amount: 108.243.278 €
 - total min: 10.033,84 €
 - total max: 850.000 €
 - total mean: 43.894,27 €

6. Write a function, which returns for an arbitrary party the sum, min, max and mean-value of donations received. The functions arguments shall be the name of a party and the dataframe as constructed in the previous subtasks. Moreover, a third argument `onlySum` shall be implemented. If this argument is `TRUE`, the function shall return not all descriptive statistics, but only the sum.
Call this function for all parties in the dataframe. Sort by the amount of donations.

    Pay attention to performance and interpretability. Do not use for- or while-loops.


In [384]:
info <- function(partyName, df, onlySum){
    df <- df[df$Empfänger == partyName,]
    summary=NULL
    if (onlySum){
        summary <- summarise(df,
        Party = partyName,
        amount = sum(Betrag, na.rm = TRUE),
        )

    }
    else{
        summary <- summarise(df,
        Party = partyName,
        amount = sum(Betrag, na.rm = TRUE),
        min = min (Betrag),
        max = max (Betrag),
        mean = mean(Betrag)
        )

    }
    summary <- as.data.frame(summary)
    return (summary)
}

#call funtion for each unique Party
PartyInfos <-lapply(unique(lobbyPediaParteispenden$Empfänger), info, df=lobbyPediaParteispenden, onlySum = FALSE)
#convert list of dataframes to one single dataframe
PartyInfos = Reduce(function(...) merge(..., all=T), PartyInfos)
#order by amount
PartyInfos[order(-data.frame(PartyInfos)$amount),]



Unnamed: 0,Party,amount,min,max,mean
1,CDU,44954948.0,10100.0,425400,41433.13
3,CSU,22191860.1,10033.84,770000,52587.35
4,FDP,20048850.0,10050.0,850000,46733.92
2,SPD,14632198.8,10100.0,300000,41101.68
5,Bündnis90DieGrünen,4808215.0,10070.0,204516,29864.69
7,SSW,1413698.9,118518.0,475726,235616.49
6,AFD,90000.0,20000.0,50000,30000.0
10,LINKE,60000.0,60000.0,60000,60000.0
9,DIEPARTEI,28506.8,12756.8,15750,14253.4
8,,15000.0,15000.0,15000,15000.0


7. A new dataframe, which lists for each donator (Geldgeber) the sum of all donations and the distribution of the donations across the parties, shall be created. This dataframe allows comfortable answers on questions like 
 * *Which companies are the strongest donators?*  
 * *Which parties do they support?*. 
 
 Each row of the data frame corresponds to a donator company (Geldgeber). The column-entries are the amounts of spendings (in Euro) for the different parties. The dataframe shall be ordered according to the values in column *Sum*.
 
 Pay attention to performance.


In [646]:
#get Dataframe for sum of Betrag over Geldgeber
Sum <- as.data.frame(setNames(aggregate(lobbyPediaParteispenden$Betrag,
                by=list(Geldgeber=lobbyPediaParteispenden$Geldgeber), FUN=sum),
                c("Geldgeber","Betrag")))

#convert Geldgeber from factor to character
Sum$Geldgeber <- as.character(Sum$Geldgeber)


#get Dataframe for sum of Betrag for each Geldgeber Empfänger combination
distribution <- setNames(aggregate(lobbyPediaParteispenden$Betrag ~ as.character(lobbyPediaParteispenden$Empfänger) 
                           + as.character(lobbyPediaParteispenden$Geldgeber), 
                           data = lobbyPediaParteispenden[c("Betrag", "Empfänger", "Geldgeber")] , 
                           FUN = 'sum'),
                           c("Empfänger","Geldgeber","Betrag"))

#get distribution over Geldgeber for each unique Empfänger
distribution <- distribution%>% spread(key = Empfänger, value = Betrag)

#delete autogenerated column V1
distribution["V1"] <- NULL

#merge dataframes
distribution["Summe"] <- Sum[2]

#order by Summe
distribution <- distribution[order(-distribution$Summe),]

#drop rownames
rownames(distribution) <- c()

#display
distribution

Geldgeber,AFD,Bündnis90DieGrünen,CDU,CSU,DIEPARTEI,FDP,LINKE,SPD,SSW,Summe
VBM Verband der Bayerischen Metall- und Elektroindustrie e.V.,,424512.07,,6741533.9,,1100451.9,,484895.2,,8751393.0
BMW Bayerische Motoren Werke AG,,495460.78,980534.5,1899269.2,,787865.7,,1239837.5,,5402967.7
Deutsche Bank AG,,70451.60,2835494.7,195564.5,,1327846.7,,380000.0,,4809357.5
Verband der Chemischen Industrie,,77500.00,2119815.9,363153.2,,1087798.3,,826114.4,,4474381.9
Südwestmetall Verband der Metall- und Elektroindustrie Baden-Württemberg e.V.,,463064.50,2343032.0,,,929177.3,,617258.0,,4352531.8
Daimler,,290000.00,1400000.0,330000.0,,390000.0,,1417331.2,,3827331.2
Allianz AG/SE,,461134.00,723957.5,561699.5,,557489.8,,698393.0,,3002673.8
Deutsche Vermögensberatung AG DVAG,,60000.00,1837588.7,13950.0,,775120.9,,150000.0,,2836659.6
METALL NRW - Verband der Metall- und Elektroindustrie Nordrhein-Westfalen e.V.,,30000.00,1649843.0,,,818074.1,,250564.5,,2748481.6
DaimlerChrysler AG,,74999.94,1015741.4,200000.0,,340149.4,,859999.0,,2490889.8


VBM Verband der Bayerischen Metall- und Elektroindustrie e.V , BMW Bayerische Motoren Werke AG and Deutsche Bank AG are the stronget donators. They support they parties CDU, CSU, Bündnis90DieGrünen, FDP and SPD. 


8. A new dataframe, which lists for each donator (Geldgeber) the distribution of the donations and the number of donations across the parties, shall be created and sorted by the donator (Geldgeber). Use tidyverse syntax.

 * *Why is there not only one row for the donator "Adolf Würth GmbH" and the party "CDU"?*  
 
 Pay attention to performance and do not use for- or while-loops.

In [704]:
donationsPerCompany <- lobbyPediaParteispenden[c("Betrag", "Empfänger", "Geldgeber")] %>%
        group_by(Geldgeber = lobbyPediaParteispenden$Geldgeber, Empfänger = lobbyPediaParteispenden$Empfänger) %>%
        summarise(
            Count = n(),
            Betrag = sum(Betrag, na.rm = TRUE)
        ) 
donationsPerCompany[order(distribution$Geldgeber),]


Geldgeber,Empfänger,Count,Betrag
A. Zovko GmbH & Co. KG,CDU,1,10300.00
A.T.U Auto-Teile-Unger GmbH & Co. KG,CSU,1,10775.00
Aachener Verlagsgesellschaft mbH,CDU,1,12500.00
ABB Asea Brown Boveri AG,CDU,1,10556.45
Abcdruck GmbH,CDU,2,25470.87
Abels &Grey GmbH,CDU,1,78080.00
Accon Köln GmbH,SPD,1,15000.00
Accumulata Immobilien Development GmbH,CSU,1,25000.00
"ADIB Agrar-, Dienstleistungs-, Industrie- und Baugesellschaft mbH",CDU,1,13500.00
Adler-Schiffe GmbH & Co. KG,CDU,1,13714.10


There are three rows for the donator Adolf Würth GmbH and the party CDU because the donator is written three times differently.

9. Save the dataframe to a file `donationsPerCompany.csv`.

In [705]:
write.csv(donationsPerCompany,"donationsPerCompany.csv", row.names = FALSE)