# Wrangling data and plotting

Based on the dataset 
https://docs.google.com/spreadsheets/d/1iEl565M1mICTubTtoxXMdxzaHzAcPTnb3kpRndsrfyY/edit?ts=5bd7f609#gid=671375968

##  Packages in this notebook
### tidyverse

This package contains both dplyr and ggplot2, so it is very useful

#### Piping

The sequence %>% is used to pipe commands together

#### Projection

To select:
 - data %>% select(vbl1, vbl2,..., vbln)
 
#### Selection
 
 To select:
 
 - data %>% filter(vbl1 == value)
 
 To extract 1 column, use 'pull'
 ### lubridate
 
 This package allows us to manipulate dates.  From this we will use ymd() 
 
 ### zoo
 This package gives us methods such as rollmean, that will allow us to look at rolling averages over time. From this we will use rollmean()

In [1]:
library(tidyverse)
library(lubridate)
library(zoo)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.6     [32m✔[39m [34mdplyr  [39m 1.0.4
[32m✔[39m [34mtidyr  [39m 1.0.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




ERROR: Error in library(zoo): there is no package called ‘zoo’


## Dataset

This dataset is cleaned from the source mentioned in the header and is available on Brightspace.  

In [16]:
dfsource<-"./data/POTUS_approval.csv"
polls <- read.csv(dfsource, stringsAsFactors = F)

In [17]:
head(polls)

Unnamed: 0_level_0,President,Start.Date,End.Date,Approving,Disapproving,Unsure.NoData
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<int>,<int>
1,Barack Obama,16/01/2017,19/01/2017,59,37,4
2,Barack Obama,09/01/2017,15/01/2017,57,39,4
3,Barack Obama,02/01/2017,08/01/2017,55,42,3
4,Barack Obama,26/12/2016,01/01/2017,55,40,5
5,Barack Obama,19/12/2016,25/12/2016,56,40,4
6,Barack Obama,12/12/2016,18/12/2016,56,40,4


Some of the column names are either too long or have dots in them (spaces are converted to dots when a csv file is read into a dataframe.)

In [3]:
polls<-polls %>% 
  rename(
    Date = Start.Date,
    EndDate = End.Date,
    Approve = Approving,
    Disapprove = Disapproving
    )

ERROR: Error in rename(., Date = Start.Date, EndDate = End.Date, Approve = Approving, : object 'polls' not found


In [4]:
polls$Date<-dmy(polls$Date) # Formats the date

str(polls)

ERROR: Error in lapply(list(...), .num_to_date): object 'polls' not found


Let's see what Trump's approval rate was on different dates 

In [5]:
TrumpApprove <- polls %>% 
  select(President, Date, Approve) %>%
  filter (President == "Donald Trump")

ERROR: Error in select(., President, Date, Approve): object 'polls' not found


## Aggregation

grouping is done using group_by(variable)

Summarising functions include mean(), median()

Display the mean approval percentage for each president over all polls taken during their presidency.

In [6]:
polls %>% 
    group_by(President) %>%
    summarise(MeanApproval = mean(Approve))

ERROR: Error in group_by(., President): object 'polls' not found


There must be a NA in there. Check how many rows are dropped when we drop NAs

In [7]:
nrow(polls)
polls <- drop_na(polls)
nrow(polls)

ERROR: Error in nrow(polls): object 'polls' not found


In [8]:
#To extract  as vector 
Avgpolls <- polls%>% 
    group_by(President) %>%
    summarise(MeanApproval = mean(Approve)) 
Avgpolls

ERROR: Error in group_by(., President): object 'polls' not found


In [9]:
Avgpolls %>%     pull(MeanApproval)

ERROR: Error in pull(., MeanApproval): object 'Avgpolls' not found


In [10]:
Avgpolls$MeanApproval

ERROR: Error in eval(expr, envir, enclos): object 'Avgpolls' not found


### Getting moving averages

This is an average over a time frame time.  For this, we need the lubridate package

In [11]:
library(lubridate)
date <-ymd("2021-02-17")
date
month(date)
month(date, label=T)

In [12]:
TrumpPolls <-polls %>%
    select(President, Date, Approve, Disapprove) %>%
    filter (President == "Donald Trump") %>%
  arrange(Date) 
head(TrumpPolls)

ERROR: Error in select(., President, Date, Approve, Disapprove): object 'polls' not found


In [13]:
TrumpApprove<-TrumpPolls %>%
    mutate(AvgApprove = rollmean(Approve, 10, na.pad=TRUE, align="right"))

ERROR: Error in mutate(., AvgApprove = rollmean(Approve, 10, na.pad = TRUE, align = "right")): object 'TrumpPolls' not found


In [14]:
summary(TrumpApprove)

ERROR: Error in summary(TrumpApprove): object 'TrumpApprove' not found


Plotting a line chart of Trump's rolling average approval

In [15]:
ggplot(data = TrumpApprove, aes(x=Date,y=AvgApprove)) + 
  geom_line()

ERROR: Error in ggplot(data = TrumpApprove, aes(x = Date, y = AvgApprove)): object 'TrumpApprove' not found


# For the exercise

  1 - Create a new attribute in the polls dataframe, calling it inaugurated.  Initialize it to 1st Jan 1970 for every row.  Using Google search, find out the first inauguration date of each of the former presidents, back to Ronald Reagan.  Update each row with the inauguration date for that president.

 2 - Create a new attribute Days, giving it a value of Date - inaugurated.  Check to see if the data makes sense.  Do you need to change the data type?

 3 - Using a rolling mean, get the rolling average approval of each president over their presidency and store it as an extra column AvgApprove in the dataframe.
 
  4 - Start plotting your data.  Make a ggplot, create a multi-line plot, with a line for each president.  The x-axis should show the number of days since the start of the presidency and the y-axis should show the rolling approval rating at that time.  You may need to change the size of your plot, so you can see it!  Try the function options(), adjusting the repr.plot.width and  repr.plot.height to your satisfaction.  Save your plot in a variable (e.g. p)
  
   5 - Enhance your plot - suggestions below:
     - Add a title, x-axis label and y-axis label
     - Make your plot readable to colourblind people by creating a palette suitable for them, and using it.
     - increase the size of the colour block in your legend
     - Modify the look and style of your title, axis labels and legend labels
     - Modify the ticks on your x-axis, depending on the range of values.
 
  6 - Add a geom_line for one of the presidents, for whom you think their line tells a story, making it thicker than the previous line, but using the same colour.