## Lets get comfortable with R

### How to install a package

Installing a new package is required only once for your computer system.  
You will not need to re-install the package once it is on your computer. 

In [None]:
install.packages('beepr') #type = 'binary'

Using a "question mark" or "help(function)" will show you the documentation page of the specific R function. It will describe exact arguments, possible options and what the function returns.It is highly advised to read the help function for every command that you use. 

In [None]:
?install.packages

## loading package in your workspace
Once the package is installed on your computer, it is essential to "load" the package in your working environment. You cannot use functions withing the specific package unless you load it with following command. 

In [None]:
library(beepr)

In [None]:
beep(sound =8)

In [None]:
#?beep

In [None]:
beep()

### Use the command to produce "MARIO" game sound

### Manage your R working environment
* Setting your working environment
* Make sure you have correct folder selected on your computer as your working directory.  
* It will be the folder from where you will read your data files and will save your outputs.


#### what is your current working directory?

In [None]:
getwd() #the current working directory

#### change it appropriate working directory

In [None]:
setwd('C:/Users/prana/Desktop/directory/Applied-Veterinary-Population-Statistics/Week 1/Code notebooks/data')

In [None]:
getwd()

#### will list all of the variables and functions stored in the global environment (your working R session):

In [None]:
ls()

#### I have nothing right now

In [None]:
a = 2

In [None]:
ls()

If you want to delete all objects in your environment you can use following command

In [None]:
rm(list = ls())

In [None]:
ls()

## Key R packages that we will use to explore the data

* Tidyverse: is a collection of packages that share the same philisophy of data management and visualization
* ggplot2: Grammer of Graphics
* dplyr: Data Wrangling package: check the cheetsheet here https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

In [None]:
#install.packages("tidyverse", type = 'binary')

#### Reading the data

In [None]:
COVID = read.csv('covid-tracking-data-master_COVID_Tracking_Project_2020_08_07_daily.csv')


In [None]:
COVID = read.csv('C:/Users/prana/Desktop/directory/MPM_200/data/covid-tracking-data-master_COVID_Tracking_Project_2020_08_07_daily.csv')

In [None]:
ls()

## Check first few lines of your dataframe

In [None]:
head(COVID)

In [None]:
#View(COVID)

* Explore the Environment section in RStudio and click on "COVID"
* Explore various columns of the data-frame
* Similar presentation as Excel Spreadshee

### Exploring dataframe Stucture

In [None]:
nrow(COVID)

In [None]:
ncol(COVID)

In [None]:
dim(COVID)

In [None]:
colnames(COVID)

#### Calling a single column from the dataframe

In [None]:
COVID$state

In [None]:
str(COVID)

## find unique values in a column

In [None]:
unique(COVID$state)

In [None]:
table(COVID$state) ## what is this function doing?

In [None]:
length(unique(COVID$state))

#### Checking missing values

In [None]:
is.na(COVID$state)
### logical function that will give TRUE if there is missing data,
# ###else it will give FALSE 

In [None]:
sum(is.na(COVID$state))

#### Find which column has truly missing data?

In [None]:
?table

### Exporting data /saving outputs

In [None]:
table(COVID$state)

In [None]:
as.data.frame(table(COVID$state)) ## we are creating a table here

In [None]:
State_table = as.data.frame(table(COVID$state)) ## we are creating a table here
## Dataframe in the Variable Environment
#View(State_table)

In [None]:
ls()

`write.csv` will help us save our dataframe as an excel table.  
`write.csv(Name_of_Dataframe, 'file_location_on_your_computer/Name_of_file.csv')`  
Here we want to save our TaxaGroupTable in "outputs" folder as `TaxaGroupTable.csv`.  
**It will overwrite your previous file, and erase the old one and save the new one with the same name**  


In [None]:
write.csv(State_table, "State_table.csv")

Where is it saved? 

In [None]:
getwd()

## Plotting GGplot

In [None]:
library(repr)
options(repr.plot.width=6, repr.plot.height=4)

#### ggplot is package that stands for Grammer of Graphics 
One of the most popular packages for plotting in R.

In [None]:
library(ggplot2)

#### lets see if we can plot the table that we generated

In [None]:
head(State_table)

In [None]:
library(ggplot2)

In [None]:
ggplot(State_table) ## we got our canvas to draw our plot

In [None]:
ggplot(State_table) +
aes(x = Var1)

In [None]:
ggplot(State_table) +
aes(x = Var1)+
aes(y = Freq)

In [None]:
ggplot(State_table)+
aes(x = Var1)+
aes(y = Freq)+
geom_bar(stat="identity")

In [None]:
ggplot(State_table)+
aes(x = Var1)+
aes(y = Freq)+
geom_bar(stat="identity")+
coord_flip()

In [None]:
ggplot(State_table)+
aes(x = Var1)+
aes(y = Freq)+
geom_bar(stat="identity", width = 0.5,
color = 'steelblue', fill = 'steelblue')+
coord_flip()

In [None]:
ggplot(State_table)+
aes(x = Var1)+
aes(y = Freq)+
geom_bar(stat="identity", width = 0.5,
color = 'steelblue', fill = 'steelblue')+
xlab('State')+ ylab('Number of rows in the dataset')+
labs(title = "Covid data description")+
coord_flip()

I would like to create point plot instead?

In [None]:
ggplot(State_table)+
aes(x = Var1)+
aes(y = Freq)+
geom_point(stat="identity", color = 'steelblue', fill = 'steelblue')+
xlab('State')+ ylab('Number of rows in the dataset')+
labs(title = "Covid data description")+
coord_flip()

In [None]:
ggplot(State_table)+
aes(x = Var1)+
aes(y = Freq)+
geom_point(stat="identity", color = 'steelblue', fill = 'steelblue')+
xlab('State')+ ylab('Number of rows in the dataset')+
labs(title = "Covid data description")+
coord_flip()
ggsave("state_points.pdf", width = 4, height = 4)

In [None]:
getwd()

## Summarizing data 

#### Subsetting data
Lets say we have to explore data only for the state of California

In [None]:
unique(COVID$state)

In [None]:
CA = COVID[COVID$state == 'CA',   ]

In [None]:
CA

In [None]:
COVID[2,1]

In [None]:
COVID(1,1)

In [None]:
COVID$state == 'CA'

In [None]:
COVID

In [None]:
ls()

In [None]:
head(CA)

## Use of ***dplyr*** package for summarizing variables

We’re going to learn some of the most common dplyr functions: 
select(), filter(),
group_by(), and summarize().
* To select columns of a data frame, use select().
* filter() helps you filter your data.
* group_by() helps you aggregate your data into groups.
* summarize() will eventually create a summary table

### pipeline function

pipe operator ***%>%** is Ctrl+Shift+M (Windows) or Cmd+Shift+M (Mac).

for more details on the data check this description  
https://covidtracking.com/data/api

In [None]:
library(dplyr)

In [None]:
COVID %>% 
select(state)

In [None]:
typeof(state)

In [None]:
typeof(COVID$state)

In [None]:
COVID %>% 
select(state, positive, positiveIncrease) ## total cases and increase in cases every day


## How can we determine the TOTAL number of cases for each state?
positive: represents cumulative number of cases  
positiveIncrease: daily increase: incidence   

In [None]:
COVID %>% 
select(state, positive, positiveIncrease) %>% 
group_by(state) %>% 
summarise_all(list()) 

In [None]:
total_cases = COVID %>% 
select(state, positive, positiveIncrease)  %>% 
group_by(state) %>% 
summarise_all(list(sum)) 

In [None]:
total_cases

In [None]:
ggplot(total_cases)+
aes(x = state)+
aes(y = positiveIncrease)+
geom_bar(stat="identity", color = 'steelblue', fill = 'steelblue')+
xlab('State')+ ylab('Total COVID cases')+
labs(title = "COVID tracker data")+
coord_flip()
#ggsave("state_points.pdf", width = 4, height = 4)