## First programming assignment 
Three functions are requested that are meant to interact with dataset that accompanies this assignment. 

Write a function named 'pollutantmean' that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function 'pollutantmean' takes three arguments: 'directory', 'pollutant', and 'id'. Given a vector monitor ID numbers, 'pollutantmean' reads that monitors' particulate matter data from the directory specified in the 'directory' argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. 

In [45]:
pollutantmean <- function(directory, pollutant, id = 1:332) {
  
    ## 'directory' is a character vector of length 1 indicating
    ## the location of the CSV files
    
    ## 'pollutant' is a character vector of length 1 indicating
    ## the name of the pollutant for which we will calculate the
    ## mean; either "sulfate" or "nitrate".
    
    ## 'id' is an integer vector indicating the monitor ID numbers
    ## to be used
    
    ## Returns the mean of the pollutant across all monitors list
    ## in the 'id' vector (ignoring NA values)
    
    dfs <- (Sys.glob("specdata//*.csv"))[id];
    
    total_data <- c()
    
    for (data in dfs) {
        file_data <- read.csv(data, sep = ",");   ## read the data first
        pollutant_data <- file_data[,pollutant];  ## what column name is requested
        pollutant_data <- pollutant_data[!is.na(pollutant_data)]    ## remove na values
        total_data <- c(total_data, pollutant_data)       # combine with total data then repeat
    }
    
    return(mean(total_data));
}

In [42]:
pollutantmean("specdata", "sulfate", 1:10)

In [43]:
pollutantmean("specdata", "nitrate", 70:72)

In [44]:
pollutantmean("specdata", "nitrate", 23)

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. 

In [50]:
complete <- function(directory, id = 1:332) {
    
    ## 'directory' is a character vector of length 1 indicating
    ## the location of the CSV files
    
    ## 'id' is an integer vector indicating the monitor ID numbers
    ## to be used
    
    ## Returns a data frame of the form:
    ## id nobs
    ## 1  117
    ## 2  1041
    ## ...
    ## where 'id' is the monitor ID number and 'nobs' is the
    ## number of complete cases
    files <- (Sys.glob("specdata//*.csv"));
    nobs <- c();
    
    for (iloc in id) {
        file_data <- read.csv(files[iloc], sep = ",");
        num_complete_cases <- file_data[complete.cases(file_data),];
        nobs <- c(nobs, nrow(num_complete_cases));
    }
    
    return(data.frame(cbind(id, nobs)));
}

In [51]:
complete("specdata", 1)

id,nobs
1,117


In [52]:
complete("specdata", c(2, 4, 8, 10, 12))

id,nobs
2,1041
4,474
8,192
10,148
12,96


In [53]:
complete("specdata", 30:25)

id,nobs
30,932
29,711
28,475
27,338
26,586
25,463


In [54]:
complete("specdata", 3)

id,nobs
3,243


Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0.

In [56]:
corr <- function(directory, threshold = 0) {
    
    ## 'directory' is a character vector of length 1 indicating
    ## the location of the CSV files
    
    ## 'threshold' is a numeric vector of length 1 indicating the
    ## number of completely observed observations (on all
    ## variables) required to compute the correlation between
    ## nitrate and sulfate; the default is 0
    
    ## Returns a numeric vector of correlations
    
    files <- (Sys.glob("specdata//*.csv"));
    
    correlations <- c()
    
    for (file in files) {
        polution_data <- read.csv(file, sep = ",");
        complete_cases <- polution_data[complete.cases(polution_data),];
        if (nrow(complete_cases) > threshold) {
            correlations <- c(correlations, cor(complete_cases$sulfate, complete_cases$nitrate))
        }
    }
    
    return(correlations)
}

This function uses the 'cor' function in R which calculates the correlation between two vectors.

In [57]:
cr <- corr("specdata", 150)
head(cr)

In [58]:
summary(cr)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.21057 -0.04999  0.09463  0.12525  0.26844  0.76313 

In [59]:
cr <- corr("specdata", 400)
head(cr)

In [60]:
summary(cr)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.17623 -0.03109  0.10021  0.13969  0.26849  0.76313 

In [61]:
cr <- corr("specdata", 5000)
summary(cr)

Length  Class   Mode 
     0   NULL   NULL 

In [62]:
length(cr)

In [63]:
cr <- corr("specdata")
summary(cr)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.00000 -0.05282  0.10718  0.13684  0.27831  1.00000 

In [64]:
length(cr)

## This assignment is graded through a quiz 

What followest is the quiz questions

In [65]:
## What value is returned by the following call to pollutantmean()? You should round your output to 3 digits.
pollutantmean("specdata", "sulfate", 1:10)

In [66]:
## What value is returned by the following call to pollutantmean()? You should round your output to 3 digits.
pollutantmean("specdata", "nitrate", 70:72)

In [67]:
## What value is returned by the following call to pollutantmean()? You should round your output to 3 digits.
pollutantmean("specdata", "sulfate", 34)

In [68]:
## What value is returned by the following call to pollutantmean()? You should round your output to 3 digits.

pollutantmean("specdata", "nitrate")

In [70]:
## What value is printed at end of the following code?
cc <- complete("specdata", c(6, 10, 20, 34, 100, 200, 310))
print(cc$nobs)

[1] 228 148 124 165 104 460 232


In [72]:
## What value is printed at end of the following code?
cc <- complete("specdata", 54)
print(cc$nobs)

[1] 219


In [73]:
## What value is printed at end of the following code?
RNGversion("3.5.1")  
set.seed(42)
cc <- complete("specdata", 332:1)
use <- sample(332, 10)
print(cc[use, "nobs"])

"non-uniform 'Rounding' sampler used"

 [1] 711 135  74 445 178  73  49   0 687 237


In [74]:
## What value is printed at end of the following code?
cr <- corr("specdata")                
cr <- sort(cr)   
RNGversion("3.5.1")
set.seed(868)                
out <- round(cr[sample(length(cr), 5)], 4)
print(out)

"non-uniform 'Rounding' sampler used"

[1]  0.2688  0.1127 -0.0085  0.4586  0.0447


In [75]:
## What value is printed at end of the following code?
cr <- corr("specdata", 129)                
cr <- sort(cr)                
n <- length(cr)    
RNGversion("3.5.1")
set.seed(197)                
out <- c(n, round(cr[sample(n, 5)], 4))
print(out)

"non-uniform 'Rounding' sampler used"

[1] 243.0000   0.2540   0.0504  -0.1462  -0.1680   0.5969


In [76]:
## What value is printed at end of the following code?
cr <- corr("specdata", 2000)                
n <- length(cr)                
cr <- corr("specdata", 1000)                
cr <- sort(cr)
print(c(n, round(cr, 4)))

[1]  0.0000 -0.0190  0.0419  0.1901
