# Data Summarization with R


## Summarizing Vectors

As we have seen in prior modules, vectors are summarized using measures of central tendency and variability. 
We will look into other descriptive statistics for summarizing the vectors. 
We will work with the same Kings County Housing Prices dataset.

In [None]:
housing_prices <- read.csv("/dsa/data/all_datasets/house_sales_in_king_county/kc_house_data.csv")

apply(), lapply(), sapply(), tapply(), ddply() are some of the summarizing functions you can use to apply functions on columns.
Let's look into each of them. 

In [None]:
# apply() is used to apply a function to the rows or columns of a matrix. It collapses either a row or column. 
# The middle parameter specifies either rows (i.e., 1) or columns (i.e., 2).  In the function below we are 
# applying the mean function to all columns -- so we specified "2".

# In this dataset, date is a factor variable.  We cannot apply the mean function to a factor variable. 
# The id column is just an id number for the row, so a mean cannot be computed for it.  We will exclude 
# these two columns from our dataset. 
# The dataframe without the date and id variables is named "less_data."

less_data = housing_prices[!names(housing_prices) %in% c('date','id')]

head(less_data)

In [None]:
apply(less_data, 2, mean)

In [None]:
# colMeans, rowMeans, colSums, rowSums are functions you can use if you want to do averages on matrix 
# columns or rows. 

In [None]:
# Create a list containing the bedrooms and bathrooms variables.  These variables become the elements of the list.
# x[1] contains bedrooms and x[2] contains bathrooms. Look at the structure of x below. 

x = list(less_data$bedrooms, less_data$bathrooms)
str(x)

In [None]:
# lapply() is used to apply a function to each element of a list. lapply(x) returns a list that has the same 
# length as the number of elements in x.

lapply_example = lapply(x, FUN = mean)

lapply_example

# lapply() calculates the means of each element in x as x$1, x$2, and returns two elements such as res$1, res$2. 
# The elements are returned as a list.

class(lapply_example)

In [None]:
# sapply() is used When you want to apply a function to each element of a list. The returned values are in the
# form of a vector, rather than a list.  lapply() and sapply() are similar, except that lapply() returns a list
# and sapply() returns a vector.

sapply_example = sapply(less_data, FUN = mean)

sapply_example

class(sapply_example)

In [None]:
# mapply() is used to apply a function to multiple list or vector arguments.  It is a multivariate version of 
# sapply().  It vectorizes arguments to functions that don't usually accept vectors as arguments.
# mapply() applies a function to the first element of each … argument, and then applies the function 
# to the second element of each argument, and then the third, and so on.  The result is coerced into a 
# vector/array, as with sapply().

# For example, in our dataset there are different variables measuring different areas -- e.g., sqft_living, 
# sqft_lot, sqft_above, sqft_basement, sqft_living15, and sqft_lot15. If we wanted to find the sum of these
# areas for each house, we could use the mapply() function.

mapply_example = mapply(sum, less_data$sqft_living, less_data$sqft_lot, less_data$sqft_above,
    less_data$sqft_basement, less_data$sqft_living15, less_data$sqft_lot15)

head(mapply_example)

In [None]:
head(less_data)

With the mapply() function, the values of the six arguments, as found in the first row, are added together
to generate the first value in the output.  The values of the six arguments, as found in the second row, are 
added together to generate the second output, and so on.  

In [None]:
# The values for these six variables, as shown in the first row, are  1180 (sqft_living), 5650 (sqft_lot),
# 1180 (sqft_above), 0 (sqft_basement), 1340 (sqft_living15), and 5650 (sqft_lot15).  The sum of these numbers
# is the value returned by mapply().  

1180 + 5650 + 1180 + 0 + 1340 + 5650

tapply() is used to apply a function to a subset of a vector, where the subset is defined by some other vector -- usually a factor (i.e., a categorical variable).

It is a tabular version of the apply() function, meaning that its input argument should be a categorical variable (i.e., categorical variable), and its argument function is applied to each group.

In [None]:
# Use tapply() to find out the average price a home, given the number of bedrooms in the house.

t(tapply(less_data$price, less_data$bedrooms, mean))

<span style="color:#d37c08; font-weight:700"> `by` </span> <span style="color:#6a85dd"> function</span>
------
tapply() can be used to summarize one variable, based on another variable.  But what if we want to summarize many variables?  The `by` function is like an extended version of the tapply() command.
 
The `by` function subsets a data frame by the values of one or more factors (i.e., categorical variables) and applies a function to each subset.

In [None]:
# Here, the data frame is being subset by the "view" variable (which has a min value of 0 and a 
# max value of 4); and within each subset, the summary function is run on both the price and the 
# sqft_living variables.  

by_function_example <- by(less_data[c('price','sqft_living')], less_data$view, summary)

by_function_example

### 2-way tables
------
2-way tables are very informative. The table() function creates tabular results of categorical variables.


The table below shows the distribution of the number of bathrooms (see column headings) for each count of bedrooms (see row numbers).  
It is very detailed and the sums of columns and rows are displayed which indicate 
number of bedrooms or bathrooms with a specific number. 

In [None]:
#The command below produces a 2-way table with a distribution count of every combination of bedrooms and bathrooms. 

#addmargins() sums the counts of each row and column.

bed_and_bath = table(less_data$bedrooms, less_data$bathrooms)

addmargins(bed_and_bath)

Below is an extended version of the table command which adds a 3rd dimension to the 2-way table. We can see same information as above but for every kind of view (0,1,2,3,4).

In [None]:
bed_bath_view <- xtabs(~ bedrooms + bathrooms + view, data = housing_prices)
bed_bath_view

In [None]:
# The stat.desc() function gives an elaborate descriptive statistics of input object. Most of the statistics are 
# commonly used.

library(pastecs)

options(scipen = 999)

stat.desc(less_data)

In [None]:
# aggregate() works just like groupby in SQL. Here we are grouping data based on the number of bedrooms. The means 
# of 3 columns (i.e., price, bathrooms, and sqft_living) are returned, grouped by the number of bedrooms present.  

aggregate(less_data[c("price","bathrooms","sqft_living")], by = list(bedrooms = less_data$bedrooms), mean)

[Additional reading on Summarizing data is suggested](http://www.cookbook-r.com/Manipulating_data/Summarizing_data/)

---

## Visual Summarization of Data 

Following are some examples of how to visualize summaries of data with multiple dimensions to show scatterplots, correlations, and distributions. 

In [None]:
library("FactoMineR")
library("factoextra")
library("car")

data(iris)

In [None]:
options(repr.plot.width=12, repr.plot.height=12)


scatterplotMatrix(iris)

In [None]:
library(GGally)
ggpairs(iris[,-5])+ theme_bw()

In [None]:
p <- ggpairs(iris, aes(color = Species))+ theme_bw()
# Change color manually.
# Loop through each plot changing relevant scales
for(i in 1:p$nrow) {
  for(j in 1:p$ncol){
    p[i,j] <- p[i,j] + 
        scale_fill_manual(values=c("#00AFBB", "#E7B800", "#FC4E07")) +
        scale_color_manual(values=c("#00AFBB", "#E7B800", "#FC4E07"))  
  }
}
p


In [None]:
#This is an example of how to visualize and summarize the results of a PCA analysis 

my_data <- iris[, -5] # Remove the grouping variable
res.pca <- prcomp(my_data, scale = TRUE)
fviz_pca_biplot(res.pca, col.ind = iris$Species,
                palette = "jco", geom = "point")

In [None]:
# This visualizes the clusters in the data 
USArrests %>%
  scale() %>%                           # Scale the data
  dist() %>%                            # Compute distance matrix
  hclust(method = "ward.D2") %>%        # Hierarchical clustering
  fviz_dend(cex = 0.5, k = 4, palette = "jco") # Visualize and cut 
                                              # into 4 groups

In [None]:
data("housetasks")

In [None]:
library(ggpubr)
theme_set(theme_pubr())

ggballoonplot(housetasks, fill = "value")+
  scale_fill_viridis_c(option = "C")