# Extra Lab - Multivariate Data 


Let's continue our discussion of bivariate data analysis a little bit using the housing prices data as well as the abdominal circumference data.

Can we say that abdominal circumference varies with gestation period? Generally, we can it is true. We can try to prove or disprove this assumption by analyzing the data. Read the dataset into a dataframe object called 'ac_data'.

In [None]:
ac_data=read.csv("/dsa/data/all_datasets/abdominal circumference/ac.csv")
head(ac_data)

When you look at the first few rows you can't say for sure if long gestation periods will correlate with abdominal circumference. Let's get into the descriptive statistics of the data...

In [None]:
summary(ac_data)

We are not sure if the data is normally distributed for both gawks and ac variables, but the data looks fine. There are no NA values.  Let's check the data type of variables using the str() command.

In [None]:
str(ac_data)

From the histograms we plotted in Vector.ipynb lab the data was looking normally distributed, but histograms tell us nothing about the relationship between the two variables, such as if its linear, non-linear, or no relationship at all. Scatter plots are the best way to visualize data to identify relationships.

In [None]:
library(ggplot2)

ggplot(ac_data,aes(ac_data$ac,ac_data$gawks))+geom_point()+    
        geom_smooth(method=lm,   # Add linear regression line
                se=FALSE)  +  # Don't add shaded confidence region
xlab("abdominal circumference")+
ylab("Gestation period")

The plot above tells us that there is definitely a positive linear relationship between ac and gawks. So, bivariate data analysis can identify relationships among variables. When there are just few features in a dataset, then you make visualizations to identify the relationships but if there are 100 features you can't plot graphs for all of them. 

We will end up using numerical statistics when dimensions get large. So how do we know if two variables are related without plotting a graph against each other. There are functions in `R` which identifies/calculates how two variables are related. One of the common methods/functions to do that is discussed below.   

### Pearson correlation coefficient
----


Recall that the Pearson product-moment correlation coefficient is a measure of the strength of the linear relationship between two variables. If the relationship between the variables is not linear, then the correlation coefficient does not adequately represent the strength of the relationship between the variables and other methods must be used.

Remember that Pearson's r (r is the symbol used to denote correlation coefficient) can range from -1 to 1. An r of -1 indicates a perfect negative linear relationship between variables, an r of 0 indicates no linear relationship between variables, and an r of 1 indicates a perfect positive linear relationship between variables.

To calculate the Pearson (linear) correlation coefficient for a pair of variables, you can use the “cor.test()” function in R.

##### Positive correlation


In [None]:
cor.test(ac_data$gawks,ac_data$ac)

The correlation coefficient, which is about 0.9863109, is a very strong positive correlation. The P-value for the statistical test of whether the correlation coefficient is significantly different from zero is 02.2e-16 is very smaller than 0.05 (which is used as a cutoff for statistical significance). So there is very strong evidence that the correlation is non-zero.

##### Negative correlation

When the correlation coefficient is negative then there would be negative correlation between variables. Again if the P-value is smaller than 0.05 (which is used as a cutoff for statistical significance) we can say the correlation coefficient is different than 0.

##### Zero correlation

When the correlation coefficient is nearly equal to zero, there will be no correlation between variables. The P-value will be larger than 0.05 indicating correlation is zero.

So the sign of the pearson's r value doesn't matter for determining if the relationship is non-zero. Either positive or negative, the r value tells us how much one variable can explain the variability in other variable.

----


We have seen how to analyze bivariate data using scatter plots. Advancing our discussion into multivariate data analysis, you will generally start with univariate data analysis followed by bivariate data analysis before plotting complex 3d plots for multivariate data analysis. 

Let's start the discussion by loading the housing prices data into a dataframe called `housing_prices`. We will skip univariate and bivariate analysis since we have seen that in previous labs. 

### Loading data...

In [None]:
housing_prices <- read.csv("/dsa/data/all_datasets/house_sales_in_king_county/kc_house_data.csv")

As usual, let's take a quick look at the data to make sure we read the data correctly into the dataframe...

In [None]:
head(housing_prices)

In [None]:
#The structure of the dataframe as follows
str(housing_prices)

Str() gave us an overall sense of the data. The data is a combination of numeric and integer variables. We will not be worried about id and data variables since they are non predictors. Let's dig deep into the data by first doing some univariate analysis, just like what we did in module 1. 

Run a summary() on all variables...

In [None]:
summary(housing_prices)

### Multivariate Data Analysis Using Plots

We have seen in previous labs that Bedrooms, bathrooms, sqft_living, grade, sqft_above, sqft_basement vary linearly with price of the house. These might be the most decisive variables in predicting price of the house. floors, yr_built, yr_renovated, condition, lat and long tend to have positive correlation with price. Let's plot condition and floors variables against price to look into them in detail. We will use view as another dimension in these plots. 

In [None]:
ggplot(housing_prices, aes(x=condition,y=price)) + xlab("condition")+ ylab("Price") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm)
    
ggplot(housing_prices, aes(x=floors,y=price)) + xlab("floors")+ ylab("Price") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm)

Condition does not vary much with price but floors varies positively with price. Floors can be a predictor. The relationship between lattitude, longitude and price could also be interesting. To see how they vary together lets try to plot all three variables in a three dimensional plot with view as 4th dimension.

In [None]:
library(scatterplot3d)
#Assigning a color to each view
housing_prices$colors[housing_prices$view==0] <- "red"
housing_prices$colors[housing_prices$view==1] <- "blue"
housing_prices$colors[housing_prices$view==2] <- "green"
housing_prices$colors[housing_prices$view==3] <- "magenta"
housing_prices$colors[housing_prices$view==4] <- "cyan"


with(housing_prices, {
   scatterplot3d(lat, long, price,        # x y and z axis
                 type="h",             # lines to the horizontal plane
                 angle = 45,pch = 16,color=colors, #angle=45 denotes how the graph is oriented, pch=16 denotes shape used to denote 
                                        #points on the plot, color=colors tells the graph to use colors variables defined above
                 main="Location vs Prices",        
                 xlab="Lattitude",
                 ylab="Longitude",
                 zlab="Price")

legend("topleft", inset=.05,      # location where the legend should be positioned on the graph
    bty="n", cex=.5,              # suppress legend box, shrink text 50%
    title="Number of Views", 
    c("0", "1", "2", "3", "4"), fill=c("red", "blue", "green", "magenta", "cyan"))
})

The plot is very difficult to interpret. We were hoping to see a pattern in the house prices with geographic coordinates data. You can rotate the angle of the graph from 45 to any degree 0 to 360, but plot doesn't get any better for interpretation. The problem is there are too many data points and they are clustered very close to each other. 

Since we have geographic location coordinates, we can try plotting the data on a google map. A google map may be more readable than a 3d plot. While lat and long are X and Y cordinates, view will be the 3rd dimension in the plot.

In [None]:
library(ggmap)

apikey <- scan("/dsa/data/all_datasets/ggmap_api_key.txt", what="character")
register_google(key = apikey)


In [None]:
#get_map() will download the map from the source specified for the location you supplied as argument. There are different 
#options available for arguments source and maptype. We need the map for Kings county, Washington. 
map <- get_map(location = 'Kings County, Washington', source = 'google', maptype = 'roadmap',zoom = 9) 

#ggmap() function plots the actual map collected above. lat and long are labelled on the plot as x and y values. A view of 4 
#will be colored red while a view of 1 is colored white. The scale_colour_gradient() used below helps a smooth transition from 
#red to white color in case of different view levels(0 to 5).
ggmap(map) +  geom_point(data=housing_prices,aes(x = long,y = lat,color=view),size=1, alpha = 0.8) +
scale_colour_gradient( low="white", high="red", space="Lab")

Although above graph doesn't tell us anything about the price and its relationship to lattitude, longitude, or view variables it makes a lot of sense compared to above plot. All the houses near the river or lake have more views (3 or 4) and will be priced more compared to houses with less views as a general assumption. 

Let's plot only those houses whose price is greater than 2 million. We will use price as the 3rd dimension and views as 4th dimension.

In [None]:
#We are using dplyr function subset() here. "subset(housing_prices,housing_prices$price>2000000 )" is taking housing_prices 
#data and subsetting the records where price is > 2 million
ggmap(map,darken = c(.5,"white")) +  geom_point(data=subset(housing_prices,housing_prices$price>2000000 ),
                         aes(x = long,y = lat,color=price,shape=factor(view)),size=1, alpha = 0.8) +
                         scale_colour_gradient( low="blue", high="red", space="Lab")

In [None]:
table(housing_prices$price>2000000)
table(housing_prices$price>4000000)
table(housing_prices$price>6000000)

Run the cell above. In plot above we tried to plot houses with prices greater than `$2` million along with their respective number of views on the google map. As shown above there are only 198 houses with price greater than `$2` million, 11 houses with price greater than `$4` million, and 3 houses with price greater than `$6` million. We used price as the third dimension in the form of color option and view is used as fourth dimension in the form of shape argument. 

We were hoping to see more red(higher price) than green. The data is skewed; in other words, there are not many houses with prices at the extreme end of 7.7 million. Most of the houses are in the range of 2 million to 3 million. Let's try to plot the same plot again with price greater than 4 million.

In [None]:
map <- get_map(location = 'Kings County, Washington', source = 'google', maptype = 'roadmap',zoom = 10) 
ggmap(map,darken = c(.5,"white")) +  geom_point(data=subset(housing_prices,housing_prices$price>4000000 ),
                         aes(x = long,y = lat,color=price,shape=factor(view)),size=2, alpha = 0.8) +
                         scale_colour_gradient( low="green", high="red", space="Lab")

Our assumption that high priced houses will have either 3 or 4 views was wrong. If you look at the plot above you will see that 3 houses have 0 views, 1 house has 2 views, 1 house has 3 views, and 5 houses have 4 views. 

Of all the independent variables, sqft_living and bathrooms are the two variables which are related to price the most; in other words, they vary most with price. Both variables are numeric. We used price as our third dimension in the plot above. You can use either bathrooms or sqft_living as the fourth dimension, like the size parameter as shown below. 

In [None]:
ggmap(map,darken = c(.5,"white")) +  geom_point(data=subset(housing_prices,housing_prices$price>2000000),
                         aes(x = long,y = lat,color=price,size=housing_prices$bathrooms),size=2,alpha = 0.8)+
                         scale_size_continuous(range=range(housing_prices$bathrooms)) +
                        scale_colour_gradient( low="white", high="red", space="Lab")

The plot above doesn't tell us anything different from the previous plots and the size of the points do not increase with number of bathrooms. 

Can we add a 5th dimension to the plot? Maybe view or sqft_living or grade or yr_renovated could be added as 5th dimension? The only option that is left to add another dimension is shape. The argument for the shape parameter should be a factor, where the number of levels cannot exceed 6. grade or yr_renovated have more than 6 levels. sqft_living is numeric and has a long range of values, so we cannot use that feature as well. So, we are stuck with view again as our 5th dimension. We can try different variables for color and size parameters though.

In the graph below, we have two different plots for data with house prices greater than 2 million and views less than 3 and greater than or equal to 3. 

In [None]:
plot1 = ggmap(map,darken = c(.5,"white")) +  
       geom_point(data=subset(housing_prices,price>2000000 & view>=3),
                  aes(x = long,y = lat,color=price,size=housing_prices$bathrooms,shape=factor(view)),size=3,alpha = 0.8)+
                  scale_size_continuous(range=range(housing_prices$bathrooms)) +
                  scale_colour_gradient( low="white", high="red", space="Lab")

plot2 = ggmap(map,darken = c(.5,"white")) +  
       geom_point(data=subset(housing_prices,price>2000000 & view<3),
                  aes(x = long,y = lat,color=price,size=housing_prices$bathrooms,shape=factor(view)),size=3,alpha = 0.8)+
                  scale_size_continuous(range=range(housing_prices$bathrooms)) +
                  scale_colour_gradient( low="white", high="red", space="Lab")

plot1+xlim(-122.45,-122)+ ylim(47.4,47.75)+labs(title="Houses with price > $2M and views >= 3")
plot2+xlim(-122.45,-122)+ ylim(47.5,47.75)+labs(title="Houses with price > $2M and views < 3")
             

Let's try to plot price vs bathrooms using grade and bedrooms as the third and fourth dimensions in ggplot. Grade is used as a size parameter, where the circle gets bigger with increasing grade. Bedrooms is our fourth dimension. As the number of bedrooms increases the circle gets bigger. 

In [None]:
# library(ggplot2)
ggplot(housing_prices, aes(x=bathrooms, y=price, size=grade, color=as.numeric(bedrooms))) + xlab("bathrooms") + ylab("price") +
  geom_point()

Again, if the number of observations are less the plot would have helped us in seeing trends in the data. We can see price is positively varying with number of bathrooms but we cannot form clear conclusions from bedrooms and grade. We are not able to see clear trends between price, bedrooms, and grade.

We were able to use the five dimensions lat, long, price, bathrooms, and view in some of the plots above but we want to see the effect of sqft_living on prices. Unfortunately, we can't visualize it on the plot as we are running out of options. So there is a limit to the advantages of visualizations. We have to get back to numerical statistics to understand relationships between features when the number of dimensions are more than a graph can handle.

### Descriptive Statistics on Vectors

We tried to do descriptive statistics on data in the beginning using summary() function. Let's dive in a bit more into data exploration.

In [None]:
# Use sapply() to get means for all variables in data frame housing_prices
# Since we don't have NA values in the data, we don't have to worry about excluding missing values. In case you have any, 
# you can do it by inclusing "na.rm=TRUE" as the third parameter in below command.

#Also, date and colors are factor variables. You cannot apply mean() on them. We have to exclude them while finding mean.
sapply(housing_prices[,!names(housing_prices) %in% c('date','colors')], mean)

Subset the data based on yr_built and run a summary on the new sub datasets. subset1 should have data before year 1990 (including 1990) and subset2 should have data with yr_built after 1990.

In [None]:
subset1=subset(housing_prices,yr_built <= 1990)
summary(subset1)

In [None]:
subset2=subset(housing_prices,yr_built > 1990)
summary(subset2)

In [None]:
#one-way table.  
table(housing_prices$bedrooms)

In [None]:
# Two-way table. Below command will produce a 2-way table with distribution count of every combination between bedrooms and price. 
#addmargins() will give the summary or sum of this counts at the end.
bed_vs_bath = table(housing_prices$bedrooms,housing_prices$bathrooms)
addmargins(bed_vs_bath)

2-way tables are very informative. In the table above, we have the distribution of bathrooms for every count of bedrooms. It is very detailed and the sums of columns and rows are displayed, which indicate the number of bedrooms or bathrooms with a specific number. 

Down below is an extended version of the table command, adding a 3rd dimension to 2-way table. We can see the same information as above but for every kind of view (0,1,2,3,4).

In [None]:
bed_bath_view <- xtabs(~bedrooms+bathrooms+view, data=housing_prices)
bed_bath_view

Let's plot barplots using the table commands. Essentially, a bar plot works like a table command.

In [None]:
par(mfrow=c(1,2))
barplot(margin.table(bed_vs_bath,1))
barplot(margin.table(bed_vs_bath,2))

In [None]:
# install.packages("pastecs",repo="https://cran.cnr.berkeley.edu/")
library(pastecs)

In [None]:
#The stat.desc() function gives an elaborate descriptive statistics of input object. Most of the statistics are well known and 
#commonly used.
options(scipen=999)
stat.desc(housing_prices)

From the table above, we can see there are 13 rows of data where the number of bedrooms are 0. Most of these rows also have no of bathrooms as 0. Let's take a look at these rows.

In [None]:
housing_prices[housing_prices$bedrooms==0,]

The rows of data above appear to be outliers; there are no bedrooms for 13 rows and no bathrooms for some of them. Also, two rows have a price greater than `$1` million. We can go ahead and delete them from the dataset.

In [None]:
housing_prices=housing_prices[!housing_prices$bedrooms %in% c(0),]

Let's see how the prices vary using a five number summary function. Also we will draw a boxplot for price and view to show how boxplots give five number summaries.

In [None]:
# Boxplot elements. Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum)"
fivenum(housing_prices$sqft_living)

library(ggplot2)
ggplot(housing_prices, aes(factor(view), sqft_living)) + geom_boxplot()

In [None]:
which.max(housing_prices$price) # Determines the location, i.e., index of the (first) minimum or maximum of a numeric vector"

#This is similar for which.min()


In [None]:
# Mode by frequencies:
# We have seen the use of table command before. Here we are trying to get the best out of it. First, we are trying to get the 
#distribution of all zip codes. Sort command will sort the counts in ascending order. So the '-' sign will get elements in 
#descending order. But we are interested in names or the zip codes instead of their counts. So, finally the names() function 
#will give us the names of maximum number of zip codes that appeared in the dataset. 
names(sort(-table(housing_prices$zipcode)))

In [None]:
# tapply()  Descriptive statistics by forming groups of data 
mean <- t(tapply(housing_prices$price,housing_prices$bedrooms, mean))
mean
# sd <- tapply(mydata$SAT,mydata$Gender, sd)

The table above gives the average price of the house and the number of bedrooms in the house. The last two columns with number of bedrooms 11 and 33 seems like outliers. The price is very low compared to mean prices of 4 and 5 bedroom houses. We should look more in depth at 9 and 10 bedroom houses too. 

In [None]:
housing_prices[housing_prices$bedrooms>=9,]

The prices of 9 and 10 bedroom houses look reasonable but casts a shadow when we look at view and grade variables. Below is the definition for every grade. 


Grade represents the construction quality of improvements. Grades run from grade 1 to 13 and are defined as:

* 1-3 Falls short of minimum building standards. Normally cabin or inferior structure.

* 4 Generally older, low quality construction. Does not meet code.

* 5 Low construction costs and workmanship. Small, simple design.

* 6 Lowest grade currently meeting building code. Low quality materials and simple designs.

* 7 Average grade of construction and design. Commonly seen in plats and older sub-divisions.

* 8 Just above average in construction and design. Usually better materials in both the exterior and interior finish work.

* 9 Better architectural design with extra interior and exterior design and quality.

* 10 Homes of this quality generally have high quality features. Finish work is better and more design quality is seen in the floor plans. Generally have a larger square footage.

* 11 Custom design and higher quality finish work with added amenities of solid woods, bathroom fixtures and more luxurious options.

* 12 Custom design and excellent builders. All materials are of the highest quality and all conveniences are present.

* 13 Generally custom designed and built. Mansion level. Large amount of highest quality cabinet work, wood trim, marble, entry ways etc. 

Definitions taken from http://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r


**Most of the houses are built long time ago and are not renovated. May be thats the reason for poor grade of the houses. **

In [None]:
#Aggregate works just like groupby in sql. Here we are grouping data based on bedrooms. We are interested in columns price, 
#bathrooms and  sqft_living. Finally applying the mean function to this subset of data for every group of data (i.e. number of bedrooms)

aggregate(housing_prices[c("price","bathrooms","sqft_living")],by=list(bedrooms=housing_prices$bedrooms), mean)

In [None]:
#Below we are trying to aggregate data for price to show how bathrooms, bedrooms, sqft_living and view will help determine 
#the price.

price_analysis <- aggregate(housing_prices[c("price")],by=list(bedrooms=housing_prices$bedrooms, 
                            bathrooms=housing_prices$bathrooms, sqft_living=housing_prices$sqft_living, 
                            view=housing_prices$view), mean)
price_analysis <- price_analysis[order(price_analysis$price),]
head(price_analysis)

In [None]:
hist(housing_prices$price[housing_prices$view==0], breaks="FD", main="price vs view", xlab="price",col="red")
hist(housing_prices$price[housing_prices$view==1], breaks="FD", main="price vs view", xlab="price",col="blue")
hist(housing_prices$price[housing_prices$view==2], breaks="FD", main="price vs view", xlab="price",col="red")
hist(housing_prices$price[housing_prices$view==3], breaks="FD", main="price vs view", xlab="price",col="blue")
hist(housing_prices$price[housing_prices$view==4], breaks="FD", main="price vs view", xlab="price",col="red")