**Install packages**

In [None]:
install.packages("UsingR")

**Load essential libraries**

In [None]:
library(ggplot2) # library for plotting
library(dplyr) # library for data wrangling
library(UsingR)

**Data Matrix**

Patient dataset corresponding to 4 patients and 3 features:

![Patient dataset](https://bl3302files.storage.live.com/y4mlspYO-L_1kEGpBOCUilkrcj3evQtgjGXDt6v2NgJwtsJf2OZVnwRnUht7CmW_wk8VMlMyGfhDqgRubB3pLHXAOe3r-pQ5wtYUuOqR_gsZzHWCqE2IEbhBjUZob5suLplmONyMsAjr1twDPK7eGODrKyav1dP1aX3lWx1YV0hiLvuTEZ7-GujIypTMkaSV2or?width=256&height=153&cropmode=none)

In [None]:
# Create dataframe with 3 columns
HR = c(76, 74, 72, 78)
BP = c(126, 120, 118, 136)
Temp = c(38, 38, 37.5, 37)
pData = data.frame(HR, BP, Temp)
print(pData)
cat(sprintf('---------------------\n'))

# Convert dataframe to matrix
P = as.matrix(pData)
print(dim(P))
cat(sprintf('---------------------\n'))
print(P)

**Vectors from the data matrix**

![Patient dataset](https://bl3302files.storage.live.com/y4mTMCQdiTnIFj1IALg09CRz7pPWl0g4HpigAPbwyMmF0QNliGAgK3aEsBESo0BNFCy-0-kR6pllskO1DPVt2-76bYsQaACRWhkOebqJ545BbtWcGr1CJG72BZJPrYbQDWNAC0h1EHhpewBlORT_xtahEu-bite73OVi-4CzGeQf6GDw11H6kn72VocdC2bLAsJ?width=256&height=167&cropmode=none)

1st feature vector (heart rate) for all patients:
$$p_1 = \begin{bmatrix}76\\74\\72\\78\end{bmatrix}$$

1st patient vector for all features:
$$p^{(1)} = \begin{bmatrix}76\\126\\38\end{bmatrix}$$

In [None]:
# Vector for 1st feature (HR)
p_1 = P[, 1]
print(p_1)
cat(sprintf('\n'))

# Vector for 2nd feature (BP)
p_2 =P[, 2]
print(p_2)
cat(sprintf('\n'))

# Vector for 1st patient
p1 = P[1, ]
print(p1)
cat(sprintf('\n'))

# Vector for 2nd patient
p2 = P[2, ]
print(p2)

In [None]:
str(p_1)
str(p1)
dim(p_1)
dim(p1)
length(p_1)
length(p1)

**Components of a vector and matrix**

The components of a vector $a$ are denoted as $a_1,a_2,\ldots.$

The component in the $i$th row and $j$th column of a matrix is $P$ is represented as $p_{ij}.$

2nd patient, 1st feature (heart rate) value is $p^{(2)}_1.$

1st feature (heart rate), 2nd patient value is $\left(p_1\right)_2.$




In [None]:
# 2nd patient vector
p2 = P[2, ]

# 1st feature (heart rate) vector
p_1 = P[, 1]

# 2nd patient, 1st feature (heart rate) value
print(p2[1])
cat(sprintf('\n'))

# 1st feature (heart rate), 2nd patient value
print(p_1[2])
cat(sprintf('\n'))

# Directly from the data matrix
print(P[2,1])

**Two ways of looking at a vector**:

$$\underbrace{\begin{bmatrix}76\\74\\72\\78\end{bmatrix}}_{\text{a 4 component vector (or) a }4\times1\text{ matrix}}$$

In [None]:
# 1st feature (heart rate) values
p_1 = P[, 1]
str(p_1)
str(as.matrix(p_1))

**Some commonly used vectors and matrices**

<p class="fragment roll-in">$$\underbrace{\begin{bmatrix}0\\0\\0\end{bmatrix}}_{\pmb{0}:\,\text{Zero vector}}\quad\underbrace{\begin{bmatrix}0\\0\\0\\0\\0\end{bmatrix}}_{\pmb{0}:\,\text{Zero vector}}\quad \underbrace{\begin{bmatrix}1\\0\\0\end{bmatrix}}_{e_1:\,\text{Unit vector}}\quad \underbrace{\begin{bmatrix}0\\1\\0\\0\end{bmatrix}}_{e_2:\,\text{Unit vector}}\quad \underbrace{\begin{bmatrix}1\\1\\1\\\vdots\\1\\1\end{bmatrix}}_{\pmb{1}:\,\text{ones vector}}$$</p>

<p class="fragment roll-in">$$\underbrace{\begin{bmatrix}0&\cdots&0\\\vdots&\ddots&\vdots\\0&\cdots&0\end{bmatrix}}_{\pmb{0}_{m\times n}:\,\text{matrix}}\quad\underbrace{\begin{bmatrix}1 & 0 & \dots & 0 \\0 & 1 & \dots & 0 \\    \vdots & \vdots & \ddots & \vdots \\0 & 0 & \dots & 1\end{bmatrix}}_{\pmb{I}:\,\text{Identity matrix}}\quad\underbrace{\begin{bmatrix}\alpha & 0 & \dots & 0 \\0 & \alpha & \dots & 0 \\    \vdots & \vdots & \ddots & \vdots \\0 & 0 & \dots & \alpha\end{bmatrix}}_{\pmb{D}:\,\text{Diagonal matrix}}$$</p>	

In [None]:
# Diagonal matrix 
I = diag(4)
print(I)
cat(sprintf('\n'))

# Unit vector
e_1 = I[, 1]
print(e_1)
e_3 = I[, 3]
print(e_3)
cat(sprintf('\n'))

# Zero vector
z = replicate(4, 0)
print(z)
cat(sprintf('\n'))

# Ones vector
o = replicate(4, 1)
print(o)

**Addition and subtraction of vectors, scalar multiplication (apply operation componentwise)**

![Vector addition](https://bl3302files.storage.live.com/y4mMlnDRWzIoNKWynOZFhzhFNDlReoFxf7XwSeFwNWW8f1lu5ssj_SvgMAEN9BWiQ2F-meER7rD2an2n2tfDoWffBHE8aD_WBsL0LAbHxnIpZtZu6hNJAvZ88m746S_ktA9-h-oo108AQjkXQHkYrgJ5AUCpvKB2dipeNG1VfIK_38Q8fsq6OKD43adplgy0H1k?width=200&height=80&cropmode=none)

![Vector subtraction](https://bl3302files.storage.live.com/y4mnQkNUONVVKJJ6dCEqV9lEuP360lE0yRumSIgl9LaQH_qBqjgI9wvUd64xJ-UNIjR7wJXZyaXZ_kf1_gAB9sXjMWaMxWhSnX6zcyvVtTrCDeO1MNWzj3A1YqI5YLALK-CGCSMurNV938QLH3C2u1-BE8_addFYSeO7DmCKz5TdWGf7qtC8M9rRN26RMqpk8iu?width=200&height=80&cropmode=none)

![Scalar-vector multiplication](https://bl3302files.storage.live.com/y4mYNwLMmuKRl3sNDSo0yyXYs0KFw1LBnQCU6nAgSawanlGNgLq7Bd93DQ0ojamRpGLx_PZvnsSG-6K-3TsdDctw5sm-QxnWUHSTJGalDR4JmUp27_Hf3ESAQukZ1Jk5G16ykO7H3AKmLSQxE4vVIAtMFbCnyxtsQEfpyb_SK5jIjVtjl7yoFcBDzsRDGzo5cZM?width=200&height=80&cropmode=none)

In [None]:
# Vector addition
a = c(0, 7, 3)
b = c(1, 2, 0)
print(a+b)
cat(sprintf('\n'))

# Vector subtraction
a = c(1, 9)
b = c(1, 1)
print(a-b)
cat(sprintf('\n'))

# Scalar-vector multiplication
a = c(1, 9, 6)
alpha = -2
print(alpha * a)

**Dot Product of Vectors**

A scalar resulting from an elementwise multiplication and addition: $$a^{\color{cyan}T}b = {\color{red}{a_1b_1}}+{\color{green}{a_2b_2}}+\cdots+{\color{magenta}{a_nb_n}}$$

The <font color="cyan">symbol</font> ${\color{cyan}T}$ is a notation that will be explained soon.

![Dot product](https://bl3302files.storage.live.com/y4mMdtIf3y5snN3f1UUvREiqT8_k5jmBTkmmHAcyh73wrou4WczQvmbtUg-W43NEg3wcUsuNBYWnGzDk9r6zvBfUTa1u7-8qMkQoQ_eK5tCU1Y7MBUcARPvaXW_lFqO48SOaOrwvys388KdjnTBMzze_ed2nPrCniHq5Kx-pqEABRNMVs7HU98UuAPSMwur4iQ8?width=400&height=100&cropmode=none)

In [None]:
a = c(-1, 2, 2)
b = c(1, 0, -3)
print(a %*% b) 
print(a*b)
print(sum(a*b))

**Some Useful Dot products**
					<ul>
						<li class="fragment roll-in"><p>$e_i^Ta = {\color{magenta}{0\times a_1}}+\cdots+{\color{cyan}{1\times a_i}}+\cdots+{\color{magenta}{0\times a_n}} = {\color{cyan}{a_i}}$</p> dot product with unit vector picks $i$th element</li>	
						<li class="fragment roll-in"><p>$\pmb{1}^Ta = 1\times a_1+1\times a_2+\cdots+1\times a_n = a_1+a_2+\cdots+a_n$</p> dot product with ones vector gives the sum of the elements.</li>	
						<li class="fragment roll-in"><p>$(\pmb{1}/n)^Ta = (1/n)\times a_1+(1/n)\times a_2+\cdots+(1/n)\times a_n$</p><p> $= (a_1+a_2+\cdots+a_n)/n$</p> dot product with $(1/n)$-scaled ones vector gives the average of the elements of the vector denoted as ${\color{green}{\text{avg}(a)}}$.</li>
						<li class="fragment roll-in"><p>$a^Ta = a_1\times a_1+a_2\times a_2+\cdots+a_n\times a_n = a_1^2+a_2^2+\cdots+a_n^2$</p> dot product with itself gives the sum of the squares of the elements.

In [None]:
# Unit vectors
e_1 = c(1, 0, 0, 0)
e_3 = c(0, 0, 1, 0)

# Ones vector
o = replicate(4, 1)

# Vector for 1st feature (heart rate)
p_1 = P[, 1]
print(p_1)
cat(sprintf('\n'))

# Get 1st component of vector (heart rate for 1st patient)
print(e_1 %*% p_1)
cat(sprintf('\n'))

# Get 3rd component of vector (heart rate for 3rd patient)
print(e_3 %*% p_1)
cat(sprintf('\n'))

# Get sum of 1st feature values (heart rate) for all patients
print(o %*% p_1)
print(sum(p_1))
cat(sprintf('\n'))

# Average of 1st feature values (heart rate)
n = length(p_1)
print(((1/n)*o) %*% p_1)
print(mean(p_1))

**Norm of a vector**

A scalar representing how <font color="cyan">big a vector</font> is:$$\lVert a\rVert =\sqrt{a_1^2+a_2^2+\cdots+a_n^2} = \sqrt{a^\mathrm{T}a}$$				
<p class="fragment">The <font color="cyan">symbol</font> ${\color{cyan}{\lVert\,\lVert}}$ represents the norm of a vector</p>	

![Norm of a vector](https://bl3302files.storage.live.com/y4m8A4FYuLT9fy5RKHMEsnS-vKtnF2AHO9UTNerw_A84S_kM8U2FkhLb1-9O-_hN_aI0WWflvQS0kXTSG_K06nDWj9pzLnFw1S0hSrmKQw_dJdxW5r2k5OlqhqmeFoKKZDdivpoudbm6my5YZHU5RKMWBB39Fu5EEVQ7hfqjRuBhqoYkSmP0fbRLJY-XKsMCJRj?width=256&height=62&cropmode=none)

![Geometry of vector](https://bl3302files.storage.live.com/y4mTwTyJtDFs8AJR9CbNJCrCPk3413w4UWANehYJyao_43H2CyM90kdttDdDqiURZjtw57BgL34bmpsrrazm7r-Re31BHxH04LEeXTU85-TLXGjotJGGoqzCH-J_nXhARQdoHQ5J450Rw-fq30GMbCjeDWz9kf12a7cmjU73fr6gfROSrsAgI7GMHlusSxljDFe?width=360&height=100&cropmode=none)

In [None]:
# Norm of the 1st feature (heart rate) vector
p_1 = P[, 1]
print(p_1)
print(sqrt(sum(p_1*p_1)))
print(sqrt(p_1 %*% p_1))
print(norm(p_1, type = '2'))

**Root Mean Square (RMS) value of a vector**

A scalar representing the <font color="cyan">typical absolute value</font> of an element: $$\text{rms}(a)=\sqrt{\frac{a_1^2+a_2^2+\cdots+a_n^2}{n}} = \frac{\lVert a\rVert}{\sqrt{n}}$$
<p class="fragment">The <font color = "cyan">rms value</font> of the vector $\begin{bmatrix}1\\-1\\-1\\1\end{bmatrix}$ is ${\color{cyan}1}$ whereas the <font color="green">avg($a$)</font> is ${\color{green}0}$.</p>	

In [None]:
# RMS value of the 1st feature (heart rate) vector
n = length(p_1)
print((1/sqrt(n) * sqrt(p_1 %*% p_1)))

# Compare with the average of the 1st feature (heart rate) vector
print(mean(p_1))

**Deviation from the average**

The vector $a-\text{avg}(a)\mathbf{1}$ is called the de-meaned version of $a.$

![Demeaned vector](https://bl3302files.storage.live.com/y4mRZeCpRTmxCVjlTBh63SSC-9ykIfmf-ZvIaXrp6prd1XpSPT0RhhaQ1j0vFp9lpKPaZtL8S1CSa64Bsw34x2_ncg1dAwBjR5lt6J8qOVqORlZIOw4gtOu0IIRHVbP3Zy4fZDeIzn8mgHFCc5y75W4PDkq0mhs-6VTSULUZQSefLhbTlv04pY_nAjg-rfyQ5jl?width=400&height=200&cropmode=none)

In [None]:
print(p_1-mean(p_1))

**Standard devitation of a vector**

A measure of how much the <font color="cyan">elements of a vector typically deviate from their average value</font>
<p class="fragment">$$\text{std}(x) = \sqrt{\frac{\left[x_1-\text{avg}(x)\right]^2+\cdots+\left[x_n-\text{avg}(x)\right]^2}{n}} = \frac{1}{\sqrt{n}}(x-\text{avg}(x)\mathbf{1})^\mathrm{T}(x-\text{avg}(x)\mathbf{1})$$</p>
<ul>
<li class="fragment roll-in">It can be shown that standard deviation of a vector $x$ is the RMS value of the <font color ="magenta">de-meaned</font> vector ${\color{magenta}{x-\text{avg}(x)\pmb{1}}}$ represented as ${\color{magenta}{\tilde{x}}}$</li>		
<li class="fragment roll-in">Standard deviation of a vector is small when its entries are nearly the same</li>	
<li class="fragment roll-in">For any vector $x$, ${\color{red}{\text{rms}(x)^2}} = {\color{green}{\text{avg}(x)^2}}+{\color{yellow}{\text{std}(x)^2}}$</li>
</ul>	

In [None]:
# Standard deviation of the 1st feature (heart rate) vector
n = length(p_1)
print(mean(p_1))
print((1/sqrt(n-1)) * sqrt((p_1-mean(p_1)) %*% (p_1-mean(p_1))))
print(sd(p_1))

**Standardization of a vector**	

Standardized version of vector $x$ is a vector $z$ with mean zero and standard deviation 1
<p class="fragment">$$z = \frac{x-\text{avg}(x)\pmb{1}}{\text{std}(x)}$$</p>
<ul>
<li class="fragment roll-in">Elements of $z$ are called $z$-scores associated with the elements of the original vector $x$</li>		
<li class="fragment roll-in">For example, if a vector $x$ gives the values of heart rates of $4$ patients admitted to a hospital, the $z$-scores tell us how high or low, compared to the population, that patient’s heart rate value is.</li>
</ul>

In [None]:
# Standardized version of 1st feature (heart rate) vector
z = (p_1-mean(p_1)) / sd(p_1)
print(p_1)
print(z)

**Component plots of a vector, its de-meaned version, and its standardized version**

In [None]:
# Generate a 10-vector representing the heart rates of 10 random patients
samplesize = 10
h = runif(samplesize, min = 50, max = 110)
h

# De-meaned heart rate
h_demeaned = h-mean(h)

# Standardized heart rate
h_standardized = h_demeaned / sd(h)

# Component plot of heart rate vector
df = as.data.frame(cbind(c(1:samplesize), h, h_demeaned, h_standardized))
colnames(df) = c('PatientNumber', 'HeartRate', 'HeartRateDemeaned', 'HeartRateStandardized')
p1 = ggplot(data = df) +
  geom_point(size = 2, color = 'blue', aes(x = PatientNumber, y = HeartRate)) +
  geom_hline(yintercept = mean(h), linetype = "dashed", color = "blue") +
  geom_point(size = 2, color = 'green', aes(x = PatientNumber, y = HeartRateDemeaned)) +
  geom_hline(yintercept = mean(h_demeaned), linetype = "dashed", color = "green") +
  geom_point(size = 2, color = 'red', aes(x = PatientNumber, y = HeartRateStandardized)) +
  geom_hline(yintercept = mean(h_standardized), linetype = "dashed", color = "red") +
  scale_x_continuous(breaks = seq(0, samplesize, by = 1)) +
  theme(axis.text = element_text(size = 12),
  axis.text.x = element_text(size = 10),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size = 10, face = "bold")) +
  labs(x = 'Patient Number',
       y = 'Heart Rate',
       title = 'Component Plot of Heart Rate Vector')
p1

In [None]:
# Highlight samples beyong 1.5 standard deviations from the mean
k = 1.5
p1 = ggplot(data = df) +
  geom_point(size = 2, color = 'red', aes(x = PatientNumber, y = HeartRateStandardized)) +
  geom_hline(yintercept = mean(h_standardized), linetype = "dashed", color = "black") +
  geom_hline(yintercept = mean(h_standardized)+k*sd(h_standardized), linetype = "dashed", color = "black") +
  geom_hline(yintercept = mean(h_standardized)-k*sd(h_standardized), linetype = "dashed", color = "black") +
  scale_x_continuous(breaks = seq(0, samplesize, by = 1)) +
  theme(axis.text = element_text(size = 12),
  axis.text.x = element_text(size = 10),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size = 10, face = "bold")) +
  labs(x = 'Patient Number',
       y = 'Standardized Heart Rate',
       title = 'Component Plot of Standardized Heart Rate Vector')
p1

**Correlation between two vectors**

It is a measure of linear relationship between two vectors $a$ and $b.$ Let $\tilde{a}$ and $\tilde{b}$ represent their de-meaned versions. The correlation coefficient is defined as:

$$\rho = \frac{\tilde{a}^\mathrm{T}\tilde{b}}{\lVert \tilde{a}\rVert\lVert \tilde{b}\rVert}.$$

This is a value between $-1$ and $1.$

In [None]:
# Correlation between 1st feature (heart rate) and 2nd feature (BP) vectors
p_1 = P[, 1]
p_2 = P[, 2]
print(cor(p_1, p_2))

# Why not use the original heart rate and blood pressure vectors?
print((p_1 %*% p_2) / (norm(p_1, type = '2')*norm(p_2, type = '2')))
plot(p_1, p_2)

In [None]:
# Normalization vs. Standardization
p_1 = P[, 1]
print(p_1 / norm(p_1, type = '2')) # normalized version
print((p_1 - mean(p_1))/sd(p_1)) # standardized version

**A sample data matrix**

In [None]:
sData = data.frame("X1" = c(5, 3), "X2" = c(2, 6))
print(sData)
plot(sData$X1, sData$X2)

**Geometric representation of vectors**

In [None]:
p1 = sData %>% ggplot() + 
  geom_segment(aes(x = 0, y = 0, xend = X1[1], yend = X1[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red') +
  geom_segment(aes(x = 0, y = 0, xend = X2[1], yend = X2[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue') +
  geom_text(aes(x = X1[1], y = X1[2]), label = 'x1', hjust = -0.5, vjust = 0, size = 10) +
  geom_text(aes(x = X2[1], y= X2[2]), label = 'x2', hjust = -0.5, vjust = 0.5, size = 10) +
  labs( x = '', y = '', title = 'Vectors') +
  geom_vline(xintercept = 0, linetype = 'dashed') + 
  geom_hline(yintercept = 0, linetype = 'dashed') +
  theme(axis.text.x = element_text(size = 20),
   axis.text.y= element_text(size = 20),
   axis.line = element_line(colour = "darkblue", size = 1, linetype = "solid")) +
   scale_x_continuous(expand = c(0, 0), limits = c(-11, 11)) +
   scale_y_continuous(expand = c(0, 0), limits = c(-11, 11))
p1

In [None]:
X = as.matrix(sData)
# Add new columns with normalized vectors
sData$X1Normalized  = X[,1] / norm(X[,1], type = '2')
sData$X2Normalized  = X[,2] / norm(X[,2], type = '2')
print(sData)

In [None]:
p2 = sData %>% ggplot() + 
  geom_segment(aes(x = 0, y = 0, xend = X1[1], yend = X1[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red') +
  geom_segment(aes(x = 0, y = 0, xend = X2[1], yend = X2[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue') +
  geom_segment(aes(x = 0, y = 0, xend = X1Normalized[1], yend = X1Normalized[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red', linetype = 'dashed') + 
  geom_segment(aes(x = 0, y = 0, xend = X2Normalized[1], yend = X2Normalized[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue', linetype = 'dashed') + 
  geom_text(aes(x = X1[1], y = X1[2]), label = 'x1', hjust = -0.5, vjust = 0, size = 10) +
  geom_text(aes(x = X2[1], y= X2[2]), label = 'x2', hjust = -0.5, vjust = 0.5, size = 10) +
  labs( x = '', y = '', title = 'Vectors') +
  geom_vline(xintercept = 0, linetype = 'dashed') + 
  geom_hline(yintercept = 0, linetype = 'dashed') +
  theme(axis.text.x = element_text(size = 20),
   axis.text.y= element_text(size = 20),
   axis.line = element_line(colour = "darkblue", size = 1, linetype = "solid")) +
   scale_x_continuous(expand = c(0, 0), limits = c(-11, 11)) +
   scale_y_continuous(expand = c(0, 0), limits = c(-11, 11))
p2

In [None]:
# Length (norm) of normalized columns
norm(sData$X1Normalized, type = '2')
norm(sData$X2Normalized, type = '2')

In [None]:
# Add new columns with standardized vectors
sData$X1Standardized  = (X[,1] - mean(X[,1])) / sd(X[,1])
sData$X2Standardized  = (X[,2] - mean(X[,2])) / sd(X[,2])
print(sData)

In [None]:
# Mean and standard deviation of the standardized columns
mean(sData$X1Standardized)
sd(sData$X1Standardized)

In [None]:
p3 = sData %>% ggplot() + 
  geom_segment(aes(x = 0, y = 0, xend = X1[1], yend = X1[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red') +
  geom_segment(aes(x = 0, y = 0, xend = X2[1], yend = X2[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue') +
  geom_segment(aes(x = 0, y = 0, xend = X1Normalized[1], yend = X1Normalized[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red', linetype = 'dashed') + 
  geom_segment(aes(x = 0, y = 0, xend = X2Normalized[1], yend = X2Normalized[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue', linetype = 'dashed') + 
  geom_segment(aes(x = 0, y = 0, xend = X1Standardized[1], yend = X1Standardized[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red', linetype = 1) + 
  geom_segment(aes(x = 0, y = 0, xend = X2Standardized[1], yend = X2Standardized[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue', linetype = 1) +  
  geom_text(aes(x = X1[1], y = X1[2]), label = 'x1', hjust = -0.5, vjust = 0, size = 10) +
  geom_text(aes(x = X2[1], y= X2[2]), label = 'x2', hjust = -0.5, vjust = 0.5, size = 10) +
  labs( x = '', y = '', title = 'Vectors') +
  geom_vline(xintercept = 0, linetype = 'dashed') + 
  geom_hline(yintercept = 0, linetype = 'dashed') +
  theme(axis.text.x = element_text(size = 20),
   axis.text.y= element_text(size = 20),
   axis.line = element_line(colour = "darkblue", size = 1, linetype = "solid")) +
   scale_x_continuous(expand = c(0, 0), limits = c(-11, 11)) +
   scale_y_continuous(expand = c(0, 0), limits = c(-11, 11))
p3

In [None]:
p5 = sData %>% ggplot() + 
  geom_segment(aes(x = 0, y = 0, xend = X1[1], yend = X2[1]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red') +
  geom_segment(aes(x = 0, y = 0, xend = X1[2], yend = X2[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue') +
  geom_text(aes(x = X1[1], y = X2[1]), label = 'x^1', hjust = -0.5, vjust = 0, size = 10) +
  geom_text(aes(x = X1[2], y= X2[2]), label = 'x^2', hjust = -0.5, vjust = 0.5, size = 10) +
  labs( x = '', y = '', title = 'Vectors') +
  geom_vline(xintercept = 0, linetype = 'dashed') + 
  geom_hline(yintercept = 0, linetype = 'dashed') +
  theme(axis.text.x = element_text(size = 20),
   axis.text.y= element_text(size = 20),
   axis.line = element_line(colour = "darkblue", size = 1, linetype = "solid")) +
   scale_x_continuous(expand = c(0, 0), limits = c(-11, 11)) +
   scale_y_continuous(expand = c(0, 0), limits = c(-11, 11))
p5

In [None]:
X = as.matrix(sData)
print(X)

**Projection of vectors and its relationship to dot product**

![Vector projection](https://bl3302files.storage.live.com/y4miuCtKP9ptv6lIB8EqEU_u7cbEydy0UsEgHl4ECni2UVONtvKZgf73pIQ4vuA99ZHP8K96W_1i-QuhSIN12IudLaUTF3_jZzFqVfsaRK7QubMS9p5C1ErN6tB8I_UqQZnSY2JSGnu0IvJQrRcd2rX2Hzngfka3tCqJhbAMdElywcis2gRaoiuEGDVqaXpZYYp?width=256&height=209&cropmode=none)

In [None]:
# Scalar projection of x^2 onto x^1
shadowLength = (X[2,] %*% X[1,])/(norm(X[1,], type = '2'))
shadowLength = as.numeric(shadowLength)
# Vector projection of x^2 onto x^1
unitVector = X[1,] / (norm(X[1,], type = '2'))
X2Projected = shadowLength * unitVector

In [None]:
p6 = sData %>% ggplot() + 
  geom_segment(aes(x = 0, y = 0, xend = X1[1], yend = X2[1]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red') +
  geom_segment(aes(x = 0, y = 0, xend = X1[2], yend = X2[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue') +
  geom_segment(aes(x = 0, y = 0, xend = X2Projected[1], yend = X2Projected[2]), size = 1,
   arrow = arrow(length = unit(0.5,"cm")), color = 'black', linetype = 'dashed') + 
  geom_text(aes(x = X1[1], y = X1[2]), label = 'x^1', hjust = -0.5, vjust = 0, size = 10) +
  geom_text(aes(x = X2[1], y= X2[2]), label = 'x^2', hjust = -0.5, vjust = 0.5, size = 10) +
  labs( x = '', y = '', title = 'Vectors') +
  geom_vline(xintercept = 0, linetype = 'dashed') + 
  geom_hline(yintercept = 0, linetype = 'dashed') +
  theme(axis.text.x = element_text(size = 20),
   axis.text.y= element_text(size = 20),
   axis.line = element_line(colour = "darkblue", size = 1, linetype = "solid")) +
   scale_x_continuous(expand = c(0, 0), limits = c(-11, 11)) +
   scale_y_continuous(expand = c(0, 0), limits = c(-11, 11))
p6

In [None]:
print(P)
# Project all patients onto the e_1 direction
e_1 = c(1, 0, 0)
print(e_1 %*% P[1, ] / norm(e_1, type = '2'))
print(e_1 %*% P[2, ] / norm(e_1, type = '2'))
print(e_1 %*% P[3, ] / norm(e_1, type = '2'))
print(e_1 %*% P[4, ] / norm(e_1, type = '2'))

# How different (or varying) are the patients w.r.t. the features?
sd(P[, 1])
sd(P[, 2])
sd(P[, 3])

In [None]:
## Load data - refer to http://openmv.net/info/food-texture for data description 
file = 'http://openmv.net/file/food-texture.csv'
foodData = read.csv(file, header = TRUE, row.names = 1)
## Print structure of data frame
str(foodData)

In [None]:
## Print first 5 samples of data frame
head(foodData, n = 5)

In [None]:
## Modify data frame
# Modify crispy column to reflect high (0) and low (1) crispness
foodData = foodData %>% mutate(Crispy = ifelse(Crispy > 11, 'high', 'low'))

# Change Crispy column to factor type
foodData['Crispy'] = lapply(foodData['Crispy'], factor)

In [None]:
## Print structure of modified data frame
str(foodData)

In [None]:
## Print first 5 samples of modified data frame
head(foodData, n = 5)

In [None]:
## Scatter plot between Density (x-axis) and Hardness (y-axis)
p1 = ggplot(data = foodData, aes(x = Density, y = Hardness)) +
  geom_point(size = 1) 
p1

In [None]:
## Scatter plot between Density (x-axis) and Oil (y-axis)
p1 = ggplot(data = foodData, aes(x = Density, y = Oil)) +
  geom_point(size = 1) 
p1

In [None]:
## Scatter plot between Density (x-axis) and Oil (y-axis) color coded using Crispy
p2 = ggplot(data = foodData, aes(x = Density, y = Oil, color = factor(Crispy))) +
  geom_point(size = 1) 
p2

In [None]:
## Print sample correlation between Density and Oil
cor(foodData$Density, foodData$Oil, method = 'pearson')

In [None]:
## Print sample correlation between Density and Hardness
cor(foodData$Density, foodData$Hardness, method = 'pearson')

**Question-4**: Select data frame without Crispy column.

In [None]:
## Select data frame without Crispy column
fData = foodData %>% dplyr::select(-c('Crispy'))

In [None]:
str(fData)

In [None]:
# Sample Correlation matrix corresponding to the continuous features 
print(cor(fData))

In [None]:
plot(fData$Density, fData$Oil)

In [None]:
# Sample Covariance matrix corresponding to the continuous features 
print(cov(fData))

In [None]:
# Sample Covariance matrix corresponding to the standardized continuous features 
print(cov(scale(fData)))

In [None]:
# Calculate sample covariance between Density and Oil
((fData$Density - mean(fData$Density)) %*% (fData$Oil - mean(fData$Oil))) / (length(fData$Density)-1)  

# Calculate sample correlation between Density and Oil
(((fData$Density - mean(fData$Density))/sd(fData$Density)) %*% ((fData$Oil - mean(fData$Oil))/sd(fData$Oil))) / (length(fData$Density)-1)  

In [None]:
# Matrix-vector multiplications
A = matrix(seq(1, 4), 4, 1)
B = matrix(seq(8, 5, -1), 4, 1)
print(dim(A))
print(dim(B))
A[,1] %*% B[,1] #?
t(A[,1]) %*% B[,1] #?
A[,1] %*% t(B[,1]) #?
t(A[,1]) %*% t(B[,1]) #?

In [None]:
## Select data frame consisiting of two features: Density & Hardness
fDataTwoFeatures = fData %>% select(c('Density', 'Hardness'))

In [None]:
head(fDataTwoFeatures, n = 5)

In [None]:
## Calculate sample correlation matrix of data frame selected above
corMatrix = cor(fDataTwoFeatures)
print(corMatrix)

In [None]:
# Calculate eigenvalues & eigenvectors of sample correlation matrix
e = eigen(corMatrix)
u = e$vectors
lambda = e$values 
print(u)
print(lambda)

In [None]:
## Extract scaled data matrix from data frame
X = scale(fDataTwoFeatures)

In [None]:
## Project samples onto the direction of the first and second eigenvectors

# Calculate shadow length of data
shadowLength1 = X %*% u[, 1]
shadowLength1 = as.numeric(shadowLength1)
shadowLength2 = X %*% u[, 2]
shadowLength2 = as.numeric(shadowLength2)

# Vector projection
projectedSamples1 = u[, 1] %*% t(as.matrix(shadowLength1))
projectedSamples2 = u[, 2] %*% t(as.matrix(shadowLength2))

In [None]:
## Scale data frame and add Crispy column to data frame
fDataTwoFeaturesScaled = as.data.frame(scale(fDataTwoFeatures))
fDataTwoFeaturesScaled['Crispy'] = foodData['Crispy']

In [None]:
## Scatter plot of Density and Hardness, color coded using Crispy and first
## two eigenvectors with the projected data on to the first principal direction
## also color coded using Crispy
u = -u # this is a minor adjustment to flip the sign of the eigenvectors
p3 = fDataTwoFeaturesScaled %>% ggplot(aes(x = Density, y = Hardness, color = factor(Crispy))) +
  geom_point(size = 1) +
  geom_segment(aes(x = 0, y = 0, xend = u[1, 1], yend = u[2, 1]), size = 0.5,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red') +
  geom_segment(aes(x = 0, y = 0, xend = u[1, 2], yend = u[2, 2]), size = 0.5,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue') +
  geom_point(aes(x = projectedSamples1[1, ], y = projectedSamples1[2, ], color = factor(Crispy)), shape = 4, size = 2.0) 
p3

In [None]:
## Scatter plot of Density and Hardness, color coded using Crispy and first
## two eigenvectors with the projected data on to the second principal direction
## also color coded using Crispy
p4 = fDataTwoFeaturesScaled %>% ggplot(aes(x = Density, y = Hardness, color = factor(Crispy))) +
  geom_point(size = 1) +
  geom_segment(aes(x = 0, y = 0, xend = u[1, 1], yend = u[2, 1]), size = 0.5,
   arrow = arrow(length = unit(0.5,"cm")), color = 'red') +
  geom_segment(aes(x = 0, y = 0, xend = u[1, 2], yend = u[2, 2]), size = 0.5,
   arrow = arrow(length = unit(0.5,"cm")), color = 'blue') +
  geom_point(aes(x = projectedSamples2[1, ], y = projectedSamples2[2, ], color = factor(Crispy)), shape = 4, size = 2.0) 
p4

In [None]:
# Father-Son dataset
mu = colMeans(father.son)
mu

In [None]:
ggplot(father.son, aes(x=fheight, y=sheight)) +
 geom_point(size=2, alpha=0.7) + xlab("Height of father") + 
  ylab("Height of son") + ggtitle("Father-son Height Data") +
  stat_ellipse(level=0.68,color="green")+
  stat_ellipse(level=0.95,color="blue")

In [None]:
cov(father.son)
cor(father.son)

In [None]:
# Samples 10 and 20
x10 = father.son[10, ]
x20 = father.son[20, ]
x10
x20
# Euclidean distance between samples 10 and 20
norm(x10-x20, type = '2')
# Mahalanobis distance between samples 10 and 20
#str(father.son)
M = mahalanobis.dist(father.son, father.son)
str(M)
M[10, 20]