# Week 1: Linear Algebra

In [1]:
#importing libraries
import numpy
import numpy as np
import numpy.linalg as nla
import scipy.linalg as sla
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

## Day 1: Vectors
* Linear Algebra is part of mathematics that deals with the study of vectors and matrices.
* Vectors can be thought of as lists/arrays of numbers
* From a more abstract perspective, many objects in mathematics can be seen as "vectors".
* We will only deal with vectors which are "lists/arrays of (real) numbers".

### Creating vectors

In [5]:
# creating vectors with numpy.array

a = np.array([1, 0, 2])

b = np.array([0.5, 1.2, -3.0])

c = np.array([np.sqrt(2), np.sqrt(10), 2**1.5])

O = np.zeros(3)

e = np.ones(3)

#np.full_like() # good if you need to fill a vector with NaN values

array([1., 1., 1.])

In [8]:
# Vector dimensionality with .size and .shape

print(c.size)

print(c.shape)

3
(3,)


### Operations with vectors

In [10]:
# Component-wise operations (addition, subtraction, division)
print('a = ', a)

print('b = ', b)

print(a + b)

print(a - b)

print(a / b)

print(a * b)

print(b / a)

a =  [1 0 2]
b =  [ 0.5  1.2 -3. ]
[ 1.5  1.2 -1. ]
[ 0.5 -1.2  5. ]
[ 2.          0.         -0.66666667]
[ 0.5  0.  -6. ]
[ 0.5  inf -1.5]


  print(b / a)


In [12]:
# Scalar multiplication, i.e. scaling
k = -2

print('k*a = ', k*a)

print('a / k = ', a/k)


k*a =  [-2  0 -4]
a / k =  [-0.5 -0.  -1. ]


In [14]:
# Combining multiple operations
a*b - 1.5*c

d = np.array([-1, 2, -3, 4])

a + d #results in an error

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

### Norms: Vectors Lengths
* **Euclidean**, or **2-norm**: the usual distance in space. Given a vector  $v = (v_1, v_2, \ldots, v_n)$,  its Euclidean norm $\|v\|_2$ is given by: \begin{equation} \|v\|_2 = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2} = \left( \sum_{i=1}^{n} v_i^2 \right) ^ {1/2}\end{equation}

In [16]:
# Calculating norms from scratch
# Euclidean norm (usual length)
def norm_of_vector(vector):
    return np.sqrt((vector**2).sum())
    
# Test:
v = np.array([3, 4])
norm_of_vector(d) # output = 5

5.477225575051661

* **Minkowski *p*-norm** is a generalization of the Euclidean norm. In short, substitute the 2's with *p*'s and you get a norm of order *p*. Note that $p \geqslant 1$. Thus: \begin{equation} \|v\|_p = \left( \sum_{i=1}^{n} \left|v_i\right|^p \right) ^ {1/p}\end{equation}

In [27]:
# Minkowski's p-norm
def pnorm_of_vector(vector, p):
    lst=[]
    #for i in range(vector.size):
    for i in vector:
        lst.append(abs(i)**p)
    return sum(lst)**(1/p)
    #return np.power((np.power(np.absolute(vector), p)).sum(), 1/p)

# Test d = [-1, 2, -3, 4]
pnorm_of_vector(d, 3)

4.641588833612778

Special cases of the *p*-norms:

* If $p=1$, then $\|v\|_1 = \sum_{i=1}^{n} |v_i|$. This norm is called **taxicab** or **Manhattan norm**

* If $p \to \infty$, then $\|v\|_\infty = \max\big\{ |v_1|, |v_2|, \ldots, |v_n| \big\}$. This norm is called **max** or **Chebyshev norm**

In [26]:
# Using NumPy's numpy.linalg.norm(x, p)
np.linalg.norm(d, 3)



4.641588833612778

In [None]:
# Plotting a 'unit circle' under different norms
# (unit circle = circle centered at the origin (0,0) with radius = 1)
# we will use https://www.desmos.com/calculator to speed up things



* Unit vectors: vectors whose norm is unit, i.e. equal to 1
* This is a norm-dependent concept (depends on what norm we use to measure distances)
* The process of "converting" a vector to a unit vector is called **normalization**. To normalize a vector $v$ we scalar multiply it by the reciprocal of its norm $\|v\|$. In short:
\begin{equation} v_{\text{unit}} = \frac{1}{\|v\|} \cdot v \end{equation}

In [31]:
# Unit Vectors
print(d)

norm_d = d / np.linalg.norm(d, 2)
print(norm_d)

print('norm_d = ', np.linalg.norm(norm_d, 2))

#scikit learn command normalize
# 0.000001 + 99999999 = 99999999
# 1.3 + 12.13

[-1  2 -3  4]
[-0.18257419  0.36514837 -0.54772256  0.73029674]
norm_d =  0.9999999999999999


### Multiplying Vectors. Angle between two vectors
* Multiplying vectors can be defined in multiple ways. Here we discuss only the **dot-product** of two vectors. If $a = (a_1, a_2, \ldots, a_n)$ and $b = (b_1, b_2, \ldots, b_n)$, then their dot-product is given by:
\begin{equation} a\cdot b = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n \end{equation}
* The result of dot-multiplication of two vectors is a scalarm i.e. a number.

In [None]:
# Dot-Product of two vectors



* If we "normalize" the dot-product of two vectors by dividing it by the product of the Euclidean norms of the vectors, then the resultin number is the **cosine of the angle** between the vectors. In other words, if $\langle a, b \rangle$ is the angle between the vectors $a$ and $b$, then:
\begin{equation} \cos{\langle a, b \rangle} = \frac{a \cdot b}{\|a\|_2 \cdot \|b\|_2} \end{equation}

In [None]:
# Angle between two vectors
u = np.array([1, 0])
v = np.array([2, 1])
w = np.array([1, -1])



### Similarity between two vectors
The cosine of the angle between two vectors can be used as a measure of **similarity** or **concordance**.
* $\cos{\langle u, v \rangle} \approx 1 \, \Rightarrow \,  \langle u, v \rangle \approx 0$  (vectors in same general direction)
* $\cos{\langle u, v \rangle} \approx 0 \, \Rightarrow \,  \langle u, v \rangle \approx 90^\circ = \frac{\pi}{2}\text{ rad}$ (vectors are close to perpendicular)
* $\cos{\langle u, v \rangle} \approx -1 \, \Rightarrow \,  \langle u, v \rangle \approx 180^\circ = \pi \text{ rad}$ (vectors are in almost opposite direction)

In [None]:
# Cosine similarity between two vectors

def cos_sim(v, w):
    

# Test
x = np.array([1, 1, 1, 1])
y = np.array([1.0, 0.5, 0.1, 2.4])
cos_sim()

In [None]:
# Cosine similarity using scikit-learn (generates similarity matrix)
from sklearn.metrics.pairwise import cosine_similarity

z = np.array([0.5, 0.5, 0.5, 0.4])

# Create a dataframe of vectors x, y and z
data = {'x' : x, 'y' : y, 'z' : z}
df = pd.DataFrame(data)

cosine_similarity()

In [None]:
# Example of cosine simiparity application: text analysis
# see: https://www.machinelearningplus.com/nlp/cosine-similarity/ for complete details

# Step 1: Generate the documents
doc_trump = 'Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin'
doc_election = 'President Trump says Putin had no political interference is the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election'
doc_putin = 'Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career'
documents = [doc_trump, doc_election, doc_putin]


# Step 2: Create the document matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

count_vectorizer = CountVectorizer(stop_words='english')
sparse_matrix = count_vectorizer.fit_transform(documents)

doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix, 
                  columns=count_vectorizer.get_feature_names(), 
                  index=['doc_trump', 'doc_election', 'doc_putin'])


# Step 3: Calculate the cosine similarity for the dataframe df
c_sim = cosine_similarity(df)
print(c_sim)

df_sim = pd.DataFrame(c_sim,
                      columns = ['doc_trump', 'doc_election', 'doc_putin'])


plt.figure()
sns.heatmap(df_sim,
            annot=True,
            fmt='0.3f',
            cmap='RdYlGn',
            yticklabels=df_sim.columns
           )
plt.yticks(rotation=0)
plt.show()

## Practice Assignment: the *iris* dataset
The iris dataset contains data about the length and width of sepals and petals of three varieties of iris flowers (*setosa*, *versicolor* and *virginica*). This is a frequently used dataset in statistics and machine learning. Your task:
* Load the dataset as a Pandas DataFrame
* For every variety of iris flowers, construct the similarity matrix (e.g. a similarity matrix for the *setosa* variety, for the four vectors: sepal_length, sepal_width, petal_length, petal_width)

In [None]:
# Loading the iris dataset
df_iris = pd.read_csv('iris.csv')

