# Introduction to Data Science & Linear Algebra in Python

This lab will introduce you to some of the commonly used data science Python packages focusing on Pandas and Numpy. Also, as much of the foundation of data science is rooted in linear algebra, we will review matrix operations using Numpy. This is important as having a working understanding of matrices will be essential in this course as many of the algorithms we learn utilize and rely upon matrix operations.

In [None]:
#%pip install numpy
import numpy as np
from numpy.linalg import eig # https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html
#%pip install pandas
import pandas as pd
#%pip install scikit-learn==0.23
from sklearn.datasets import load_iris # used to obtain the pre-loaded data
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
#%pip install matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('classic')
#%pip install scipy
from scipy.linalg import svd

### Pandas

Pandas is a powerful data analysis and manipulation tool. When working with Pandas, we will typically be working with what is called a 'DataFrame,' which we can think of as a two-dimensional structure for storing tabular data. 

To showcase some of its functionality, we will load a dataset using another package, scikit-learn, which we will cover in detail in the next lab. 

In [None]:
iris_data = load_iris() # load the data
print(iris_data.keys()) # what does our data object contain?

In [None]:
# can use 'DESCR' to get more details about our data
print(iris_data.DESCR)

In [None]:
# transform data into a pandas DataFrame using the 'data' and 'feature_names' keys
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_df.head() # preview the first 5 rows

In [None]:
# how many rows/columns does our data have? in other words, what is its shape?

iris_df.shape # number of rows, number of columns

Next we will add the target variable, the species, to our DataFrame. To do so, we will make it into a **series** object, which is a one-dimensional array that can hold data of any type and is composed of indexes and values.

In [None]:
pd.Series(iris_data.target)

In [None]:
# add our target variable, the species to our DataFrame as a series object
iris_df['species'] = pd.Series(iris_data.target)
iris_df.head()

But what if we want our target variable to be the actual species name? We can use the replace method!

In [None]:
# replacing target variable
iris_df['species'] = iris_df['species'].replace(to_replace= [0, 1, 2], value = iris_data.target_names)
iris_df.head()

There may be scenarios where we wish to filter to a subset of the data that meets certain conditions. This can be easily done using our pandas DataFrame. For example, what if we only want the rows where the species is 'setosa'?

In [None]:
# filtering data
setosa_data= iris_df[(iris_df['species'] == 'setosa')] 
setosa_data.head()

We can also specify multiple conditions to filter on as such:

In [None]:
# use & when we want both conditions to be met
setosa_data_long_sepals = iris_df[(iris_df.species == 'setosa') & (iris_df['sepal length (cm)'] > 5.0)]
setosa_data_long_sepals.sample(5)

In [None]:
### YOUR CODE: in the code above, replace '&' with '|' - can you tell the difference?

We may also wish to sort our data by one of the columns. This can be done using the sort_values method on our dataframe to which we will specify the column on which we want to sort and whether we want to sort the values from low to high or high to low.

In [None]:
# sorting data, default is to sort values from low to high 
iris_df.sort_values('petal width (cm)').head(10)

In [None]:
### YOUR CODE: add the argument 'ascending=False' to sort_values - what is the effect? 

When exploring our data, we usually want to get an understanding of the distribution of the values of our data. This can be done by applying the describe method to our data.

In [None]:
# use 'describe' to get a better understanding of our features
iris_df.describe(include="all")

Next, we can use the nunique and unique methods to understand how many unique values exist in the column and what they are - this is especially useful for categorical data such as our 'species' column.

In [None]:
# how many different species? what are they?
print("Number of unique species:", iris_df.species.nunique())
print("Species:", iris_df.species.unique())

We can also use the value_counts method to compute how many samples of each species we have in our data.

In [None]:
# how many of each species is in our data?
iris_df.species.value_counts()

What if we want to compute the mean/median/max/etc. of a column for each category/group? We can use the groupby method as such:

In [None]:
# example; average petal length for each species
iris_df.groupby('species')['petal length (cm)'].mean()

In [None]:
### YOUR CODE: what is the maximum sepal width for each species?

Another useful method is get_dummies which will create a dummy variable for each unique value/category of a column with a value of 0 or 1 to denote whether or not the instance belongs to that category. We can also use the drop_first argument if we want to create k-1 dummy variables where k is the number of unique categories - this is important when we need our columns to be linearly independent. We can apply this function to our species column as such:

In [None]:
dummy_data = pd.get_dummies(iris_df, columns = ["species"]) 
dummy_data

In [None]:
### YOUR CODE: add the argument 'drop_first = True' to the code above - does it behave as expected?

We may then choose to join these new columns to our original data. This can be done using the merge method. In the example below, we are joining the DataFrames on their indexes and are using an inner join meaning that we will only keep instances that match/exist in both dataframes.

In [None]:
merged_data = iris_df.merge(dummy_data[['species_setosa', 'species_versicolor', 'species_virginica']], 
              how="inner", left_index=True, right_index=True)

merged_data.head()

In [None]:
# always a good check to make sure we have the number of rows/columns we expected

merged_data.shape # still have 150 rows, added 2/3 columns depending on which dummy variables were used

### Numpy

Another popular Python library is Numpy. Numpy is great for working with arrays as it makes it easy and fast to apply functions to data that is formatted as an array. We can think of an array as a collection of elements that have associated indexes or positions, similar to a list. However, unlike in a list, array elements must all be of the same data type. We will use Numpy below to review matrices and matrix operations.

### 1. Matrices

A **matrix** is an array of numbers that are organized into a fixed number of rows and columns. We will start with the basics of matrices:
- shape
- transpose
- mathematical operations (sum, mean, max, etc.)
- matrix addition
- element-wise multiplication
- matrix multiplication
- inverse
- broadcasting

In [None]:
# define a matrix
A = np.array([[1,2,3],[4,5,6]]) # use numpy to define our matrix
print(A)

#### 1.1 Shape

The shape of the matrix refers to how many rows and columns the matrix is composed of.

In [None]:
print("Shape of A:", A.shape) # 2 rows, 3 columns

#### 1.2 Transpose

Transposing a matrix can be thought of as flipping the matrix about the diagonal.

In [None]:
print(A.T)

In [None]:
print(A.T.shape) # see that we now have 3 rows, 2 columns

#### 1.3 Mathematical Operations

When applying mathematical operations to our matrix, we can apply it to the entire matrix (the default) or specify the 'axis' - 0 will compute the operation for each column and 1 will compute the operation for each row.

In [None]:
# sum
print(A.sum()) # sums all entries of the matrix

In [None]:
# mean
print(A.mean(axis=0)) # computes mean for each column

In [None]:
### YOUR CODE: compute the maximum value for each row of matrix A

#### 1.4 Matrix Addition

When adding matrices, it adds each element by its position. Therefore, the matrices must be the same shape!

In [None]:
B = np.array([[1,3,5], [2,4,6]])
print(A, "\n")
print(B)
A + B 

#### 1.5 Matix Element-Wise Multiplication

Similar to adding matrices, we can multiple the elements of a matrix by multiplying each element that occupies the same position. Again, this means that the matrices must be of the same shape!

In [None]:
A*B

#### 1.6 Matrix Multiplication

You can use `@` to represent matrix multiplication. However, the size of the two arrays must be compatible to do this operation. For example, if A has shape (M, P) and matrix B has shape (P, N), they are considered compatible since the inner dimensions match (P=P). The result of `A @ B` will have shape (M, N) - the outer dimensions of the matrices.

In [None]:
A @ B

# think: why does this cause an error? consider their shapes...

In [None]:
print(A.shape, B.shape) # inner dimensions do not match, 3 != 2

In [None]:
print(A.shape, B.T.shape) # inner dimensions now match, 3 = 3

In [None]:
A @ B.T

#### 1.7 Inverse

When working with numbers, an inverse is simply the reciprocal of the number. We can think of the inverse of a matrix in a similar fashion as it follows along a similar principal that $A × A^{-1} = A^{-1} × A = I$ where A is a matrix and I is the identity matrix. Therefore, we can think of the identity matrix as the matrix equivalent of "1" from when we multiple a number its by reciprocal. The resulting identity matrix will always be a square (i.e.., number of rows = number of columns) with values of 1 along the diagonal and 0 elsewhere. For more information on how the inverse is computed: https://www.mathsisfun.com/algebra/matrix-inverse.html

In [None]:
# example of I, the identity matrix 

I = np.array([[1,0], [0,1]])
print(I)

In [None]:
X = A @ B.T # matrix multiplication
print(X, "\n")
X_inverse = np.linalg.inv(X)
print(X_inverse)

In [None]:
# checks

print(X @ X_inverse, "\n")

print(X_inverse @ X)

#### 1.8 Broadcasting

If arrays have different shapes but some dimensions are compatible, the operation will be broadcast to other dimensions. This allows us to still apply some arithmetic operations between arrays of different shapes. 

In [None]:
# integers can be easily broadcast to the whole array
print(A)
print(A + 2) # adds two to each position

In [None]:
### YOUR CODE: multiple each element of matrix A by 2

In [None]:
# broadcasting A to a matrix C of higher dimensions  
C = np.array([[[1,1,1],[1,1,1]], [[2,2,2],[2,2,2]]])
print(C.shape, A.shape) # second two dimensions are the same

In [None]:
print("A:\n", A, "\n")
print("C:\n", C, "\n")
print("A+C:\n", A + C) # adds A to both dimensions of C

### 2. Eigenvalues

It is often useful to express things in simpler, smaller parts as properties of these simpler and smaller parts often speak volumes about the characteristics or behavior of the whole. They can also allow us to calculate things faster.

This is where eigenvalues come in. **Eigenvalues** are values (positive and/or negative) that produce an original matrix when multiplied by a special vector, the **eigenvector**.

Therefore, for a matrix A, eigenvector v and eigenvalues λ,
    $Av = λv$, where:
    - A is an n x n matrix, a “square” matrix **remember this requirement!**
    - λ is a vector of length n 
    - V is also a vector of length n 

This allows us to represent big matrices with just two vectors! Let's see an example.

In [None]:
# this will fail, let's ask why?
print("The shape of A is ", A.shape)
try:
    eig(A) # try to perform eigendecomposition
except np.linalg.LinAlgError:
    print("There was an error trying to do eigendecomposition on A!")
assert A.shape[0] == A.shape[1], "This matrix is not square!"

#### Why do matrices have to be square?
$Av = \lambda v$, the definition of the eigendecomposition problem only admits square shaped solutions.

Imagine, we know A is of size $m \times n$, then let's explore our definition:

* [$m \times n$] @ [$n \times n$] = [$?$] * $n \times n$

* From the left side, we have matrix $A$ and eigenvectors $v$
* Eigenvectors describe the column space of $A$, so we have $n$ of them
* From the right side we have **scalar multiplication**, not a matrix product... 
* So what number must [$?$] be? From the basics we know we can only multiply $n$ scalars by $n$ rows, to produce a $n \times n$ matrix
* Ans: [?] must be of size $n$
* And if this is true, then the left side, $m$ must actually be $m = n$ for it also to hold true

caveat: Here I'm trying to intuitively convince you that the matrices have to be square, have not rigorously checked this argument

In [None]:
a_square_matrix = np.array([[5, 1], [3, 3]])
the_eigenvalues, the_eigenvectors = eig(a_square_matrix)
print("our square matrix: \n", a_square_matrix)
print("our eigenvalues: \n", the_eigenvalues)
print("our eigenvectors: \n", the_eigenvectors) # matrix of eigenvectors

print("---- Let's check that one eigenvalue works as expected ----")

Av = a_square_matrix@the_eigenvectors # matrix multiplcation
lambda_v = the_eigenvalues*the_eigenvectors # element-wise multiplication
print('A@v\n', Av)
print("lambda * v\n", lambda_v)

### 3. Singular Value Decomposition (SVD)

One of the major drawbacks of eigendecomposition is the requirement that the matrix must be square as we will frequently encounter situations where this is not the case. However, other matrix decomposition methods, such as **Singular Value Decomposition (SVD)** can be used in its place. SVD is similar to eigendecomposition in that an original matrix can be created from a function involving matrix products of eigenvectors and eigenvalues.  However, unlike eigendecomposition, it can work with non-square matrices and it is a partially constructed from matrix multiplication, left and right transpose and their eigenvectors. It is also different from eigendecomposition in that it is partially constructed from square roots of the eigenvalues, referred to as **singular values** rather than eigenvalues. We will refer to these singular values using 𝛔.

So, for SVD we have 3 matrices of singular values: 
    - U - the left singular vectors
    - V - the right singular vectors
    - and the singular values common to both U and V, in a diagonal matrix 𝚺

In [None]:
# source: https://machinelearningmastery.com/singular-value-decomposition-for-machine-learning/

U, s, VT = svd(a_square_matrix) # apply svd and get 3 resulting matrices

# s contains singular values but we need to insert them into a proper matrix with the values along the diagonal
Sigma = np.zeros((a_square_matrix.shape[0], a_square_matrix.shape[1])) # initalize matrix of all zeros
Sigma[:A.shape[1], :A.shape[1]] = np.diag(s) # add values

print("U\n", U)
print("Sigma\n", Sigma)
print("V transpose\n", VT)

print("---- Let's see if A = U Sigma V' ----\n")

print(
    "U Sigma V'\n",
    U.dot(Sigma.dot(VT))
)

print(
    "Original Matrix\n",
    a_square_matrix
)

### 4. Principal Component Analysis (PCA)

As review, variance is a measure of the spread or dispersion of data and is an important characteristic of data. With this in mind, PCA can be seen as a dimensionality reduction technique from N columns to k columns that preserves *most* of the variance among N columns. Comparing PCA and SVD, they are similar in that they both work on non-square and square matrices and make use of eigenvalues. However, PCA is different from SVD in that it does not work on the original matrix. Instead, it works on the covariance matrix.

One important thing to keep in mind with PCA is that variables with more variability will get more "power." Therefore, it is important that we scale our variables beforehand which can be done by simply setting whiten=True.

Lastly, one thing you may be wondering is how to choose the optimal number of principal components, k. Some approaches include:
 - Chose manually: If you theoretically expect an underlying dimension of the dataset, like there were 3 factors that created a process, then this could be a choice
 - Scree plot: This can be somewhat subjective but it is simple, can pick a cumulative variance cutoff point
 - Cross-validate: If used in a predictive task, treat as a tuning/hyperparameter

In [None]:
# Shamelessly taken from https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html
# special note about this dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

iris = load_iris() # reload the data
X = iris.data # features
y = iris.target # target/outcome variable
target_names = iris.target_names

pca = PCA(n_components=2, whiten=True) # apply PCA where k=2
X_r = pca.fit(X).transform(X)

# percentage of variance explained
print('explained variance ratio (first two components): %s'
      % str(pca.explained_variance_ratio_))

# plot the results of our PCA
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=.8, lw=lw,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of IRIS dataset - x axis is first component, y the second')
plt.show() # see pretty good separation between the 3 species

Looking at the results, we see that most of the variance is explained by the first principal component - this makes sense as our data is very limited in this example. However, with more complex datasets, having additional principal components will be important for capturing the additional variance in the data.

In the next lab, we will build off of this introduction and will dive into applying machine learning techniques using scikit-learn.