# Module 1: Data Science Fundamentals
## Sprint 3: Intro to Modeling
## Subproject 1: Linear Algebra Refresher

Welcome to the final sprint of the Data Science Fundamentals module! We've done EDA, we've learned to verify our hypotheses of change by testing for significance in the change. There's one more skill set every data scientist should have - modeling. Modeling comes in many flavors - statistical modeling, machine learning. 

We will to explore the fundamental idea that **data can be represented in a compressed way - with a model**. You can imagine the model as an approximation of a dataset. Using models not only allows us, humans, to understand data better, but also make predictions about unseen data (we'll explore the concept of automatically learning relationships in the data).

## Learning outcomes

- Basics of linear algebra: vector operations, matrix operations.
- Intermediate NumPy proficiency: broadcasting, vectorization, higher-level APIs.

## Linear Algebra Refresher

Go through all videos and exercises at [KhanAcademy's linear algebra module](https://www.khanacademy.org/math/linear-algebra). After the course, you should have refreshed your knowledge of linear algebra basics - vectors, matrices, their operations, inner product.

## Advanced NumPy

We've had a chance to try out some NumPy in Sprint 1. We've barely scratched the surface - there's so much more you should know about NumPy APIs as a data scientist, which we'll focus on throughout this subproject. This subproject will mostly consist of exercises.

### Aggregations

By this point, you should be familiar with Pandas aggregation techniques. Often, you need to crunch big data with pure NumPy (after all, Pandas Series are just NumPy arrays under the hood), thus it's worth being familiar with NumPy data aggregation techniques - sums, statistical parameter calculations (mean, median, quantiles, percentiles).

Go through [this](https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html) tutorial on NumPy aggregation APIs. By the end of this tutorial, you should be able to apply basic NumPy aggregation techniques, when needed.

### Broadcasting and vectorization

Get familiar with the concept of [broadcasting](https://cs231n.github.io/python-numpy-tutorial/#broadcasting). It allows NumPy to work different shape arrays.

Afterwards, get used to the concept of [vectorization](https://realpython.com/numpy-array-programming/#what-is-vectorization). This technique allows to run loop-based algorithms several times faster.

### Exercises

Implement exercises [26](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints.md#26-what-is-the-output-of-the-following-script-), [13](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints.md#13-create-a-10x10-array-with-random-values-and-find-the-minimum-and-maximum-values-) and [45](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints.md#45-create-random-vector-of-size-10-and-replace-the-maximum-value-by-0-). Check the solutions [here](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints_with_solutions.md). You can implement them in this notebook.

For the end of this subproject, we invite you to solve some more NumPy exercises, this time more advanced. From the document [here](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints.md), choose 2 exercises of each difficulty level 3, and solve them. Afterwards, verify your solution using the answer sheet [here](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints_with_solutions.md).

In [None]:
import numpy as np

26. What is the output of the following script? (★☆☆)

In [None]:
# Author: Jake VanderPlas

print(sum(range(5),-1))
from numpy import *
print(sum(range(5),-1))


10
10


In [1]:
import numpy as np
print(sum(range(5),-1))

print(sum(range(5),-1))

9
9


In [None]:
v=list(range(5)),-1
v

([0, 1, 2, 3, 4], -1)

45. Create random vector of size 10 and replace the maximum value by 0 (★★☆)

In [None]:
Ran = np.random.random(10)
print(Ran)
print(Ran.argmax())
print('')
Ran[Ran.argmax()] = 0
print(Ran)

[0.0224639  0.68227967 0.71005824 0.1548906  0.14950709 0.94079284
 0.11979777 0.52664008 0.90386678 0.79061684]
5

[0.0224639  0.68227967 0.71005824 0.1548906  0.14950709 0.
 0.11979777 0.52664008 0.90386678 0.79061684]


In [None]:
print(sum(range(5),-1))

10


40. Create a random vector of size 10 and sort it (★★☆)

In [None]:
num = np.random.random(10)
num.sort()
num

array([0.13198769, 0.29741545, 0.38174644, 0.51736253, 0.67848693,
       0.73894452, 0.74179442, 0.87292216, 0.9400086 , 0.98084481])

-----

13. Create a 10x10 array with random values and find the minimum and maximum values (★☆☆)

In [None]:
Z = np.random.random((10,10))
Zmin, Zmax = Z.min(), Z.max()
print(Zmin, Zmax)

0.006631342960017994 0.9782430099094102


64. Consider a given vector, how to add 1 to each element indexed by a second vector (be careful with repeated indices)? (★★★)
hint: np.bincount | np.add.at



In [6]:
Z = np.ones(10)
I = np.random.randint(0,len(Z),20)
Z += np.bincount(I, minlength=len(Z))
print(Z)



[3. 1. 5. 4. 3. 4. 3. 3. 2. 2.]


In [12]:
Z = np.ones(10)
I = np.random.randint(0,len(Z),20)
print(I)
np.add.at(Z, I, 1)
print(Z)

[1 6 2 4 0 2 4 3 3 8 9 6 9 4 6 3 1 1 4 6]
[2. 4. 3. 4. 5. 1. 5. 1. 2. 3.]


In [10]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [9]:
np.random.randint(0,len(Z),20)

array([3, 7, 6, 5, 1, 3, 4, 7, 1, 8, 7, 9, 3, 7, 5, 4, 4, 6, 7, 1])

65. How to accumulate elements of a vector (X) to an array (F) based on an index list (I)? (★★★)
hint: np.bincount

In [4]:
X = [1,2,3,4,5]
I = [1,3,9,3,1]
F = np.bincount(I,X)
print(F)

[0. 6. 0. 6. 0. 0. 0. 0. 0. 3.]


69. How to get the diagonal of a dot product? (★★★)

## Summary

Linear algebra basics will be fundamental for the upcoming subprojects of this sprint. You should be comfortable with vector and matrix operations by now, as well as be able to employ De facto the best linear algebra framework for Python NumPy. Be aware that although NumPy is usually much faster than Python, vectorization knowledge is crucial to make it blazing fast.