### A Foundational Python Data Science Course
## Session 04: Numpy: basic vector arithmetic, linear algebra, and broadcasting.

***

### 0. What do we want to do today?

Our goal in Session 04 is to learn the basics of **Numpy**, a powerful number-crunching machinery that turn Python into a **vector programming language** - a kind of language ideally suited for mathematical statistics and Data Science.  

Along the way we will begin to understand how Numpy runs under Pandas and what is the relationship between them. And the basics of vector arithmetic in Numpy, of course. And how to vectorize a function with Numpy. And plenty of other things as well! 

### 1. Where am I?

In [None]:
import os
work_dir = os.getcwd()
print(work_dir)
data_dir = os.path.join(work_dir,"_data")
print(data_dir)


### 2. Numpy, alright. A gentle introduction.

In [3]:
import numpy as np
# set RGN seed
np.random.seed(777)

**N.B.** The `np.random.seed(777)` thing will be explained in the session.

Lists seem like an ideal Python structure to hold numerical data. But there are things lists can and cannot do:

In [None]:
import math
l_1 = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
math.sqrt(l_1[-1])

What if we need to take the square rooth of **all** elements in `l_1`?

In [None]:
import math
l_1 = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
sqrt_l_1 = [math.sqrt(x) for x in l_1]
print(sqrt_l_1)

That would do. But this:

In [None]:
math.sqrt(l_1)

**does not work.** Now,

In [None]:
v1_array = np.array(l_1)
print(v1_array)
print(type(v1_array))
np.sqrt(v1_array)

## The key difference between a Python list and a NumPy array when applying math.sqrt is:

### Python List []: math.sqrt cannot be directly applied to a list. You need to iterate through each element and compute the square root individually. Example: [math.sqrt(x) for x in my_list].

### NumPy Array([]): Supports element-wise operations. You can directly apply np.sqrt() to the entire array, and it will compute the square root of each element efficiently. Example: np.sqrt(my_array).

NumPy arrays are optimized for mathematical operations, whereas lists are general-purpose collections.

More fun:

In [None]:
l_1 + 1

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 l_1 + 1

TypeError: can only concatenate list (not "int") to list

In [None]:
[x+1 for x in l_1]

However, w. Numpy:

In [None]:
v1_array + 1

Operations, like `+`, `-`, `*`, `**` and similar - as well as Numpy, *vectorized* functions, are applied to all elements of a Numpy vector simultaneously - unlike in lists that we need to iterate in order to achieve the same effect.

In [None]:
v1_array *2
v1_array ** 2


# Scalars and Vectors

**Scalar: A scalar is a single number, representing one value. For example, 5, 3.14, or -2 are scalar values. Scalars have zero dimensions (0D).**

    > Example in Python: scalar = 5

**Vector: A vector is a one-dimensional array of numbers, representing multiple values. It is essentially a collection of scalars arranged in a specific order. Vectors have one dimension (1D).**

    > Example: [1, 2, 3] or a NumPy array np.array([1, 2, 3])

A NumPy vector is simply a one-dimensional NumPy array, like np.array([1, 2, 3]). It allows for efficient mathematical operations and is optimized for numerical computation.

A __vectorized function__ is a function that performs an operation on entire arrays (or vectors) element-wise __without__ the need for explicit loops. These functions take advantage of NumPy's internal implementation in C, making them highly optimized and faster than looping in Python.

__Key characteristics:__
 -  Operates on entire arrays instead of individual elements.
 -  Eliminates the need for for loops.
 -  Provides concise and readable code.

In [None]:
# scalar - without using numpy array and using for loop
# make number ^2 in list
l1 = [1,2,3,4]
[x**2 for x in l1]

# Now , using the vectorized functions
import numpy as np
l1_array = np.array([l1])
l1_array**2


### Why Use Vectorized Functions?

-  __Speed__: NumPy's vectorized operations are implemented in C and are much faster than Python loops.
-  __Simplicity__: Code is easier to read and write.
-  __Efficiency__: Minimizes overhead and reduces Python's looping overhead.

#### Vectorizing functions with Numpy

In [None]:
def plus_one(x):
    return(x+1)
plus_one(5)

In [None]:
plus_one(l_1[0])

In [None]:
plus_one(l_1)

In [None]:
plus_one_vectorize = np.vectorize(plus_one)
plus_one_vectorize(l1)


**N.B.** My call to `plus_one_v()`, i.e. `plus_one_v(l_1)` has automatically turned the `l_1` list into a Numpy vector before execution. Now `plus_one_v()` is a *vectorized* version of `plus_one()`.

## Important

### Vectorization:
-  It's the process of applying a function to an entire array (or list) at once, without using explicit loops.
-  It makes your code shorter, cleaner, and faster, as NumPy's operations are optimized in C.

### Automatic Conversion:

-  When you pass a list like l_1 to the vectorized function, NumPy automatically treats it like a NumPy array internally. This allows for efficient element-wise operations.

### Why Use Vectorization?

-  Normal Python loop for applying plus_one to a list:

In [None]:
result = [plus_one(x) for x in l_1]
#This is explicit and works fine but can be slower for large datasets.

-  Vectorized version:

In [None]:
result = plus_one_v(l_1)

# NumPy handles the iteration internally, which is optimized for performance.

Without Vectorization:

In [None]:
l_1 = [1,2,3,4]
result = [plus_one(x) for x in l_1]
print(result)

With __Vectorization__ (numpy):

In [None]:
import numpy as np
l_1 = [1, 2, 3, 4]
plus_one_v = np.vectorize(plus_one)
result = plus_one_v(l_1)
print(result)  # Output: [2, 3, 4, 5]


np.vectorize Purpose: 

-  It is specifically used to transform functions that operate on a single value at a time into vectorized functions that can operate on entire lists, arrays, or iterables element-wise.

After Vectorizing:

-  Once a function is vectorized using np.vectorize, you can pass a list, NumPy array, or other iterables to it, and NumPy will handle applying the function to each element for you.

***

Let's vectorize something else, w/o help from `np.vectorize()`. In Decision Theory, specifically in Choice under Risk, there is the concept of the Expected Utility of a set of risky options (called a *lottery* when put together), e.g.:

- **Option A**: Stay home, watch a movie. (costs \$5 to rent a movie via online services)
- **Option B**: Go to cinema, watch a movie. (costs \$10 to buy a ticket)



- **Option A1**: Stay home, play a board game. (costs \$10 to buy a board game)
- **Option B1:** Go to the board game club, play a board game. (costs \$3.5 to enter the club)

Let's assume that a decision maker has a Utility Function $u(x)$ for money that maps monetary value to utiliy, and that $u(x)$ takes a form of a power-utility function, i.e. $u(x) = x^{\rho}$ with some exponent $\rho$ controlling the function. 

Before we proceed, here is the bried explanation of the concepts above:

__Decision Theory__ is a concept rooted in statistics, mathematics, and economics, focusing on making optimal choices under conditions of uncertainty or risk. It combines statistical reasoning with human decision-making processes.

-  Key Components of Decision Theory:

> Decision Maker

> Alternatives (The set of possible actions or choices (e.g., invest in A or B?)).

> Outcomes

> States of Nature (Possible scenarios or conditions beyond the control of the decision-maker (e.g., weather conditions, market changes).)

> Payoff (The result or "reward" associated with each choice and scenario, often measured in terms of profit, utility, or benefit.)

> Probability (The likelihood of each state of nature or outcome.)

### A utility function like u(x)=xpu(x)=xp is a way to quantify the satisfaction (utility) a decision-maker derives from a certain amount of money xx. This specific form is known as a power-utility function, and the exponent pp controls the behavior of the function.

Utility Function u(x):
-  It represents how much "value" or "satisfaction" someone gets from money.
-  The utility function is often non-linear because the value of money doesn't grow equally for everyone. For example, $10 might be more valuable to someone with only $100 than to someone with $10,000.

Power-Utility Function $u(x) = x^{\rho}$

-  The function raises xx (money) to the power of pp, where pp controls the curvature of the function.

-  p>0: The function is increasing, meaning more money provides more utility.

-  The shape of the function depends on the value of p.

Let's assume that a decision maker has a Utility Function $u(x)$ for money that maps monetary value to utiliy, and that $u(x)$ takes a form of a power-utility function, i.e. $u(x) = x^{\rho}$ with some exponent $\rho$ controlling the function.  (From above)

In [None]:
import pandas as pd
import numpy as np
rho = .67
x = np.linspace(0,10,100)
ux = x**rho
display_data = pd.DataFrame({'money':x,'utility':ux})
display_data.plot.line(x ='money', y='utility')


What is `np.linspace()`?

In [None]:
np.linspace(0,10,100)

__np.linspace()__ is a NumPy function used to generate an array of evenly spaced numbers over a specified range. It's particularly useful when you want to divide a range into a certain number of intervals.

-  Visualization: Often used to generate data points for plotting functions (e.g., graphs in matplotlib).

-  Sampling: Divide a continuous range into discrete intervals for numerical analysis.

In [None]:
# Examples: (start, end, step)

import numpy as np
arr = np.linspace(0,10,5)
print(arr)
# output [ 0.   2.5  5.   7.5 10. ]

-  The range from 0 to 10 is divided into 5 evenly spaced numbers.

In [None]:
import numpy as np
arr = np.linspace(0,10, endpoint=False)
print(arr)
# output [0.  0.2 0.4 0.6 0.8 1.  1.2 1.4 1.6 1.8 2.  2.2 2.4 2.6 2.8 3.  3.2 3.4 3.6 3.8 4.  4.2 4.4 4.6 4.8 5.  5.2 5.4 5.6 5.8 6.  6.2 6.4 6.6 6.8 7.7.2 7.4 7.6 7.8 8.  8.2 8.4 8.6 8.8 9.  9.2 9.4 9.6 9.8]

Now, let's assume that the probabilities of options `A`, `B`, `A1`, and `B1` are known to the decision maker, and that we have, of course, $P(A)+P(B)=1$ and $P(A1)+P(B1)=1$:

- **P(A) = .55**: Stay home, watch a movie. (costs \$5 to rent a movie via online services)
- **P(B) = .45**: Go to cinema, watch a movie. (costs \$10 to buy a ticket)


- **P(A1) = .75**: Stay home, play a board game. (costs \$10 to buy a board game)
- **P(B1) = .25**: Go to the board game club, play a board game. (costs \$3.5 to enter the club)

In [None]:
import numpy as np
import pandas as pd

#power of utility function exponent

rho = 0.67

# lottery (A,B)
pa = 0.55
pb = 0.45
ca = 5 # cost of option A
cb = 10 # cost of option B
eu_ab = pa*ca**rho + pb*cb**rho # Expected utility for the lottery (A,B)(A,B), calculated as: # EU(A,B)=P(A)⋅u(cA​)+P(B)⋅u(cB​)
print('EU(A,B) = '+ str(eu_ab))

pa1 = 0.75
pb1 = 0.25
ca1 = 10
cb1 =3.5
eu_a1b1 = pa1*ca1**rho + pb1*cb1**rho
print('EU_A1B1 = '+ str(eu_a1b1))


-  u(cA​)=cAρ​ is the utility of spending $5.
-  u(cB)=cBρu(cB​)=cBρ​ is the utility of spending $10.

### Why u(cA)=cAρu(cA​)=cAρ​ for cA=5cA​=5?

-  cA​ represents the cost of staying home to watch a movie, which is $5.
-  The utility function maps this monetary cost (cAcA​) into a subjective level of satisfaction. The function cAρcAρ​ takes into account the diminishing returns to utility from money  if   
ρ<1ρ<1.
-  For ρ=0.67ρ=0.67, spending $5 yields a utility of 50.6750.67, which reflects how much "value" the decision maker derives from that $5 based on their preferences.

And since monetary values constitue costs in this example, the decision maker - if cares only about the utility of money - should choose $(A,B)$ over $(A1,B1)$ according to the principle of Maximum Expected Utility.

-   A higher expected utility reflects a better combination of satisfaction (utility) and likelihood of outcomes. In this case, the decision maker perceives more overall value in the A1B1 options despite the costs involved. ​​

A Python function to compute the expected utility of a lottery:

In [None]:
def eu(p1, p2, v1, v2, rho):
    leu= p1*v1**rho + p2*v2**rho
    return(leu)
#test eu
eu(p1=0.75,v1=10,p2=0.25,v2=3.5,rho=0.67)

Ok, now that works. However: what if we need to evaluate many lotteries at once?

In [None]:
pA = [.15, .33, .84]
vA = [10, 20, 30]
pB = [.85, .77, .16]
vB = [8, 17, 45]

lots = pd.DataFrame({'pA':pA, 'vA':vA,'pB':pB,'vB':vB})
lots

Vectorized `eu()` via Numpy:

In [None]:
def eu_v(lotteries, rho):
    uA=np.power(lotteries['vA'],rho)
    uB=np.power(lotteries['vB'],rho)
    leu=pA*uA+pB*uB
    return leu
eu_v(lots,rho)

**Yes:** you can perform vectorized operations and use vectorized functions on `pd.DataFrame` columns.

**Remember:** a `pd.Series` object is, essentialy, a `np.array` with an index attached.

Check (first lottery only):

In [None]:
eu1 = pA[0]*vA[0]**rho + pB[0]*vB[0]**rho 
print(eu1)

Check (second lottery only):

In [None]:
eu2 = pA[1]*vA[1]**rho + pB[1]*vB[1]**rho
print(eu2)

**N.B.** `np.vectorize` is not really recommended. As the Numpy documentation states:
> The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
[numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html)

You should always write your own vectorized functions with Numpy.

### 3.  __Numpy__ step by step:

In [None]:
a = [1,2,3]
b = [2,2,2]

result1 = a*b

# - a*b rises an error
# - TypeError: can't multiply sequence by non-int of type 'list'

#but if we multiply it as an array:
import numpy as np
a = np.array([1,2,3])
b = np.array([2,2,2])
result2 = a*b
print(result2)
# [2,4,6]

#### Element-wise operations

In [None]:
a = np.array([1,2,3])
b = np.array([4,5,6])
a+b

# array([5, 7, 9])

In [None]:
a = np.array([10,10,10])
b = np.array([2,3,4])

a**b

# array([  100,  1000, 10000])

In [None]:
a = np.repeat(10, repeats=3)
print(a)

# [10 10 10]

b = np.array([2,3,4])
a**b

# array([  100,  1000, 10000])

In [None]:
np.repeat(10, repeats=3)

# array([10, 10, 10])

In [None]:
a = np.repeat(10, repeats=3)
print(type(a))
# <class 'numpy.ndarray'>
b = a.tolist()
print(type(b))
# <class 'list'>

# .tolist() method is used to convert type to a list.

In [None]:
a = np.repeat(10, repeats = 3)
list(a)

# [10, 10, 10]

#### Matrices

### A matrix is a two-dimensional array (rows × columns) used to represent data in rows and columns.
#### Matrices could be created with numpy array- __np.array__ or numpy matrix __np.matrix__ :

In [None]:
matrix_array = np.array([[1, 2, 3], [4, 5, 6]])
display(matrix_array)

# array([[1, 2, 3],
#          [4, 5, 6]])

### Using np.array: General-purpose, supports multi-dimensional data.

## Using np.matrix: Specialized for 2D data (less flexible, not recommended).

In [None]:
matrix = np.matrix([[1, 2, 3], [4, 5, 6]])
display(matrix)

# matrix([[1, 2, 3],
#        [4, 5, 6]])

In [None]:
mat = np.array([[1,2,3],
             [4,5,6],
             [7,8,9]])

display(mat)


In [None]:
np.shape(mat)
# expected shape 3 rows 3 columns
# output (3, 3)

#### Subsetting vectors, matrices, and multidimensional arrays

In [None]:
print(mat)
mat[0,0]
#output 1


In [None]:
print(mat)
mat[0,1]

#output 2

# Slicing the mat [ROWS,COLUMNS]

In [None]:
print(mat)
mat[:,0]

# expected output is 1,4,7

In [None]:
print(mat)
mat[:,2]

#expected output 3,6,9

Use a list to subset a NumPy array

In [None]:
v = np.linspace(1, 10, 10, dtype=int)
print(v)

In [None]:
v[[0,2,4]] #[[]] NumPy fancy indexing:

### Numpy __Fancy Indexing__ :

-   Fancy indexing requires passing a list or array explicitly inside double square brackets.

-   Fancy indexing returns a new array and does not alter the original one.

-   Fancy indexing is a powerful way to select specific elements from a NumPy array using an array (or list) of indices. It allows for flexible selection of multiple non-consecutive elements.

In [None]:
a = np.array([1,2,3])
a.shape
#expected [3,]
a.ndim
#expected 1
mat.ndim
#expected 2

In [None]:
multiarray = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9]], 
                        [[10, 11, 12], [13, 14, 15], [16, 17, 18]],
                        [[19, 20, 21], [22, 23, 24], [25, 26, 27]]])
print(multiarray)
multiarray.ndim

In [None]:
multiarray = np.linspace(1,27,27, dtype=int)
print(multiarray)
multiarray = np.reshape(multiarray, newshape=(3,3,3))
print(multiarray)

In [None]:
np.linspace(-10,10,100)

In [None]:
np.arange(2,2)

## np.arange()

-   np.arange() is a function in NumPy used to generate a sequence of numbers. It is similar to Python's built-in range() function but more powerful because it can handle non-integer steps and create NumPy arrays directly.

-                               np.arange([start, ] stop[, step], dtype=None)                               -
-   np.arange(1, 10, 2) 1 -> start | 10 -> stop | 2 -> stepsize


In [None]:
np.arange(1,10) # end value not included
np.arange(1,11) # end value not included

Subsetting: work outside in

First layer od a 3D structure:

In [1]:
multiarray[2,:,]

NameError: name 'multiarray' is not defined

Second layer of a 3D strucutre:

In [None]:
multiarray[2,:,]

The comma , inside the square brackets in multiarray[0, :, :] is part of NumPy's multi-dimensional indexing syntax. It allows you to access specific layers, rows, and columns in a multi-dimensional array in an intuitive way.

Explanation of multiarray [0 , : , : ]:

> 0: Refers to the first layer of the 3D array.

> : (row selector): Means "select all rows" from this layer.

> : (column selector): Means "select all columns" for each row in this layer.

***

NumPy uses explicit dimensional indexing:

> The comma separates dimensions in the array.

-   multiarray[0] → Accesses the first layer (0th index) but implicitly selects the entire 2D slice, equivalent to multiarray[0, :, :].

-   multiarray[0, :] → Accesses all rows from the first layer (2D array).


In [None]:
#Stepsize

a = [1,2,3,4,5,6,7,8,9]

a[0:10:2]

Set the value of an element

In [None]:
print(mat)
mat[0,1] = 17


#expected num 17 instead of 2 in the first row (2nd columns)

Change whole row

In [None]:
mat[0, :] = [8,9,11]
print(mat)

Change mat back to original.

In [None]:
mat[:,:] = [[1,2,3],[4,5,6],[7,8,9]]
print(mat)

#Now , try again to change the whole row #0

mat[0,:] = [3,2,1]
print(mat)

# Great, you finally understood this. [0,:] - > 0 Represents 0 row (1,2,3), and : represents all columns in

# Now , try to change only 2nd and 3rd column in the first row!

mat[0,1:3] = [2,3]
print(mat)

# Great!

Stacking arrays

> np.vstack - Stacks arrays vertically (row-wise), one on top of the other.

In [None]:
v1 = np.array([1,1,1,1])
v2 = np.array([2,2,2,2])
vstacked = np.vstack([v1,v2])
print(vstacked)

> np.hstack - Stacks arrays horizontally (column-wise), side by side.

In [None]:
v1 = np.array([1,1,1,1])
v2 = np.array([2,2,2,2])
hstacked = np.hstack([v1,v2])
print(hstacked)

### Some algebraic operations

__Transpose__

In [None]:
mat

In [None]:
mat.T

In [None]:
np.transpose(mat)

In [None]:
print(lots)
lots[['vA','vB']].T

Multiply matrix by a scalar constant, elementwise

In [None]:
C = 3
print(mat)
mat * C

Matrix times matrix, **elementwise**

In [None]:
mat1 = np.array(([1, 1, 1], [2, 2, 2], [3, 3, 3]))
print(mat1)
print(mat)
print("Element-wise product is: ")
mat*mat1

the same as:

In [None]:
np.multiply(mat,mat1)

Vector by vector, **elementwise**:

In [None]:
x = mat[0,:]
y = mat[1,:]
x*y


Algebraic operations: **the dot product**

-    The dot product, also known as the scalar product, is a fundamental operation in linear algebra that combines two vectors to produce a single scalar value.

In [None]:
v1 = np.array([1, 2, 3])
v2 = np.array([5, 6, 7])
print(v1)
print(v2)


In [None]:
np.dot(v2,v1)

Also, w. `@`:

In [None]:
v1 @ v2 # Dot
v2 @ v1 

Do not forget that this product **is not commutative for matrices**:

In [None]:
print(mat1)
print(mat)
print("Dot product: np.dot(mat, mat1)")
np.dot(mat, mat1)

Dot product: vector times matrix

In [None]:
a = np.array([1,2,3])
print(a)
print(mat)
np.dot(mat,a.T)

Type

In [None]:
a.dtype

In [None]:
a.size

In [None]:
a = np.array([[1.1, 2, 3.14], [2, 2.22, 1.41]])
a.dtype

__np.outer()__

-   The outer product of two vectors results in a matrix where each element is the product of elements from the two vectors. In NumPy, you can compute this using the np.outer() function.

-   The resulting matrix is formed by multiplying each element of v1 (rows) with each element of v2 (columns):

In [None]:
print(v1)
print(v2)
np.outer(v1,v2)

a1 = np.array([1,2,3])
a2 = np.array([4,5,6])

np.outer(a1,a2)

### Broadcasting

From the Numpy [Broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html#) documentation:

> The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

In [None]:
a = np.array([[[1,2,3],[4,5,6],[7,8,9]]])
b = np.array([1,2,3])
a+b
a*b

Example from NumPy [Broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html) documentation: *A Practical Example: Vector Quantization*

General Broadcasting Rules:

> two dimension are compatible when:

1. they are equal, or

2. one of them is one

3. Expand Dimensions: If a dimension is 1, it can be expanded to match the other dimension.(in addition to number 2. rule)


-   If these conditions are not met, a __ValueError: operands could not be broadcast together__ exception is thrown, indicating that the arrays have incompatible shapes.

-   Input arrays do not need to have the same number of dimensions. The resulting array will have the same number of dimensions as the input array with the greatest number of dimensions, where the size of each dimension is the largest size of the corresponding dimension among the input arrays. Note that missing dimensions are assumed to have size one.



### Array a (shape: (5, 1)):

> To match a target shape of (5, 6): 

> The first dimension (5) is already 5.

-   The second dimension (1) can be expanded to 6.

### Results: 
-   __a behaves like a (5, 6) array where each row [a[i, 0]] is broadcast across all columns.__

In [None]:
# - one data point
observation = np.array([111.0, 188.0])
# - several data points
codes = np.array([[102.0, 203.0],
    [132.0, 193.0],
    [45.0, 155.0],
    [57.0, 173.0]])

# - the minimal distance between codes and observation:

diff = codes - observation
print(diff)

# - Euclidean distances
dist = np.sqrt(np.sum(diff**2,axis=1))
print(dist)
# - indice of the minimum
w_min = np.argmin(dist)
print(w_min)
# - minimal distance
print(dist[w_min])


### More repeating of things

In [None]:
np.ones(10, dtype=int)
np.ones((2, 5))

In [None]:
np.ones((2, 5))

In [None]:
np.zeros(4, dtype=int)

In [None]:
np.zeros((2,5))

The np. full () function creates a new array of a specified shape and fills it with a constant value.

> np.full(shape, fill_value, dtype=None)

In [2]:
np.full((2,2),10)

NameError: name 'np' is not defined

In [None]:
np.random.random((3,4))

-   np.random.randint(low, high=None, size=None, dtype=int)

> low: Lowest (inclusive) integer to be drawn from the distribution.

> high (optional): If provided, the largest (exclusive) integer to be drawn. If not provided, the range becomes [0, low).

> size (optional): Output shape. For example, size=(m, n) creates an m x n array. If None, a single integer is returned.

In [None]:
np.random.randint(10,15, size=(3,4))

In [None]:
v = np.array([1,3,5])
print(v)
np.repeat(10, repeats=3)

In [None]:
a = np.array([[1,2,3],[4,5,6]])
print(a)

In [None]:
np.repeat(a, repeats=2, axis=0)

When using __np.repeat(a, repeats=2, axis=0)__, you're instructing NumPy to repeat each row of the array a twice along the rows (since axis=0 refers to rows)

In [None]:
np.repeat(a, repeats=2, axis=1) #columns, horizontally

In [3]:
a = np.array([[1, 2, 3]])
print(a.ndim)
np.repeat(a, repeats=2, axis=1)

NameError: name 'np' is not defined

### Find elements based on conditions

In [135]:
v1 = np.linspace(1,100,100, dtype=int)

In [None]:
print(v1)

In [None]:
cond1 = v1 > 50
print(cond1)

In [None]:
v1[cond1] # give True values

In [None]:
v1[v1<50]

In [None]:
v1[(v1<50)&(v1>50)]

In [None]:
print(mat)

In [None]:
print(mat>5)

`np.any()`

-   The np.any() function in NumPy tests whether any elements in an array evaluate to True. If at least one element is True, the function returns True; otherwise, it returns False. This function is particularly useful for evaluating conditions across arrays, especially when combined with comparison operations.

In [6]:
import numpy as np
my_matrix = np.array([
            [1,2,3],
            [4,5,6],
            [7,8,9]
])

print(my_matrix)


[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [7]:
np.any(my_matrix) #True

True

In [8]:
np.any(my_matrix>10)

False

np.any() on columns:

In [9]:
np.any(my_matrix>7,axis=0)

array([False,  True,  True])

np.any() on rows:

In [None]:
np.any(my_matrix>7, axis=1)

 Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1).

 > Axis 0: Runs vertically down the rows. Operations along this axis are column-wise. 

 > Axis 1: Runs horizontally across the columns. Operations along this axis are row-wise. 

In [None]:
a = np.array([[1,2,3],
              [4,5,6]])

np.sum(a,axis=0) # Column wise 

# output - array([5, 7, 9])

np.sum(a,axis=1)

# output - array([ 6, 15])

`np.all()`

-   The np.all() function in NumPy checks if all elements in an array evaluate to True. If every element meets this condition, the function returns True; otherwise, it returns False. This function is particularly useful for validating that all elements in an array satisfy a specific condition. 

In [10]:
np.all(my_matrix)

True

In [11]:
np.all(my_matrix > 7)

False

In [12]:
np.all(my_matrix > 0)
#expected True

True

### The treatment of missing values in NumPy

In [None]:
import numpy as np
v = np.array([1, 2, 3, 4, np.nan, 6, 7, np.nan, 8, 9])
print(v)

print(np.isnan(v))
print(np.logical_not(np.isnan(v)))


np.isnan () is returning boolean values of the same shape as (), where each element is True if corresponding element in () is NaN , and False otherwise.
np.logical_not () inverts the boolean array such that True becomes Fales and opposite.


In [None]:
v.mean()

In [None]:
s = v[np.logical_not(np.isnan(v))] #this filters the Nan's and create the new array of values that are not NaN!
print(s)
s = np.sum(v[np.logical_not(np.isnan(v))])
print(f"Sum of array is {s}")
n = v[np.logical_not(np.isnan(v))].size
print(n)
s/n


Also you can do:

> See working task02 to restore the knowledge. 

#13. Using the following logical operators you can combine several logiacl conditions to extract the data based on those conditions:   
   - ~ NOT

   - | OR
   
   - & AND

In [None]:
s = np.sum(v[~(np.isnan(v))])
print(s)

v = np.array([1,4,19,np.nan,18.2,np.nan,np.nan])
v_mean = np.mean(v[~(np.isnan(v))])
print(v_mean)
v_mean = np.nanmean(v)
print(v_mean)

# Perfect built-in for this situation :D

But you should definitelly use `numpy.nanmean()`:

> np.nanmean() is ignoring NaN values in the passed array calculating the mean of the values without NaN. 

### 4. Apply a Linear Regression Model w. known coefficients in Numpy:

(https://www.youtube.com/watch?v=zPG4NjIkCjc&list=LL&index=1) -> please check a brief explanation of introdcution to a linear regression analysis. Key concepts are expalained in this video and it's recommended to watch it first, <10 min .

A group of ten students took tests A, B, and C. We are now interested in their performance on some test D taken in the end of their first year in college.

To our best knowledge, and knowing what major programs they have enrolled, we judge the test A result to be the most important indicator of their future performance, test B to be a little less relevant, and test C a poor indicator.

We hypothesize that 

- if we put some multiplicative weights on test results say, i.e.
- we weight test A (weight: $\beta_1$), test B (weight: $\beta_2$), and test C (weight: $\beta_3$)
- and add some constant, say $\beta_0$ to, $\beta_1A + \beta_2B + \beta_3C$ (let's represent the constant by $\beta_0$)
- we could predict their 1st year performance in college by

$$D = \beta_0 + \beta_1A + \beta_2B + \beta_3C$$

Following a convention, we could say that we have a linear regression model of the following form 

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3$$

where $Y$ represents their 1st year performance on test D, while $X_1$, $X_2$, and $X_3$ stand for our A, B, and C test scores. We call $X_1$, $X_2$, and $X_3$ **the predictors**.

Let's assume that we already know the optimal values for $\beta_0$, $\beta_1$, $\beta_2$ and $\beta_3$, and that these values are:

- $\beta_0=7.96$,
- $\beta_1=5.71$,
- $\beta_2=2.23$, and
- $\beta_3=.65$.

We will also say that $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$ are our **regression coefficients**

Given the following data, how could we make Numpy compute $Y$ - the model predictions - for us?

In [None]:
import pandas as pd
import numpy as np

test_scores = pd.DataFrame({'A':[10, 7, 6, 2, 8, 8, 9, 5, 10, 7],
                            'B':[6, 3, 3, 7, 7, 1, 4, 2, 10, 9],
                            'C':[4, 5, 8, 7, 8, 9, 10, 2, 9, 10]})
display(test_scores)

test_scores.shape

Let's now define a vector of coefficients, $\beta$:

Why Define Coefficients as a Vector?

-   Compact Representation
Representing coefficients as a vector allows you to express the linear regression model in a concise mathematical form -- > -    Y=X⋅β    -
X is the design matrix containing all the predictor variables (test scores _A,B,C_).

-   $\beta$ is the coefficient vector ( $\beta1$  $\beta2$  $\beta3 $ ) 

-   ⋅ denotes a dot product.

> With a coefficient vector, predictions can be computed for all data points at once using matrix operations, which are computationally efficient.

> If you add more predictors (e.g., test D, E), the coefficient vector simply grows. The formula and computation remain consistent regardless of the number of predictors.

> Numpy is optimized for vectorized operations. Using a vector of coefficients makes it easy to calculate predictions YY for all students simultaneously, rather than computing them one at a time.

In [None]:
import numpy as np
betas = np.array([7.96, 5.71, 2.23, .65]) 
betas


How to Use the Coefficient Vector:

- The Models formula: _Y =  $\beta$(0) + $\beta$(2) x X1 + $\beta$(2) x X2 + $\beta$(3) x X3_

Can be rewritten in the matrix form as:

-  __Y__ = __X__ ⋅ $\beta$ 

- Where Y is the vector of prediciton (output, Y values for all students)
- __X__ is the desing matrix. (first column is all ones to account for the intercept ( $\beta$(0) )

​
 ### 1. intercept ( $\beta$ ) - > 
 
-   What is intercept? __The intercept, $\beta$​, is the baseline prediction when all other predictors _(X1,X2,X3)_ are zero. It represents the starting value of Y and adjusts the model to better fit the data__. Imagine you're predicting student performance (YY) based only on test scores A,B,C. If A = 0, B = 0 and C = 0 (which could happen in the dataset),the equation:

    __Y = $\beta$(0)+ $\beta$(1)*A + $\beta$(2)*B + $\beta$(3)*C__

would simplfy to:

   >Y=$\beta$(0)

>Thus, β0​ gives a meaningful prediction even in the absence of any input values (predictors). Without β0β0​, your model would be forced to go through the origin (0,0), which is often unrealistic.


### 2.  Why np.ones? why not np.zeros for example?

> When we add a column of ones to the design matrix, it ensures that the intercept (β0β0​) is included in the dot product during the calculation. Here’s why:

    -   If you use np.zeros instead of np.ones, the contribution of β0β0​ to the model would always be zero because:


 0 ⋅ $\beta$ = 0 !!


 #### This would effectively eliminate the intercept, which means your model won’t account for the baseline prediction.

### 3. Dot Product in Linear Regression

-   What is a __Dot__ product?

- The dot product is a fundamental operation in linear algebra. For two vectors:

    a = [a1,a2,a3] and b = [b1,b2,b3]

-   The dot product is defined as:

    a ⋅ b = a1b1 + a2b2 + a3b3

It effectively multiplies corresponding elements and sums the results.


#### 4.1 How Dot Product is Used in Linear Regression?

### In matrix form, the linear regression model is:

 ## __Y__ = X ⋅ $\beta$
 
- The dot product computes the linear combination of the predictors with their respective coefficients for all rows (students) in the dataset.

In [None]:
def linear_predict(design_matrix,coeffs):
    # add column of ones for the intercept:
    features_with_intercept = np.hstack((np.ones((design_matrix.shape[0],1)), design_matrix))
    # compute the predictions
    predictions = features_with_intercept @ coeffs
    return predictions
predictions = linear_predict(test_scores,betas)
print(predictions.shape)
print(predictions)

Let's analyse this ^^ step by step.

Add a column of 1s for $\beta_0$ (the intercept):

In [None]:
np.ones((test_scores.shape[0], 1))

design_matrix.shape[0] refers to the number of rows in the design matrix.

It is used to ensure that the added column of ones matches the row count of the original matrix.

 [0] is used because rows are the first dimension in Numpy arrays, and the .shape tuple lists dimensions in the order (rows, columns).


### Main points here:

-   design_matrix.shape[0] ensures that the column of ones has the same number of rows as the design_matrix.

-   The 1 in the shape (design_matrix.shape[0], 1) specifies that this column has exactly one column.

Augmented design matrix:

In [None]:
aug_features = np.hstack((np.ones((test_scores.shape[0], 1)), test_scores))
print(aug_features)

The regression coefficients:

In [None]:
betas

Prediction of D for the first student (first row in `test_scores`):

In [None]:
7.96 + 5.71*10 + 2.23*6 + .65*4

Key Part: Adding the Column of Ones:

np.ones((design_matrix.shape[0], 1))

is used to create a column of ones that corresponds to the intercept (β0​) in the linear regression model.

-   design_matrix.shape returns the shape of the matrix as a tuple: (number of rows,number of columns)(number of rows,number of columns).

-   design_matrix.shape[0] gives the number of rows in the matrix, i.e., the number of data points (students in your case).


test_scores = pd.DataFrame({'A':[10, 7, 6, 2, 8, 8, 9, 5, 10, 7],
                            'B':[6, 3, 3, 7, 7, 1, 4, 2, 10, 9],
                            'C':[4, 5, 8, 7, 8, 9, 10, 2, 9, 10]})

-   design_matrix.shape[0] will return 10 because there are 10 rows (students).

-   np.ones((design_matrix.shape[0], 1)): 

> Creates a column of ones with the same number of rows as the design_matrix.

> The shape of this column is (number of rows, 1). For your example, it creates:


[[1.],
 [1.],
 [1.],
 [1.],
 [1.],
 [1.],
 [1.],
 [1.],
 [1.],
 [1.]]

exercise:

In [37]:
import pandas as pd
import numpy as np

test_scores = pd.DataFrame({'A':[10, 7, 6, 2, 8, 8, 9, 5, 10, 7],
                            'B':[6, 3, 3, 7, 7, 1, 4, 2, 10, 9],
                            'C':[4, 5, 8, 7, 8, 9, 10, 2, 9, 10]})
display(test_scores)

test_scores.shape

def lin_pred(matrix_model,coeff):
    matrix_model_plus_ones = np.hstack((np.ones((matrix_model.shape[0],1)), matrix_model))
    linear_reg = matrix_model_plus_ones @ coeff
    return linear_reg

lin_pred(test_scores,betas)


Unnamed: 0,A,B,C
0,10,6,4
1,7,3,5
2,6,3,8
3,2,7,7
4,8,7,8
5,8,1,9
6,9,4,10
7,5,2,2
8,10,10,9
9,7,9,10


array([81.04, 57.87, 54.11, 39.54, 74.45, 61.72, 74.77, 42.27, 93.21,
       74.5 ])

### 5. Numpy and Pandas

In [13]:
import pandas as pd
import numpy as np
import os
data_set= pd.read_csv(os.path.join("_data\MovieRatings.csv"), index_col=0)
                      

data_set.head()

Unnamed: 0,FILM,RottenTomatoes,Metacritic,IMDB,Fandango_Stars
0,Avengers: Age of Ultron (2015),74,66,7.8,5.0
1,Cinderella (2015),85,67,7.1,5.0
2,Ant-Man (2015),80,64,7.8,5.0
3,Do You Believe? (2015),18,22,5.4,5.0
4,Hot Tub Time Machine 2 (2015),14,29,5.1,3.5


In [21]:
rt = data_set.iloc[:,1]
rt.head(5)
rt[0:10]

0    74
1    85
2    80
3    18
4    14
5    63
6    42
7    86
8    99
9    89
Name: RottenTomatoes, dtype: int64

In [25]:
rt.mean()

60.84931506849315

In [26]:
rt.median()

63.5

In [27]:
rt.var()

910.1564478034954

In [28]:
rt.std()

30.168799243647324

In [35]:
rt + 1

0       75
1       86
2       81
3       19
4       15
      ... 
141     88
142     98
143     98
144    101
145     88
Name: RottenTomatoes, Length: 146, dtype: int64

You can do that in the DataFrame directly:

In [36]:
print(data_set.head(5))
data_set['RottenTomatoes'] = data_set["RottenTomatoes"] + 1
print(data_set.head(5))


                             FILM  RottenTomatoes  Metacritic  IMDB  \
0  Avengers: Age of Ultron (2015)              74          66   7.8   
1               Cinderella (2015)              85          67   7.1   
2                  Ant-Man (2015)              80          64   7.8   
3          Do You Believe? (2015)              18          22   5.4   
4   Hot Tub Time Machine 2 (2015)              14          29   5.1   

   Fandango_Stars  
0             5.0  
1             5.0  
2             5.0  
3             5.0  
4             3.5  
                             FILM  RottenTomatoes  Metacritic  IMDB  \
0  Avengers: Age of Ultron (2015)              75          66   7.8   
1               Cinderella (2015)              86          67   7.1   
2                  Ant-Man (2015)              81          64   7.8   
3          Do You Believe? (2015)              19          22   5.4   
4   Hot Tub Time Machine 2 (2015)              15          29   5.1   

   Fandango_Stars  
0     

In [37]:
type(rt)

pandas.core.series.Series