<a href="https://colab.research.google.com/github/AvinashShrikhande/Python_For_DataScience/blob/main/NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **A Quick Introduction to Numerical Data Manipulation with Python and NumPy**

### **What is NumPy?**

NumPy stands for numerical Python. It's the backbone of all kinds of scientific and numerical computing in Python.

And since machine learning is all about turning data into numbers and then figuring out the patterns, NumPy often comes into play. **bold text**

### Why NumPy?

You can do numerical calculations using pure Python. In the beginning, you might think Python is fast but once your data gets large, you'll start to notice slow downs.

One of the main reasons you use NumPy is because it's fast. Behind the scenes, the code has been optimized to run using C. Which is another programming language, which can do things much faster than Python.

The benefit of this being behind the scenes is you don't need to know any C to take advantage of it. You can write your numerical computations in Python using NumPy and get the added speed benefits.

If your curious as to what causes this speed benefit, it's a process called vectorization,vectorization aims to do calculations by avoiding loops as loops can create potential bottlenecks.

NumPy achieves vectorization through a process called broadcasting. **bold text**

### **1. Importing NumPy**

In [2]:
import numpy as np


### 1. DataTypes and attributes **bold text**

NOTE: Important to remember the main type in NumPy is ndarray, even seemingly different kinds of arrays are still ndarray's. This means an operation you do on one array, will work on another.

In [None]:
# 1-dimensonal array, also referred to as a vector
a1 = np.array([1, 2, 3])

# 2-dimensional array, also referred to as matrix
a2 = np.array([[1, 2.0, 3.3],
               [4, 5, 6.5]])

# 3-dimensional array, also referred to as a matrix
a3 = np.array([[[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]],
                [[10, 11, 12],
                 [13, 14, 15],
                 [16, 17, 18]]])

In [None]:
a1.shape, a1.ndim, a1.dtype, a1.size, type(a1)

((3,), 1, dtype('int64'), 3, numpy.ndarray)

In [None]:
a2.shape, a2.ndim, a2.dtype, a2.size, type(a2)

((2, 3), 2, dtype('float64'), 6, numpy.ndarray)

In [None]:
a3.shape, a3.ndim, a3.dtype, a3.size, type(a3)

((2, 3, 3), 3, dtype('int64'), 18, numpy.ndarray)

In [None]:
a1

array([1, 2, 3])

In [None]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [None]:
a3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])


### **Anatomy of an array**

Key terms:

*   Array - A list of numbers, can be multi-dimensional.
*  Scalar - A single number (e.g. 7).
*  Vector - A list of numbers with 1-dimesion (e.g. np.array([1, 2, 3])).
*  Matrix - A (usually) multi-deminsional list of numbers (e.g. np.array([[1, 2, 3], [4, 5, 6]])).


### **pandas DataFrame out of NumPy arrays**

This is to examplify how NumPy is the backbone of many other libraries.



In [None]:
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 3)), 
                                    columns=['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
0,7,3,4
1,8,6,8
2,3,4,1
3,1,4,9
4,4,7,4


In [None]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [None]:
df2 = pd.DataFrame(a2)
df2

Unnamed: 0,0,1,2
0,1.0,2.0,3.3
1,4.0,5.0,6.5


### **2. Creating arrays**



*   np.array()
*   np.ones()
*   np.zeros()
*   np.random.rand(5, 3)
*   np.random.randint(10, size=5)
*   np.random.seed() - pseudo random numbers
*   Searching the documentation example (finding np.unique() and using it)

In [None]:
# Create a simple array
simple_array = np.array([1, 2, 3])
simple_array

array([1, 2, 3])

In [None]:
simple_array = np.array((1, 2, 3))
simple_array, simple_array.dtype

(array([1, 2, 3]), dtype('int64'))

In [None]:
# Create an array of ones
ones = np.ones((10, 2))
ones

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

In [None]:
# The default datatype is 'float64'
ones.dtype

dtype('float64')

In [None]:
# You can change the datatype with .astype()
ones.astype(int)

array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1]])

In [None]:
# Create an array of zeros
zeros = np.zeros((5, 3, 3))
zeros

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

In [None]:
zeros.dtype

dtype('float64')

In [None]:
# Create an array within a range of values
range_array = np.arange(0, 10, 2)
range_array

array([0, 2, 4, 6, 8])

In [None]:
# Random array
random_array = np.random.randint(10, size=(5, 3))
random_array

array([[9, 3, 1],
       [0, 4, 8],
       [3, 8, 2],
       [3, 9, 2],
       [4, 0, 5]])

In [None]:
# Random array of floats (between 0 & 1)
np.random.random((5, 3))

array([[0.57316283, 0.13848311, 0.43716189],
       [0.37465423, 0.72535326, 0.17897145],
       [0.66077011, 0.79386231, 0.69283944],
       [0.76291165, 0.64921025, 0.89235129],
       [0.79805494, 0.31915385, 0.09419216]])

In [None]:
np.random.random((5, 3))

array([[0.71340282, 0.68224087, 0.09370401],
       [0.2699475 , 0.6872088 , 0.82141698],
       [0.19160835, 0.87218561, 0.1659593 ],
       [0.092336  , 0.52296592, 0.43855169],
       [0.88186459, 0.67527228, 0.36989894]])

In [None]:
# Random 5x3 array of floats (between 0 & 1), similar to above
np.random.rand(5, 3)

array([[0.80235839, 0.648698  , 0.52887164],
       [0.6857698 , 0.74016951, 0.48412763],
       [0.06838505, 0.09818475, 0.59157726],
       [0.11978153, 0.11918818, 0.298131  ],
       [0.64571443, 0.19462665, 0.62968663]])

In [None]:
np.random.rand(5, 3)

array([[0.60288132, 0.21149634, 0.22361013],
       [0.18594674, 0.80601252, 0.0959047 ],
       [0.32337443, 0.95785006, 0.83687221],
       [0.13189855, 0.80356297, 0.71041159],
       [0.42639587, 0.84626526, 0.77462847]])

NumPy uses pseudo-random numbers, which means, the numbers look random but aren't really, they're predetermined.

For consistency, you might want to keep the random numbers you generate similar throughout experiments.

To do this, you can use np.random.seed().

What this does is it tells NumPy, "Hey, I want you to create random numbers but keep them aligned with the seed."

Let's see it.

In [None]:
# Set random seed to 0
np.random.seed(0)

# Make 'random' numbers
np.random.randint(10, size=(5, 3))

array([[5, 0, 3],
       [3, 7, 9],
       [3, 5, 2],
       [4, 7, 6],
       [8, 8, 1]])


With np.random.seed() set, every time you run the cell above, the same random numbers will be generated.

What if np.random.seed() wasn't set?

Every time you run the cell below, a new set of numbers will appear.

In [None]:
# Make more random numbers
np.random.randint(10, size=(5, 3))

array([[2, 3, 8],
       [1, 3, 3],
       [3, 7, 0],
       [1, 9, 9],
       [0, 4, 7]])


Let's see it in action again, we'll stay consistent and set the random seed to 0.

In [None]:
# Set random seed to same number as above
np.random.seed(0)

# The same random numbers come out
np.random.randint(10, size=(5, 3))

array([[5, 0, 3],
       [3, 7, 9],
       [3, 5, 2],
       [4, 7, 6],
       [8, 8, 1]])

Because np.random.seed() is set to 0, the random numbers are the same as the cell with np.random.seed() set to 0 as well.

Setting np.random.seed() is not 100% necessary but it's helpful to keep numbers the same throughout your experiments.

For example, say you wanted to split your data randomly into training and test sets.

Every time you randomly split, you might get different rows in each set.

If you shared your work with someone else, they'd get different rows in each set too.

Setting np.random.seed() ensures there's still randomness, it just makes the randomness repeatable. Hence the 'pseudo-random' numbers.

In [None]:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
df

Unnamed: 0,0,1,2
0,5,0,3
1,3,7,9
2,3,5,2
3,4,7,6
4,8,8,1


### **What unique values are in the array a3?**

Now you've seen a few different ways to create arrays, as an exercise, try find out what NumPy function you could use to find the unique values are within the a3 array.

You might want to search some like, "how to find the unqiue values in a numpy array".

### **3. Viewing arrays and matrices (indexing)**

Remember, because arrays and matrices are both ndarray's, they can be viewed in similar ways.

Let's check out our 3 arrays again.

In [None]:
a1

array([1, 2, 3])

In [None]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [None]:
a3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

In [None]:
a1[0]

1

In [None]:
a2[0]

array([1. , 2. , 3.3])

In [None]:
a3[0]

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [None]:
# Get 2nd row (index 1) of a2
a2[1]

array([4. , 5. , 6.5])

In [None]:
# Get the first 2 values of the first 2 rows of both arrays
a3[:2, :2, :2]

array([[[ 1,  2],
        [ 4,  5]],

       [[10, 11],
        [13, 14]]])

This takes a bit of practice, especially when the dimensions get higher. Usually, it takes me a little trial and error of trying to get certain values, viewing the output in the notebook and trying again.

NumPy arrays get printed from outside to inside. This means the number at the end of the shape comes first, and the number at the start of the shape comes last.

In [None]:
a4 = np.random.randint(10, size=(2, 3, 4, 5))
a4

array([[[[4, 2, 0, 6, 6],
         [5, 9, 6, 0, 8],
         [3, 6, 2, 4, 8],
         [1, 1, 6, 8, 8]],

        [[0, 2, 5, 4, 4],
         [8, 9, 3, 6, 1],
         [9, 4, 6, 2, 8],
         [9, 4, 1, 9, 9]],

        [[6, 1, 6, 2, 9],
         [4, 3, 8, 8, 4],
         [9, 6, 3, 9, 8],
         [7, 1, 6, 2, 6]]],


       [[[8, 9, 1, 4, 6],
         [8, 9, 5, 6, 5],
         [5, 2, 7, 0, 6],
         [6, 7, 0, 7, 9]],

        [[4, 6, 6, 8, 8],
         [7, 5, 7, 8, 0],
         [7, 1, 2, 8, 8],
         [3, 3, 9, 7, 0]],

        [[4, 0, 4, 6, 4],
         [8, 8, 4, 9, 4],
         [8, 8, 4, 3, 7],
         [7, 8, 4, 7, 4]]]])

In [None]:
a4.shape

(2, 3, 4, 5)

In [None]:
# Get only the first 4 numbers of each single vector
a4[:, :, :, :4]

array([[[[4, 2, 0, 6],
         [5, 9, 6, 0],
         [3, 6, 2, 4],
         [1, 1, 6, 8]],

        [[0, 2, 5, 4],
         [8, 9, 3, 6],
         [9, 4, 6, 2],
         [9, 4, 1, 9]],

        [[6, 1, 6, 2],
         [4, 3, 8, 8],
         [9, 6, 3, 9],
         [7, 1, 6, 2]]],


       [[[8, 9, 1, 4],
         [8, 9, 5, 6],
         [5, 2, 7, 0],
         [6, 7, 0, 7]],

        [[4, 6, 6, 8],
         [7, 5, 7, 8],
         [7, 1, 2, 8],
         [3, 3, 9, 7]],

        [[4, 0, 4, 6],
         [8, 8, 4, 9],
         [8, 8, 4, 3],
         [7, 8, 4, 7]]]])

a4's shape is (2, 3, 4, 5), this means it gets displayed like so:

Inner most array = size 5
Next array = size 4
Next array = size 3
Outer most array = size 2

### **4. Manipulating and comparying arrays**



*   Arithmetic
        +, -, *, /, //, **, %
        np.exp()
        np.log()
        Dot product - np.dot()
        Broadcasting
*   Aggregation
        np.sum() - faster than .sum(), make demo, np is really fast
        np.mean()
        np.std()
        np.var()
        np.min()
        np.max()
        np.argmin() - find index of minimum value
        np.argmax() - find index of maximum value
        These work on all ndarray's
        a4.min(axis=0) -- you can use axis as well

*  Reshaping
        np.reshape()
*  Transposing
        a3.T
*  Comparison operators
        >
        <
        <=
        >=
        x != 3
        x == 3
        np.sum(x > 3)

### **Arithmetic**

In [None]:
a1

array([1, 2, 3])

In [None]:
ones = np.ones(3)
ones

array([1., 1., 1.])

In [None]:
# Add two arrays
a1 + ones

array([2., 3., 4.])

In [None]:
# Subtract two arrays
a1 - ones

array([0., 1., 2.])

In [None]:
# Multiply two arrays
a1 * ones

array([1., 2., 3.])

In [None]:
# Multiply two arrays
a1 * a2

array([[ 1. ,  4. ,  9.9],
       [ 4. , 10. , 19.5]])

In [None]:
a1.shape, a2.shape

((3,), (2, 3))

In [None]:
a2 * a3   #Will give error as shapes are different

ValueError: ignored

### **Broadcasting**

*   What is broadcasting?
      Broadcasting is a feature of NumPy which performs an operation across multiple dimensions of data without replicating the data. This saves time and space. For example, if you have a 3x3 array (A) and want to add a 1x3 array (B), NumPy will add the row of (B) to every row of (A).
*   Rules of Broadcasting
      1.If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
      2.If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
      3.If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

**The broadcasting rule:** In order to broadcast, the size of the trailing axes for both arrays in an operation must be either the same size or one of them must be one.


In [None]:
a1

array([1, 2, 3])

In [None]:
a1.shape

In [None]:
a2.shape

In [None]:
a2

In [None]:
a1 + a2

In [None]:
a2 + 2

In [None]:
# Raises an error because there's a shape mismatch
a2 + a3

In [None]:
# Divide two arrays
a1 / ones

In [None]:
# Divide using floor division
a2 // a1

In [None]:
# Take an array to a power
a1 ** 2

In [None]:
# You can also use np.square()
np.square(a1) 

In [None]:
# Modulus divide (what's the remainder)
a1 % 2  

You can also find the log or exponential of an array using np.log() and np.exp()

In [None]:

# Find the log of an array
np.log(a1)

In [None]:
# Find the exponential of an array
np.exp(a1)

### **Aggregation**
Aggregation - bringing things together, doing a similar thing on a number of things.

In [None]:
sum(a1)

In [None]:
np.sum(a1)

Use NumPy's np.sum() on NumPy arrays and Python's sum() on Python lists.

In [None]:
massive_array = np.random.random(100000)
massive_array.size

In [None]:
%timeit sum(massive_array) # Python sum()
%timeit np.sum(massive_array) # NumPy np.sum()

In [None]:
import random 
massive_list = [random.randint(0, 10) for i in range(100000)]
len(massive_list) 

In [None]:
massive_list[:10]

In [None]:
%timeit sum(massive_list)
%timeit np.sum(massive_list)

In [None]:
a2

In [None]:
# Find the mean
np.mean(a2)

In [None]:
# Find the max
np.max(a2)

In [None]:
# Find the min
np.min(a2)

In [None]:
# Find the standard deviation
np.std(a2)

In [None]:
# Find the variance
np.var(a2)

In [None]:
# The standard deviation is the square root of the variance
np.sqrt(np.var(a2))

**What's mean?**
Mean is the same as average. You can find the average of a set of numbers by adding them up and dividing them by how many there are.

**What's standard deviation?**

Standard deviation is a measure of how spread out numbers are.

**What's variance?**

The [variance](https://www.mathsisfun.com/data/standard-deviation.html) is the averaged squared differences of the mean.

To work it out, you:                                           
      1.Work out the mean                                       
      2.For each number, subtract the mean and square the result                                                            
      3.Find the average of the squared differences

      

In [8]:
# Demo of variance
high_var_array = np.array([1, 100, 200, 300, 4000, 5000])
low_var_array = np.array([2, 4, 6, 8, 10])

np.var(high_var_array), np.var(low_var_array)

(4296133.472222221, 8.0)

In [9]:
np.std(high_var_array), np.std(low_var_array)

(2072.711623024829, 2.8284271247461903)

In [10]:
# The standard deviation is the square root of the variance
np.sqrt(np.var(high_var_array))

2072.711623024829

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(high_var_array)
plt.show()

In [None]:
plt.hist(low_var_array)
plt.show()

### **Reshaping**

In [None]:
a2

In [None]:
a2.shape

In [None]:
a2 + a3

In [None]:
a2.reshape(2, 3, 1)

In [None]:
a2.reshape(2, 3, 1) + a3

### **Transpose**

In [None]:
a2.shape

In [None]:
a2.T

In [None]:
a2.T.shape

In [None]:
matrix = np.random.random(size=(5,3,3))
matrix

In [None]:
matrix.shape

In [None]:
matrix.T

In [None]:
matrix.T.shape

### **Dot**


*   TODO - create graphic for dot versus element-wise also known as Hadamard product
*   TODO - why would someone use dot versus element-wise?
*   A dot product models real world problems well, it's a method of finding patterns between data




In [None]:
np.random.seed(0)
mat1 = np.random.randint(10, size=(3, 3))
mat2 = np.random.randint(10, size=(3, 2))

mat1.shape, mat2.shape

In [None]:
mat1

In [None]:
mat2

In [None]:
np.dot(mat1, mat2)

In [None]:
np.random.seed(0)
mat3 = np.random.randint(10, size=(4,3))
mat4 = np.random.randint(10, size=(4,3))
mat3

In [None]:
mat4

In [None]:
np.dot(mat3, mat4)

In [None]:
mat3.T.shape

In [None]:
# Dot product
np.dot(mat3.T, mat4)

In [None]:
# Element-wise multiplication, also known as Hadamard product
mat3 * mat4

### **Dot product practical example, nut butter sales**

In [None]:
np.random.seed(0)
sales_amounts = np.random.randint(20, size=(5, 3))
sales_amounts

In [None]:
weekly_sales = pd.DataFrame(sales_amounts,
                            index=["Mon", "Tues", "Wed", "Thurs", "Fri"],
                            columns=["Almond butter", "Peanut butter", "Cashew butter"])
weekly_sales

In [None]:
prices = np.array([10, 8, 12])
prices

In [None]:
butter_prices = pd.DataFrame(prices.reshape(1, 3),
                             index=["Price"],
                             columns=["Almond butter", "Peanut butter", "Cashew butter"])
butter_prices.shape

In [None]:
weekly_sales.shape

In [None]:
# Find the total amount of sales for a whole day
total_sales = prices.dot(sales_amounts)
total_sales

The shapes aren't aligned, we need the middle two numbers to be the same.

In [None]:
prices

In [None]:
sales_amounts.T.shape

In [None]:
# To make the middle numbers the same, we can transpose
total_sales = prices.dot(sales_amounts.T)
total_sales

In [None]:
butter_prices.shape, weekly_sales.shape

In [None]:
daily_sales = butter_prices.dot(weekly_sales.T)
daily_sales

In [None]:
# Need to transpose again
weekly_sales["Total"] = daily_sales.T
weekly_sales

### **Comparison operators**


In [None]:
a1

In [None]:
a2

In [None]:
a1 > a2

In [None]:
a1 >= a2

In [None]:
a1 > 5

In [None]:
a1 == a1

In [None]:
a1 == a2

### **5. Sorting arrays**


*   np.sort()
*   np.argsort()
*   np.argmax()
*   np.argmin()




In [None]:
random_array

In [None]:
np.sort(random_array)

In [None]:
np.argsort(random_array)

In [None]:
a1

In [None]:
# Return the indices that would sort an array
np.argsort(a1)

In [None]:
# No axis
np.argmin(a1)

In [None]:
random_array

In [None]:
# Down the vertical
np.argmax(random_array, axis=1)

In [None]:
# Across the horizontal
np.argmin(random_array, axis=0) 

### **6. Use case**
  Turning an image of a panda into numbers.

In [None]:
from matplotlib.image import imread

panda = imread('../images/numpy-panda.png')
print(type(panda)

In [None]:
panda.shape

In [None]:
panda

In [None]:
car = imread("../images/numpy-car-photo.png")
car.shape

In [None]:
car[:,:,:3].shape

In [None]:
dog = imread("../images/numpy-dog-photo.png")
dog.shape

In [None]:
dog