<a href="https://colab.research.google.com/github/AvinashShrikhande/Python_For_DataScience/blob/main/NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **A Quick Introduction to Numerical Data Manipulation with Python and NumPy**

### **What is NumPy?**

NumPy stands for numerical Python. It's the backbone of all kinds of scientific and numerical computing in Python.

And since machine learning is all about turning data into numbers and then figuring out the patterns, NumPy often comes into play. **bold text**

### Why NumPy?

You can do numerical calculations using pure Python. In the beginning, you might think Python is fast but once your data gets large, you'll start to notice slow downs.

One of the main reasons you use NumPy is because it's fast. Behind the scenes, the code has been optimized to run using C. Which is another programming language, which can do things much faster than Python.

The benefit of this being behind the scenes is you don't need to know any C to take advantage of it. You can write your numerical computations in Python using NumPy and get the added speed benefits.

If your curious as to what causes this speed benefit, it's a process called vectorization,vectorization aims to do calculations by avoiding loops as loops can create potential bottlenecks.

NumPy achieves vectorization through a process called broadcasting. **bold text**

### **1. Importing NumPy**

In [2]:
import numpy as np


### 1. DataTypes and attributes **bold text**

NOTE: Important to remember the main type in NumPy is ndarray, even seemingly different kinds of arrays are still ndarray's. This means an operation you do on one array, will work on another.

In [20]:
# 1-dimensonal array, also referred to as a vector
a1 = np.array([1, 2, 3])

# 2-dimensional array, also referred to as matrix
a2 = np.array([[1, 2.0, 3.3],
               [4, 5, 6.5]])

# 3-dimensional array, also referred to as a matrix
a3 = np.array([[[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]],
                [[10, 11, 12],
                 [13, 14, 15],
                 [16, 17, 18]]])

In [6]:
a1.shape, a1.ndim, a1.dtype, a1.size, type(a1)

((3,), 1, dtype('int64'), 3, numpy.ndarray)

In [7]:
a2.shape, a2.ndim, a2.dtype, a2.size, type(a2)

((2, 3), 2, dtype('float64'), 6, numpy.ndarray)

In [8]:
a3.shape, a3.ndim, a3.dtype, a3.size, type(a3)

((2, 3, 3), 3, dtype('int64'), 18, numpy.ndarray)

In [9]:
a1

array([1, 2, 3])

In [10]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [11]:
a3

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])


### **Anatomy of an array**

Key terms:

*   Array - A list of numbers, can be multi-dimensional.
*  Scalar - A single number (e.g. 7).
*  Vector - A list of numbers with 1-dimesion (e.g. np.array([1, 2, 3])).
*  Matrix - A (usually) multi-deminsional list of numbers (e.g. np.array([[1, 2, 3], [4, 5, 6]])).


### **pandas DataFrame out of NumPy arrays**

This is to examplify how NumPy is the backbone of many other libraries.



In [12]:
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 3)), 
                                    columns=['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
0,7,3,4
1,8,6,8
2,3,4,1
3,1,4,9
4,4,7,4


In [13]:
a2

array([[1. , 2. , 3.3],
       [4. , 5. , 6.5]])

In [14]:
df2 = pd.DataFrame(a2)
df2

Unnamed: 0,0,1,2
0,1.0,2.0,3.3
1,4.0,5.0,6.5


2. Creating arrays
### **bold text**


*   np.array()
*   np.ones()
*   np.zeros()
*   np.random.rand(5, 3)
*   np.random.randint(10, size=5)
*   np.random.seed() - pseudo random numbers
*   Searching the documentation example (finding np.unique() and using it)

In [15]:
# Create a simple array
simple_array = np.array([1, 2, 3])
simple_array

array([1, 2, 3])

In [19]:
simple_array = np.array((1, 2, 3))
simple_array, simple_array.dtype

(array([1, 2, 3]), dtype('int64'))

In [17]:
# Create an array of ones
ones = np.ones((10, 2))
ones

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

In [18]:
# The default datatype is 'float64'
ones.dtype

dtype('float64')

In [21]:
# You can change the datatype with .astype()
ones.astype(int)

array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1]])

In [22]:
# Create an array of zeros
zeros = np.zeros((5, 3, 3))
zeros

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

In [23]:
zeros.dtype

dtype('float64')

In [24]:
# Create an array within a range of values
range_array = np.arange(0, 10, 2)
range_array

array([0, 2, 4, 6, 8])

In [25]:
# Random array
random_array = np.random.randint(10, size=(5, 3))
random_array

array([[9, 3, 1],
       [0, 4, 8],
       [3, 8, 2],
       [3, 9, 2],
       [4, 0, 5]])

In [26]:
# Random array of floats (between 0 & 1)
np.random.random((5, 3))

array([[0.57316283, 0.13848311, 0.43716189],
       [0.37465423, 0.72535326, 0.17897145],
       [0.66077011, 0.79386231, 0.69283944],
       [0.76291165, 0.64921025, 0.89235129],
       [0.79805494, 0.31915385, 0.09419216]])

In [27]:
np.random.random((5, 3))

array([[0.71340282, 0.68224087, 0.09370401],
       [0.2699475 , 0.6872088 , 0.82141698],
       [0.19160835, 0.87218561, 0.1659593 ],
       [0.092336  , 0.52296592, 0.43855169],
       [0.88186459, 0.67527228, 0.36989894]])

In [28]:
# Random 5x3 array of floats (between 0 & 1), similar to above
np.random.rand(5, 3)

array([[0.80235839, 0.648698  , 0.52887164],
       [0.6857698 , 0.74016951, 0.48412763],
       [0.06838505, 0.09818475, 0.59157726],
       [0.11978153, 0.11918818, 0.298131  ],
       [0.64571443, 0.19462665, 0.62968663]])

In [29]:
np.random.rand(5, 3)

array([[0.60288132, 0.21149634, 0.22361013],
       [0.18594674, 0.80601252, 0.0959047 ],
       [0.32337443, 0.95785006, 0.83687221],
       [0.13189855, 0.80356297, 0.71041159],
       [0.42639587, 0.84626526, 0.77462847]])

NumPy uses pseudo-random numbers, which means, the numbers look random but aren't really, they're predetermined.

For consistency, you might want to keep the random numbers you generate similar throughout experiments.

To do this, you can use np.random.seed().

What this does is it tells NumPy, "Hey, I want you to create random numbers but keep them aligned with the seed."

Let's see it.

In [30]:
# Set random seed to 0
np.random.seed(0)

# Make 'random' numbers
np.random.randint(10, size=(5, 3))

array([[5, 0, 3],
       [3, 7, 9],
       [3, 5, 2],
       [4, 7, 6],
       [8, 8, 1]])


With np.random.seed() set, every time you run the cell above, the same random numbers will be generated.

What if np.random.seed() wasn't set?

Every time you run the cell below, a new set of numbers will appear.

In [32]:
# Make more random numbers
np.random.randint(10, size=(5, 3))

array([[2, 3, 8],
       [1, 3, 3],
       [3, 7, 0],
       [1, 9, 9],
       [0, 4, 7]])


Let's see it in action again, we'll stay consistent and set the random seed to 0.

In [33]:
# Set random seed to same number as above
np.random.seed(0)

# The same random numbers come out
np.random.randint(10, size=(5, 3))

array([[5, 0, 3],
       [3, 7, 9],
       [3, 5, 2],
       [4, 7, 6],
       [8, 8, 1]])

Because np.random.seed() is set to 0, the random numbers are the same as the cell with np.random.seed() set to 0 as well.

Setting np.random.seed() is not 100% necessary but it's helpful to keep numbers the same throughout your experiments.

For example, say you wanted to split your data randomly into training and test sets.

Every time you randomly split, you might get different rows in each set.

If you shared your work with someone else, they'd get different rows in each set too.

Setting np.random.seed() ensures there's still randomness, it just makes the randomness repeatable. Hence the 'pseudo-random' numbers.

In [34]:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
df

Unnamed: 0,0,1,2
0,5,0,3
1,3,7,9
2,3,5,2
3,4,7,6
4,8,8,1


### **What unique values are in the array a3?**

Now you've seen a few different ways to create arrays, as an exercise, try find out what NumPy function you could use to find the unique values are within the a3 array.

You might want to search some like, "how to find the unqiue values in a numpy array".

### **3. Viewing arrays and matrices (indexing)**

Remember, because arrays and matrices are both ndarray's, they can be viewed in similar ways.

Let's check out our 3 arrays again.