## Numpy


### Revisiting Python List

A python List can be used to store a group of elements together in a sequence. It can contain hetrogeneous elements.

Following are some examples of List:


In [1]:
item_list = ['Bread','Milk','Eggs','Butter','Cocoa']
student_marks = [78,47,96,55,34]
hetero_list = [1, 2, 3.0, 'text', True, 3+2j]

To perform operations on the List elements, one needs to iterate through the List. For example, if five extra marks need to be awarded to all the entries in the student marks list. The following approach can be used to achieve the same:


In [2]:
student_marks = [78,47,96,55,34]
for i in range(len(student_marks)):
    student_marks[i]+=5
print(student_marks)

[83, 52, 101, 60, 39]


It can be observed that, there is use of a loop. The code is lengthy and becomes computationally expensive with increase in the size of the List.

Data Science is the field that utilizes scientific methods and algorithms to generate insights from the data. These insights can be made actionable and applied across a broad range of application domains. Data Science deals with large datasets. Operating on such data with lists and loops is time consuming and computationally expensive.


### Comparing Python List and Numpy performance


Let us understand why Python Lists can become a bottleneck if they are used for large data.

Consider that 1 million numbers must be added from two different lists.


In [3]:
%%time

list1 = list(range(1,1000000))
list2 = list(range(2,1000001))

list3 =[]

for i in range(len(list1)):
    list3.append(list1[i] + list2[i])

CPU times: total: 46.9 ms
Wall time: 199 ms


<b>Note:</b> Time taken will be different systems

Let us understand, how Numpy can solve the same in minimal time

<b>Note: Ignore the syntax and focus on the only the output.</b>


In [13]:
%%time

import numpy as np

a = np.arange(1,1000000)
b = np.arange(2,1000001)

c = a + b

CPU times: total: 0 ns
Wall time: 19.7 ms


It can be observed that the same operation has been complete in 19.7 milliseconds when compared to 199 milliseconds taken by Python List. As the data size and the complexity of operations increases, the difference between the performance of Numpy and Python Lists broadens.

In Data Science, there are millions of records to be dealt with. The performance limitations faced by using Python. List can be managed by usage of advanced Python libraries like Numpy.


### Introduction to Numpy


Numeric-Python (Numpy), is a Python library that is used for numeric and scientific operations. It serves as a building block for many libraries available in Python.

#### Data Structures in Numpy

The main data structure of NumPy is the ndarray or n-dimensional array.

The ndarray is a multidimensional container of elements of the same type as depicted below. It can easily deal with matrix and vector operations.

<img src="assests/Array1627301877492.png">


Following are the benefits of using Numpy

<ul>
    <li>As the array increases, Numpy can execute more parallel operations, therebymaking computation faster. When the array size gets close to 5,000,000, NumPy gets around 120 times faster than Python List.</li>
    <li>NumPy has many optimized built-in mathematical functions. These functions helps in performing variety of complex mathematical computations faster and with very minimal code.</li>
    <li>Another great feature of NumPy is that it has multidimensional array data structures that can represent vectors and matrices. This can be useful as lot of machine learning algorithms rely on matrix operations.</li>
</ul>


### Getting Started


#### Importing Numpy

Numpy library needs to be imported in the environment before it can be used as shown below. 'np' is the standard alias used for Numpy


In [1]:
import numpy as np

#### Numpy object creation

Numpy array can be created by using array() function. The array() function in Numpy return an array object named ndarray.

<b>Syntax: np.array(object,dtype)</b>
object - A python object(for example, a list)
dtype - data type of object (for example, integer)

Example: Consider the following marks scored by students:

<table>
    <tr>
        <th>Student ID</th>
        <th>Marks</th>
    </tr>
    <tr>
        <td>1</td>
        <td>78</td>
    </tr>
    <tr>
        <td>2</td>
        <td>92</td>
    </tr>
    <tr>
        <td>3</td>
        <td>36</td>
    </tr>
    <tr>
        <td>4</td>
        <td>64</td>
    </tr>
    <tr>
        <td>5</td>
        <td>89</td>
    </tr>
</table>

These marks can be represented in one-dimensional Numpy array as shown below:


In [2]:
import numpy as np
student_marks_arr = np.array([78, 92, 36, 64, 89])
student_marks_arr

array([78, 92, 36, 64, 89])

This is one way to create a simple one-dimensional array.


### Numpy object creation demo 1D array

The following dataset has been provided by XYZ Customer Cars. This data comes in a csv file format.
<img src="assests/11627367976870.PNG">


There are various columns in this datasheet. Each column contains multiple values. These values can be represented as lists of items. Since each column contains homogenous values, Numpy arrays can be used to represent them.

Let us understand, how to represent the car 'horsepower' values in a Numpy array.


In [3]:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
horsepower_arr

array([130, 165, 150, 150, 140])

### Numpy object creation demo - 2D array

#### How can multiple columns be represented together?

This can be achieved by creating the Numpy array from List of Lists.

Let us understand, how to represent the car 'mpg', 'horsepower', and 'acceleration' values in a Numpy array.


In [5]:
#creating a list of lists of 5 mpg, horsepower and acceleration values
car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318, 304, 302]]
#creating a numpy array from car_attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr

array([[ 18,  15,  18,  16,  17],
       [130, 165, 150, 150, 140],
       [307, 350, 318, 304, 302]])

The example demonstrates that the Numpy array created using the List of Lists results in a two-dimensional array.


### Shape of ndarray

The <b>numpy.ndarray.shape</b> returns a tuple that describes the shape of the array.

For example:

<ul>
    <li>a one-dimensional array having 10 elements will have a shape as (10,)</li>
    <li>a two-dimensional array having 10 elements distributed evenly in two rows will have a shape as (2,5)</li>
</ul>

Let us comprehend, how to find out the shape of car attributes array.


In [7]:
#creating a list of lists of  mpg, horsepower and acceleration values
car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318, 304, 302]]
#creating a numpy array from attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr.shape

(3, 5)

Here, 3 represents the number of rows and 5 represents the number of elements in each row.


### 'dtype' of ndarray

<b>'dtype'</b> refers to the data type of the data contained by the array. Numpy supports multiple datatypes like integer, float, string, boolean etc.

Below is an example of using dtype property to identify the data type of elements in an array.


In [8]:
#creating a list of lists of 5 mpg, horsepower and acceleration values
car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318, 304, 302]]
#creating a numpy array from attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr.dtype

dtype('int32')

#### Changing dtype

Numpy dtype can be changed as per requirements. For example, an array of integers can be converted to float.

Below is an example of using dtype as an argument of np.array() function to convert the data type of elements from integer to float.


In [9]:
#creating a list of lists of 5 mpg, horsepower and acceleration values
car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318, 304, 302]]
#converting dtype
car_attributes_arr = np.array(car_attributes, dtype = 'float')
print(car_attributes_arr)
print(car_attributes_arr.dtype)

[[ 18.  15.  18.  16.  17.]
 [130. 165. 150. 150. 140.]
 [307. 350. 318. 304. 302.]]
float64


## Operations on Numpy arrays

### Accessing Numpy arrays

The elements in the ndarray are accessed using index within the square brackets[]. In Numpy, both positive and negative indices can be used to access elements in the ndarray. Positive indices start from the begining of the array, while negative indices start from the end of the array. Array indexing starts from 0 in positive indexing and from -1 in negative indexing.

<img src="assests/MicrosoftTeamsimage61627560454959.png">

Below are some examples of accessing data from numpy arrays:

#### 1. Accessing element from 1D array


In [10]:
#creating an array of cars
cars = np.array(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino'])
#accessing the second car from the array
cars[1]

'buick skylark 320'

#### 2. Accessing element form a 2D array


In [11]:
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
car_hp_arr

array([['chevrolet chevelle malibu', 'buick skylark 320',
        'plymouth satellite', 'amc rebel sst', 'ford torino'],
       ['130', '165', '150', '150', '140']], dtype='<U25')

In [12]:
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing car names
car_hp_arr[0]

array(['chevrolet chevelle malibu', 'buick skylark 320',
       'plymouth satellite', 'amc rebel sst', 'ford torino'], dtype='<U25')

In [13]:
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing horsepower
car_hp_arr[1]

array(['130', '165', '150', '150', '140'], dtype='<U25')

In [14]:
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing second car - 0 represents 1st row and 1 represents 2nd element of the row
car_hp_arr[0,1]

'buick skylark 320'

In [15]:
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing name of last car using negative indexing
car_hp_arr[0,-1]

'ford torino'

### Slicing

Slicing is a way to access and obtain subsets of ndarray in Numpy.

<b>Syntax: array_name[start:end]</b> - index starts at 'start' and ends at 'end-1'.

#### 1. Slicing from 1D array


In [16]:
#creating an array of cars
cars = np.array(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino'])
#accessing a subset of cars from the array
cars[1:4]

array(['buick skylark 320', 'plymouth satellite', 'amc rebel sst'],
      dtype='<U25')

#### 2. SLicing from 2D array


In [17]:
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
car_hp_acc_arr

array([['chevrolet chevelle malibu', 'buick skylark 320',
        'plymouth satellite', 'amc rebel sst', 'ford torino'],
       ['130', '165', '150', '150', '140'],
       ['18', '15', '18', '16', '17']], dtype='<U25')

In [18]:
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name and horsepower 
car_hp_acc_arr[0:2]

array([['chevrolet chevelle malibu', 'buick skylark 320',
        'plymouth satellite', 'amc rebel sst', 'ford torino'],
       ['130', '165', '150', '150', '140']], dtype='<U25')

In [19]:
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name and horsepower of last two cars 
car_hp_acc_arr[0:2, 3:5]

array([['amc rebel sst', 'ford torino'],
       ['150', '140']], dtype='<U25')

In [20]:
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name, horsepower and acceleration of first three cars
car_hp_acc_arr[0:3, 0:3]

array([['chevrolet chevelle malibu', 'buick skylark 320',
        'plymouth satellite'],
       ['130', '165', '150'],
       ['18', '15', '18']], dtype='<U25')

### Mean and Median

#### Problem Statement:

The engineers at XYZ Custom Cars want to know about the mean and median of horsepower.

#### Solution:

The mean and median can be calculated with the help of following code:


In [22]:
%%time

#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#mean horsepower
print("Mean horsepower = ",np.mean(horsepower_arr))

Mean horsepower =  147.0
CPU times: total: 0 ns
Wall time: 0 ns


In [24]:
%%time

#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#median horsepower
print("Median horsepower = ",np.median(horsepower_arr))

Median horsepower =  150.0
CPU times: total: 0 ns
Wall time: 195 μs


### Min and Max

#### Problem Statement:

The engineers at XYZ Custom Cars want to know about the minimal and maximum horsepower.

#### Solution:

The min and max can be calculated with the help of following code:


In [26]:
%%time

#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
print("Minimum horsepower: ", np.min(horsepower_arr))
print("Maximum horsepower: ", np.max(horsepower_arr))

Minimum horsepower:  130
Maximum horsepower:  165
CPU times: total: 0 ns
Wall time: 0 ns


#### Finding the index of minimun and maximum values:

'argmin()' and 'argmax()' return the index of minimum and maximum values in an array respectively.


In [44]:
%%time

#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
print("Index of Minimum horsepower: ", np.argmin(horsepower_arr))
print("Index of Maximum horsepower: ", np.argmax(horsepower_arr))


Index of Minimum horsepower:  0
Index of Maximum horsepower:  1
CPU times: total: 0 ns
Wall time: 1.04 ms


### Querying/searching in an array

#### Problem Statement:

The engineers at XYZ Custom Cars want to know the horsepower of cars that are greater than or equal to 150.

#### Solution:

The 'where' function can be used for this requirement. Given a condition,'where' function returns the indexes of the array where the condition satisfies. Using these indexes, the repective values from the array can be obtained.


In [45]:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
x = np.where(horsepower_arr >= 150)
print(x) # gives the indices 
# With the indices , we can find those values 
horsepower_arr[x]

(array([1, 2, 3], dtype=int64),)


array([165, 150, 150])

### Filter Data

#### Problem Statement:

The Engineers at XYZ Custom Cars want to create a seperate array consisting of filtered values of horsepower greater than 135.

#### Solution:

Getting some elements out of an existing array based on certain conditions and creating a new array out of them is calling filtering.

The following code can be used to accomplish this:


In [47]:
%%time

#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#creating filter array
filter_arr = horsepower_arr > 135
newarr = horsepower_arr[filter_arr]
print(filter_arr)
print(newarr)

[False  True  True  True  True]
[165 150 150 140]
CPU times: total: 0 ns
Wall time: 0 ns


### Sorting an Array

#### Problem Statement:

The engineers at XYZ Custom Cars want the horsepower in sorted order.

#### Solution:

The numpy array can be sorted by passing the array to the function <i>sort(array)</i> or by <i>array.sort</i>.

So, what is the difference between these two functions though they are used for same functionality?


In [48]:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#using sort(array)
print('original array: ', horsepower_arr)
print('Sorted array: ', np.sort(horsepower_arr))
print('original array after sorting: ', horsepower_arr)

original array:  [130 165 150 150 140]
Sorted array:  [130 140 150 150 165]
original array after sorting:  [130 165 150 150 140]


In [49]:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#using sort(array)
print('original array: ', horsepower_arr)
horsepower_arr.sort()
print('original array after sorting: ', horsepower_arr)

original array:  [130 165 150 150 140]
original array after sorting:  [130 140 150 150 165]


The difference is that the array.sort() function modifies the original array by default, whereas the sort(array) function does not.


### Vectorized Operations

The mathematical operations can be performed on Numpy arrays. Numpy makes use of optimized, pre-compiled code to perform mathemetical operations on each array element. This eliminates the need of using loops, thereby enhancing the performance. This process is called vectorization.

Numpy provides various mathematical functions such as sum(), add(), sub(), log(), sin(), etc. which uses vectorization.

Consider an example of marks scored by a student:

<table>
    <tr>
        <th>Subject</th>
        <th>Marks</th>
    </tr>
    <tr>
        <td>English</td>
        <td>78</td>
    </tr>
        <tr>
        <td>Mathematics</td>
        <td>92</td>
    </tr>
        <tr>
        <td>Physics</td>
        <td>36</td>
    </tr>
        <tr>
        <td>Chemistry</td>
        <td>64</td>
    </tr>
        <tr>
        <td>Biology</td>
        <td>89</td>
    </tr>
</table>

#### Problem Statement:

Calculate the sum of all the marks.

#### Solutions:

The sum() function cna be used which internally uses vectorization.


In [50]:
student_marks_arr = np.array([78, 92, 36, 64, 89])
print(np.sum(student_marks_arr))

359


#### Problem Statement:

Award extra marks in subjects as follows:

English: +2
Mathematics: +2
Physics: +5
Chemistry: +10
Biology: +2

#### Solution

Below is the solution to the problem:


In [51]:
additional_marks = [2, 2, 5, 10, 1]
student_marks_arr += additional_marks
student_marks_arr

array([80, 94, 41, 74, 90])

Also, the same operation can be performed as shown below:


In [52]:
student_marks_arr = np.array([78, 92, 36, 64, 89])
student_marks_arr = np.add(student_marks_arr, additional_marks)
student_marks_arr

array([80, 94, 41, 74, 90])

Both the above methods use vectorization internally eliminating the need of loops.

Other arithmetic operations can also be performed in a similar manner.

<img src="assests/131627370079558.PNG">

In addition to arithmetic operations, several other mathemetical operations like exponents, logarithms and trigonometric functions are also available in Numpy. This makes Numpy a very useful tool for scientific computing.
