---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.2</h1>

# _02-Array vs List.ipynb_
#### [Click me to learn more about Numpy Library](https://www.w3schools.com/python/numpy/numpy_intro.asp)

# Learning agenda of this notebook
1. A Comparison
    - Python List
    - Python Arrays
    - NumPy Arrays
2. Memory Consumption of Python List and Numpy Array
3. Operation cost on Python List and Numpy Array

## Benefits of using Numpy arrays

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `crop_yield`.
- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

## 1. A Comparison
**Python List:** 
1. A Python List (part of core Python) is an oredered and mutable data structure that is used to store a collection of items
2. Python list is by default 1 dimensional. But we can create an N-Dimensional list. But then too it will be 1 D list storing another 1D list
3. Items can be of same or different data types. 
4. More memory hungry.
5. Items are stored non-contiguously in memory.
6. Operations on Lists are typically slower, however, append operation will take O(1) time.


**NumPy Arrays:** 
1. NumPy is the fundamental package for scientific computing in Python. NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data.
2. A NumPy array is also an ordered and mutable data structure that is used to store a collection of items.
3. A NumPy array can be N-dimensional.
4. Items must be of same data type (can have array of structs). 
5. Less memory hungry.
6. Items are stored contiguously in memory.
7. Operations on Arrays are typically faster, however, append operation will take O(n) time.

In [None]:
# To install this library in Jupyter notebook
#import sys
#!{sys.executable} -m pip install numpy

In [2]:
import numpy as np
np.__version__ , np.__path__

('1.19.5',
 ['/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/numpy'])

### a. Python List

In [None]:
# creating a list containing elements belonging to different data types  
list1 = [1, "Data Science", ['a','e']]  
print(list1) 


# creating a list having elements of different data types
list2 = [1, 4, '2', 5.3, False]
print(list2)



### b. Python Array

In [None]:
# A simple Python array is a sequence of objects of similar data dype
# Python array module requires all array elements to be of the same type. 
# Moreover, to create an array, you'll need to specify a value type. 

# To use Python arrays, you have to i=import Python's built-in array module
import array as arr  

# declaring array of integers
arr1 = arr.array("i", [3, 6, 9, 2])  
print(arr1)  
print(type(arr1))  

# declaring array of floats
arr2 = arr.array("f", [3.4, 6.7, 9.5, 2])  
print(arr2)  
print(type(arr2))  

### c. Numpy Array
- Numpy arrays offer the following benefits over Python lists for operating on numerical data:
    - **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `crop_yield`.
    - **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

In [3]:
np.sctypes

{'int': [numpy.int8, numpy.int16, numpy.int32, numpy.int64],
 'uint': [numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64],
 'float': [numpy.float16, numpy.float32, numpy.float64, numpy.float128],
 'complex': [numpy.complex64, numpy.complex128, numpy.complex256],
 'others': [bool, object, bytes, str, numpy.void]}

In [None]:
# To create a NumPy array, you only need to specify the items (enclosed in square brackets) 
# Optionally you can mention the dtype

import numpy as np  

mylist = [1, 4, 2, 5, 3]
arr = np.array(mylist, dtype=np.uint16)
print(arr)
print(type(arr))  


# NumPy array upcast data type of all elements to bigger datatype in case it is heterogeneous
arr2 = np.array([1, 4.6, 2, False])
print(arr2)


# NumPy array upcast data type of all elements to bigger datatype in case it is heterogeneous
arr2 = np.array([3, 5.1, '5', False])
print(arr2)


## 2. Memory Consumption of NumPy Array and Python List
- Python Lists consume more memory than NumPy arrays

In [None]:
# Let's have an example to compare the memory consumption of NumPy array and Python List

# importing system module
import sys
  
# declaring a list of 1000 elements 
list1 = range(1000)

# printing size of each element of the list, You can use getsizeof function from sys module
print("Size of each element of list in bytes: ", sys.getsizeof(list1))
  
# You can get the size of whole list in bytes by multiplying len of list with size of individual element    
size = sys.getsizeof(list1)*len(list1)    
print("Size of the whole list in bytes: ", size)

# declaring a Numpy array of 1000 elements 
array1 = np.arange(1000)
  
# printing size of each element of the Numpy array
print("Size of each element of the Numpy array in bytes: ", array1.itemsize)
  
# printing size of the whole Numpy array
print("Size of the whole Numpy array in bytes: ", array1.nbytes)

## 3. Operations on NumPy Arrays vs Python Lists
- NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access and manipulate them very efficiently. 
- This behavior is called **locality of reference** in computer science. 
- This is the main reason why NumPy is faster than lists. 
- As a proof of concept, we can multiply two list and arrays, and compare their multiplication time

### Effect of * operator on NumPy Array and Python List

In [4]:
# You can multiply two numPy arrays using * operator
import numpy as np
myarray1 = np.array([1, 2, 3, 4, 5, 6])
myarray2 = np.array([1, 2, 3, 4, 5, 6])
myarray3 = myarray1 * myarray2
myarray3

array([ 1,  4,  9, 16, 25, 36])

In [13]:
# you can't multiply two lists using a * operator, you have to use a loop
mylist1 = [1, 2, 3, 4, 5, 6]
mylist2 = [1, 2, 3, 4, 5, 6]
mylist3 = [0, 0, 0, 0, 0, 0]
for i in range(0,6):
    mylist3[i] = mylist1[i] * mylist2[i]
mylist3

[1, 4, 9, 16, 25, 36]

In [27]:
# importing required packages
import time
import numpy as np


# Creating two large size arrays and multiplying them element by element
size = 1000000
array1 = np.arange(size)
array2 = np.arange(size)
# capturing time before the multiplication of Numpy arrays
initialTime = time.time()
# multiplying elements of both the Numpy arrays and stored in another Numpy array
array3 = array1 * array2
# capturing time again after the multiplication is done
finishTime = time.time()
print("\nTime taken by NumPy Arrays to perform multiplication:", finishTime - initialTime, "seconds")




# Creating two large size Lists and multiplying them element by element
list1 = list(range(size))
list2 = list(range(size))
list3 = list(range(size))
# capturing time before the multiplication of Python Lists
initialTime = time.time()
# multiplying elements of both the lists and stored in another list
# simply run a loop and overwrite the elements of the new list with resulting value
for i in range(0, len(list1)):
     list3[i] = list1[i] * list2[i]


# capturing time again after the multiplication is done
finishTime = time.time()
print("\nTime taken by Lists to perform multiplication:", finishTime - initialTime, "seconds")


Time taken by NumPy Arrays to perform multiplication: 0.001650094985961914 seconds

Time taken by Lists to perform multiplication: 0.2171339988708496 seconds


** 
- The element-wise multiplication of two vectors and taking a sum of the results is called the **dot product**.
- Let us do a comparison of dot products performed using two Python Lists using loop and two Numpy arrays using the dot() methodon two vectors with a million elements each**

In [26]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

In [27]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

CPU times: user 194 ms, sys: 2.46 ms, total: 196 ms
Wall time: 195 ms


833332333333500000

In [28]:
# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [29]:
%%time
np.dot(arr1_np, arr2_np)

CPU times: user 1.97 ms, sys: 680 µs, total: 2.65 ms
Wall time: 1.34 ms


833332333333500000

As you can see, using `np.dot` is around 150 times faster than using a `for` loop. This makes Numpy especially useful while working with really large datasets with tens of thousands or millions of data points.