# NumPy vs. Python Collections

![Python Collections compared to NumPy](./../images/data_munging_01-Numpy-02.png)

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/01-Numpy/02.01-Numpy-over-Python-Collections.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

# Setup and suchlike

In [2]:
import time
import sys
import numpy as np

In [3]:
np.__version__

'1.26.2'

# Performance over Python Lists

In [4]:
# NumPy is faster
# 1. Contiguous storage
# 2. Leverage datatypes

# ten million
big_number = 10000000

# List
python_list = list(range(big_number))

start_time = time.time()
sum_list = sum(python_list)
list_time = time.time() - start_time

# NumPy Array
numpy_array = np.array(range(big_number), dtype=np.int64)

start_time_np = time.time()
sum_array = np.sum(numpy_array)
numpy_time = time.time() - start_time_np

print(f"Python List Time: {list_time}")
print(f"NumPy Array Time: {numpy_time}")
print(f"Numpy performing {list_time/numpy_time} times faster than Python Lists")

Python List Time: 0.2040574550628662
NumPy Array Time: 0.00608515739440918
Numpy performing 33.53363632801786 times faster than Python Lists


In [5]:
# one million, two million
lst1 = [i for i in range(1000000)]
lst2 = [i for i in range(1000000, 2000000)]

arr1 = np.array(lst1)
arr2 = np.array(lst2)

# Python List
start_time1 = time.time()
result_list = [a + b for a, b in zip(lst1, lst2)]
list_time1 = time.time() - start_time1

# NumPy Array
start_time_np1 = time.time()
result_array = arr1 + arr2
numpy_time1 = time.time() - start_time_np1

print(f"Python List Operation Time: {list_time1}")
print(f"NumPy Array Operation Time: {numpy_time1}")
print(f"Numpy performing {list_time1/numpy_time1} times faster than Python Lists")

Python List Operation Time: 0.0559992790222168
NumPy Array Operation Time: 0.0009839534759521484
Numpy performing 56.91252725951054 times faster than Python Lists


# How NumPy handles Data Types

NumPy promotes the types to the smallest size and smallest scalar kind that fits all the elements.
This type promotion can be counter intuitive sometimes.
See: 
* [Type Casting Rules](https://numpy.org/doc/stable/user/basics.ufuncs.html#type-casting-rules)
* [```numpy.result_type```](https://numpy.org/doc/stable/reference/generated/numpy.result_type.html#numpy-result-type)
* [```numpy.promote_types```](https://numpy.org/doc/stable/reference/generated/numpy.promote_types.html#numpy-promote-types)


In [6]:
# handling inconsistent data in NumPy
try:
	inconsistent_array1 = np.array([1, "two", 3, '!'])
	print("inconsistent_array1.dtype = ",inconsistent_array1.dtype)
except Exception as e:
	print(e)
# 
try:
	# throws exception for inconsistent_array2 as elements do not match the specified dtype
	inconsistent_array2 = np.array([1, "two", 3, '!'], dtype=np.int32)
	print(inconsistent_array2.dtype)
except Exception as e:
	print("inconsistent_array2 exception: ",e)
# 
try:
	# specify the dtype - makes things more reliable (and faster!)
	inconsistent_array3 = np.array([1, 2, 3, 4], dtype=np.int64)
	print("inconsistent_array3.dtype = ",inconsistent_array3.dtype)
except Exception as e:
	print(e)
# 

inconsistent_array1.dtype =  <U11
inconsistent_array2 exception:  invalid literal for int() with base 10: 'two'
inconsistent_array3.dtype =  int64


In [7]:
inconsistent_array3.strides

(8,)

# NumPy uses contiguous blocks of data in memory
  
![Row-Wise Representation of Data](./../images/PyDataGlobal2023-PythonvsNumpy-rowandcol.drawio.png)

In [8]:
# NumPy array
arr = np.array([1, 2, 3, 4], dtype=np.int32)    
# 
print(f"NumPy ctypes {arr.ctypes.data}\n{arr.ctypes.strides}\n{arr.nbytes}")
print(f"NumPy ctypes.data {arr.ctypes.data}")
# add another element to the array - see how size changes
arr = np.append(arr, [5])
print(f"NumPy ctypes {arr.ctypes.data}\n{arr.ctypes.strides}\n{arr.nbytes}")
print(f"NumPy ctypes.data {arr.ctypes.data}")
# 
for idx in range(len(arr)):
	# __array_interface__['data'] is a tuple (address for 0th element, mutable) 
	# Gives you same address each time, 
	# because it's expecting to traverse the array using offsets - 'strides' in NumPy
	# a stragety only applicable when you traverse a contiguous block of data
    print(f"NumPy array element {idx}: memory address = {id(arr[idx])}, {arr[idx].__array_interface__['data']}")
# 
# Python list
lst = [1, 2, 3, 4]
for idx, item in enumerate(lst):
    print(f"Python list element {idx}: memory address = {id(item)}")

NumPy ctypes 1770104670544
<numpy.core._internal.c_longlong_Array_1 object at 0x0000019C31A592D0>
16
NumPy ctypes.data 1770104670544
NumPy ctypes 1770104670576
<numpy.core._internal.c_longlong_Array_1 object at 0x0000019C31A591D0>
20
NumPy ctypes.data 1770104670576
NumPy array element 0: memory address = 1770883228048, (1770081209120, False)
NumPy array element 1: memory address = 1770883228048, (1770081209120, False)
NumPy array element 2: memory address = 1770883228048, (1770081209120, False)
NumPy array element 3: memory address = 1770883228048, (1770081209120, False)
NumPy array element 4: memory address = 1770883228048, (1770081209120, False)
Python list element 0: memory address = 140736721097512
Python list element 1: memory address = 140736721097544
Python list element 2: memory address = 140736721097576
Python list element 3: memory address = 140736721097608


# How strict data-types in NumPy reduce memory overhead

We saw that Python data structures come with a lot of functions that help with duck-typing and other general purpose data analysis tasks. 

Python lists have a significant memory overhead because they store more than just the data (like object type info, size, reference count, etc.).  
NumPy arrays, being homogeneous, cut down on this overhead.

In [9]:
# NumPy consuming less memory
lst = list(range(big_number))
lst_size = sys.getsizeof(lst)
print(f"Size of Python list: {lst_size} bytes or {round(lst_size/1024, 2)} Kb")

np_arr = np.array(lst, dtype = np.dtype(int))
np_arr_size = np_arr.nbytes
print(f"Size of NumPy array: {np_arr_size} bytes or {round(np_arr_size/1024, 2)} Kb")
print('\n')
print(f"Compared to Python lists, NumPy consumes approximately \
{round(((lst_size-np_arr_size)/lst_size)*100,2)}% less memory")

Size of Python list: 80000056 bytes or 78125.05 Kb
Size of NumPy array: 40000000 bytes or 39062.5 Kb


Compared to Python lists, NumPy consumes approximately 50.0% less memory


Some references for you to keep handy when dealing with NumPy Arrays and data types ([dtypes](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html#numpy.dtype)):
* NumPy [Structured Arrays](https://numpy.org/doc/stable/user/basics.rec.html#structured-arrays)
* [The array interface protocol](https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface)
* [Data type objects](https://numpy.org/doc/stable/reference/arrays.dtypes.html#arrays-dtypes-constructing)
* Built-in [Scalars](https://numpy.org/doc/stable/reference/arrays.scalars.html#scalars)

# Implementing a naive ```GroupBy``` with NumPy

We have our Movies dataset and we want to find out the answers to the following questions
* How many movies were released each year?
* On average how many movies were released per year?

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

# Locate the data

In [10]:
datalocation = "./../data/ml-latest-small/"

In [11]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

# Load the dataset(s)

From the ```README.txt``` file in the small MovieLens dataset:
The dataset files are written as [**comma-separated values**](http://en.wikipedia.org/wiki/Comma-separated_values) files with a **single header row**. Columns that contain commas (`,`) are **escaped using double-quotes (`"`)**. These files are encoded as **UTF-8**. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

So, we specify:
* Separator - ```,```
* Escape Character - ```"```
* Encoding - ```UTF-8```

We need to find a way to load the data set where titles like  "American President, The (1995)" don't break the CSV loading process

In [12]:
# we'll need regex to handle the escape characters and extract the year
import re

We'll use regex to match here.  
Something like [regex101](https://regex101.com/r/pWPPbM/1) is really helpful in building the expression.  

If we wanted to just select the comma that is surrounded by quotes, we would use a technique is called [look ahead](https://www.regular-expressions.info/lookaround.html). 

``` Python
# Splitting by comma but not inside quotes
# r',(?=")'
```

Wait, there's not general purpose 'Group by' in NumPy.  
So, we'll have to implement our own.   
fun!

In [13]:
# load the movies.csv dataset, knowing the idisyncracies involved (like that comma-quote thing etc.)
data = []
# regex to extract 4 digits between parenthesis
year_match_regex = '\((\d{4})\)'
with open(file_path_movies, 'r', encoding='utf-8') as file:
	next(file)  # Skip the header line
	for line in file:
		# split by the comma
		parts = re.split(r',', line.strip())
		# skip empty or malformed lines
		if len(parts) >= 3:
			# 
			movie_id = int(parts[0])
			# Combine all elements except the first (movie ID) and last (genres)
			title = ','.join(parts[1:-1]).strip('"')
			# extract year into a new column
			yr_match = re.search(year_match_regex, title)
			year = int(yr_match.group(1)) if yr_match else None
			# 
			genres = parts[-1]
			data.append([movie_id, title, year, genres])
movies = np.array(data)

In [14]:
print('data shape: ',len(data),',', len(data[0]))
print('movies.shape: ',movies.shape)

data shape:  9742 , 4
movies.shape:  (9742, 4)


In [15]:
# head
print(data[:5])
movies[:10]

[[1, 'Toy Story (1995)', 1995, 'Adventure|Animation|Children|Comedy|Fantasy'], [2, 'Jumanji (1995)', 1995, 'Adventure|Children|Fantasy'], [3, 'Grumpier Old Men (1995)', 1995, 'Comedy|Romance'], [4, 'Waiting to Exhale (1995)', 1995, 'Comedy|Drama|Romance'], [5, 'Father of the Bride Part II (1995)', 1995, 'Comedy']]


array([[1, 'Toy Story (1995)', 1995,
        'Adventure|Animation|Children|Comedy|Fantasy'],
       [2, 'Jumanji (1995)', 1995, 'Adventure|Children|Fantasy'],
       [3, 'Grumpier Old Men (1995)', 1995, 'Comedy|Romance'],
       [4, 'Waiting to Exhale (1995)', 1995, 'Comedy|Drama|Romance'],
       [5, 'Father of the Bride Part II (1995)', 1995, 'Comedy'],
       [6, 'Heat (1995)', 1995, 'Action|Crime|Thriller'],
       [7, 'Sabrina (1995)', 1995, 'Comedy|Romance'],
       [8, 'Tom and Huck (1995)', 1995, 'Adventure|Children'],
       [9, 'Sudden Death (1995)', 1995, 'Action'],
       [10, 'GoldenEye (1995)', 1995, 'Action|Adventure|Thriller']],
      dtype=object)

In [16]:
# define a group by method, use dictionaries
def group_by(data, key_index):
	groups = {}
	for row in data:
		key = row[key_index]
		if key not in groups:
			groups[key] = []
		groups[key].append(row)
	return groups

In [17]:
# define a count function on top of the group_by
def count(data, key_index):
	groups = group_by(data, key_index)
	count_dict = {key: len(value) for key, value in groups.items()}
	return np.array(list(count_dict.items()), dtype='object')

In [18]:
count_of_movies_by_year = count(movies,2)
average_number_of_movies_released_each_year = np.mean(count_of_movies_by_year[:,1].astype(int))

In [19]:
count_of_movies_by_year[:,1]

array([259, 237, 276, 44, 167, 42, 198, 43, 63, 47, 87, 147, 147, 142, 16,
       25, 35, 92, 59, 33, 36, 37, 42, 165, 260, 39, 10, 16, 22, 33, 34,
       31, 23, 11, 16, 37, 39, 23, 18, 30, 23, 21, 23, 17, 15, 20, 13, 18,
       30, 25, 9, 42, 45, 47, 69, 153, 139, 89, 59, 126, 42, 40, 83, 101,
       20, 12, 14, 1, 258, 4, 5, 7, 4, 263, 283, 5, 1, 1, 4, 4, 294, 311,
       279, 2, 1, 5, 279, 4, 1, 273, 295, 1, 13, 1, 284, 269, 282, 247,
       254, 233, 239, 278, 274, 218, 147, 41, 1], dtype=object)

In [20]:
average_number_of_movies_released_each_year

91.04672897196262

# Row Oriented vs Columnar

<font color='red'><em>This bit gets real confusing, real fast, so please ignore it in the first pass.</em></font>

In [21]:
# row order, column order

arr_c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], order='C')
arr_f = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], order='F')

print("Row-major order:")
print(arr_c.ravel(order='C'))
print(arr_c)
# 
print("\nColumn-major order:")
print(arr_f.ravel(order='F'))
print(arr_f)
# 
arr_c_to_f = arr_c.ravel(order='F').reshape(arr_c.shape, order = 'F')
arr_f_to_c = arr_f.ravel(order='C').reshape(arr_f.shape, order = 'C')
# this gets mondo confusing.. 
# add lots of comments
# even if they explain the code over and over
# over-communicate - confusion confuses
print("\nRow-major to Column-major:")
print(arr_c_to_f.ravel(order='F'))
print(arr_c_to_f)
print("\nColumn-major to Row-major:")
print(arr_f_to_c.ravel(order='C'))
print(arr_f_to_c)

Row-major order:
[1 2 3 4 5 6 7 8 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Column-major order:
[1 4 7 2 5 8 3 6 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Row-major to Column-major:
[1 4 7 2 5 8 3 6 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Column-major to Row-major:
[1 2 3 4 5 6 7 8 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [22]:
def print_memory(arr):
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            print(arr[i, j], arr[i, j].data)

print("Memory locations for C-order:")
print_memory(arr_c)
print("\nMemory locations for F-order:")
print_memory(arr_f)

Memory locations for C-order:
1 <memory at 0x0000019C31A560C0>
2 <memory at 0x0000019C31A560C0>
3 <memory at 0x0000019C31A560C0>
4 <memory at 0x0000019C31A560C0>
5 <memory at 0x0000019C31A560C0>
6 <memory at 0x0000019C31A560C0>
7 <memory at 0x0000019C31A560C0>
8 <memory at 0x0000019C31A560C0>
9 <memory at 0x0000019C31A560C0>

Memory locations for F-order:
1 <memory at 0x0000019C31A560C0>
2 <memory at 0x0000019C31A560C0>
3 <memory at 0x0000019C31A560C0>
4 <memory at 0x0000019C31A560C0>
5 <memory at 0x0000019C31A560C0>
6 <memory at 0x0000019C31A560C0>
7 <memory at 0x0000019C31A560C0>
8 <memory at 0x0000019C31A560C0>
9 <memory at 0x0000019C31A560C0>
