### Combined usage of list and string data type

In [20]:
s = "This is a string"

# s.split() creates a list of substrings
# that are seperated by white spaces
s.split()

['This', 'is', 'a', 'string']

In [21]:
# To get all characters of a string in a list
l = list(s)
print(l)
print(type(l))

['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g']
<class 'list'>


In [22]:
# Using the join function to join a list to a string
# Join elements of l as given
s_joined = "".join(l)
print(s_joined)
print(type(s_joined))

This is a string
<class 'str'>


### Deeper Knowledge of python 'list'

-> A python list is a very complex data structure that offers high flexibility and lets us perform very complex tasks easily, but at the cost of performance. As a list is a very complex data structure that can contain any data type and is mutable, it has to keep track of and work with a lot of memory addresses and pointers, making it very inefficient as a base for high performance numerical computation.  

-> Furthermore a list can expand and shrink as demanded, hence it is a dynamic data structure that doesn't have a predefined size or shape.

-> We will see in future projects that although a python 'list' lets us work on various problems quite easily, we will often have to stop using it and instead use the 'ndarray' object of numpy for large scale data processing and analysis.

In [23]:
# In the next few cells we will analyze how 'inefficient' the list data structure and the python language is.
# Please don't think that this means that Python is a 'bad' or 'slow' language, instead what we will actually
# learn from this is how seamlessly we can can make Python flexible or performant as necessary.
# 
# The example we will look at here is very basic, we will add 1000000 numbers starting from 1 and ending at 1000000

In [24]:
# Standard Library to help us benchmark performance
import time
# Third Party Library
import numpy as np

In [25]:
l = list(range(1, 10000001))
nparray = np.array(l)

In [26]:
# Using a python for loop

start = time.perf_counter_ns()

s = 0
for number in l:
    s = s + number

end = time.perf_counter_ns()

print(f"The for loop method runtime: {end - start} nanosec")

for_method = end - start

The for loop method runtime: 1192794800 nanosec


In [27]:
# Using built-in sum function provided by python

start = time.perf_counter_ns()

sum(l)

end = time.perf_counter_ns()

print(f"The sum method runtime: {end - start} nanosec")

sum_method = end - start

The sum method runtime: 316956100 nanosec


In [28]:
# Using the sum function and ndarray object provided by the numpy library

start = time.perf_counter_ns()

np.sum(nparray)

end = time.perf_counter_ns()

print(f"The np.sum method runtime: {end - start} nanosec")

npsum_method = end - start

The np.sum method runtime: 6113900 nanosec


In [29]:
print("The ratio of the benchmarks is: ")
print(f"npsum : sum : for = {1} : {(sum_method/npsum_method):.2f} : {(for_method/npsum_method):.2f}")

The ratio of the benchmarks is: 
npsum : sum : for = 1 : 51.84 : 195.10


In [30]:
# Boosting the performance of numpy
# by using numba 'decorators'. A decorator
# is an advanced concept, it is a function that
# takes in a function and processes it. 
# We call decorators using the '@' operator and
# write the decorator function name above a function
# we want to decorate.

import numba as nb

@nb.njit()
def nb_np_sum(a: np.ndarray) -> int:
    return np.sum(a)

In [33]:
# Using the sum function and ndarray object provided by the numpy library
# and boosting its performance further by using numba

# We have to run a numba decorated function once before using it
# as numba needs to compile the function during the first iteration
# and thus when we use a numba function for the first time, it might
# perform worse than others

start = time.perf_counter_ns()

nb_np_sum(nparray)

end = time.perf_counter_ns()

print(f"The np.sum method runtime: {end - start} nanosec")

np_nb_sum_method = end - start

The np.sum method runtime: 2390600 nanosec


In [34]:
print("The ratio of the benchmarks is: ")
print(f"np_nb_sum : npsum : sum : for = {1} : {(npsum_method/np_nb_sum_method):.2f} : {(sum_method/np_nb_sum_method):.2f} : {(for_method/np_nb_sum_method):.2f}")

The ratio of the benchmarks is: 
np_nb_sum : npsum : sum : for = 1 : 2.56 : 132.58 : 498.95


We can see that the core list data type combined with a python loop is on average 500 times slower than numpy and numba combined, even when running basic numerical computations. This performance loss may become more significant when working on more complex numerical computations.

This doesn't mean that the list data structure provided by python is useless. It is actually really useful for analyzing and preprocessing large scale datasets. Especially datasets that contain strings, and mixed data types.

Also, for a specialized branch of AI, Natural Language Processing(NLP), we will often have to work with a combination of list, string and dictionary objects. 