![AIFI_logo.jpg](attachment:AIFI_logo.jpg)


# Bootcamp - Python and Coding - Primer

# Data Types and Structures

# About this Notebook

This is an introduction to the basic data types and data structures in Python. It is by no means a complete presentation of all data types, but a brief overview of the most common and important types and some of their commonly used functions. For an overview of all types and functions, see the [Python documentation](https://docs.python.org/3/library/datatypes.html). 

# Imports

In [1]:
import numpy as np
import pandas as pd
from random import randint
import decimal
from decimal import Decimal
import keyword
import os
import time
from copy import deepcopy
import warnings
warnings.filterwarnings(action='ignore')

# Python is an interpreted language

Compiled and interpreted languages differ in how they execute code.

In compiled languages like C++, the source code is transformed into machine-readable code (executable code) and is saved as a standalone program. This code is then executed directly by the computer's processor. The advantage of this approach is that once the code is compiled, it can be executed very quickly, as the computer does not have to interpret the code each time it is run.

In interpreted languages like Python, the source code is executed line by line by an interpreter. The interpreter reads the code and executes its instructions. The advantage of this approach is that it allows for much faster development and testing, as there is no need to compile the code each time it is modified. Additionally, interpreted languages like Python tend to be more flexible, as it is easier to change the behavior of the code during runtime.

In summary, compiled languages are more efficient in terms of performance, but interpreted languages are more flexible and easier to develop with.

![compiled_vs_interpreted-2.png](attachment:compiled_vs_interpreted-2.png)

# Built-in Data Types 

|Data type  | Python   | 
|---|---|
Text Type | str
Numeric Types | int, float, complex 
Sequence Types | list, tuple, range
Mapping Type | dict
Set Types | set, frozenset
Boolean Type | bool
Binary Types | bytes, bytearray, memoryview
None Type | NoneType
</div>

The fact that everything is an object in Python makes it a dynamically-typed language. This means that in Python, the type of a variable is determined at runtime, while in other languages, like Java, the type of a variable must be declared at compile-time. For dynamic typing to work and for data to be packaged into an object, a variable is not just the necessary memory for the actual data:

![cint_vs_pyint.png](attachment:cint_vs_pyint.png)

`python lists` are thus surrounded by a rather cumbersome `apparatus` due to the dynamic typing of their individual elements:

In [2]:
L = [ True, 99, 'hello', 5.1, b"8"]
[type(l) for l in L]

[bool, int, str, float, bytes]

This flexibility has its price. Storing and computing with data of homogeneous type could be made much more efficient if we didn't have to provide `python` object packing with every single element.  

For this there are *arrays*, especially `numpy` *arrays*:

![array_vs_list.png](attachment:array_vs_list.png)

In [3]:
# Create a list with 10 million elements
a = [i for i in range(10000000)]
start = time.time()
sum_a = sum(a)
print("Time taken by list:", time.time() - start)

# Create a numpy array with 10 million elements
b = np.arange(10000000)
start = time.time()
sum_b = np.sum(b)
print("Time taken by numpy array:", time.time() - start)

Time taken by list: 0.06699419021606445
Time taken by numpy array: 0.006251811981201172


A `numpy` *Array* contains data of a single data type. The possible data types are:

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

In this example, we can see that the time taken to calculate the sum of the elements in a list is much slower than the time taken to calculate the sum of the elements in a numpy array. This is because numpy arrays are stored in a contiguous block of memory, which allows for much faster processing times.

In conclusion, arrays, especially numpy arrays, are much more efficient than lists when it comes to storing and processing data of homogeneous type, because they do not require object packing with every single element. This leads to significant performance improvements and makes working with large datasets much easier and more efficient.

___

## Basic Data Types

### Integers

In [4]:
a = 42
print('Type:', type(a))
print('Bytes: ', a.bit_length())

b = 10000
print('Bytes: ', b.bit_length())

Type: <class 'int'>
Bytes:  6
Bytes:  14


### Floats

In [5]:
c = 0.42
print('Type:', type(c))

Type: <class 'float'>


### Boolean

In [6]:
42 > 3  

True

In [7]:
type(42 > 3)

bool

In [8]:
type(False)

bool

In [9]:
42 >= 3  

True

In [10]:
42 < 3  

False

In [11]:
42 <= 3  

False

In [12]:
42 == 3  

False

In [13]:
42 != 3  

True

In [14]:
True and True

True

In [15]:
True and False

False

In [16]:
False and False

False

In [17]:
True or True

True

In [18]:
True or False

True

In [19]:
False or False

False

In [20]:
not True

False

In [21]:
not False

True

In [22]:
(42 > 3) and (2 > 3)

False

In [23]:
(42 == 3) or (42 != 3)

True

In [24]:
not (42 != 42)

True

In [25]:
(not (42 != 42)) and (2 == 3)

False

In [26]:
int(True)

1

In [27]:
int(False)

0

### Strings

In [28]:
t = 'this is a Python string object'

In [29]:
t.capitalize()

'This is a python string object'

In [30]:
t.split()

['this', 'is', 'a', 'Python', 'string', 'object']

In [31]:
# find a string
t.find('Python')

10

In [32]:
# if string not found, returns -1
t.find('bootcamp')

-1

In [33]:
# replacing within a string
t.replace(' ', '*')

'this*is*a*Python*string*object'

In [34]:
# remove a specified part from a string
'https://www.aifinanceinstitute.com/'.strip('https://')

'www.aifinanceinstitute.com'

----

# Python Data Structures

![python-data-structures.jpg](attachment:python-data-structures.jpg)


### Lists

In [35]:
l = [42, 42.5, 'python']
l[2]

'python'

In [36]:
l = list(t[:-14])
l

['t',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 'a',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n']

In [37]:
type(l)

list

In [38]:
# append something to a list
l.append([' ', 42, 43])  
l

['t',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 'a',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 [' ', 42, 43]]

In [39]:
l.extend([1.0, 1.5, 2.0])  
l

['t',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 'a',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 [' ', 42, 43],
 1.0,
 1.5,
 2.0]

In [40]:
# make an insertion
l.insert(1, 'this is an insertion')  
l

['t',
 'this is an insertion',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 'a',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 [' ', 42, 43],
 1.0,
 1.5,
 2.0]

In [41]:
# remove insertion
l.remove('this is an insertion')  
l.remove('a')  
l

['t',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 [' ', 42, 43],
 1.0,
 1.5,
 2.0]

In [42]:
l_2 = l.pop(1) # delete an entry
l_3 =  l[-10:] # use index to create a new list from old
print('l: ', l, '\n')
print('l_2: ', l_2)
print('l_3: ', l_3)


l:  ['t', 'i', 's', ' ', 'i', 's', ' ', ' ', 'P', 'y', 't', 'h', 'o', 'n', [' ', 42, 43], 1.0, 1.5, 2.0] 

l_2:  h
l_3:  ['P', 'y', 't', 'h', 'o', 'n', [' ', 42, 43], 1.0, 1.5, 2.0]


In [43]:
l[2:5]  

['s', ' ', 'i']

## Shallow vs deep copy


A shallow copy constructs a new compound object and then (to the extent possible) inserts 
references into it to the objects found in the original.

A deep copy constructs a new compound object and then, recursively, 
inserts copies into it of the objects found in the original. (allocates similar memory resources), 
i.e. in C++ with specific copy constructor

A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.

![shallow_vs_deep_copy.png](attachment:shallow_vs_deep_copy.png)

Left image is a shallow copy, the data types of `a[1]`, `b[1]` are lists, which are __mutable objects__, and their memory addresses __are the same__. Therefore, operations on `b[1]` will affect `a[1]`. 

Compared with shallow copy after deep copy (right image), the biggest difference is that __a new list__ `[1]` is created, and the __memory address is different__ from `a[1]`. After deep copy, any operation on `b` will not affect to `a`, although it consumes more memory, it is more secure.

In [44]:
# Original list with a nested list at index 1
a = [1, [2, 3], 4]

# Creating a shallow copy of a
b = a.copy()  # You can also use b = a[:]

# Memory addresses before modification
print(f"Memory address of a[1]: {id(a[1])}")
print(f"Memory address of b[1]: {id(b[1])}")

# Modify the nested list through b
b[1].append(5)

# Effects of the modification
print(f"a: {a}")  # a is affected by the modification through b
print(f"b: {b}")

# Memory addresses after modification to confirm they are the same
print(f"Memory address of a[1] after modification: {id(a[1])}")
print(f"Memory address of b[1] after modification: {id(b[1])}")


Memory address of a[1]: 5173936384
Memory address of b[1]: 5173936384
a: [1, [2, 3, 5], 4]
b: [1, [2, 3, 5], 4]
Memory address of a[1] after modification: 5173936384
Memory address of b[1] after modification: 5173936384


In [45]:

# Original list with a nested list at index 1
a = [1, [2, 3], 4]

# Creating a deep copy of a
b = deepcopy(a)

# Memory addresses before modification
print(f"Memory address of a[1]: {id(a[1])}")
print(f"Memory address of b[1]: {id(b[1])}")

# Modify the nested list through b
b[1].append(5)

# Effects of the modification
print(f"a: {a}")  # a is not affected by the modification through b
print(f"b: {b}")

# Memory addresses after modification to confirm they are different
print(f"Memory address of a[1] after modification: {id(a[1])}")
print(f"Memory address of b[1] after modification: {id(b[1])}")


Memory address of a[1]: 5173886912
Memory address of b[1]: 5173936640
a: [1, [2, 3], 4]
b: [1, [2, 3, 5], 4]
Memory address of a[1] after modification: 5173886912
Memory address of b[1] after modification: 5173936640


### Dictionaries

In [46]:
d = {
     'Name' : 'The Hitchhiker\'s Guide to the Galaxy',
     'Autor' : 'Douglas Adams',
     'Genre' : 'Science Fiction',
     'Pages' : 193
     }
type(d)

dict

In [47]:
print(d['Name'], d['Pages'])

The Hitchhiker's Guide to the Galaxy 193


In [48]:
d.keys()

dict_keys(['Name', 'Autor', 'Genre', 'Pages'])

In [49]:
d.values()

dict_values(["The Hitchhiker's Guide to the Galaxy", 'Douglas Adams', 'Science Fiction', 193])

In [50]:
d.items()

dict_items([('Name', "The Hitchhiker's Guide to the Galaxy"), ('Autor', 'Douglas Adams'), ('Genre', 'Science Fiction'), ('Pages', 193)])

In [51]:
for item in d.items():
    print(item)

('Name', "The Hitchhiker's Guide to the Galaxy")
('Autor', 'Douglas Adams')
('Genre', 'Science Fiction')
('Pages', 193)


In [52]:
for value in d.values():
    print(type(value))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'int'>


### Sets

In [53]:
s = set(['a', 'b', 'ab', 'ba', 'b', 'ab'])
s

{'a', 'ab', 'b', 'ba'}

In [54]:
t = set(['a', 'aa', 'bb', 'b'])

In [55]:
s.union(t)  

{'a', 'aa', 'ab', 'b', 'ba', 'bb'}

In [56]:
s.intersection(t)  

{'a', 'b'}

In [57]:
s.difference(t)  

{'ab', 'ba'}

In [58]:
t.difference(s)  

{'aa', 'bb'}

In [59]:
s.symmetric_difference(t)  

{'aa', 'ab', 'ba', 'bb'}

### Tuple

In [60]:
t = (42, 42.5, 'data')
type(t)

tuple

In [61]:
t[2]

'data'

In [62]:
type(t[2])

str

In [63]:
t.count('data')

1

In [64]:
t.index(42)

0