# Data Types and Functions in Python

In [None]:
from typing import List

## Data Types

Python supports many types of data structures for assigning to variables

### Strings

Strings store alpha numeric data such as characters or words. Eg.

In [None]:
x = "Hello"
type(x)

### Numerical

There are two main types of numerical data in Python: Integers and Floats. Integers store whole numbers and floats which store decimal numbers

In [None]:
x = 1
type(x)

In [None]:
x = 1.5
type(x)

### Lists

When we want to store a collection of ordered data we can use a list

In [None]:
x = [1,2,3,4,5]
type(x)

Elements of a list can be any other data type

In [None]:
type(x[1])

Including mixed

In [None]:
y = [[1,2,3], 4, "Hello"]
print(type(y))
print(type(y[0]))
print(type(y[1]))
print(type(y[2]))

### Dictionaries

These are used to store key value pairs. They are a lot like lists only we'd rather have more control over a label than using an index

In [None]:
person = {"Name": "Alice", "Age": 23}
print(type(person))

Anything can be a value, including other dictionaries

In [None]:
person = {"Name": "Alice", "Age": 23, "Friends": [{"Name": "Bob", "Age": 43}, {"Name": "Peter", "Age": 33}]}
print(type(person["Name"]))
print(type(person["Friends"]))
print(type(person["Friends"][0]))

Combining these data types we can construct complex data structures to represent what we are modelling:

In [None]:
client = {"Name": "Claire", "Age": 34, "Career": {"Employeer": "NHS", "Years_of_service": 4, "Role": "Nurse"}, "Children": ["David", "Elliot"], }
display(client)

In [None]:
clients = [{"Name": "Bob", "Age": 34, "Career": {"Employeer": "NHS", "Years_of_service": 4, "Role": "Nurse"}, "Children": ["David", "Elliot"], }, {"Name": "Sam", "Age": 24, "Career": {"Employeer": "BT", "Years_of_service": 1, "Role": "Technician"}, "Children": [], }]
display(clients)

## Mutable Data Vs Immutable Data

When it comes to storing data there are two schools of thought. Once data is assigned if it not allowed to change we refer to it as immutable. If the data can change it is referred to as mutatable.

### Mutable data:

#### Lists:

In [None]:
x = [1,2,3]
print(x[1])

In [None]:
x[1] = 3
print(x)

#### Dictionaries:

In [None]:
x = {"Name": "Pete", "Age": 23}
print(x["Name"])

In [None]:
x["Name"] = "Sam"
print(x)

### Immutable Data

#### Tuples

In [None]:
x = ("Apple", "Mango")
print(x[0])

In [None]:
x[0] = "Banana"

Generally speaking its a bad idea to mutate your data, it makes it much harder to debug and reason about. However sometimes its unavoidable.

## Functions

The role of functions in Python is to typically transform data. They often take data as an input and return data as an output

In [None]:
def add_3(num: int) -> int:
    return num + 3 

add_3(5)

Notice the type hints in the function. This makes functions much easier to understand!

### DRY Principle

Do not repeat yourself (DRY). If in your code you find you are doing the same things multiple times you should write a function instead:

In [None]:
# Combine data
x = [1,2,3,4,5,6]
y = [2,4,7,8,4]

sum_x = 0
for val in x:
    sum_x += val

sum_y = 0
for val in y:
    sum_y += val
    
combined_data = sum_x + sum_y
print(combined_data)

This is better represented as:

In [None]:
def combine_list(data: List[int]) -> int:
    sum_data = 0 
    for val in data:
        sum_data += val
    return sum_data

In [None]:
# Combine data
x = [1,2,3,4,5,6]
y = [2,4,7,8,4]
combined_data = combine_list(x) + combine_list(y)
combined_data 

Say we need to change how we combine lists to taking the maximum element. This requires a lot of change in the first example

In [None]:
# Combine data
x = [1,2,3,4,5,6]
y = [2,4,7,8,4]

max_x = 0
for val in x:
    if val > max_x:
        max_x = val

max_y = 0
for val in y:
    if val > max_y:
        max_y = val
    
combined_data = max_x + max_y
print(combined_data)

Verses just updating the function

In [None]:
def combine_list(data: List[int]) -> int:
    max_data = 0 
    for val in data:
        if val > max_data:
            max_data = val
    return max_data

In [None]:
# Combine data
x = [1,2,3,4,5,6]
y = [2,4,7,8,4]
combined_data = combine_list(x) + combine_list(y)
combined_data 

## Function cohesion

Functions should generally do one thing well. If a function has too many responsibilities it becomes difficult to understand, reason about, test and maintain

In [None]:
def process_data(data: List[int], threshold: int) -> float:
    data_without_duplication = []
    for datum in data:
        if datum not in data_without_duplication:
            data_without_duplication.append(datum)
    absolute_value_of_data = []
    for datum in data_without_duplication:
        absolute_value_of_data.append(abs(datum))
    data_without_outliers = []
    for datum in absolute_value_of_data:
        if datum <= threshold:
            data_without_outliers.append(datum)
    sum_of_data = 0
    counter = 0
    for datum in data_without_outliers:
        sum_of_data += datum
        counter += 1
    return sum_of_data/counter


process_data([1,2,3,-5,6,3], 5)

The above is a mess. Its really hard to understand what is happening and debug. Its also going to be difficult to change the process if we need to in the future and it'll be hard to test.

In [None]:
def deduplicate_data(data: List[int]) -> List[int]:
    data_without_duplication = []
    for datum in data:
        if datum not in data_without_duplication:
            data_without_duplication.append(datum)
    return data_without_duplication

def absolute_value_of_data(data: List[int]) -> List[int]:
    absolute_value_of_data = []
    for datum in data:
        absolute_value_of_data.append(abs(datum))
    return absolute_value_of_data

def remove_outliers(data: List[int], threshold: int) -> List[int]:
    data_without_outliers = []
    for datum in data:
        if datum <= threshold:
            data_without_outliers.append(datum)
    return data_without_outliers

def compute_mean(data: List[int]) -> float:
    return sum(data)/len(data)

def process_data(data: List[int], threshold: int) -> float:
    data = deduplicate_data(data)
    data = absolute_value_of_data(data)
    data = remove_outliers(data, threshold)
    return compute_mean(data)

process_data([1,2,3,-5,6,3], 5)
    

This code is much easier to understand. The functions only do one thing and are well named to explain their task, we can actually use them in other parts of the codebase. Its also now easy to understand the process data pipeline. We can test each function induvidually and easily add/update the functions if the process changes. This code has high cohesion.