# CDCS Summer School
# A Gentle Introduction to Coding for Data Analysis
## Session 10: The Final Syntactic Frontier

---------------

### Learning objectives for this session:

At the end of this notebook you will know:

1. A recap on lists, functions and loops.
2. How to use functions, loops and lists together.
3. Why we need to sometimes use 'try-except' blocks.
4. A bit about computational efficiency and why it is important to consider.

--------

## 1. Recapping lists, functions and loops.

It has been a very long week so far -- you are doing great! Let us try and bring together some of the most important concepts so far. Below we'll look to recap lists, functions and loops. This is all in the context of penguins ready for the rest of the week, all about data!

Here, we create a simple list containing penguin species and another list with flipper lengths. We demonstrate how to access elements in a list, which is one of the most basic operations in list manipulation. Accessing list elements by index allows us to retrieve specific data points from a dataset.

In [None]:
# List of penguin species
species = ['Adelie', 'Chinstrap', 'Gentoo']

# List of flipper lengths (in mm)
flipper_lengths = [181, 195, 210, 182, 191, 213]

# Display the first species and the first flipper length
print("First species:", species[0])
print("First flipper length:", flipper_lengths[0])

Functions are vital for code reusability and organisation. Here, we define a function to calculate the average of numerical data. This type of function is useful when you need to perform statistical analysis on data, such as finding the average flipper length of penguins. This demonstrates how functions can encapsulate logic that can be repeatedly applied to different data sets.

In [None]:
# Function to calculate the average of a list
def average(data):
    return sum(data) / len(data)

# Calculate the average flipper length
avg_flipper_length = average(flipper_lengths)
print("Average flipper length:", avg_flipper_length)

Loops allow us to automate repetitive tasks, such as processing each item in a list. Here, we use a `for` loop to iterate over the species list and print each one. This is a common pattern when dealing with data, allowing us to apply operations to each element of a dataset systematically.

In [None]:
# Print each species in the species list
for s in species:
    print("Species:", s)

Python's list comprehensions provide a concise way to create lists. They are often more readable and efficient than using a loop to create a list. In this example, we filter flipper lengths to include only those greater than 190 mm, demonstrating a common data filtering operation.

In [None]:
# List flipper lengths greater than 190 mm using list comprehension
long_flippers = [length for length in flipper_lengths if length > 190]
print("Flipper lengths greater than 190 mm:", long_flippers)

Modifying data in a list directly within a loop is a frequent necessity in data processing. Here, we increment each flipper length by 5 mm, simulating an adjustment or calibration of measurement data. This kind of operation shows how loops can be used to modify data in place.

In [None]:
# Increase each flipper length by 5 mm
for i in range(len(flipper_lengths)):
    flipper_lengths[i] += 5

print("Updated flipper lengths:", flipper_lengths)

----

## 2. Using functions, loops and lists together.

Building on our basic understanding of lists, functions, and loops, we now explore how to effectively combine these tools to perform more complex data processing tasks. This integration is crucial for efficiently managing and analysing data, particularly in scenarios that simulate real-world challenges like those found in the Palmer Penguins dataset.

First, we'll look at how functions can be used to modify lists. Here, we create a function to 'normalise' data, a common preprocessing step in data analysis. This function adjusts each flipper length by subtracting the minimum length in the list, which can help in standardizing measurements for comparison.

In [None]:
# Function to normalize flipper lengths (subtract the minimum length)
def normalize_lengths(lengths):
    min_length = min(lengths)
    normalized = [length - min_length for length in lengths]
    return normalized

# Normalize flipper lengths
normalized_lengths = normalize_lengths(flipper_lengths)
print("Normalized flipper lengths:", normalized_lengths)

### Filtering Data Using Functions and Loops

Next, we demonstrate how to use functions and loops together to filter data. This approach is useful for extracting subsets of data based on specific criteria, such as selecting penguins observed in a particular time of day or having certain characteristics. Here, the function filters out flipper lengths that are above the average, showcasing a typical use case in data analysis for isolating significant outliers or specific data points.

In [None]:
# Function to filter flipper lengths that are above the average
def filter_above_average(lengths):
    avg = average(lengths)  # Using the average function defined earlier
    return [length for length in lengths if length > avg]

# Filter flipper lengths
filtered_lengths = filter_above_average(flipper_lengths)
print("Filtered flipper lengths above average:", filtered_lengths)

### Using Functions to Process and Categorize List Items in a Loop

Integrating loops, with functions, allows us to categorise or modify each item in a list systematically. In this example, we categorise penguins based on their body mass, a typical task in data processing that involves applying a classification rule across a dataset. This method is instrumental in preparing data for further analysis or reporting.

In [None]:
# Function to categorize body mass
def categorize_mass(mass):
    if mass > 5000:
        return 'Heavy'
    elif mass > 3000:
        return 'Moderate'
    else:
        return 'Light'

# List of body masses (in grams)
body_masses = [4500, 3800, 3200, 2900, 5500, 6100]

# Categorize each body mass
mass_categories = [categorize_mass(mass) for mass in body_masses]
print("Body mass categories:", mass_categories)

### Aggregating Data from Lists Using Loops and Functions

Data aggregation is a critical function in data science, often requiring the integration of loops and functions to compile statistics or metrics from data. Here, we create a function to count occurrences of each species in a list. This example highlights how functions can encapsulate the logic for counting, while loops iterate over the dataset to apply this logic, providing a clear summary of data.

In [None]:
# Function to count occurrences of each species
def count_species(species_list):
    species_count = {}
    for spec in species_list:
        if spec in species_count:
            species_count[spec] += 1
        else:
            species_count[spec] = 1
    return species_count

# List of species observed
species_observations = ['Adelie', 'Chinstrap', 'Gentoo', 'Adelie', 'Adelie', 'Gentoo']

# Count each species
species_counts = count_species(species_observations)
print("Species counts:", species_counts)

### Nested Loops for Data Comparisons

Nested loops are particularly useful when you need to perform operations that require comparing or combining elements from two different lists or datasets. This technique is essential in many data processing tasks, such as finding matches or correlations between different data points.

In this example, we use nested loops to find all matches between two lists of species. This kind of operation is common in many fields, including bioinformatics, ecology, and general data science, where you might need to cross-reference or match datasets against each other to find commonalities or relationships.

In [None]:
# Function to find all matches between two lists of species
def find_matches(list1, list2):
    matches = []
    for item1 in list1:
        for item2 in list2:
            if item1 == item2 and item1 not in matches:
                matches.append(item1)
    return matches

# Another list of species for matching
additional_species = ['Gentoo', 'Adelie', 'Emperor']

# Find matches
matching_species = find_matches(species_observations, additional_species)
print("Matching species:", matching_species)

This function demonstrates the power and utility of nested loops in data analysis:

- Outer Loop: Iterates over each species in the first list (`species_observations`).
- Inner Loop: For each species in the outer loop, the inner loop iterates over each species in the second list (`additional_species`).
- Conditional Check: Inside the inner loop, we check if the species from the first list matches any species in the second list and if it is not already included in the matches list to avoid duplicates.
- Appending Matches: If a match is found and it is unique, it is appended to the matches list.

----

## 3. Try-except blocks.

In this section, we'll explore how using `try-except` blocks can help us manage errors more gracefully in our Python programs. Error handling is a critical part of developing robust code, especially when processing uncertain or variable data like that found across the arts and humanities sphere.

We start by addressing a common error in data processing: division by zero. This type of error can occur when calculating ratios or percentages, especially if the data has not been fully cleaned or validated. Proper handling of this error is essential to maintain the integrity of your data analysis processes.

In [None]:
# Function to calculate the ratio of flipper length to body mass
def flipper_body_ratio(flipper_length, body_mass):
    try:
        ratio = flipper_length / body_mass
    except ZeroDivisionError:
        ratio = None  # Assign a None value if body mass is zero
    return ratio

# Example data
flipper_length_example = 210
body_mass_example = 0  # This might be a data entry error

# Calculate ratio
ratio_result = flipper_body_ratio(flipper_length_example, body_mass_example)
print("Flipper to body mass ratio:", ratio_result)

In this code, the `try-except` block is used to catch the `ZeroDivisionError`. If the error occurs (due to a body mass of zero in this case), the function returns `None` instead of crashing the program. This approach ensures that your data processing can continue even if some data points are flawed.

### Handling Index Errors in Lists

Next, we address another frequent issue when dealing with lists: index errors. These errors happen when trying to access elements at non-existing indexes, which is a common mistake during data manipulation.

In [None]:
# Function to safely retrieve an element from a list by index
def safe_access(data_list, index):
    try:
        return data_list[index]
    except IndexError:
        return "Index out of range"

# Example list of penguin species
species = ['Adelie', 'Chinstrap', 'Gentoo']

# Attempt to access an out-of-range index
print("Accessing fifth element:", safe_access(species, 4))

### Handling Type Errors

Type errors can disrupt a program when unexpected data types are used in operations, such as multiplying a number by a string. Handling these errors is crucial for data integrity and user experience.

In [None]:
# Function to multiply flipper length by a factor (must be a number)
def multiply_flipper_length(length, factor):
    try:
        return length * factor
    except TypeError:
        return "Factor must be a number"

# Example usage
print("Correct factor:", multiply_flipper_length(210, 2))
print("Incorrect factor:", multiply_flipper_length(210, 'two'))

### Handling Errors in Data Conversion

Data conversion errors are common when processing data that comes from varied sources, such as user input or external files. Incorrect data types can lead to `ValueError`, which can disrupt data processing if not properly handled.

In [None]:
# Function to convert string data to integer
def convert_to_integer(data):
    try:
        return int(data)
    except ValueError:
        return "Invalid input: cannot convert to integer"

# Example data (might be fetched from a CSV file as strings)
data_entries = ['3500', '4100', 'unknown']

# Convert data and handle errors
converted_data = [convert_to_integer(entry) for entry in data_entries]
print("Converted data:", converted_data)

### Using Finally for Cleanup Actions

Finally blocks are used to execute code after try-except blocks, regardless of whether an exception was raised. This is particularly useful for cleanup actions, such as closing files, ensuring that these actions occur no matter what happens in the try and except blocks.

In [None]:
# Function to process data with a cleanup action
def process_data(data):
    try:
        result = [int(d) for d in data]  # Attempt to convert all entries to integers
    except ValueError:
        result = "Error processing data"
    finally:
        print("Cleanup actions can be performed here.")
    return result

# Example data with a possible error
data_example = ['3000', '4500', 'five thousand']
processed_data = process_data(data_example)
print("Processed data:", processed_data)

Using `try-except` blocks allows us to handle unexpected errors gracefully, maintaining the integrity and continuity of our programs. Proper error handling is especially important in data science, where real-world data often includes anomalies and irregularities.

----

## 4. Computational efficiency.

This section delves into the concept of computational efficiency, which encompasses both time complexity and space complexity. Efficient coding practices are essential for handling large datasets effectively, reducing runtime and memory usage, and ultimately improving the performance of software applications.

As a note, this section becomes quite technical and mathematical. The concepts here are important to be aware of, but not the be-all-and-end-all of programming. So give it a go and if it doesn't sink in the first time, that's ok!

### What is time complexity?

Time complexity refers to the amount of time it takes to run an algorithm as a function of the length of the input. Understanding this helps predict how algorithms will perform as datasets grow larger. For instance, an algorithm that works well for small data might become impractically slow when applied to larger datasets, such as those commonly found in data science. There are some handy things to know when judging time complexity:

- Linear Time (O(n)): A loop over n items that performs a fixed amount of work for each item. This means we can just add the time is takes for one thing to happen in the loop n-times.

- Quadratic Time (O(n²)): Nested loops over the same dataset, like comparing each item with every other item. This type of time is bad as it scales **exponentially** and gets very slow very quick.

In [None]:
import time

# Generate a list of numbers
data = list(range(1, 1001))

# Linear time algorithm: summing all elements
start_time = time.time()
total_sum = sum(data)
end_time = time.time()
print("Linear time - Total sum:", total_sum)
print("Execution time (linear):", end_time - start_time)

# Quadratic time algorithm: sum of products of all pairs
start_time = time.time()
total_sum_pairs = 0
for i in data:
    for j in data:
        total_sum_pairs += i * j
end_time = time.time()
print("Quadratic time - Total sum of pairs:", total_sum_pairs)
print("Execution time (quadratic):", end_time - start_time)

### What is space complexity?

Space complexity deals with the amount of memory an algorithm needs to run. This becomes crucial when dealing with large datasets where inefficient memory use can lead to excessive resource consumption, slowing down or even halting processes. Efficient data handling and structures can significantly reduce a program’s memory footprint.

Different data structures are suited to different tasks and can drastically affect the performance of a program. For example, accessing elements in a list is fast, but adding elements to a list can be slow if it involves copying large amounts of data to new memory locations. Structures like dictionaries and sets offer faster lookups under certain conditions and can be more efficient.

In [None]:
import time

# List of species with duplicates
species_list = ['Adelie'] * 500 + ['Chinstrap'] * 500 + ['Gentoo'] * 500

# Checking for duplicates using a list
start_time = time.time()
duplicates = len(species_list) != len(set(species_list))
end_time = time.time()
print("Using a set to check duplicates:", duplicates)
print("Execution time (set):", end_time - start_time)

# Inserting and checking for duplicates using a list
start_time = time.time()
seen_species = []
for species in species_list:
    if species in seen_species:
        duplicates = True
        break
    seen_species.append(species)
end_time = time.time()
print("Using a list to check duplicates:", duplicates)
print("Execution time (list):", end_time - start_time)

------

## ⭐️⭐️⭐️💥 What you learned in this session: Three stars and a wish.
**In your own words** write in the Markdown cell below:

- 3 things you would like to remember from this notebook.
- 1 thing you wish to understand better in the future or a question you'd like to ask.

*Add your reflections here.*

--------------

## Topic Overview

In [None]:
# Function to categorize body mass
def categorize_mass(mass):
    if mass > 5000:
        return 'Heavy'
    elif mass > 3000:
        return 'Moderate'
    else:
        return 'Light'

# List of body masses and applying the function
body_masses = [4500, 3800, 3200, 2900, 5500, 6100]
mass_categories = [categorize_mass(mass) for mass in body_masses]
print("Body mass categories:", mass_categories)

In [None]:
# Normalize flipper lengths
def normalize_lengths(lengths):
    min_length = min(lengths)
    return [length - min_length for length in lengths]

flipper_lengths = [181, 195, 210, 182, 191, 213]
normalized_lengths = normalize_lengths(flipper_lengths)
print("Normalized flipper lengths:", normalized_lengths)=

In [None]:
# Function to safely convert string data to integer
def convert_to_integer(data):
    try:
        return int(data)
    except ValueError:
        return "Invalid input: cannot convert to integer"

data_entries = ['3500', '4100', 'unknown']
converted_data = [convert_to_integer(entry) for entry in data_entries]
print("Converted data:", converted_data)

-----------

# ⛏ Exercise: Process the Penguins

Imagine you are provided with a dataset containing information about a group of penguins. The data for each penguin includes species, island, flipper length, body mass, and whether the penguin was observed in the morning or evening. The dataset is represented as a list of dictionaries, where each dictionary corresponds to a single penguin.

In [None]:
penguins_data = [
    {'species': 'Adelie', 'island': 'Torgersen', 'flipper_length': 181, 'body_mass': 3750, 'time_of_day': 'morning'},
    {'species': 'Chinstrap', 'island': 'Dream', 'flipper_length': 195, 'body_mass': 3800, 'time_of_day': 'evening'},
    {'species': 'Gentoo', 'island': 'Biscoe', 'flipper_length': 210, 'body_mass': 5000, 'time_of_day': 'morning'},
    {'species': 'Adelie', 'island': 'Torgersen', 'flipper_length': 186, 'body_mass': 3200, 'time_of_day': 'evening'},
    {'species': 'Gentoo', 'island': 'Biscoe', 'flipper_length': 222, 'body_mass': 5200, 'time_of_day': 'morning'},
]

1. Data Filtering Function:
Start by writing a function to filter penguins by species and time_of_day. The function should return a list of dictionaries for penguins matching both criteria.

In [None]:
# try to solve the task here

2. Data Processing Function:
Write a function to calculate the average flipper length and body mass for a given list of penguin data. Ensure your function is robust against potential errors like division by zero or missing data keys.

In [None]:
# try to solve the task here

# ⛏ Exercise: Considering quadratic time

Discuss, with your partner, the impact of quadratic time complexity when processing large datasets, such as those with thousands of penguins and their attributes. Think about how nested loops can affect performance and what alternatives might exist, such as using more efficient algorithms or data structures.

*write your thoughts here.*