# Table of Contents
- [Data Engineering Debugging Techniques](#Data-Engineering-Debugging-Techniques)
- [Mastering Clean Coding with PEP8](#Mastering-Clean-Coding-with-PEP8)
- [Notes](#CNotes)

# Introduction to Data Engineering and Common Challenges

1. Brief introduction to data engineering.
  - Overview of common challenges and errors faced in data engineering.

2. Types of Errors in Data Engineering

  - Syntax Errors: Explain with examples of incorrect code syntax.
  - Logical Errors: Demonstrate through examples where the logic of a data pipeline is flawed.
  - Runtime Errors: Discuss errors that occur during execution, like connection failures, with examples.
  - Data-Related Errors: Explain errors related to data quality, format inconsistencies, etc.

3. Debugging Techniques
  - Understanding Error Messages: How to read and understand error messages and stack traces.
  - Logging: Implementing logging in data pipelines. Show examples using Python's logging module.
  - Unit Testing: Writing and running unit tests for data processing functions. Use Python's unittest or pytest framework for examples.
  - Interactive Debugging: Demonstrate the use of interactive debugging tools (e.g., Python Debugger (pdb)).
  - Version Control for Debugging: Using git to track changes and find when bugs were introduced.

4. Case Studies and Use Cases
  - Case Study 1: Debugging a Data Pipeline Failure.
  - Case Study 2: Solving Data Quality Issues in a Data Lake.
  - Case Study 3: Performance Tuning in Data Processing.

5. Best Practices in Error Handling and Prevention
  - Writing robust error handling code.
  - Strategies to prevent common errors in data engineering.

6. Conclusion
  - Summary of key points.

Data engineering involves the design and management of data workflows and pipelines. In this field, professionals often encounter various challenges, including data inconsistency, pipeline failures, and performance issues.

In this notebook, we will explore common errors in data engineering and discuss effective debugging techniques.


## Types of Errors in Data Engineering:

### Syntax Errors

Syntax errors occur when the code written does not conform to the rules of the programming language.

Imagine you're writing a recipe for a cake in a language you're still learning. If you misuse grammar or vocabulary, the person reading the recipe might not understand what you're trying to say. In programming, syntax errors are similar. They occur when the code is not written according to the grammatical rules of the programming language. For instance, missing a comma or a parenthesis in Python can lead to a syntax error. It's like forgetting a full stop in a sentence.

**Example 1: Missing the colon for the `for` loop**

In [None]:
for i in range(10)
    print(i)

This will cause a syntax error because the colon (:) after range(10) is missing.

**Example 2: Missing Quotes in String**

In [None]:
name = 'Alice  # Missing closing quote
print(name)

Error: SyntaxError due to the unclosed string.

**Example 3: Incorrect Indentation**

In [None]:
def greet(name):
print("Hello, " + name)

Error: IndentationError because the print statement is not correctly indented within the function.

**Example 4: Misplaced Brackets**

In [None]:
list = [1, 2, 3, 4
print(list[2])

Error: SyntaxError due to a missing closing bracket for the list.

### Logical Errors

Logical errors arise when the code does not perform the intended task due to incorrect logic.

Suppose you successfully wrote the cake recipe, but you accidentally wrote "bake for 5 hours" instead of "bake for 50 minutes." While the recipe is grammatically correct, the logic (baking time) is incorrect. In programming, logical errors are when the syntax is right, but the code doesn't do what you intend it to do. The program runs but gives the wrong result.

**Example 1:**

In [None]:
def double_number(num):
    return num + num  # Intended to multiply by 2, but added instead

Here, the function incorrectly adds the number to itself instead of doubling it (multiplying by 2).

**Example 2: Incorrect Comparison**

In [None]:
def is_adult(age):
    return age < 18  # Logic should be 'age >= 18'

print(is_adult(20))

Issue: The function incorrectly returns `False` for an adult age.

**Example 3: Wrong Arithmetic Operation**

In [None]:
def calculate_discount(price):
    return price / 0.1  # Should be 'price * 0.1' for a 10% discount

print(calculate_discount(100))

Issue: The function calculates the wrong discount value.

**Example 4: Incorrect Loop Condition**

In [None]:
i = 0
while i != 10:
    i += 2

Issue: This loop will never terminate because i will never equal 10 (it increments by 2).


### Runtime Errors

Runtime errors are encountered during the execution of a program, such as database connection failures.

Imagine you've written the perfect cake recipe, but when someone tries to make it, they realize they don't have a crucial ingredient, like flour. This is a runtime error - an error that occurs during the execution of the program. It could be due to external factors like missing data, a failed database connection, or insufficient memory.

**Example 1:**

In [None]:
numbers = [1, 2, 3]
print(numbers[3])  # Index out of range error

Here, you're trying to access the fourth element (index 3) of a list that only has three elements.

**Example 2: Division by Zero**

In [None]:
x = 10
y = 0
print(x / y)

Error: ZeroDivisionError at runtime.

**Example 3: Accessing Invalid List Index**

In [None]:
my_list = [1, 2, 3]
print(my_list[5])

Error: IndexError as the index 5 does not exist in `my_list`.

**Example 4: File Not Found**

In [None]:
with open('nonexistent_file.txt', 'r') as file:
    data = file.read()

Error: FileNotFoundError because the specified file does not exist.

### Data-Related Errors

These errors are related to the quality and consistency of data, including missing values or incorrect data formats.

Finally, let's say you write a recipe calling for "a cup of sugar," but don't specify what kind of cup (metric or imperial). If someone uses the wrong type, the cake won't turn out as expected. In data engineering, data-related errors occur when the data is not what your program expects. This could be due to incorrect data formats, missing values, or inconsistent data.

**Example 1:**

In [None]:
data = {"name": "Alice", "age": "Twenty-five"}
age = int(data["age"])  # Error: cannot convert the string "Twenty-five" to an integer


Here, the program expects a numeric value for age, but it receives a string, leading to an error.

**Example 2: Incorrect Data Type**

In [None]:
age = "twenty-five"  # Age is a string, not a number
if age > 20:
    print("Adult")

Issue: TypeError as the comparison is between a string and an integer.

**Example 3: Missing Data**

In [None]:
data = {'name': 'Alice', 'age': None}
print("Age:", data['age'] + 1)

Issue: TypeError or logic error due to attempting an arithmetic operation with `None`.

**Example 4: Unexpected Data Format**

In [None]:
date_str = "2020-31-02"  # Unusual date format, looks like YYYY-DD-MM
try:
    year, day, month = map(int, date_str.split('-'))
except ValueError as e:
    print("Error:", e)

Issue: This code will run into a **`ValueError`** when trying to use this date, as there is no 31st month. The error arises due to the incorrect assumption about the format of the date string. This is a common issue in data engineering where data comes in unexpected or inconsistent formats, leading to processing errors.

In this example, the code expects the date in the format **`YYYY-MM-DD`**, but the input is in a different format. This leads to a logical error when interpreting the date components.



Understanding these errors with everyday analogies can help you better grasp the concepts and apply them in their work.

## Debugging Techniques

Effective debugging is crucial in data engineering. Here are some techniques:

### 1. Understanding Error Messages

Error messages provide clues about what went wrong. It's essential to understand how to interpret these messages.

Imagine you're assembling a complicated piece of IKEA furniture but something goes wrong. The instruction manual often has troubleshooting tips pointing out common mistakes. Error messages in programming are similar. They are like hints or specific instructions pointing out what might have gone wrong in your code. Learning to understand these messages is like learning to decipher those troubleshooting tips to figure out where you might have made a mistake.

Analogy: A furniture assembly manual saying, "Screw A should not be used with Panel B." This is similar to an error message in programming, guiding you to the exact problem.

**Example 1: Index Error**

In [None]:
my_list = [1, 2, 3]
print(my_list[3])

Error Message: **`IndexError: list index out of range`** <br>
Interpretation: The code is trying to access an index that doesn't exist in **`my_list`**.

**Example 2: Type Error**

In [None]:
age = "25"
print(age + 5)

Error Message: **`TypeError: can only concatenate str (not "int") to str`** <br>
Interpretation: The code is trying to add an integer to a string, which is not allowed.

**Example 3: Syntax Error**

In [None]:
for i in range(10)
    print(i)

Error Message: **`SyntaxError: invalid syntax`** <br>
Interpretation: The code is missing a colon : at the end of the **`for`** loop.

### 2. Logging

Implementing logging helps in monitoring and troubleshooting data pipelines. Python's `logging` module can be used for this purpose.

Think of logging like keeping a diary of what happens during a science experiment. In data engineering, logging is the practice of recording what your code is doing, especially when it's processing data. This can help you understand what happened leading up to an error. Python's logging module lets you easily record these events.

Analogy: A lab journal that notes every step of an experiment. If the experiment fails, you can look back to see what might have caused the problem.

**Example 1: Basic Logging**

In [1]:
import logging

logging.basicConfig(level=logging.INFO)
logging.info("This is an info message")

Explanation: This sets up basic logging and records an informational message.

**Example 2: Logging with Different Levels**

In [None]:
logging.debug("This is for debugging")
logging.warning("This is a warning message")

Explanation: Records messages of different severity levels.

**Example 3: Logging to a File**

In [None]:
logging.basicConfig(filename='example.log', level=logging.ERROR)
logging.error("An error has occurred")

Explanation: This configures logging to write error messages to a file named **`example.log`**.

### 3. Unit Testing

Unit testing ensures that individual components of the data pipeline work as expected. Python's `unittest` or `pytest` frameworks are commonly used.

Unit testing in programming is like testing each individual component of a machine before assembling the whole machine. You ensure that each part (or unit) works correctly on its own. This makes it easier to pinpoint problems. Python provides frameworks like unittest or pytest for this purpose.

Analogy: Before building a car, each part like the engine, brakes, and lights are tested separately to ensure they work properly.

**`Note:`** There is also the concept of **End-to-End Testing** <br>



#### **Example 1: Testing a Function**

In [None]:
import unittest

def add(a, b):
    return a + b

class TestAddFunction(unittest.TestCase):
    def test_add(self):
        self.assertEqual(add(1, 2), 3)

unittest.main()

Explanation: A simple test case for a function that adds two numbers.

#### **Example 2: Testing for an Exception**

In [None]:
def divide(a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

class TestDivideFunction(unittest.TestCase):
    def test_divide_zero(self):
        with self.assertRaises(ValueError):
            divide(10, 0)

unittest.main()

Explanation: Testing that the **`divide`** function correctly raises a **`ValueError`** when dividing by zero.

#### **Example 3: Testing a Data Processing Function**

In [None]:
import unittest

def clean_data(data):
    """
    Cleans the input string by trimming whitespace, converting to lowercase,
    and removing special characters.
    """
    # Trim whitespace
    cleaned = data.strip()

    # Convert to lowercase
    cleaned = cleaned.lower()

    # Remove special characters
    cleaned = ''.join(char for char in cleaned if char.isalnum() or char.isspace())

    return cleaned


class TestDataCleaning(unittest.TestCase):

    def test_whitespace_removal(self):
        self.assertEqual(clean_data("   Hello World   "), "hello world")

    def test_lowercase_conversion(self):
        self.assertEqual(clean_data("HeLLo WorLD"), "hello world")

    def test_special_character_removal(self):
        self.assertEqual(clean_data("Hello@#World!!"), "hello world")

    def test_combined_cleaning(self):
        self.assertEqual(clean_data("  HeLLo@# WoRLD!!  "), "hello world")

    def test_empty_string(self):
        self.assertEqual(clean_data(""), "")

if __name__ == '__main__':
    unittest.main()


#### **Explanation of Test Cases** <br>

1. **test_whitespace_removal**: Verifies that leading and trailing whitespaces are removed.
2. **test_lowercase_conversion**: Checks if the function converts all characters to lowercase.
3. **test_special_character_removal**: Tests if special characters (non-alphanumeric) are removed.
4. **test_combined_cleaning**: A comprehensive test to ensure the function performs all cleaning tasks together correctly.
5. **test_empty_string**: Checks the behavior of the function with an empty string as input.

Each of these test cases calls the **`clean_data`** function with different input strings and checks if the output matches the expected cleaned string. This way, we can ensure that our data cleaning function behaves as intended for a variety of input scenarios.

By running these tests, especially after modifications to the function, you can quickly catch and fix any regressions or bugs, ensuring the reliability and correctness of your data cleaning logic.








#### **End-to-End Testing** <br>

End-to-end testing, on the other hand, is like testing the whole car after it has been fully assembled to ensure all parts work well together. This involves running scenarios that mimic real-world use of the entire system to validate the complete flow of data or processes.

**Analogy:** Once the car is fully assembled, you take it for a test drive. This checks how well the engine, brakes, lights, and other components work together in real-world conditions. In data engineering, end-to-end testing would mean running the entire data pipeline from data ingestion, processing, to storage and ensuring the whole process works as expected.

**Key Points:**

- Tests the entire system as a whole.
- Validates the integration and interaction between different components.
- Mimics real-world usage and checks the overall system behavior.

**Differences and Importance**

- Scope: Unit testing checks individual components, while end-to-end testing evaluates the entire system.
- Complexity: Unit tests are simpler and focus on the logic of small parts. End-to-end tests are more complex and involve testing the system as a whole.
- Purpose: Unit testing ensures that each component functions correctly on its own, while end-to-end testing ensures that all parts of the system work together correctly in a real-world scenario.
- Detection of Issues: Unit testing can quickly identify which specific component has a problem. End-to-end testing helps to identify issues in the interaction between different components.

#### **Example 1: Testing a Data Pipeline** <br>

Suppose you have a data pipeline that extracts data from a source, transforms it, and then loads it into a database. An end-to-end test would check this entire process.

**Python Pseudocode:**

In [None]:
import unittest
from my_data_pipeline import DataPipeline

class TestDataPipeline(unittest.TestCase):
    def test_pipeline_flow(self):
        pipeline = DataPipeline()
        pipeline.extract_data("source_data.csv")
        pipeline.transform_data()
        success = pipeline.load_data("destination_database")
        self.assertTrue(success)

if __name__ == '__main__':
    unittest.main()

Explanation: This test simulates running the entire pipeline with a specific dataset and checks whether the data is successfully loaded into the destination database.

**Expected Output:** <br>
For this test case, the expected output is a confirmation message indicating the test passed. This means the data pipeline successfully extracted, transformed, and loaded the data as intended.

If the test passes, the output will typically be along the lines of:

![Success report](https://raw.githubusercontent.com/ehiughele/Project-COVID19-DE-PROJECT/main/image/test%201.png)

If there's a failure at any stage of the pipeline (e.g., extraction, transformation, or loading fails), the test framework will report a failure, and the output will look something like this:

![Failure report](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%202.png?raw=true)

#### **Example 2: Testing an API End-to-End** <br>

If you have a REST API that processes data requests, an end-to-end test would involve sending a request and verifying the response and the state of the system.

**Python with Requests:**

In [None]:
import requests
import unittest

class TestApiEndToEnd(unittest.TestCase):
    def test_data_processing(self):
        response = requests.post("http://example.com/api/process", json={"data": "test"})
        self.assertEqual(response.status_code, 200)
        # Further checks can be added to verify database changes or other side effects

if __name__ == '__main__':
    unittest.main()

Explanation: This test sends a POST request to the API and verifies that the response indicates success. Additional checks can be added to ensure the data was processed correctly in the backend.

**Expected Output:** <br>
For the API test, a successful test will output a message indicating the test passed, confirming that the API responded correctly to the POST request and any subsequent checks (like database validation) were successful.

A successful test output might look like:

![Success report](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%203.png?raw=true)

If the API does not respond as expected or if the subsequent data checks fail, the output will indicate a failure:

![Failure report](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%203-1.png?raw=true)

#### **Example 3: Testing a Batch Data Processing Job** <br>

In many data engineering contexts, batch data processing is a common task. An end-to-end test for such a job would involve running the job with a predefined dataset and then verifying that it produced the correct output.

**Python Pseudocode for a Batch Job Test:**

In [None]:
import unittest
from batch_processor import BatchProcessor

class TestBatchProcessing(unittest.TestCase):
    def test_batch_job(self):
        processor = BatchProcessor()
        processor.load_data("input_dataset.csv")
        processor.run_job()
        success = processor.verify_output("expected_output.csv")
        self.assertTrue(success)

if __name__ == '__main__':
    unittest.main()


Explanation: This test simulates the execution of a batch processing job, ensuring that the job processes the input data as expected and produces the correct output. The **`verify_output`** method is assumed to compare the job's output against a predefined expected output, validating the correctness of the job.

**Expected Output:** <br>
For the batch job testing, a passing test indicates that the batch processing job correctly processed the input data and produced the correct output, as verified against the expected result.

A successful test will show:

![Success report](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%204.png?raw=true)

If the batch job fails to produce the correct output, the test will fail, and the output might look like this:

![Failure report](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%204-1.png?raw=true)

### 4. Interactive Debugging

Tools like Python Debugger (pdb) can be used for interactive debugging.

Interactive debugging is akin to having a conversation with someone while trying to solve a puzzle. You ask questions (commands) and get immediate responses that help you understand the puzzle better. Tools like Python Debugger (pdb) allow you to pause your program, inspect variables, and step through your code line by line to find out where things are going wrong.

Analogy: It's like having a guide who helps you through each step of a maze, providing immediate feedback on each decision you make.

#### **Example 1: Using Python Debugger (pdb)**

Imagine you have a script where you're unsure why a loop is behaving unexpectedly. You can insert a breakpoint using **`pdb`** to inspect it.

In [None]:
import pdb

for i in range(5):
    pdb.set_trace()  # Breakpoint
    print(i)


Explanation: When you run this script, it will pause at the **`pdb.set_trace()`** line. You can inspect variables, step through the code, and continue execution interactively.

**Expected Interaction and Output:** <br>
When the script encounters the **`pdb.set_trace()`** line, it will pause execution and enter the debugger. You'll see a prompt like this:

![Output](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%205.png?raw=true)

Here, you can type commands like **`p i`** to print the current value of **`i`**, or **`c`** to continue execution. The script will stop at the breakpoint each time the loop iterates

#### **Example 2: Exploring Variables with pdb**

In [None]:
def compute_sum(numbers):
    sum = 0
    for number in numbers:
        pdb.set_trace()  # Breakpoint
        sum += number
    return sum

compute_sum([1, 2, 3])


Explanation: During each iteration of the loop, you can use pdb to inspect the **`number`** and **`sum`** variables.

**Expected Interaction and Output:**
Similar to Example 1, the script will pause at the **`pdb.set_trace()`** line inside the loop. The debugger prompt will be shown:

![Output](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%206.png?raw=true)

At this point, you can inspect variables like **`number`** and **`sum`** using commands like **`p number`** or **`p sum`**. After inspecting, you can continue (**`c`**) to the next iteration or exit (**`q`**) the debugger.

#### **Example 3: Using pdb to Identify Logical Errors**

In [None]:
def find_max(numbers):
    max_number = numbers[0]
    for number in numbers:
        pdb.set_trace()  # Breakpoint
        if number > max_number:
            max_number = number
    return max_number

find_max([1, 3, 2])


Explanation: By stepping through the function, you can watch how **`max_number`** is updated and verify the logic.

**Expected Interaction and Output:** <br>
As the function executes, it will hit the breakpoint. The debugger will open:

![Output](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%207.png?raw=true)

In the debugger, you can step through each line (**`n`** for next line), inspect variables (**`p max_number `**and**` p number`**), and understand how max_number is updated in each iteration. This interactive session helps you trace the logic step-by-step and identify where it might be going wrong.

In each of these examples, **`pdb`** serves as an interactive tool that allows you to pause execution, inspect the state of your program, and execute commands to understand how your code is behaving. This is particularly useful for identifying and resolving logical errors and understanding the flow of execution.

### 5. Version Control for Debugging

Using version control systems like git helps in tracking changes and identifying when bugs were introduced.

Using version control systems like git in programming is similar to keeping a detailed history of the drafts of a novel. Each change is recorded, and if something goes wrong, you can look back through the changes to find out where the error was introduced. It's a powerful tool for collaboration and tracking the evolution of your codebase.

Analogy: Imagine writing a book with multiple drafts. If the latest draft has a problem, you can compare it with earlier drafts to see what changed and find where the error might have started.

#### **Example 1: Browsing History with git**

![Output](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%208.png?raw=true)

Explanation: By reviewing commit messages and diffs, you can identify when a specific change was made that may have introduced a bug.

**Expected Output:** <br>
The **`git log`** command provides a list of recent commits in the repository, including the commit message, author, and date.

![Output](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%2011.png?raw=true)

####  **Example 2: Comparing Changes**

![Output](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%209.png?raw=true)

Explanation: This command allows you to see what changed between two commits, which can be useful in pinpointing when a bug was introduced.

**Expected Output:** <br>
Using **`git diff <commit_id_1> <commit_id_2>`**, you'll see a detailed comparison between two commits. It shows what has been added or removed in the files that changed.

![Output](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%2012.png?raw=true)

####  **Example 3: Checking Out an Older Version**

![Output](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%2010.png?raw=true)

Explanation: If you suspect a recent change caused a bug, you can revert to an older version of the code to see if the problem persists.

**Expected Output:** <br>
When you run **`git checkout <commit_id>`**, git switches your working directory to the state it was in at the specified commit. The output confirms the switch.

![Output](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/test%2013.png?raw=true)

In these examples, the outputs from git provide valuable information for debugging. They allow you to explore the history of changes, compare different states of the code, and even revert to previous versions to understand when and how bugs were introduced. This is a crucial part of the debugging process, especially in complex projects where changes are continuously integrated.

## Case Studies and Use Cases

## Case Study 1: Debugging a Data Pipeline Failure

This case study explores the steps taken to debug a failing data pipeline.

Analogy: Imagine a water pipeline system in a city. One day, water stops flowing to certain areas. The engineers must find the cause: Is it a blockage, a leak, or a pump failure? They check the pipeline segment by segment until they locate the problem.

In data engineering, debugging a data pipeline is similar. You check each part of the pipeline - data ingestion, processing, and storage - to locate the issue. It might be a failed data source connection, a processing error in the transformation stage, or an issue with the final data load into the storage system.

**Steps for Debugging:**

- Check logs to identify where the pipeline failed.
- Test each component individually (source, processing, storage).
- Correct the identified issue and monitor the pipeline to ensure it's resolved.

## Case Study 2: Solving Data Quality Issues in a Data Lake

Here, we discuss strategies to address and rectify data quality issues in a data lake.

Analogy: Consider a librarian organizing a library. If books are randomly placed without any system, finding a specific book becomes difficult. The librarian must organize and possibly clean out old, irrelevant books.

In a data lake, solving data quality issues is akin to organizing this library. The data might be in various formats, incomplete, outdated, or inaccurate. The task is to clean (remove inaccuracies), organize (correct formats, schema), and update (ensure data is current and relevant).

**Strategies:**

- Implement data cleaning processes to remove or correct inaccurate records.
- Standardize data formats and schemas for consistency.
- Regularly update the data and remove outdated or irrelevant information.

## Case Study 3: Performance Tuning in Data Processing

This case focuses on optimizing the performance of data processing tasks.

Analogy: Think of a factory assembly line. If one part of the line is slower, it bottlenecks the entire production. To optimize, each segment of the line must operate efficiently.

Similarly, in data processing, performance tuning involves identifying bottlenecks or inefficient code segments in your data processing tasks. This could be an inefficient database query, a slow data transformation process, or inadequate resource allocation.

**Optimization Techniques:**

- Profile the processing tasks to identify slow operations.
- Optimize resource-intensive processes, like using more efficient algorithms or indexing in databases.
- Scale resources as needed, such as increasing computational power or optimizing storage.

## Best Practices in Error Handling and Prevention

- Writing robust error handling code to gracefully manage unexpected failures.
- Proactively preventing common errors through best practices in code development and data management.


## Conclusion

In this notebook, we have covered various aspects of debugging in data engineering, from understanding different types of errors to applying practical debugging techniques. Remember, effective error handling and debugging are key to maintaining reliable and efficient data pipelines.


# Mastering Clean Coding with PEP8

## Educational Aims

This training module will equip you with the skills to:

- Adhere to PEP-8 standards for pristine code writing
- Apply proper indentation and adhere to recommended comment formatting
- Craft docstrings effectively and employ PEP-8 compliant variable naming conventions

## Training Structure

The module is organized in the following manner:

- An introductory overview of PEP-8 guidelines for refined code writing
- Best practices for imports, indentation, and overall code layout
- Methods for integrating comments and docstrings into your code
- Guidance on naming conventions and the strategic use of whitespace in expressions and statements

## Introduction

As you delve into code writing, understanding the art of styling your code becomes essential. Coding transcends the mere act of typing words and expecting them to function; it's about the organization and style of those words. Imagine trying to comprehend an essay with poor punctuation; similarly, unformatted code can be challenging to decipher. To enhance the readability of our code, we turn to the Python Enhancement Proposal (PEP) guidelines, specifically PEP-8.

A Python Enhancement Proposal (PEP) is a document submitted for proposing significant changes or improvements to the Python language. PEP-8, one of the earliest PEPs, provides a collection of rules and recommendations for formatting Python code. This is crucial in projects involving cross-departmental collaboration, as cleanly written code fosters better communication and efficiency within teams.

Below is a *PEP-8 Cheat Sheet* image for your reference. In this training, we will delve deeper into each of these guidelines.

![Pep-8 Image](https://github.com/ehiughele/Project-COVID19-DE-PROJECT/blob/main/image/PEP_8_Guide.jpg?raw=true)

## Managing Imports

In Python data science, it's typical to use various libraries and modules. The initial step in script creation should be to position your library imports at the beginning of the document.

Place imports right after any initial comments, docstrings, and declarations of global modules or constants. For clarity and organization, each library should be imported on its own line.

* **Good Import Practices**

Here are some examples of effective techniques for importing libraries.

In [None]:
# Proper Import Example (1)
import matplotlib.pyplot as plt


In [None]:
# Additional Example of Effective Imports (2)
from datetime import datetime
import seaborn as sns

 * **Inadvisable Import Methods**

Below are examples of import practices that are best avoided in your coding efforts. Utilizing wildcard imports (2) can be problematic because it's possible for two modules within a library to share the same name but function differently; one might be a native Python3 module, while the other could be from a third-party extension. This practice can also complicate the process of pinpointing which module is responsible for any bugs in your code.

In [None]:
# Improper Import Example (1)
from math, random import sqrt, randint


In [None]:
# Further Example of Inadvisable Imports (2)
import matplotlib., scipy.

## Managing Indentation

Python sets itself apart from languages like Java and C++ by using indentation instead of braces for structuring its syntax. It's important to note that using tab indents is generally more reliable than spaces. This is due to the uniform nature of tabs, which reduces the likelihood of errors compared to varying numbers of spaces in code blocks. In Python, a standard indent equates to four spaces, and it's easy to mistakenly use three or five spaces instead. Therefore, utilizing the Tab key (on Windows) for indents is advisable. Also, it's a good practice to keep a line of code within 80 characters; if it exceeds, it should be continued on the next line.

The importance of these guidelines will be evident in the upcoming examples.

* **Effective Indentation Practices**

In the example below, tabs are utilized to create uniform spaces between the blocks of code.

In [None]:
# Example of Clear Indentation
def calculate_area(shape, dimensions):
    if shape == 'circle': # 4 white spaces used
        radius = dimensions[0] # 8 white spaces used
        return 3.14 * radius ** 2
    elif shape == 'square':
        side = dimensions[0]
        return side * side
    else:
        return 'Shape not recognized'

In [None]:
# Example of Readable Function Definition
def calculate_volume(length, width, height):
    if length > 0 and width > 0 and height > 0: # 4 white spaces used
        volume = length * width * height # 8 white spaces used
        return volume
    else:
        return 'Invalid dimensions'

In [None]:
# Example of Well-Indented Conditional Statements
def check_eligibility(age, citizenship):
    if age >= 18 and citizenship == 'Yes': # 4 white spaces used
        return 'Eligible to vote'
    else:
        return 'Not eligible to vote'

* **Incorrect Indentation Practices**

The example below demonstrates a mix of tabs and white spaces for indentation. Python 3 disallows this combination, leading to errors. Consistency is key, and it's preferable to stick to using tabs exclusively throughout the code.

In [None]:
# Example of Incorrect Indentation
def calculate_discount(price, discount):
    if price > 100: # 4 white spaces used
      discount_rate = 0.10 # 6 white spaces used
    elif price > 50: # 4 white spaces used
       discount_rate = 0.05 # 7 white spaces used
    else: # 4 white spaces used
        discount_rate = 0.02 # 8 white spaces used
        return price - (price * discount_rate)

In [None]:
# Incorrect Indentation in Loop
def list_even_numbers(numbers):
    for number in numbers: # 4 white spaces used
     if number % 2 == 0: # 5 white spaces used
        print(number) # 8 white spaces used
      else: # 6 white spaces used
          continue

In [None]:
# Mismatched Indentation in Function
def find_max_value(values):
    max_value = values[0] # 4 white spaces used
     for value in values: # 5 white spaces used
        if value > max_value: # 8 white spaces used
            max_value = value # 12 white spaces used
    return max_value


## Organizing Code Layout

The structure of your code significantly influences its readability. To create code that is both clear and adheres to PEP-8 standards, consider these key points about using blank lines and managing line length.

### Utilizing Blank Lines

Strategically placed blank lines can enhance the readability of your code. Compact code can be hard to follow, while too many blank lines can make your code appear disjointed and lead to excessive scrolling. Here are essential guidelines for using blank lines effectively in your code.


Surround top-level functions and classes with two blank lines. These elements should be relatively independent, and functions should be treated as distinct units. Adding extra blank lines between each function helps in making them easily identifiable in your code:

In [None]:
def first_function():
    print("This is the first function")
    return None


def second_function():
    print("This is the second function")
    return None


def third_function():
    print("This is the third function")
    return None

On the other hand, when defining multiple functions within the same script or module, it's recommended to use a single blank line between them. This practice, applicable to functions created with def, helps to differentiate each function while acknowledging their collective relevance within the same context:

In [None]:
def calculate_sum(a, b):
    return a + b

def multiply_numbers(x, y):
    return x * y

def divide_numbers(m, n):
    if n != 0:
        return m / n
    else:
        return "Division by zero is not allowed"

### Managing Line Length

PEP 8 recommends keeping lines to a maximum of 79 characters for code and 72 for docstrings/comments. This recommendation originates from the display limitations of older computer terminals, which could only accommodate 79 characters on a single line. Adhering to this limit facilitates side-by-side file viewing and reduces the need for line wrapping. However, it's worth noting that some teams may opt for a 99-character limit, reflecting a more flexible approach to this guideline.

For wrapping longer lines, Python encourages the use of implied line continuation within parentheses, brackets, and braces, favoring this method over using a backslash for line breaks. This approach allows for splitting long expressions over several lines, enhancing readability. Below is an example illustrating this practice:

with open('/path/to/the/source/file') as source_file, \
     open('/path/to/the/destination/file', 'w') as destination_file:
    destination_file.write(source_file.read())

## Utilizing Comments

Comments serve as guides for readers to understand the intentions behind code segments. This is especially valuable in scenarios where project responsibilities are transferred, allowing newcomers to grasp the existing work more easily.

Python, unlike C++ or Java, doesn't distinguish between block and inline comments through syntax; it uses the hash symbol "#" for both. Comments should be employed to clarify: the reasoning behind specific code decisions, and to highlight crucial aspects of the problem your code addresses.

A key principle to remember is that while code illustrates the process, comments should elucidate the rationale.

In [None]:
# Example of a bLOCK comment
tasks = ['walk barefoot', 'maintain positivity', 'explore boldly', 'prioritize development'] # Example of an inline comment
goals = ['develop a remarkable project', 'commit to team goals'] # Another inline comment

In [None]:
# This is a block comment explaining the purpose of the function
# It calculates the average of a list of numbers and prints the result
def calculate_average(numbers):
    total_sum = sum(numbers)  # Inline comment: summing the list of numbers
    count = len(numbers)  # Inline comment: getting the count of numbers
    average = total_sum / count  # Inline comment: calculating the average
    print("The average is:", average)

In [None]:
# Here we define a function to check if a number is prime
# This is useful for mathematical computations and algorithms
def is_prime(number):
    if number > 1:  # Inline comment: all primes are greater than 1
        for i in range(2, number):
            if (number % i) == 0:  # Inline comment: checking for factors
                print(number, "is not a prime number")
                break
        else:
            print(number, "is a prime number")
    else:
        print(number, "is not a prime number")  # Inline comment: 1 is not prime

## Documenting with Docstrings

Effective coding, especially in Object-Oriented Programming (OOP), necessitates the inclusion of docstrings for methods, classes, and functions. Docstrings, short for documentation strings, are textual annotations enclosed within triple double quotes (""") or triple single quotes (''') situated at the beginning of any function, class, method, or module. Their purpose is to provide a concise description and documentation of a code segment. Adhering to the PEP-257 guidelines, docstrings play a critical role in maintaining clean and understandable code.

For comprehensive examples of docstrings, refer to the PEP-257 documentation.

In [None]:
# Single-line docstring
"""This is an example of a single-line docstring"""
# Example function
def greet(name):
    """Greets a person with their name."""
    print(f"Hello, {name}!")


In [None]:
# Multi-line docstring
"""Performs a division

Takes two numbers and divides the first by the second.
"""
# Example function
def divide(x, y):
    """Divides the first number by the second and returns the result.
    Parameters
    ----------
    x: float
        The numerator
    y: float
        The denominator, should not be zero

    Returns
    -------
    result: float
        The division of x by y
    """
    result = x / y
    return result


## Adhering to Naming Conventions

Maintaining a consistent naming scheme is crucial for the clarity and readability of code. While several styles exist, this course will focus on specific conventions that enhance code cleanliness. For a more thorough exploration, refer to the PEP-8 documentation (see appendix for link).

- `lowercase`: Used for naming variables.
- `UPPERCASE`: Designates constants.
- `camelCase`: Though rarely employed in Python, it's mentioned for completeness.
- `CapitalizedWords` (or CamelCase): Recommended for class names.
- Use `underscores` (`_`) to represent spaces.

In [None]:
# lowercase (for variable names)
age = 25
city = 'london'

# lowercase_underscores (suitable for function names)
def calculate_tax():
tax_rate = 0.2

# UPPERCASE (ideal for constants)
MAX_SPEED = 120
EARTH_RADIUS = 6371

# UPPERCASE_UNDERSCORES (also ideal for constants)
WATER_BOILING_POINT = 100
DEFAULT_LANGUAGE = 'EN'

# camelCase (more common in Java and C++, less in Python)
userAge = 30
userName = 'John'

# CapitalizedWords (preferred for class names)
class UserAccount:
defaultStatus = 'active'

# Capitalized_Words (not recommended in Python coding standards)
Social_Security_Number = '123-45-6789'
Full_Name = 'Jane Doe'

## Managing Whitespace

While whitespace can significantly improve the readability of code, its overuse or incorrect application can lead to cluttered and hard-to-read code. This section will cover fundamental guidelines to ensure your code remains clean and well-organized by judicious use of whitespace.

```python
# Proper use of whitespace
calculate_area(width[1:10], height[5:15])
coffee = (4,)
serve_drink(3)

# Improper use of whitespace
calculate_area( width [ 1 : 10 ], height [ 5 : 15 ] )
coffee = (4, )
serve_drink (3)
```

# Appendix

- [PEP8 Guidelines](https://pep8.org/)

- [Official PEP8 Python Guide](https://www.python.org/dev/peps/pep-0008/)

- [Python Coding Style and Layout](https://realpython.com/python-pep8/#code-layout)

- [Python Unittest Documentation](https://docs.python.org/3/library/unittest.html)

# Notes

## Comparison Operators, Loops and Functions:

### list dict tupl etc

In [15]:
f_test = float(input("What is your first test score ?"))
s_test = float(input("What is your second test score ?"))
attdnce = float(input("What is your attendance score ?"))
exam = float(input("What is yourexam score ?"))

total_score = f_test + s_test + attdnce+exam
total_score

What is your first test score ? 1
What is your second test score ? 6
What is your attendance score ? 9
What is yourexam score ? 30


46.0

In [16]:
first_name = str(input("What is your first name ?"))
last_name = str(input("What is your last name ?"))

full_name = first_name + " " + last_name
full_name

What is your first name ? keith
What is your last name ? bishop


'keith bishop'

In [27]:
#Slicing sequence[start:stop:step]
full_name[0:5]

'keith'

In [26]:
text = "abcdef"
text[0::2]

'ace'

In [1]:
name = str(input("What is your  name ?"))
age = int(input("How old are you?"))

print("my name is {} and I am {} years old".format(name,age))


What is your  name ? annie
How old are you? 42


my name is annie and I am 42 years old


In [5]:
name = str(input("What is your  name ?"))
age = int(input("How old are you?"))

print("my name is %s and I am %s years old" %(name,age))

What is your  name ? bis
How old are you? 66


my name is bis and I am 66 years old


In [6]:
name = str(input("What is your  name ?"))
age = int(input("How old are you?"))

print(f"my name is (name) and I am (age) years old")

What is your  name ? hh
How old are you? 88


my name is (name) and I am (age) years old


In [7]:
#dictionaries= ordered sequence of items that can be modified
my_dict={"key1": "value1"}
my_dict

{'key1': 'value1'}

In [8]:
my_dict["key1"]

'value1'

In [11]:
#tuples =ordered sequence of items that cannot be modified
tum=("sign", 10,9.77,"bis","clan")
tum

('sign', 10, 9.77, 'bis', 'clan')

In [12]:
tum[0]=44

TypeError: 'tuple' object does not support item assignment

In [14]:
#set collection of uniques items that are not ordered
num={0,0,0,0,9,9,9,9,7,7,6,5,4,4,4,3,3,2,22,2,2,11,1,1}
num


{0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 22}

In [15]:
num.add(56)
num

{0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 22, 56}

In [16]:
names =("bish", "bish","joe","lua","clan")
names

('bish', 'bish', 'joe', 'lua', 'clan')

In [19]:
u_names= set(names)

In [20]:
u_names.add("annie")
u_names

{'annie', 'bish', 'clan', 'joe', 'lua'}

In [None]:
#boolean


In [23]:
import re

def verify_password(password):
    """Checks if a password meets security criteria and returns error messages if invalid."""
    if len(password) < 8:
        return "Password must be at least 8 characters long."

    if not re.search(r"[A-Z]", password):
        return "Password must contain at least one uppercase letter."

    if not re.search(r"[a-z]", password):
        return "Password must contain at least one lowercase letter."

    if not re.search(r"\d", password):
        return "Password must contain at least one digit."

    if not re.search(r"[!@#$%^&*(),.?\":{}|<>]", password):
        return "Password must contain at least one special character (!@#$%^&* etc.)."

    return None  # Password is valid

# Maximum allowed attempts
max_attempts = 3
attempts = 0

while attempts < max_attempts:
    password = input("Enter your password: ")
    error = verify_password(password)
    
    if error is None:
        print("Password is strong and valid.")
        break  # Exit the loop when the password is valid
    else:
        attempts += 1
        print(f"Error: {error}")
        if attempts < max_attempts:
            print(f"Attempt {attempts}/{max_attempts}. Please try again.\n")
        else:
            print("Too many failed attempts. Exiting...")

Enter your password:  nhjg6


Error: Password must be at least 8 characters long.
Attempt 1/3. Please try again.



Enter your password:  kliAq124


Error: Password must contain at least one special character (!@#$%^&* etc.).
Attempt 2/3. Please try again.



Enter your password:  hyut456


Error: Password must be at least 8 characters long.
Too many failed attempts. Exiting...


In [28]:
# Example usage
verify_password("&123Aergjjj")

In [30]:
name = "efem"
name[::-1]

'mefe'

### filter map, lambda

In [34]:
words= ["efe", "hello", "world", "level", "civic"]

def palindrome(x):
    return x == x[::-1]

pali = filter(palindrome, words)
list(pali)

['efe', 'level', 'civic']

In [36]:
# map function
# applys a function to each item in a list

def cal_fah(degree):
    return (degree * 9/5) + 32

temp_c = [10,25,45,90,0,-45]

#apply the map function
temp_f = map(cal_fah, temp_c)
list(temp_f)

[50.0, 77.0, 113.0, 194.0, 32.0, -49.0]

In [41]:
#lambda function (anonymous func)

age = list(range(18,31))
age_10 = map(lambda x: x+10, age)

print(age)
print(list(age_10))

[18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
[28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]


## Logging

In [3]:
import logging

# This sets up basic logging and records an informational message.
logging.basicConfig(level=logging.INFO)
logging.info("This is an info message")

INFO:root:This is an info message


In [4]:
# Records messages of different severity levels.
logging.debug("This is for debugging")
logging.warning("This is a warning message")



In [5]:
logging.basicConfig(filename='example.log', level=logging.ERROR)
logging.error("An error has occurred")

ERROR:root:An error has occurred


## Unit Testing

- Unit testing ensures that individual components of the data pipeline work as expected. Python's unittest or pytest frameworks are commonly used.
- **Example 1: Testing a Function**
- **Example 2: Testing for an Exception**
- **Example 3: Testing a Data Processing Function**

## Fully Automate Data Cleaning with Python in 5 Steps

Step 1: Run Basic Data Quality 
- ChecksMissing values in each column, Duplicate rows, Basic data characteristics

In [None]:
def check_data_quality(df):
    # Store initial data quality metrics
    quality_report = {
        'missing_values': df.isnull().sum().to_dict(),
        'duplicates': df.duplicated().sum(),
        'total_rows': len(df),
        'memory_usage': df.memory_usage().sum() / 1024**2  # in MB
    }
    return quality_report

Step 2 – Standardize Data Types
- Converting string dates to datetime objects, Identifying and converting numeric strings to actual numbers, Ensuring categorical variables are properly encoded


In [None]:
def standardize_datatypes(df):
    for column in df.columns:
        # Try converting string dates to datetime
        if df[column].dtype == 'object':
            try:
                df[column] = pd.to_datetime(df[column])
                print(f"Converted {column} to datetime")
            except ValueError:
                # Try converting to numeric if datetime fails
                try:
                    df[column] = pd.to_numeric(df[column].str.replace('$', '').str.replace(',', ''))
                    print(f"Converted {column} to numeric")
                except:
                    pass
    return df

Step 3 – Handle Missing Values
- Using median imputation for numeric columns, Applying mode imputation for categorical data, Maintaining the statistical properties of the dataset while filling gaps

In [None]:

from sklearn.impute import SimpleImputer

def handle_missing_values(df):
    # Handle numeric columns
    numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
    if len(numeric_columns) > 0:
        num_imputer = SimpleImputer(strategy='median')
        df[numeric_columns] = num_imputer.fit_transform(df[numeric_columns])

    # Handle categorical columns
    categorical_columns = df.select_dtypes(include=['object']).columns
    if len(categorical_columns) > 0:
        cat_imputer = SimpleImputer(strategy='most_frequent')
        df[categorical_columns] = cat_imputer.fit_transform(df[categorical_columns])

    return df

Step 4 – Detect and Handle Outliers
- an approach using the Interquartile Range (IQR) method:
> Calculate Interquartile Range (IQR) for numeric columns, Identify values beyond 1.5 * IQR from quartiles, Apply capping to extreme values rather than removing them, This preserves data while managing extreme values.

In [None]:
def remove_outliers(df):
    numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
    outliers_removed = {}

    for column in numeric_columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Count outliers before removing
        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)].shape[0]

        # Cap the values instead of removing them
        df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)

        if outliers > 0:
            outliers_removed[column] = outliers

    return df, outliers_removed

Step 5 – Validate the Results
- Confirm no remaining missing values, Check for any remaining duplicates, Validate data integrity and consistency, Generate a comprehensive cleaning report

In [None]:
def validate_cleaning(df, original_shape, cleaning_report):
    validation_results = {
        'rows_remaining': len(df),
        'missing_values_remaining': df.isnull().sum().sum(),
        'duplicates_remaining': df.duplicated().sum(),
        'data_loss_percentage': (1 - len(df)/original_shape[0]) * 100
    }

    # Add validation results to the cleaning report
    cleaning_report['validation'] = validation_results
    return cleaning_report

     

## Putting It All Together in a complete pipeline:

In [None]:
def automated_cleaning_pipeline(df):
    # Store original shape for reporting
    original_shape = df.shape

    # Initialize cleaning report
    cleaning_report = {}

    # Execute each step and collect metrics
    cleaning_report['initial_quality'] = check_data_quality(df)

    df = standardize_datatypes(df)
    df = handle_missing_values(df)
    df, outliers = remove_outliers(df)
    cleaning_report['outliers_removed'] = outliers

    # Validate and finalize report
    cleaning_report = validate_cleaning(df, original_shape, cleaning_report)

    return df, cleaning_report

# Web Scraping Fundamentals

> Beautifulsoup
> [Selenium](https://www.selenium.dev/documentation/overview/details/)
> [Scrapy](https://docs.scrapy.org/en/latest/)
> [website](http://quotes.toscrape.com/)