
# 🔹 1. Introduction: Why Python?

Python is the most popular language in data engineering because it's readable, powerful, and used across AWS services like Lambda, Glue, and SageMaker.

**Key points to mention:**
- Easy to learn and write (compared to Java/C++)
- Massive ecosystem (pandas, boto3, etc.)
- Official AWS SDKs in Python (like boto3)
- Python is used in Lambda, Glue, and even cloud notebooks



# 🔹 2. Python Basics: Variables, Types, and Printing

Let's start with simple variable assignment and printing.


In [None]:

# Variable assignment
name = "Cape Town" // String, str
stage = 2
is_power_on = True
temp = 19.5

#operators
+

# Print with formatting
print(f"{name} - Load Shedding Stage: {stage}")
print("Temperature:", temp)



Python is dynamically typed, which means you don't have to declare types explicitly.

**Basic types:**
- `str`: String (text)
- `int`: Integer (whole number)
- `float`: Decimal number
- `bool`: Boolean (True/False)



# 🔹 3. Python Basics: Data structures

**Why data Structures?**

Data structures help organize and store collections of data.

They let you group related data and perform operations efficiently.

Different structures suit different purposes.


**List (ordered, mutable collections)**

- Store multiple items (can be mixed types).
- Access by index.
- Can be changed (add/remove items).


**Example**

fruits = ["apple", "banana", "cherry"]
print(fruits[1])  # banana

**Traversing a list:**
for fruit in fruits:
    print(fruit)

**Tuples (ordered, immutable collections)**
- Like lists but cannot be changed.
- Good for fixed collections.

coordinates = (10, 20)
print(coordinates[0])  # 10

**Traversing a Tuple:**
for i in coordinates:
    print(i);

 **Dictionaries (key-value pairs)**
 - Store data as pairs: key -> value
 - Fast lookup by key.
 - Useful for mapping data.



# 🔹 3. Lists & Loops

Lists are ordered collections. You can loop through them easily using `for` loops.


In [None]:

# A list of cities
cities = ["Cape Town", "Joburg", "Durban"]

# Loop through the list
for city in cities:
    print(f"🔋 Checking power for {city}")


In [None]:

# Loop with index
for i in range(len(cities)):
    print(f"{i + 1}. {cities[i]}")



- `range(n)` creates a sequence from 0 to n-1.
- `for` loops iterate directly over items or via index.



# 🔹 4. Dictionaries: Like a Mini Database

Dictionaries store data as key-value pairs.


In [None]:

city_data = {
    "name": "Joburg",
    "stage": 2,
    "hours_off": 4,
    "province": "Gauteng"
}

# Access values
print(city_data["name"])
print(city_data["hours_off"])

# Add new key
city_data["load_date"] = "2025-06-18"
print(city_data)



This structure is similar to JSON or a row in a CSV file.



# 🔹 5. Functions: Reusable Logic

Functions allow you to encapsulate logic and reuse it.


In [None]:

def economic_impact(stage, population):
    cost = stage * 100
    return population * cost

impact = economic_impact(2, 1_000_000)
print(f"Estimated loss: R{impact}")



**Explanation:**
- Define a function with `def`
- Use parameters and return a value
- Real-world example: estimate economic loss from load shedding



# 🔹 6. Mini Exercise

**Task**: Create a list of 3 cities with stages, loop through them and print the expected economic loss if stage is 2.


In [None]:

cities = [
    {"name": "Cape Town", "stage": 2},
    {"name": "Durban", "stage": 1},
    {"name": "Pretoria", "stage": 3},
]

for city in cities:
    loss = city["stage"] * 1_000_000
    print(f"{city['name']}: R{loss} loss")



# 🔹 7. Preview: Where This Fits with AWS

These Python basics enable you to interact with AWS.

You can:
- Upload data to **S3** using `boto3`
- Write results to **CSV** with `csv` or `pandas`
- Use **dictionaries** inside AWS **Lambda functions**

```python
# Example: Using boto3 to upload to S3
import boto3

s3 = boto3.client("s3")
s3.upload_file("local_file.csv", "my-bucket", "uploads/file.csv")
```

This opens the door to automating infrastructure, processing data, and running cloud-native applications.


## ❓ Beginner-Friendly Questions to Explore

We’ll answer the following questions:

1. What is the structure of the dataset (rows and columns)?
2. What are the column names?
3. Are there any missing values in the dataset?
4. What is the data type of each column?
5. What is the date range covered in the dataset?
6. What is the average residual demand?
7. What are the maximum and minimum values of residual demand?
8. How does residual demand vary over time (line plot)?
9. What is the average international export and import?
10. What is the total electricity generated over time?
11. What is the trend in thermal vs nuclear generation?
12. What is the correlation between residual demand and dispatchable generation?
13. Which hour of the day has the highest average residual demand?
14. What is the installed renewable capacity over time?
15. How does load shedding vary over the months or years?
16. Can we identify any patterns in total UCLF (Unplanned Capacity Loss Factor)?
17. What is the average non-commercial sent-out?
18. How much variation is there in the Drakensberg generation unit hours?
19. Compare Palmiet and Ingula generation hours over time.
20. Which variables are most highly correlated?