# Day 5: Pandas Basics

1. Overview  
2. Importing & Creating DataFrames  
3. Indexing, Selecting & Filtering  
4. GroupBy & Aggregations  
5. Exercise 1: Load & Inspect Solar Data  
6. Exercise 2: Filter High-Demand Days  
7. Exercise 3: Regional Generation GroupBy  
8. Exercise 4: Time-Series Resampling  

---

## 1. Overview

Welcome to Day 5 of my Energy Analytics journey! Today’s goals are to:

- Load and inspect tabular energy data with Pandas  
- Select, filter, and slice DataFrames  
- Group and aggregate by categorical fields  
- Resample time-series data at different intervals  
- Complete four field-related coding exercises in my Jupyter notebook  

---

## 2. Importing & Creating DataFrames

In this section, I will:

- Import Pandas with the standard alias  
- Create a DataFrame from a Python dict  
- Inspect my DataFrame using `.head()`, `.info()`, and `.describe()`

In [1]:
import pandas as pd

# Create from dict
data = {
    "day" : [1, 2, 3, 4, 5],
    "irradiance_khm" : [5.2, 4.6, 4.9, 5.4, 5.1],
    "demand_gw" : [45.3, 54.1, 41.4, 38.5, 44.3]
}

df = pd.DataFrame(data)
print(df.head())

   day  irradiance_khm  demand_gw
0    1             5.2       45.3
1    2             4.6       54.1
2    3             4.9       41.4
3    4             5.4       38.5
4    5             5.1       44.3


---

## 3. Indexing, Selecting & Filtering

In this section, I will:

- Select columns (`df["column"]`)  
- Select rows by position (`.iloc`) and by label (`.loc`)  
- Apply boolean masks for filtering

In [2]:
# Columm selection
irr = df["irradiance_khm"]

# Row selection by position & lable
print(df.iloc[0], df.loc[ df["day"] == 3 ])

# Boolean filtering 
high_demand = df[ df["demand_gw"] >= 50 ]
print(high_demand)

day                1.0
irradiance_khm     5.2
demand_gw         45.3
Name: 0, dtype: float64    day  irradiance_khm  demand_gw
2    3             4.9       41.4
   day  irradiance_khm  demand_gw
1    2             4.6       54.1


---

## 4. GroupBy & Aggregations

In this section, I will:

- Add a categorical column (`region`)  
- Use `.groupby()` and `.agg()` to compute summaries  
- Apply built-in functions like `sum`, `mean`, and `max`

In [3]:
# Simulate region tag
df["region"] = ["North", "South", "North", "East", "South"]
grp = df.groupby("region")["demand_gw"].agg(["sum", "mean", "max"])
print(grp)

         sum   mean   max
region                   
East    38.5  38.50  38.5
North   86.7  43.35  45.3
South   98.4  49.20  54.1


---

### Exercise 1: Load & Inspect Solar Data

In this cell, I:

- Read the `irr.csv` file from Day 3 using `pd.read_csv()`  
- Display `.info()`, `.describe()`, and the first 5 rows with `.head()`

- **Why?**  
  To verify data types, check for missing values, and understand basic statistics before analysis.

In [8]:
df = pd.DataFrame(pd.read_csv("../day3/irr.csv"))
print(irr.info(), irr.describe(), irr.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   day            2 non-null      int64
 1   irradiance_wh  2 non-null      int64
dtypes: int64(2)
memory usage: 164.0 bytes
None             day  irradiance_wh
count  2.000000       2.000000
mean   1.500000    5000.000000
std    0.707107     282.842712
min    1.000000    4800.000000
25%    1.250000    4900.000000
50%    1.500000    5000.000000
75%    1.750000    5100.000000
max    2.000000    5200.000000    day  irradiance_wh
0    1           5200
1    2           4800


---

### Exercise 2: Filter High-Demand Days

In this cell, I:

- Filter the DataFrame to include only rows where `demand_gw` ≥ 50  
- Reset the index with `.reset_index(drop=True)` and display the result

- **Why?**  
  To identify days with peak grid demand for targeted operational planning.

In [9]:
# Covert Wh to kwh
df["irradiance_kwh"] = df["irradiance_wh"] / 1000

# Add sample demant data
df["demand_gw"] = [45.6, 53.4]

# Displaying high demand and reseting index
high_demand = df[df["demand_gw"] >= 50].reset_index(drop=True)
print(high_demand)

   day  irradiance_wh  irradiance_kwh  demand_gw
0    2           4800             4.8       53.4


---

### Exercise 3: Regional Generation GroupBy

In this cell, I:

- Add a `region` column to the loaded DataFrame (assigning at least two distinct regions)  
- Group by `region` and compute total and average values for `irradiance_kwh` and `demand_gw`  
- Display the aggregated DataFrame

- **Why?**  
  To compare solar performance and demand across different regions of the grid.

In [10]:
# Assign regions
df["region"] = ["North", "South"]

# Group by region and aggregate
agg_df = df.groupby("region").agg({
    "irradiance_kwh": ["sum", "mean"],
    "demand_gw":       ["sum", "mean"]
})

# Display result
print(agg_df)

       irradiance_kwh      demand_gw      
                  sum mean       sum  mean
region                                    
North             5.2  5.2      45.6  45.6
South             4.8  4.8      53.4  53.4


---

### Exercise 4: Time-Series Resampling

In this cell, I:

- Convert `day` into a datetime index for May 1–5, 2025 using `pd.date_range()`  
- Set the new `date` column as index  
- Resample the DataFrame at a 2-day frequency (`'2D'`) and compute mean `irradiance_kwh`  
- Display the resampled series

- **Why?**  
  To practice aggregating irregular or daily data into custom time intervals for trend analysis.

In [11]:
df["date"] = pd.date_range("2025-05-01", periods=2, freq="D")
df.set_index("date", inplace=True)

resampled = df["irradiance_kwh"].resample("2D").mean()
print(resampled)

date
2025-05-01    5.0
Freq: 2D, Name: irradiance_kwh, dtype: float64
