In [0]:
from pyspark.sql.functions import *
df=spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load('/Volumes/sandeshmsdatabricks/sourcefiles/sourcevolume/weather/Brazil_weather_data.csv')

Group by "Country". Compute total observations, average Temp_Mean, max Precipitation_Sum. 

In [0]:
df.groupBy("Country")\
  .agg(
    count("*").alias("total_count"),
    round(avg("Temp_Mean").alias("avg_temp_mean"),1),
    round(max("Precipitation_Sum").alias("max_Precipitation_Sum"),1)
  )\
    .show()

+-------+-----------+-----------------------------------------+---------------------------------------------------------+
|Country|total_count|round(avg(Temp_Mean) AS avg_temp_mean, 1)|round(max(Precipitation_Sum) AS max_Precipitation_Sum, 1)|
+-------+-----------+-----------------------------------------+---------------------------------------------------------+
| Brazil|       8766|                                     26.8|                                                    124.4|
+-------+-----------+-----------------------------------------+---------------------------------------------------------+



Using the Brazil weather DataFrame, calculate the average monthly sunshine duration for Brazil by:

Grouping the data by month,

Computing the average of the Sunshine column for each month,

Returning one row per month with month and avg_sunshine_hours

In [0]:
display(df.groupBy(date_format("Date", "MM/yyyy").alias("Month-Year"))\
  .agg(
    round(avg("Sunshine_Duration"),2).alias("Monthly_AVG_Sunshine_Duration_in_Sec")
  ).orderBy("Month-Year"))

Month-Year,Monthly_AVG_Sunshine_Duration_in_Sec
01/2000,31361.31
01/2001,36086.71
01/2002,29895.33
01/2003,28323.07
01/2004,20826.6
01/2005,28413.79
01/2006,33973.3
01/2007,31829.74
01/2008,30111.57
01/2009,35032.64


### Question: Monthly weather KPIs for Brazil
Using your Brazil weather DataFrame, calculate monthly KPIs by grouping on month-year (use the Date column for this). For each month, compute all of the following metrics in the same aggregation:

**hot_days**
Number of days where Temp_Mean > 30.

**heavy_rain_days**

Number of days where Precipitation_Sum >= 20.

**high_wind_sunshine**

Total Sunshine_Duration on days where Windgusts_Max > 15.

**total_days**

Total number of records in that month (for Brazil).

**Return one row per month with:**
Month_Year, hot_days, heavy_rain_days, high_wind_sunshine, total_days, ordered by Month_Year.

In [0]:
df.groupBy(date_format("Date", "MM/yyyy").alias("Month-Year"))\
    .agg(
        count(when(col("Temp_Mean")> 30, 1)).alias("hot_days"),
        count(when(col("Precipitation_Sum")>=20,1)).alias("heavy_rain_days"),
        round(sum(when(col("Windgusts_Max") > 15, col("Sunshine_Duration")).otherwise(0)),2).alias("high_wind_sunshine"),
        count("*").alias("Total_days")
    ).show()


+----------+--------+---------------+------------------+----------+
|Month-Year|hot_days|heavy_rain_days|high_wind_sunshine|Total_days|
+----------+--------+---------------+------------------+----------+
|   06/2000|       0|              0|        1098389.04|        30|
|   11/2013|       0|              3|         936092.77|        30|
|   09/2022|       3|              0|        1112831.45|        30|
|   07/2007|       0|              0|        1168253.19|        31|
|   06/2023|       0|              0|        1126886.87|        30|
|   01/2012|       0|              0|         1007433.0|        31|
|   01/2023|       0|              6|         871847.56|        31|
|   09/2002|       6|              0|        1159450.84|        30|
|   01/2010|       0|              3|        1029814.69|        31|
|   08/2023|       1|              0|        1152967.68|        31|
|   06/2001|       0|              0|        1123563.26|        30|
|   01/2005|       0|              3|         88

Using the same Brazil weather DataFrame, build a country-level extreme-weather KPI table (still only Brazil in your data, but write it as if there could be more countries). Group by Country and compute all of these metrics in a single groupBy + agg:

**very_hot_days**
Number of days where Temp_Mean > 32.

**cool_days**
Number of days where Temp_Mean < 20.

**stormy_days**
Number of days where Windgusts_Max > 20 and Precipitation_Sum >= 10 on the same day.

**dry_sunny_hours**
Total Sunshine_Duration on days where Precipitation_Sum = 0.

**records**
Total number of rows per country.

**Return columns:**
Country, very_hot_days, cool_days, stormy_days, dry_sunny_hours, records.

In [0]:
df.groupBy("Country")\
.agg(
    count(when(col("Temp_Mean")>32,1)).alias("Very_hot_days"),
    count(when(col("Temp_Mean")<20,1)).alias("Very_Cool_days"),
    count(when( (col("Windgusts_Max") > 20) & (col("Precipitation_Sum") >= 10), 1)).alias("Stormy_Days"),
    sum(when(col("Precipitation_Sum") == 0, col("Sunshine_Duration")).otherwise(0)).alias("dry_sunny_hours"),
    count("*").alias("Total_days")
).show()

+-------+-------------+--------------+-----------+--------------------+----------+
|Country|Very_hot_days|Very_Cool_days|Stormy_Days|     dry_sunny_hours|Total_days|
+-------+-------------+--------------+-----------+--------------------+----------+
| Brazil|           79|             8|       1055|1.5666575244999984E8|      8766|
+-------+-------------+--------------+-----------+--------------------+----------+

