In the realm of global indicators and country-level assessments, it's imperative to identify the years in which certain indicators hit their lowest values for each country. Leveraging a dataset provided by government, which contains indicators across multiple years for various countries, your task is to formulate an SQL query to find the following information:
For each country and indicator combination, determine the year in which the indicator value was lowest, along with the corresponding indicator value. Sort the output by country name and indicator name.

What is stack in PySpark?
stack is a PySpark function that helps unpivot a DataFrame. It converts multiple columns into rows, which is essential when dealing with wide data formats that need to be transformed into long formats.

Syntax:
python
Copy code
stack(n, col1, val1, col2, val2, ..., colN, valN)
n: The number of columns to unpivot (must match the number of col and val pairs).
col1, val1...: Column names and their corresponding values to create rows.
The result of stack will generate two columns:

One for the column names (you can rename it later).
One for the values associated with those columns.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window

# Initialize Spark session
spark = SparkSession.builder.appName("GlobalIndicators").getOrCreate()

# Sample input data
data = [
    ("United States", "Control of Corruption", 1.26, 1.51, 1.52, 1.50, 1.46),
    ("United States", "Government Effectiveness", 1.27, 1.45, 1.28, 1.25, 1.27),
    ("United States", "Regulatory Quality", 1.28, 1.63, 1.63, 1.54, 1.62),
    ("United States", "Rule of Law", 1.32, 1.61, 1.60, 1.54, 1.62),
    ("United States", "Voice and Accountability", 1.30, 1.11, 1.13, 1.08, 1.05),
    ("Canada", "Control of Corruption", 1.46, 1.61, 1.71, 1.50, 1.56),
    ("Canada", "Government Effectiveness", 1.47, 1.55, 1.38, 1.35, 1.47),
    ("Canada", "Regulatory Quality", 1.38, 1.73, 1.63, 1.59, 1.68),
    ("Canada", "Rule of Law", 1.42, 1.71, 1.80, 1.64, 1.72),
    ("Canada", "Voice and Accountability", 1.40, 1.19, 1.21, 1.16, 1.09)
]

schema = ["country_name", "indicator_name", "year_2010", "year_2011", "year_2012", "year_2013", "year_2014"]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

# Step 1: Unpivot the data
unpivoted_df = df.selectExpr(
    "country_name",
    "indicator_name",
    "stack(5, '2010', year_2010, '2011', year_2011, '2012', year_2012, '2013', year_2013, '2014', year_2014) as (year_number, indicator_value)"
)

# Converting year_number to integer
unpivoted_df = unpivoted_df.withColumn("year_number", col("year_number").cast("int"))

#  Add row numbers for ranking using row_number by indicator_value ascending
window_spec = Window.partitionBy("country_name", "indicator_name").orderBy(col("indicator_value").asc())
ranked_df = unpivoted_df.withColumn("rn", row_number().over(window_spec))

# Step 3: Filter for the minimum indicator value
result_df = ranked_df.filter(col("rn") == 1).select(
    "country_name", "indicator_name", "indicator_value", "year_number"
).orderBy("country_name", "indicator_name")

# Show the result
result_df.display()


country_name,indicator_name,indicator_value,year_number
Canada,Control of Corruption,1.46,2010
Canada,Government Effectiveness,1.35,2013
Canada,Regulatory Quality,1.38,2010
Canada,Rule of Law,1.42,2010
Canada,Voice and Accountability,1.09,2014
United States,Control of Corruption,1.26,2010
United States,Government Effectiveness,1.25,2013
United States,Regulatory Quality,1.28,2010
United States,Rule of Law,1.32,2010
United States,Voice and Accountability,1.05,2014
