## ICC Points Table
### Problem:
Generate an output table which gives a report of Number of Matches played by each country, Number of Wins and Number of losses.

Final Ouput should look like:
| Team Name | Matches_Played | no_of_wins | no_of_losses |
|-----|------|------------------|----------------|
| India | 2 | 2 | 0 | 

In [0]:
-- Switch to my Catalog
USE CATALOG workspace;

-- Create schema if not exists
CREATE SCHEMA IF NOT EXISTS sql_pyspark_practice;

-- Use this schema
USE sql_pyspark_practice;

In [0]:
create or replace table icc_world_cup
(
Team_1 Varchar(20),
Team_2 Varchar(20),
Winner Varchar(20)
);

INSERT INTO icc_world_cup values('India','SL','India');
INSERT INTO icc_world_cup values('SL','Aus','Aus');
INSERT INTO icc_world_cup values('SA','Eng','Eng');
INSERT INTO icc_world_cup values('Eng','NZ','NZ');
INSERT INTO icc_world_cup values('Aus','India','India');

select * from icc_world_cup;

## SQL Solution

In [0]:
select Team_Name, count(1) as no_of_matches_played, sum(win_flag) as no_of_wins, (count(1) - sum(win_flag)) as no_of_losses  
from (
  select team_1 as Team_Name, case when team_1=winner then 1 else 0 end as win_flag
  from icc_world_cup
  union all
  select team_2 as Team_Name, case when team_2=winner then 1 else 0 end as win_flag
  from icc_world_cup 
) A
group by Team_Name
order by no_of_wins desc;

## PySpark Solution

In [0]:
%python
from pyspark.sql import functions as F

# Read Source table
df = spark.table("icc_world_cup")

# Combine 2 columns 
df_union = (
            df.select(
                F.col("Team_1").alias("Team_Name"),
                F.when(F.col("Team_1") == F.col("Winner"),1).otherwise(0).alias("win_flag")
                )
            .unionAll(
                df.select(
                    F.col("Team_2").alias("Team_Name"),
                    F.when(F.col("Team_2") == F.col("Winner"),1).otherwise(0).alias("win_flag"))
                )
            )

# Group, Aggregate, Derive and Order Pipeline
df_final = (
    df_union
    .groupBy("Team_Name")
    .agg(
        F.count("Team_Name").alias("no_of_matches_played"),
        F.sum("win_flag").alias("no_of_wins")
        )
    .withColumn(
        "no_of_losses",
        F.col("no_of_matches_played") - F.col("no_of_wins")
        )
    .orderBy("no_of_wins", ascending=False)
    )

display(df_final)


## Notes
1. **Read table**
   `df = spark.table("icc_world_cup")`
   → Load SQL table into PySpark DataFrame.

2. **Create 2 DataFrames** (`df_team1`, `df_team2`)
   → Select team names and use `F.when(...).otherwise(...)` to create `win_flag`.

3. **Union both**
   `df_union = df_team1.union(df_team2)`
   → Combine all teams (Team_1 + Team_2).

4. **Group & Aggregate**
   → `groupBy("Team_Name")`, then
   `count()` → matches played
   `sum()` → wins.

5. **Add derived column**
   → `no_of_losses = no_of_matches_played - no_of_wins`.

6. **Order results**
   → Sort by `no_of_wins` (descending).

7. **Display final output**
   → `display(df_final)` in Databricks.

---

## Learnings
### 10th Nov 2025
- Learnt about Git integration in Databricks.
- Learnt about Union and Union all Use cases in sql
- Learnt about PySpark syntax
- Usage of Functions, Select, Union All, Aggregate functions, Groupby & orderBy in PySpark