# Approach to Complex SQLquery Building in Kaggle

## Project Overview

In this mini-project, you will explore and analyze an IPL (Indian Premier League) database to practice and understand complex SQL query building. You will perform data loading, cleaning, and execute multiple SQL queries to extract comprehensive career metrics for IPL players.

---

Task

1. Load and Explore the Data

- Import the necessary libraries: pandas and sqlite3.
- Connect to the IPL database and load the master table to understand the structure.
- Load all the tables and print their column names to identify common columns.

2. Query 1: Select All Columns from Player’s Table

- Write and execute a SQL query to select all columns from the Player_Match table.

3. Query 2: Batsman vs Runs

- Write and execute a SQL query to calculate the total runs scored by each batsman.

4. Query 3: Fifties and Hundreds

- Write and execute a SQL query to calculate the number of fifties and hundreds scored by each batsman.

5. Query 4: Best Bowling Figures

- Write and execute a SQL query to find the best bowling figures for each bowler.

6. Query 5: Comprehensive Career Metrics

- Combine all the previous chunks into a single comprehensive query to get detailed career metrics for players.

---





![Entity Relationship Diagram](ERD.png)

[extracted from here](https://www.kaggle.com/code/arvinthsss/approach-to-complex-sqlquery-building-in-kaggle)

## 1. Load and Explore the Data

- Import the necessary libraries: pandas and sqlite3.
- Connect to the IPL database and load the master table to understand the structure.
- Load all the tables and print their column names to identify common columns.


In [1]:
# import necessary libraries
import pandas as pd
import sqlite3

In [2]:
# connect to the database
conn = sqlite3.connect("database.sqlite")

In [3]:
# get the list of tables
tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql(tables_query, conn)
print(f'## the tables are: {tables}')
print('\n' + '=' * 40 + '\n' + '=' * 40 + '\n')

# function to get column names from table
def get_column_names(table_name):
    columns_query = f'PRAGMA table_info({table_name});'
    columns = pd.read_sql(columns_query, conn)
    return columns[['name']]

# sweep through each table and print its column names
for table in tables['name']:
    print(f'## Table: {table}')
    column_names = get_column_names(table)
    print(column_names)
    print()


## the tables are:                name
0            Player
1        Extra_Runs
2    Batsman_Scored
3     Batting_Style
4     Bowling_Style
5           Country
6            Season
7              City
8           Outcome
9            Win_By
10     Wicket_Taken
11            Venue
12       Extra_Type
13         Out_Type
14    Toss_Decision
15           Umpire
16             Team
17     Ball_by_Ball
18      sysdiagrams
19  sqlite_sequence
20            Match
21            Rolee
22     Player_Match


## Table: Player
            name
0      Player_Id
1    Player_Name
2            DOB
3   Batting_hand
4  Bowling_skill
5   Country_Name

## Table: Extra_Runs
            name
0       Match_Id
1        Over_Id
2        Ball_Id
3  Extra_Type_Id
4     Extra_Runs
5     Innings_No

## Table: Batsman_Scored
          name
0     Match_Id
1      Over_Id
2      Ball_Id
3  Runs_Scored
4   Innings_No

## Table: Batting_Style
           name
0    Batting_Id
1  Batting_hand

## Table: Bowling_Style
        

## 2. Query 1: Select All Columns from Player’s Table

Write and execute a SQL query to select all columns from the Player_Match table.


In [4]:
# table to select
table = 'Player_Match'

# create the sql query
query = f'SELECT * FROM {table};'

# execute the query
df = pd.read_sql(query, conn)

# display the dataframe
display(df)

Unnamed: 0,Match_Id,Player_Id,Role_Id,Team_Id
0,335987,1,1,1
1,335987,2,3,1
2,335987,3,3,1
3,335987,4,3,1
4,335987,5,3,1
...,...,...,...,...
12689,981024,385,3,11
12690,981024,394,3,11
12691,981024,429,3,11
12692,981024,434,3,2


## 3. Query 2: Batsman vs Runs

Write and execute a SQL query to calculate the total runs scored by each batsman.


In [5]:
# query to calculate total runs by each batsman
query = """
SELECT
    pl.Player_Id,
    pl.Player_Name,
    SUM(bsco.Runs_Scored) AS Total_Runs
FROM
    Ball_by_Ball byb
JOIN
    Batsman_Scored bsco ON byb.Ball_Id = bsco.Ball_Id
                         AND byb.Match_Id = bsco.Match_Id
                         AND byb.Over_Id = bsco.Over_Id
                         AND byb.Innings_No = bsco.Innings_No
JOIN
    Player_Match pm ON bsco.Match_Id = pm.Match_Id
JOIN
    Player pl ON byb.Striker = pl.Player_Id
JOIN
    Rolee re ON pm.Role_Id = re.Role_Id
JOIN
    Batting_Style bs ON pl.Batting_hand = bs.Batting_Id
GROUP BY
    pl.Player_Id, pl.Player_Name;
"""

# execute the query and store result in dataframe
df = pd.read_sql(query, conn)

# display the dataframe first rows
display(df.head())

Unnamed: 0,Player_Id,Player_Name,Total_Runs
0,1,SC Ganguly,29678
1,2,BB McCullum,53570
2,3,RT Ponting,2002
3,4,DJ Hussey,29084
4,5,Mohammad Hafeez,1408


## 4. Query 3: Fifties and Hundreds

Write and execute a SQL query to calculate the number of fifties and hundreds scored by each batsman.


In [6]:
# query to calculate # of fifties
query = """
WITH Innings_Runs_CTE AS ( -- CTE for Innings Runs 50s & 100s
    SELECT
        byb.Striker,
        pl.Player_Name,
        bs.Match_Id,
        byb.Innings_No,
        SUM(bs.Runs_Scored) AS Runs,
        CASE
            WHEN SUM(bs.Runs_Scored) >= 50 AND SUM(bs.Runs_Scored) < 100 THEN 1
            ELSE 0
        END AS Fifties,
        CASE
            WHEN SUM(bs.Runs_Scored) >= 100 THEN 1
            ELSE 0
        END AS Hundreds
    FROM
        Ball_by_Ball byb
    JOIN
        Batsman_Scored bs
    ON
        byb.Match_Id = bs.Match_Id
        AND byb.Over_Id = bs.Over_Id
        AND byb.Ball_Id = bs.Ball_Id
        AND byb.Innings_No = bs.Innings_No
    JOIN
        Player pl
    ON
        byb.Striker = pl.Player_Id
    GROUP BY
        byb.Striker, pl.Player_Name, bs.Match_Id, byb.Innings_No
),
Aggregate_CTE AS ( -- CTE for Aggregate Runs
    SELECT
        Striker,
        Player_Name,
        SUM(Fifties) AS Fifties,
        SUM(Hundreds) AS Hundreds
    FROM
        Innings_Runs_CTE
    GROUP BY
        Striker, Player_Name
)
SELECT -- main query
    Striker,
    Player_Name,
    Fifties,
    Hundreds
FROM
    Aggregate_CTE
ORDER BY
    Striker;
"""

# execute the query and store result in dataframe
df = pd.read_sql(query, conn)

# display the dataframe first rows
display(df.head())

Unnamed: 0,Striker,Player_Name,Fifties,Hundreds
0,1,SC Ganguly,7,0
1,2,BB McCullum,11,2
2,3,RT Ponting,0,0
3,4,DJ Hussey,5,0
4,5,Mohammad Hafeez,0,0


## 5. Query 4: Best Bowling Figures

Write and execute a SQL query to find the best bowling figures for each bowler.


In [7]:
query = """
WITH Wickets_CTE AS ( -- CTE for Wickets
    SELECT
        byb.Match_Id,
        byb.Bowler,
        COUNT(wkt.Player_Out) AS Wickets
    FROM
        Ball_by_Ball byb
    JOIN
        Wicket_Taken wkt
    ON
        byb.Match_Id = wkt.Match_Id
        AND byb.Over_Id = wkt.Over_Id
        AND byb.Ball_Id = wkt.Ball_Id
        AND byb.Innings_No = wkt.Innings_No
    GROUP BY
        byb.Match_Id, byb.Bowler
),
Runs_CTE AS ( -- CTE for Runs
    SELECT
        byb.Match_Id,
        byb.Bowler,
        SUM(bs.Runs_Scored) AS Runs_Given
    FROM
        Ball_by_Ball byb
    JOIN
        Batsman_Scored bs
    ON
        byb.Match_Id = bs.Match_Id
        AND byb.Over_Id = bs.Over_Id
        AND byb.Ball_Id = bs.Ball_Id
        AND byb.Innings_No = bs.Innings_No
    GROUP BY
        byb.Match_Id, byb.Bowler
),
Combined_CTE AS ( -- CTE for Combining Wickets and Runs
    SELECT
        wt.Match_Id,
        wt.Bowler,
        wt.Wickets,
        rt.Runs_Given
    FROM
        Wickets_CTE wt
    JOIN
        Runs_CTE rt
    ON
        wt.Match_Id = rt.Match_Id
        AND wt.Bowler = rt.Bowler
)
SELECT -- main query
    c.Bowler,
    p.Player_Name,
    MAX(c.Wickets) AS Max_Wickets,
    c.Runs_Given,

    -- concatenate to get stats Best_Bowling_Figure
    MAX(c.Wickets) || '-' || c.Runs_Given AS Best_Bowling_Figure

FROM
    Combined_CTE c
JOIN
    Player p
ON
    c.Bowler = p.Player_Id
GROUP BY
    c.Bowler, p.Player_Name, c.Runs_Given
ORDER BY
    MAX(c.Wickets) DESC, c.Runs_Given ASC;
"""

# execute the query and store result in dataframe
df = pd.read_sql(query, conn)

# display the dataframe first rows
display(df.head())

Unnamed: 0,Bowler,Player_Name,Max_Wickets,Runs_Given,Best_Bowling_Figure
0,102,Sohail Tanvir,6,14,6-14
1,334,AD Russell,6,19,6-19
2,430,A Zampa,6,19,6-19
3,362,DJG Sammy,6,22,6-22
4,124,A Kumble,5,5,5-5


## 6. Query 5: Comprehensive Career Metrics

Combine all the previous chunks into a single comprehensive query to get detailed career metrics for players.


In [8]:
# query for life, the universe, and everythinhg else
query = """
-- calculate wickets taken by each bowler and group by player name
WITH Wickets_CTE AS (
    SELECT
        bbb.Bowler AS Bowler_Id,
        pl.Player_Name,
        COUNT(wt.Player_Out) AS Wickets
    FROM
        Wicket_Taken wt
        JOIN Ball_by_Ball bbb
            ON wt.Match_Id = bbb.Match_Id
            AND wt.Over_Id = bbb.Over_Id
            AND wt.Ball_Id = bbb.Ball_Id
            AND wt.Innings_No = bbb.Innings_No
        JOIN Player pl
            ON bbb.Bowler = pl.Player_Id
    GROUP BY
        bbb.Bowler, pl.Player_Name
),

-- calculate balls bowled by each bowler
Balls_Bowled_CTE AS (
    SELECT
        bbb.Bowler AS Bowler_Id,
        COUNT(bbb.Ball_Id) AS Balls_Bowled
    FROM
        Ball_by_Ball bbb
    GROUP BY
        bbb.Bowler
),

-- calculate runs given by each bowler
Economy_CTE AS (
    SELECT
        bbb.Bowler AS Bowler_Id,
        SUM(bs.Runs_Scored) AS Runs_Given
    FROM
        Ball_by_Ball bbb
        LEFT JOIN Batsman_Scored bs
            ON bbb.Match_Id = bs.Match_Id
            AND bbb.Over_Id = bs.Over_Id
            AND bbb.Ball_Id = bs.Ball_Id
            AND bbb.Innings_No = bs.Innings_No
    GROUP BY
        bbb.Bowler
),

-- calculate best bowling figures for each bowler
Best_Bowling_CTE AS (
    SELECT
        wt.Bowler AS Bowler_Id,
        MAX(wt.Wickets) || '-' || MIN(rt.Runs_Given) AS Best_Bowling_Figure
    FROM (
        SELECT
            bbb.Match_Id,
            bbb.Bowler,
            COUNT(wt.Player_Out) AS Wickets
        FROM
            Ball_by_Ball bbb
            JOIN Wicket_Taken wt
                ON bbb.Match_Id = wt.Match_Id
                AND bbb.Over_Id = wt.Over_Id
                AND bbb.Ball_Id = wt.Ball_Id
                AND bbb.Innings_No = wt.Innings_No
        GROUP BY
            bbb.Match_Id, bbb.Bowler
    ) wt
    JOIN (
        SELECT
            bbb.Match_Id,
            bbb.Bowler,
            SUM(bs.Runs_Scored) AS Runs_Given
        FROM
            Ball_by_Ball bbb
            LEFT JOIN Batsman_Scored bs
                ON bbb.Match_Id = bs.Match_Id
                AND bbb.Over_Id = bs.Over_Id
                AND bbb.Ball_Id = bs.Ball_Id
                AND bbb.Innings_No = bs.Innings_No
        GROUP BY
            bbb.Match_Id, bbb.Bowler
    ) rt ON wt.Bowler = rt.Bowler
    GROUP BY
        wt.Bowler
)

-- combine all metrics into the final result
SELECT
    wc.Bowler_Id,
    wc.Player_Name,
    wc.Wickets,
    6 * (ec.Runs_Given * 1.0 / bb.Balls_Bowled) AS Economy_Rate,
    (bb.Balls_Bowled * 1.0 / wc.Wickets) AS Bowler_Strike_Rate,
    bbct.Best_Bowling_Figure
FROM
    Wickets_CTE wc
    JOIN Balls_Bowled_CTE bb
        ON wc.Bowler_Id = bb.Bowler_Id
    JOIN Economy_CTE ec
        ON wc.Bowler_Id = ec.Bowler_Id
    JOIN Best_Bowling_CTE bbct
        ON wc.Bowler_Id = bbct.Bowler_Id
ORDER BY
    wc.Wickets DESC;
"""

# execute the query and store result in dataframe
df = pd.read_sql(query, conn)

# display the dataframe first rows
display(df.head())

Unnamed: 0,Bowler_Id,Player_Name,Wickets,Economy_Rate,Bowler_Strike_Rate,Best_Bowling_Figure
0,194,SL Malinga,159,6.0,15.138365,5-6
1,71,DJ Bravo,137,7.558294,15.40146,4-2
2,136,A Mishra,132,6.924574,18.681818,5-6
3,50,Harbhajan Singh,128,6.630197,21.421875,5-7
4,67,PP Chawla,127,7.332524,19.464567,4-8
