# Approach to Complex SQLquery Building in Kaggle

## Project Overview

In this mini-project, you will explore and analyze an IPL (Indian Premier League) database to practice and understand complex SQL query building. You will perform data loading, cleaning, and execute multiple SQL queries to extract comprehensive career metrics for IPL players.

---

Task

1. Load and Explore the Data

- Import the necessary libraries: pandas and sqlite3.
- Connect to the IPL database and load the master table to understand the structure.
- Load all the tables and print their column names to identify common columns.

2. Query 1: Select All Columns from Player’s Table

- Write and execute a SQL query to select all columns from the Player_Match table.

3. Query 2: Batsman vs Runs

- Write and execute a SQL query to calculate the total runs scored by each batsman.

4. Query 3: Fifties and Hundreds

- Write and execute a SQL query to calculate the number of fifties and hundreds scored by each batsman.

5. Query 4: Best Bowling Figures

- Write and execute a SQL query to find the best bowling figures for each bowler.

6. Query 5: Comprehensive Career Metrics

- Combine all the previous chunks into a single comprehensive query to get detailed career metrics for players.

---





![Entity Relationship Diagram](ERD.png)

[extracted from here](https://www.kaggle.com/code/arvinthsss/approach-to-complex-sqlquery-building-in-kaggle)

## 1. Load and Explore the Data

- Import the necessary libraries: pandas and sqlite3.
- Connect to the IPL database and load the master table to understand the structure.
- Load all the tables and print their column names to identify common columns.


In [1]:
# import necessary libraries
import pandas as pd
import sqlite3

In [2]:
# connect to the database
conn = sqlite3.connect("database.sqlite")

In [3]:
# get the list of tables
tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql(tables_query, conn)
print(f'## the tables are: {tables}')
print('\n' + '=' * 40 + '\n' + '=' * 40 + '\n')

# function to get column names from table
def get_column_names(table_name):
    columns_query = f'PRAGMA table_info({table_name});'
    columns = pd.read_sql(columns_query, conn)
    return columns[['name']]

# sweep through each table and print its column names
for table in tables['name']:
    print(f'## Table: {table}')
    column_names = get_column_names(table)
    print(column_names)
    print()


## the tables are:                name
0            Player
1        Extra_Runs
2    Batsman_Scored
3     Batting_Style
4     Bowling_Style
5           Country
6            Season
7              City
8           Outcome
9            Win_By
10     Wicket_Taken
11            Venue
12       Extra_Type
13         Out_Type
14    Toss_Decision
15           Umpire
16             Team
17     Ball_by_Ball
18      sysdiagrams
19  sqlite_sequence
20            Match
21            Rolee
22     Player_Match


## Table: Player
            name
0      Player_Id
1    Player_Name
2            DOB
3   Batting_hand
4  Bowling_skill
5   Country_Name

## Table: Extra_Runs
            name
0       Match_Id
1        Over_Id
2        Ball_Id
3  Extra_Type_Id
4     Extra_Runs
5     Innings_No

## Table: Batsman_Scored
          name
0     Match_Id
1      Over_Id
2      Ball_Id
3  Runs_Scored
4   Innings_No

## Table: Batting_Style
           name
0    Batting_Id
1  Batting_hand

## Table: Bowling_Style
        

## 2. Query 1: Select All Columns from Player’s Table

Write and execute a SQL query to select all columns from the Player_Match table.


In [4]:
# table to select
table = 'Player_Match'

# create the sql query
query = f'SELECT * FROM {table};'

# execute the query
df = pd.read_sql(query, conn)

# display the dataframe
display(df)

Unnamed: 0,Match_Id,Player_Id,Role_Id,Team_Id
0,335987,1,1,1
1,335987,2,3,1
2,335987,3,3,1
3,335987,4,3,1
4,335987,5,3,1
...,...,...,...,...
12689,981024,385,3,11
12690,981024,394,3,11
12691,981024,429,3,11
12692,981024,434,3,2


## 3. Query 2: Batsman vs Runs

Write and execute a SQL query to calculate the total runs scored by each batsman.


In [5]:
# query to calculate total runs by each batsman
query = """
-- CTE to get the detailed data for each ball
WITH detailed_batting_data AS (
    SELECT
        byb.match_id AS match_id,
        byb.over_id AS over_id,
        byb.ball_id AS ball_id,
        byb.innings_no AS innings_no,
        byb.team_batting AS team_batting,
        striker,
        non_striker,
        bowler,
        role_desc,
        bs.batting_hand AS batting_hand,
        pl.player_name AS striker_name,
        runs_scored AS runs
    FROM ball_by_ball byb
    -- join the tables to get the batting hand, player names, etc.
    JOIN batsman_scored bsco ON byb.ball_id = bsco.ball_id 
                              AND byb.match_id = bsco.match_id 
                              AND byb.over_id = bsco.over_id 
                              AND byb.innings_no = bsco.innings_no
    JOIN player_match pm ON bsco.match_id = pm.match_id
    JOIN player pl ON byb.striker = pl.player_id
    JOIN rolee re ON pm.role_id = re.role_id
    JOIN batting_Style bs ON pl.batting_hand = bs.batting_id
    GROUP BY striker, byb.match_id, byb.over_id, byb.ball_id
    ORDER BY bsco.innings_no ASC
),
-- CTE to calculate the required aggregates
aggregated_batting_data AS (
    SELECT
        a.striker,
        COUNT(DISTINCT a.match_id) AS batting_innings,
        a.striker_name,
        a.role_desc,
        a.batting_hand,
        SUM(a.runs) AS runs
    FROM detailed_batting_data a
    GROUP BY a.striker_name
)

-- main query
SELECT * 
FROM aggregated_batting_data b
ORDER BY b.runs DESC;
"""

# execute the query and store result in dataframe
df = pd.read_sql(query, conn)

# display the dataframe first rows
display(df.head())

Unnamed: 0,striker,batting_innings,striker_name,role_desc,batting_hand,runs
0,21,143,SK Raina,Player,Left-hand bat,4106
1,8,131,V Kohli,Captain,Right-hand bat,4105
2,57,137,RG Sharma,Captain,Right-hand bat,3874
3,40,130,G Gambhir,Player,Left-hand bat,3634
4,162,91,CH Gayle,Player,Left-hand bat,3431


## 4. Query 3: Fifties and Hundreds

Write and execute a SQL query to calculate the number of fifties and hundreds scored by each batsman.


In [6]:
# query to calculate # of fifties
query = """
-- CTE to calculate runs & determine 30s, 50s and 100s
WITH runs_and_milestones AS (
    SELECT 
        striker,
        bs.match_id,
        player_name,
        byb.innings_no, 
        SUM(runs_scored) AS runs,
        -- calculate milestones: thirties, fifties, and hundreds
        (CASE WHEN SUM(runs_scored) >= 30 THEN 1 ELSE 0 END) AS thirties,
        (CASE WHEN SUM(runs_scored) >= 50 THEN 1 ELSE 0 END) AS fifties,
        (CASE WHEN SUM(runs_scored) >= 100 THEN 1 ELSE 0 END) AS hundreds
    FROM ball_by_ball byb
    -- join with batsman_scored to get the runs scored in each ball
    JOIN batsman_scored bs ON byb.match_id = bs.match_id 
                              AND byb.over_id = bs.over_id 
                              AND byb.ball_id = bs.ball_id 
                              AND byb.innings_no = bs.innings_no
    -- join with player to get the player's name
    JOIN player pl ON byb.striker = pl.player_id
    GROUP BY bs.match_id, player_name
    ORDER BY byb.innings_no
),
-- CTE to sum up thirties, fifties, and hundreds per player
aggregated_milestones AS (
    SELECT 
        b.striker,
        b.player_name,
        SUM(b.fifties) AS fifties,
        SUM(b.hundreds) AS hundreds
    FROM runs_and_milestones b
    GROUP BY b.player_name
)

-- main query
SELECT * 
FROM aggregated_milestones
ORDER BY striker;
"""

# execute the query and store result in dataframe
df = pd.read_sql(query, conn)

# display the dataframe first rows
display(df.head())

Unnamed: 0,striker,player_name,fifties,hundreds
0,1,SC Ganguly,7,0
1,2,BB McCullum,13,2
2,3,RT Ponting,0,0
3,4,DJ Hussey,5,0
4,5,Mohammad Hafeez,0,0


## 5. Query 4: Best Bowling Figures

Write and execute a SQL query to find the best bowling figures for each bowler.


In [7]:
query = """
-- CTE to calculate wickets per bowler
WITH wickets_data AS (
    SELECT 
        byb.match_id,
        bowler,
        COUNT(byb.ball_id) AS wickets
    FROM ball_by_ball byb
    -- join with wicket_taken to count the wickets
    JOIN wicket_taken wkt ON byb.match_id = wkt.match_id 
                            AND byb.over_id = wkt.over_id 
                            AND byb.ball_id = wkt.ball_id 
                            AND byb.innings_no = wkt.innings_no
    GROUP BY byb.match_id, bowler
),
-- CTE to calculate the runs given per bowler
runs_data AS (
    SELECT 
        byb.match_id,
        bowler,
        SUM(runs_scored) AS runs_given
    FROM ball_by_ball byb
    -- join with batsman_scored to get the runs given by the bowler
    JOIN batsman_scored bs ON byb.match_id = bs.match_id 
                            AND byb.over_id = bs.over_id 
                            AND byb.ball_id = bs.ball_id 
                            AND byb.innings_no = bs.innings_no
    GROUP BY bs.match_id, bowler
),
-- CTE to combine wickets and runs for each bowler
combined_data AS (
    SELECT 
        wt.match_id,
        wt.bowler,
        wt.wickets,
        rt.runs_given
    FROM wickets_data wt
    JOIN runs_data rt ON rt.match_id = wt.match_id
)

-- main query
SELECT 
    a.match_id,
    a.bowler,
    MAX(a.wickets) AS wickets,
    a.runs_given,
    MAX(a.wickets) || '-' || a.runs_given AS Best_Bowling_Figure
FROM combined_data a
GROUP BY a.bowler
ORDER BY a.wickets DESC;
"""

# execute the query and store result in dataframe
df = pd.read_sql(query, conn)

# display the dataframe first rows
display(df.head())

Unnamed: 0,match_id,bowler,wickets,runs_given,Best_Bowling_Figure
0,980984,430,6,10,6-10
1,598061,362,6,4,6-4
2,980968,334,6,8,6-8
3,336010,102,6,2,6-2
4,729308,364,5,10,5-10


## 6. Query 5: Comprehensive Career Metrics

Combine all the previous chunks into a single comprehensive query to get detailed career metrics for players.


In [8]:
query = """
WITH

----------------------------- query to calculate total runs by each batsman
-- CTE to get the detailed data for each ball
detailed_batting_data AS (
    SELECT
        byb.match_id AS match_id,
        byb.over_id AS over_id,
        byb.ball_id AS ball_id,
        byb.innings_no AS innings_no,
        byb.team_batting AS team_batting,
        striker,
        non_striker,
        bowler,
        role_desc,
        bs.batting_hand AS batting_hand,
        pl.player_name AS striker_name,
        runs_scored AS runs
    FROM ball_by_ball byb
    -- join the tables to get the batting hand, player names, etc.
    JOIN batsman_scored bsco ON byb.ball_id = bsco.ball_id 
                              AND byb.match_id = bsco.match_id 
                              AND byb.over_id = bsco.over_id 
                              AND byb.innings_no = bsco.innings_no
    JOIN player_match pm ON bsco.match_id = pm.match_id
    JOIN player pl ON byb.striker = pl.player_id
    JOIN rolee re ON pm.role_id = re.role_id
    JOIN batting_Style bs ON pl.batting_hand = bs.batting_id
    GROUP BY striker, byb.match_id, byb.over_id, byb.ball_id
    ORDER BY bsco.innings_no ASC
),
-- CTE to calculate the required aggregates
aggregated_batting_data AS (
    SELECT
        a.striker,
        COUNT(DISTINCT a.match_id) AS batting_innings,
        a.striker_name,
        a.role_desc,
        a.batting_hand,
        SUM(a.runs) AS runs
    FROM detailed_batting_data a
    GROUP BY a.striker_name
),


----------------------------- query to calculate # of fifties
-- CTE to calculate runs & determine 30s, 50s and 100s
runs_and_milestones AS (
    SELECT 
        striker,
        bs.match_id,
        player_name,
        byb.innings_no, 
        SUM(runs_scored) AS runs,
        -- calculate milestones: thirties, fifties, and hundreds
        (CASE WHEN SUM(runs_scored) >= 30 THEN 1 ELSE 0 END) AS thirties,
        (CASE WHEN SUM(runs_scored) >= 50 THEN 1 ELSE 0 END) AS fifties,
        (CASE WHEN SUM(runs_scored) >= 100 THEN 1 ELSE 0 END) AS hundreds
    FROM ball_by_ball byb
    -- join with batsman_scored to get the runs scored in each ball
    JOIN batsman_scored bs ON byb.match_id = bs.match_id 
                              AND byb.over_id = bs.over_id 
                              AND byb.ball_id = bs.ball_id 
                              AND byb.innings_no = bs.innings_no
    -- join with player to get the player's name
    JOIN player pl ON byb.striker = pl.player_id
    GROUP BY bs.match_id, player_name
    ORDER BY byb.innings_no
),
-- CTE to sum up thirties, fifties, and hundreds per player
aggregated_milestones AS (
    SELECT 
        b.striker,
        b.player_name,
        SUM(b.fifties) AS fifties,
        SUM(b.hundreds) AS hundreds
    FROM runs_and_milestones b
    GROUP BY b.player_name
)


----------------------------- combine the output
SELECT
    b.striker,
    b.batting_innings,
    b.striker_name AS player_name,
--    b.bowler AS player_name,
    b.role_desc,
    b.batting_hand,
    b.runs,
    m.fifties,
    m.hundreds
FROM aggregated_batting_data b
INNER JOIN aggregated_milestones m ON b.striker_name = m.player_name

ORDER BY b.striker;  
"""

# Execute the query and store result in dataframe
df = pd.read_sql(query, conn)

# Display the dataframe first rows
display(df.head())


Unnamed: 0,striker,batting_innings,player_name,role_desc,batting_hand,runs,fifties,hundreds
0,1,56,SC Ganguly,Captain,Left-hand bat,1349,7,0
1,2,92,BB McCullum,Captain,Right-hand bat,2435,13,2
2,3,9,RT Ponting,Captain,Right-hand bat,91,0,0
3,4,61,DJ Hussey,Captain,Right-hand bat,1322,5,0
4,5,8,Mohammad Hafeez,Captain,Right-hand bat,64,0,0


I have not the sightless clue of the bowler relationship with the player???????