# NBA Data Analysis for Prediction Model

## Table of Contents <a id="table_of_contents"></a>

0. [Import Libraries](#imports)
1. [Import Data](#import_data)
2. Data Exploration
3. Data Cleaning
4. Data Preparation
5. Benchmark Model
6. Visualization ?

## 0. Importing Libraries <a id="imports"></a>

[Back to top](#table_of_contents)

In [1]:
!pip install opendatasets

[33mDEPRECATION: Loading egg at /opt/anaconda3/lib/python3.12/site-packages/geometry-0.0.1-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [2]:
pip install pypyodbc

[33mDEPRECATION: Loading egg at /opt/anaconda3/lib/python3.12/site-packages/geometry-0.0.1-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
import sqlite3 as sql
import matplotlib
from matplotlib import pyplot as plt
import opendatasets as od
import os

## 1. Import Data <a id="import_data"></a>

[Back to top](#table_of_contents)

In [4]:
#create connection to database file in Kaggle
dataset = "https://www.kaggle.com/datasets/wyattowalsh/basketball"
od.download(dataset)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

  sharabelton


Your Kaggle Key:

  ········


Dataset URL: https://www.kaggle.com/datasets/wyattowalsh/basketball
Downloading basketball.zip to ./basketball


100%|██████████████████████████████████████████████████████████████████████████████████| 697M/697M [00:26<00:00, 27.7MB/s]





In [5]:
# 1.Connect to SQL database
connection = sql.connect('basketball/nba.sqlite')
# Test SQL connectivity to database by writing a SQL query
test_query = "SELECT * FROM game"
# Load the data into a pandas DataFrame
df = pd.read_sql_query(test_query, connection)
# Display the DataFrame
print(df)
# Close Connection if necessary
connection.close()

      season_id team_id_home team_abbreviation_home           team_name_home  \
0         21946   1610610035                    HUS          Toronto Huskies   
1         21946   1610610034                    BOM        St. Louis Bombers   
2         21946   1610610032                    PRO  Providence Steamrollers   
3         21946   1610610025                    CHS            Chicago Stags   
4         21946   1610610028                    DEF          Detroit Falcons   
...         ...          ...                    ...                      ...   
65693     42022   1610612748                    MIA               Miami Heat   
65694     42022   1610612748                    MIA               Miami Heat   
65695     42022   1610612743                    DEN           Denver Nuggets   
65696     32022   1610616834                    LBN              Team LeBron   
65697     32022   1610616834                    LBN              Team LeBron   

          game_id            game_date 

## 2. Data Exploration

[Click here](https://www.nba.com/stats/help/glossary) for an NBA stat glossary.
[Click here](https://en.wikipedia.org/wiki/Basketball_statistics) for general basketball statistics acronyms.

Slug for CSV folder files (for the purpose of viewing the tables here): /basketball/csv/[table name].csv

* fgm = field goals made
* ftm = free throws made
* pts = points
* reb = rebounds
* oreb = offensive rebounds
* dreb = defensive rebounds
* ast = assists
* stl = steals
* blk = blocks
* tov = turnovers
* td = triple double
* wl = win/lose
* ft_pct_away = free throws % of away team

NBA's efficiency rating: (PTS + REB + AST + STL + BLK − ((FGA − FGM) + (FTA − FTM) + TOV))

## 3. Data Cleaning

In [6]:
# 1.Connect to SQL database
explore_connection = sql.connect('basketball/nba.sqlite')

cursor = explore_connection.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS game_cleaned AS
    SELECT team_id_home, 
        team_id_away, 
        team_name_home, 
        team_name_away, 
        game_id, 
        wl_home, 
        fga_home, 
        fg_pct_home, 
        ftm_home, 
        pts_home, 
        fta_home, 
        fgm_home, 
        ftm_away, 
        oreb_home, 
        dreb_home, 
        reb_home, 
        ast_home, 
        stl_home, 
        blk_home, 
        tov_home, 
        pf_home, 
        team_id_away, 
        team_name_away, 
        wl_away, 
        fg_pct_away, 
        ft_pct_away, 
        fta_away, 
        pts_away, 
        fga_away, 
        fgm_away, 
        oreb_away, 
        dreb_away, 
        reb_away, 
        ast_away, 
        stl_away, 
        blk_away, 
        tov_away, 
        pf_away
    FROM game
    WHERE fga_home IS NOT NULL AND fta_home IS NOT NULL AND ftm_home IS NOT NULL AND oreb_home IS NOT NULL AND dreb_home IS NOT NULL AND stl_home IS NOT NULL AND ast_home IS NOT NULL AND reb_home IS NOT NULL AND blk_home IS NOT NULL AND tov_home IS NOT NULL AND pf_home IS NOT NULL AND pts_home IS NOT NULL AND team_id_away IS NOT NULL AND team_name_away IS NOT NULL AND wl_away IS NOT NULL AND fg_pct_away IS NOT NULL AND ft_pct_away IS NOT NULL AND pts_away IS NOT NULL AND fgm_away IS NOT NULL AND oreb_away IS NOT NULL AND dreb_away IS NOT NULL AND reb_away IS NOT NULL AND ast_away IS NOT NULL AND stl_away IS NOT NULL AND blk_away IS NOT NULL AND tov_away IS NOT NULL AND pf_away IS NOT NULL AND pts_away IS NOT NULL AND fga_away IS NOT NULL;
''')
display_game_cleaned_query = "SELECT * FROM game_cleaned"
game_cleaned_df = pd.read_sql_query(display_game_cleaned_query, explore_connection)
print(game_cleaned_df)

      team_id_home team_abbreviation_home       team_name_home     game_id  \
0       1610612760                    SEA  Seattle SuperSonics  0047700045   
1       1610612764                    WAS   Washington Bullets  0047700046   
2       1610612764                    WAS   Washington Bullets  0047700047   
3       1610612760                    SEA  Seattle SuperSonics  0047700048   
4       1610612760                    SEA  Seattle SuperSonics  0047700049   
...            ...                    ...                  ...         ...   
46659   1610612748                    MIA           Miami Heat  0042200403   
46660   1610612748                    MIA           Miami Heat  0042200404   
46661   1610612743                    DEN       Denver Nuggets  0042200405   
46662   1610616834                    LBN          Team LeBron  0032200001   
46663   1610616834                    LBN          Team LeBron  0032200001   

      wl_home  fga_home  fg_pct_home  ftm_home  pts_home  fta_h

## 3. Data Preparation / Manipulation

### Calculating the efficiency of the home vs away teams for each game

In [9]:
# 1.Connect to SQL database
add_calculation_connection = sql.connect('basketball/nba.sqlite')
cursor = add_calculation_connection.cursor()
cursor.execute('''
    ALTER TABLE game_cleaned ADD COLUMN eff_home;
''')

cursor.execute('''
    ALTER TABLE game_cleaned ADD COLUMN eff_away;
''')

cursor.execute('''
    UPDATE game_cleaned
    SET eff_home = (
        pts_home + 
        reb_home + 
        ast_home + 
        stl_home + 
        blk_home - 
        ((fga_home - fgm_home) + 
        (fta_home - ftm_home) + 
        tov_home)
    ),
    eff_away = (
        pts_away + 
        reb_away + 
        ast_away + 
        stl_away + 
        blk_away - 
        ((fga_away - fgm_away) + 
        (fta_away - ftm_away) + 
        tov_away)
    )
''')

display_game_calculated_query = "SELECT * FROM game_cleaned"
game_calculated_df = pd.read_sql_query(display_game_calculated_query, add_calculation_connection)
print(game_calculated_df)
add_calculation_connection.close()

      team_id_home team_abbreviation_home       team_name_home     game_id  \
0       1610612760                    SEA  Seattle SuperSonics  0047700045   
1       1610612764                    WAS   Washington Bullets  0047700046   
2       1610612764                    WAS   Washington Bullets  0047700047   
3       1610612760                    SEA  Seattle SuperSonics  0047700048   
4       1610612760                    SEA  Seattle SuperSonics  0047700049   
...            ...                    ...                  ...         ...   
46659   1610612748                    MIA           Miami Heat  0042200403   
46660   1610612748                    MIA           Miami Heat  0042200404   
46661   1610612743                    DEN       Denver Nuggets  0042200405   
46662   1610616834                    LBN          Team LeBron  0032200001   
46663   1610616834                    LBN          Team LeBron  0032200001   

      wl_home  fga_home  fg_pct_home  ftm_home  pts_home  fta_h

### Duplicate the team table so that I can have a city for each team in the game_cleaned table

In [None]:
duplicate_team_table_connection = sql.connect('basketball/nba.sqlite')
cursor = duplicate_team_table_connection.cursor()
cursor.execute('''
    CREATE TABLE duplicate_team_table AS
    SELECT * FROM team
''')
duplicate_team_table_connection.closed()

### Joining the duplicated table with the cleaned game table, rename the city field AND state field for the home team

In [None]:
join_duplicated_team_w_cleaned_game = sql.connect('basketball/nba.sqlite')
cursor = join_duplicated_team_w_cleaned_game.cursor()
cursor.execute('''
      SELECT 
        game_cleaned.team_id_home, 
        game_cleaned.team_name_home, 
        game_cleaned.team_name_away, 
        game_cleaned.game_id,
        game_cleaned.eff_home,
        game_cleaned.eff_away,
        duplicate_team_table.state AS state_away,
        duplicate_team_table.city AS city_away
    FROM game_cleaned
    INNER JOIN duplicate_team_table ON game_cleaned.team_id_away = duplicate_team_table.id;
''')
display_game_team_duplicate_joined_query = "SELECT * FROM game_cleaned"
game_team_duplicated_df = pd.read_sql_query(display_game_team_duplicate_joined_query, join_duplicated_team_w_cleaned_game)
print(game_team_duplicated_df)
join_duplicated_team_w_cleaned_game.closed()

### Joining the original team Table with the cleaned Game table and renaming the city field AND state field for the away team

In [None]:
join_teams_original_and_games_cleaned_connection = sql.connect('basketball/nba.sqlite')
cursor = join_teams_original_and_games_cleaned_connection.cursor()
cursor.execute('''
    SELECT  
        game_cleaned.team_id_away, 
        game_cleaned.team_name_home, 
        game_cleaned.team_name_away, 
        game_cleaned.game_id,
        game_cleaned.eff_home,
        game_cleaned.eff_away,
        game_cleaned.state_away,
        game_cleaned.city_away,
        team.state AS state_home,
        team.city AS state_away
    FROM game_cleaned
    INNER JOIN team ON game_cleaned.team_id_home = team.id;
''')


display_fully_joined_game_query = "SELECT * FROM game_cleaned"
game_fully_joined_df = pd.read_sql_query(display_fully_joined_game_query, join_teams_original_and_games_cleaned_connection)
print(game_fully_joined_df)
join_teams_original_and_games_cleaned_connection.close()

### Add a column with the timezone of the away team vs the home team.

### Add a column with the distance between the two cities.
