<h3> Data Analysis with SQL </h3>

<p> Using SQLite3 integration into Jupyter Notebook, we will perform further data analysis with SQL using the additional_info file that contains more useful information about the players. The main advantage of SQL to do further analysis is its easy syntax compared to Python, while Python is more suited with wrangling.  The following cells installs SQLite3 and connects the dataframes into SQL. </p>

In [1]:
import sqlite3
import pandas as pd

In [None]:
!pip install ipython-sql

In [2]:
cnn = sqlite3.connect('nba_player_analysis.db')

In [4]:
merged=pd.read_csv("./data/player_shooting.csv")

In [5]:
additional_info=pd.read_csv("./data/additional_info.csv")

In [None]:
merged.to_sql('./data/player_shooting', cnn)
additional_info.to_sql('./data/additional_info', cnn)

In [3]:
%load_ext sql 
%sql sqlite:///nba_player_analysis.db

In [4]:
%%sql 

CREATE temporary TABLE player_data (
name varchar(30),
num_shots int,
real_height float,
release_time float,
release_vel float,
max_vel float, 
left_to_right float,
right_to_left float,
arc float,
raw_arc float,
`release` varchar(100),
`jump_dist` varchar(100),
height int, 
career_fg float,
career_fg_3 float,
college varchar(100),
draft_team varchar(100),
position varchar(100),
hand varchar(100),
weight varchar(50),
`status` int
);

 * sqlite:///nba_player_analysis.db
Done.


[]

#### Quick Data Cleaning:
- Renaming and selecting necessary columns for analysis
- Joining the shooting data with additional_info table, which contains further details about players

In [7]:
%%sql
INSERT INTO player_data 
WITH p_stats AS (
	SELECT name, n AS num_shots, ROUND(hght, 1) as real_height, rt AS release_time, rv AS release_vel, mxv AS max_vel, lr1t AS left_to_right, rl1t AS right_to_left, plr AS arc, arc_angle AS raw_arc, `Release` AS `release` , jump_dist, approx_h AS height, `status`
    FROM player_shooting
)
SELECT p.name, num_shots, real_height, release_time, release_vel, max_vel, left_to_right, right_to_left, arc, raw_arc, `release`, jump_dist, height, career_fg, career_fg_3, college, draft_team, 
position, hand, weight, `status` FROM p_stats p
INNER JOIN (SELECT name, `career_FG%` AS `career_fg`, `career_FG3%` AS `career_fg_3`, college, draft_team, `position`, shoots AS hand, weight FROM additional_info) ai
ON p.name = ai.name;

UPDATE player_data 
SET 
    weight = REPLACE(weight, 'lb', '');

ALTER TABLE player_data
ALTER COLUMN weight int;

 * sqlite:///nba_player_analysis.db
370 rows affected.
370 rows affected.
(sqlite3.OperationalError) near "ALTER": syntax error
[SQL: ALTER TABLE player_data
ALTER COLUMN weight int;]
(Background on this error at: https://sqlalche.me/e/20/e3q8)


### Analysis

#### 1. Ranking of field goals per each team and their arcs (and selecting top scorers)

In [32]:
%%sql 

WITH ranks AS (
SELECT name, career_fg_3, raw_arc, draft_team,
RANK() OVER (ORDER BY career_fg_3 DESC) AS fg_rank,
RANK() OVER (ORDER BY raw_arc DESC) AS arc_rank
FROM (SELECT * FROM player_data WHERE `status` = 1) p )
SELECT name, fg_rank, arc_rank, ROUND(CAST(arc_rank AS float)/185, 2) AS top_percent FROM ranks
LIMIT 10;

 * sqlite:///nba_player_analysis.db
Done.


name,fg_rank,arc_rank,top_percent
Stephen Curry,1,78,0.42
Kyle Korver,2,51,0.28
Klay Thompson,3,87,0.47
Anthony Morrow,4,86,0.46
Matt Bonner,5,162,0.88
Joe Ingles,6,42,0.23
Jose Calderon,7,84,0.45
Mike Miller,7,119,0.64
Danny Green,9,29,0.16
CJ McCollum,10,152,0.82


#### 2. Comparing the arc, release time and release velocity of different handed players

In [12]:
%%sql

SELECT 
    AVG(arc) AS avg_arc,
    AVG(release_time) AS avg_rt,
    AVG(release_vel) AS avg_rv,
    hand
FROM
    player_data
GROUP BY hand;

 * sqlite:///nba_player_analysis.db
Done.


avg_arc,avg_rt,avg_rv,hand
1.1007916013086256,0.4214642857142858,14.148099023169278,Left
1.102327995460179,0.4248734939759037,14.160897556877202,Right


#### 3. Average field goal for each handedness and shoot pocket & selecting top 3 averages

In [13]:
%%sql

SELECT 
    hand,
    CASE
        WHEN left_to_right > right_to_left THEN 'R'
        WHEN left_to_right < right_to_left THEN 'L'
        WHEN left_to_right = 0 AND right_to_left = 0 THEN 'S'
    END AS shoot_pocket,
    ROUND(AVG(career_fg), 2) AS avg_fg
FROM
    player_data
GROUP BY hand , shoot_pocket
ORDER BY avg_fg DESC
LIMIT 3;

 * sqlite:///nba_player_analysis.db
Done.


hand,shoot_pocket,avg_fg
Left,,46.21
Right,,43.8
Left,L,43.74


#### 4. Avg metrics for each weightclasses (> 200 lb: heavy, < 170: light, in-between: Normal)

In [14]:
%%sql

SELECT 
    AVG(arc) AS avg_arc,
    AVG(career_fg) AS avg_fg,
    AVG(release_time) AS avg_rt,
    CASE
        WHEN 170 < weight AND weight < 200 THEN 'Normal'
        WHEN weight < 170 THEN 'Light'
        WHEN weight > 200 THEN 'Very Heavy'
        ELSE 'Light'
    END AS weight_class
FROM
    player_data
GROUP BY weight_class;

 * sqlite:///nba_player_analysis.db
Done.


avg_arc,avg_fg,avg_rt,weight_class
1.09562194816986,42.5,0.4247,Light
1.1092179639137765,43.255555555555546,0.4223555555555557,Normal
1.1004032672170223,43.75439999999998,0.4254416666666669,Very Heavy


#### 5. Average metrics for height classes 

In [15]:
%%sql

SELECT 
    AVG(arc), AVG(career_fg_3), AVG(release_time), real_height
FROM
    player_data
WHERE
    real_height != 0
GROUP BY real_height;

 * sqlite:///nba_player_analysis.db
Done.


AVG(arc),AVG(career_fg_3),AVG(release_time),real_height
1.11491373979176,36.0,0.4985,5.7
1.134196596654705,35.1,0.5189999999999999,5.8
1.0859221454547026,37.0,0.41875,5.9
1.12704485157724,36.6,0.4570000000000001,6.0
1.1447571865424655,35.775,0.4314583333333334,6.1
1.1060590827811636,35.70571428571429,0.413,6.2
1.073468199426431,36.28750000000001,0.3965625,6.3
1.0786323094899295,35.19444444444444,0.4201714285714286,6.4
1.0833819638954127,34.64285714285713,0.417190476190476,6.5
1.0857139014996433,36.755,0.4318205128205127,6.6


#### 6. Does jump distance lead to higher FG percentage?

In [8]:
%%sql

WITH  t_ AS (SELECT *,
DENSE_RANK() OVER(PARTITION BY position ORDER BY jump_dist) AS j_rank FROM player_data),
t__ AS (SELECT career_fg, j_rank, position,
LAG(career_fg, 1) OVER (PARTITION BY position ORDER BY j_rank DESC) AS `prev`
FROM t_)
SELECT DISTINCT position, j_rank, IIF(career_fg - `prev` > 0, "Better", "Not Better") AS better_than_prev
FROM t__;

 * sqlite:///nba_player_analysis.db
Done.


position,j_rank,better_than_prev
Center,1,Not Better
Center and Power Forward,2,Not Better
Center and Power Forward,2,Better
Center and Power Forward,1,Not Better
Point Guard,3,Not Better
Point Guard,3,Better
Point Guard,2,Not Better
Point Guard,2,Better
Point Guard,1,Better
Point Guard,1,Not Better


#### 7. Does release time affect three-point performance?

In [17]:
%%sql

SELECT 
    AVG(career_fg_3) AS three_pt_fg, `release`
FROM
    player_data
GROUP BY `release`;

 * sqlite:///nba_player_analysis.db
Done.


three_pt_fg,release
36.04117647058824,1
35.9058510638298,2
34.85773195876288,3


#### 8. Which college produces the best scorers?  

In [18]:
%%sql

SELECT 
    college, AVG(career_fg) AS fg
FROM
    player_data
GROUP BY college
ORDER BY fg DESC;

 * sqlite:///nba_player_analysis.db
Done.


college,fg
San Diego State University,49.5
Louisiana Tech University,49.1
Gonzaga University,47.8
Davidson College,47.7
Virginia Commonwealth University,47.2
University of South Carolina,46.7
University of Virginia,46.4
University of Illinois at Urbana-Champaign,46.2
Washington State University,45.9
Lehigh University,45.5


#### 9. Comparing all metrics between players that made and missed the shot.

REMARK: Every even-th column are made shot metrics and every odd-th was missed.

In [9]:
%%sql

WITH tt as (
SELECT *,
LEAD(release_time,1) OVER lead_window AS rt_win,
LEAD(release_vel,1) OVER lead_window AS rv_win,
LEAD(max_vel,1) OVER lead_window AS mv_win,
LEAD(left_to_right,1) OVER lead_window AS lr_win,
LEAD(right_to_left,1) OVER lead_window AS rl_win,
LEAD(arc,1) OVER lead_window AS arc_win
FROM player_data 
WINDOW lead_window AS (partition by name))
SELECT * FROM (select name, ROUND(release_time - ifnull(rt_win, null), 1) as rt_diff,
ROUND(release_vel - ifnull(rv_win, null), 1) as rv_diff,
ROUND(max_vel - ifnull(mv_win, null), 1) as mv_diff,
ROUND(left_to_right - ifnull(lr_win, null), 1) as lr_diff,
ROUND(right_to_left - ifnull(rl_win, null), 1) as rl_diff,
ROUND(arc - ifnull(arc_win, null), 1) as arc_diff
FROM tt) tt_
WHERE rt_diff IS NOT NULL;

 * sqlite:///nba_player_analysis.db
Done.


name,rt_diff,rv_diff,mv_diff,lr_diff,rl_diff,arc_diff
Aaron Brooks,0.0,0.1,-2.0,,-0.0,-0.0
Al-Farouq Aminu,0.0,-0.5,1.5,0.0,0.0,0.0
Alan Anderson,0.2,0.7,0.9,,0.2,0.0
Andre Iguodala,0.1,-2.2,-1.4,0.1,0.1,-0.0
Anthony Morrow,0.0,-0.1,-0.8,,-0.0,-0.0
Anthony Tolliver,0.0,-0.3,-0.2,-0.3,-0.2,0.0
Arron Afflalo,-0.0,0.1,1.2,-0.0,-0.1,0.0
Austin Rivers,-0.0,-0.7,0.1,,-0.0,-0.0
Avery Bradley,0.0,-0.7,0.5,,-0.0,0.0
Ben McLemore,-0.1,-0.5,1.4,-0.1,-0.1,-0.1


##### 10. Between made vs missed shots, what are the differences in how much you "pull" back when you release?

In [20]:
%%sql

SELECT status, AVG(ry) pull_back FROM player_shooting
GROUP BY status
ORDER BY AVG(ry);

 * sqlite:///nba_player_analysis.db
Done.


status,pull_back
1,-0.0116422670219144
0,-0.0112015604845404


## Conclusion

Given the thorough analysis involving many interesting metrics above, it is necessary to take a look at the results that were produced to conclude which metrics improves the scoring performance for NBA players. 

##### #1:
 When we filtered the players with highest field goal rankings (1st & 2nd place) and listed their arc ranks out of all players, it is obviously clear that almost all players' arc ranks are above average (place above top 50%). However, it is important to note that their arc-ranks are not extremely high! Hence, best scorers shoot with a good amount of arc, but not too high that it affects their accuracy.

##### #2: 
Comparing the arc, release time and release velocity with left & right hand players, right hand players exceed left players in everything. This indicates higher arc, slower release time, and higher release velocity. It is useful to recognize the fast release time of left hand players like James Harden.

##### #3: 
We acquired the hand & shoot pocket metrics for top 3 average field goal numbers, which indicates that left-hand players with either no pocket or left pocket as well as right-hand players with no pocket dominates the scoring game. Connecting this with #2, this could somewhat indicate that the fast shooting release of left hand players allows you to score more, but also not having any pocket (a straight shooting hand) can increase your accuracy OR having a pocket same as your dominant hand.

##### #4: 
From what we can see from weightclasses, it is obvious that the "Very Heavy" players have the highest field goal -- because they do not shoot 3s. However, the "Normal" class has the best statistics with lowest release time and highest arc. From what we can assume from #1 and #3, a high-arc and low release time causes benefits to performance. Therefore, being a fit but having a solid physcality may support your scoring ability.

##### #5: 
After grouping players into respective heights, we calculated average arc, field goal and release time. There is nothing specific to find here, except the ideal height seems to be any between 5.9ft to 6.4ft looking at all metrics. 

##### #6: 
Within each position, we tried to see if jumping higher leads to better field goal. We ranked the jump distance for each position and tried to see if field goal performance got better or not better as jump distance got higher. It did not.

##### #7: 
Does relase time affect 3pt performance? Yes. It is clear that fast release improve 3pt field goals.

###### #8: 
This figure highlights which college has the best average scorers. This doesn't indicate anything about players, but when this data was collected, San Diego was the best with Murray State being worst. 

###### #9: 
This figure compares the difference of metrics between "made" shot versus "missed" shot statistics. 
- Surprisingly, there is no real difference in release time and arc between missed vs made shots. 
- Same case for the x-translation of the shot.
- Major differences were in release velocity and max velocity.

Hence, the data indicates to just shoot with stronger initial flick. 

##### #10: 

The table compares if the "ry" metric -- how much players "pull back" on their shot" makes a difference in making the shot. It does not. 

To finally conclude, to be a good scorer in general, you need to:
1. Have a decent arc on your shot.
2. Practice releasing the shot quickly.
3. Shoot straight right-handed or shoot straight/from left pocket left-handed.
4. Have a slim but muscular build with height between 5.9ft to 6.4ft.
5. Flick strongly when you release. 

## References

- https://shootinschool.com/the-formula-to-being-a-great-shooter/ (Qualities that make a good shooter)
- https://www.inpredictable.com/2021/01/nba-player-shooting-motions-data-dump.html (Blog on the dataset used in this project)
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9590067/ (Shooting analysis involving Kinematics)
- https://scholarworks.umt.edu/cgi/viewcontent.cgi?article=6850&context=etd (Shooting study on male basketball lpayers)
- https://www.btsbioengineering.com/wp-content/uploads/2023/05/Struzik-et-al-2014.pdf (Biomechanical Analysis of a Basketball shot)