# NBA Performance Effects

UC San Diego: Introduction to Machine Learning<br> 
Winter 2024 | Instructor: Jason Fleischer

## Group Members
- Sohaib Khan
- Simon Zheng
- Steven Xie
- Alexander Tang

## Abstract
This project mainly focuses on a comprehensive analysis to decode the factors contributing to NBA teams’ success from machine learning techniques. By merging and leveraging thorough datasets, we use unsupervised learning techniques including PCA, DBSCAN, and hierarchical clustering to help us identify the attributes of the success of NBA teams. We aim to include insights into the teams’ performance and other factors behind the performance such as management of teams to make an evaluation and prediction for future related sports and players’ contributions. Some notable findings of this study include determining the most impactful players across NBA teams based on player performance metrics based on DBSCAN and hierarchical clustering, determining the most impactful and successful teams throughout the years through DBSCAN and hierarchical clustering, and also analyzing specific basketball team metrics such as points scored, rebounds per player and team, and assists. All results are dimensionality reduced, then based on a hypertuned DBSCAN model as well as a hierarchical clustering model.

## Background
In recent years, the intersection of sports and data analytics has become an increasingly significant area of research, offering a wealth of insights into the intricacies of team dynamics, player contributions, and overall performance. As the popularity of sports analytics continues to rise, the application of machine learning techniques to unravel the factors influencing the success of NBA teams has garnered particular attention [1]. This project aims to delve into a comprehensive analysis utilizing unsupervised machine learning methodologies to determine the complex elements contributing to the success of NBA teams.

The overarching goal of our research is to provide valuable insights that go beyond simple retrospective analyses, seeking to understand underlying factors driving team success and develop predictive models that can evaluate future outcomes[2]. The basis of the evaluations not only encompasses team performance but also takes into account individual player contributions to suggest a holistic perspective on team performance dynamics. The results and models from this project can assist both current and future NBA teams in evaluating their chances of success. Beyond this, the models and techniques have the potential to be adapted for predicting success in other sports, thereby broadening the applicability and impact of the current research in sports analytics[3].

When specifically analyzing data recorded on the NBA, it is important to note that official data records date back many years. This allows for the analysis of a large dataset, which could lead to more concrete findings. In addition, the number of features that were collected over this timespan is also quite extensive (such as player awards, rankings, individual/team statistcs, etc.), allowing for a detailed analysis to determine which specific features are the most impactful for team success. With all of this data being recorded, it would be beneficial to perform analysis to determine any outstanding features or statistics among the data to see if there can be any advantages gained.

## Problem Statement
Our group will analyze data on various NBA teams to see if there are any outstanding features/attributes (team offensive/defensive metrics, individual player talent metrics, etc.) that lead to overall team success (number of wins, championships, seeding, etc.) with the goal of identifying specific features/attributes that make an NBA team successful.

## NBA team datasets
Link:https://okcthunder.app.box.com/s/pwyt8jlcfv4bnqeks2usdtkkgznermk5

Size: there are 4 datasets that are included. They are:

- team_rebounding_data_22.csv:

Number of observation: 2460<br>
Variables:7<br>
Detail: offensive rebounding data for each team/game in the 2022 season<br>
Critical variables: team, opposite team, offesnsive rebounds, game number<br>
Handling: the dataset is from 2022, so with a high limitation of interactionwith other datasets.<br>


- awards_data.csv:

Number of observation: 4329<br>
Variables:23<br>
Detail: This dataset has awards data for each player/season combo<br>
Critical variables: All NBA first, second, third team , plyaer of the month, week,all ranks<br>
Handling: the dataset requires cleaning the null values and deplicated rows, but they might also mean something.<br>

- team_stats.csv:

Number of observation: 450<br>
Variables:11<br>
Detail: This dataset has a few stats for each team/season combo<br>
Critical variables: teams wining and losing. games nuber.<br>
Handling: The dataset is clean enough and we need to make the features be readable and create new feature to indicate the losing and wining percentage.

- player_stats.csv:

Number of observation: 8492<br>
Variables:49<br>
Detail: This dataset has a variety of stats and measures for each player/ team / season combination.<br>
Critical variables: all points percent and attempted. Games number, mins, and all playing stats.<br>
Handling: The dataset has many null value and deplicated value that requires cleaning.<br>


These four datasets mainly cover the teams, players, and other important metrics that determine success. After wrangling and understanding more about the dataset, we will determine the best approach to analysis, machine learning, and evaluation.

## Proposed Solution
To identify the possible features that make various NBA teams successful, we would like to use PCA to first perform dimensionality reduction on our dataset, then perform hierarchical clustering to group these NBA teams based on their success-related features. NBA teams' success metrics involve multiple variables(team offensive/defensive metrics, individual player talent metrics, etc.), which can lead to a high-dimensional dataset. Using PCA, we can reduce the dataset’s dimensionality, emphasizing variation and bringing out strong patterns in a dataset. At the same time, by focusing on the principal components with the most variance, PCA helps to filter out noise from less significant variables. Since PCA is sensitive to the variances of the features, we can start by standardizing the dataset using StandardScaler from scikit-learn for this purpose. After that, we can implement the PCA class in scikit-learn and use the function pca.fit_transform() to get the results.

Using the dimensionality-reduced data, we will apply hierarchical clustering and DBSCAN to group the NBA teams based on their similarities across the principal components. The method of hierarchical clustering produces a dendrogram that visualizes the clustering process, offering insights into the data's structure; we believe that this can help in identifying relationships between teams based on their success metrics. In addition, hierarchical clustering can capture complex relationships between data points, which can be beneficial if the success factors of NBA teams are not linearly separable. We will be using the common agglomerative(bottom-up) approach by importing the AgglomerativeClustering class from scikit-learn, and we can initialize the clusterings using AgglomerativeClustering() and fit the model using clustering.fit(). We will then analyze the clusters to identify common features or attributes within each group and see what makes the most successful cluster of teams stand out from the rest. Also, we will apply DBSCAN to the dimensionality-reduced data to determine any potential outliers. Tuning the hyperparameters of this model will also help narrow down the potential outliers, which can be adjusted to be more or less sensitive.

## EDA/Data Wrangling & Configuration
Please refer to the EDA.ipynb file

## Data
Please refer to the analysis and modeling.ipynb file

## Evaluation Metrics
To evaluate the performance of both the hierarchical clustering and DBSCAN, we will be using the Silhouette Score as our evaluation metrics. The mathematical representation of the Silhouette Score is S=(b-a)/max(a,b), where a is the average distance between the sample and all other points in the same cluster and b is the average distance between the sample and all points in the nearest cluster. This metric measures how similar an object is to its own cluster compared to other clusters and a high score indicates that the sample is well matched to its own cluster and poorly matched to neighboring clusters. If the silhouette score for our hierarchical clustering is lower than the benchmark model, that means our hierarchical clustering isn’t performing well on the dataset.

## Results

Our results from the DBSCAN and hierarchical clustering of the dimensionality-reduced dataset showed numerous outliers for each of the awards, player statistics, team rebounding, and team wins datasets. Looking at the PCA dataset for the player dataset, we see that only 56% of the variance is captured in the first 2 principal components, compared to 30% for the awards dataset, 82% for the team dataset, and 81% for the rebounding dataset. 

To combat the low variance found in the player and awards dataset, t-tests were performed to see if there was any correlation between the player performance and any awards they may win. With high t-statistics and low p-values among the awards dataset, there was a clear correlation between player awards and performance. Looking at the t-tests in regards to player awards and team win percentage, we found no significant correlation between the two statistics, indicating that even though a team may have a good player, their surrounding cast may not be as strong overall. 

Looking at the DBSCAN results, we see that hyper tuning the parameters for each dataset results in specific outliers. In the player dataset, we see that players such as Jawun Evans, who played in season 2018 for team OKC, is the smallest outlier, and Russell Westbrook, who played in season 2016 for team OKC, is the largest outlier. This result makes sense because of Russell Westbrook’s historic 2016 season where he won league MVP and set all-time records in every major statistical category, compared to Jawan Evan’s lackluster NBA career.

Looking at the general shape of the awards dataset, we see the general data is shaped like an “L”, with clusters on the vertical and horizontal components of the “L”, as well as a few outliers outside of the shape. Looking at certain players in each cluster, such as Damian Lillard in the vertical cluster as the smallest outlier, and Lebron James in the horizontal cluster as the biggest outlier, we can see that this result makes sense because both Damian Lillard and Lebron James have won multiple accolades, but Damian Lillard hasn’t won any championships and Lebron has won four NBA championships. In addition, Lebron James has significantly more career wins than Damian Lillard, which further solidifies the results.

Looking at the rebounding dataset, we see the game between CHA and MEM on 2022-11-04 is the smallest outlier and the rebounding competition between CHA and HOU on 2022-10-26 is the largest outlier in the rebounding dataset. Looking at the general shape of the player dataset, we see that a few outliers surround the outside of the general cluster, indicating that those players excel in certain statistical areas (such as points, rebounds, assists, etc.). 

Looking at the general shape of the team wins dataset, we see a general loose shape of one main cluster with a lot of outliers. This makes sense because there are a lot of different ways for teams to be winning and successful, which explains the high variance in this dataset. Some notable teams that were successful were the 2015-2016 Golden State Warriors, who currently hold the record for winning the most regular season games of all time at 73-9. In contrast, one of the biggest outliers for worst-performing teams was the 2011-2012 Charlotte Hornets, who posted a record of 21-45 in the regular season.

Looking at the hierarchical clustering plot, we again see that multiple players/teams stand out among the main clusters, which signifies that certain statistical metrics do indeed affect overall team success and performance. Based on the hierarchical clusters, we can see that there are around three major clusters, with all three clusters resulting in teams/players of differing success.

## Discussion

Interpreting results from our analysis of NBA players and team statistics reveals some critical insights into players talents and the competitiveness of teams. We were able to identify a smaller cluster from both our base model(Hierarchical Clustering) and our stronger model(DBSCAN), distinguishing elite performers(players and teams) from the broader pool. This suggests that despite the complex statistics and metrics that players and teams accumulate over seasons, there does exist a group of talented players in the NBA and their performance is statistically significant. Such consistency across these two models validates our hypothesis that there are indeed stellar players in the NBA.

As seen in the clustering plots from hierarchical clustering and DBSCAN above, there exists a small cluster for players and teams, which we verified to have exceptional statistics compared to the larger cluster. Players who belonged to the smaller cluster would have significantly better statistics such as points per game. Hierarchical clustering allows us to see that there exists a smaller cluster of elite players, and DBSCAN further nails it by showing more interpretable results. Compared to our base model of hierarchical clustering, DBSCAN performs much better in terms of identifying outliers and showing where the elite cluster truly belongs.

Using these models, we now can identify a small, elite cluster of players through unsupervised learning methods that have profound implications for talent scouting and team composition strategies. Using these clustering techniques, scouts and team analysts can target their recruitment efforts toward players who are categorized as high performers by the unsupervised learning method, thereby optimizing resource allocation and enhancing team competitiveness.

Our model could also be used to study the dynamics between players from different clusters. For instance, if players from the elite cluster were on the same team with players from the average cluster, would this cause average players to have better performance eventually? Our results indicate that this is not the case, because average players have significantly different statistics compared to elite players and are unlikely to change as player age increases. The reasons behind this are also noteworthy to study.


## Limitations

Our study on clustering NBA players using unsupervised learning can have limitations due to the scope of data, hyperparameter exploration, and temporal variability. Most notably, the rebounding data is only available from the 2022 season, which severely limits the conclusions we can draw from that particular aspect of the game. The limited range of performance metrics and the depth of hyperparameter tuning could also affect the identification of clusters. The dynamic nature of player performances over seasons and the lack of consideration for qualitative contributions such as leadership and game intelligence are also limitations to this study. Addressing these limitations through expanded datasets, comprehensive hyperparameter analysis, or a more longitudinal study could enhance the study's findings and applicability.


## Ethics & Privacy [4]

We will ensure that any personal or sensitive information about individuals, such as fans or players, is handled with care and in accordance with relevant privacy laws and regulations. Additionally, we will have to be mindful of potential biases in the data that could unfairly impact certain groups of fans or players. Biases may arise from factors such as demographic information, ticket pricing, or player performance metrics.

## Conclusion

Through a thorough analysis of the collected data, it has become increasingly evident that there exists a significant discrepancy in performance levels between teams composed of players with superior statistical profiles and those whose players are not as highly rated. This disparity is not merely confined to individual achievements but extends to the overall success of the teams. It suggests a strong correlation between the caliber of the players and the team's success, indicating that teams with statistically superior players tend to achieve higher levels of performance and success in comparison to their counterparts. This trend highlights the pivotal role player quality plays in shaping the outcomes and achievements of NBA teams.

## Footnotes
1.^: George Foster, Norm O'Reilly, Zachary Naidu. (5 Jul 2021) Playing-Side Analytics in Team Sports: Multiple Directions, Opportunities, and Challenges. Frontiers in Sports and Active Living. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8287128/
2.^: Matt Dawson. (24 April 2023) The iron cage of efficiency: analytics, basketball and the logic of modernity. Sport in Society. https://ieeexplore.ieee.org/abstract/document/10399370/authors#authors
3.^: Jaime Sampaio, Tim McGarry, Julio Calleja-González, Sergio Jiménez Sáiz, Xavi Schelling i del Alcázar, Mindaugas Balciunas. (14 July 2014) Exploring Game Performance in the National Basketball Association Using Player Tracking Data. Plose One. https://springerplus.springeropen.com/articles/10.1186/s40064-016-3108-2#citeas
4.^: ChatGPT