# COGS 118B - Project Proposal

# Names

- Alexander Tang
- Sohaib Khan
- Steven Xie
- Simon Zheng

# Abstract 

This project mainly focuses on a comprehensive analysis to decode the factors contributing to NBA teams’ success from machine learning techniques. By merging datasets, and leveraging thorough datasets, we would use unsupervised learning techniques including PCA, different prediction models, and hierarchical clustering techniques to help us identify the attributes of the success of teams. We aim to include insights into the teams’ performance and other factors behind the performance such as management of teams to make an evaluation and prediction for future related sports and players’ contributions. By doing so, we intend to offer valuable evaluations and predictions that can inform future sports analytics, player contributions, and team strategies. Furthermore, our findings are poised to assist current and future NBA teams in evaluating their strategies and identifying key areas of focus to enhance their chances of success. Similar models and techniques are also able to help predict and stand for other sports evaluations in their chances of success.

# Background

In recent years, the intersection of sports and data analytics has become an increasingly significant area of research, offering a wealth of insights into the intricacies of team dynamics, player contributions, and overall performance. As the popularity of sports analytics continues to rise, the application of machine learning techniques to unravel the factors influencing the success of NBA teams has garnered particular attention <a name="first paper"></a>[<sup>[1]</sup>](#first). This project aims to delve into a comprehensive analysis utilizing unsupervised machine learning methodologies to determine the complex elements contributing to the success of NBA teams.

The overarching goal of our research is to provide valuable insights that go beyond simple retrospective analyses, seeking to understand underlying factors driving team success and develop predictive models that can evaluate future outcomes <a name="second paper"></a>[<sup>[2]</sup>](#second). The basis of the evaluations not only encompasses team performance but also takes into account individual player contributions to suggest a holistic perspective on team performance dynamics. The results and models from this project can assist both current and future NBA teams in evaluating their chances of success. Beyond this, the models and techniques have the potential to be adapted for predicting success in other sports, thereby broadening the applicability and impact of the current research in sports analytics <a name="third paper"></a>[<sup>[3]</sup>](#third).

When specifically analyzing data recorded on the NBA, it is important to note that official data records date back many years. This allows for the analysis of a large dataset, which could lead to more concrete findings. In addition, the number of features that were collected over this timespan is also quite extensive (such as player awards, rankings, individual/team statistcs, etc.), allowing for a detailed analysis to determine which specific features are the most impactful for team success. With all of this data being recorded, it would be beneficial to perform analysis to determine any outstanding features or statistics among the data to see if there can be any advantages gained.

# Problem Statement

Our group will analyze data on various NBA teams to see if there are any outstanding features/attributes (team offensive/defensive metrics, individual player talent metrics, etc.) that lead to overall team success (number of wins, championships, seeding, etc.) with the goal of identifying specific features/attributes that make an NBA team successful. 

## NBA team datasets.
Link:https://okcthunder.app.box.com/s/pwyt8jlcfv4bnqeks2usdtkkgznermk5

Size: there are 4 datasets that are included. They are: 
*  team_rebounding_data_22.csv 
    * Number of observation: 2460 
    *    Variables:7
    *    Detail:  offensive rebounding data for each team/game in the 2022 season
    *    Critical variables: team, opposite team, offesnsive rebounds, game number
    *    Handling: the dataset is from 2022, so with a high limitation of interactionwith other datasets.

*  awards_data.csv
    * Number of observation: 4329 
    *    Variables:23
    *    Detail:  This dataset has awards data for each player/season combo
    *    Critical variables: All NBA first, second, third team , plyaer of the month, week,all ranks
    *    Handling: the dataset requires cleaning the null values and deplicated rows, but they might also mean something. 

*  team_stats.csv
    * Number of observation: 450 
    *    Variables:11
    *    Detail:  This dataset has a few stats for each team/season combo
    *    Critical variables: teams wining and losing. games nuber. 
    *    Handling: The dataset is clean enough and we need to make the features be readable and create new feature to indicate the losing and wining percentage. 

*  player_stats.csv
    * Number of observation: 8492 
    *    Variables:49
    *    Detail:  This dataset has a variety of stats and measures for each player/ team / season combination.
    *    Critical variables: all points percent and attempted. Games number, mins, and all playing stats.
    *    Handling: The dataset has many null value and deplicated value that requires cleaning. 


These four datasets mainly cover the teams, players, and other important metrics that determine success. After wrangling and understanding more about the dataset, we will determine the best approach to analysis, machine learning, and evaluation.

# Proposed Solution

To identify the possible features that make various NBA teams successful, we would like to use PCA to first perform dimensionality reduction on our dataset, then perform hierarchical clustering to group these NBA teams based on their success-related features. NBA teams' success metrics involve multiple variables(team offensive/defensive metrics, individual player talent metrics, etc.), which can lead to a high-dimensional dataset. Using PCA, we can reduce the dataset’s dimensionality, emphasizing variation and bringing out strong patterns in a dataset. At the same time, by focusing on the principal components with the most variance, PCA helps to filter out noise from less significant variables. Since PCA is sensitive to the variances of the features, we can start by standardizing the dataset using StandardScaler from scikit-learn for this purpose. After that, we can implement the PCA class in scikit-learn and use the function pca.fit_transform() to get the results.

Using the dimensionality-reduced data, we will apply hierarchical clustering and DBSCAN to group the NBA teams based on their similarities across the principal components. The method of hierarchical clustering produces a dendrogram that visualizes the clustering process, offering insights into the data's structure; we believe that this can help in identifying relationships between teams based on their success metrics. In addition, hierarchical clustering can capture complex relationships between data points, which can be beneficial if the success factors of NBA teams are not linearly separable. We will be using the common agglomerative(bottom-up) approach by importing the AgglomerativeClustering class from scikit-learn, and we can initialize the clusterings using AgglomerativeClustering() and fit the model using clustering.fit(). We will then analyze the clusters to identify common features or attributes within each group and see what makes the most successful cluster of teams stand out from the rest. Also, we will apply DBSCAN to the dimensionality-reduced data to determine any potential outliers. Tuning the hyperparameters of this model will also help narrow down the potential outliers, which can be adjusted to be more or less sensitive.

# Evaluation Metrics

To evaluate the performance of both the hierarchical clustering and DBSCAN, we will be using the Silhouette Score as our evaluation metrics. The mathematical representation of the Silhouette Score is S=(b-a)/max(a,b), where a is the average distance between the sample and all other points in the same cluster and b is the average distance between the sample and all points in the nearest cluster. This metric measures how similar an object is to its own cluster compared to other clusters and a high score indicates that the sample is well matched to its own cluster and poorly matched to neighboring clusters. If the silhouette score for our hierarchical clustering is lower than the benchmark model, that means our hierarchical clustering isn’t performing well on the dataset.

# Ethics & Privacy  <a name="fourth paper"></a>[<sup>[4]</sup>](#fourth) 

## Privacy Concerns:
We will ensure that any personal or sensitive information about individuals, such as fans or players, is handled with care and in accordance with relevant privacy laws and regulations. Additionally, we will have to be mindful of potential biases in the data that could unfairly impact certain groups of fans or players. Biases may arise from factors such as demographic information, ticket pricing, or player performance metrics.
## Transparency and Accountability:
All methods and datasets will be made open-source so as to maintain transparency about the study. We will also document any assumptions, limitations, or uncertainties associated with the analysis to ensure that stakeholders or interested parties have an informative picture of the potential implications of the findings and how they may influence decision-making.
## Impact on Players and Teams:
Perhaps most importantly, we will have to consider the potential impact of the analysis on players, teams, and other stakeholders. Drawing conclusions that could harm reputations or lead to unfair treatment will have to be seriously considered and accurately represented in our report.

# Team Expectations

* Actively communicate regularly and effectively with group members in a timely manner.
* Ensure all criticism is respectful and constructive.
* Collaborate productively with other team members to ensure quality work completely within planned deadlines.
* Attend team meetings and provide constructive feedback and engage in proactive problem solving to overcome obstacles.
* Make a conscious effort to be adaptable and flexible with any unforeseen circumstances in order to reach the overall goal.


# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/10  |  2:30 PM |  Brainstorm ideas/topics  | Get to know each other, vote on topic | 
| 2/18  |  9:00 PM |  Browse Optimal Datasets | Datasets observed, whether they are feasible, assign tasks for Project Proposal | 
| 2/19  | 11:00 AM  | Follow up from previous  | Continue discussing |
| 2/20  | 8:00 PM  | Finish Tasks for Project Proposal | Finalize Project Proposal |
| 2/24  | 12:00 PM  | Import & Wrangle Data EDA | Review Wrangling, EDA and discuss Analysis Plan |
| 3/2  | 12:00 PM | Finalize wrangling/EDA; Begin programming for project | Discuss project code and finalize it |
| 3/9  | 12:00 PM  | Complete analysis; Draft results/conclusion/discussion | Rundown of the whole project, any changes needed to be made? |
| 3/16  | 12:00 PM  | Finalize the last changes | Submit project |

# Footnotes
<a name="first paper"></a>1.[^](#first): George Foster, Norm O'Reilly, Zachary Naidu. (5 Jul 2021) Playing-Side Analytics in Team Sports: Multiple Directions, Opportunities, and Challenges. *Frontiers in Sports and Active Living*. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8287128/<br> 
<a name="second paper"></a>2.[^](#second): Matt Dawson. (24 April 2023) The iron cage of efficiency: analytics, basketball and the logic of modernity. *Sport in Society*. https://ieeexplore.ieee.org/abstract/document/10399370/authors#authors<br>
<a name="third paper"></a>3.[^](#third): Jaime Sampaio, Tim McGarry, Julio Calleja-González, Sergio Jiménez Sáiz,  Xavi Schelling i del Alcázar, Mindaugas Balciunas. (14 July 2014) Exploring Game Performance in the National Basketball Association Using Player Tracking Data. *Plose One*. https://springerplus.springeropen.com/articles/10.1186/s40064-016-3108-2#citeas<br>
<a name="fourth paper"></a>4.[^](#fourth): ChatGPT<br>