# COGS 118B - Project Proposal

# Names

- Alexander Tang
- Sohaib Khan
- Siddhant Joshi
- Steven Xie
- Simon Zheng

# Abstract 

This project mainly focuses on a comprehensive analysis to decode the factors contributing to football teams’ success from machine learning techniques. By merging datasets, and leveraging thorough datasets, we would use unsupervised learning techniques including PCA, different prediction models, and hierarchical clustering techniques to help us identify the attributes of the success of teams. We aim to include insights into the teams’ performance and other factors behind the performance such as management of teams to make an evaluation and prediction for future related sports and players’ contributions. By doing so, we intend to offer valuable evaluations and predictions that can inform future sports analytics, player contributions, and team strategies. Furthermore, our findings are poised to assist current and future team managers in evaluating their strategies and identifying key areas of focus to enhance their chances of success. Similar models and techniques are also able to help predict and stand for other sports evaluations in their chances of success.


# Background

In recent years, the intersection of sports and data analytics has become an increasingly significant area of research, offering a wealth of insights into the intricacies of team dynamics, player contributions, and overall performance. As the popularity of sports analytics continues to rise, the application of machine learning techniques to unravel the factors influencing the success of football teams has garnered particular attention <a name="first paper"></a>[<sup>[1]</sup>](#first). This project aims to delve into a comprehensive analysis utilizing unsupervised machine learning methodologies to determine the complex elements contributing to the success of American football teams.

The overarching goal of our research is to provide valuable insights that go beyond simple retrospective analyses, seeking to understand underlying factors driving team success and develop predictive models that can evaluate future outcomes <a name="second paper"></a>[<sup>[2]</sup>](#second). The basis of the evaluations not only encompasses team performance but also takes into account individual player contributions to suggest a holistic perspective on team performance dynamics. The results and models from this project can assist both current and future team managers in evaluating their chances of success. Beyond this, the models and techniques have the potential to be adapted for predicting success in other sports, thereby broadening the applicability and impact of the current research in sports analytics <a name="third paper"></a>[<sup>[3]</sup>](#third).


# Problem Statement

Our group will analyze data on various soccer teams to see if there are any outstanding features/attributes (team offensive/defensive metrics, individual player talent metrics, team/player salaries, etc.) that lead to overall team success (number of wins, championships, seeding, popularity, etc.) with the goal of identifying specific features/attributes that make a soccer team successful. 

## FIFA World Cup 2022 Player Data
Link:https://www.kaggle.com/datasets/swaptr/fifa-world-cup-2022-player-data?select=player_defense.csv
Size: there are 11 datasets that we can use. They are: 
* player_defense.csv 
    * Number of observation: 680 
    *    Variables:22
    *    Detail: defensive actions and metrics for individual players
    *    Critical variables: include player name, position, team name, age, minutes played, tackles, interceptions, and     clearances.
    *    Handling: requires data cleaning for missing values and normalization of defensive metrics.
    
* player_gca.csv 
    * Number of observation:680
    * Variables:22
    * Detail: player involvement in goal-creating actions
    * Critical variables: player name, position, team name, goal-creating actions(GCA) and shot-creating actions(SCA)
    * Handling: requires data cleaning for missing values and normalization. 
* player_keepers.csv 
    * Number of observation:41
    * Variables:25
    * Detail: Saves, goals against and clean sheets
    * Critical variables: player name, position, team name, saves, clean sheets and goals against
    * Handling: requires analysis from field players and normalization of metrics to account for playing time.
* player_keepersadv.csv 
    * Number of observation:41
    * Variables:31
    * Detail: same as player_keepers but more details metrics
    * Critical variables: Post-shot xG, goalkeeping actions outside the penalty area.
    * Handling: interpretation of some metrics based on its advanced representation
* player_misc.csv 
    * Number of observation:680
    * Variables:22
    * Detail: miscellaneous player metrics.
    * Critical variables: yellow, red card, fouls, offsides
    * Handling: Deep evaluation of the impact
* player_passing_types.csv 
    * Number of observation:680
    * Variables:21
    * Detail: details about the breakdown of passing types
    * Critical variables: short, long, crosses, accuracy
    * Handling: deep evaluation of the key passing and its strategies.
* player_passing.csv 
    * Number of observation:680
    * Variables:29
    * Detail: the players’ passing statistics data
    * Critical variables: Total passes, pass completion rate, key passes.
    * Handling: aggregation and normalization of passing data for more teams’ insights
* player_playingtime.csv 
    * Number of observation:829
    * Variables:27
    * Detail: playing time and appearances
    * Critical variables: minutes, appearances, starts
    * Handling: deep evaluation of playing time impact on team sucess
* player_possession.csv 
    * Number of observation:680
    * Variables:41
    * Detail: player possession and dribbling
    * Critical variables: dribbles completed, players’ dribbled past, touches
    * Handling: normalization to account for playing time and role in team strategy analysis
* player_shooting.csv 
    * Number of observation:680
    * Variables:23
    * Detail: shooting data of players
    * Critical variables: shots, shots on target, goals, shooting accuracy and xG
    * Handling: deep analysis of shooting efficiency and contribution to team success, and normalization for playing time.
* Player_stats.csv
    * Number of observation:680
    * Variables:31
    * Detail: broad overview of player data covering multiple aspects of the game
    * Critical variables: most of them,
    * Handling: requires aggregation for teams’ insights. Data cleaning and standardization of necessary data variables.
    
# Historical World Cup data
* link:https://github.com/jfjelstul/worldcup/tree/master
* Size: 27 datasets include similar data from FIFA World Cup 2022 player data but include more about the goal, management, team, and tournament.
* Description: its match level data including teams, scores, and tournament stages are important. For some in the dataset, the variables such as “match outcomes”, and “goals”, team statistics, and progression stages, which include a numerical and categorical format that we should do normalization and data cleaning for a null value.

# EPL 21-22 Matches and Players Statistics
* link:https://www.kaggle.com/datasets/azminetoushikwasi/epl-21-22-matches-players?select=all_players_stats.csv
* Size: 3 files about the match results, players' stats and points table
    * All match result
        * Number of observation:380
        * Variables:4
        * Detail: results of matches, including teams and scores
        * Critical variables: Team names, goals scored, match outcomes
        * Handling: merging with team and player statistics for deeper analysis
    * All players stats
        * Number of observation:623
        * Variables:10
        * Detail: player statistics across various performance
        * Critical variables: players’ names, team affiliations, and performance indicators including goals, assists, etc.
        * Handling: deeper analysis and normalization for comparison of other players. 
    * Points table
        * Number of observation:20
        * Variables:10
        * Detail: teams’ points, wins, losses other key metics about the score
        * Critical variables: team names, points, wins, goal difference
        * Handling: deep evaluation of team success.

These three dataset mainly cover the teams, matches, and other important metrics that determine the problem of success. After touching and understanding more about the dataset, we would merge three dataset sources and determine the chosen dataset for analysis, ML, and evaluation.


# Proposed Solution

To identify the possible features that make various soccer teams successful, we would like to use PCA to first perform dimensionality reduction on our dataset, then perform hierarchical clustering to group these soccer teams based on their success-related features. Soccer teams' success metrics involve multiple variables(team offensive/defensive metrics, individual player talent metrics, team/player salaries, etc.), which can lead to a high-dimensional dataset. Using PCA, we can reduce the dataset’s dimensionality, emphasizing variation and bringing out strong patterns in a dataset. At the same time, by focusing on the principal components with the most variance, PCA helps to filter out noise from less significant variables. Since PCA is sensitive to the variances of the features, we can start by standardizing the dataset using StandardScaler from scikit-learn for this purpose. After that, we can implement the PCA class in scikit-learn and use the function pca.fit_transform() to get the results.

Using the dimensionality-reduced data, we will apply hierarchical clustering to group the soccer teams based on their similarities across the principal components. This method of hierarchical clustering produces a dendrogram that visualizes the clustering process, offering insights into the data's structure; we believe that this can help in identifying relationships between teams based on their success metrics. In addition, hierarchical clustering can capture complex relationships between data points, which can be beneficial if the success factors of soccer teams are not linearly separable. We will be using the common agglomerative(bottom-up) approach by importing the AgglomerativeClustering class from scikit-learn, and we can initialize the clusterings using AgglomerativeClustering() and fit the model using clustering.fit(). We will then analyze the clusters to identify common features or attributes within each group and see what makes the most successful cluster of teams stand out from the rest.

As a benchmark model, we could use K-Means clustering with a predefined number of clusters, where the number of clusters will be based on the number of tiers of team performance typically observed in soccer leagues. This benchmark can help us evaluate the effectiveness of our hierarchical clustering approach.

# Evaluation Metrics

To evaluate the performance of both the hierarchical clustering and the benchmark model(K-Means), we will be using the Silhouette Score  as our evaluation metrics. The mathematical representation of the Silhouette Score is S=(b-a)/max(a,b), where a is the average distance between the sample and all other points in the same cluster and b is the average distance between the sample and all points in the nearest cluster. This metric measures how similar an object is to its own cluster compared to other clusters and a high score indicates that the sample is well matched to its own cluster and poorly matched to neighboring clusters. If the silhouette score for our hierarchical clustering is lower than the benchmark model, that means our hierarchical clustering isn’t performing well on the dataset.

# Ethics & Privacy  <a name="fourth paper"></a>[<sup>[4]</sup>](#fourth) 

## Privacy Concerns:
We will ensure that any personal or sensitive information about individuals, such as fans or players, is handled with care and in accordance with relevant privacy laws and regulations. Additionally, we will have to be mindful of potential biases in the data that could unfairly impact certain groups of fans or players. Biases may arise from factors such as demographic information, ticket pricing, or player performance metrics.
## Transparency and Accountability:
All methods and datasets will be made open-source so as to maintain transparency about the study. We will also document any assumptions, limitations, or uncertainties associated with the analysis to ensure that stakeholders or interested parties have an informative picture of the potential implications of the findings and how they may influence decision-making.
## Impact on Players and Teams:
Perhaps most importantly, we will have to consider the potential impact of the analysis on players, teams, and other stakeholders. Drawing conclusions that could harm reputations or lead to unfair treatment will have to be seriously considered and accurately represented in our report.

# Team Expectations

* Actively communicate regularly and effectively with group members in a timely manner.
* Ensure all criticism is respectful and constructive.
* Collaborate productively with other team members to ensure quality work completely within planned deadlines.
* Attend team meetings and provide constructive feedback and engage in proactive problem solving to overcome obstacles.
* Make a conscious effort to be adaptable and flexible with any unforeseen circumstances in order to reach the overall goal.


# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/10  |  2:30 PM |  Brainstorm ideas/topics  | Get to know each other, vote on topic | 
| 2/18  |  9:00 PM |  Browse Optimal Datasets | Datasets observed, whether they are feasible, assign tasks for Project Proposal | 
| 2/19  | 11:00 AM  | Follow up from previous  | Continue discussing |
| 2/20  | 8:00 PM  | Finish Tasks for Project Proposal | Finalize Project Proposal |
| 2/24  | 12:00 PM  | Import & Wrangle Data EDA | Review Wrangling, EDA and discuss Analysis Plan |
| 3/2  | 12:00 PM | Finalize wrangling/EDA; Begin programming for project | Discuss project code and finalize it |
| 3/9  | 12:00 PM  | Complete analysis; Draft results/conclusion/discussion | Rundown of the whole project, any changes needed to be made? |
| 3/16  | 12:00 PM  | Finalize the last changes | Submit project |

# Footnotes
<a name="first paper"></a>1.[^](#first): George Foster, Norm O'Reilly, Zachary Naidu. (5 Jul 2021) Playing-Side Analytics in Team Sports: Multiple Directions, Opportunities, and Challenges. *Frontiers in Sports and Active Living*. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8287128/<br> 
<a name="second paper"></a>2.[^](#second): Zhiqiang Pu, Yi Pan, Shijie Wang, Boyin Liu, Min Chen, Hao Ma, Yixiong Cui. (Jan 2024) Orientation and Decision-Making for Soccer Based on Sports Analytics and AI: A Systematic Review. *Institute of Electrical and Electronics Engineers*. https://ieeexplore.ieee.org/abstract/document/10399370/authors#authors<br>
<a name="third paper"></a>3.[^](#third): Robert Rein, Daniel Memmert. (24 Aug 2016) Big data and tactical analysis in elite soccer: future challenges and opportunities for sports science. *SpringerPlus*. https://springerplus.springeropen.com/articles/10.1186/s40064-016-3108-2#citeas<br>
<a name="fourth paper"></a>4.[^](#fourth): ChatGPT<br>