# COGS 118B - Final Project

# COGS 118B - Positional Analysis using Clustering models on FIFA(the video game) Dataset

## Group members

- Kavin G Raj
- Michael Daoud 
- Shay Samat
- Hyun Choi
- Akhil Vasanth

# Abstract 
We have accumulated data from the hit video game series EA FIFA now known as EAFC. Using the data from the game, our goal is to see if we can use the clustering methods learned in the course to group players based on several performance statistics and test how well the model was able to differentiate player positions. With data collected from the video game we will be conducting clustering models.


# Background

Prior work has shown that clustering in sports research has been used commonly to investigate many problems in sports. Clustering is used so often because members of the same cluster are likely to have similar responses than members of a different cluster, and it is often easier to sample from clusters of individuals, such as sports teams, instead of individuals<a name="hayen"></a>[<sup>[1]</sup>](#hayennote). There have been many projects and studies where clustering has been used to analyze player data and determine the positions of players based on their clusters. Clustering methods like partition clustering has been used to cluster soccer players based on their performance data to analyze the composition of teams and find players similar to other given players <a name="akhanli"></a>[<sup>[2]</sup>](#akhanlinote). Additionally, techniques like K-Means Clustering and Model Based Clustering have been used to identify new, unique clusters of NBA players that are more informative than standard player positions <a name="stern"></a>[<sup>[3]</sup>](#sternnote). Due to the prevalence of clustering in sport our project would allow people to easily filter and discriminate between players based on their performance to see what type of player they want. 


# Problem Statement

Soccer is a very fluid game, making it hard to determine players position with just stats. Our problem is, how accurately can player statistics in the game predict the position of a player based on its similarity to other players? 


# Proposed Solution

Our proposed solution is to use K-Means Clustering to group players into 4 clusters for goalies, defenders, midfielders, and attackers and evaluate the clusters by using silhouette scores and other evaluations metrics. We will also see how accurate the clusters are to clusters based on their actual positions by using an adjusted Rand index. We plan to use the scikit-learn library to implement the K-means clustering algorithm and to check the accuracy of clusters using the built in functions to calculate the Adjusted Rand Index and silhouette scores.

# Evaluation Metrics

After obtaining clusters from K-means Clustering, we will assess the effectiveness of the clustering in grouping players based on performance statistics by utilizing Silhouette score. The Silhouette Score measures how similar a data point is to its own cluster in comparison to other clusters, with the score ranging from -1 to 1.
Formula of the Silhouette Score:
$s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}$
Where a(i) is the mean distance between a sample and all other points in the same class, and b(i) is the mean distance between a sample and all other points in the nearest cluster.


Then, we will validate our clustering with the Adjusted Rand Index since cluster labels do not inherently represent the label and actual grouping, making direct validation challenging. The Adjusted Rand Index is used to compare the clusters formed by K-means clustering with the ground truth. In our case, this involves comparing the positions assigned to players within the clusters against their actual position. Importantly, the ARI also accounts for the possibility that some agreement between the clustering and the ground truth could happen just by chance. At the end, ARI provides a score between -1 and 1.


Adjusted Rand Index:


$$
\text{ARI} = \frac{\sum_{ij} \binom{n_{ij}}{2} - \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \right] / \binom{n}{2}}{\frac{1}{2} \left[ \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2} \right] - \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \right] / \binom{n}{2}}
$$


where:
- $n_{ij}$ is the number of objects in both cluster $i$ and cluster $j$.
- $a_i$ is the number of objects in cluster $i$.
- $b_j$ is the number of objects in cluster $j$.
- $n$ is the total number of objects.
- $\binom{x}{2} = \frac{x(x-1)}{2}$ represents the binomial coefficient, i.e., the number of pairs that can be formed from $x$ items.  


By utilizing the Silhouette Score, we can internally validate the quality of the K-Mean Clustering and can externally validate the K-Mean Clustering using Adjusted Rand Index, comparing the formed clusters against the ground truth of the player's positions. This evaluation comprehensively captures the performance of our unsupervised learning algorithm in predicting player positions based on performance statistics.


# Preprocessing (Data Wrangling)

# EDA

# Feature Engineering

# Baseline Model

# Performance Models

# Data

Data: https://www.kaggle.com/datasets/stefanoleone992/ea-sports-fc-24-complete-player-dataset?select=male_players.csv

Initials Size of Dataset:  109 Variables, 180021 Observations

Each dataset consists of name, club, nationality, preferred foot, attack stats, defending stats, dribbling stats, goalie stats,mentality, DOB,  etc. for every player who was in the FIFA video game from 2014 to 2024. The critical values would be all the attack, defender, and dribbling stats which are written as a number from a 0-100. Each of these mentioned stats has a more specific stat for each action. For example attack would have attack_crossing, attack_finishing, attack_shortpassing.There are a lot of extra columns such as nationality, club jersey, league name, etc. We will remove some of these columns in the dataset. We also plan on just using FIFA 24, so we will remove any row that is not in Fifa 24. 


# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

Some ethical and privacy concerns to address would be to address not necessarily where our data was sourced from, but how it is used by EA Sports FC. The original data linked from Kaggle was from sofifa.com and has no copyright and a CC0:Public Domain license. SoFifa is a database used by EA Fifa/EAFC, and they calculate their stats based on different linear combinations of a player's main attributes. The privacy terms of image and name rights could be a concern from EA Sports. Sole ownership of image rights starts with the player, thereby prohibiting their exploitation without the player's explicit consent. When agreeing to be with a club, the player typically grants the club partial rights to their image, facilitating its incorporation into the club's marketing endeavors. If the player is selected to represent their national team, similar image rights are often extended to the national association.1 This falls more on EA Sports rather than the dataset in which we are using it for, however it is still important to address.

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
