# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Hinn Zhang
- Jingyue Xu

# Abstract 

Our project seeks to re-evaluate football (soccer) player evaluation provided in the FIFA game and offer an alternative interpretation of player ratings based on different knowledge of football players. Our dataset captures diverse attributes such as player styles, team lineup, and detailed match events from over 25,000 matches across several European countries from 2008  to 2016. Through data preprocessing, exploratory data analysis, feature selection, and model selection, we seek to identify the most predictive features and construct an optimal model for player performance. The project will leverage various machine learning techniques, including linear regression, logistic regression, and multiple regression, with a focus on model interpretability. The performance of the predictive model will be assessed using a suite of metrics, including to ensure a comprehensive understanding of the model's predictive power and generalizability.


# Background

Despite that EA Sports and FIFA have parted ways after over three decades of partnership<a name="1"></a>[<sup>[1]</sup>](#1note), the football game developed by Electronic Arts has undoubtfully impacted the industry of football simulation and the way fans interact with the sport<a name="2"></a>[<sup>[2]</sup>](#2note). Notably, the game features realistic football athletes and their performance and skill ratings, such as passing, tackling, dribbling, and shooting. 

Prediction with FIFA ratings in the game of football has garnered significant attention in the field of data science and machine learning. Previous studies have identified machine learching technique using players’ ratings on FIFA to predict player values in the transfer market<a name="3"></a>[<sup>[3]</sup>](#3note). And researchers have established correlations between player’s performance values (ratings) and their attributes<a name="4"></a>[<sup>[4]</sup>](#4note). Yet, there has not been a holistic evaluation of these player statistics or a reveal of how they are calculated. That is, while overall players’ ratings can be addressed as their overall performance, specific statistics do not follow a rigorous evidence-based approach to account for player’s impact, such as from tactical aspects. Thus, interpreting the overall ratings of players on FIFA serves to expand the quantitative evaluation to the knowledge of football and players and add to the significance of the prior work in understanding the dynamics of player’s attributes and true performance. 


# Problem Statement

The existing research focuses on deploying FIFA player ratings in prediction of performance and other aspects of player characteristics. Yet, the lack of tools to evaluate the player ratings gives question to whether the overall FIFA player scores are strong evidence of athletes’ success. In the project, to establish a clear relationship between the real-time performance of players and their overall ratings, we aim to generate a regression model to predict players’ ratings with real-world performance metrics. Given the appropriate data such as players’ real life match performance, we can dissect the prediction of the overall score in our model.

# Data

For the purposes of our project, we have already identified two public datasets, the European Soccer Database and the Football Data from Transfermarkt.

The European Soccer Database<a name="5"></a>[<sup>[5]</sup>](#5note) consists of data on 10,000+ players in 25,000+ matches over the seasons from 2008 to 2016. The player attributes data is sourced from the FIFA game series, which reflects real-world player information. The player attributes include specific features of the player’s style and detailed in-match event ratings. It contains 42 variables and 184,000+ observations. Some of the important variables are player ID, in-game rating, which is the value we will try to verify and compare against, and multiple player attributes represented in categories or numerical values. Select attributes among them will be used in our analysis to construct a prediction model on player evaluation.

The Football Data from Transfermarkt<a name="6"></a>[<sup>[6]</sup>](#6note) pacts in itself 28,000+ players and 300,000+ player market valuations historical records. We plan to use part of its data to aid our evaluation of player rating from the angle of the player’s valuation on the market. This variable contains 422k observations and we will have to clean this data to match it to the player attributes by player name.

In addition, we look forward to identifying additional features that may contribute and help us describe a player’s rating by continuing to explore new datasets.

# Proposed Solution

Our goal is to predict a football player's rating based on attributes such as play styles, in-match events, and market value. This is essentially a regression problem, where the target variable (player' rating) is continuous. We propose to use ensemble learning methods, specifically Random Forest and Gradient Boosting regressors, for this task. These models are robust, handle high dimensional data well, and require less data preprocessing compared to other regression models.

Random Forests work by creating a set of decision trees from randomly selected subsets of the training set and then aggregating the votes from different decision trees to decide the final prediction. This method is robust to overfitting and can handle non-linear relationships between features.

Gradient Boosting is another ensemble learning method that builds sequential models, each correcting the errors from the previous one. It is known for its effectiveness and flexibility, and can capture complex patterns in the data.

To implement these models, we will use the scikit-learn library in Python, which provides pre-built classes for both Random Forest (RandomForestRegressor) and Gradient Boosting (GradientBoostingRegressor). We will also use GridSearchCV for hyperparameter tuning, to find the optimal set of parameters for each model. The data will be split into training and testing sets using train_test_split, ensuring that the model's performance is evaluated on unseen data.

# Evaluation Metrics

Given that we are dealing with a regression problem, there are several appropriate metrics we can use to evaluate the performance of our predictive models. In this context, three particularly suitable metrics are Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R² score.

RMSE is a commonly used metric for regression problems, and it measures the average magnitude of the error. It does this by squaring the differences between the predicted and actual values, averaging these squared differences, and then taking the square root of the result. This metric gives a higher weight to large errors due to the squaring operation.

MAE is another metric for regression problems. It calculates the average of the absolute differences between the predicted and actual values. Unlike RMSE, MAE treats all errors equally, regardless of their magnitude.

The R² score, or coefficient of determination, provides a measure of how well future samples are likely to be predicted by the model. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).

In all these cases, a lower RMSE or MAE and a higher R² indicate a better fit of the model to the data.

# Ethics & Privacy

The data for this project is publicly available and has been anonymized to maintain the privacy of individual players and teams. Even though the data is anonymized, it is important to handle it responsibly. We recognize that predictive models can unintentionally exhibit or amplify biases in the data. We will take steps to identify and mitigate any potential bias in the dataset. We will also address potential biases in our report. We also recognize that our project may present real-world impacts as it contains real names and places. We will adress these concerns in our project discussion.


# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* Timely communication on discord/imessage groupchat weekly to address left-over problems from meetings
* All members contribute equally and stick to their responsibility as assigned
* Collaborate via google collab and google doc 
* Address comments and issues received from instructors and peer reviews
* Be respectful to every team member

# Project Timeline Proposal

We plan to follow the schedule:

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/12  | All Afternoon |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 5/16  |  5 PM |  Finalize research space and address previous work (Hinn) | Discuss ideal dataset(s) and ethics (Jingyue); complete and submit project proposal | 
| 5/19  | 11 AM  | Import & Wrangle datasets (all)  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 5/22  | 5 PM  | Finalize previous sections and start EDA (Maradonna) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 5/28  | 12 PM  | Begin analysis (all) | Discuss/edit project code; Complete project |
| 6/01  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Carlos)| Discuss/edit full project |
| 6/03  | Before 11:59 PM  | Review and Finalize write-up | Turn in Final Project  |

# Footnotes
<a name="1note"></a>1.[^](#1):  https://www.nytimes.com/2022/05/10/sports/soccer/fifa-ea-sports.html<br>
<a name="2note"></a>2.[^](#2): https://www.ea.com/games/fifa/news/ea-sports-fifa-and-the-impact-on-soccer-in-the-usa<br>
<a name="3note"></a>3.[^](#3): https://ieeexplore.ieee.org/abstract/document/9721908<br>
<a name="6note"></a>4.[^](#4): https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8474750<br>
<a name="4note"></a>5.[^](#5): https://www.kaggle.com/datasets/hugomathien/soccer <br>
<a name="5note"></a>6.[^](#6): https://www.kaggle.com/datasets/davidcariboo/player-scores<br>
