### 1. Business Understanding

In the world of football, data is becoming increasingly important. It plays a significant role in the business side of the sport, helping clubs understand players' strengths and weaknesses. Data not only provides insights into performance but also helps evaluate positional suitability. Over the course of a player's career, they may play multiple positions, raising the question: Why does this happen?

Haven’t they become good footballers because they excelled in their original scouted positions? Not always. This is where data comes in—it helps identify which role best suits a player, eliminating unnecessary position changes. By focusing on their natural strengths, players can refine their skills to perfection, benefiting both the club and themselves. This targeted development allows them to maximize their potential in the position that aligns with their traits.

The metrics used for analysis include:

- Shooting power
- Positioning
- Defending
- Standing tackle
- Sliding tackle
- Short passes
- Long passes

These are just some of the metrics that help determine a player's profile. Using these data points, we can assess whether a player is more suited to a defensive or attacking role, allowing them to tailor their development accordingly.

Although we are using a dataset from the game FIFA, it’s important to note that this data may not perfectly reflect reality. However, FIFA employs data analysis to generate player stats, and these can be used to simulate how real-world players might be evaluated in a similar system. Keepers will be excluded from the dataset due to the entirely different set of metrics required for their analysis compared to outfield players.

### 2. Data Understanding

The dataset was collected by football experts working for EAFC, the company behind the popular FIFA video game series. The data is publicly available on Kaggle, exported as a CSV file, and provides a comprehensive overview of football players across top-tier leagues, including some second and third divisions.
 
The dataset is designed to reflect real-world player attributes, which are used in-game to simulate player performance. These attributes are meticulously evaluated by EAFC’s team of analysts, ensuring a high degree of accuracy and relevance for football analytics.
 
### Data Discription ###
 
The dataset contains 51 attributes for each player, covering:
 
- Player Identification & Metadata
name, full_name: Player’s name and full name.
birth_date, age: Birthdate and current age.
height_cm, weight_kgs: Physical attributes.
nationality, positions: Player’s country and primary/secondary positions (e.g., CF, RW, ST).
 
- Performance & Skill Attributes
Attacking Skills:
crossing, finishing, dribbling, shot_power, long_shots, etc.
Defensive Skills:
heading_accuracy, interceptions, standing_tackle, sliding_tackle, etc.
Physical & Mental Attributes:
strength, aggression, stamina, composure, vision, etc.
 
- Career & Club Details
overall_rating, potential: Player’s current and potential skill rating (0–100 scale).
value_euro, wage_euro: Estimated market value and weekly wage.
club, national_team: Current affiliations.
 
### Summary Statistics ###
 
The dataset contains 15,889 outfield players, with stats on 51 different attributes


### Data cleaning

1. First all the goalkeepers were removed from the data, this was done because they could influence the prediction to much. The goalkeepers have very different attributes then the infield players. By removing them the future prediction would be better.
 
2. Handling Missing Values
Columns like national_team, release_clause_euro, and national_rating had missing values but were deemed non-essential for the classification task.
Critical columns (e.g., attacking_score, defending_score) were checked for completeness. Rows with missing values in these fields were droppe. As a result, the dataset reduced from ~17,953 to 15,889 entries (goalkeepers excluded). Dropped Irrelevant Columns: Removed metadata like name, birth_date, and financial details (wage_euro) to focus on performance metrics.
 
3. Data Integration
Single-Source Data: The dataset was sourced directly from Kaggle as one CSV file (fifa_players.csv), so no integration was needed.
Subsetting: Goalkeepers (positions containing GK) were filtered out to focus on field players.
 
4. Normalization
Min-max scaling (0–1) applied to all numeric features to ensure equal weighting in distance-based models (e.g., kNN):
 
5. Categorical Encoding
player_type (categorical) was encoded numerically for modeling
 
6. Normalization:
Essential for kNN to avoid bias toward high-magnitude features.

### 4. Modeling

### Model Selection
Models Considered: Only the Naive Bayes classifier was considered due to the assignment's specific requirements.

Rationale: Naive Bayes is a simple yet effective probabilistic classifier, suitable for binary classification tasks like distinguishing between "Attacker" and "Defender" roles. It assumes feature independence, which works well for this dataset where attacking and defending attributes are distinct.

### Parameter Tuning
Hyperparameters: The Gaussian Naive Bayes classifier was used with default parameters (no explicit tuning was performed).

Tuning Strategy: Since the assignment focused on implementing Naive Bayes, no advanced tuning (e.g., grid search) was applied. The model relies on the inherent probabilities calculated from the data.

### Training & Validation
Data Split: The dataset was split into:

Training Set: 70% of the data.
Testing Set: 30% of the data.

Random State: A fixed random state (random_state=42) was used for reproducibility.

### Evaluation Metrics
Metrics Used:

Accuracy: Measures the proportion of correct predictions (95% in this case).
Precision: Indicates the model's ability to avoid false positives (95.15%).
Recall (Sensitivity): Measures the model's ability to identify all relevant instances (93.61%).
F1 Score: Harmonic mean of precision and recall (94.38%).
Confusion Matrix: Visualizes true positives, true negatives, false positives, and false negatives.
Classification Report: Provides a summary of precision, recall, and F1 score for each class.

### Comparison to Baseline
Baseline: A baseline model could predict the majority class (e.g., always predicting "Attacker" if it is more frequent). The Naive Bayes model significantly outperforms such a baseline, as evidenced by the high accuracy (95%) and balanced precision/recall.

Performance: The model shows strong performance across all metrics, indicating it generalizes well to unseen data. The confusion matrix further confirms minimal misclassifications.

### Additional Notes
The model's high performance suggests that the selected features (attacking and defending attributes) are well-suited for distinguishing between the two roles.

No cross-validation was performed, but the train-test split provides a reasonable estimate of the model's performance.

The classification report and confusion matrix offer detailed insights into the model's behavior for each class ("Attacker" and "Defender").





### 5. Evaluation

The Naive Bayes classifier achieved the following performance metrics on the test set:

Metric	    Score	Interpretation
Accuracy	0.95	95% of predictions were correct.
Precision	0.9515	When the model predicts "Attacker," it is correct 95.15% of the time.
Recall	    0.9361	The model correctly identifies 93.61% of actual "Attackers."
F1 Score	0.9438	A balanced measure of precision and recall.

### Confusion Matrix:

Predicted Defender	Predicted Attacker
Actual Defender	2381 (TP)	100 (FN)
Actual Attacker	137 (FP)	2149 (TN)
True Positives (Defenders): 2381

False Negatives (Defenders misclassified as Attackers): 100
False Positives (Attackers misclassified as Defenders): 137
True Negatives (Attackers): 2149

### Model Interpretability

Naive Bayes Probabilities: The model calculates the probability of a player being an "Attacker" or "Defender" based on their attacking and defending attributes.

Key Features:
Attackers tend to have higher dribbling, sprint_speed, and shot_power.
Defenders tend to have higher strength, heading_accuracy, and standing_tackle.
Decision Boundary: Players are classified based on whether their attacking_score > defending_score.

Baseline Model: A simple majority-class classifier (always predicting the most frequent class).
The dataset has 55% defenders and 45% attackers, baseline accuracy would be 55%.

Naive Bayes Performance (95% Accuracy):
Significantly better than the baseline.
High precision and recall indicate strong generalization to unseen data.

Strengths
High Accuracy (95%) – The model performs exceptionally well on the test set.
Simple & Fast – Naive Bayes is computationally efficient.

Weaknesses
Assumes Feature Independence – Real-world player attributes may have correlations (e.g., sprint_speed and agility).
No Hyperparameter Tuning – Using default parameters may not be optimal.


### Deployment Plan
The model will be turned into a simple web tool where coaches and scouts can enter a player's stats (like speed, passing, and tackling) to instantly see if they're better as an attacker or defender. It will work on phones and computers without needing special software.

### Keeping the Model Working Well
We'll regularly check if the model's predictions still match real-world performance. If FIFA updates their rating system or we notice more mistakes happening, we'll update the model with new data. Scouts will be able to report wrong predictions to help improve accuracy over time.

### Next Improvements
Next steps include:

- Adding more positions like midfielders
- Using actual game stats (not just FIFA ratings) to make predictions more accurate
- Showing clear explanations for why a player was classified a certain way

### Potential Issues
The main challenges will be keeping up with changes in how players are rated and handling players who don't fit neatly into attacker or defender categories. We'll solve this by updating the model every time new FIFA data comes out and eventually adding more flexible position categories.

The model is ready to use now, but will keep getting better as we gather more feedback and data. The goal is to create a tool that helps scouts quickly identify players' best positions while still allowing for human judgment.

## Conclusion

This football analytics project successfully created a model that classifies players as attackers or defenders with 95% accuracy using FIFA game data. By analyzing key attributes like shooting, tackling and passing, it helps identify players' natural positions. While currently limited to two positions and game-based stats, the model provides clubs with a practical scouting tool. Future improvements will add midfield positions and incorporate real match data. The project demonstrates how data analysis can effectively support player development decisions in modern football.