# Data Analysis: Women's Football - FIFA

This notebook contains exploratory analysis and regression models developed to evaluate the salary value of female football players based on FIFA data. Project by Alejandro Galindo Valencia and Carla Moreno Molina.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


In [None]:
# Load datasets
players_path = "../datos/female_players.csv"
teams_path = "../datos/female_teams.csv"

female_players = pd.read_csv(players_path)
female_teams = pd.read_csv(teams_path)

# Merge datasets by team ID
merged_data = female_players.merge(female_teams, left_on='club_team_id', right_on='team_id', how='inner')
merged_data.head()


### Age distribution of players

In [None]:
plt.figure(figsize=(8, 6))
merged_data['age'].hist(bins=20, edgecolor='black')
plt.title('Age Distribution of Players')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()


### Salary distribution

In [None]:
plt.figure(figsize=(8, 6))
merged_data['wage_eur'].dropna().hist(bins=20, edgecolor='black')
plt.title('Salary Distribution of Players (EUR)')
plt.xlabel('Salary (EUR)')
plt.ylabel('Frequency')
plt.show()


### Most common nationalities

In [None]:
top_nationalities = merged_data['nationality_name_x'].value_counts().head(10)
plt.figure(figsize=(8, 6))
top_nationalities.plot(kind='bar', edgecolor='black')
plt.title('Top 10 Nationalities of Players')
plt.xlabel('Nationality')
plt.ylabel('Number of Players')
plt.show()


### Relationship between performance and salary

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(merged_data['overall_x'], merged_data['wage_eur'], alpha=0.5)
plt.title('Relationship between Overall and Salary')
plt.xlabel('Overall')
plt.ylabel('Salary (EUR)')
plt.show()


## Regression model: Individual variables

In [None]:
features = ['overall_x', 'potential', 'age', 'international_reputation', 'skill_moves']
X = merged_data[features]
y = merged_data['wage_eur']

X = X.dropna()
y = y[X.index]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Regression Model Evaluation:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2): {r2}")
print("Model Coefficients:")
for feature, coef in zip(features, regressor.coef_):
    print(f"{feature}: {coef}")


## Regression model: Team-related variables

In [None]:
team_features = ['starting_xi_average_age', 'whole_team_average_age', 'international_prestige', 'domestic_prestige']
X_team = merged_data[team_features]
y_team = merged_data['wage_eur']

X_team = X_team.dropna()
y_team = y_team[X_team.index]

X_train_team, X_test_team, y_train_team, y_test_team = train_test_split(X_team, y_team, test_size=0.2, random_state=42)

regressor_team = LinearRegression()
regressor_team.fit(X_train_team, y_train_team)

y_team_pred = regressor_team.predict(X_test_team)
mse_team = mean_squared_error(y_test_team, y_team_pred)
r2_team = r2_score(y_test_team, y_team_pred)

print("Team Model Regression Evaluation:")
print(f"Mean Squared Error (MSE): {mse_team}")
print(f"R-squared (R2): {r2_team}")
print("Team Model Coefficients:")
for feature, coef in zip(team_features, regressor_team.coef_):
    print(f"{feature}: {coef}")


## 🧾 Extended Conclusions

The comprehensive analysis of FIFA women's football data yields several important insights that can guide decision-making for fair investment, player development, and competitive equity:

### ⚽ Player Demographics and Salaries
- Most female players are aged between **20 and 30**, aligning with peak physical performance in sports.
- **Wage distribution** is heavily skewed: while the majority earn lower to mid-range salaries, a small group of elite players earns significantly more. This suggests a **pay gap** even within professional women’s football.
- **National diversity** is limited, with most players coming from a handful of countries such as **France, England, USA, Spain, and Germany**, reflecting unequal development and investment in women’s football across nations.

### 💸 Regression Analysis — Individual Performance
- The **linear regression model** using player variables (overall, potential, age, reputation, skill moves) achieves a solid **R² of 0.70**, showing that these individual factors explain 70% of wage variance.
- The strongest predictor is **overall performance**, meaning current ability is highly rewarded.
- Surprisingly, **potential** has a negative coefficient, implying that clubs prioritize current contribution over future promise.
- **Age** also negatively impacts wages, favoring younger players.
- **International reputation** significantly increases a player’s wage — branding and recognition matter.
- **Skill moves** contribute positively but modestly.

### 🏟️ Regression Analysis — Team Characteristics
- A second regression using team-related variables (average age, prestige) performed poorly (**R² = 0.048**), meaning these features explain less than 5% of salary differences.
- Neither **domestic** nor **international prestige** had a meaningful effect on individual salaries.
- The conclusion is that **team characteristics do not significantly determine player pay** — it is the individual that matters most.

### 📊 Tactical Attributes by Position
- Radar analysis of player attributes across positions reveals specialization:
  - **Wingers (LW, RW)** excel in pace and dribbling.
  - **Strikers (ST)** dominate in shooting and physical traits.
  - **Midfielders (CM)** have balanced skills, crucial for transitions and ball distribution.
  - **Center backs (CB)** are strongest in defending and physicality, but weakest in offensive traits.
- These profiles reflect the **tactical demands** of each position and **could influence salary valuation** accordingly.

### 🏆 Recommendations for Competitive Equity
To address talent concentration and foster equality in global women’s football, a **3-pillar strategic approach** is proposed:

1. **Subsidize Smaller Clubs**  
   Provide conditional grants to less financially capable clubs, especially in emerging leagues, for investment in:
   - Youth academies
   - Training infrastructure
   - Coaching staff

2. **Reinforce National Leagues**  
   Boost local league competitiveness by:
   - Introducing salary caps or foreign player limits
   - Encouraging development of homegrown talent
   - Offering performance-based incentives

3. **Support Regional Outreach**  
   Promote football in underrepresented areas through:
   - School scouting programs
   - Local tournaments
   - Inclusion initiatives in vulnerable communities

These policies, supported by transparent monitoring and evaluation, will promote fair access to resources, reduce concentration of talent, and enhance the visibility and sustainability of women’s football worldwide.
