<a href="https://colab.research.google.com/github/SamAbr/FPL-Squad-Selection/blob/main/FPL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## üèÜ Fantasy Premier League (FPL) Data Collection

### üìò Description

This section collects live Fantasy Premier League (FPL) data using the **official FPL public API**.  
The API provides comprehensive, real-time information about players, teams, fixtures, and gameweeks, which will form the foundation for building an **FPL team selection and performance optimization model**.

The goal of this step is to load and explore key FPL data tables directly into memory as **Pandas DataFrames**, which will later be used for data analysis, feature engineering, and optimization.

---

### üåê Data Source

**API Base URL:**  
`https://fantasy.premierleague.com/api/`

**Main Endpoints Used:**
| Endpoint | Description |
|-----------|--------------|
| `/bootstrap-static/` | Returns metadata on all players, teams, and gameweeks. |
| `/fixtures/` | Provides all Premier League fixtures and difficulty ratings. |

The API is open and does not require authentication.  
Data is collected using simple HTTP GET requests via Python‚Äôs `requests` library.

---

### üßæ Collected DataFrames

| DataFrame | Description | Key Columns |
|------------|--------------|--------------|
| `players_df` | Contains player-level statistics and attributes. | `web_name`, `team`, `element_type`, `price_m`, `form`, `total_points`, `ict_index` |
| `teams_df` | Contains information about all Premier League teams. | `id`, `name`, `strength_overall_home`, `strength_overall_away` |
| `positions_df` | Provides mappings for player positions. | `id`, `singular_name` (e.g., Goalkeeper, Defender) |
| `fixtures_df` | Contains fixture list with home/away teams and difficulty ratings. | `event`, `team_h`, `team_a`, `team_h_difficulty`, `team_a_difficulty` |
| `events_df` | Stores metadata for each gameweek (past and upcoming). | `id`, `name`, `deadline_time`, `finished` |

---

### ‚öôÔ∏è Technical Notes

- Data is fetched using HTTPS requests directly from the official FPL API.
- Player costs (`now_cost`) are stored in tenths of a million; converted to millions (`price_m`) for clarity.
- All data is stored **in-memory** as Pandas DataFrames (no CSV export required).
- API is updated daily by the FPL system, reflecting the latest prices, points, and forms.
- The script includes a short delay between requests to prevent overloading the server.


In [6]:
# Loading important libraries
import requests
import pandas as pd
import time

BASE = "https://fantasy.premierleague.com/api"

def fetch_json(url, sleep=0.5):
    """Fetch JSON from FPL API with a short polite delay."""
    r = requests.get(url, headers={"User-Agent": "colab-fpl-analyzer/1.0"})
    r.raise_for_status()
    time.sleep(sleep)
    return r.json()

# Fetch bootstrap data
bootstrap = fetch_json(f"{BASE}/bootstrap-static/")

# Extract useful tables
players_df = pd.DataFrame(bootstrap["elements"])
teams_df = pd.DataFrame(bootstrap["teams"])
positions_df = pd.DataFrame(bootstrap["element_types"])
events_df = pd.DataFrame(bootstrap["events"])

# Convert cost units (tenths of millions ‚Üí millions)
players_df["price_m"] = players_df["now_cost"] / 10

# Keep only important columns
keep_cols = [
    "id","web_name","team","element_type","price_m",
    "total_points","minutes","goals_scored","assists",
    "clean_sheets","form","ict_index","selected_by_percent"
]
players_df = players_df[keep_cols]

# Fetch fixtures
fixtures_df = pd.DataFrame(fetch_json(f"{BASE}/fixtures/"))

print("‚úÖ Data loaded into memory successfully.")
print("Players shape:", players_df.shape)
print("Teams shape:", teams_df.shape)
print("Fixtures shape:", fixtures_df.shape)

# Display a quick preview
players_df.head()


‚úÖ Data loaded into memory successfully.
Players shape: (748, 13)
Teams shape: (20, 21)
Fixtures shape: (380, 17)


Unnamed: 0,id,web_name,team,element_type,price_m,total_points,minutes,goals_scored,assists,clean_sheets,form,ict_index,selected_by_percent
0,1,Raya,1,1,5.8,52,900,0,0,7,6.0,16.8,31.5
1,2,Arrizabalaga,1,1,4.2,0,0,0,0,0,0.0,0.0,0.5
2,3,Hein,1,1,4.0,0,0,0,0,0,0.0,0.0,0.3
3,4,Setford,1,1,4.0,0,0,0,0,0,0.0,0.0,0.2
4,5,Gabriel,1,2,6.6,80,900,1,2,7,11.0,42.8,41.4


## Feature Engineering

### Objective

The purpose of this step is to derive additional, insightful metrics from the raw FPL data that can better capture each player's **value**, **form**, and **potential**.  
These engineered features will help our model and optimization process make smarter decisions when selecting the optimal FPL team.

---

### üîç Key Engineered Features

| Feature | Formula / Description | Intuition |
|----------|-----------------------|------------|
| `points_per_million` | `total_points / price_m` | Measures cost-effectiveness ‚Äî how many FPL points a player has produced per ¬£1.0m. |
| `form_numeric` | Numeric conversion of the `form` column (originally a string) | Enables quantitative comparisons between players. |
| `team_name` | Merged from `teams_df` using the `team` ID | Gives readable team names (e.g., Arsenal, Liverpool). |
| `position` | Mapped from `element_type` via `positions_df` | Shows player‚Äôs position (Goalkeeper, Defender, Midfielder, Forward). |
| `value_index` | `(form_numeric * points_per_million)` | Combines form and cost-efficiency to rank high-performing, undervalued players. |


In [7]:
# --- Feature Engineering ---
# Convert form to numeric
players_df["form_numeric"] = players_df["form"].astype(float)

# Compute cost-effectiveness
players_df["points_per_million"] = players_df["total_points"] / players_df["price_m"]

# Merge readable team names
players_df = players_df.merge(
    teams_df[["id", "name"]],
    left_on="team",
    right_on="id",
    how="left"
).rename(columns={"name": "team_name"}).drop(columns=["id_y"]).rename(columns={"id_x": "id"})

# Merge position names
players_df = players_df.merge(
    positions_df[["id", "singular_name"]],
    left_on="element_type",
    right_on="id",
    how="left"
).rename(columns={"singular_name": "position"}).drop(columns=["id_y"]).rename(columns={"id_x": "id"})

# Create a simple value metric
players_df["value_index"] = players_df["form_numeric"] * players_df["points_per_million"]

# Sort by best performing value players
players_features_df = players_df.sort_values(by="value_index", ascending=False)

# Display top 10
players_features_df.head(10)[["web_name", "team_name", "position", "price_m", "form_numeric", "points_per_million", "value_index"]]

Unnamed: 0,web_name,team_name,position,price_m,form_numeric,points_per_million,value_index
4,Gabriel,Arsenal,Defender,6.6,11.0,12.121212,133.333333
648,Van de Ven,Spurs,Defender,4.8,8.3,11.875,98.5625
282,Gu√©hi,Crystal Palace,Defender,5.0,6.0,12.8,76.8
634,Mukiele,Sunderland,Defender,4.2,6.7,11.190476,74.97619
19,Rice,Arsenal,Midfielder,6.8,7.7,9.264706,71.338235
245,James,Chelsea,Defender,5.5,8.7,8.181818,71.181818
506,Casemiro,Man Utd,Midfielder,5.5,9.0,7.636364,68.727273
7,J.Timber,Arsenal,Defender,6.1,6.0,10.819672,64.918033
6,Calafiori,Arsenal,Defender,5.8,6.0,10.344828,62.068966
476,Haaland,Man City,Forward,14.8,9.0,6.621622,59.594595


## üß© Data Cleaning and Preprocessing

In this step, we clean the `players_features_df` dataset to prepare it for analysis.  
This dataset already includes important FPL performance indicators such as:
- `total_points`, `minutes`, `form`, and `ict_index` for performance tracking.  
- `price_m`, `points_per_million`, and `value_index` for value-based analysis.  
- `team_name` and `position` for grouping and visualization.  

We will:
1. Keep the most relevant columns.
2. Ensure correct data types.
3. Rename and reorder columns for clarity.

### üß© Data Wrangling Function

To ensure reproducibility and maintain a clean workflow, the data cleaning and preprocessing steps are wrapped in a single function called `clean_fpl_data()`.  

This function:
1. Selects the most relevant columns from the raw dataset.  
2. Converts numeric fields to the proper data type.  
3. Handles missing values by filling them with zero.  
4. Returns a clean, analysis-ready DataFrame.


In [16]:
def clean_fpl_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Cleans and preprocesses the Fantasy Premier League (FPL) player dataset.
    """

    # Select relevant columns
    cols = [
        'web_name', 'team_name', 'position',
        'price_m', 'total_points', 'minutes',
        'goals_scored', 'assists', 'clean_sheets',
        'form', 'ict_index', 'selected_by_percent',
        'form_numeric', 'points_per_million', 'value_index'
    ]

    # Keep only available ones (safe selection)
    cols = [c for c in cols if c in df.columns]
    fpl_df = df[cols].copy()

    # Rename columns for consistency
    rename_map = {
        'web_name': 'player_name',
        'team_name': 'team',
        'position': 'position',
        'price_m': 'price_million'
    }
    fpl_df.rename(columns=rename_map, inplace=True)

    # Convert numeric columns
    numeric_cols = [
        'price_million', 'total_points', 'minutes', 'goals_scored',
        'assists', 'clean_sheets', 'form', 'ict_index',
        'selected_by_percent', 'form_numeric', 'points_per_million', 'value_index'
    ]
    numeric_cols = [c for c in numeric_cols if c in fpl_df.columns]
    fpl_df[numeric_cols] = fpl_df[numeric_cols].apply(pd.to_numeric, errors='coerce')

    # Fill missing values
    fpl_df.fillna(0, inplace=True)

    # Reorder for readability
    ordered_cols = [
        'player_name', 'team', 'position', 'price_million',
        'total_points', 'minutes', 'goals_scored', 'assists', 'clean_sheets',
        'form', 'ict_index', 'selected_by_percent', 'points_per_million', 'value_index'
    ]
    ordered_cols = [c for c in ordered_cols if c in fpl_df.columns]
    fpl_df = fpl_df[ordered_cols]

    print(f"‚úÖ Cleaned dataset ready: {fpl_df.shape[0]} players √ó {fpl_df.shape[1]} columns")
    return fpl_df

In [18]:
fpl_df = clean_fpl_data(players_features_df)
fpl_df.head()

‚úÖ Cleaned dataset ready: 748 players √ó 14 columns


Unnamed: 0,player_name,team,position,price_million,total_points,minutes,goals_scored,assists,clean_sheets,form,ict_index,selected_by_percent,points_per_million,value_index
4,Gabriel,Arsenal,Defender,6.6,80,900,1,2,7,11.0,42.8,41.4,12.121212,133.333333
648,Van de Ven,Spurs,Defender,4.8,57,889,3,0,4,8.3,39.3,32.6,11.875,98.5625
282,Gu√©hi,Crystal Palace,Defender,5.0,64,900,1,3,4,6.0,43.5,34.3,12.8,76.8
634,Mukiele,Sunderland,Defender,4.2,47,720,1,0,3,6.7,32.6,5.5,11.190476,74.97619
19,Rice,Arsenal,Midfielder,6.8,63,803,2,4,6,7.7,63.6,16.9,9.264706,71.338235
