# Laboratory Exercise: Predicting Champion Picks in the LoL Worlds Final

**Goal:** In this laboratory exercise, you will build a machine learning model capable of predicting whether a League of Legends champion will be selected for the **Season 13 Worlds Final**.
Using detailed performance statistics from **Season 12** and **Season 13**, you will explore, clean, and merge the datasets, construct meaningful features, analyze champion trends, and ultimately train a classification model that identifies which champions are strong enough to appear on the world stage.


## 1. Overview of the Task

You are given two datasets:

- `season_12_data.csv`
- `season_13_data.csv`

Each dataset contains the following features:

- **Name**
  The name of the champion (e.g., *Ahri*, *Garen*, *Lee Sin*).

- **Class**
  The gameplay class or archetype of the champion.
  Possible values include: **Fighter**, **Assassin**, **Mage**, **Marksman**, **Support**, **Tank**.

- **Role**
  The primary lane or position where the champion is played.
  Possible roles: **Top**, **Mid**, **ADC**, **Support**, **Jungle**.

- **Tier**
  The overall performance/strength ranking of the champion for that season.
  Possible tiers: **S+**, **S**, **A**, **B**, **C**, **D**.

- **Trend**
  The performance trend or momentum indicator (e.g., positive/negative score trend across the season).

- **Win %**
  The win rate percentage of the champion in the given role and season (e.g., `51%`).

- **Role %**
  The percentage of games in which this champion is played in the **given role** (e.g., a champion being played top lane 80% of the time).

- **Pick %**
  The pick rate percentage, i.e., how often the champion is selected overall.

- **Ban %**
  The ban rate percentage.
  - This is split by side, e.g., `"Blue (23%) / Red (34%)"`, indicating separate ban rates depending on the team side.

- **KDA**
  The Kill/Death ratio of the champion for the season.

Your final objective is to build a machine learning model that predicts whether a champion will be selected for the **Season 13 Worlds Final** (fictionally played every two seasons).

To achieve this, you will build a complete data-science pipeline:
- Data loading
- Column renaming
- Full outer merging
- Data cleaning (strings, percentages, KDA formatting)
- Handling missing values
- Exploratory Data Analysis (EDA)
- Feature selection
- Categorical encoding
- Target variable creation
- Train–test splitting
- ML model training
- Final evaluation

## 2. Renaming Columns by Season

Both datasets share identical column names.
To avoid confusion after merging, rename all non-key columns so that their season is clearly indicated.

Keep the following **unchanged**:
- `Name`
- `Role`

Rename all other columns using this format:

- `Win % (season 12)`
- `Win % (season 13)`

This ensures clarity and prevents accidental overwriting during merging or preprocessing.

## 3. Full Outer Join on `Name` and `Role`

Perform a **full outer merge** between the two datasets using:

- `Name`
- `Role`

Why full outer join?:
- Some champions exist in Season 13 but not in Season 12 (new releases)
- Some champions exist in Season 12 but only in certain roles, resulting in partial feature availability

A full outer join ensures:
- Champions appearing only in Season 12 are included
- Champions appearing only in Season 13 are included
- Shared champions receive both seasons’ features
- Missing values naturally appear where data is unavailable

After merging:
- Inspect the shape
- Check for missing values
- Show initial descriptive statistics

How to perform the merge:

- `df_full = season12.merge(season13, on=["Name","Role"], how="outer")`

## 4. Cleaning Numeric-Like Columns Stored as Strings

Several columns appear numeric but are actually stored as strings. These must be converted into clean, consistent numerical values before any modeling or missing-value handling.

Examples of problematic formats:

- `"78%"` → should become `78.0`
- `"Blue (23%)/Red (34%)"` → Ban rate split by side

### Your tasks:

1. **Strip percentage symbols (`%`)** from all percentage columns.
2. **Handle the Ban % special case**.
   - The `Ban %` column contains values such as: `"Blue (23%)/Red (34%)"`
   - Take the **sum** of them: `sum(23, 34)`
3. **Convert all cleaned values to `float`**
4. Ensure that all missing values stay **`NaN`**

This preprocessing step must be completed **before handling missing values**

## 5. Handling Missing Values

You must:

- Analyze why values are missing
- Apply appropriate techniques:
  - **Simple Imputation (mean, mode, median)**
  - **Advanced Imputation (MICE, KNN)**
  - **Dropping rows/columns** when justified

Your decisions must be motivated.

## 6. Exploratory Data Analysis (EDA)

Perform EDA to understand data patterns, distributions, and relationships.

Suggested visualizations:

- Distribution plots (histograms)
- Boxplots for numeric columns
- Correlation heatmap (numeric features)
- Role distribution (Season 12 & 13)
- Tier distribution
- Missing-value visualization heatmap

Use visualizations to justify feature selection and preprocessing choices.

## 7. Feature Selection

Not all features are equally valuable.
You must determine:

- Which numeric features matter most
- Which categorical features are useful
- Whether any columns are redundant or irrelevant

## 8. Creating the Target Column: `World Cup Suitable`

Create a new binary label based on Tier performance:

### A champion is labeled **1 (World Cup Suitable)** if:

- `Tier (season 12)` is **S+**,
  **OR**
- `Tier (season 13)` is **S** or **S+**

Otherwise → label **0**.

This new column will be your **prediction target**.

After generating the label:

- Decide whether to drop or encode the Tier columns
- Ensure proper categorical encoding if kept

## 9. Encoding Categorical Features

Choose the encoding strategy:

- **One-Hot Encoding**
- **Ordinal Encoding**

## 10. Train–Test Split

Split your dataset into:

- **Training set** (80%)
- **Test set** (20%)

## 11. Model Training

Train at least one classification model

## 12. Model Evaluation

Evaluate your model using classification metrics

In [3]:
# Your code here