# Supervised learning case study
## Dataset information

* Dataset scraped from Badminton World Federation (BWF) website
* Each record is a specific match played at a professional tournament, with the data split up into disciplines
* The purpose of the analysis is to identify whether any checkpoints in matches, such as the mid-game interval at 11 points, can be used to predict the outcome.

Variable | Description
---------|------------
tournament | The name of the tournament
city | The city the tournament is played
country | The country this tournament is played
date | The date this match is played
tournament_type | The type of this tournament (Super 100 - Super 1000)
discipline | The discipline of this match (useful when merging different datasets)
round | The round of this match in the tournament
winner | Number (1 or 2) specifying if team 1 or team 2 won the match (0 if one of the teams retired)
nb_sets | Number of sets played
retired | Boolean stating if a team retired
gameiscore | Final score of game i
teamoneplayers | Name of the player of team one
teamtwoplayers | Name of the player of team two
teamonenationalities | Abbreviation of the nationality of the player of team one
teamtwonationalities | Abbreviation of the nationality of the player of team two
teamonetotal_points | Total number of points scored by team one during the match
teamtwototal_points | Total number of points scored by team two during the match
teamonemostconsecutivepoints | Most conseccutive points scores by team one during the match
teamtwomostconsecutivepoints | Most conseccutive points scores by team two during the match
teamonegame_points | Total number of game points team one had during the match
teamtwogame_points | Total number of game points team two had during the match
teamonemostconsecutivepointsgamei | Most consecutive points scores by team one during game i
teamtwomostconsecutivepointsgamei | Most consecutive points scores by team two during game i
teamonegamepointsgame_i | Total number of game points team one had during game i
teamtwogamepointsgame_i | Total number of game points team two had during game i
gameiscores | List of all score changes during game i


# Import libraries

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns
import plotly.express as px

In [3]:
%matplotlib inline

In [4]:
pd.set_option('display.max_columns', None)

### Preparing the data
The disciplines of badminton are split into two groups: singles and doubles. The datasets share the majority of columns; however, the doubles datasets feature two additional columns for the two additional players in each game. As we are not concerned with directly comparing players, the player name and nationality columns will be dropped and comparisons will be between 'Team 1' and 'Team 2'.

In [None]:
# Read all disciplines into respective DataFrames
ms = pd.read_csv('/content/drive/MyDrive/Badminton/ms.csv')
ws = pd.read_csv('/content/drive/MyDrive/Badminton/ws.csv')
wd = pd.read_csv('/content/drive/MyDrive/Badminton/wd.csv')
md = pd.read_csv('/content/drive/MyDrive/Badminton/md.csv')
xd = pd.read_csv('/content/drive/MyDrive/Badminton/xd.csv')

# Check head

# Remove names and nationalities from the datasets (for doubles, please remove information about both players)

# Concatenate all disciplines into one dataframe

# Reset the index

### Understanding the data
Now that we have one DataFrame containing all matches across all disciplines, the next step is to understand the data to identify features of interest, gaps in the data, and any areas to clean or process.

In [None]:
# The shape of the data

## Data pre-processing
### Missing value imputation

The dataset has a significant number of Null values, which can be explained by considering the dynamics of a badminton match. The columns with the most Null values regard a game three, which is only necessary if both team have won a single game each. Consequently, the Null values for game three indicate that either a team won the match in two straight games or a team retired.

# Reassign the DataFrame with retired matches filtered out and drop the column as it is no longer needed (filter retired == False and then drop the column)


# Check na

By removing games involving a retirement the Null values present in game two have been removed, as well as a portion of those in game three. There are still a significant number of Null values in game three, which we can verify are 'valid' Null values as the match has been completed in two straight games.

If this assumption is proved true, the Null values will be replaced with either '0' or '0-0' depending on the column.

# Return rows where three games have been played (nb_sets == 3) yet the 'game_3_score' is Null (should return 0)


# Replace 'valid' Null values with appropriate values
* game_3_score: fillna with '0-0'
* team_one_most_consecutive_points_game_3, team_two_most_consecutive_points_game_3, team_one_game_points_game_3, team_two_game_points_game_3: fillna with 0

# Recheck null values for game_2_scores and game_1_scores and remove them for simplicity

# Reset the index

# Feature engineering


# Split the 'game_i_score' column into six new columns to extract each team's individual score per game
* hint: df[['t1_pts_g1', 't2_pts_g1']] = df.game_1_score.str.split('-', expand=True)
* Apply the same logic to game_2_score and game_3_score

# Find the winner of each game based on previously created score columns
* hint: df['g1_winner'] = np.where(df['t1_pts_g1'] > df['t2_pts_g1'], 1, 2)
* Apply the same logic to create new columns g2_winner and g3_winner

# Function to find the first 'instance' of 11 points being scored, signalling the interval

* For example ['9-10', '9-11'] => detect the first time 11 points scored

In [None]:
def interval_score(array):
    '''
    extract score on interval (11 points) on a single game
    '''

    for a in array:
        if '11' in a:
            interval_score = a.strip("'] [ ")
            return interval_score
            break

# Extract the score at the mid-game interval from the List of each games' score.


In [None]:
df['g1_int'] = df['game_1_scores'].apply(lambda x: interval_score(x.split(',')) if isinstance(x, str) else '0-0')
df['g2_int'] = df['game_2_scores'].apply(lambda x: interval_score(x.split(',')) if isinstance(x, str) else '0-0')
df['g3_int'] = df['game_3_scores'].apply(lambda x: interval_score(x.split(',')) if isinstance(x, str) else '0-0')

In [None]:
df.head()

# Split the 'game_i_interval_score' column into six new columns to extract each team's individual score at each games interval
* hint: df[['t1_pts_g1_int', 't2_pts_g1_int']] = df.g1_int.str.split('-', expand=True)
* Apply the same logic to g2_int and g3_int

# Check head

# Find the 'winner' of each interval based on previously created interval score columns
* hint: df['g1_int_winner'] = np.where(df['t1_pts_g1_int'] > df['t2_pts_g1_int'], 1, 2)
* apply the same logic to create new columns: g2_int_winner and g3_int_winner

# Replace the erroneous entries introduced above with 0 or '0-0' when a game three was not played, where approrpriate


In [None]:
df.loc[(df.nb_sets==2), 'g3_winner']=0
df.loc[(df.nb_sets==2), 'g3_int']='0-0'
df.loc[(df.nb_sets==2), 't1_pts_g3_int']=0
df.loc[(df.nb_sets==2), 't2_pts_g3_int']=0
df.loc[(df.nb_sets==2), 'g3_int_winner']=0

In [None]:
df.head()

# From this point, to simply the dataset and identify what features may correlate with a win, the analysis will be centered on 'Team 1'. If they win the value will be '1', else '0'. This way we do not have to create two rows for each match played, to account for one player winning and one losing
* hint: replace winner = 1 with 1, else = 0

In [None]:
df['win'] = np.where(df['winner']==1, 1, 0)

# Calculate relative point difference for each game
* hint: df['g1_pt_dif'] = np.where(df['g1_winner']==1, df['t2_pts_g1'].astype(int) / df['t1_pts_g1'].astype(int),
                           df['t1_pts_g1'].astype(int) / df['t2_pts_g1'].astype(int))
* Apply the same logic to create new columns: g2_pt_dif and g3_pt_dif

# Recheck null values probably due to some previous steps
* fill na with 0 for t1_pts_g2_int, t2_pts_g2_int, t1_pts_g3_int, t2_pts_g3_int

# Calculate relative point difference at each interval
* hint: df['g1_int_pt_dif'] = np.where(df['g1_int_winner']==1, df['t2_pts_g1_int'].astype(int) / df['t1_pts_g1_int'].astype(int),
                           df['t1_pts_g1_int'].astype(int) / df['t2_pts_g1_int'].astype(int))
* Apply the same logic to create other columns: g2_int_pt_dif and g3_int_pt_dif

# Replace any Null values introduced with 0 due to no game three being played (g3_pt_dif, g3_int_pt_dif)


# Only concerned with 'Team 1' so most consecutive points per game is t1's most consecutive points


In [None]:
# Only concerned with 'Team 1' so most consecutive points per game is t1's most consecutive points
df['g1_mst_con_pts'] = df['team_one_most_consecutive_points_game_1']
df['g2_mst_con_pts'] = df['team_one_most_consecutive_points_game_2']
df['g3_mst_con_pts'] = df['team_one_most_consecutive_points_game_3']

# Import MinMaxScaler to normalise the newly created features


In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Normalise game point difference: g1_pt_dif, g2_pt_dif, g3_pt_dif


In [None]:
# Normalise interval point difference: g1_int_pt_dif, g2_int_pt_dif, g3_int_pt_dif


In [None]:
# Normalise most consecutive points per game: g1_mst_con_pts, g2_mst_con_pts, g3_mst_con_pts



A significant number of new features have been created by expanding on the existing data and normalising it. Each row still represents a single match, though the statistics are now centered around only one team. This way, the statistics surrounding point distribution and checkpoints within matches should prove more insightful and may reveal correlation with winning the match.

Consequently, many of the original features can now be removed from the dataset.

# Keep ['win', 'g1_pt_dif', 'g2_pt_dif', 'g3_pt_dif', 'g1_int_pt_dif', 'g2_int_pt_dif',    'g3_int_pt_dif', 'g1_mst_con_pts', 'g2_mst_con_pts', 'g3_mst_con_pts'] for df

### Data exploration
As all features were extracted from original features which were then removed, exploratory data analysis was not considered necessary previously. A brief examination of some the new features is useful before beginning analysis.

# Check correlation

In [None]:
corr=df.corr() # gives us the correlation values

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(corr, annot = False, cmap="BuPu")  # let's visualise the correlation matrix
plt.title('Correlation heatmap of new features')
plt.show()

# Recheck null values and remove them if present

# Reset the index

# Create x and y for test and train sets for analysis (y: win)

# Train test split

In [None]:
# Import sci-kit learn library which randomly splits data into test and train sets
from sklearn.model_selection import train_test_split

# Divide the dataset into train and test sets for both x and y, test_size 0.2 = 20% of the dataset
x_train, x_test, y_train, y_test =

In [None]:
# Import sci-kit learn library to cross-validate the test data
from sklearn.model_selection import cross_val_score

# Build several models and report your best model