### Context

League of Legends is a MOBA (multiplayer online battle arena) where 2 teams (blue and red) face off. There are 3 lanes, a jungle, and 5 roles. The goal is to take down the enemy Nexus to win the game.

### Content

This dataset contains the first 10min. stats of approx. 10k ranked games (SOLO QUEUE) from a high ELO (DIAMOND I to MASTER). Players have roughly the same level.

Each game is unique. The gameId can help you to fetch more attributes from the Riot API.

There are 19 features per team (38 in total) collected after 10min in-game. This includes kills, deaths, gold, experience, level… It's up to you to do some feature engineering to get more insights.

The column blueWins is the target value (the value we are trying to predict). A value of 1 means the blue team has won. 0 otherwise.

So far I know, there is no missing value

### Glossary

- Warding totem: An item that a player can put on the map to reveal the nearby area. Very useful for map/objectives control.
- Minions: NPC that belong to both teams. They give gold when killed by players.
- Jungle minions: NPC that belong to NO TEAM. They give gold and buffs when killed by players.
- Elite monsters: Monsters with high hp/damage that give a massive bonus (gold/XP/stats) when killed by a team.
- Dragons: Elite monster which gives team bonus when killed. The 4th dragon killed by a team gives a massive stats bonus. The - - 5th dragon (Elder Dragon) offers a huge advantage to the team.
- Herald: Elite monster which gives stats bonus when killed by the player. It helps to push a lane and destroys structures.
- Towers: Structures you have to destroy to reach the enemy Nexus. They give gold.
- Level: Champion level. Start at 1. Max is 18.

### Importing major libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data=pd.read_csv('../input/league-of-legends-diamond-ranked-games-10-min/high_diamond_ranked_10min.csv')
data.head()

In [None]:
# Checking the shape of the data
data.shape

### Basic EDA & Preprocessing the data

In [None]:
# Checking null values
data.isnull().sum().sum()

In [None]:
# checking data types of the columns
data.info()

In [None]:
#checking for quasi constants
data.nunique()

As seen above, there is no column with 1 or same value throughout

#### Importing required libraries

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

In [None]:
X=data.drop(['blueWins', 'gameId'], axis=1)
y=data['blueWins']

Dropping gameID column as it is only an ID and has different value for each row

#### Splitting the data into train and test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

# Method 1: Feature Selection using different methods and checking with different models

## `Feature Selection using Feature importance of Random Forest Classifier

In [None]:
sel_rf=SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=1))

sel_rf.fit(X_train, y_train)

In [None]:
sel_rf.get_support()

As you can see, lot of features have been set as False depicting that they are not as important as other features

#### How many features remain after above procedure

In [None]:
print("Total number of features in the database: ", len(X_train.columns))
print("Total number of features after removing according to RF feature importances: ", sel_rf.get_support().sum())
print("Total features removed: ", int(len(X_train.columns)-sel_rf.get_support().sum()))

#### Let's transformed the data now and check the accuracy

In [None]:
X_train_rfc=sel_rf.transform(X_train)
X_test_rfc=sel_rf.transform(X_test)

In [None]:
# Let's check the shape of the data now to confirm that they have 16 features now
X_train_rfc.shape, X_test_rfc.shape

### Let's create a function with RandomForest and Gradient Boost Classifier, once we find the best classifier, we can further fine tune it using hyperparameter tuning

In [None]:
def classifier_model(X_train, X_test, y_train, y_test, method, data):
    rf_clf=RandomForestClassifier(n_estimators=1000, random_state=1)
    rf_clf.fit(X_train, y_train)
    y_pred_rf=rf_clf.predict(X_test)
    score_rlf=accuracy_score(y_test, y_pred_rf)
    print("---Feature Selection method: {}---". format(method))
    print("---Checking Accuracy with {}---".format(data))
    print("The accuracy score of Random Forest:", score_rlf)
    
    
    gb_clf=GradientBoostingClassifier(n_estimators=1000, random_state=1)
    gb_clf.fit(X_train, y_train)
    y_pred_gb=gb_clf.predict(X_test)
    score_gb=accuracy_score(y_test, y_pred_gb)
    print("The accuracy score of Gradient Boosting:", score_rlf)

### Accuracy with Reduced features

In [None]:
classifier_model(X_train_rfc, X_test_rfc, y_train, y_test, "Random Forest Feature importance", "Reduced Features")

### Accuracy with all features

In [None]:
classifier_model(X_train, X_test, y_train, y_test, "Random Forest Feature importance", "All Features")

##### As you can see above, accuracy has reduced after feature removal, hence let's check some other method to reduce the feature space

## Feature Selection using Recursive feature extraction (RFE)

In [None]:
sel_rfe=RFE(RandomForestClassifier(n_estimators=100, random_state=1),n_features_to_select=20)
sel_rfe.fit(X_train, y_train)


In [None]:
# Total features selected:
sel_rfe.get_support().sum()

In [None]:
#### Let's transform the data now;
X_train_rfe=sel_rfe.transform(X_train)
X_test_rfe=sel_rfe.transform(X_test)

### Let's run the classifiers now

### Accuracy with reduced features

In [None]:
classifier_model(X_train_rfe, X_test_rfe, y_train, y_test, "Recursive feature extraction with RF", "Reduced Features")

### Accuracy with All features

In [None]:
classifier_model(X_train, X_test, y_train, y_test, "Recursive feature extraction with RF", "All Features")

### Recursive Feature extraction using Gradient Boosting

In [None]:
sel_rfe_gb=RFE(GradientBoostingClassifier(n_estimators=100, random_state=1), n_features_to_select=22)
sel_rfe_gb.fit(X_train, y_train)

X_train_rfe_gb=sel_rfe_gb.transform(X_train)
X_test_rfe_gb=sel_rfe_gb.transform(X_test)

    

## Let's run the model


### Accuracy with reduced features

In [None]:
classifier_model(X_train_rfe_gb, X_test_rfe_gb, y_train, y_test, "Recursive feature extraction with GB", "Reduced Features")

## Gradient boosting algorithm had the highest accuracy. Now let's check how many number of features will give the best accuracy

In [None]:
for index in range(14,39):
    sel_rfe_gb=RFE(GradientBoostingClassifier(n_estimators=100, random_state=1), n_features_to_select=index)
    sel_rfe_gb.fit(X_train, y_train)

    X_train_rfe_gb=sel_rfe_gb.transform(X_train)
    X_test_rfe_gb=sel_rfe_gb.transform(X_test)
    
    clf_gb=GradientBoostingClassifier(n_estimators=200, random_state=1)
    clf_gb.fit(X_train_rfe_gb, y_train)
    y_pred_gb=clf_gb.predict(X_test_rfe_gb)
    score_gb=accuracy_score(y_test, y_pred_gb)
    print("Number of features: ", index)
    print("Accuracy: ", score_gb)
    print()

### It is clear from above that best selection of features are 16:

#### Now transforming the data with 16 features only and then running on different models to select the best model

In [None]:
sel_rfe_gb_new=RFE(GradientBoostingClassifier(n_estimators=1000, random_state=1), n_features_to_select=16)
sel_rfe_gb_new.fit(X_train, y_train)

X_train_final=sel_rfe_gb_new.transform(X_train)
X_test_final=sel_rfe_gb_new.transform(X_test)

## GRADIENT BOOST CLASSIFIER

### Checking with reduced and important features

In [None]:
gb_clf_1=GradientBoostingClassifier(n_estimators=400, random_state=1)

gb_clf_1.fit(X_train_final, y_train)
y_pred_gb_1=gb_clf_1.predict(X_test_final)

score_gb_1=accuracy_score(y_test, y_pred_gb_1)

print("Accuracy:" ,score_gb_1)

In [None]:
params_grid_gb={'n_estimators' : [100,200,400,600,1000,1200],
                'min_samples_split': [100,200,300,400],
                'min_samples_leaf' : [10,20,30,40,60,100],
                'max_depth' : [2,4,6,8],
                'learning_rate' : [0.01, 0.05, 0.1, 0.5, 1, 5, 10]
               }

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

gridsearch_gb=RandomizedSearchCV(estimator=GradientBoostingClassifier(), param_distributions=params_grid_gb, cv=5, scoring='accuracy')


In [None]:
gridsearch_gb.fit(X_train_final, y_train)

In [None]:
gridsearch_gb.best_score_

In [None]:
gridsearch_gb.best_params_

In [None]:
#### Checking accuracy on Test set
y_pred_final_gb=gridsearch_gb.predict(X_test_final)

print("Accuracy of GBM with accuracy_scoreced features on test set", accuracy_score(y_test, y_pred_final_gb))

### checking the model with all features

In [None]:
gridsearch_gb.fit(X_train, y_train)

In [None]:
gridsearch_gb.best_score_

In [None]:
gridsearch_gb.best_params_

In [None]:
#### Checking accuracy on Test set
y_pred_final_gb_all=gridsearch_gb.predict(X_test)

print("Accuracy of GBM with all features on test set", accuracy_score(y_test, y_pred_final_gb_all))

### As we see that maximum accuracy achieved was , Now let's do some feature engineering to further improve the accuracy

# Method 2: Performing feature Engineering and checking with different models now

In [None]:
data.head()

In [None]:
# Let's calculate the difference of values b/w Blue and red teams in all the columns

In [None]:
cols=[x[4:] for x in data.columns if "blue" in x and x[4:]!= 'Wins']
cols

These are the columns which require to be differenced b/w Blue and Red teams

In [None]:
# Below columns to be dropped  because they are already the difference of blue and red
cols_to_drop=['GoldDiff', 'ExperienceDiff']
final_cols=[x for x in cols if x not in cols_to_drop]

In [None]:
final_cols

In [None]:
data_new=pd.DataFrame()

for col in final_cols:
    data_new[f'Diff_{col}'] =data[f'blue{col}']-data[f'red{col}']

    

In [None]:
# Keeping values corresponding to only Red in ['GoldDiff', 'ExperienceDiff'] i.e redGoldDiff and redExperienceDiff
for col_ in cols_to_drop:
    data_new[col_]=data[f'red{col_}']

In [None]:
data_new.head()

In [None]:
# Now split the dataset into train and test
X_new=data_new
y_new=data['blueWins']
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.2, random_state=1, stratify=y)   
    

In [None]:
### Create 2 datasets tuples in order to run the model easily on new dataset ( feature engineered) and old dataset( original)

#Originaldata
dataset_1=(X_train, X_test, y_train, y_test, 'dataset_1')

#Featureengineered data
dataset_2=(X_train_new, X_test_new, y_train_new, y_test_new, 'dataset_2')

#### Create a function to test different classifiers

In [None]:
def run_classifier(model, dataset):
    model.fit(dataset[0], dataset[2])
    y_pred=model.predict(dataset[1])
    score_=accuracy_score(dataset[3], y_pred)
    return f'{round(score_, 4)*100}%'

### A quick run on different algorithms

In [None]:
model_dict={ 'Decision Tree' : DecisionTreeClassifier(max_depth=6,random_state=1),
            'Random Forest' : RandomForestClassifier(n_estimators=100, random_state=1),
           'Support Vector Classification': SVC(random_state=1), 
           'Gaussian Naive Bayes': GaussianNB(),
           'Gradient Boosting Classifier': GradientBoostingClassifier(random_state=1),
           'XG Boost Classifier': XGBClassifier()
                 
          }

### On original dataset

In [None]:
for model in model_dict:
    print(f'model:{model} -accuracy: {run_classifier(model_dict[model],dataset_1)}')

### On feature Engineered dataset

In [None]:
for model in model_dict:
    print(f'model:{model} -accuracy: {run_classifier(model_dict[model],dataset_2)}')