# Hello mister data scientist 

Imagine that we are working in a huge analytics company and our new task is to model the probability of Counter Terrorist (**CT** for short) team winning a Counter Strike: Global Offensive (**CSGO** for short) game.

The rules of the game are simple: there are two teams, named terrorists and counter-terrorists, each consisting of 5 players. At the start of the round each player buys weapons, armor and other equipment and the objective is to eliminate every single member of the opposite team. 

To read more about the game visit the official website: https://blog.counter-strike.net/index.php/about/

This esport is very popular and our analytics company is trying to break into the gaming market with a very accurate model which will be shown on TV, on gaming streams and other places. 

If we define: 

$$ \mathbb{Y}_{i} = \{0, 1\}, \forall i = 1, ..., n$$ 

$$ \mathbb{X}_{i} \in R^{p}, \forall i = 1, ..., n$$

Where

$i$ - observation i. 

$n$ - total number of observations.

$p$ - number of features.

Then we are trying to create a model for the probability:

$$P(\mathbb{Y}|\mathbb{X}) \in (0, 1)$$

The function $f$ that links $\mathbb{X}$ to $\mathbb{Y}$ is the machine learning model which are trying to build:

$$ f: \mathbb{X} \rightarrow \mathbb{Y} $$

Because we are trying to predict an observation falling into one of two categories the machine learning model $f$ can be called a *classifier*. 

We roll up our sleeves and start working on each of the steps: EDA, data spliting, model creation and validation. 

# Python package imports 

The first thing that any developer or a ML practioner does is load up packages which are installed into his/hers machine. 

In [4]:
# Data reading 
import pandas as pd 

# Main modeling class
import xgboost as xgb 

# Data spliting 
from sklearn.model_selection import train_test_split

# Reading data 

Finding, cleaning and labelling data is ussually a long and painfull process. This is not the main emphasis of this book so lets imagine that we have already spent months in creating the beautifull dataset which we will read.

The original dataset can be found here: https://www.kaggle.com/christianlillelund/csgo-round-winner-classification

In [6]:
# Using pandas to read a csv file 
d = pd.read_csv("data/data.csv")

# Printing the shape of data 
print(f"Number of observations: {d.shape[0]}")
print(f"Number of features: {d.shape[1]}")

Number of observations: 122410
Number of features: 97


In [10]:
# Getting the feature names 
d.columns.values

array(['time_left', 'ct_score', 't_score', 'map', 'bomb_planted',
       'ct_health', 't_health', 'ct_armor', 't_armor', 'ct_money',
       't_money', 'ct_helmets', 't_helmets', 'ct_defuse_kits',
       'ct_players_alive', 't_players_alive', 'ct_weapon_ak47',
       't_weapon_ak47', 'ct_weapon_aug', 't_weapon_aug', 'ct_weapon_awp',
       't_weapon_awp', 'ct_weapon_bizon', 't_weapon_bizon',
       'ct_weapon_cz75auto', 't_weapon_cz75auto', 'ct_weapon_elite',
       't_weapon_elite', 'ct_weapon_famas', 't_weapon_famas',
       'ct_weapon_g3sg1', 't_weapon_g3sg1', 'ct_weapon_galilar',
       't_weapon_galilar', 'ct_weapon_glock', 't_weapon_glock',
       'ct_weapon_m249', 't_weapon_m249', 'ct_weapon_m4a1s',
       't_weapon_m4a1s', 'ct_weapon_m4a4', 't_weapon_m4a4',
       'ct_weapon_mac10', 't_weapon_mac10', 'ct_weapon_mag7',
       't_weapon_mag7', 'ct_weapon_mp5sd', 't_weapon_mp5sd',
       'ct_weapon_mp7', 't_weapon_mp7', 'ct_weapon_mp9', 't_weapon_mp9',
       'ct_weapon_negev', 't_

In [12]:
# Displaying a snippet of data
print(d.head())

   time_left  ct_score  t_score       map  bomb_planted  ct_health  t_health  \
0     175.00       0.0      0.0  de_dust2         False      500.0     500.0   
1     156.03       0.0      0.0  de_dust2         False      500.0     500.0   
2      96.03       0.0      0.0  de_dust2         False      391.0     400.0   
3      76.03       0.0      0.0  de_dust2         False      391.0     400.0   
4     174.97       1.0      0.0  de_dust2         False      500.0     500.0   

   ct_armor  t_armor  ct_money  ...  t_grenade_flashbang  \
0       0.0      0.0    4000.0  ...                  0.0   
1     400.0    300.0     600.0  ...                  0.0   
2     294.0    200.0     750.0  ...                  0.0   
3     294.0    200.0     750.0  ...                  0.0   
4     192.0      0.0   18350.0  ...                  0.0   

   ct_grenade_smokegrenade  t_grenade_smokegrenade  \
0                      0.0                     0.0   
1                      0.0                     2.0

A short description about the data from the kaggle source: 
    
*The dataset consists of round snapshots from about 700 demos from high level tournament play in 2019 and 2020. Warmup rounds and restarts have been filtered, and for the remaining live rounds a round snapshot has been recorded every 20 seconds until the round is decided. Following its initial publication, It has been pre-processed and flattened to improve readability and make it easier for algorithms to process. The total number of snapshots is 122411. **Snapshots are i.i.d and should be treated as individual data points**, not as part of a match.*

The feature that will be used for the creation of $\mathbb{Y}$ variable is **round_winner**. If CT have won, then the value of $\mathbb{Y}$ will be 1 and 0 othervise.

In [13]:
# Creating the Y variable 
d['Y'] = [1 if x == 'CT' else 0 for x in d['round_winner']]

# Explanatory Data Analysis 