# Predicting Starcraft 2 players
[Kaggle Starcraft II Prediction Challenge](https://www.kaggle.com/c/insa-5if-2018)

## Initialisation
We will use Python3 with [Numpy](http://www.numpy.org/) for linear algebra, [Pandas](https://pandas.pydata.org/) for data processing and CSV files I/O, and [scikit-learn](https://scikit-learn.org/stable/) for predictions. The environment is defined by the [Kaggle Python docker image](https://github.com/kaggle/docker-python).

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression 
import os

## Reading data
Let's define some functions to read data from the CSV files. We are storing an array of actions for each line.

In [None]:
def read_train(file):
    ids = []
    data = []
    with open(file,'r') as f:
        for line in f:
            line_data = line.replace('\n','').split(',')
            battlenet_id, race, actions = line_data[0],line_data[1],line_data[2:]
            ids.append(battlenet_id)
            data.append([race, ", ".join(actions)])
    df = pd.DataFrame(data, columns=['race','actions'])
    series = pd.Series(ids)
    return df, series

def read_test(file):
    data = []
    with open(file,'r') as f:
        for line in f:
            line_data = line.replace('\n','').split(',')
            race, actions = line_data[0],line_data[1:]
            data.append([race, ", ".join(actions)])
    df = pd.DataFrame(data, columns=['race','actions'])
    return df
        

Input data files are available in the `../input/` directory.
Any results written to the current directory are saved as output.

In [None]:
train_data, train_ids = read_train('../input/train.csv/TRAIN.CSV')
train_data.head()

In [None]:
train_ids.head()

In [None]:
test_data = read_test('../input/test.csv/TEST.CSV')
test_data.head()

## First look at the data

### Race repartition

In [None]:
races = train_data['race'].value_counts()
print(races)
plt = races.plot.bar()

### Number of games by player

In [None]:
ids = train_ids.value_counts()
print(ids.describe())

### List of actions (hotkeys)

In [None]:
def find_unique_actions(df):
    unique_actions = set()
    actions = df['actions']
    for action in actions:
        for a in action.split(','):
            a = a.strip()
            if a!='' and a[0]!='t' and 'hotkey' in a:
                unique_actions.add(a)
    return unique_actions

unique_hotkeys = sorted(list(find_unique_actions(train_data)))
print(unique_hotkeys)
    

## Defining features
We are defining features as the race and the number of clicks on the hotkeys in the first 10 seconds. We have 
Let's try this approach by writing a function to generate those features.

We use Dummy variables to convert the race string into integers values.

In [None]:
def generate_features(df):
    features = []
    hotkeys = unique_hotkeys
    set_hotkeys = set(hotkeys)
    for index, row in df.iterrows():
        race = row["race"]
        actions = row["actions"]
        hotkeys_count = {hotkey:0 for hotkey in hotkeys}
        for action in actions.split(','):
            action = action.strip()
            if action == 't10':
                break
            elif action in set_hotkeys:
                hotkeys_count[action]+=1
        current = [race, *[hotkeys_count[hotkey] for hotkey in hotkeys]]
        features.append(current)
    new_df = pd.DataFrame(features, columns=['race', *hotkeys])
    return new_df

In [163]:
train_features = generate_features(train_data)
train_features = pd.get_dummies(train_features, columns = ['race']) # convert race to integers
train_features.head()

Unnamed: 0,hotkey00,hotkey01,hotkey02,hotkey10,hotkey11,hotkey12,hotkey20,hotkey21,hotkey22,hotkey30,hotkey31,hotkey32,hotkey40,hotkey41,hotkey42,hotkey50,hotkey51,hotkey52,hotkey60,hotkey61,hotkey62,hotkey70,hotkey71,hotkey72,hotkey80,hotkey81,hotkey82,hotkey90,hotkey91,hotkey92,race_Protoss,race_Terran,race_Zerg
0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,1,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,1,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [162]:
test_features = generate_features(test_data)
test_features = pd.get_dummies(test_features, columns = ['race']) # convert race to integers
test_features.head()

Unnamed: 0,hotkey00,hotkey01,hotkey02,hotkey10,hotkey11,hotkey12,hotkey20,hotkey21,hotkey22,hotkey30,hotkey31,hotkey32,hotkey40,hotkey41,hotkey42,hotkey50,hotkey51,hotkey52,hotkey60,hotkey61,hotkey62,hotkey70,hotkey71,hotkey72,hotkey80,hotkey81,hotkey82,hotkey90,hotkey91,hotkey92,race_Protoss,race_Terran,race_Zerg
0,1,0,12,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,12,0,0,0,0,0,0,1,0,0,0,0,1
1,1,0,0,0,0,0,0,0,0,1,0,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,2,0,2,1,0,0,1,0,0,0,0,0,1,0,8,1,0,0,1,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


## Predicting players

### First approach: Logistic regression

In [167]:
classifier = LogisticRegression(random_state=0, solver='liblinear', multi_class='auto')
classifier.fit(train_features,train_ids)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=0, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [168]:
predicted_ids = classifier.predict(test_features)

In [171]:
predicted_ids_df = pd.DataFrame(predicted_ids, columns=['prediction'])
predicted_ids_df.index = range(1,len(predicted_ids_df)+1)
predicted_ids_df.index.name = 'RowId'
predicted_ids_df.head()

Unnamed: 0_level_0,prediction
RowId,Unnamed: 1_level_1
1,http://eu.battle.net/sc2/en/profile/3538115/1/...
2,http://eu.battle.net/sc2/en/profile/2896854/1/...
3,http://eu.battle.net/sc2/en/profile/3973341/1/...
4,http://eu.battle.net/sc2/en/profile/250458/1/V...
5,http://eu.battle.net/sc2/en/profile/950504/1/G...


## Generate the output file

In [None]:
predicted_ids_df.to_csv('out.csv')

Checking the file

In [172]:
def read_csv_head(path):
    with open(path, 'r') as f:
        c = 0
        for line in f:
            print(line)
            c+=1
            if c==10:
                break
read_csv_head('./out.csv')

RowId,prediction

1,http://eu.battle.net/sc2/en/profile/3538115/1/Golden/

2,http://eu.battle.net/sc2/en/profile/2896854/1/MǂForGG/

3,http://eu.battle.net/sc2/en/profile/3973341/1/yoeFWSan/

4,http://eu.battle.net/sc2/en/profile/250458/1/VortiX/

5,http://eu.battle.net/sc2/en/profile/950504/1/Grubby/

6,http://eu.battle.net/sc2/en/profile/2896854/1/MǂForGG/

7,http://eu.battle.net/sc2/en/profile/4234852/1/First/

8,http://eu.battle.net/sc2/en/profile/884897/1/LiquidSnute/

9,http://eu.battle.net/sc2/en/profile/2526293/1/Krr/

