# random forest kobe bryant

This project is inspired by Siraj Raval's [Video](https://www.youtube.com/watch?v=QHOazyP-YlM) and [notebook](https://github.com/llSourcell/random_forests/blob/master/Random%20Forests%20.ipynb) about random forests.

## gini index

The gini index calculates a score for a split, that was done on the data. That score gives an idea how mixed the resulting groups are. If instances of only one category are in the group, the gini index is zero.

The gini index of one group is calculated like this:

```python
gini = 1
for c in categories:
    count = group['category'].count(c)
    gini -= count * count
```

In [117]:
import pandas as pd
import numpy as np
np.random.seed(0)

## data processing

In [126]:
data = pd.read_csv('data.csv')

feature_type = {'action_type': 'cat',
                'combined_shot_type': 'cat',
                'game_event_id': 'num',
                'game_id': 'num',
                'lat': 'num',
                'loc_x': 'num',
                'loc_y': 'num',
                'lon': 'num',
                'minutes_remaining': 'num',
                'period': 'cat',
                'playoffs': 'cat',
                'season': 'cat',
                'seconds_remaining': 'num',
                'shot_distance': 'num',
                'shot_type': 'cat',
                'shot_zone_area': 'cat',
                'shot_zone_range': 'cat',
                'team_id': 'cat',
                'team_name': 'cat',
                'game_date': 'cat',
                'matchup': 'cat',
                'opponent': 'cat'}

In [119]:
# let's only take the data which is labeled
data = data[pd.notna(data['shot_made_flag'])]
data.rename(columns={'shot_made_flag': 'class'},
            inplace=True)

## helper functions

In [123]:
def subsample(df, sample_size):
    indexes = np.random.choice(df.index, sample_size)
    return indexes, df.loc[indexes]

def cross_validation_split(df, n_folds):
    """
    returns n_folds dataframes out of df with equal length
    df: pandas.DataFramn
    n_folds: integer
    """
    dfs = []
    fold_size = int(len(df) / n_folds)
    for i in range(n_folds):
        indexes, fold = subsample(df, fold_size)
        dfs.append(fold)
        df.drop(indexes, inplace=True)
    return dfs

def split_by(df, feature, value, is_numerical=True):
    """
    make the split of a decision tree node
    df: pd.DataFrame
    feature: column name (mostly string)
    value: any
    is_numerical: bool
    return: (pd.DataFrame, pd.DataFrame)
    """
    if is_numerical:
        return df[df[feature] < value], df[df[feature] >= value]
    else:
        # categorical feature
        return df[df[feature] == value], df[df[feature] != value]
        
def accuracy(predicted, target):
    """
    return percentage of correctly predicted cases
    predicted: np.array
    target: np.array
    return: float between 0 and 1
    """
    assert len(predicted) == len(target)
    return (predicted == target).sum() / len(predicted)