In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Kobe Bryant  shot selection

Cсылка на соревнование: https://www.kaggle.com/c/kobe-bryant-shot-selection

Goal: Fun and education

Using 20 years of data on Kobe's swishes and misses, can you predict which shots will find the bottom of the net? This competition is well suited for practicing classification basics, feature engineering, and time series analysis. Practice got Kobe an eight-figure contract and 5 championship rings. What will it get you?

This data contains the location and circumstances of every field goal attempted by Kobe Bryant took during his 20-year career. Your task is to predict whether the basket went in (shot_made_flag).

We have removed 5000 of the shot_made_flags (represented as missing values in the csv file). These are the test set shots for which you must submit a prediction. You are provided a sample submission file with the correct shot_ids needed for a valid prediction.

In [2]:
data = pd.read_csv('Kobe.csv')

In [3]:
data.head()

Unnamed: 0,action_type,combined_shot_type,game_event_id,game_id,lat,loc_x,loc_y,lon,minutes_remaining,period,...,shot_type,shot_zone_area,shot_zone_basic,shot_zone_range,team_id,team_name,game_date,matchup,opponent,shot_id
0,Jump Shot,Jump Shot,10,20000012,33.9723,167,72,-118.1028,10,1,...,2PT Field Goal,Right Side(R),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,1
1,Jump Shot,Jump Shot,12,20000012,34.0443,-157,0,-118.4268,10,1,...,2PT Field Goal,Left Side(L),Mid-Range,8-16 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,2
2,Jump Shot,Jump Shot,35,20000012,33.9093,-101,135,-118.3708,7,1,...,2PT Field Goal,Left Side Center(LC),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,3
3,Jump Shot,Jump Shot,43,20000012,33.8693,138,175,-118.1318,6,1,...,2PT Field Goal,Right Side Center(RC),Mid-Range,16-24 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,4
4,Driving Dunk Shot,Dunk,155,20000012,34.0443,0,0,-118.2698,6,2,...,2PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,1610612747,Los Angeles Lakers,2000-10-31,LAL @ POR,POR,5


In [4]:
target = 'shot_made_flag'

**Задания:**

1. Провести анализ данных. Много хороших примеров анализа можно посмотреть здесь https://www.kaggle.com/c/kobe-bryant-shot-selection/kernels
2. Подготовить фичи для обучения модели - нагенерить признаков, обработать пропущенные значения, проверить на возможные выбросы, обработать категориальные признаки и др.
3. Обучить линейную модель, Lasso, Ridge на тех же признаках - построить сравнительную таблицу коэффициентов, сделать заключения о том, как меняется величина коэффициентов, какие зануляются. Посчитать RSS

**Дополнительно**
4. Сравнить результаты на тестовом наборе данных - сделать train_test_split в самом начале, подготовить переменные, сравнить результаты работы классификаторов (те же 3), метрика ROC AUC
5. Построить PCA на подготовленных признаках, посмотреть, какие компоненты составляют наибольшую часть дисперсии целевой переменной

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30697 entries, 0 to 30696
Data columns (total 25 columns):
action_type           30697 non-null object
combined_shot_type    30697 non-null object
game_event_id         30697 non-null int64
game_id               30697 non-null int64
lat                   30697 non-null float64
loc_x                 30697 non-null int64
loc_y                 30697 non-null int64
lon                   30697 non-null float64
minutes_remaining     30697 non-null int64
period                30697 non-null int64
playoffs              30697 non-null int64
season                30697 non-null object
seconds_remaining     30697 non-null int64
shot_distance         30697 non-null int64
shot_made_flag        25697 non-null float64
shot_type             30697 non-null object
shot_zone_area        30697 non-null object
shot_zone_basic       30697 non-null object
shot_zone_range       30697 non-null object
team_id               30697 non-null int64
team_name         

In [6]:
data['remaining_time'] = data['minutes_remaining'] * 60 + data['seconds_remaining']

In [7]:
data['season'] = data['season'].apply(lambda x: int(x.split('-')[1]) )

In [8]:
data['season'].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 97,
       98, 99,  0])

In [9]:
data['shot_type'].unique()

array(['2PT Field Goal', '3PT Field Goal'], dtype=object)

In [10]:
data['shot_zone_area'].unique()

array(['Right Side(R)', 'Left Side(L)', 'Left Side Center(LC)',
       'Right Side Center(RC)', 'Center(C)', 'Back Court(BC)'],
      dtype=object)

In [11]:
data['shot_zone_basic'].unique()

array(['Mid-Range', 'Restricted Area', 'In The Paint (Non-RA)',
       'Above the Break 3', 'Right Corner 3', 'Backcourt',
       'Left Corner 3'], dtype=object)

In [12]:
data['shot_zone_range'].unique()

array(['16-24 ft.', '8-16 ft.', 'Less Than 8 ft.', '24+ ft.',
       'Back Court Shot'], dtype=object)

In [13]:
data[['season','shot_made_flag']].groupby(['season'])['shot_made_flag'].agg('count')

season
0     1312
1     1575
2     1708
3     1852
4     1371
5     1127
6     1924
7     1579
8     1819
9     1851
10    1772
11    1521
12    1416
13    1328
14      59
15     593
16     932
97     383
98     810
99     765
Name: shot_made_flag, dtype: int64

In [14]:
one = data[data['shot_made_flag'] == 1.0 ]

In [15]:
one['shot_made_flag'].value_counts()

1.0    11465
Name: shot_made_flag, dtype: int64

In [16]:
one[['season','shot_made_flag']].groupby(['season']).agg('count').sort_values(by=['shot_made_flag'],ascending = False)

Unnamed: 0_level_0,shot_made_flag
season,Unnamed: 1_level_1
6,873
9,866
8,852
3,808
10,804
2,783
1,735
7,723
11,679
13,608


In [17]:
noshot = data[data['shot_made_flag'] == 0.0 ] 

In [19]:
noshot[['season','shot_made_flag']].groupby(['season']).agg('count').sort_values(by=['shot_made_flag'],ascending = False)

Unnamed: 0_level_0,shot_made_flag
season,Unnamed: 1_level_1
6,1051
3,1044
9,985
10,968
8,967
2,925
7,856
11,842
1,840
12,813


In [20]:
drops = ['shot_id', 'team_id', 'team_name', 'opponent',\
         'matchup', 'lon', 'lat', 'seconds_remaining', 'minutes_remaining', \
          'loc_x', 'loc_y', 'game_event_id', 'game_id', 'game_date']
for drop in drops:
    data = data.drop(drop, 1)

In [21]:
data.head()

Unnamed: 0,action_type,combined_shot_type,period,playoffs,season,shot_distance,shot_made_flag,shot_type,shot_zone_area,shot_zone_basic,shot_zone_range,remaining_time
0,Jump Shot,Jump Shot,1,0,1,18,,2PT Field Goal,Right Side(R),Mid-Range,16-24 ft.,627
1,Jump Shot,Jump Shot,1,0,1,15,0.0,2PT Field Goal,Left Side(L),Mid-Range,8-16 ft.,622
2,Jump Shot,Jump Shot,1,0,1,16,1.0,2PT Field Goal,Left Side Center(LC),Mid-Range,16-24 ft.,465
3,Jump Shot,Jump Shot,1,0,1,22,0.0,2PT Field Goal,Right Side Center(RC),Mid-Range,16-24 ft.,412
4,Driving Dunk Shot,Dunk,2,0,1,0,1.0,2PT Field Goal,Center(C),Restricted Area,Less Than 8 ft.,379


In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30697 entries, 0 to 30696
Data columns (total 12 columns):
action_type           30697 non-null object
combined_shot_type    30697 non-null object
period                30697 non-null int64
playoffs              30697 non-null int64
season                30697 non-null int64
shot_distance         30697 non-null int64
shot_made_flag        25697 non-null float64
shot_type             30697 non-null object
shot_zone_area        30697 non-null object
shot_zone_basic       30697 non-null object
shot_zone_range       30697 non-null object
remaining_time        30697 non-null int64
dtypes: float64(1), int64(5), object(6)
memory usage: 2.8+ MB


In [23]:
test=data[data['shot_made_flag'].isnull()]
test.shape

(5000, 12)

In [24]:
train=data[data['shot_made_flag'].notnull()]
train.shape

(25697, 12)

In [25]:
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)


In [26]:
train = MultiColumnLabelEncoder(columns = ['action_type','combined_shot_type', 'shot_type', 'shot_zone_basic', 'shot_zone_range', 'shot_zone_area']).fit_transform(train)

In [27]:
train.head()

Unnamed: 0,action_type,combined_shot_type,period,playoffs,season,shot_distance,shot_made_flag,shot_type,shot_zone_area,shot_zone_basic,shot_zone_range,remaining_time
1,25,3,1,0,1,15,0.0,0,3,4,2,622
2,25,3,1,0,1,16,1.0,0,2,4,0,465
3,25,3,1,0,1,22,0.0,0,4,4,0,412
4,4,1,2,0,1,0,1.0,0,1,5,4,379
5,25,3,3,0,1,14,0.0,0,3,4,2,572


In [28]:
test = MultiColumnLabelEncoder(columns = ['action_type','combined_shot_type', 'shot_type', 'shot_zone_basic', 'shot_zone_range', 'shot_zone_area']).fit_transform(test)

In [31]:
from sklearn.linear_model import LinearRegression

predictors = train.columns.tolist()
predictors.remove('shot_made_flag')

linreg = LinearRegression(normalize=True)
linreg.fit(train[predictors], train[target])
y_pred = linreg.predict(train[predictors])

In [32]:
rss = sum((y_pred - train[target]) ** 2)
print(rss)

5961.91205826129


In [33]:
print(linreg.coef_)

[ 1.08237418e-03 -1.07460850e-01 -1.02516671e-02 -1.59660178e-03
 -3.12293328e-04 -1.43653960e-02  1.40614116e-01 -1.46071860e-02
  2.54917234e-02 -1.66387623e-02  3.94052140e-05]


In [34]:
print(linreg.intercept_)

0.9165085429924158


In [35]:
from sklearn.linear_model import Ridge

# Fit
ridgereg = Ridge(alpha=5, normalize=True)
ridgereg.fit(train[predictors], train[target])
y_pred = ridgereg.predict(train[predictors])

# rss
rss = sum((y_pred-train[target]) ** 2)
print(rss)

6184.342752266698


In [36]:
print(ridgereg.coef_)

[ 2.22224452e-05 -1.23892942e-02 -2.02178302e-03 -4.46271858e-04
 -3.07181880e-05 -1.39394199e-03 -1.68926197e-02 -4.31157210e-03
  4.97875256e-03  6.46314384e-03  1.03114815e-05]


In [37]:
print(ridgereg.intercept_)

0.4908918365536861


In [38]:
from sklearn.linear_model import Lasso

# fit
lassoreg = Lasso(alpha=5, normalize=True, max_iter=1e5)
lassoreg.fit(train[predictors], train[target])
y_pred = lassoreg.predict(train[predictors])
    
# rss
rss = sum((y_pred-train[target]) ** 2)
print(rss)

6349.763785655984


In [39]:
print(lassoreg.coef_)

[-0. -0. -0. -0. -0. -0. -0. -0.  0.  0.  0.]


In [40]:
print(lassoreg.intercept_)

0.44616103047048294
