# Project : Kobe Bryant Shot Selection
### Which shots did Kobe sink? 

## 1. Classification vs Regression

which type of supervised machine learning problem is this, classification or regression? Why?

**Answer:**

This is a classification problem. Classification is simply the process of taking some kind of input and mapping it to some discrete label. In this problem, our goal is to predict whether or not a Kobe's shots find the bottom of the net (binary classification: yes, or no).

Regression is more about continuous value function. So, something like giving a bunch of points and finding some real value for the new given point. 

The difference between classification and regression is the difference between mapping from some input to some small number of discrete values. And regression is mapping from some input space to some real number.

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.


In [31]:
# Import libraries
import numpy as np
import pandas as pd

In [32]:
# Read data
data = pd.read_csv("data.csv")
print "Data read successfully!"
# Note: The column 'shot_made_flag' is the target/label, all other are feature columns

Data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of shots
- Number of made shots
- Number missed shots
- Rate of the class (%)
- Number of features



In [33]:
n_shots = data.shape[0] #Total number of shots
n_features = data.shape[1] - 1  # The column "shot_made_flag", is the target label 

print data["shot_made_flag"].unique()

n_made = data["shot_made_flag"].value_counts()[1]
n_missed = data["shot_made_flag"].value_counts()[0]

rate = float( n_made ) / n_shots  * 100

print "Total number of shots: {}".format(n_shots)
print "Number of shots which Kobe made: {}".format(n_made)
print "Number of shots which Kobe missed: {}".format(n_missed)
print "Rate of successful shots: {:.2f}%".format(rate)

print "Number of features: {}".format(n_features)

[ nan   0.   1.]
Total number of shots: 30697
Number of shots which Kobe made: 11465
Number of shots which Kobe missed: 14232
Rate of successful shots: 37.35%
Number of features: 24


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the column (`'shot_made_flag'`) is the target or label we are trying to predict.

In [34]:
#Drop useless columns
def drop_columns(data, column_names):
    new_data = data.drop(column_names, 1)
    return new_data

In [35]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If non-numeric, convert to one or more dummy variables
        if (col_data.dtype == object and col != 'season'):
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'action_type' => 'action_type_Jump Shot', 
                                                             #'action_type_Driving Dunk Shot'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

In [36]:
def PreProcessData():
    
    Seasons = data["season"].unique()

    columns_ = [#'action_type',
                'combined_shot_type',
                'game_event_id',
                'game_id',
                'lat',
                'loc_x',
                'loc_y',
                'lon',
                'minutes_remaining',
                'period',
                'playoffs',
                'season',
                'seconds_remaining',
                'shot_distance',
                #'shot_made_flag' (this is what you are predicting),
                'shot_type',
                'shot_zone_area',
                'shot_zone_basic',
                'shot_zone_range',
                'team_id',
                'team_name',
                'game_date',
                'matchup',
                'opponent',
                #'shot_id'
               ]

    data['Away'] =  data.apply(home_away, axis = 1)
    data_dropped = drop_columns(data,columns_)
    data_dropped_processed = preprocess_features(data_dropped)

    missing_data = data_dropped_processed.loc[data_dropped_processed['shot_made_flag'].isnull() == True]

    good_data = data_dropped_processed.loc[data_dropped_processed['shot_made_flag'].isnull() == False]

    target_col = 'shot_made_flag'  # This column is the target/label
    feature_cols = [col for col in data_dropped_processed.columns if (col != 'shot_made_flag' and col != 'shot_id')]

    X_all = good_data[feature_cols]
    y_all = good_data[target_col]


In [37]:
# Put any import statements you need for this code block here
from sklearn.utils import shuffle
from sklearn.cross_validation import train_test_split

def shuffle_split_data(X, y):
    """ Shuffles and splits data into 70% training and 30% testing subsets,
        then returns the training and testing subsets. """
    
    # Shuffle and split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

    # Return the training and testing data subsets
    return X_train, y_train, X_test, y_test


# Test shuffle_split_data
try:
    X_train, y_train, X_test, y_test = shuffle_split_data(X_all, y_all)
    print "Successfully shuffled and split the data!"
except:
    print "Something went wrong with shuffling and splitting the data."

Successfully shuffled and split the data!


In [38]:
# Put any import statements you need for this code block here
from sklearn import metrics

def performance_metric(y_true, y_predict):
    """ Calculates and returns the total error between true and predicted values
        based on a performance metric chosen by the student. """
    
    return metrics.mean_squared_error(y_true, y_predict) #evaluate performance 

# Test performance_metric
try:
    total_error = performance_metric(y_train, y_train)
    print "Successfully performed a metric calculation!"
except:
    print "Something went wrong with performing a metric calculation."


Successfully performed a metric calculation!


In [39]:
# Put any import statements you need for this code block
from sklearn.grid_search import GridSearchCV

def fit_model(X, y):
    """ Tunes a decision tree regressor model using GridSearchCV on the input data X 
        and target labels y and returns this optimal model. """

    # Create a decision tree regressor object
    regressor = DecisionTreeRegressor()

    # Set up the parameters we wish to tune
    parameters = {'max_depth':(1,2,3,4,5,6,7,8,9,10)}

    # Make an appropriate scoring function
    # functions ending with _error or _loss return a value to minimize, the lower the better
    scoring_function = metrics.make_scorer(metrics.mean_squared_error, greater_is_better=False)

    # Make the GridSearchCV object
    reg = GridSearchCV(regressor, param_grid = parameters, scoring = scoring_function)

    # Fit the learner to the data to obtain the optimal model with tuned parameters
    reg.fit(X, y)

    # Return the optimal model
    return reg


# Test fit_model on entire dataset
try:
    reg = fit_model(X_all, y_all)
    print "Successfully fit a model!"
except:
    print "Something went wrong with fitting a model."

Successfully fit a model!


In [40]:
def home_away(row):
    if (row['matchup'].find('@')>0):
        return 1
    else:
        return 0  

In [41]:
# Prediction of admission
missing_data_featurs = missing_data[feature_cols] 
shot_prediction = reg.predict(missing_data_featurs)

# convert to CSV
submission = pd.DataFrame({'shot_id': missing_data['shot_id'],
                           'shot_made_flag': shot_prediction })

submission[['shot_id', 'shot_made_flag']].to_csv('submission.csv', index=False)

  
  


### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted!These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like ` action_type` and `shot_zone_area `, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `action_type_Jump Shot`, `action_type_Driving Dunk`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

Here are columns that we want to transform:<br/>
action_type,combined_shot_type,season,shot_type,shot_zone_area,shot_zone_basic,shot_zone_range,opponent

We also want to eliminate columns that will provide no value to our final pridiction.
game_event_id,game_id,team_id,team_name,game_date,shot_id

There are 2 columns that need special attention:<br/>
matchup : This column will be used to construct 2 additional columns as "Home" and "Away" which indicates wheather Kobe played at home or away.

### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem.