The National Basketball Association (NBA) is a professional basketball league in North America. The league is composed of 30 teams (29 in the United States and 1 in Canada) and is one of the four major professional sports leagues in the United States and Canada. It is the premier men's professional basketball league in the world.

Career longevity is dependent on various factors for any players in all the games and so for NBA Rookies. The factors like games played, count of games played, and other statistics of the player during the game.

The objective is to determine if a player’s career will flourish or not using machine learning techniques

The evaluation metric is based on the Accuracy Score

Data description:

The dataset contains player statistics for NRB Rookies. There are 1100+ observations in the train dataset with 19 variables excluding the target variable (i.e. Target).

1. GP: Games Played (here you might find some values in decimal, consider them to be the floor integer, for example, if the value is 12.789, the number of games played by the player is 12)
2. The values for given attributes are averaged over all the games played by players

MIN:  Minutes Played

PTS: Number of points per game

FGM: Field goals made

FGA: Field goals attempt

FG%: field goals percent

3P Made: 3 point made

3PA: 3 points attempt

3P%: 3 point percent

FTM: Free throw made

FTA: Free throw attempts

FT%: Free throw percent

OREB: Offensive rebounds

DREB: Defensive rebounds

REB: Rebounds

AST: Assists

STL: Steals

BLK: Blocks

TOV: Turnovers

Target: 0 if career years played < 5, 1 if career years played >= 5

In [191]:
#Import libraries
from numpy import *
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [192]:
from sklearn import metrics
from sklearn.metrics import accuracy_score

In [193]:
import warnings
warnings.filterwarnings("ignore")

In [194]:
#Load train data
df = pd.read_csv('Train_data.csv')
df.head()

Unnamed: 0,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,Target
0,59.0,12.8,3.4,1.3,2.6,51.0,0.2,0.3,50.0,0.7,0.8,78.0,1.1,2.3,3.3,0.5,0.3,0.4,0.5,1
1,31.0,10.7,3.4,1.2,3.3,35.3,0.5,2.1,25.8,0.5,0.9,55.2,0.3,1.1,1.4,0.4,0.3,0.1,0.2,0
2,48.0,9.3,4.5,1.7,3.4,49.7,0.0,0.1,0.0,1.2,1.9,61.5,0.4,0.8,1.2,0.8,0.5,0.4,1.0,0
3,80.0,27.7,11.2,3.5,9.4,37.4,1.3,4.1,32.9,2.8,3.3,85.0,0.8,1.6,2.4,3.9,1.3,0.1,2.2,1
4,58.0,18.4,5.8,1.9,5.3,36.7,0.0,0.1,25.0,1.9,3.1,61.7,0.5,0.7,1.2,1.9,1.1,0.2,1.7,0


In [195]:
#Check the dimension
df.shape

(1101, 20)

In [196]:
#Check for missing values
df.isna().sum()

GP         0
MIN        0
PTS        0
FGM        0
FGA        0
FG%        0
3P Made    0
3PA        0
3P%        0
FTM        0
FTA        0
FT%        0
OREB       0
DREB       0
REB        0
AST        0
STL        0
BLK        0
TOV        0
Target     0
dtype: int64

In [197]:
#Check for the types of parameters
df.dtypes

GP         float64
MIN        float64
PTS        float64
FGM        float64
FGA        float64
FG%        float64
3P Made    float64
3PA        float64
3P%        float64
FTM        float64
FTA        float64
FT%        float64
OREB       float64
DREB       float64
REB        float64
AST        float64
STL        float64
BLK        float64
TOV        float64
Target       int64
dtype: object

In [198]:
#Apply floor to Games played
df['GP'] = df['GP'].apply(floor)

In [199]:
#Separate features and target
X = df.drop(['Target'], axis = 1)
y = df.Target

In [200]:
X.shape

(1101, 19)

In [201]:
df['Target'].value_counts()

0    551
1    550
Name: Target, dtype: int64

In [202]:
#Split the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state= 0)
print("X_train:", X_train.shape)
print("X_test:", X_test.shape) 
print("y_train:", y_train.shape) 
print("y_test:", y_test.shape)

X_train: (770, 19)
X_test: (331, 19)
y_train: (770,)
y_test: (331,)


#Apply ML algorithms and Calculate the accuracy score

Logistic Regression

In [203]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
sco = accuracy_score(y_pred, y_test)
print(sco)

0.6918429003021148


Support Vector

In [204]:
from sklearn.svm import SVC 

model_s = SVC(kernel = 'rbf', C = 5, gamma = 'scale', probability= True)
model_s.fit(X_train, y_train)
y_pred_svm = model_s.predict(X_test)
svc_sco = accuracy_score(y_pred_svm, y_test)
print(svc_sco)

0.6676737160120846


Linear SVC

In [205]:
from sklearn.svm import LinearSVC
lin_svc = LinearSVC()
lin_svc.fit(X_train, y_train)
y_pred_lin = lin_svc.predict(X_test)
lin_sco = accuracy_score(y_pred_lin, y_test)
print(lin_sco)

0.6435045317220544


Decision Tree

In [206]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_sco = accuracy_score(y_pred_dt, y_test)
print(dt_sco)

0.6404833836858006


Random Forest

In [207]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
rf_sco = accuracy_score(y_pred_rf, y_test)
print(rf_sco)

0.7311178247734139


KNN or k-nearest neighbors

In [208]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
knn_sco = accuracy_score(y_pred_knn, y_test)
print(knn_sco)

0.6858006042296072


Gaussian Naive Bayes

In [209]:
from sklearn.naive_bayes import GaussianNB
g = GaussianNB()
g.fit(X_train, y_train)
y_pred_g = g.predict(X_test)
sco_g = accuracy_score(y_pred_g, y_test)
print(sco_g)

0.6706948640483383


Stochastic Gradient descent

In [210]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier()
sgd.fit(X_train, y_train)
y_pred_sgd = sgd.predict(X_test)
sgd_sco = accuracy_score(y_pred_sgd, y_test)
print(sgd_sco)

0.5709969788519638


Gradient Boosting Classifier

In [211]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
y_pred_gbc = gbc.predict(X_test)
gbc_sco = accuracy_score(y_pred_gbc, y_test)
print(gbc_sco)

0.7129909365558912


In [212]:
#Load test data
new_test_data = pd.read_csv('Test_data.csv')

In [213]:
new_test_data.head()

Unnamed: 0,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV
0,44.0,13.0,6.6,2.5,5.6,45.3,0.4,1.3,32.7,1.1,1.7,65.3,0.8,0.6,1.4,1.1,0.7,0.2,1.0
1,51.0,9.1,2.7,1.0,2.7,39.0,0.1,0.3,23.5,0.6,0.8,69.8,0.3,0.7,1.0,0.9,0.6,0.1,0.7
2,51.0,15.1,5.7,2.2,5.2,41.2,0.3,0.8,32.5,1.1,1.6,69.1,0.4,1.3,1.7,2.1,0.7,0.0,1.4
3,15.0,7.9,1.9,0.7,2.5,27.0,0.0,0.0,0.0,0.5,0.8,66.7,0.5,1.1,1.5,0.5,0.1,0.1,1.0
4,36.0,14.4,5.8,2.3,5.4,43.1,0.0,0.1,50.0,1.1,1.4,82.0,1.1,1.4,2.4,0.9,0.3,0.2,0.9


In [214]:
new_test_data.columns

Index(['GP', 'MIN', 'PTS', 'FGM', 'FGA', 'FG%', '3P Made', '3PA', '3P%', 'FTM',
       'FTA', 'FT%', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TOV'],
      dtype='object')

In [215]:
#Check the dimension
new_test_data.shape

(555, 19)

In [216]:
#Check model with the highest score
models = pd.DataFrame({
    'Model': ['LogisticRegression', 'SVM', 'Linear SVC', 'DecisionTree', 'RandomForest', 'KNN', 'Gaussian', 'SGD', 'GBC'],
    'Score': [sco, svc_sco, lin_sco, dt_sco, rf_sco, knn_sco, sco_g, sgd_sco, gbc_sco]})
models.sort_values(by = 'Score', ascending= False)

Unnamed: 0,Model,Score
4,RandomForest,0.731118
8,GBC,0.712991
0,LogisticRegression,0.691843
5,KNN,0.685801
6,Gaussian,0.670695
1,SVM,0.667674
2,Linear SVC,0.643505
3,DecisionTree,0.640483
7,SGD,0.570997


In [217]:
#Create new csv file
target = rf.predict(new_test_data)
d = pd.DataFrame(target)
d.index = new_test_data.index
d.columns = ['prediction']
d.to_csv('prediction.csv', index=False)

In [218]:
a = pd.read_csv('prediction.csv')
a.head()

Unnamed: 0,prediction
0,0
1,0
2,0
3,0
4,1
