# Capstone Project : NBA Player Analysis and Hall Of Fame Predictions
By Chan Song Yuan SG-DSI-14

## Notebook 1. Data Cleaning

## Problem Statement

For this capstone project, we want to predict which NBA players will have higher probability to get nominated as Hall Of Fame in the near future. By obtaining the datasets which available on Kaggle, we would like to know :

    1) Which features will have higher correlations for Hall Of Fame predictions / classifications ?
    2) Feature Engineering required to increase the accuracy of predictions / classifications ?
    3) Which features to be chosen for the modeling ?
    
In order to achieve our goal to get the Hall of Fame predictions for the NBA players.

## Table of Contents

- [1.Import Data](#1.-Import-Data)<br>
- [2. Data Cleaning](#2.-Data-Cleaning)<br>
    - [2.1 Stats & Player df Cleaning](#2.1-Stats-&-Player-df-Cleaning)<br>
    - [2.2 Data df Cleaning](#2.2-Data-df-Cleaning)<br>
- [3. Data Frames Export](#3.-Data-Frames-Export)<br>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_columns',10) 
pd.set_option('display.max_rows',10)
#tried to show all columns 

%matplotlib inline

### 1. Import Data

In [2]:
stats = pd.read_csv('../datasets/Seasons_Stats.csv')
data = pd.read_csv('../datasets/player_data.csv')
player = pd.read_csv('../datasets/Players.csv')

In [3]:
stats.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,...,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,...,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,...,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,...,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,...,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,...,,,,27.0,59.0


In [4]:
stats.shape

(24691, 53)

In [5]:
stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24691 entries, 0 to 24690
Data columns (total 53 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  24691 non-null  int64  
 1   Year        24624 non-null  float64
 2   Player      24624 non-null  object 
 3   Pos         24624 non-null  object 
 4   Age         24616 non-null  float64
 5   Tm          24624 non-null  object 
 6   G           24624 non-null  float64
 7   GS          18233 non-null  float64
 8   MP          24138 non-null  float64
 9   PER         24101 non-null  float64
 10  TS%         24538 non-null  float64
 11  3PAr        18839 non-null  float64
 12  FTr         24525 non-null  float64
 13  ORB%        20792 non-null  float64
 14  DRB%        20792 non-null  float64
 15  TRB%        21571 non-null  float64
 16  AST%        22555 non-null  float64
 17  STL%        20792 non-null  float64
 18  BLK%        20792 non-null  float64
 19  TOV%        19582 non-nul

### 2. Data Cleaning

#### 2.1 Stats & Player df Cleaning

In [6]:
stats.isnull().sum()

Unnamed: 0       0
Year            67
Player          67
Pos             67
Age             75
              ... 
STL           3894
BLK           3894
TOV           5046
PF              67
PTS             67
Length: 53, dtype: int64

There are 67 rows with no datas. They were seperator for each season year, so we need to drop it.

In [7]:
#Drop the columns which only have null values
stats = stats.drop(['blanl'], axis=1)
stats = stats.drop(['blank2'],axis=1)

In [8]:
#Drop the row in which Year columns having null values
stats = stats[~stats['Year'].isnull()]

In [9]:
len(stats)

24624

In [10]:
stats.shape

(24624, 51)

In [11]:
stats.isnull().sum()

Unnamed: 0       0
Year             0
Player           0
Pos              0
Age              8
              ... 
STL           3827
BLK           3827
TOV           4979
PF               0
PTS              0
Length: 51, dtype: int64

In [12]:
# Rename the Unnamed : 0 column to id
stats = stats.rename(columns ={'Unnamed: 0' : 'id'})
stats = stats.set_index('id')

In [13]:
stats.head(5)

Unnamed: 0_level_0,Year,Player,Pos,Age,Tm,...,STL,BLK,TOV,PF,PTS
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1950.0,Curly Armstrong,G-F,31.0,FTW,...,,,,217.0,458.0
1,1950.0,Cliff Barker,SG,29.0,INO,...,,,,99.0,279.0
2,1950.0,Leo Barnhorst,SF,25.0,CHS,...,,,,192.0,438.0
3,1950.0,Ed Bartels,F,24.0,TOT,...,,,,29.0,63.0
4,1950.0,Ed Bartels,F,24.0,DNN,...,,,,27.0,59.0


> Most player stats data was not fully recorded during 1950's to 1970's so there will be NaN values for that periods. Considering to exclude those datas when come in to feature engineering.

In [14]:
player.head(5)

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
0,0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


In [15]:
len(player)

3922

In [16]:
player = player.rename(columns ={'Unnamed: 0':'id'})

In [17]:
player = player[~player['Player'].isnull()]


In [23]:
player.shape

(3921, 8)

In [18]:
player.head(5)

Unnamed: 0,id,Player,height,weight,collage,born,birth_city,birth_state
0,0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


#### 2.2 Data df Cleaning

In [19]:
data.head(5)

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,"June 24, 1968",Duke University
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,"April 7, 1946",Iowa State University
2,Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0,"April 16, 1947","University of California, Los Angeles"
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,"March 9, 1969",Louisiana State University
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"November 3, 1974",San Jose State University


In [20]:
data['id'] = data.index

In [24]:
data.shape

(4550, 9)

In [21]:
data.head(5)

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college,id
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,"June 24, 1968",Duke University,0
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,"April 7, 1946",Iowa State University,1
2,Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0,"April 16, 1947","University of California, Los Angeles",2
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,"March 9, 1969",Louisiana State University,3
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"November 3, 1974",San Jose State University,4


### 3. Data Frames Export

In [22]:
stats.to_csv('../datasets/season_stats_cleaned.csv',index=False)
data.to_csv('../datasets/player_data_cleaned.csv',index=False)
player.to_csv('../datasets/players_cleaned.csv',index=False)