# NBA Player Career Duration Prediction - Feature Engineering Phase

## **Introduction**


Featuring engineering help determining which attributes in the data can best predict certain measures.

In this Project, I'll provide insights to the National Basketball Association (NBA), a professional North American basketball league. I'll help NBA managers and coaches identify which players are most likely to thrive in the high-pressure environment of professional basketball and help the team be successful over time.

To do this, I'll will analyze a subset of data that contains information about NBA players and their performance records. I'll conduct feature engineering to determine which features will most effectively predict whether a player's NBA career will last at least five years. The insights gained then will be used in the next stage of the project: building the predictive model.


## **Step 1: Imports** 


Start by importing `pandas`.

In [28]:
# Import pandas.
import pandas as pd

In [29]:
# RUN THIS CELL TO IMPORT THE DATA.
data = pd.read_csv("nba-players.csv", index_col=0)

## **Step 2: Data exploration** 

Display the first 10 rows of the data to get a sense of what it entails.

In [30]:
# Display first 10 rows of data.

data.head(10)

Unnamed: 0,name,gp,min,pts,fgm,fga,fg,3p_made,3pa,3p,...,fta,ft,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs
0,Brandon Ingram,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,...,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,Andrew Harrison,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,...,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,JaKarr Sampson,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,...,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,Malik Sealy,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,...,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,Matt Geiger,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,...,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1
5,Tony Bennett,75,11.4,3.7,1.5,3.5,42.3,0.3,1.1,32.5,...,0.5,73.2,0.2,0.7,0.8,1.8,0.4,0.0,0.7,0
6,Don MacLean,62,10.9,6.6,2.5,5.8,43.5,0.0,0.1,50.0,...,1.8,81.1,0.5,1.4,2.0,0.6,0.2,0.1,0.7,1
7,Tracy Murray,48,10.3,5.7,2.3,5.4,41.5,0.4,1.5,30.0,...,0.8,87.5,0.8,0.9,1.7,0.2,0.2,0.1,0.7,1
8,Duane Cooper,65,9.9,2.4,1.0,2.4,39.2,0.1,0.5,23.3,...,0.5,71.4,0.2,0.6,0.8,2.3,0.3,0.0,1.1,0
9,Dave Johnson,42,8.5,3.7,1.4,3.5,38.3,0.1,0.3,21.4,...,1.4,67.8,0.4,0.7,1.1,0.3,0.2,0.0,0.7,0


Displaying the number of rows and the number of columns to get a sense of how much data is available to you.

In [31]:
# Display number of rows, number of columns.

data.shape

(1340, 21)

- There are 1340 observation, and 21 feature, which is alot of features.

Displaying all column names to get a sense of the kinds of metadata available about each player. Using the columns property in pandas.


In [32]:
# Display all column names.

data.columns

Index(['name', 'gp', 'min', 'pts', 'fgm', 'fga', 'fg', '3p_made', '3pa', '3p',
       'ftm', 'fta', 'ft', 'oreb', 'dreb', 'reb', 'ast', 'stl', 'blk', 'tov',
       'target_5yrs'],
      dtype='object')

The following table provides a description of the data in each column.

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played per game|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

Displaying a summary of the data to get additional information about the DataFrame, including the types of data in the columns.

In [33]:
# Use .info() to display a summary of the DataFrame.

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1340 entries, 0 to 1339
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         1340 non-null   object 
 1   gp           1340 non-null   int64  
 2   min          1340 non-null   float64
 3   pts          1340 non-null   float64
 4   fgm          1340 non-null   float64
 5   fga          1340 non-null   float64
 6   fg           1340 non-null   float64
 7   3p_made      1340 non-null   float64
 8   3pa          1340 non-null   float64
 9   3p           1340 non-null   float64
 10  ftm          1340 non-null   float64
 11  fta          1340 non-null   float64
 12  ft           1340 non-null   float64
 13  oreb         1340 non-null   float64
 14  dreb         1340 non-null   float64
 15  reb          1340 non-null   float64
 16  ast          1340 non-null   float64
 17  stl          1340 non-null   float64
 18  blk          1340 non-null   float64
 19  tov   

- All are numerical except the `name` column is categorical.

### Checking for missing values

Reviewing the data to determine whether it contains any missing values. 

In [34]:
# Display the number of missing values in each column.
data.isna().sum()

name           0
gp             0
min            0
pts            0
fgm            0
fga            0
fg             0
3p_made        0
3pa            0
3p             0
ftm            0
fta            0
ft             0
oreb           0
dreb           0
reb            0
ast            0
stl            0
blk            0
tov            0
target_5yrs    0
dtype: int64

- There is no missing data.

- It's important to check for missing values to clean the data set before making the model and the missing values aren't useful.

## **Step 3: Statistical tests** 



Using a statistical technique to check the class balance in the data. To understand how balanced the dataset is in terms of class, displaying the percentage of values that belong to each class in the target column. In this context, class 1 indicates an NBA career duration of at least five years, while class 0 indicates an NBA career duration of less than five years.

In [35]:
# Display percentage (%) of values for each class (1, 0) represented in the target column of this dataset.

data['target_5yrs'].value_counts(normalize=True)*100

1    62.014925
0    37.985075
Name: target_5yrs, dtype: float64

- About 62% of the values in the target columm belong to class 1, and about 38% of the values belong to class 0. In other words, about 62% of players represented by this data have an NBA career duration of at least five years, and about 38% do not. 
- The dataset is not perfectly balanced, but an exact 50-50 split is a rare occurance in datasets, and a 62-38 split is not too imbalanced. However, if the majority class made up 90% or more of the dataset, then that would be of concern, and It would be advisable to tackle this problem using techniques such as upsampling and downsampling.

it's important to check class balance because If there is a lot more representation of one class than another, then the model may be biased toward the majority class. When this happens, the predictions may be inaccurate. 

## **Step 4: Results and evaluation** 
Performing feature engineering, with the goal of identifying and creating features that will serve as useful predictors for the target variable, `target_5yrs`. 

### Feature selection

The following table contains descriptions of the data in each column:

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

- I should avoid using the `name` column as a feature since it doesn't provide relevant insights for predicting career duration and is not ethical.

- While the number of games played (`gp`) offers some context, the points scored (`pts`) are more critical for determining career length. Combining `gp` and `pts` to calculate total points would be more effective during feature engineering.

- If I extract total points, I can also consider total minutes played (`min * gp`) to measure player efficiency, as `min` alone may not be useful.

- Among the field goal statistics, the percentage of field goals made (`fg`) better reflects performance than the raw counts of made (`fgm`) or attempted (`fga`) shots, offering a clearer comparison. The same applies to three-point and free throw percentages.

- Since total rebounds (`reb`) include both offensive (`oreb`) and defensive rebounds (`dreb`), using just the total makes more sense.

- Additionally, stats like assists (`ast`), steals (`stl`), blocks (`blk`), and turnovers (`tov`) are valuable indicators of player performance and can help predict career duration.

Therefore, at this stage of the feature engineering process, it would be most effective to select the following columns: 

`gp`, `min`, `pts`, `fg`, `3p`, `ft`, `reb`, `ast`, `stl`, `blk`, `tov`.

Selecting columns to proceed with. Making sure to include the target column, `target_5yrs`. Displaying the first few rows to confirm they are as expected.

In [36]:
# Select the columns to proceed with and save the DataFrame in new variable `selected_data`.
# Include the target column, `target_5yrs`.

selected_data = data[['gp', 'min', 'pts', 'fg', '3p', 'ft', 'reb', 'ast', 'stl', 'blk', 'tov','target_5yrs']]

# Display the first few rows.

selected_data.head(5)


Unnamed: 0,gp,min,pts,fg,3p,ft,reb,ast,stl,blk,tov,target_5yrs
0,36,27.4,7.4,34.7,25.0,69.9,4.1,1.9,0.4,0.4,1.3,0
1,35,26.9,7.2,29.6,23.5,76.5,2.4,3.7,1.1,0.5,1.6,0
2,74,15.3,5.2,42.2,24.4,67.0,2.2,1.0,0.5,0.3,1.0,0
3,58,11.6,5.7,42.6,22.6,68.9,1.9,0.8,0.6,0.1,1.0,1
4,48,11.5,4.5,52.4,0.0,67.4,2.5,0.3,0.3,0.4,0.8,1


### Feature transformation

A key part of feature transformation is feature encoding. Categorical columns I plan to use as features must be converted into numerical formats.



- Many types of models are designed in a way that requires the data coming in to be numerical, normalized standrdized or taking the log() if it's ditribution is skewed. So, transforming categorical features into numerical features is an important step. 
- In this particular dataset, `name` is the only categorical column and the other columns are numerical. Given that `name` is not selected as a feature, all of the features that are selected at this point are already numerical and `do not require transformation`. 

### Feature extraction

Displaying the first few rows containing containing descriptions of the data for reference. The table is as follows:

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played per game|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

In [37]:
# Display the first few rows of `selected_data` for reference.

selected_data.head(10)


Unnamed: 0,gp,min,pts,fg,3p,ft,reb,ast,stl,blk,tov,target_5yrs
0,36,27.4,7.4,34.7,25.0,69.9,4.1,1.9,0.4,0.4,1.3,0
1,35,26.9,7.2,29.6,23.5,76.5,2.4,3.7,1.1,0.5,1.6,0
2,74,15.3,5.2,42.2,24.4,67.0,2.2,1.0,0.5,0.3,1.0,0
3,58,11.6,5.7,42.6,22.6,68.9,1.9,0.8,0.6,0.1,1.0,1
4,48,11.5,4.5,52.4,0.0,67.4,2.5,0.3,0.3,0.4,0.8,1
5,75,11.4,3.7,42.3,32.5,73.2,0.8,1.8,0.4,0.0,0.7,0
6,62,10.9,6.6,43.5,50.0,81.1,2.0,0.6,0.2,0.1,0.7,1
7,48,10.3,5.7,41.5,30.0,87.5,1.7,0.2,0.2,0.1,0.7,1
8,65,9.9,2.4,39.2,23.3,71.4,0.8,2.3,0.3,0.0,1.1,0
9,42,8.5,3.7,38.3,21.4,67.8,1.1,0.3,0.2,0.0,0.7,0


- **Columns**:
  - **`gp`**: Total number of games a player has participated in.
  - **`pts`**: Average points scored by the player per game.
  - **`min`**: Average minutes played by the player per game.

- **Combining Features**:
  - By merging `gp` and `pts`, I can calculate the total points scored by the player across all games and store this in a new column called `total_points`. This metric can provide insights into the player’s performance and their potential career longevity.

- **Efficiency Measure**:
  - Additionally, I can combine the new `total_points` with the `min` and `gp` columns to derive another feature called `efficiency`. This feature measures player efficiency and could influence predictions about their career duration.


In [38]:
# Extract two features that would help predict target_5yrs.
# Create a new variable named `extracted_data`.

extracted_data = selected_data.copy()

# Add a new column named `total_points`; 
# Calculate total points earned by multiplying the number of games played by the average number of points earned per game.

extracted_data["total_points"] = extracted_data["gp"] * extracted_data["pts"]


# Add a new column named `efficiency`. Calculate efficiency by dividing the total points earned by the total number 
# of minutes played, which yields points per minute. (Note that `min` represents avg. minutes per game.)
extracted_data["efficiency"] = extracted_data["total_points"] / (extracted_data["min"] * extracted_data["gp"])

# Display the first few rows of `extracted_data` to confirm that the new columns were added.
extracted_data.head()


Unnamed: 0,gp,min,pts,fg,3p,ft,reb,ast,stl,blk,tov,target_5yrs,total_points,efficiency
0,36,27.4,7.4,34.7,25.0,69.9,4.1,1.9,0.4,0.4,1.3,0,266.4,0.270073
1,35,26.9,7.2,29.6,23.5,76.5,2.4,3.7,1.1,0.5,1.6,0,252.0,0.267658
2,74,15.3,5.2,42.2,24.4,67.0,2.2,1.0,0.5,0.3,1.0,0,384.8,0.339869
3,58,11.6,5.7,42.6,22.6,68.9,1.9,0.8,0.6,0.1,1.0,1,330.6,0.491379
4,48,11.5,4.5,52.4,0.0,67.4,2.5,0.3,0.3,0.4,0.8,1,216.0,0.391304


To prepare for the Naive Bayes model I will build later in the project, I need to clean the extracted data and keep it concise. The Naive Bayes algorithm assumes that features are independent of one another given the class. To meet this requirement, I may need to remove original features if I aggregate them to create new ones. As a result, I will drop the columns used for feature extraction.


**Note:** There are other types of models that do not involve independence assumptions, so this would not be required in those instances. In fact, keeping the original features may be beneficial.

In [39]:
# Remove any columns from `extracted_data` that are no longer needed.

extracted_data = extracted_data.drop(["gp", "pts", "min"], axis=1)


# Display the first few rows of `extracted_data` to ensure that column drops took place.

extracted_data.head()


Unnamed: 0,fg,3p,ft,reb,ast,stl,blk,tov,target_5yrs,total_points,efficiency
0,34.7,25.0,69.9,4.1,1.9,0.4,0.4,1.3,0,266.4,0.270073
1,29.6,23.5,76.5,2.4,3.7,1.1,0.5,1.6,0,252.0,0.267658
2,42.2,24.4,67.0,2.2,1.0,0.5,0.3,1.0,0,384.8,0.339869
3,42.6,22.6,68.9,1.9,0.8,0.6,0.1,1.0,1,330.6,0.491379
4,52.4,0.0,67.4,2.5,0.3,0.3,0.4,0.8,1,216.0,0.391304


Exporting the extracted data as a new .csv file. I will use this later when making the Naive Bayes model. 

In [40]:
# Export the extracted data.

extracted_data.to_csv("extracted_nba_players_data.csv", index=0)

## Conclusion

In this project, I'm focused on feature engineering to help predict NBA player career longevity based on performance metrics. 

- **Objective**: My aim is to identify key features that can predict whether a player's career will last at least five years, aiding NBA managers and coaches in their decision-making.

- **Data Overview**: The dataset consists of 1,340 observations with 21 features, primarily numerical, capturing player performance metrics.

- **Feature Selection**: I carefully selected relevant features such as games played, points scored, shooting percentages, and various performance statistics to create meaningful insights.

- **Engineered Features**: Through the feature engineering process, the following new features were created:
  - **`total_points`**: Total points scored across all games, calculated by combining `gp` and `pts`.
  - **`efficiency`**: Points earned per minute, derived from the total points `total_points` divided by the total minutes played (`min * gp`).

  

- **Data Integrity**: I ensured that the data is clean and free of missing values, as this is crucial for building an effective predictive model.

- **Statistical Balance**: I evaluated the class distribution in the target variable, which indicates career duration, noting that while it is somewhat imbalanced, it is not severely skewed.

- **Next Steps**: Moving forward, I will prepare the data for a Naive Bayes model, keeping in mind the independence assumption of features. This will involve creating new features that reflect player efficiency and performance while dropping redundant original features.

- **Insights for Stakeholders**: Key attributes will be summarized for stakeholders, providing them with an understanding of how player performance metrics will inform the predictive model and assist in strategic decisions.
