# Feature Engineering

In this project we will explore the process and principle of feature engineering to help us determine which attributes in our dataset can best predict certain measures when building a model.

For this project we will use the case study below.

**Case Study:** 

In this project, you are working for a firm that provides insights to the National Basketball Association (NBA), a professional North American basketball league. You will help NBA managers and coaches identify which players are most likely to thrive in the high-pressure environment of professional basketball and help the team be successful over time.

To do this, we will analyze a subset of data that contains information about NBA players and their performance records. we will conduct feature engineering to determine which features will most effectively predict whether a player's NBA career will last at least five years. The insights gained then will be used in our next project: building the predictive model (Naive Bayes).

**Note**: This project will not include the typical exhaustive EDA process as its main aim is to show the process and principle of Feature Engineering. EDA is a very important part of any machine learning project and should always be carried out properly.

### **Step 1: Imports and data loading** 


In [44]:
# import packages
import pandas as pd

In [58]:
# load data
df = pd.read_csv(r"C:\Users\Ghost\Desktop\project_files\Files\nba-players.csv")

## **Step 2: Data exploration** 

#### **Data overview and summary statistics**

Use the following methods and attributes on the dataframe:

* `head()`
* `shape`
* `info()`

It's always helpful to have this information at the beginning of a project, where we can always refer back to if needed.

In [59]:
df.head()

Unnamed: 0.1,Unnamed: 0,name,gp,min,pts,fgm,fga,fg,3p_made,3pa,...,fta,ft,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs
0,0,Brandon Ingram,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,...,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,1,Andrew Harrison,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,...,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,2,JaKarr Sampson,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,...,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,3,Malik Sealy,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,...,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,4,Matt Geiger,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,...,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


drop `Unnamed: 0` column 

In [46]:
# drop Unnamed column
df = df.drop('Unnamed: 0', axis = 1)

# confirm changes
df.head(3)

Unnamed: 0,name,gp,min,pts,fgm,fga,fg,3p_made,3pa,3p,...,fta,ft,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs
0,Brandon Ingram,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,...,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,Andrew Harrison,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,...,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,JaKarr Sampson,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,...,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0


In [48]:
df.shape

(1340, 21)

Generate summary information using the `info()` method.

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1340 entries, 0 to 1339
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         1340 non-null   object 
 1   gp           1340 non-null   int64  
 2   min          1340 non-null   float64
 3   pts          1340 non-null   float64
 4   fgm          1340 non-null   float64
 5   fga          1340 non-null   float64
 6   fg           1340 non-null   float64
 7   3p_made      1340 non-null   float64
 8   3pa          1340 non-null   float64
 9   3p           1340 non-null   float64
 10  ftm          1340 non-null   float64
 11  fta          1340 non-null   float64
 12  ft           1340 non-null   float64
 13  oreb         1340 non-null   float64
 14  dreb         1340 non-null   float64
 15  reb          1340 non-null   float64
 16  ast          1340 non-null   float64
 17  stl          1340 non-null   float64
 18  blk          1340 non-null   float64
 19  tov   

The following table provides a description of the data in each column.

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played per game|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

### Check for missing values

Check for missing values using the `isna()` method

In [17]:
df.isna().sum()

name           0
gp             0
min            0
pts            0
fgm            0
fga            0
fg             0
3p_made        0
3pa            0
3p             0
ftm            0
fta            0
ft             0
oreb           0
dreb           0
reb            0
ast            0
stl            0
blk            0
tov            0
target_5yrs    0
dtype: int64

## **Step 3: Statistical tests** 



Next, use a statistical technique to check the class balance in the data. To understand how balanced the dataset is in terms of class, display the percentage of values that belong to each class in the target column. In this context, class 1 indicates an NBA career duration of at least five years, while class 0 indicates an NBA career duration of less than five years.

In [51]:
# check class balance
round(df['target_5yrs'].value_counts(normalize = True) * 100, 2)

target_5yrs
1    62.01
0    37.99
Name: proportion, dtype: float64

62.01% of the data accounts for players with a 5+ years NBA career, while players with less than a 5 years career  make up the remaining 37.99%.

A 62:38 class split is satisfactory.

## **Step 4: Results and evaluation** 


Now, perform feature engineering, with the goal of identifying and creating features that will serve as useful predictors for the target variable, `target_5yrs`. 

### Feature selection

The following table contains descriptions of the data in each column:

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

##### Which columns should we select and avoid selecting as features, and why? Keep in mind the goal is to identify features that will serve as useful predictors for the target variable, `target_5yrs`. 

- `name` : will not be selected since a player's name doesn't do much in predicting `target_5yrs`
- `gp` : number of games played offers good info and can be used with other features to extract new features
- `min` : minutes played will be selected as it provides us with the average minutes played per game, which can be a very good predictor
- `pts` : average number of points gives us valuable information about a players offensive performance per game, we should select it
- `fg` : tells us the average percentage of field goals made per game, it summarizes field goals info so we do not need to select `fga` and `fgm`
- `3p` : This tells us the average percent of three-point field goals made per game and also summarizes info on three-point goals so we do not need to select `3pa` and `3pm`
- `ft` : Average percent of free throws made per game will be selected. It also summarizes info on free throws so we do not need to select `fta` and `ftm`
- `reb` : this is the average number of rebounds made per game. it is the overall number of rebounds, combination of `dreb` and `oreb` so we have no need to include those two
- `ast`, `stl`, `blk` and `tov` will all be selected as they are all good performance evaluation metrices.

Next, we select the columns we want to proceed with

In [52]:
# select the columns to proceed with and save the DataFrame in new variable `selected_data`.

selected_data = df[['gp', 'min', 'pts', 'fg', '3p', 'ft', 'reb', 'ast', 'stl', 'blk', 'tov', 'target_5yrs']]

selected_data.head()

Unnamed: 0,gp,min,pts,fg,3p,ft,reb,ast,stl,blk,tov,target_5yrs
0,36,27.4,7.4,34.7,25.0,69.9,4.1,1.9,0.4,0.4,1.3,0
1,35,26.9,7.2,29.6,23.5,76.5,2.4,3.7,1.1,0.5,1.6,0
2,74,15.3,5.2,42.2,24.4,67.0,2.2,1.0,0.5,0.3,1.0,0
3,58,11.6,5.7,42.6,22.6,68.9,1.9,0.8,0.6,0.1,1.0,1
4,48,11.5,4.5,52.4,0.0,67.4,2.5,0.3,0.3,0.4,0.8,1


### Feature transformation

An important aspect of feature transformation is feature encoding. If there are categorical columns that we would want to use as features, those columns should be transformed to numerical. This technique is also known as feature encoding.

The only categorical feature in our dataset, `name` was not selected from our original data as it provides no value to our model. Every feature we have is numeric, there is therefore no need to carry out feature encoding. we can move to the next step.

### Feature extraction

Which columns lend themselves to feature extraction and what new features can be extracted from those columns

In [53]:
selected_data.head()

Unnamed: 0,gp,min,pts,fg,3p,ft,reb,ast,stl,blk,tov,target_5yrs
0,36,27.4,7.4,34.7,25.0,69.9,4.1,1.9,0.4,0.4,1.3,0
1,35,26.9,7.2,29.6,23.5,76.5,2.4,3.7,1.1,0.5,1.6,0
2,74,15.3,5.2,42.2,24.4,67.0,2.2,1.0,0.5,0.3,1.0,0
3,58,11.6,5.7,42.6,22.6,68.9,1.9,0.8,0.6,0.1,1.0,1
4,48,11.5,4.5,52.4,0.0,67.4,2.5,0.3,0.3,0.4,0.8,1


we will extract two features we believe will help predict `target_5yrs`. Then, create a new variable named 'extracted_data' that contains features from 'selected_data', as well as the features being extracted.

Two new features we can create:
- total points scored = ( `gp` * `pts`)
- efficiency = ( total points scored / (`gp` * `min`) )

In [55]:
# create a new variable named `extracted_data`.
extracted_data = selected_data.copy()

# extract new features
# total points scored
extracted_data['total_pts'] = extracted_data['gp'] * extracted_data['pts']

# efficiency 
extracted_data['efficiency'] = round(extracted_data['total_pts'] / (extracted_data['gp'] * extracted_data['min']), 2)

extracted_data.head()

Unnamed: 0,gp,min,pts,fg,3p,ft,reb,ast,stl,blk,tov,target_5yrs,total_pts,efficiency
0,36,27.4,7.4,34.7,25.0,69.9,4.1,1.9,0.4,0.4,1.3,0,266.4,0.27
1,35,26.9,7.2,29.6,23.5,76.5,2.4,3.7,1.1,0.5,1.6,0,252.0,0.27
2,74,15.3,5.2,42.2,24.4,67.0,2.2,1.0,0.5,0.3,1.0,0,384.8,0.34
3,58,11.6,5.7,42.6,22.6,68.9,1.9,0.8,0.6,0.1,1.0,1,330.6,0.49
4,48,11.5,4.5,52.4,0.0,67.4,2.5,0.3,0.3,0.4,0.8,1,216.0,0.39


Now, to prepare for the Naive Bayes model that we will build in the next project, clean the extracted data and ensure it is concise. Naive Bayes involves an assumption that features are independent of each other given the class. In order to satisfy that criteria, if certain features are aggregated to yield new features, it may be necessary to remove those original features. Therefore, drop the columns that were used to extract new features.

**Note:** There are other types of models that do not involve independence assumptions, so this would not be required in those instances. In fact, keeping the original features may be beneficial.

In [56]:
# remove any columns from `extracted_data` that are not independent.
extracted_data = extracted_data.drop(columns = ['gp', 'min', 'pts'])

# confirm changes
extracted_data.head()

Unnamed: 0,fg,3p,ft,reb,ast,stl,blk,tov,target_5yrs,total_pts,efficiency
0,34.7,25.0,69.9,4.1,1.9,0.4,0.4,1.3,0,266.4,0.27
1,29.6,23.5,76.5,2.4,3.7,1.1,0.5,1.6,0,252.0,0.27
2,42.2,24.4,67.0,2.2,1.0,0.5,0.3,1.0,0,384.8,0.34
3,42.6,22.6,68.9,1.9,0.8,0.6,0.1,1.0,1,330.6,0.49
4,52.4,0.0,67.4,2.5,0.3,0.3,0.4,0.8,1,216.0,0.39


Next, export the extracted data as a new .csv file.

In [57]:
# export the extracted data.
extracted_data.to_csv(r"C:\Users\Ghost\Desktop\project_files\Files\extracted_nba_players.csv", index=0)

In [39]:
extracted_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1340 entries, 0 to 1339
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   fg           1340 non-null   float64
 1   3p           1340 non-null   float64
 2   ft           1340 non-null   float64
 3   reb          1340 non-null   float64
 4   ast          1340 non-null   float64
 5   stl          1340 non-null   float64
 6   blk          1340 non-null   float64
 7   tov          1340 non-null   float64
 8   target_5yrs  1340 non-null   int64  
 9   total_pts    1340 non-null   float64
 10  efficiency   1340 non-null   float64
dtypes: float64(10), int64(1)
memory usage: 115.3 KB


##### We have successful performed feature engineering on the nba players dataset. we extracted a new csv file consisting of 1340 entries and 12 features:

`fg`, `3p`, `ft`, `reb`, `ast`, `stl`, `blk`, `tov`, `target_5yrs`, `total_pts`, and `efficiency` 
##### We will use this extracted dataset to build a Naive Bayes Model in our next project