## NBA Player Statistics — Preprocessing & Feature Engineering
### Team TSIAUF — Sanzhar Ilichbekov, Nurbekov Mirlan, Aydraliev Atai.

This notebook shows the full preprocessing steps for the NBA Player Statistics dataset.
The goal is to make the data clean, correct, and ready for analysis and machine learning models.

We load the data, check it, fix missing values, standardize important columns, create new features, and prepare the final dataset for modeling and visualization.

In [3]:
# Importing the needed libraries

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import joblib

In [4]:
# Setting up the pandas option for the dataframe look

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 20)

In [5]:
file_path = r"C:\Users\sanca\Videos\NBA_Project_TSIAUF\data\PlayerStatistics.csv"
df = pd.read_csv(file_path)

df.shape, df.columns.tolist()

  df = pd.read_csv(file_path)


((1637622, 35),
 ['firstName',
  'lastName',
  'personId',
  'gameId',
  'gameDateTimeEst',
  'playerteamCity',
  'playerteamName',
  'opponentteamCity',
  'opponentteamName',
  'gameType',
  'gameLabel',
  'gameSubLabel',
  'seriesGameNumber',
  'win',
  'home',
  'numMinutes',
  'points',
  'assists',
  'blocks',
  'steals',
  'fieldGoalsAttempted',
  'fieldGoalsMade',
  'fieldGoalsPercentage',
  'threePointersAttempted',
  'threePointersMade',
  'threePointersPercentage',
  'freeThrowsAttempted',
  'freeThrowsMade',
  'freeThrowsPercentage',
  'reboundsDefensive',
  'reboundsOffensive',
  'reboundsTotal',
  'foulsPersonal',
  'turnovers',
  'plusMinusPoints'])

We have 1637622 rows of data, and 35 columns which is in the list we can see the column names

In [7]:
display(df.head())
df.info()
df.isna().sum().sort_values(ascending=False).head(20)

Unnamed: 0,firstName,lastName,personId,gameId,gameDateTimeEst,playerteamCity,playerteamName,opponentteamCity,opponentteamName,gameType,gameLabel,gameSubLabel,seriesGameNumber,win,home,numMinutes,points,assists,blocks,steals,fieldGoalsAttempted,fieldGoalsMade,fieldGoalsPercentage,threePointersAttempted,threePointersMade,threePointersPercentage,freeThrowsAttempted,freeThrowsMade,freeThrowsPercentage,reboundsDefensive,reboundsOffensive,reboundsTotal,foulsPersonal,turnovers,plusMinusPoints
0,Chris,Paul,101108,22500300,2025-11-29 17:00:00,LA,Clippers,Dallas,Mavericks,,,,,0,1,11.03,0.0,3.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,2.0,0.0,-2.0
1,D'Angelo,Russell,1626156,22500300,2025-11-29 17:00:00,Dallas,Mavericks,LA,Clippers,,,,,1,0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Kris,Dunn,1627739,22500300,2025-11-29 17:00:00,LA,Clippers,Dallas,Mavericks,,,,,0,1,32.46,6.0,2.0,0.0,2.0,7.0,3.0,0.429,4.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,3.0,2.0,1.0,8.0
3,Ivica,Zubac,1627826,22500300,2025-11-29 17:00:00,LA,Clippers,Dallas,Mavericks,,,,,0,1,37.53,19.0,4.0,1.0,1.0,12.0,8.0,0.667,0.0,0.0,0.0,3.0,3.0,1.0,9.0,2.0,11.0,5.0,2.0,4.0
4,John,Collins,1628381,22500300,2025-11-29 17:00:00,LA,Clippers,Dallas,Mavericks,,,,,0,1,36.03,21.0,1.0,3.0,0.0,10.0,9.0,0.9,2.0,1.0,0.5,2.0,2.0,1.0,4.0,0.0,4.0,2.0,1.0,1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1637622 entries, 0 to 1637621
Data columns (total 35 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   firstName                1637622 non-null  object 
 1   lastName                 1637622 non-null  object 
 2   personId                 1637622 non-null  int64  
 3   gameId                   1637622 non-null  int64  
 4   gameDateTimeEst          1637622 non-null  object 
 5   playerteamCity           1637622 non-null  object 
 6   playerteamName           1637622 non-null  object 
 7   opponentteamCity         1637622 non-null  object 
 8   opponentteamName         1637622 non-null  object 
 9   gameType                 1629175 non-null  object 
 10  gameLabel                94707 non-null    object 
 11  gameSubLabel             6166 non-null     object 
 12  seriesGameNumber         135048 non-null   float64
 13  win                      1637622 non-null 

gameSubLabel               1631456
gameLabel                  1542915
seriesGameNumber           1502574
numMinutes                  163305
gameType                      8447
fieldGoalsPercentage          1219
threePointersAttempted        1219
threePointersMade             1219
threePointersPercentage       1219
freeThrowsAttempted           1219
assists                       1219
fieldGoalsAttempted           1219
freeThrowsMade                1219
freeThrowsPercentage          1219
reboundsDefensive             1219
reboundsOffensive             1219
reboundsTotal                 1219
foulsPersonal                 1219
turnovers                     1219
fieldGoalsMade                1219
dtype: int64

In [15]:
df['gameDateTimeEst'] = pd.to_datetime(df['gameDateTimeEst'], utc=True, errors='coerce')
df['gameYear'] = df['gameDateTimeEst'].dt.year
df['gameMonth'] = df['gameDateTimeEst'].dt.month

df = df[df['gameYear'] >= 2016]

#### Keeping Data from 2016 and Later

We converted the `gameDateTimeEst` column to datetime and extracted the **year** and **month** into separate columns. Then, we filtered the dataset to keep only rows from **2016 or later**.  

In [18]:
df.drop(columns=["gameSubLabel", "gameLabel", "seriesGameNumber", "gameType"], inplace=True)

#### Dropping Low-Value Columns
Some columns contain too many missing values and do not give useful information.
These columns are almost empty: gameSubLabel, gameLabel, seriesGameNumber, and gameType.
Because they have almost no meaningful data, we remove them from the dataset to keep it cleaner and easier to use.

In [21]:
df["numMinutes"] = pd.to_numeric(df["numMinutes"], errors="coerce")
df["numMinutes"] = df.groupby("personId")["numMinutes"].transform(lambda x: x.fillna(x.median()))
df["numMinutes"] = df["numMinutes"].fillna(df["numMinutes"].mean())

print("Missing values in minutes:", df["numMinutes"].isna().sum())

Missing values in minutes: 0


#### Fixing Missing Playing Time
The column numMinutes has over 160,000 missing values.
To fix this, we fill a player's missing minutes with that player's median playing time.
If a player never had any valid minutes, we fill the value with the overall median value.
This method keeps the dataset realistic and reduces bias.

In [24]:
stat_cols = [
    "points", "assists", "blocks", "steals", "turnovers", "foulsPersonal",
    "reboundsTotal", "reboundsDefensive", "reboundsOffensive",
    "fieldGoalsAttempted", "fieldGoalsMade", "fieldGoalsPercentage",
    "threePointersAttempted", "threePointersMade", "threePointersPercentage",
    "freeThrowsAttempted", "freeThrowsMade", "freeThrowsPercentage",
    "plusMinusPoints"
]

for col in stat_cols:
    df[col] = pd.to_numeric(df[col], errors="coerce")
    df[col] = df[col].fillna(0)

print("Missing values in points:", df["points"].isna().sum())

Missing values in points: 0


#### Fixing Missing Shooting Statistics
The dataset has around 1,200 missing values for shooting-related columns.
This is a small number compared to the total number of rows, so we fill the missing values with 0.
This makes data more accurate because if there was none of it, it means that player didn't did any of that.

In [27]:
df["efficiency"] = (
    df["points"] + df["assists"] + df["reboundsTotal"] +
    df["steals"] + df["blocks"] - df["turnovers"]
)

df["points_per36"] = df["points"] / (df["numMinutes"].replace(0, np.nan) / 36)
df["points_per36"] = df["points_per36"].fillna(0)

#### Feature Engineering

To help machine learning models understand player performance better, we create new features.
The first feature is a simple efficiency score that combines scoring, assists, rebounds, steals, blocks, and turnovers.
The second feature is points per 36 minutes, which normalizes scoring by playing time.
These features make the data more meaningful and help models learn more effectively.
To make it we divide by minutes. If minutes are 0, we replace with NaN to avoid error, then fill with 0 later.

In [31]:
le = LabelEncoder()
categorical_cols = ['playerteamName', 'opponentteamName', 'playerteamCity', 'opponentteamCity']

for col in categorical_cols:
    df[col] = le.fit_transform(df[col].astype(str))

#### Categorical Encoding

We also convert categorical text columns, such as team names and cities, into numbers using Label Encoding. This step is necessary because our machine learning models require numerical input to function properly. Finally, we save the clean and processed dataset to a new CSV file for use in the next stages of the project.s.

In [34]:
df.head()

Unnamed: 0,firstName,lastName,personId,gameId,gameDateTimeEst,playerteamCity,playerteamName,opponentteamCity,opponentteamName,win,home,numMinutes,points,assists,blocks,steals,fieldGoalsAttempted,fieldGoalsMade,fieldGoalsPercentage,threePointersAttempted,threePointersMade,threePointersPercentage,freeThrowsAttempted,freeThrowsMade,freeThrowsPercentage,reboundsDefensive,reboundsOffensive,reboundsTotal,foulsPersonal,turnovers,plusMinusPoints,gameYear,gameMonth,efficiency,points_per36
0,Chris,Paul,101108,22500300,2025-11-29 17:00:00+00:00,14,5,6,17,0,1,11.03,0.0,3.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,2.0,0.0,-2.0,2025.0,11.0,6.0,0.0
1,D'Angelo,Russell,1626156,22500300,2025-11-29 17:00:00+00:00,6,17,14,5,1,0,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2025.0,11.0,0.0,0.0
2,Kris,Dunn,1627739,22500300,2025-11-29 17:00:00+00:00,14,5,6,17,0,1,32.46,6.0,2.0,0.0,2.0,7.0,3.0,0.429,4.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,3.0,2.0,1.0,8.0,2025.0,11.0,12.0,6.654344
3,Ivica,Zubac,1627826,22500300,2025-11-29 17:00:00+00:00,14,5,6,17,0,1,37.53,19.0,4.0,1.0,1.0,12.0,8.0,0.667,0.0,0.0,0.0,3.0,3.0,1.0,9.0,2.0,11.0,5.0,2.0,4.0,2025.0,11.0,34.0,18.22542
4,John,Collins,1628381,22500300,2025-11-29 17:00:00+00:00,14,5,6,17,0,1,36.03,21.0,1.0,3.0,0.0,10.0,9.0,0.9,2.0,1.0,0.5,2.0,2.0,1.0,4.0,0.0,4.0,2.0,1.0,1.0,2025.0,11.0,28.0,20.982515


In [36]:
df.tail()

Unnamed: 0,firstName,lastName,personId,gameId,gameDateTimeEst,playerteamCity,playerteamName,opponentteamCity,opponentteamName,win,home,numMinutes,points,assists,blocks,steals,fieldGoalsAttempted,fieldGoalsMade,fieldGoalsPercentage,threePointersAttempted,threePointersMade,threePointersPercentage,freeThrowsAttempted,freeThrowsMade,freeThrowsPercentage,reboundsDefensive,reboundsOffensive,reboundsTotal,foulsPersonal,turnovers,plusMinusPoints,gameYear,gameMonth,efficiency,points_per36
352789,Elfrid,Payton,203901,21500490,2016-01-01 19:00:00+00:00,24,16,33,33,0,0,24.0,3.0,7.0,0.0,1.0,6.0,1.0,0.167,2.0,1.0,0.5,0.0,0.0,0.0,1.0,0.0,1.0,2.0,3.0,-17.0,2016.0,1.0,9.0,4.5
352790,Aaron,Gordon,203932,21500490,2016-01-01 19:00:00+00:00,24,16,33,33,0,0,24.0,6.0,3.0,0.0,0.0,6.0,3.0,0.5,1.0,0.0,0.0,1.0,0.0,0.0,4.0,3.0,7.0,0.0,1.0,-6.0,2016.0,1.0,15.0,9.0
352791,Jarell,Eddie,204067,21500490,2016-01-01 19:00:00+00:00,33,33,24,16,1,1,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2016.0,1.0,0.0,0.0
352792,Drew,Gooden,2400,21500490,2016-01-01 19:00:00+00:00,33,33,24,16,1,1,4.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,4.0,0.0,-5.0,2016.0,1.0,2.0,0.0
352793,Kris,Humphries,2743,21500490,2016-01-01 19:00:00+00:00,33,33,24,16,1,1,12.0,11.0,0.0,1.0,0.0,6.0,4.0,0.667,3.0,1.0,0.333,2.0,2.0,1.0,3.0,0.0,3.0,2.0,0.0,1.0,2016.0,1.0,15.0,33.0


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 352749 entries, 0 to 352793
Data columns (total 35 columns):
 #   Column                   Non-Null Count   Dtype              
---  ------                   --------------   -----              
 0   firstName                352749 non-null  object             
 1   lastName                 352749 non-null  object             
 2   personId                 352749 non-null  int64              
 3   gameId                   352749 non-null  int64              
 4   gameDateTimeEst          352749 non-null  datetime64[ns, UTC]
 5   playerteamCity           352749 non-null  int32              
 6   playerteamName           352749 non-null  int32              
 7   opponentteamCity         352749 non-null  int32              
 8   opponentteamName         352749 non-null  int32              
 9   win                      352749 non-null  int64              
 10  home                     352749 non-null  int64              
 11  numMinutes        

In [40]:
output_path = r"../data/cleaned_player_stats_for_model.csv"
df.to_csv(output_path, index=False)