Overview of this Notebook:
- Library Imports:
Essential libraries for data manipulation (pandas), visualization (matplotlib, seaborn), and machine learning preprocessing (StandardScaler, PCA, TSNE).
- Data Reading and Initial Display:
The data is read from a CSV file, and initial statistics, including head and info, are displayed.
- Anomaly Correction:
I corrected anomalies in percentage features (FG%, 3P%, FT%) based on made and attempted features.
- Missing Values:
I checked for missing values and displayed rows with missing 3P%. Initial missing values were coming from 3P%, which have been recreated.
- Duplicate Rows:
Duplicate checks and handling are performed, keeping only rows with the maximum GP for each player.
- Target Distribution Visualization:
I visualized the distribution of the target variable (TARGET_5Yrs).
- Correlation Analysis:
A correlation matrix is created, and features with high correlation (both positive and negative) are identified and displayed.
- Dimensionality Reduction:
PCA and t-SNE are used to visualize the separation of classes in a 2D space.
- Feature Engineering:
New features like PER (Player Efficiency Rating) and PPM (Points per Minute) are created to enhance the dataset.


In [1]:
# ==================================================================
# IMPORT LIBRARIES
# ==================================================================
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import os  # Import os to handle directory creation


import missingno as msno

from IPython.display import display #To display dataframe with a nice format.

# Setting display precision for pandas
pd.set_option("display.precision", 2)
# Set display option to show all columns
pd.set_option('display.max_columns', None)

blue_green      = (82/255, 162/255, 160/255)
blue_green_dark = (41/255, 81/255, 81/255)

# ==================================================================================================
# Set the fontsize & Bold for each GRAPH !!!
# ==================================================================================================
plt.rcParams['axes.titlesize'] = 18
plt.rcParams['axes.labelsize'] = 14
plt.rcParams["axes.titleweight"] = "bold"
plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["lines.linewidth"] = 3
plt.rcParams["lines.markersize"] = 10
plt.rcParams["xtick.labelsize"] = 12
plt.rcParams["ytick.labelsize"] = 12
plt.rcParams['axes.titlepad'] = 20 

# ==================================================================
# READ DATA
# ==================================================================
# Charger les données
df = pd.read_csv('nba_logreg.csv')

# Afficher les premières lignes et les informations sur le dataset
display(df.head())
display(df.info())



Unnamed: 0,Name,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
0,Brandon Ingram,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0.0
1,Andrew Harrison,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0.0
2,JaKarr Sampson,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0.0
3,Malik Sealy,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1.0
4,Matt Geiger,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1340 entries, 0 to 1339
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         1340 non-null   object 
 1   GP           1340 non-null   int64  
 2   MIN          1340 non-null   float64
 3   PTS          1340 non-null   float64
 4   FGM          1340 non-null   float64
 5   FGA          1340 non-null   float64
 6   FG%          1340 non-null   float64
 7   3P Made      1340 non-null   float64
 8   3PA          1340 non-null   float64
 9   3P%          1329 non-null   float64
 10  FTM          1340 non-null   float64
 11  FTA          1340 non-null   float64
 12  FT%          1340 non-null   float64
 13  OREB         1340 non-null   float64
 14  DREB         1340 non-null   float64
 15  REB          1340 non-null   float64
 16  AST          1340 non-null   float64
 17  STL          1340 non-null   float64
 18  BLK          1340 non-null   float64
 19  TOV   

None

NOTE:
- Data have been loaded correctly
- Feature have been assigned the good Dtype.

In [2]:
# ==================================================================
# EXPLANATION OF THE FEATURES
# =================================================================
# Source : https://www.nba.com/stats/help/glossary

| Feature Name | Description | Detailed Description |
|---|---|---|
| Name | The name of the player | This is simply the player's name. |
| GP | Games Played | This counts how many games the player has participated in throughout their career. |
| MIN | Minutes Played | This records the total amount of time the player has spent on the court during their career. |
| PTS | Points Per Game | This is the average number of points the player scores in each game. It's a key indicator of their scoring ability. |
| FGM | Field Goals Made | This is the number of successful shots the player has made from within the 3-point line. |
| FGA | Field Goal Attempts | This is the total number of shots the player has attempted from within the 3-point line. |
| FG% | Field Goal Percent | This is the percentage of field goals the player has made, calculated by dividing FGM by FGA. It shows how accurate their shooting is. |
| 3P Made | 3-Point Made | This is the number of successful 3-point shots the player has made. |
| 3PA | 3-Point Attempts | This is the total number of 3-point shots the player has attempted. |
| 3P% | 3-Point Attempts | This is the percentage of 3-point shots the player has made, calculated by dividing 3P Made by 3PA. It shows how accurate their 3-point shooting is. |
| FTM | Free Throw Made | This is the number of successful free throws the player has made. Free throws are awarded when a player is fouled while shooting. |
| FTA | Free Throw Attempts | This is the total number of free throws the player has attempted. |
| FT% | Free Throw Percent | This is the percentage of free throws the player has made, calculated by dividing FTM by FTA. It shows how accurate their free throw shooting is. |
| OREB | Offensive Rebounds | This is the number of rebounds the player has grabbed on the offensive end of the court, giving their team another chance to shoot. |
| DREB | Defensive Rebounds | This is the number of rebounds the player has grabbed on the defensive end of the court, preventing the opposing team from getting another shot. |
| REB | Rebounds | This is the total number of rebounds the player has grabbed, combining both offensive and defensive rebounds. |
| AST | Assists | This is the number of times the player has passed the ball to a teammate who has scored a basket. It shows their ability to create scoring opportunities for others. |
| STL | Steals | This is the number of times the player has stolen the ball from an opposing player. |
| BLK | Blocks | This is the number of times the player has blocked an opponent's shot. |
| TOV | Turnovers | This is the number of times the player has lost possession of the ball due to a mistake, such as a bad pass or dribbling out of bounds. |
| TARGET_5Yrs | Outcome: 1 if career length >= 5 yrs, 0 if < 5 yrs | This is a binary variable indicating whether the player's career lasted 5 or more years (1) or less than 5 years (0). |

In [3]:
# ==================================================================
# Some stats for the features
# =================================================================
# Statistiques descriptives
display(df.describe())

Unnamed: 0,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
count,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1329.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0
mean,60.41,17.62,6.8,2.63,5.89,44.17,0.25,0.78,19.31,1.3,1.82,70.3,1.01,2.03,3.03,1.55,0.62,0.37,1.19,0.62
std,17.43,8.31,4.36,1.68,3.59,6.14,0.38,1.06,16.02,0.99,1.32,10.58,0.78,1.36,2.06,1.47,0.41,0.43,0.72,0.49
min,11.0,3.1,0.7,0.3,0.8,23.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.3,0.0,0.0,0.0,0.1,0.0
25%,47.0,10.88,3.7,1.4,3.3,40.2,0.0,0.0,0.0,0.6,0.9,64.7,0.4,1.0,1.5,0.6,0.3,0.1,0.7,0.0
50%,63.0,16.1,5.55,2.1,4.8,44.1,0.1,0.3,22.4,1.0,1.5,71.25,0.8,1.7,2.5,1.1,0.5,0.2,1.0,1.0
75%,77.0,22.9,8.8,3.4,7.5,47.9,0.4,1.2,32.5,1.6,2.3,77.6,1.4,2.6,4.0,2.0,0.8,0.5,1.5,1.0
max,82.0,40.9,28.2,10.2,19.8,73.7,2.3,6.5,100.0,7.7,10.2,100.0,5.3,9.6,13.9,10.6,2.5,3.9,4.4,1.0


NOTE:
- We can see that some features are based on another : ex FGM, FGA and FG%.
  --> I will remove some of them later (after correlation check).
- It seems there are no anomalies in the data (for exemple : 1. negative value or 2. "Attemps" Features with inferior value to "Made" Feature)
- BUT there are some anomalies for the "Percentage" features like 3P% which should be (3PM)/(3PA)

Note:
What i can consider anomalies in the data at this stage : 
1. All nunemerical feature should be positive (ex: games played cannot be negative, 3P, 3PA and 3P% cannot be negative etc).

# I. Checking for anomalies

In [4]:
# ==================================================================
# Correcting anomalies of "Percentage Features
# =================================================================
df['FG%'] = 100*df['FGM']/df['FGA']
df['3P%'] = 100*df['3P Made']/df['3PA']
df['FT%'] = 100*df['FTM']/df['FTA']

# /!\ Sometime, the value of 3PA is 0, which means that 3P% is NaN! In this case, we will replace it by 0.


# II. Checking for missing values

In [5]:
# ==================================================================
# Missing Values
# =================================================================
# Vérifier les valeurs manquantes
display(df.isnull().sum())

Name             0
GP               0
MIN              0
PTS              0
FGM              0
FGA              0
FG%              0
3P Made          0
3PA              0
3P%            360
FTM              0
FTA              0
FT%              1
OREB             0
DREB             0
REB              0
AST              0
STL              0
BLK              0
TOV              0
TARGET_5Yrs      0
dtype: int64

In [6]:
# let's do the investigation of the missing values in the featrue 3P%
# Get rows with missing values in the "3P%" column
missing_rows = df[df['3P%'].isnull()]

# Print the rows
display(missing_rows)

Unnamed: 0,Name,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
12,Lorenzo Williams,27,6.6,1.3,0.6,1.3,46.15,0.0,0.0,,0.1,0.3,33.33,0.6,1.4,2.0,0.2,0.2,0.6,0.3,1.0
14,Elmore Spencer,44,6.4,2.4,1.0,1.9,52.63,0.0,0.0,,0.4,0.7,57.14,0.4,1.0,1.4,0.2,0.2,0.4,0.6,1.0
16,Stephen Howard,49,5.3,2.1,0.7,1.9,36.84,0.0,0.0,,0.7,1.1,63.64,0.5,0.7,1.2,0.2,0.3,0.2,0.5,0.0
25,Larry Stewart,76,29.3,10.4,4.0,7.8,51.28,0.0,0.0,,2.5,3.1,80.65,2.4,3.5,5.9,1.6,0.7,0.6,1.5,1.0
29,Donald Hodge,51,20.7,8.4,3.2,6.4,50.00,0.0,0.0,,2.0,2.9,68.97,2.3,3.1,5.4,0.8,0.5,0.5,1.5,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1314,Clarence Weatherspoon,82,32.4,15.6,6.0,12.8,46.88,0.0,0.0,,3.5,5.0,70.00,2.2,5.0,7.2,1.8,1.0,0.8,2.1,1.0
1316,Sean Rooks,72,29.0,13.5,5.1,10.4,49.04,0.0,0.0,,3.3,5.4,61.11,2.7,4.7,7.4,1.3,0.5,1.1,2.2,1.0
1318,Anthony Avent,82,27.9,9.8,4.2,9.8,42.86,0.0,0.0,,1.4,2.1,66.67,2.2,4.0,6.2,1.1,0.7,0.9,1.7,1.0
1330,Adam Keefe,82,18.9,6.6,2.3,4.6,50.00,0.0,0.0,,2.0,2.9,68.97,2.1,3.2,5.3,1.0,0.7,0.2,1.2,1.0


NOTE:
All missing value correspond to a 0 value of feature '3PA'. SO in these cases, '3P%' value should be 0. 
Same for 'FT%' and 'FG%'

In [7]:
df['3P%'] = df['3P%'].fillna(0)
df['FG%'] = df['FG%'].fillna(0)
df['FT%'] = df['FT%'].fillna(0)

In [8]:
# We check again 
# Vérifier les valeurs manquantes
display(df.isnull().sum())

Name           0
GP             0
MIN            0
PTS            0
FGM            0
FGA            0
FG%            0
3P Made        0
3PA            0
3P%            0
FTM            0
FTA            0
FT%            0
OREB           0
DREB           0
REB            0
AST            0
STL            0
BLK            0
TOV            0
TARGET_5Yrs    0
dtype: int64

# III. Checking for duplicates 

In [9]:
# ==================================================================
# DUPLICATES
# =================================================================
# Check for duplicate rows
print(f'Duplicate rows: {df.duplicated(keep=False).sum()}')

# Check for duplicate rows and display them
duplicate_rows = df[df.duplicated(keep=False)]
display(duplicate_rows)

Duplicate rows: 24


Unnamed: 0,Name,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
162,Charles Smith,60,8.7,2.9,1.0,2.2,45.45,0.0,0.1,0.0,0.9,1.3,69.23,0.2,0.9,1.2,1.7,0.6,0.1,0.6,1.0
163,Charles Smith,60,8.7,2.9,1.0,2.2,45.45,0.0,0.1,0.0,0.9,1.3,69.23,0.2,0.9,1.2,1.7,0.6,0.1,0.6,1.0
165,Charles Smith,71,30.4,16.3,6.1,12.4,49.19,0.0,0.0,0.0,4.0,5.5,72.73,2.4,4.1,6.5,1.5,1.0,1.3,2.1,1.0
166,Charles Smith,71,30.4,16.3,6.1,12.4,49.19,0.0,0.0,0.0,4.0,5.5,72.73,2.4,4.1,6.5,1.5,1.0,1.3,2.1,1.0
168,Charles Smith,34,8.6,3.5,1.4,3.7,37.84,0.4,1.4,28.57,0.2,0.3,66.67,0.4,0.4,0.8,0.6,0.3,0.2,0.8,1.0
169,Charles Smith,34,8.6,3.5,1.4,3.7,37.84,0.4,1.4,28.57,0.2,0.3,66.67,0.4,0.4,0.8,0.6,0.3,0.2,0.8,1.0
242,Reggie Williams,35,24.5,10.4,4.3,12.2,35.25,0.4,1.7,23.53,1.4,1.9,73.68,1.6,1.8,3.4,1.7,0.8,0.6,1.8,1.0
243,Reggie Williams,35,24.5,10.4,4.3,12.2,35.25,0.4,1.7,23.53,1.4,1.9,73.68,1.6,1.8,3.4,1.7,0.8,0.6,1.8,1.0
338,Ken Johnson,64,12.7,4.1,1.8,3.3,54.55,0.0,0.0,0.0,0.6,1.3,46.15,1.4,2.4,3.8,0.3,0.2,0.3,0.9,0.0
339,Ken Johnson,64,12.7,4.1,1.8,3.3,54.55,0.0,0.0,0.0,0.6,1.3,46.15,1.4,2.4,3.8,0.3,0.2,0.3,0.9,0.0


NOTE:

This dataset have two problems : 
1. It exists some rows which are totally identique.
2. it exists multiple rows for the same player. Ex : Charles Smith (163 , 166, 169). I keep the row with the biggest GP (Games Played) value which should the most recent data for the player.


In [10]:
# Check for fully duplicated rows and remove them
print(f'Number of fully duplicated rows: {df.duplicated().sum()}')
df = df.drop_duplicates()

# For each player, keep only the row with the maximum 'GP' (Games Played)
df = df.sort_values('GP', ascending=False).drop_duplicates(subset='Name', keep='first')

print(f'Remaining rows after cleaning: {df.shape[0]}')

Number of fully duplicated rows: 12
Remaining rows after cleaning: 1294


# IV. Some Graphs (target distribution, correlation)

In [11]:
# ==================================================================
# TARGET DISTRIBUTION
# =================================================================
# Distribution de la variable cible
target_distribution = df['TARGET_5Yrs'].value_counts(normalize=True)
print("Distribution de la variable cible :")
print(target_distribution)

# Create the directory if it doesn't exist
output_dir = 'image/target_distribution'
os.makedirs(output_dir, exist_ok=True)

plt.figure(figsize=(8, 6))
target_distribution.plot(kind='bar')
plt.title("Distribution de la variable cible (TARGET_5Yrs)")
plt.xlabel("Carrière >= 5 ans")
plt.ylabel("Proportion")
plt.savefig(os.path.join(output_dir, 'target_distribution.png'))
plt.close()



Distribution de la variable cible :
TARGET_5Yrs
1.0    0.62
0.0    0.38
Name: proportion, dtype: float64


NOTE: 

/!\ The target isn't balanced!
I need to use something like:
- Class Weight Balancing
- SMOTE Technique (Synthetic Minority Over-sampling Technique)

In [12]:
# ==================================================================
# CORRELATION BETWEEN FEATURES (NUMERICAL FEATURES ONLY)
# =================================================================
# Create the directory if it doesn't exist
output_dir = 'image/correlation'
os.makedirs(output_dir, exist_ok=True)

# Identifier les colonnes non-numériques
df_numerique = df.select_dtypes(include=['number'])

# Corrélations
correlation_matrix = df_numerique.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm')
plt.title("Matrice de corrélation")
# plt.savefig('correlation_matrix.png')
plt.savefig(os.path.join(output_dir, 'correlation_matrix.png'))
plt.close()

# Top 5 des corrélations avec la variable cible
top_correlations = correlation_matrix['TARGET_5Yrs'].sort_values(key=abs, ascending=False)[1:6]
print("Top 5 des corrélations avec TARGET_5Yrs :")
print(top_correlations)



# Visualisations pour les variables les plus corrélées
for feature in top_correlations.index:
    plt.figure(figsize=(10, 6))
    sns.boxplot(x='TARGET_5Yrs', y=feature, data=df_numerique)
    plt.title(f"Distriabution de {feature} par rapport à TARGET_5Yrs")
    # plt.savefig(f'{feature}_boxplot.png')
    plt.savefig(os.path.join(output_dir, f'{feature}_boxplot.png'))
    plt.close()



Top 5 des corrélations avec TARGET_5Yrs :
GP     0.41
MIN    0.33
FGM    0.32
PTS    0.32
FTM    0.31
Name: TARGET_5Yrs, dtype: float64


NOTE: 
Some features are highly correlated (positively!!!)
EX (in positif : MIN & PTS, FGM & MIN, etc.)

NOTE:
It exists some outliers (observation on the boxplot), but these values don't seems to anomalies. Instead, they seems to be possible values (extreme value).
Exemple : In feature "PTS", the max value is 28, corresponding to the number of Points Per Game  in average.

In [13]:
# ==================================================================
# HIGH POSITIVE & NEGATIVE CORRELATION BETWEEN FEATURES
# =================================================================
# Get features with high correlation 
high_correlations = correlation_matrix[((correlation_matrix > 0.9) | (correlation_matrix < -0.9)) & (correlation_matrix != 1.0)]

# Get the pairs of features with high correlation
high_correlation_pairs = high_correlations.stack().reset_index()
high_correlation_pairs.columns = ['feature1', 'feature2', 'correlation']

# Filter out duplicate pairs
high_correlation_pairs = high_correlation_pairs[(high_correlation_pairs['feature1'] < high_correlation_pairs['feature2'])]

# Print the pairs
print("Pairs of features with high correlation (>0.9):")
print(high_correlation_pairs)

Pairs of features with high correlation (>0.9):
   feature1 feature2  correlation
0       MIN      PTS         0.91
6       FGM      MIN         0.90
7       FGM      PTS         0.99
9       FGA      MIN         0.91
10      FGA      PTS         0.98
11      FGA      FGM         0.98
12  3P Made      3PA         0.98
15      FTA      FTM         0.98
16     OREB      REB         0.93
17     DREB      REB         0.98


NOTE:

Now that we know the highly correlated pair of features, we are going to delete one of them. 

1. We remove MIN and we keep PTS. 
Reason : PTS is more directly indicative of performance.

2. We remove OREB and DREB and we keep REB.
Reason : REB combines both offensive and defensive rebounds.

3. We remove "Made" features and "Attempts" feature and we keep the "Percentage" feature. 

In [14]:
# We remove the features
df_copy = df
df = df.drop(columns=['MIN', 'OREB', 'DREB', 'FGM', 'FGA', '3P Made', '3PA', 'FTM', 'FTA'])

# V. Identifying visually patterns and clusters

The scatter plot can reveal clustering within the data, indicating the presence of groups or categories. If the points are well-separated, it suggests that features are effective in distinguishing between the target classes.

If the points overlap significantly, it may indicate that current features are not sufficient for distinguishing between classes. We need tp consider feature engineering.

In [15]:
# ==================================================================
# PCA 2D
# =================================================================
# Create the directory if it doesn't exist
output_dir = 'image/reduction_dimensionelle'
os.makedirs(output_dir, exist_ok=True)

# PCA pour visualiser la séparation des classes
features = df.drop(['Name', 'TARGET_5Yrs'], axis=1)
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

pca = PCA(n_components=2)
pca_result = pca.fit_transform(features_scaled)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(pca_result[:, 0], pca_result[:, 1], c=df['TARGET_5Yrs'], cmap='viridis')
plt.colorbar(scatter)
plt.title("ACP - Visualisation des deux premières composantes principales")
plt.xlabel("Première composante principale")
plt.ylabel("Deuxième composante principale")
# plt.savefig('pca_visualization.png')
plt.savefig(os.path.join(output_dir, 'pca_visualization.png'))
plt.close()

print("L'analyse exploratoire des données est terminée. Les visualisations ont été sauvegardées sous forme de fichiers PNG.")

L'analyse exploratoire des données est terminée. Les visualisations ont été sauvegardées sous forme de fichiers PNG.


In [16]:
# ==================================================================
# TSNE (t-distributed Stochastic Neighbor Embedding) 2D
# =================================================================
from sklearn.manifold import TSNE

# Standardize the features as you've done before
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=100, random_state=42)
tsne_result = tsne.fit_transform(features_scaled)

# Plot the t-SNE result
plt.figure(figsize=(10, 8))
scatter = plt.scatter(tsne_result[:, 0], tsne_result[:, 1], c=df['TARGET_5Yrs'], cmap='viridis')
plt.colorbar(scatter)
plt.title("t-SNE - Visualisation des deux dimensions principales")
plt.xlabel("Première dimension t-SNE")
plt.ylabel("Deuxième dimension t-SNE")
# plt.savefig('tsne_visualization.png')
plt.savefig(os.path.join(output_dir, 'tsne_visualization.png'))

plt.close()

print("Visualisation t-SNE terminée. L'image a été sauvegardée sous forme de fichier PNG.")


Visualisation t-SNE terminée. L'image a été sauvegardée sous forme de fichier PNG.


Both PCA and TSNE (with different perplexity) shows that the features are not sufficient for distinguishing between classes of the TARGET. 
--> I will need to do feature engineering. 

In [17]:
# ==================================================================
# Save the clean Dataset (NO Feature Engineering)
# =================================================================
# Save the cleaned DataFrame to a CSV file
df.to_csv('nba_logreg_clean_NO_FE.csv', index=False)

In [18]:
# ==================================================================					
# Feature Engineering					
# =================================================================					
import numpy as np					
					
# Efficiency Metrics									
df_copy['PER'] = (df_copy['PTS'] + df_copy['REB'] + df_copy['AST'] + df_copy['STL'] + df_copy['BLK'] - df_copy['TOV']) / df_copy['GP']					
df_copy['PPM'] = df_copy['PTS'] / df_copy['MIN']					
df_copy['Usage_Rate'] = (df_copy['FGA'] + 0.44 * df_copy['FTA'] + df_copy['TOV']) / df_copy['MIN']					
					
# Scoring Efficiency					
df_copy['True_Shooting_Percentage'] = df_copy['PTS'] / (2 * (df_copy['FGA'] + 0.44 * df_copy['FTA']))					
df_copy['Effective_FG_Percentage'] = (df_copy['FGM'] + 0.5 * df_copy['3P Made']) / df_copy['FGA']					
df_copy['Points_Per_Shot'] = df_copy['PTS'] / (df_copy['FGA'] + 0.44 * df_copy['FTA'])					
					
# Rebounding					
df_copy['Total_Rebound_Percentage'] = df_copy['REB'] / (df_copy['GP'] * df_copy['MIN'] / 48)					
df_copy['Offensive_Rebound_Percentage'] = df_copy['OREB'] / (df_copy['GP'] * df_copy['MIN'] / 48)					
df_copy['Defensive_Rebound_Percentage'] = df_copy['DREB'] / (df_copy['GP'] * df_copy['MIN'] / 48)					
					
# Passing and Ball Handling					
df_copy['Assist_Percentage'] = df_copy['AST'] / (df_copy['MIN'] / 48 * df_copy['GP'] * 5)					
df['AST/TOV'] = df['AST'] / np.where(df['TOV'] == 0, 1, df['TOV'])  # Avoid division by zero		
df_copy['AST_Ratio'] = df_copy['AST'] * 100 / df_copy['FGA']					
					
# Defensive Metrics					
df_copy['Steal_Percentage'] = df_copy['STL'] / (df_copy['GP'] * df_copy['MIN'] / 48)					
df_copy['Block_Percentage'] = df_copy['BLK'] / (df_copy['GP'] * df_copy['MIN'] / 48)					
df_copy['Stocks'] = df_copy['STL'] + df_copy['BLK']  # Steals + Blocks					
					
# Versatility Metrics					
df_copy['Versatility_Index'] = df_copy['PTS'] + df_copy['REB'] + df_copy['AST'] + df_copy['STL'] + df_copy['BLK']					
df_copy['Offensive_Versatility'] = df_copy['PTS'] + df_copy['AST'] + df_copy['OREB']					
df_copy['Defensive_Versatility'] = df_copy['DREB'] + df_copy['STL'] + df_copy['BLK']					
					
# Shooting Metrics					
df_copy['Three_Point_Rate'] = df_copy['3PA'] / df_copy['FGA']					
df_copy['Free_Throw_Rate'] = df_copy['FTA'] / df_copy['FGA']					
					
# Per Game and Per 36 Minutes Metrics					
for stat in ['PTS', 'REB', 'AST', 'STL', 'BLK', 'TOV']:
    df_copy[f'{stat}_Per_Game'] = df_copy[stat] / df_copy['GP']
    df_copy[f'{stat}_Per_36'] = df_copy[stat] / df_copy['MIN'] * 36			
					
# Advanced Metrics					
df_copy['Box_Plus_Minus'] = (df_copy['PTS'] + 0.2 * df_copy['OREB'] + 0.8 * df_copy['DREB'] + 2.7 * df_copy['AST'] + df_copy['STL'] + 0.7 * df_copy['BLK'] - 0.7 * df_copy['FGA'] - 0.4 * (df_copy['FTA'] - df_copy['FTM']) - 1.2 * df_copy['TOV']) / df_copy['GP']					
					
# Consistency Metrics					
df_copy['Scoring_Consistency'] = df_copy['PTS'] / (df_copy['FGA'] + df_copy['FTA'])					
df_copy['Usage_Consistency'] = df_copy['Usage_Rate'] / df_copy['GP']					
					
# Compound Metrics					
df_copy['Offensive_Impact'] = (df_copy['PTS'] + df_copy['AST']) / df_copy['MIN']					
df_copy['Defensive_Impact'] = (df_copy['DREB'] + df_copy['STL'] + df_copy['BLK']) / df_copy['MIN']					
df_copy['Overall_Impact'] = df_copy['Offensive_Impact'] + df_copy['Defensive_Impact']					
					
# Efficiency Ratios					
df_copy['Points_Per_Touch'] = df_copy['PTS'] / (df_copy['FGA'] + df_copy['FTA'] + df_copy['AST'] + df_copy['TOV'])					
df_copy['Production_Per_Possession'] = (df_copy['PTS'] + df_copy['AST'] + df_copy['OREB']) / (df_copy['FGA'] - df_copy['OREB'] + df_copy['TOV'] + 0.44 * df_copy['FTA'])					
					
# Relative Performance Metrics					
df_copy['PTS_to_Usage_Ratio'] = df_copy['PTS'] / df_copy['Usage_Rate']					
df_copy['AST_to_Usage_Ratio'] = df_copy['AST'] / df_copy['Usage_Rate']					
					
# Shooting Splits					
df_copy['2P%'] = (df_copy['FGM'] - df_copy['3P Made']) / (df_copy['FGA'] - df_copy['3PA'])					
df_copy['Points_Per_Shot_Attempt'] = df_copy['PTS'] / df_copy['FGA']					
					
# Physical Impact Metrics					
df_copy['Physical_Impact'] = (df_copy['REB'] + df_copy['BLK'] + df_copy['STL']) / df_copy['MIN']					
					
# Clutch Performance (assuming we had this data)					
# df_copy['Clutch_Rating'] = (df_copy['Clutch_PTS'] + df_copy['Clutch_AST'] + df_copy['Clutch_REB']) / df_copy['Clutch_MIN']					



print("Feature engineering complete. New features added to the dataframe.")					
print(f"Total number of features: {df_copy.shape[1]}")					
					
# Display the first few rows of the dataframe with new features					
display(df_copy.head())					


# Correlation matrix for new features
new_features = df_copy.columns.drop(['Name', 'GP', 'MIN', 'PTS', 'FGM', 'FGA', 'FG%', '3P Made', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'TARGET_5Yrs'])
correlation_matrix_new = df_copy[list(new_features) + ['TARGET_5Yrs']].corr()

plt.figure(figsize=(20, 16))
sns.heatmap(correlation_matrix_new, annot=False, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Matrix of New Features")
plt.tight_layout()
plt.savefig(os.path.join(output_dir, 'new_features_correlation.png'))
plt.close()

print("Correlation matrix of new features has been saved.")

# Top correlations with TARGET_5Yrs
top_new_correlations = correlation_matrix_new['TARGET_5Yrs'].sort_values(key=abs, ascending=False)[1:11]
print("\nTop 10 correlations of new features with TARGET_5Yrs:")
print(top_new_correlations)

Feature engineering complete. New features added to the dataframe.
Total number of features: 65


Unnamed: 0,Name,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs,PER,PPM,Usage_Rate,True_Shooting_Percentage,Effective_FG_Percentage,Points_Per_Shot,Total_Rebound_Percentage,Offensive_Rebound_Percentage,Defensive_Rebound_Percentage,Assist_Percentage,AST_Ratio,Steal_Percentage,Block_Percentage,Stocks,Versatility_Index,Offensive_Versatility,Defensive_Versatility,Three_Point_Rate,Free_Throw_Rate,PTS_Per_Game,PTS_Per_36,REB_Per_Game,REB_Per_36,AST_Per_Game,AST_Per_36,STL_Per_Game,STL_Per_36,BLK_Per_Game,BLK_Per_36,TOV_Per_Game,TOV_Per_36,Box_Plus_Minus,Scoring_Consistency,Usage_Consistency,Offensive_Impact,Defensive_Impact,Overall_Impact,Points_Per_Touch,Production_Per_Possession,PTS_to_Usage_Ratio,AST_to_Usage_Ratio,2P%,Points_Per_Shot_Attempt,Physical_Impact
244,Greg Anderson,82,24.2,11.7,4.6,9.2,50.0,0.0,0.1,0.0,2.4,4.0,60.0,2.0,4.3,6.3,1.0,0.7,1.5,1.7,1.0,0.24,0.48,0.52,0.53,0.5,1.07,0.15,0.0484,0.1,0.00484,10.87,0.02,0.0363,2.2,21.2,14.7,6.5,0.01,0.43,0.14,17.4,0.08,9.37,0.01,1.49,0.00854,1.04,0.0183,2.23,0.02,2.53,0.13,0.89,0.00638,0.52,0.27,0.79,0.74,1.38,22.36,1.91,0.51,1.27,0.35
83,Bimbo Coles,82,16.5,4.9,2.0,4.8,41.67,0.1,0.4,25.0,0.9,1.2,75.0,0.7,1.2,1.9,2.8,0.8,0.1,1.2,1.0,0.11,0.3,0.4,0.46,0.43,0.92,0.07,0.0248,0.04,0.0199,58.33,0.03,0.00355,0.9,10.5,8.4,2.1,0.08,0.25,0.06,10.69,0.02,4.15,0.03,6.11,0.00976,1.75,0.00122,0.22,0.01,2.62,0.12,0.82,0.00482,0.47,0.13,0.59,0.49,1.44,12.39,7.08,0.43,1.02,0.17
411,Paul Thompson,82,21.1,9.0,3.8,8.1,46.91,0.1,0.5,20.0,1.4,1.8,77.78,1.5,2.3,3.8,1.5,0.8,0.5,0.9,0.0,0.18,0.43,0.46,0.51,0.48,1.01,0.11,0.0416,0.06,0.00832,18.52,0.02,0.0139,1.3,15.6,12.0,3.6,0.06,0.22,0.11,15.36,0.05,6.48,0.02,2.56,0.00976,1.36,0.0061,0.85,0.01,1.54,0.12,0.91,0.00566,0.5,0.17,0.67,0.73,1.45,19.39,3.23,0.49,1.11,0.24
894,Chris Duhon,82,26.5,5.9,2.1,6.0,35.0,1.1,3.2,34.38,0.6,0.8,75.0,0.3,2.3,2.6,4.9,1.0,0.0,1.4,1.0,0.16,0.22,0.29,0.46,0.44,0.93,0.06,0.00663,0.05,0.0216,81.67,0.02,0.0,1.0,14.4,11.1,3.3,0.53,0.13,0.07,8.02,0.03,3.53,0.06,6.66,0.0122,1.36,0.0,0.0,0.02,1.9,0.2,0.87,0.00357,0.41,0.12,0.53,0.45,1.49,20.17,16.75,0.36,0.98,0.14
321,Joe Dumars*,82,23.9,9.4,3.5,7.3,47.95,0.1,0.2,50.0,2.3,2.9,79.31,0.7,0.7,1.5,4.8,0.8,0.1,1.9,1.0,0.18,0.39,0.44,0.55,0.49,1.1,0.04,0.0171,0.02,0.0235,65.75,0.02,0.00245,0.9,16.6,14.9,1.6,0.03,0.4,0.11,14.16,0.02,2.26,0.06,7.23,0.00976,1.21,0.00122,0.15,0.02,2.86,0.2,0.92,0.00535,0.59,0.07,0.66,0.56,1.52,21.45,10.95,0.48,1.29,0.1


Correlation matrix of new features has been saved.

Top 10 correlations of new features with TARGET_5Yrs:
Versatility_Index           0.35
PTS_to_Usage_Ratio          0.34
Offensive_Versatility       0.34
Defensive_Versatility       0.31
Stocks                      0.30
Usage_Consistency          -0.29
Overall_Impact              0.26
Points_Per_Shot             0.25
True_Shooting_Percentage    0.25
Steal_Percentage           -0.25
Name: TARGET_5Yrs, dtype: float64


In [19]:
display(df_copy)

Unnamed: 0,Name,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs,PER,PPM,Usage_Rate,True_Shooting_Percentage,Effective_FG_Percentage,Points_Per_Shot,Total_Rebound_Percentage,Offensive_Rebound_Percentage,Defensive_Rebound_Percentage,Assist_Percentage,AST_Ratio,Steal_Percentage,Block_Percentage,Stocks,Versatility_Index,Offensive_Versatility,Defensive_Versatility,Three_Point_Rate,Free_Throw_Rate,PTS_Per_Game,PTS_Per_36,REB_Per_Game,REB_Per_36,AST_Per_Game,AST_Per_36,STL_Per_Game,STL_Per_36,BLK_Per_Game,BLK_Per_36,TOV_Per_Game,TOV_Per_36,Box_Plus_Minus,Scoring_Consistency,Usage_Consistency,Offensive_Impact,Defensive_Impact,Overall_Impact,Points_Per_Touch,Production_Per_Possession,PTS_to_Usage_Ratio,AST_to_Usage_Ratio,2P%,Points_Per_Shot_Attempt,Physical_Impact
244,Greg Anderson,82,24.2,11.7,4.6,9.2,50.00,0.0,0.1,0.00,2.4,4.0,60.00,2.0,4.3,6.3,1.0,0.7,1.5,1.7,1.0,0.24,0.48,0.52,0.53,0.50,1.07,0.15,4.84e-02,0.10,4.84e-03,10.87,0.02,3.63e-02,2.2,21.2,14.7,6.5,0.01,0.43,0.14,17.40,0.08,9.37,1.22e-02,1.49,8.54e-03,1.04,1.83e-02,2.23,0.02,2.53,0.13,0.89,6.38e-03,0.52,0.27,0.79,0.74,1.38,22.36,1.91,0.51,1.27,0.35
83,Bimbo Coles,82,16.5,4.9,2.0,4.8,41.67,0.1,0.4,25.00,0.9,1.2,75.00,0.7,1.2,1.9,2.8,0.8,0.1,1.2,1.0,0.11,0.30,0.40,0.46,0.43,0.92,0.07,2.48e-02,0.04,1.99e-02,58.33,0.03,3.55e-03,0.9,10.5,8.4,2.1,0.08,0.25,0.06,10.69,0.02,4.15,3.41e-02,6.11,9.76e-03,1.75,1.22e-03,0.22,0.01,2.62,0.12,0.82,4.82e-03,0.47,0.13,0.59,0.49,1.44,12.39,7.08,0.43,1.02,0.17
411,Paul Thompson,82,21.1,9.0,3.8,8.1,46.91,0.1,0.5,20.00,1.4,1.8,77.78,1.5,2.3,3.8,1.5,0.8,0.5,0.9,0.0,0.18,0.43,0.46,0.51,0.48,1.01,0.11,4.16e-02,0.06,8.32e-03,18.52,0.02,1.39e-02,1.3,15.6,12.0,3.6,0.06,0.22,0.11,15.36,0.05,6.48,1.83e-02,2.56,9.76e-03,1.36,6.10e-03,0.85,0.01,1.54,0.12,0.91,5.66e-03,0.50,0.17,0.67,0.73,1.45,19.39,3.23,0.49,1.11,0.24
894,Chris Duhon,82,26.5,5.9,2.1,6.0,35.00,1.1,3.2,34.38,0.6,0.8,75.00,0.3,2.3,2.6,4.9,1.0,0.0,1.4,1.0,0.16,0.22,0.29,0.46,0.44,0.93,0.06,6.63e-03,0.05,2.16e-02,81.67,0.02,0.00e+00,1.0,14.4,11.1,3.3,0.53,0.13,0.07,8.02,0.03,3.53,5.98e-02,6.66,1.22e-02,1.36,0.00e+00,0.00,0.02,1.90,0.20,0.87,3.57e-03,0.41,0.12,0.53,0.45,1.49,20.17,16.75,0.36,0.98,0.14
321,Joe Dumars*,82,23.9,9.4,3.5,7.3,47.95,0.1,0.2,50.00,2.3,2.9,79.31,0.7,0.7,1.5,4.8,0.8,0.1,1.9,1.0,0.18,0.39,0.44,0.55,0.49,1.10,0.04,1.71e-02,0.02,2.35e-02,65.75,0.02,2.45e-03,0.9,16.6,14.9,1.6,0.03,0.40,0.11,14.16,0.02,2.26,5.85e-02,7.23,9.76e-03,1.21,1.22e-03,0.15,0.02,2.86,0.20,0.92,5.35e-03,0.59,0.07,0.66,0.56,1.52,21.45,10.95,0.48,1.29,0.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
809,Demetris Nichols,14,3.1,1.1,0.4,1.6,25.00,0.2,0.9,22.22,0.0,0.0,0.00,0.0,0.4,0.4,0.1,0.0,0.2,0.3,0.0,0.11,0.35,0.61,0.34,0.31,0.69,0.44,0.00e+00,0.44,2.21e-02,6.25,0.00,2.21e-01,0.2,1.8,1.2,0.6,0.56,0.00,0.08,12.77,0.03,4.65,7.14e-03,1.16,0.00e+00,0.00,1.43e-02,2.32,0.02,3.48,0.03,0.69,4.38e-02,0.39,0.19,0.58,0.55,0.63,1.79,0.16,0.29,0.69,0.19
735,Cedric Jackson,12,6.3,1.7,0.5,1.7,29.41,0.1,0.5,20.00,0.6,1.0,60.00,0.2,0.5,0.7,1.2,0.3,0.2,0.9,0.0,0.27,0.27,0.48,0.40,0.32,0.79,0.44,1.27e-01,0.32,1.52e-01,70.59,0.19,1.27e-01,0.5,4.1,3.1,1.0,0.29,0.59,0.14,9.71,0.06,4.00,1.00e-01,6.86,2.50e-02,1.71,1.67e-02,1.14,0.07,5.14,0.28,0.63,4.02e-02,0.46,0.16,0.62,0.35,1.09,3.52,2.49,0.33,1.00,0.19
596,Caris LeVert,12,15.5,4.7,1.6,4.4,36.36,0.8,2.6,30.77,0.7,1.0,70.00,0.3,2.3,2.6,1.3,0.9,0.2,0.6,0.0,0.76,0.30,0.35,0.49,0.45,0.97,0.67,7.74e-02,0.59,6.71e-02,29.55,0.23,5.16e-02,1.1,9.7,6.3,3.4,0.59,0.23,0.39,10.92,0.22,6.04,1.08e-01,3.02,7.50e-02,2.09,1.67e-02,0.46,0.05,1.39,0.60,0.87,2.92e-02,0.39,0.22,0.61,0.64,1.23,13.39,3.70,0.44,1.07,0.24
918,James Thomas,11,10.6,2.4,1.1,1.8,61.11,0.0,0.0,0.00,0.2,0.5,40.00,1.5,1.8,3.4,0.4,0.4,0.4,0.6,0.0,0.58,0.23,0.25,0.59,0.61,1.19,1.40,6.17e-01,0.74,3.29e-02,22.22,0.16,1.65e-01,0.8,7.0,4.3,2.6,0.00,0.28,0.22,8.15,0.31,11.55,3.64e-02,1.36,3.64e-02,1.36,3.64e-02,1.36,0.05,2.04,0.35,1.04,2.25e-02,0.26,0.25,0.51,0.73,3.84,9.71,1.62,0.61,1.33,0.40


1. Derived Statistical Features:
1.1. Efficiency Metrics:
- Player Efficiency Rating (PER): A composite measure of a player's statistical contributions per minute played.
1.2. Scoring Efficiency:
- Points per Minute (PPM): This shows how effectively a player scores based on their playing time.

2. Ratios and Percentages:
2.1. Assist-to-Turnover Ratio:
 

​
 

​


In [20]:
# ==================================================================
# PCA 2D
# =================================================================
# Create the directory if it doesn't exist
output_dir = 'image/reduction_dimensionelle'
os.makedirs(output_dir, exist_ok=True)

# PCA pour visualiser la séparation des classes
features = df_copy.drop(['Name', 'TARGET_5Yrs'], axis=1)
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

pca = PCA(n_components=2)
pca_result = pca.fit_transform(features_scaled)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(pca_result[:, 0], pca_result[:, 1], c=df['TARGET_5Yrs'], cmap='viridis')
plt.colorbar(scatter)
plt.title("ACP - Visualisation des deux premières composantes principales")
plt.xlabel("Première composante principale")
plt.ylabel("Deuxième composante principale")
# plt.savefig('pca_visualization.png')
plt.savefig(os.path.join(output_dir, 'pca_visualization_with_FE.png'))
plt.close()

print("L'analyse exploratoire des données est terminée. Les visualisations ont été sauvegardées sous forme de fichiers PNG.")

L'analyse exploratoire des données est terminée. Les visualisations ont été sauvegardées sous forme de fichiers PNG.


In [21]:
# ==================================================================
# TSNE (t-distributed Stochastic Neighbor Embedding) 2D
# =================================================================
from sklearn.manifold import TSNE

# Standardize the features as you've done before
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=100, random_state=42)
tsne_result = tsne.fit_transform(features_scaled)

# Plot the t-SNE result
plt.figure(figsize=(10, 8))
scatter = plt.scatter(tsne_result[:, 0], tsne_result[:, 1], c=df['TARGET_5Yrs'], cmap='viridis')
plt.colorbar(scatter)
plt.title("t-SNE - Visualisation des deux dimensions principales")
plt.xlabel("Première dimension t-SNE")
plt.ylabel("Deuxième dimension t-SNE")
plt.savefig('tsne_visualization.png')
plt.savefig(os.path.join(output_dir, 'tsne_visualization_with_FE.png'))

plt.close()

print("Visualisation t-SNE terminée. L'image a été sauvegardée sous forme de fichier PNG.")


Visualisation t-SNE terminée. L'image a été sauvegardée sous forme de fichier PNG.


In [22]:
# ==================================================================
# Save the clean Dataset
# =================================================================
# Save the cleaned DataFrame to a CSV file
df_copy.to_csv('nba_logreg_clean_WITH_FE.csv', index=False)

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import os

# 1. Analyze new features' distributions and relationships with target variable
output_dir = 'image/feature_analysis'
os.makedirs(output_dir, exist_ok=True)

def plot_feature_distribution(df, feature, target):
    plt.figure(figsize=(10, 6))
    sns.histplot(data=df, x=feature, hue=target, kde=True, element="step")
    plt.title(f'Distribution of {feature} by {target}')
    plt.savefig(os.path.join(output_dir, f'{feature}_distribution.png'))
    plt.close()

# Select top 10 correlated features with the target
df_numerique = df_copy.select_dtypes(include=['number'])
correlation_with_target = df_numerique.corr()['TARGET_5Yrs'].abs().sort_values(ascending=False)
top_features = correlation_with_target[1:11].index.tolist()

for feature in top_features:
    plot_feature_distribution(df_copy, feature, 'TARGET_5Yrs')

print("Feature distribution plots saved.")



  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):


Feature distribution plots saved.
